High-Performance Implementation of Algorithms on Reconfigurable Hardware

Doctoral Dissertation

Christos Gentsos, M.Sc.

Aristotle University of Thessaloniki Faculty of Science School of Physics

July, 2018

Υψηλών Επιδόσεων Υλοποίηση Αλγορίθμων σε Επαναδιαρθρώσιμο Υλικό

Διδακτορική Διατριβή

Χρίστος Γέντσος, M.Sc.

Αριστοτέλειο Πανεπιστήμιο Θεσσαλονίκης Σχολή Θετικών Επιστημών Τμήμα Φυσικής

Ιούλιος, 2018 Copyright 2018 ©

Christos Gentsos

Aristotle University of Thessaloniki

This thesis must be used only under the normal conditions of scholarly fair dealingfor purposes of research, criticism or review. In particular no results or conclusions should be extracted from it, nor should it be copied or closely paraphrased in whole or in part without the written consent of the author. Proper written acknowledgement should be made for any assistance obtained from this thesis. Επταμελής Εξεταστική Επιτροπή:

,+ Νικολαΐδης Σπυρίδων∗ καθηγητής ΑΠΘ

Αναγνωστόπουλος Αντώνιος+ Θεοδωρίδης Γεώργιος+ καθηγητής επίκουρος καθηγητής ΑΠΘ Πανεπιστήμιο Πατρών

Σίσκος Στυλιανός Χατζόπουλος Αλκιβιάδης καθηγητής καθηγητής ΑΠΘ ΑΠΘ

Κορδάς Κωνσταντίνος Σιώζιος Κωνσταντίνος επίκουρος καθηγητής επίκουρος καθηγητής ΑΠΘ ΑΠΘ

: Επιβλέπων ∗ +: Μέλος τριμελούς συμβουλευτικής επιτροπής

Dedicated to

my wife Daphne, my parents Dimitrios and Eleni

Abstract

This work is concerned with the design of high-performance digital circuits onField- Programmable Gate Array (FPGA) devices. These are generic devices, offering reconfig- urable hardware units for digital circuits to be loaded on, and their applications range from the Automotive to the Aerospace sector. The work for this dissertation is two-fold and was motivated by practical problems, in the domains of Molecular Diagnostics and High Energy Physics, calling for high-performance implementations of a number of algorithms that map very well to FPGAs. As such, it is arranged in two main parts, one for each application.

The first application calls for a novel implementation of the Canny edge detection algorithm for a real-time machine vision system that powers a microfluidic Lab-on-a- Chip demonstrator. The Canny edge detector algorithm is widely popular, having been designed with the objectives of minimizing the error rate and improving the localization of the identified edges. It is comprised of individual processing steps; two of them, the Gaussian smoothing and the Sobel edge detector, are also widely used independently, as image processing filters. This novel architecture incorporates various methods and well-researched approximations to optimize for performance but the main feature that stands out is the strong exploitation of the parallelism capabilities provided by modern FPGAs. The architecture is pipelined in both cycle level and block level. The simultaneous exploitation of parallelism and pipelining results in a very efficient design that computes four pixels per clock cycle while maintaining a very high operating frequency. At the same time, the memory requirements remain constant with respect to a design that does not apply any pixel computation parallelism, while reducing memory read accesses. The performance achieved by this implementation ranges from 800 Mpixel/s to 1900 Mpixel/s, depending on the FPGA device used. As a specific example, this translates to a computation time of 1.5 ms for a 1.2 Mpixel grayscale image on a Spartan-6 FPGA. To the best of the author’s knowledge, this implementation outperforms existing solutions; furthermore, the performance exceeds the system requirements, allowing even high- resolution images to be used in the real-time system. Performance and resource utilization

i ii ABSTRACT figures are presented for each component of the implementation, with differences between successive FPGA families briefly discussed; Finally, the integration of the machine vision implementation into an IP core to be used as a drop-in subsystem in the full design is also presented for completeness.

The second application involves the redesign of various algorithms used in theFast TracKer (FTK) project. To perform real-time reconstruction of the trajectories of the particles resulting from collisions inside the ATLAS detector out of the traces they leave on the silicon detector layers, a system comprising a few thousands of FPGAs and custom Application-Specific Integrated Circuits (ASICs) has been realized. The ASICs implement massively parallel comparison operations to perform low-latency pattern matching, each one able to perform 64 G comparisons per second. The FPGAs handle a wide range of tasks, from complex data-moving operations that facilitate pattern matching to high-performance mathematical operations to manipulate the hit coordinates and eventually compute the track parameters. Novel implementations of key components of this system, namely the Data Organizer (DO), the Combiner, and the Track Fitter (TF) have been designed, in order to cope with the higher data rates of future scheduled detector upgrades and lift certain limitations of the existing implementations. The objective is to facilitate the construction of a system based on the principles of FTK for other, even more demanding applications. The DO functions as the bridge between the pattern matching step, performed in low-resolution, and the generation of full-resolution hit combinations that form potential tracks to be used in the track fitting step. Each full- resolution hit is stored based on a low-resolution identifier, and can be retrieved based on it. The operating principle is based on a novel, fast FPGA implementation ofan instantly-erasable array of linked lists with support for features of the AM ASICs, such as variable-size patterns and missing layers introducing extra layers of complexity tothe architecture. The final implementation supports an operating frequency of upwards of 400 MHz, greatly surpassing the specification targets. Advanced design methods, such as the automated generation of Look-up Table (LUT) instantiation code, and interleaving reading loops with initiation interval greater than one, such that one memory read port can form more than one individual read channels, were introduced and applied to achieve that performance. The next component, the Combiner, is given a set of hits foreach detector layer; its function is to form all the track-forming combinations out of these hit sets. This design is simpler than the DO; nevertheless, it still outperforms previous designs and it connects the DO to the final component of this track reconstruction chain, the TF. The latter component performs the track fitting operation by implementing fast scalar products with columns of pre-computed matrices. The goal was to design a flexible iii novel architecture, optimized to strike a balance between low latency and resource usage while maintaining an operating frequency that approaches the device limits. An architecture involving systolic arrays of registers, hardened Processing (DSP) blocks provided by modern FPGAs, and their dedicated interconnects, was devised. Combining the principles of parallelism and pipelining, one full track can be processed per clock cycle; by also taking physical layout considerations into account early in the design phase, these clock cycles are short, as an implementation that reaches a frequency of 600 MHz was obtained. Furthermore, advanced methodologies were employed for the verification of these components, and a novel method was devised to utilize the same high-level testbench to verify correct operation in both the Register Transfer Level simulation, and the actual implemented design while it is running on the FPGA. Finally, two demonstrators that make use of these implementations are presented; one ad-hoc demonstrator based on an evaluation board, and a research proposal that offers track reconstruction at the Level-1 track trigger for the 2025 HL-LHC CMS detector upgrade.

The compromises and approximations made in the algorithms and their justifica- tion; the strategies and methodologies employed, or devised, in order to derive these implementations; and area, power, and performance metrics of the resulting designs, are described in detail over the subsequent chapters.

Περίληψη

Το πόνημα που ακολουθεί αφορά τον σχεδιασμό ψηφιακών κυκλωμάτων υψηλών επιδό- σεων σε συσκευές «προγραμματιζόμενες-στο-πεδίο διατάξεις πυλών» (FPGA). Αυτές είναι συσκευές γενικής χρήσης που προσφέρουν μονάδες επαναδιαρθρώσιμου υλικού για την υλοποίηση ψηφιακών κυκλωμάτων και βρίσκουν εφαρμογή σε μία σειρά τομέων, από την αυτοκινητοβιομηχανία μέχρι την αεροναυπηγική. Η δουλειά που παρουσιά- ζεται σε αυτήν τη διατριβή αφορά δύο κύριες εφαρμογές που προκύπτουν από τους τομείς της Μοριακής Διαγνωστικής και της Φυσικής Υψηλών Ενεργειών. Κίνητρο για την πραγματοποίησή της αποτέλεσαν υπάρχοντα προβλήματα στους παραπάνω το- μείς, όπου παρουσιάζονται απαιτήσεις για υψηλών επιδόσεων υλοποιήσεις διαφόρων αλγορίθμων. Ο χαρακτήρας των εφαρμογών κάνει τις συσκευές FPGA να αποτελούν ιδανική πλατφόρμα για τις εν λόγω υλοποιήσεις. Ως αποτέλεσμα των παραπάνω, η διατριβή είναι οργανωμένη σε δύο κύρια μέρη, ένα για την κάθε εφαρμογή.

Η πρώτη εφαρμογή απαιτεί την ανάπτυξη μιας πρωτότυπης υλοποίησης ενός αλ- γόριθμου ανίχνευσης ακμών, ονόματι Canny, για ένα σύστημα πραγματικού χρόνου που αποτελεί τη βάση ενός συστήματος επίδειξης για μικρορροϊκό εργαστήριο-σε-τσιπ. Ο αλγόριθμος ανίχνευσης ακμών Canny είναι αρκετά δημοφιλής και έχει σχεδιαστεί με στόχο την ελαχιστοποίηση των σφαλμάτων και τη βελτιστοποίηση του εντοπισμού των αναγνωρισθέντων ακμών. Αυτός απαρτίζεται από ανεξάρτητα επεξεργαστικά βήματα· δύο από αυτά, η Γκαουσιανή εξομάλυνση και η ανίχνευση ακμών Sobel, χρη- σιμοποιούνται ευρέως και αυτόνομα, ως φίλτρα επεξεργασίας εικόνας. Η πρωτότυπη αρχιτεκτονική που σχεδιάστηκε ενσωματώνει διάφορες μεθόδους και καλά μελετημένες προσεγγίσεις, όντας βελτιστοποιημένη ως προς τις επιδόσεις, αλλά το χαρακτηριστικό που ξεχωρίζει είναι η εκτενής εκμετάλλευση της παραλληλίας που προσφέρουν οι σύγχρονες συσκευές FPGA. Στην αρχιτεκτονική χρησιμοποιείται η τεχνική της διασω- λήνωσης τόσο στο επίπεδο του κύκλου ρολογιού όσο και στο επίπεδο των βημάτων επεξεργασίας. Η ταυτόχρονη εκμετάλλευση της παραλληλίας και της διασωλήνωσης οδήγησε σε ένα πολύ αποτελεσματικό κύκλωμα που επεξεργάζεται τέσσερα εικονοστοι- χεία ανά κύκλο ρολογιού, ενώ παράλληλα υποστηρίζει υψηλή συχνότητα λειτουργίας.

v vi ΠΕΡΙΛΗΨΗ

Επιπλέον, οι απαιτήσεις μνήμης παραμένουν σταθερές σε σχέση με μία υλοποίηση που δεν θα υποστήριζε παραλληλία στο επίπεδο των εικονοστοιχείων, ενώ ο αριθμός αναγνώσεων από τη μνήμη είναι μειωμένος. Οι επιδόσεις που προσφέρει η υλοποίηση που προέκυψε κυμαίνονται στα 800 Mpixel/s με 1900 Mpixel/s, ανάλογα με τη συσκευ- ή που χρησιμοποιείται. Ως συγκεκριμένο παράδειγμα, αυτό μεταφράζεται σε χρόνο επεξεργασίας 1.5 ms για μια εικόνα κλίμακας του γκρι 1.2 Mpixel σε συσκευή FPGA οικογένειας Spartan-6. Μετά από μελέτη της σχετικής βιβλιογραφίας, φαίνεται ότι οι επιδόσεις υπερβαίνουν τις υλοποιήσεις που προϋπήρχαν· επιπλέον, οι επιδόσεις υπερβαίνουν τις προδιαγραφές του συστήματος, επιτρέποντας την χρήση ακόμα και εικόνων υψηλής ανάλυσης στο πραγματικού χρόνου σύστημα. Οι επιδόσεις και η χρήση πόρων της συσκευής παρουσιάζονται για κάθε αυτόνομο μέρος της υλοποίησης, με την παράλληλη παράθεση μιας σύντομης πραγμάτευσης πάνω στις διαφορές μεταξύ διαδοχικών οικογενειών συσκευών FPGA· τέλος, παρουσιάζεται η ενσωμάτωση της υλοποίησης μηχανικής όρασης σε εξάρτημα IP έτοιμο προς χρήση ως υποσύστημα στο τελικό προϊόν, για λόγους πληρότητας.

Η δεύτερη εφαρμογή είναι ο επανασχεδιασμός εκ νέου διαφόρων αλγορίθμων που χρησιμοποιούνται στο σύστημα Fast TracKer (FTK). Αυτό έχει ως σκοπό την ανακα- τασκευή σε πραγματικό χρόνο των τροχιών των σωματιδίων που παράγονται από τις συγκρούσεις στον ανιχνευτή ATLAS από τα ίχνη (hits) που αφήνουν στα στρώματα των ανιχνευτών πυριτίου. Το σύστημα αυτό απαρτίζεται από κάποιες χιλιάδες συσκευές FPGA και ολοκληρωμένα κυκλωμάτα για ειδικές εφαρμογές (ASIC). Τα ολοκληρωμένα κυκλώματα υλοποιούν παράλληλα λειτουργίες συγκρίσεως δεδομένων ώστε να εκτε- λούν ταίριασμα μοτίβων (pattern matching) με χαμηλή καθυστέρηση, με το καθένα τους να είναι ικανό να εκτελέσει εξήντα τέσσερα δισεκατομμύρια συγκρίσεις το δευτερόλεπτο. Οι συσκευές FPGA διαχειρίζονται ένα ευρύ φάσμα εργασιών, από περίπλοκες λειτουργί- ες διαχείρισης δεδομένων που υποστηρίζουν την ανίχνευση μοτίβων, έως μαθηματικούς υπολογισμούς υψηλών επιδόσεων πάνω στις συντεταγμένες που δίνει ο ανιχνευτής, με στόχο να παραχθούν οι παράμετροι που ορίζουν τις τροχιές των σωματιδίων. Σχεδιάστη- καν πρωτότυπες υλοποιήσεις κομβικών συστατικών του συστήματος, πιο συγκεκριμένα του Οργανωτή Δεδομένων (Data Organizer, DO), του Συνδυαστή (Combiner) και του Προσαρμογέα Τροχιών (Track Fitter, TF), ώστε αυτά να αντεπεξέλθουν στη μεγαλύτερη ροή δεδομένων που θα προκύψει από τις μελλοντικές αναβαθμίσεις του ανιχνευτή και να αρθούν συγκεκριμένοι περιορισμοί των υλοποιήσεων που υπάρχουν ήδη. Ο σκοπός ήταν να καταστεί δυνατή η κατασκευή μελλοντικών συστημάτων βασισμένων στις αρχές λειτουργίας του FTK για άλλες, ακόμα πιο απαιτητικές εφαρμογές. Ο Οργανω- τής Δεδομένων λειτουργεί ως γέφυρα ανάμεσα στο στάδιο αναγνώρισης μοτίβων, που vii

χρησιμοποιεί αναπαράσταση χαμηλής ανάλυσης, και στο στάδιο του υπολογισμού των συνδυασμών συντεταγμένων του ανιχνευτή, οι οποίοι αποτελούν πιθανές τροχιές σωματιδίων, που χρησιμοποιεί δεδομένα σε πλήρη ανάλυση. Κάθε υψηλής ανάλυσης ίχνος (hit) αποθηκεύεται με βάση μία αναπαράσταση χαμηλής ανάλυσης και κατόπιν μπορεί να ανακληθεί χρησιμοποιώντας την. Η αρχή λειτουργίας της αρχιτεκτονικής που σχεδιάστηκε βασίζεται σε μία πρωτότυπη και ταχύτατη υλοποίηση σε FPGA μίας διάταξης συνδεδεμένων λιστών που επιτρέπει τη στιγμιαία διαγραφή της και υποστη- ρίζει τις δυνατότητες των ολοκληρωμένων κυκλωμάτων αναγνώρισης μοτίβων, όπως είναι τα μοτίβα μεταβλητού μεγέθους και την απουσία στρωμάτων ανιχνευτή, οι οποίες αυξάνουν την περιπλοκότητα της αρχιτεκτονικής. Η τελική υλοποίηση υποστηρίζει συχνότητα λειτουργίας μεγαλύτερη από 400 MHz, ξεπερνώντας κατά πολύ τις αρχικές προδιαγραφές του συστήματος που στοχεύεται. Για να επιτευχθούν αυτές οι επιδόσεις, εισήχθηκαν προχωρημένες τεχνικές σχεδιασμού, όπως η αυτοματοποιημένη παραγω- γή κώδικα που συνθέτει πίνακες αναζήτησης (Look-up Tables, LUT), και η διεμπλοκή βρόχων ανάγνωσης με διάστημα έναρξης μεγαλύτερο του ένα, με τρόπο που μία θύρα ανάγνωσης να σχηματίζει περισσότερα από ένα κανάλια ανάγνωσης. Το επόμενο συστατικό, ο Συνδυαστής, δέχεται σύνολα ιχνών για κάθε στρώμα του ανιχνευτή και η λειτουργία του είναι να σχηματίζει όλους τους συνδυασμούς ιχνών που είναι σε θέση να συνιστούν τροχιά. Ο σχεδιασμός αυτού του συστατικού είναι πιο απλός από αυτόν του Οργανωτή Δεδομένων· ωστόσο, οι επιδόσεις του εξακολουθούν να ξεπερνούν αυτές αντίστοιχων υλοποιήσεων και, επιπλέον, συνδέει τον Οργανωτή Δεδομένων με το τε- λευταίο συστατικό της αλυσίδας ανακατασκευής τροχιών, τον Προσαρμογέα Τροχιών. Το συστατικό αυτό εκτελεί τη λειτουργία προσαρμογής τροχιών υλοποιώντας γρήγο- ρα εσωτερικά γινόμενα με στήλες προ-αποθηκευμένων πινάκων. Ο στόχος ήταν να σχεδιαστεί μια ευέλικτη και πρωτότυπη αρχιτεκτονική, βελτιστοποιημένη ως προς την ισορροπία μεταξύ χαμηλής καθυστέρησης και χρήσης πόρων, διατηρώντας παράλληλα μια συχνότητα λειτουργίας που αγγίζει τα όρια της συσκευής. Προέκυψε, έτσι, μια αρχιτεκτονική που περικλείει συστολικές διατάξεις καταχωρητών, προκατασκευασμέ- να μπλοκ Ψηφιακής Επεξεργασίας Σήματος (Digital Signal Processing, DSP) και τις αποκλειστικές διασυνδέσεις τους. Συνδυάζοντας τις αρχές της παραλληλίας και της διασωλήνωσης, κατασκευάστηκε ένα κύκλωμα που παράγει μια ολόκληρη τροχιά ανά κύκλο ρολογιού· λαμβάνοντας υπόψιν νωρίς στη διαδικασία του σχεδιασμού τις χαμη- λού επιπέδου λεπτομέρειες της υλοποίησης επιτεύχθηκε μία πολύ υψηλή συχνότητα λειτουργίας, στα 600 MHz. Επιπλέον, προχωρημένες μεθοδολογίες χρησιμοποιήθηκαν για την επαλήθευση αυτών των συστατικών και μία πρωτότυπη μέθοδος εισήχθηκε ώστε να χρησιμοποιηθεί το ίδιο περιβάλλον επαλήθευσης τόσο στην προσομοίωση viii ΠΕΡΙΛΗΨΗ

όσο και στο υλικό, ενώ αυτό λειτουργεί πάνω στο FPGA. Τέλος, παρουσιάζονται δύο συστήματα επίδειξης που κάνουν χρήση αυτών των συστατικών: ένα σύστημα επίδειξης ειδικού σκοπού, βασισμένο σε ένα σύστημα αξιολόγησης FPGA (evaluation board), και ένα σύστημα που αποτελεί μέρος μίας ερευνητικής πρότασης που έχει ως στόχο να επιτρέψει την ανακατασκευή τροχιών στον σκανδαλιστή πρώτου επιπέδου (Level-1 trigger) του ανιχνευτή CMS στην αναβάθμιση HL-LHC, το 2025.

Στα κεφάλαια που ακολουθούν περιγράφονται λεπτομερώς οι τεχνικές, συμβιβασμοί και οι προσεγγίσεις που υιοθετήθηκαν ώστε να υλοποιηθούν οι διάφοροι αλγόριθμοι. Τέλος, αναλύονται οι στρατηγικές και οι μεθοδολογίες που ακολουθήθηκαν ή επινοή- θηκαν ώστε να σχεδιαστούν οι υλοποιήσεις και εξετάζονται τα χαρακτηριστικά και οι επιδόσεις των κυκλωμάτων που προέκυψαν. Acknowledgments

First, I would like to thank my thesis supervisor, Prof. Spyridon Nikolaidis, for his valuable guidance and teachings over the years. On a personal level, I am grateful for the patience and dedication he has shown. He has gone above and beyond to help, guide and motivate me—which was especially helpful when it came to publishing results!

To Alberto Annovi, I extend my gratitude for the guidance and continuing support he offered me. He has always offered his keen reasoning and sound advice onmatters ranging from the very low-level and technical to high-level, and always gladly shared his experience and wisdom to my and to my colleagues’ benefit.

Fabrizio Palla, for the faith he put in my abilities and for allowing me to work in the AM team for the CMS Level-1 Track Trigger upgrade; it was a very exciting time working for this project, filled with enthusiasm. His personal involvement in the day-to- day work—as much as in long-term planning, his feedback, and his support, have been invaluable.

It would be hugely remiss of me not to express here my deepest appreciation and gratitude to Paola Giannetti, who with her her multifaceted expertise in matters ofboth Physics and digital circuit architecture, her keen insights, and her immense dedication to the FTK project, helped me grow as a researcher. Her willingness to give her time so generously and selflessly is greatly appreciated by all.

A special thank you goes out to Kostas Kordas; since the time I was introduced to the FTK project and was feeling lost while trying to catch on, he has been tirelessly and enthusiastically answering my seemingly endless questions about ATLAS and triggering mechanics—and this is but a sample of his ever-present help over the last few years.

To Livio Fanò, Gian-Mario Bilei, Bruno Checcucci, and Alessandro Rossi, at University of Perugia, I owe my thanks for the excellent collaboration we had, and for supporting me in numerous ways during my stay in Perugia.

I would also like to thank my excellent colleague and friend, Calliope-Louisa Sotiro-

ix x ACKNOWLEDGMENTS poulou, for the years of great work and excellent collaboration across multiple projects, and for her lasting encouragement and motivation while writing this thesis.

Many thanks also go to Guido Volpi and Akis Gkaitatzis, who provided valuable help and ideas on the track fitting software and interpreting the performance plots.

Special thanks also to Pantelis Sopasakis, for lending a hand with the linear algebra and for sharing his unsurpassable wit; he provides continuing and venerable competition on high-quality drollery. Many thanks go out to Marietta for putting up with it, too!

A shout-out to my colleagues at the University of Pisa and AUTH, Giacomo Fedi, Guido Magazzù, and Lymberis Voudouris, I am sure I will miss our time working together (that also goes to all the wonderful people mentioned up to this point).

I would like to take this opportunity to also thank my parents for instilling me with the mentality of continuously improving, motivating me to keep furthering my studies, and providing their support through my entire life; I am eternally indebted.

Last but not least, I would like to thank my loving wife, Daphne, without whose lasting support and encouragement this journey (literally as much as figuratively as she endured a thousand home moves over the last few years) would not have been possible.

The author,

Christos Gentsos

Thessaloniki, 2018-07-04 Table of Contents

Abstract i

Περίληψη v

Acknowledgments ix

List of Figures xix

List of Tables xxi

List of Code Listings xxiii

Acronyms xxv

Glossary xxix

Conventions xxxvii

Chapter 1: Introduction 1 1.1 History of Programmable Logic ...... 1 1.2 The Field-Programmable Gate Array ...... 5 1.2.1 Why FPGAs ...... 6 1.2.2 Design Considerations for FPGA Synthesis ...... 8 1.3 Thesis Contribution ...... 10 1.3.1 Image Processing ...... 10 1.3.2 Track Trigger applications ...... 11 1.4 Thesis Organization ...... 15

Chapter 2: High-Performance Implementations of Image Processing Algorithms 17 2.1 Introduction ...... 17

xi xii TABLE OF CONTENTS

2.1.1 Related Work ...... 18 2.2 Gaussian Convolution Implementation ...... 19 2.2.1 Implementation Details ...... 21 2.2.1.1 Cache Lines and Control Signals ...... 22 2.2.1.2 Image Border Handling ...... 24 2.2.1.3 Division by a Constant Implementation ...... 25 2.2.1.4 Application of Parallelism ...... 28 2.2.2 Performance and Resource Utilization ...... 28 2.3 Sobel Filter Implementation ...... 30 2.3.1 Implementation Details ...... 31 2.3.1.1 Image Border Handling ...... 32 2.3.1.2 Computation Core ...... 33 2.3.2 Performance and Resource Utilization ...... 35 2.4 Implementation of a Modified Canny Algorithm for Edge Detection . . . 36 2.4.1 NMS Stage ...... 37 2.4.2 Hysteresis Approximation ...... 38 2.4.2.1 Image Compression ...... 39 2.4.3 Performance and Resource Utilization ...... 42 2.5 High Performance Machine Vision System Implementation ...... 45 2.5.1 System Function ...... 45 2.5.2 Implementation ...... 46 2.5.2.1 Frame Detection - Modified Hough Algorithm . . . . . 46 2.5.2.2 Flow Detection ...... 47 2.5.3 IP Core Packaging ...... 49 2.5.3.1 The PLB Bus—Core Configuration and Control . . . . . 49 2.5.3.2 External Memory Access ...... 50 2.5.4 Performance and Resource Utilization ...... 51 2.6 Conclusions ...... 51

Chapter 3: Advanced High-Performance Designs for Track Trigger Applications 55 3.1 Introduction ...... 55 3.1.1 The LHC particle collider ...... 56 3.1.2 The ATLAS detector ...... 58 3.1.3 Triggering ...... 59 3.1.4 The FTK System ...... 60 3.1.5 The CMS detector ...... 63 TABLE OF CONTENTS xiii

3.1.6 The AM-Chip Approach for Level-1 Tracking at CMS ...... 64 3.2 Related Work ...... 66 3.3 The Data Organizer ...... 67 3.3.1 Special Features of the Data Organizer ...... 67 3.3.2 Implementation Details ...... 68 3.3.2.1 Organization of Memory Structures ...... 69 3.3.2.2 Register File Implementation ...... 70 3.3.2.3 Write Details and Data Collision Detection ...... 72 3.3.2.4 Sorting by Data Valid using Python-initialized ROMs . 73 3.3.2.5 Multiple Read Ports ...... 75 3.3.3 Verification Environment ...... 78 3.3.3.1 Use of SVA ...... 78 3.3.3.2 A Short Introduction to UVM ...... 80 3.3.3.3 The Testbench ...... 80 3.3.4 Performance, Resource Utilization and Power ...... 81 3.4 The Combiner ...... 84 3.4.1 Implementation Details ...... 85 3.4.1.1 DO Interface ...... 85 3.4.1.2 Combination generation ...... 86 3.4.2 Performance and Resource Utilization ...... 87 3.5 The Track Fitter ...... 88 3.5.1 Implementation Details ...... 90 3.5.1.1 DSP Block Structure in Modern FPGAs . . . . . 91 3.5.1.2 Computational Core Implementation ...... 92 3.5.1.3 Optimizing the Computational Core for Resources, La- tency and Power ...... 94 3.5.2 Performance and Resource Utilization ...... 95 3.6 Track Reconstruction System Implementation, Testing, and Evaluation . 96 3.6.1 Hardware Setup ...... 97 3.6.1.1 Ethernet Communucation and the IPBus suite . . . . . 97 3.6.2 Verification Environment ...... 99 3.6.2.1 The Testbench ...... 100 3.6.2.2 The BFM and On-Board Verification ...... 101 3.6.3 Track Reconstruction Testing and Evaluation in Python . . . . . 102 3.6.3.1 Toy Detector Model Description ...... 103 3.6.3.2 Functions Performed by the Software ...... 106 xiv TABLE OF CONTENTS

3.6.3.3 Reconstruction Resolution and Performance ...... 107 3.6.3.4 Reconstruction Processing Time ...... 110 3.6.3.5 Testing and Validation ...... 111 3.7 The PRM06 Demonstrator for L1 Track Reconstruction inCMS ...... 112 3.7.1 The PRM06 FPGA design ...... 114 3.7.1.1 DO port configuration and Double instantiation . . . . 116 3.7.1.2 TF Parameter Binning Support ...... 116 3.7.1.3 Performance and Resource Utilization ...... 117 3.8 Conclusions ...... 118

Chapter 4: Conclusions 121 4.1 Review ...... 121 4.2 Future Work ...... 124

Appendices

Chapter A: Designing around Metastability 127 A.1 Introduction ...... 127 A.2 CDC Circuits ...... 128 A.2.1 The basic 2FF synchronizer ...... 128 A.2.2 The handshake synchronizer ...... 130 A.3 Common Pitfalls ...... 131 A.3.1 Convergence ...... 131 A.3.2 Divergence ...... 131 A.3.3 Re-convergence ...... 133

Chapter B: The SystemVerilog Design and Verification Language 135 B.1 Introduction ...... 135 B.2 SystemVerilog Synthesis Features ...... 135 B.3 SystemVerilog Verification Features ...... 141

Chapter C: Vivado® Synthesis and Implementation attributes 145 C.1 Introduction ...... 145 C.2 Attributes ...... 145

Chapter D: Code Listings 149 D.1 Bit Indexing ROM Initialization Using Python ...... 149 D.2 Kintex-7 Optimal Multiplexer ...... 151 TABLE OF CONTENTS xv

D.3 Parametric Delay Line ...... 156 D.4 Reset Synchronizer ...... 157 D.5 Pulse Detection Synchronizer ...... 158 D.6 Combiner Module Core ...... 161 D.7 Data Organizer UVM testbench ...... 162

References 183

Publications 185

List of Figures

1.1 PAL architecture functional block diagram ...... 2 1.2 PLD families ...... 3 1.3 Simplified FPGA architecture ...... 4 1.4 Xilinx 7-series FPGA slice ...... 7

2.1 3 3 kernel convolution ...... 20 × 2.2 Block I/O and timing diagram for the Gaussian smoothing core . . . . . 21 2.3 Gaussian smoothing output formatter implementation ...... 23 2.4 Gaussian smoothing top-level block diagram ...... 23 2.5 Simplified ASM chart for the Gaussian smoothing cache and control block 24 2.6 Gauss convolution matrix normalization factors in different positions . . 26 2.7 Gaussian smoothing computation core implementation ...... 29 2.8 Gauss parallelism ...... 29 2.9 Sobel top-level block diagram ...... 32 2.10 Sobel computation core implementation details ...... 33 2.11 Norm approximation ...... 35 2.12 The angle bins the Sobel filter computes ...... 36 2.13 Main Canny building blocks ...... 37 2.14 NMS core implementation ...... 38 2.15 Hysteresis thresholding example ...... 39 2.16 Compression scheme ...... 40 2.17 Huffman encoder block diagram ...... 40 2.18 Canny edge detection steps, “lena” image ...... 43 2.19 Canny edge detection steps, “LoC frame” image ...... 44 2.20 LoC demonstrator ...... 46 2.21 Video frame, chip frame, and flow windows ...... 48 2.22 VFBC command words ...... 51 2.23 Machine Vision subsystem IP core address space ...... 52

xvii xviii LIST OF FIGURES

3.1 The ATLAS detector ...... 57 3.2 AM chip pattern matching visual description ...... 62 3.3 The CMS detector ...... 63 3.4 Proposed system hierarchy for L1 tracking ...... 65 3.5 High-level Track Trigger FPGA design block diagram ...... 68 3.6 DO write example ...... 70 3.7 2k multiplexer design ...... 72 3.8 DO waveforms ...... 74 3.9 Sort by Data Valid example ...... 75 3.10 LUT-based parallel sort by Data Valid example ...... 76 3.11 Data Organizer read datapath block diagram ...... 77 3.12 Simplified UVM testbench diagram ...... 79 3.13 DO core constrained in an triangular-like area ...... 82 3.14 DO power consumption breakdown ...... 83 3.15 Combiner memory structure and read diagram ...... 86 3.16 TF block diagram ...... 90 3.17 High-level DSP48E2 block diagram, highlighting its main functionality . 91 3.18 Detailed DSP48E2 block diagram that includes the dedicated interconnect 92 3.19 Original TF computational core ...... 93 3.20 Detailed view of the TF MACC block ...... 93 3.21 Waveform diagram explaining the staggered input to the TF MACC blocks 93 3.22 Resource-optimized TF computational core ...... 94 3.23 The KCU-105 evaluation board ...... 98 3.24 Timing diagram showing various IPBus transactions ...... 99 3.25 Block diagram of the testbench top-level ...... 101 3.26 Toy detector geometry ...... 103 3.27 Track parameter resolution plots ...... 108 3.28 Floating-point and fixed-point χ2 distribution comparison ...... 109 3.29 Processing time vs occupancy ...... 109 3.30 Floating-point and fixed-point efficiency / fake rate comparison . . . . . 109 3.31 Top view of the PRM06 demonstrator board ...... 112 3.32 PRM06 demonstrator system block diagram ...... 114

A.1 Basic 2FF synchronizer schematic and example timing ...... 129 A.2 Handshake synchronizer circuit schematic ...... 130 A.3 Convergence example schematic ...... 132 A.4 Crossover divergence example schematic ...... 132 LIST OF FIGURES xix

A.5 Metastable signal divergence example schematic ...... 132 A.6 Re-convergence example schematic ...... 132

List of Tables

2.1 Gaussian smoothing implementation results ...... 30 2.2 Sobel filter implementation results ...... 36 2.3 NMS implementation results ...... 40 2.4 Hysteresis implementation results ...... 41

2.5 Canny implementation LUT usage broken down by block, Fmax . . . . . 42 2.6 Machine vision implementation results ...... 51

3.1 AM Chip generations ...... 62 3.2 Data Organizer resource utilization results ...... 82

3.3 Data Organizer fmax results on constrained area ...... 82 3.4 Combiner resource utilization results ...... 88 3.5 Track Fitter core resource utilization results ...... 96 3.6 Full Track Fitter resource utilization results ...... 96 3.7 PRM06 resource utilization ...... 117

B.1 VHDL (a) and (b) logic values ...... 139

xxi

List of Code Listings

3.1 Count Leading Zeros (clz) ...... 75 3.2 Sorting function algorithmic representation ...... 75 3.3 Code snippet of a simple assertion ...... 78 3.4 Code snippet of a cover property ...... 79 3.5 IPBus interface ...... 99 3.6 write_ipbus task ...... 102

B.1 2-in-1 multiplexer in SystemVerilog ...... 136 B.2 2-in-1 multiplexer in Verilog ...... 136 B.3 Examples of new data types introduced in SystemVerilog ...... 139

D.1 Python program that generates HDL code for ROM initialization . . . . . 149 D.2 Optimal 2 k:1 multiplexer for Kintex-7 devices ...... 152 D.3 Parametric delay line ...... 156 D.4 Parametric reset synchronizer ...... 157 D.5 Pulse detection and synchronization implementation...... 158 D.6 Combiner module core in SystemVerilog ...... 161 D.7 DO UVM testbench top-level source ...... 162 D.8 DO UVM testbench package source ...... 163 D.9 DO UVM testbench test source ...... 164 D.10 DO UVM testbench interface source ...... 165 D.11 DO UVM testbench driver source ...... 168 D.12 DO UVM testbench scoreboard source ...... 171

xxiii

Acronyms

ABEL Advanced Boolean Expression Lan- 117, 122, 124, 147 guage. 3, CAM Content Addressable Memory. ABV Assertion-Based Verification. 78, 142 CCL Connected-component Labeling. 39 ALU Arithmetic Logic Unit. 91 CDC Clock Domain Crossing. 16, 128, 131, AM Associative Memory. 12, 113, 122 Glos- 133, 145 sary: Associative Memory CDF Collider Detector at Fermilab. 60 Glos- AMBA Advanced Microcontroller Bus Ar- sary: Collider Detector at Fermilab chitecture. CERN Conseil Européen pour la Recherche ANSI American National Standards Insti- Nucléaire. 56, 57, 63 Glossary: CERN tute. CISC Complex Instruction Set . API Application Programming Interface. CLB Configurable Logic Block. 5, 6, 8, 9, 29, 97, 99 71, 73, 145 Glossary: CLBs ASIC Application-Specific Integrated Cir- CMOS Complementary Metal-Oxide- cuit. 6, 9, 12, 60, 61, 79, 122, 128 Glossary: Semiconductor. 61 Glossary: CMOS ASICs CMS Compact Muon Solenoid. 14, 15, 56, ASM Algorithmic State Machine. 23 Glos- 57, 63, 64, 66, 97, 112, 118, 119, 122, 123 sary: ASM Glossary: CMS ATCA Advanced Telecommunications CORDIC COordinate Rotation DIgital Computing Architecture. 65, 97 Glossary: Computer. 35 ATCA CPLD Complex Programmable Logic De- ATLAS A Toroidal LHC ApparatuS. 12, 14, vice. 4 Glossary: CPLD 15, 56, 57, 59, 60, 61, 63, 64, 118, 119, 122 CPU . 6, 7, 59, 64, Glossary: ATLAS 73, 107, 118 AXI Advanced eXtensible Interface. CRC Cyclic Redundancy Check. 115 Glos- BER Bit Error Rate. sary: CRC BFM Bus Functional Model. 101 CUPL Compiler for Universal Program- BRAM Block RAM. 6, 8, 12, 17, 22, 29, 40, mable Logic. 3, 41, 49, 65, 68, 70, 71, 77, 81, 84, 85, 87, 96, DAQ Data Acquisition System. Glossary:

xxv xxvi ACRONYMS

Data Acqisition system 29, 39, 42, 45, 53, 56, 60, 61, 65, 66, 67, 70, DDR Double-Data-Rate. 97, 113 Glossary: 73, 75, 81, 83, 88, 89, 90, 91, 97, 98, 99, 100, DDR 101, 106, 111, 112, 113, 114, 117, 118, 121, DDR SDRAM Double-Data-Rate Syn- 122, 123, 125, 127, 128, 136, 137, 146, 151 chronous Dynamic Random-Access Mem- Glossary: FPGAs ory. 49, FPLA Field-. 2 DIP Dual In-line Package. 3 Glossary: FPLA DMA Direct Memory Access. FSL Fast Simplex Link. DO Data Organizer. 12, 13, 14, 15, 60, 66, FSM Finite State Machine. 9, 22, 23, 41, 141 67, 68, 69, 71, 72, 73, 77, 78, 80, 81, 83, 84, 85, Glossary: FSM 86, 87, 94, 96, 99, 100, 110, 111, 112, 114, 115, FTK Fast TracKer. 12, 15, 60, 61, 65, 66, 89, 116, 117, 118, 122, 123, 124, 125, 142, 151, 118, 122 Glossary: FTK 162 FWFT First-Word Fall-Through. DPI Direct Programming Interface. 99, 101, GAL . 3 Glossary: GAL 118 Glossary: DPI GNU GNU’s Not Unix. DRAM Dynamic Random-Access Memory. GPU Graphics Processing Unit. 18, 118 Glossary: DRAM HCAL Hadron Calorimeter. 58, 63 DRC Design Rule Check. 145 HDL Hardware Description Language. 4, DSP Digital Signal . 6, 7, 8, 9, 14, 12, 16, 75, 99, 123, 149 15, 47, 65, 89, 90, 91, 117, 118, 123, 146, 147 HEP High-Energy Physics. 9, 11, 14, 15, 55, DUT Design Under Test. 79, 100, 101 56, 118, 121 ECAL Electromagnetic Calorimeter. 58, 63 HL-LHC High Luminosity Large Hadron EDA Electronic Design Automation. Glos- Collider. 12, 14, 57, 64, 112, 119, 122, 123 sary: EDA Glossary: High Luminosity LHC EDK Embedded Development Kit. HLS High-Level Synthesis. 7, 124 EEPROM Electrically Erasable Program- HLT High-Level Trigger. 12, 59, 60, 61, 64 mable Read-Only Memory. 3 HTT Hardware Tracking Trigger. 14 FF Flip-Flop. 5, 29, 35, 81, 87, 117, 127, 129, HVL Hardware Verification Language. 135 131, 133, 136, 145, 157 Glossary: Flip-Flop I2C Inter-. Glossary: I2C FIFO First-In First-Out memory. 9, 41, 50, I/O Input / Output. 3, 4, 7, 21, 61, 113, 139 76, 77, 85, 86, 88, 128, 130, 133, 143 Glossary: IBL Insertable B-Layer. 59 FIFO IC Integrated Circuit. 1, 2, 4 Glossary: IC FMC FPGA Mezzanine Card. 97, 113, 114, ICARUS Integrated Circuit ARtwork Util- 115 Glossary: FMC ity System. Glossary: ICARUS FPGA Field-Programmable Gate Array. 5, ID Inner Detector. 58, 59, 61, 63, 64 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 27, 28, IDE Integrated Development Environment. ACRONYMS xxvii

ILA Integrated Logic Analyzer. NDOF Number of Degrees of Freedom. 89 INFN Istituto Nazionale di Fisica Nucleare. NMS Non-Maximum Suppression. 33, 37, Glossary: INFN 38, 42 Glossary: Non-Maximum Suppres- IP Intellectual Property. 15, 49, 50, 53, 122 sion Glossary: OOP Object-oriented programming. Glos- ISP In-System Programming. 4 sary: Object-Oriented Programming JTAG Joint Test Action Group. 4 OSVVM Open Source VHDL Verification L1 Level-1. 11, 12, 14, 59, 60, 61, 64, 66, 112, Methodology. Glossary: OSVVM 119, 122, 123 OTP One-Time Programmable. 2, 6 L2 Level-2. PAL . 3, 4 Glos- LEP Large Electron-Positron collider. Glos- sary: PAL sary: Large Electron-Positron Collider PC Personal Computer. 59, 97 LFSR Linear-Feedback Shift Register. PCA Principal Component Analysis. 65, 89 LHC Large Hadron Collider. 11, 15, 56, 57, Glossary: Principal Component Analysis 59, 64, 101, 119 Glossary: Large Hadron PCB . 3, 6, 61, 112 Collider PCIe PCI Express. 97 Glossary: PCIe LIFO Last-In First-Out memory. 143 Glos- PCR Polymerase chain reaction. 48 Glos- sary: LIFO sary: Polymerase chain reaction LoC Lab-on-Chip. 10, 15, 17, 18, 19, 32, 40, PLA Programmable Logic Array. 2, 3, 5 42, 45, 46, 47, 48, 49, 51, 53, 121 Glossary: Glossary: PLA Lab-on-Chip PLB Processor Local Bus. 49 LOCs Lines of Code. 80 PLL Phase-Locked Loop. 127, LSB Least Significant Bit. 67 PoC Point-of-Care. 18, 42, 121 LUT Look-Up Table. 5, 8, 11, 12, 29, 35, 67, PRM Pattern Recognition Mezzanine. 65, 69, 73, 77, 81, 87, 97, 117, 147, 149 Glossary: 66, 112, 113, 114, 115, 117, 123, 125 LUTs PROM Programmable Read-Only Memory. LVDS Low-Voltage Differential Signaling. 1, 2 113, 115 Glossary: Low-Voltage Differ- PSL Property Specification Language. 142 ential Signaling Glossary: PSL MACC Multiplier-Accumulator. 90, 91 PU Processing Unit. 60 MPMC Multi-Port Memory Controller. 49, RAM Random Access Memory. 1, 5, 6, 9, 50 73, 147 MSB Most Significant Bit. RISC Reduced Instruction Set Computer. MTBF Mean Time Between Failures. 129, RLDRAM Reduced Latency Dynamic 130, 145 Random-Access Memory. 113, 114, Glos- MUX Multiplexer. 5, 8, 9, 12, 128 sary: RLDRAM xxviii ACRONYMS

RLE Run-length encoding. 11, 39, 40, 41 TRT Transition Radiation Tracker. 59 Glossary: run-length encoding TTL Transistor-Transistor Logic. 1 ROM Read-Only Memory. 2, 9, 73, 75 UDP User Datagram Protocol. 97 RTL Register-Transfer Level. 14, 84, 99, 118, UVM Universal Verification Methodology. 123, 124, 142 14, 78, 79, 80, 99, 118, 123, 162 Glossary: SCT Semiconductor Tracker. 58 UVM SDK Software Development Kit. VFBC Video Frame Buffer Controller. 50 SiP System-in-Package. Glossary: System- VHDL Very high speed integrated circuit in-Package Hardware Description Language. 4, 136, 141 SoC System-on-Chip. 7, 97, 135, 138 Glos- Glossary: VHDL sary: System-on-Chip VLSI Very Large Scale Integration. 1 Glos- SPI Serial Peripheral Interface. 65, 97 Glos- sary: VLSI sary: SPI VME VERSAmodule Eurocard. 61 Glossary: SPLD Simple Programmable Logic Device. VME 4 Glossary: SPLD XPS Xilinx Platform Studio. 49 SPS Super Proton Synchrotron. SRAM Static Random Access Memory. 5, 6 SRL Shift Register LUT. 5 SS Super-Strip. 60, 65, 66, 67, 68, 69, 70, 71, 75, 76, 77, 113, 114 Glossary: Super-Strip SSID Super-Strip ID. 12, 66, 67, 68, 69, 71, 73, 76, 97, 100, 114 SSN Simultaneous Switching Noise. SSTL Stub Series Terminated Logic. STA Static Timing Analysis. 137 SVA SystemVerilog Assertions. 78, 142 SVT Silicon Vertex Tracker. 60, 61 TCAM Ternary Content Addressable Mem- ory. TCB Track Candidate Builder. 65, 114, 115, 116, 117 TCP Transmission Control Protocol. TDR Technical Design Report. 66 TF Track Fitter. 13, 14, 15, 60, 65, 66, 84, 85, 87, 88, 89, 90, 91, 94, 95, 96, 97, 100, 106, 110, 111, 112, 114, 115, 117, 118, 123 Glossary

Altera Corporation is an American tech- See also: VME nology company, one of the two major ATLAS is one of the seven particle detector FPGA manufacturers; it is now part of In- experiments (ALICE, ATLAS, CMS, TOTEM, tel®. 5 LHCb, LHCf and MoEDAL) constructed at See also: FPGAs the LHC. It is a general-purpose detector, AM Chip, an Associative Memory-like built in order to confirm and take improved ASIC, developed by the INFN institute, in- measurements on the Standard Model, but tegral part of the FTK project. 60, 61, 62, 64, also to discover possible clues for new phys- 65, 66, 67, 112, 113, 114, 115, 116 ical theories. 12, 14, 15, 56, 57, 59, 60, 61, 63, See also: Associative Memory, FTK 64, 118, 119, 122 ASICs are integrated circuits designed for See also: Large Hadron Collider a particular use, rather than intended for C++ is an ISO-standardized general- general-purpose use. 6, 9, 12, 60, 61, 79, 122, purpose programming language. Despite 128 being grouped with high-level program- ASM charts are used to describe FSMs, but ming languages and facilitating object- in a less formal way than state diagrams, so oriented programming paradigms it allows they are easier to understand. 23 (and demands familiarity with) low-level See also: FSM memory manipulation operations. 97, 99, Associative Memory refers to a type of 101 memory that is addressed by content rather See also: Object-Oriented Programming than by address. Given part of a pattern or Canny edge detector is a popular multi- key, it retrieves the values associated with stage edge detection algorithm that aims for that pattern. 12, 113, 122 low error rate, good localization and only ATCA is a standardization effort for one response per real edge. 10, 15, 17, 18, (initially) telecommunications equipment, 19, 36, 37, 38, 42, 46, 49, 121, 124 specifically shelves, backplanes, and their See also: edge detection complementing boards. It succeeded VME CERN, the European Organization for Nu- as an industry standard. 65, 97 clear Research, founded in 1954.. 56, 57, 63

xxix xxx GLOSSARY

CLBs are fundamental FPGA building order of the polynomial, and then defined blocks, containing LUTs, MUXs, registers, by the polynomial itself. 115 and routing resources. 5, 6, 8, 9, 29, 71, 73, Data Acquisition system, a system that 145 samples signals representing real-world CMOS is a family of manufacturing pro- data to convert them to some format that cesses used for mostly digital, but in some can be manipulated by a computer. Option- cases also analog and mixed-signal inte- ally but frequently, a DAQ system also per- grated circuit construction. 61 forms some form of signal conditioning (fil- CMS is one of the seven particle detector tering, amplification, etc.). experiments at the LHC. It is based on simi- DDR signifies data transfers that occur on lar construction principles as ATLAS, and both the rising and falling edges of a clock. shares the same goal, as well. 14, 15, 56, 57, 97, 113 63, 64, 66, 97, 112, 118, 119, 122, 123 Dennard scaling states that as the tran- See also: Large Hadron Collider sistors get smaller, their power density re- Collider Detector at Fermilab is an ex- mains constant. This enabled clock fre- pirimental collaboration and one of the two quency increases but since the mid-00s this detector experiments at the Tevatron Parti- law appears to have broken down, resulting cle Collider. In February 1995, it observed in the industry’s focus on multicore proces- for the first time the top quark, the very last sors as a means to improve performance. quark to be observed. 60 See also: Moore’s law, Koomey’s law See also: Tevatron Don’t Care bits is an AM chip feature that Computer Cluster is a set of connected allows the formation of variable-shape pat- that work together to perform terns, facilitating the creation of more com- a certain task, so they can be viewed as a pact and efficient pattern banks. 12, 61, 67, single system. 71, 73, 76, 100, 106, 113 Computer Farm see: Computer Cluster. See also: AM Chip Computer Vision deals with how comput- DPI is an interface between SystemVerilog ers can be made to gain high-level under- and C, allowing a testbench to call C func- standing from digital images or videos. 17 tions. 99, 101, 118 Convolution Matrix see: kernel. 22 See also: SystemVerilog CPLDs are programmable, but non-volatile, DRAM is a type of RAM that stores each bit logic devices. 4 in a separate capacitor. Due to leakage, the CRC is an error detection method based on capacitors discharge and thus, a periodical polynomial long division, widely used in “refresh” prodecure is necessary to retain digital networks. Many standardized vari- the data.. eties exist, characterized primarily by the EDA is a category of software tools for de- GLOSSARY xxxi signing electronic systems. See also: fixed-point Edge detection is the process of locating FMC is an ANSI/VITA standard to define brightness discontinuities in digital images. the connector and mechanical properties 17, 18, 19, 27, 30, 31, 32, 36, 38, 50, 121 for FPGA-based mezzanines. 97, 113, 114, Evaluation Board, a board specifically de- 115 signed to showcase the functionality of an See also: mezzanine FPGA or microcontroller. Almost always it FPGAs are semiconductor devices that features some kind of RAM and a USB port. are based around a matrix of configurable It is usual for one to also host a number of logic blocks (CLBs) connected via program- peripherals and connectors to make it easy mable interconnects. FPGAs can be repro- for developers to evaluate the potential of grammed to desired application or function- the main featured device. 6, 97, 100, 101, ality requirements after manufacturing. 5, 118, 123 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 27, 28, FIFO memory, a memory type commonly 29, 39, 42, 45, 53, 56, 60, 61, 65, 66, 67, 70, 73, used as a buffer. The data that enter first, 75, 81, 83, 88, 89, 90, 91, 97, 98, 99, 100, 101, exits first as well. 9, 41, 50, 76, 77, 85, 86, 88, 106, 111, 112, 113, 114, 117, 118, 121, 122, 128, 130, 133, 143 123, 125, 127, 128, 136, 137, 146, 151 Fixed-point representation expresses frac- FPLA is the field-programmable evolution tional values by using a fixed number of of PLA devices. 2 digits after the radix point. 25, 92, 106, 107, See also: PLA 111 FSM is a machine that can be in one of a Flash memory is a type of non-volatile, finite number of states at any given time. electronic digital storage that can be erased Its outputs depend either only on the state and reprogrammed. 65, 97 of the machine (Moore machine), or also on Flip-Flop, the fundamental storage ele- its inputs (Mealy machine). An FSM can ment in digital circuits, the base of sequen- be visually represented as a state diagram. tial logic. 5, 29, 35, 81, 87, 117, 127, 129, 131, FSMs are widely used in digital circuit de- 133, 136, 145, 157 sign, where they are implemented using reg- Floating-point representation uses two isters and combinatorial logic. 9, 22, 23, 41, fields, a fixed-point significand and theex- 141 ponent, to widen the range of real numbers FTK is an approved ATLAS TDAQ upgrade. that can be represented. As an example Its function is to reconstruct tracks in real- in base-2 floating-point, the number repre- time using precise inner detector coordi- sented by a sign bit s, a significand of p, and nates to improve online event selection qual- an exponent of q is (−1)s p 2q. 90, 107, ity. 12, 15, 60, 61, 65, 66, 89, 118, 122 × × 111, 140 See also: ATLAS xxxii GLOSSARY

GALs improve upon the PAL line of 118 programmable logic devices by offer- Karnaugh map is a logic function repre- ing increased capacity, optionally regis- sentation that uses a table. In this tabular tered outputs, better performance, and re- form patterns become visible that help with programmability. 3 minimization and hazard analysis tasks. 1 See also: PAL Kernel in image processing, a kernel is de- High Luminosity LHC is a project for fined as a small matrix that is applied toan a planned upgrade of the LHC in the image by means of convolution to produce mid-2020s, consolidating upgrade projects a certain effect. 19, 20, 21, 24, 30, 31, 32, 35 for its various subsystems. 12, 14, 57, 64, Kintex-7 is a mid-range, 28 nm Xilinx 112, 119, 122, 123 FPGA family. xxiii, 71, 73, 75, 81, 83, 84, See also: Large Hadron Collider 83, 87, 88, 92, 95, 113, 149, 151, 152, 153, 154, Huffman coding is an entropy encoding 155, 156 technique, used not only in lossless data Kintex Ultrascale is the evolution of the compression but also as a back-end in lossy Xilinx Kintex-7 FPGA family at 20 nm. 28, compression schemes. 11, 39, 40, 41 29, 35, 42, 65, 71, 73, 75, 81, 83, 84, 83, 87, 91, I2C is a serial bus, typically used to establish 95, 97, 113, 124, 149 low-speed connections between micropro- See also: Xilinx, Kintex-7 cessors and peripherals in the same board. Koomey’s law describes a trend in the # IC or simply “microchip”, is a collection of of computations/J, specifically that it’s dou- electronic circuits on a small piece (“chip”) bling every 1.5 years. of semiconducting material. 1, 2, 4 See also: Moore’s law, Dennard scaling ICARUS was a revolutionary IC layout Lab-on-Chip, or Lab-on-a-Chip, is a de- CAD software system introduced in 1978. vice that miniaturizes one or many labora- It could run on a Xerox personal computer tory functions on a single chip. 10, 15, 17, of its age. 18, 19, 32, 40, 42, 45, 46, 47, 48, 49, 51, 53, INFN, the coordinating institution for nu- 121 clear, particle and astroparticle physics in Large Electron-Positron Collider is a Italy, founded in 1951.. particle collider based at CERN, built in the Initiation interval is the minimum num- 27-kilometer tunnel that later hosted the ber of clock cycles between two successive LHC. iterations of a loop. 75, 116 See also: Large Hadron Collider IPbus an Ethernet-based software and Large Hadron Collider is the world’s firmware suite that implements a reliable most powerful particle collider. It lies in high-performance control link for particle a tunnel 27 kilometers in circumference, be- physics electronics. 97, 98, 99, 100, 101, 115, neath the France-Switzerland border near GLOSSARY xxxiii

Geneva, Switzerland. 11, 15, 56, 57, 59, 64, problems that might be deterministic in 101, 119 principle. Latency is here defined as the time it takes Moore’s law is the observation that the from the appearance of some input to the number of transistors in an IC doubles ev- production of useful results in the output of ery two years.. a circuit. 61, 62, 66, 71, 73, 75, 90, 118 See also: Koomey’s law, Dennard scaling Layers in this work refer to the layers of Non-Maximum Suppression is an inter- the silicon-based trackers inside ATLAS and mediate step in many edge detection algo- CMS. 60, 64, 67, 68, 84, 85, 86, 87, 88, 103, rithms, suppressing all image information 105, 106, 107, 113, 114, 115 that is not part of local maxima. 33, 37, 38, LIFO memory, a memory type commonly 42 used as a stack. The data that enter last, See also: Canny exits first. 143 Object-Oriented Programming is a pro- Low-Voltage Differential Signaling is gramming paradigm based on the use of an ANSI electrical standard for a serial, dif- “objects” which may contain both data (at- ferential communications protocol. 113, 115 tributes) and code (methods). The methods LUTs are basic structural units that com- of an object can access, and typically mod- prise FPGA devices. 5, 8, 11, 12, 29, 35, 67, ify, the attributes within an instance. 69, 73, 77, 81, 87, 97, 117, 147, 149 OSVVM is a library for VHDL providing Machine Vision is the technology and similar functionality to UVM. methods used to provide imaging-based au- See also: UVM tomatic inspection and analysis for such ap- PALs are programmable logic devices that plications as automatic inspection, process implement combinatorial circuits. They control, and robot guidance in industry. 9, consist of a sequential elements driven by a 10, 15, 17, 18, 19, 28, 42, 45, 49, 51, 53, 121, programmable AND array that is connected 122 to a fixed OR array (not to be confused with See also: computer vision PLAs, that feature a programmable OR ar- Mezzanine, or daughterboard, a PCB de- ray). 3, 4 signed to be mounted on a motherboard. See also: PLA, FPLA 61, 112, 113 Pattern, in the context of FTK, is a set of ModelSim® is an HDL simulation environ- low-resolution coordinates—one for each ment by . detector layer—that may contain valid, and Monte Carlo methods are a class of com- interesting, tracks. 60, 61, 67, 71, 100, 106, putational algorithms that rely on repeated 110, 111, 113, 114, 115, 116, 118, 123 random sampling to obtain results. The es- See also: FTK sential idea is using randomness to solve PCIe is a high-speed serial bus standard, xxxiv GLOSSARY mostly used as a computer expansion bus. Python is a high-level, interpreted pro- 97 gramming language. It’s very popular— Pipelining is a technique used in digital especially in academia—much due to its ease design to increase the throughput of a sys- of use and the abundance of ready-to-use tem. Processing elements are connected in libraries available for it. 14, 27, 34, 75, 97, series via buffer storage, where the output 102, 107, 111, 149 of one element is the input of the next one. QuestaSim® is an HDL simulation envi- In that way, all the processing elements can ronment by Mentor Graphics. It is part of be running in parallel. 8, 17, 45, 60 the Questa platform, which is targeted at PLAs are programmable logic devices that complex FPGA and ASIC designs.. implement combinatorial circuits. They Real-time computing describes hardware consist of a programmable AND array that or software systems that guarantee re- is connected to a programmable OR array. sponse within specified time constraints. 2, 3, 5 The response time of such systems is typi- Polymerase chain reaction is a tech- cally of the order of ms, and sometimes even nique used in molecular biology to amplify µs. 10, 11, 17, 18, 19, 42, 46, 48, 60, 64, 66, a single copy or a few copies of a piece of 118, 121, 122 DNA across several orders of magnitude, RLDRAM is a type of DRAM memory generating thousands to millions of copies with an SRAM-like interface, offering lower of a particular DNA sequence. 48 latency and better random access perfor- Principal Component Analysis is a sta- mance than typical DRAMs. 113, 114, tistical technique to identify patterns in See also: DRAM data, providing a method for dimensionality Road refers to a matched pattern. 60, 61, reduction. 65, 89 65, 67, 69, 73, 76, 77, 84, 85, 86, 87, 97, 100, Pseudorapidity (η) is, in experimental 101, 103, 106, 110, 111, 110, 111, 114, 160 particle physics, a dimensionless quantity See also: pattern describing the angle of a particle relative to Road ID refers to the ID (#) of a matched the beam axis. 64, 104 pattern. 67, 69, 76, 97 PSL, initially developed by and See also: pattern, road eventually made into a IEEE standard, is a Run-length encoding is a very simple temporal logic description language, most form of lossless data compression, partic- commonly used in assertion and formal ver- ularly effective on sparse data. Runs of re- ification scenarios. 142 peated data words are stored as a single Pulsar IIb is an FPGA-based, extensible word and a count. Notable applications in- ATCA board designed by Fermilab for HEP clude fax transmissions, JPEG compression, applications. 65, 112, 113 and FPGA bitstream compression. 11, 39, GLOSSARY xxxv

40, 41 160, 162 SPI is a very popular four-wire syn- See also: Object-Oriented Programming chronous serial interface that primarily tar- Tcl , pronounced Tickle, is a high-level pro- gets embedded systems. 65, 97 gramming language commonly used to in- SPLD is an umbrella term that encompasses terface with FPGA and ASIC design EDA PLA, FPLA, PAL, and GAL devices. 4 software. See also: PLA, GAL Testbench is a construct that provides in- Stub is a full resolution detector hit put (and possibly checks the output) to a that passes some preliminary filtering per- circuit in a simulation environment. 14, 78, formed in the readout modules of the CMS 79, 80, 99, 100, 101, 111, 118, 123 detector. For the purposes of this text that Tevatron was a synchrotron particle accel- can be considered equivalent to a hit; in erator built at Fermilab (Batavia, IL). Al- ATLAS-related sections the word “hit” will though inactive since 2011, it holds the title be used, while in CMS-related sections the of the second highest energy particle accel- word “stub” will be used. 64, 65, 64, 113, 114, erator in the world, after the LHC. 60

113, 115, 116 Transverse Momentum (pT) is, in exper- Super-Strip refers to a low-resolution rep- imental particle physics, the component of resentation of a detector hit. 60, 65, 66, 67, a particle momentum perpendicular to the 68, 69, 70, 71, 75, 76, 77, 113, 114 beam line. Its importance arises from the See also: stub fact that the momentum along the beam- System-in-Package refers to a number of line may just be left over from the beam silicon dies enclosed in a single IC package particles, while the transverse momentum and connected, horizontally or vertically. It is always associated with whatever physics is not to be confused with a SoC, which in- happened at the vertex. 64, 103 volves a single IC die. Trigger System is the detector system that See also: System-on-Chip performs event selection. 11, 14, 56, 57, 59, System-on-Chip is an IC that encapsu- 64, 66, 112, 119, 122, 123 lates the components of a complete system Track Trigger is defined as a trigger sys- (e.g., processor core/cores, memory, GPU, tem that takes into account track parame- interfaces). 7, 97, 135, 138 ters in order to make its decision. 95 SystemVerilog is a hardware description Truth tables are used in logic to represent and hardware verification language. It is an boolean functions: each input variable is extension of Verilog, encompassing object- assigned a column, with all possible input oriented programming techniques to facil- values listed in rows; a last column shows itate verification tasks. 16, 78, 80, 99, 123, the result of the boolean function for each 135, 137, 138, 139, 141, 142, 143, 151, 156, set of inputs. 1 xxxvi GLOSSARY

UVM is a standardized methodology for See also: FPGAs verifying digital circuit designs. 14, 78, 79, 80, 99, 118, 123, 162 Verilog is a hardware description language used to model electronic circuits. 4, 16, 75, 135, 136, 137, 138, 139, 141, 149 VHDL is a hardware description language primarily used to describe digital circuits. It borrows heavily from the Ada program- ming language. 4, 136, 141 VLSI is the process of creating an inte- grated circuit by integrating hundreds of thousands of transistors into a single chip. 1 VME refers to a set of bus standards and associated crate technologies, dating back to the 80s but still in widespread use today. 61 Xilinx PlanAhead™ is an FPGA design suite provided by Xilinx to replace the de- sign suite part of ISE for their older devices. See also: ISE Xilinx ISE® design suite is an FPGA de- sign toolchain and IDE provided by Xilinx. Even though its development has stopped since 2013, the toolchain part of ISE is the only way to target 6-series, and older, de- vices. ® design suite is an FPGA design toolchain provided by Xilinx to tar- get their 7-series, and newer, devices. It has evolved from PlanAhead. 16 See also: PlanAhead Xilinx, Inc. is an American technology company, known for inventing the FPGA. 5, 6, 16, 49, 81, 97, 113, 151 Conventions

A number of conventions will be used throughout this document. These conventions are detailed below. • When introducing physical quantities, the measurement units are typeset in square −1 brackets: q/PT [GeV ] • Hexadecimal numbers are typeset in typewriter font and prefixed by ’0x’, like this: 0x1c. Similarly, binary numbers are prefixed by ’0b’, like so: 0b1011. • Bit positions are typeset as such: [7..0], or [1], also in typewriter font so they cannot be mistaken for citations. • Sections of code are also typeset in typewriter font, usually using syntax highlight- ing, like this:

assign data_little_endian = {<

• In code syntax examples, optional elements are enclosed in square brackets, like this:

֓← element_type [vector_dimension] array_name [array_dimension]...[more_array_dimensions];

• Caveats may be used to warn the reader about any potential pitfalls the techniques presented may entail. They are typeset as the example below:

Caveat: Here, any warnings, or any common pitfalls about the techniques presented, are explained.

xxxvii

Chapter 1

Introduction

“In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move.” — Douglas Adams

1.1 History of Programmable Logic

To better appreciate the impact programmable logic had on digital design, the components that were used before the first programmable devices were introduced should be briefly discussed.

In 1963, the Transistor-Transistor Logic (TTL) Integrated Circuits (ICs) were brought to the market. Each discrete chip would implement a different type of logic function: inverters, gates, multiplexers, flip-flops, latches, RAM and PROM memories. Depending on the available inventory of basic gates, the available area, and the cost, design logic would often have to be adapted.

Pen and paper constituted the typical design environment: the truth tables of the necessary digital functions would be laid out, Karnaugh maps would be used for min- imization and glitch analysis. The final step would be a component-level schematic design. For prototyping, the components would typically be placed onto a stripboard or a wire-wrap board.

In spite of the complex design process, these ICs were used for constructing processors for computers ranging from minicomputers to mainframes, and a wide range of related

1 2 CHAPTER 1: INTRODUCTION

AB

sel en

0 out2 1

DQ

sel clk en

0 out1 1

DQ

clk

Figure 1.1: PAL architecture functional block diagram equipment, such as graphics terminals and printers, until the advent of Very Large Scale Integration (VLSI) devices. Even then, they remained in use as glue logic components; TTL-compatible ICs are used even today.

One might note that the Programmable Read-Only Memory (PROM) memories that were already available are a form of programmable logic and, indeed, each of a Read-Only Memory (ROM)’s output pins can implement a combinatorial function using the address pins as inputs. This is a good solution for some uses but not for others, as:(i) ROMs were slower, more expensive, and consumed more power than gate-based circuits; and (ii) as the inputs switch, the outputs can glitch in ways that cannot be precisely controlled.

When IC technology evolved to allow more switching elements per chip, Program- mable Logic Array (PLA) devices such as the TMS 2000 were introduced around 1970 [1, p. 62, 2, p. 4-6] to address these shortcomings. These were comprised by an AND gate array, with its outputs connected to an OR gate array; the outputs of the OR array could be optionally inverted. The interconnections between inputs and AND gate array elements, and AND gate array outputs and OR gate array inputs, were mask-programmable. This meant that an extra mask that defined the functionality was used during fabrication, leading to a part that was programmed to execute a specified function.

In 1975 the 82S100 Field-Programmable Logic Array (FPLA) devices expanded on the PLA functionality by adding a very important feature: it was field-programmable.1 Using

1Quoting Mitch Richman [3]: “The idea that the customer could actually have some influence onthe function of a chip, other than convincing the product planning guy at the manufacturer that he wanted it to do this, was a radical concept”. HISTORY OF PROGRAMMABLE LOGIC 3

PLDs

SPLDs CPLDs FPGAs

PROMs PLAs PALs GALs

Figure 1.2: PLD families metal fuse technology, it was One-Time Programmable (OTP) by the user, without the need for expensive masks to be produced for the manufacturing stage. This, among other obvious advantages,2 lowers the development cost, as it is much cheaper to just replace a device programmed with faulty logic with another one. The 82S100 chip offered 16 inputs and 8 outputs, with 48 intermediate product terms (AND outputs).

Then, the Programmable Array Logic (PAL) devices were introduced in 1978, omitted the programmability of the OR array. They were faster, smaller, and cheaper than their PLA predecessors. The size improvement was reflected on the package size: its widthwas a 300 mil Dual In-line Package (DIP), instead of 600 mil with MMI introducing a 20-pin device and AMD following with the most commonly used 24-pin 22V10 devices.3 A major advancement that distinguished these devices was the inclusion of fully programmable “macrocells”: these are placed after the sum of products and can be configured toeither propagate the signal or its inverse, or a registered version of these. Their output is directly connected to their respective I/O (Input / Output) pin and fed back to the array at the same time, thus enabling the I/O pins of the device to have a programmable direction. The most definitive breakthrough PAL devices brought, though, was the introduction of languages such as PALASM, CUPL and ABEL. These enabled engineers to easily express their functions by directly describing the logic behavior in a text file, which was converted to the fuse map file that was used to program the device.

In 1985, Lattice Semiconductors introduced the Generic Array Logic (GAL) family of programmable devices. These were enhanced by adding selectively-enabled registers to the outputs, and basing the reprogrammable logic on Electrically Erasable Programmable Read-Only Memory (EEPROM) technology, allowing them to be erased and reprogrammed using special programmers. To the time of writing, GAL devices are still being produced, providing a capacity equivalent of 500–800 gates and a pin-to-pin delay of 4 ns. They are

2Quoting Mitch Richman [3] again: “Oh it didn’t work? Well I’m not going to tell my boss that I had an error here. I’ll just go, and I’ll change it and program another one and drop it in. He’ll never know”. 3The technology was bought by in 1999. Common variations thatfollowed include the 16V8 and 20V8: the first number signifies the number of input pins, and the second number determines the number of output pins. 4 CHAPTER 1: INTRODUCTION

CLB I/O

CLB CLB

Switch Matrix

CLB CLB

Figure 1.3: Simplified FPGA architecture

mostly used in 5 V glue logic. Also, due to the large, easy-to-solder onto home-made PCBs (Printed Circuit Boards), package options, these devices are still used by the hobbyist community.

The devices described so far are collectively referred toas Simple Programmable Logic Devices (SPLDs) (see figure 1.2 on the preceding page). The next step in increasing the amount of logic in a single device was brought on by the introduction of the Complex Programmable Logic Device (CPLD) family of devices. These feature an array of logic blocks, each of which carries logic capabilities equivalent to a full PAL device, connected to each other and to the I/O pins through a programmable interconnect fabric. The I/O pins also have logic built-in to provide more control and features, such as tri-state logic in all pins. The configuration is still stored in a non-volatile way, but the increase in logic capacity, features, and packaging called for more advanced programming methods. In-System Programming (ISP) was introduced by means of the 4-wire JTAG (named after the Joint Test Action Group) programming interface, making the devices able to be programmed on the board in which they are to be used.

With CPLDs the typical number of equivalent gates grew to more than 10 k, with 512 registers: designs that used a few tens of PAL ICs could be packed in a single ∼ device. Around the same period the VHDL (Very high speed integrated circuit Hardware Description Language) [4] and Verilog [5] Hardware Description Languages (HDLs) emerged, smoothing the way for more complex logic to be described with CPLDs. This combination allowed the compact implementation of a wide range of circuits, such as network controllers, graphics controllers, power management and sequencing, and boot- loaders; CPLDs are still used in a number of low-power, instant-on glue logic applications. THE FIELD-PROGRAMMABLE GATE ARRAY 5

However, for designs that require many registers or bus interfaces, this architecture does not scale well.

1.2 The Field-Programmable Gate Array

An Field-Programmable Gate Array (FPGA) is a heterogeneous device based on a large, two-dimensional array of logic blocks; a complex interconnect and clocking network; and (typically) hardened higher-level blocks, such as Static Random Access Memory (SRAM) memory blocks, math blocks, external memory controllers, and multi-gigabit serial transceivers.

A major distinction between FPGAs and older programmable logic technologies— such as the PLA—is the capability to program the function of the replicated logic blocks themselves, instead of programming the interconnect networks around predetermined logic functions. This architectural change complimented the technological advancements that ultimately enabled the leap in logic capacity brought upon with FPGAs devices: in current FPGAs that ranges from some thousand logic elements and registers, up to several million.

Figure 1.3 on the preceding page offers a very simplified view of the FPGA architecture, omitting the heterogeneous features: it is an array of Configurable Logic Blocks (CLBs) connected through an interconnect network that is based on regularly placed routing nodes. Each CLB4 typically contains a number of Look-Up Tables (LUTs), carry logic, a number of multiplexers (MUXes), and an array of Flip-Flops (FFs). Many CLBs also feature memory elements to facilitate efficient implementations of shift registers and small Random Access Memorys (RAMs). An example of a Xilinx CLB can be seen on figure 1.4 on page 7 [6, p. 19]: a slice element is shown, with two slice elements making up a CLB.

In reality FPGA architectures are much more complicated. First of all the core of the architecture, the LUT, is rarely just a LUT: more frequently than not, it implements a number of different LUT configurations. As an example, in Xilinx 7-series devices, the 6-input LUTs can be configured as two different 5-input LUTs that share inputs, or two 3 and 2-input LUTs; a subset of LUTs can be also configured to be used asa small, synchronous RAMs or a Shift Register LUT (SRL). FPGAs LUTs, named Adaptive Logic Modules (ALMs), have no equivalent function but they can implement a

4Here the term refers to the basic element of an FPGA architecture in general, not specifically Xilinx devices in which it is named as such. 6 CHAPTER 1: INTRODUCTION wider variety of configurations: they can implement any 6-input function, select 7-input functions, two independent 4-input functions, two identical 6-input functions with 4 shared inputs, and more.

In most FPGA devices, columns of RAM and Digital Signal Processor (DSP) slices are provided and typically laid out in regular intervals, so that the amount of logic that can access these elements with a small routing delay is maximized. Encapsulating such functionality in hardened, dedicated, blocks translates to major improvements in speed and power, and DSP performance of several TMAC/s is not uncommon. Furthermore, typically tens of dedicated global and regional buffers are offered to provide minimal skew distribution for a small number of critical signals, such as clocks and resets. Moreover, neighboring CLBs feature separate, dedicated connections to propagate carry logic and more efficiently implement wider adders and multipliers. In Xilinx devices, DSP slices provide multiple signals in their datapaths that can be cascaded across neighboring elements, without passing through the fabric interconnect. Block RAMs (BRAMs) can also be cascaded in pairs through dedicated interconnect to form deeper memories.

1.2.1 Why FPGAs

FPGAs are used in a very wide range of applications from pocket-size devices to data centers [7]; the following sectors are listed as examples indicative of that breadth:

• Automotive • Data centers • Research • Communications • Aerospace • Financial • Military • Medical

The main reasons for this wide acceptance by both the industry and academia is their positioning in the CPU (Central Processing Unit) / Application-Specific Integrated Circuit (ASIC) price / performance spectrum [8, 9, 10]. Compared to ASICs, FPGAs exhibit lower performance and higher power [11], but on the other hand they offer: (i) much faster time-to-market, (ii) significantly smaller Non-Recurring Engineering (NRE) costs, (iii) flexible, in-circuit debug, and (iv) reprogrammability5 which also means that they

5Although OTP FPGAs are available, the dominant types are SRAM-based which can be reprogrammed. THE FIELD-PROGRAMMABLE GATE ARRAY 7

SRHI D COUT SRLO INIT1 Q Reset Type CE INIT0 CK SR Sync/Async FF/LAT DX DMUX

D6:1 A6:A1 D O6 D O5 FF/LAT DX INIT1 Q DQ D INIT0 SRHI SRHI CE SRLO D SRLO CK INIT1 Q SR CE INIT0 CK SR

CX CMUX

C6:1 A6:A1 C C O6 O5 FF/LAT CX INIT1 Q CQ D INIT0 SRHI CE SRHI D SRLO SRLO CK INIT1 Q SR CE INIT0 CK SR

BX BMUX

B6:1 A6:A1 B B O6 FF/LAT O5 BX INIT1 Q BQ D INIT0 CE SRHI SRLO SRHI CK D SRLO SR INIT1 Q CE INIT0 CK SR

AX AMUX

A6:1 A6:A1 A A O6 FF/LAT O5 AX INIT1 Q AQ D INIT0 CE SRHI SRLO CK SR

0/1 SR CE CLK

CIN UG474_c2_03_101210 Figure 1.4: Xilinx 7-series FPGA slice

are field-upgradable. The very high initial investment inherent with ASICs can be offset by the lower unit price, but only for very high-volume applications. Let us consider the examples of a laboratory within the academic sector, or of a small business, that have a number of low-volume specific application(s) with high-performance computing needs; it is often difficult to come up with the funds or the time for manufacturing a separate ASIC per application, even by participating in multi-project wafer runs. With FPGAs, it is often possible for the group to select an evaluation board that satisfies the needs of the application(s) and start their development on that, even before designing and manufacturing a PCB. In other sectors, reprogrammability offers an equally significant advantage: applications like High-Frequency Trading (HFT) demand sub-microsecond latency but need the ability to frequently make changes to the algorithms, something that could not be achieved with ASICs. 8 CHAPTER 1: INTRODUCTION

Compared to CPUs and DSPs, they are more difficult to program due to the non- sequential manner logic is described6 and are more expensive, but they feature: (i) higher performance—less latency [12], (ii) higher, deterministic I/O bandwidth, and (iii) signifi- cantly higher flexibility, by supporting a wide array of I/O standards. Recently, popularity of System-on-Chip (SoC) devices that feature high-performance, and even multi-core, hard processors is on the rise—FPGAs are becoming ever more versatile as the range of applications they are used for widens [13].

1.2.2 Design Considerations for FPGA Synthesis

The evolution of FPGA devices has enhanced these devices with features that improve the performance or power consumption in certain tasks. In order to exploit these features, the details of the specific FPGA device that is targeted must be taken into account. However, there are some general rules that can be obtained, that emerge from high-level FPGA structure characteristics.

FPGAs are designed with a 2:1 register / LUT ratio. Since the propagation delay can predominantly determined by the interconnect delay, ample distribution of registers mitigates the effect of the interconnect architecture on the operating frequency. That fact also means that pipelining techniques must be employed to actually realize a high operating frequency.

However, unrestrained use of deep pipelining would have implications not only on the architecture of the circuit to be implemented, but on two other important factors: power, for obvious reasons; and congestion. Since the interconnect resources on FPGAs are limited, areas that are densely packed with logic and registers can start to utilize sub- optimal paths for routing, which can lead to worse results. There are techniques applied by the synthesis tools that can help increase the operating frequency without adding more pipeline steps: retiming, for example, moves logic between successive register stages such that the path timing of these stages is better balanced. Overall, careful pipelining has to be applied at select points of the design. These select points can be chosen based on particular situations, such as using synchronous blocks other than registers that have a higher clock-to-out time (e.g. unregistered BRAMs or DSPs), or placement that causes long routes, but usually the estimated location of the critical path is found using the logic levels between register stages. Some simple rules of thumb, based on the FPGA CLB architecture, can assist with the estimation of the number of logic levels, which is

6Although the High-Level Synthesis (HLS) tools are advancing fast, so this may be subject to change in the near future. THE FIELD-PROGRAMMABLE GATE ARRAY 9 correlated with the estimation of the critical path. To give a few simple examples for the architecture seen at figure 1.4 on page 7, an 8-bit addition of up to three operands would use 8 LUTs across two slices, and the carry chain (each LUT can compute one bit out of up to 3 2-bit operands, and the carry chains would complete the operation); a 4-to-1 MUX would just use a single LUT; and a 16-to-1 MUX can be realized using 4 LUTs and the MUX resources located at the outputs of the LUTs. As a simple rule, by counting the number of inputs for a 1-bit logic function, one can get an estimate for the number of LUTs and, consequently, for the number of logic levels and whether a signal could pose a critical path with respect to the rest of the design.

Another design aspect that has an impact on congestion can be the reset strategy that is followed. In contrast with ASICs, which need every register to be reset before normal operation begins, FPGAs benefit from the initialization procedure: when the design is loaded onto the device, every register can be given an initial value that can be specified by the designer. That has the implication that only a small subset of the used registers actually needs to be reset; that regards mostly registers involved in Finite State Machines (FSMs) and control signals. Liberally attaching a reset to every register of the design allocates valuable routing resources, making timing closure more difficult to achieve and increasing the overall power consumption.

For large memories, whether they are used as a RAM, a ROM, or as part of a First-In First-Out memory (FIFO), using the dedicated memory blocks is essential to obtaining an efficient design, both in terms of performance and of power. Implementing alarge memory7 with registers would inevitably cause a congestion point, drop the operating frequency, and allocate a wide area of the device—especially considering the extra-wide MUXes that would be required to construct the memory. The same applies to complex arithmetic on FPGAs, such as multiplication or operations on very wide signals. The DSP slices that have been mentioned before are usually designed full-custom, so their performance and power characteristics are very improved when compared with the most optimized CLB-based implementation of the same functionality. To infer these optimized elements special restrictions can apply so the target FPGA architecture must be carefully examined before the design phase.

7In this context that would mean more than a few kbit. 10 CHAPTER 1: INTRODUCTION

1.3 Thesis Contribution

The subject of this thesis is to study and introduce techniques involved in deriving high- performance implementations of algorithms on FPGAs devices by making appropriate use of the resources made available in modern FPGA families. In order to achieve this goal, applications from the fields of machine vision and High-Energy Physics (HEP) were chosen; each application demands cutting-edge levels of performance, and it will be shown that the implementations obtained match that.

1.3.1 Image Processing

This application concerns a Lab-on-Chip (LoC) project [14, 15] which required a high-per- formance machine vision subsystem to process live video stream from a camera, mounted on the demonstrator assembly, so that the volumes of the reactants flowing within a microfluidics chip are monitored in real-time. The volume information produced by the machine vision subsystem is eventually used to adjust actuators that control the reactions that take place on the microfluidics chip. To adjust for any movement or micro- vibrations of the chip that might interfere with the flow detection during operation, real-time detection of the chip frame is needed; this is achieved by applying a Hough transformation to the edges detected in the video stream.

The primary contribution of this thesis in the context of this application isanovel implementation of the Canny edge detection algorithm. The two techniques with the most influence in the drawing of the implementation architecture are pipelining and parallelism increase. Regarding pipelining, this is employed at two levels: at the block level, with each image processing block streaming the data to the next one without the need to finish processing a line or an image; and at the pixel level, with the computations spread across many clock cycles to maintain a high clock speed, both by keeping the critical path short and by allowing data to spread farther in the device through the interconnect fabric to facilitate greater parallelism. The latter is also applied in multiple levels: all the necessary calculations to generate a pixel are performed concurrently; furthermore, the images are processed four pixels at a time, a number that emerges naturally from the specifications of the memory interface that feeds the machine vision subsystem.8 As it is discussed later, an added advantage of this architecture is that while the memory requirements remain constant with respect to a design that does not apply

8The memory interface is 32-bit, which translates to exactly four 8-bit grayscale pixels per clock cycle. THESIS CONTRIBUTION 11 any pixel-level parallelism, the memory read accesses are reduced.

Apart from the techniques used for the design of the architecture, certain ad-hoc, application-specific approximations were introduced to increase performance while reducing resource utilization. As an example, in the divisions by constants, used in the normalization of the Gaussian filter, the approximation has no noticeable effect onthe results but helps achieve a LUT reduction of more than 80 %, leading to a faster and much more compact circuit. Another example can be found in the Sobel step, in the estimation of the norm of a two-dimensional vector. Here, instead of the full formula for the norm calculation, the Manhattan distance is used; a thorough analysis proves that the impacton the results for this type of application, that focuses on lateral edge detection, is minimal.

Moreover, an image compression implementation based on Run-length encoding (RLE) and Huffman has been developed in order to perform 2-pass hysteresis thresholding while storing the intermediate picture on the FPGA device. That architecture offers an image compression ratio of at least 2:1 on edge-detection images but even though its 240 Mpixels/s performance outperforms comparable implementations at the time of design, it is lower compared to the rest of this edge detection implementation. Since it was determined that removing the second pass did not have a significant impact on the quality, performance was favored and this component was not used in the final design.

The techniques and approximations outlined above have resulted in a novel implemen- tation that, to the best of the author’s knowledge, outperforms other existing solutions by achieving a throughput of more than 1 Gpixels/s. Good use of the FPGA resources is also reflected on the resource usage figures which show that the design, whilestill delivering real-time performance, is able to be used even in the smallest FPGAs in the market. Furthermore, the performance ended up exceeding the system requirements, thus allowing even high-resolution images to be used in the real-time system, with favorable impact on the precision and capabilities of the full system.

1.3.2 Track Trigger applications

The field of experimental High-Energy Physics is traditionally rich with high-performance data processing applications. The Large Hadron Collider (LHC) experiments use detectors to analyze the particles produced by collisions inside of them and typically produce data volumes that cannot be transferred outside of the detector space in their entirety, much less stored offline. Thus, complex systems are designed that make real-time decisions based on limited information to filter what small fraction of the collision data willbe 12 CHAPTER 1: INTRODUCTION stored offline for detailed analysis; these are called trigger systems. The event selection9 takes place in multiple steps, with Level-1 (L1) providing the lower-latency, crudest filtering, operating at the bunch crossing rate. The subsequent trigger steps perform rate reduction at lower rates and are thus allowed a greater decision-making margin, so they are allowed to use more detailed data, such as information from the silicon trackers, to offer higher-quality event selection. One piece of information that can help improve trigger selection quality is reconstructed track information: from the traces the particles leave on the silicon layers of the detector, their trajectories are reconstructed, providing information about the very particles produced in an event. However, this is an extremely computationally demanding task, and the future scheduled upgrades make it even more challenging.

The Fast TracKer (FTK) is an ATLAS (A Toroidal LHC ApparatuS) upgrade that performs track reconstruction in hardware, to provide track information to the High- Level Trigger (HLT) trigger stage. It combines a few thousand ASICs and FPGAs: the Associative Memory (AM) ASICs perform a pattern matching step using a low-resolution representation of the coordinates to reduce the combinatorics of the problem, and the FPGAs facilitate the pattern matching and perform additional processing, such as track fitting using the full resolution coordinates. The operating principle could beapplied in more demanding applications, such as the lower-latency and even more demanding environment of the L1 trigger after the High Luminosity Large Hadron Collider (HL-LHC) upgrade, but only if the performance of the FPGA algorithms is massively increased; a full redesign of performance-critical components is warranted.

The Data Organizer (DO) component was the first one to be redesigned. Its function is to: (i) store full-resolution hits (coordinates) based on their low-resolution representation, the Super-Strip ID (SSID); and (ii) retrieve the list of hits contained in a SSID. A novel implementation of an instantly-erasable array of linked lists with support for features of the AM ASICs, such as variable-size patterns, is the base of the DO architecture. For each SSID that contains one hit, a linked list structure is instantiated; any subsequent hits that belong to the same SSID are added to the list. The linked lists approach has two major implications that set it apart from the current implementation and allow the component to be used in other applications: (i) data do not have to arrive ordered by SSID; and (ii) the hit storage of all linked lists is dynamically created in a shared memory and as a result, the only size limitation is on the overall number of hits that can be stored. There are certain aspects of the application that make the linked list implementation unique. One is the requirement of negligible downtime between finishing the readout

9Each collision of particle bunches is called “bunch crossing”, or “event”. THESIS CONTRIBUTION 13 of an event from the DO memories and starting to write the next one. As the array of linked list start pointers is stored in a BRAM-based memory, and those cannot be reset at once, in order to prevent data corruption there has to be a register file that signifies the validity status of each pointer. To reduce the size of this otherwise very large register file the very wide logic capabilities and the opportunity to create wide BRAM-based memories was exploited; thirty two pointers are kept in each memory location. Furthermore, the register file is itself pipelined, and the 2 k:1 multiplexer is optimally designed by directly using the device’s low-level LUT and MUX resources. Another feature of the linked list structure, related to supporting the Don’t Care bits (DC bits) AM function that essentially allows pattern of variable size by requesting data from up to eight consecutive SSIDs, is the ability to read out the heads of up to eight linked lists in a single clock cycle. To effectively support this a block that performs a fast, fully parallel data sort operation was optimally designed; the core of this block is based on a piece of software written to automatically produce HDL code that directly instantiates ad-hoc configured LUTs. The high operating frequency target of 400 MHz further compounds the complexity of this novel architecture: deep pipelining has to be used throughout the logic that populates the linked lists, which required the introduction of conflict resolution logic. Furthermore, the looped operation that traverses the linked lists and reads out the data is pipelined with an initiation interval of two; this means that to go from one element to the next, two clock cycles are needed. To go around this, time multiplexing between two interleaved reading loops has been applied to each port, such that it forms two individual read channels. By using concurrently all ports in all memory structures up to four such “virtual” channels can be enabled, leading to a sustained read bandwidth of two times the operating frequency, exceeding 800 Mhit/layer/s. Factoring in the improvements that arise from the elimination of the input ordering restriction, this novel architecture brings a cumulative performance improvement of at least an order of magnitude with respect to the previous implementation.

The second component, the Combiner, is the interface between the DO and the Track Fitter (TF): it collects the sets of hits read out by the DO and generates all possible track- forming combinations (i.e. sets of hits with one hit per layer, at most). Its input ports support the same high operating frequency and data rates of the DO. The combination generation and the output function at a different, lower-frequency clock. Here the main goal is to have a non-stalling architecture, capable of processing successive roads (sets of input hits) in successive clock cycles. Although this component outperforms the previous comparable implementation, and is more flexible, it is not as technically challenging asthe other designs; it is, however, discussed to cover all elements of a pattern matching–based 14 CHAPTER 1: INTRODUCTION track reconstruction processing chain.

The TF component performs fast linear track fitting using pre-computed matrices and the full-resolution coordinates read out by the DO. The massively parallel potential of FPGAs is used here to obtain an implementation that is capable of performing one full fit per clock cycle; performing a full fit means calculating Nscalar products of N-dimensional vectors10 using fixed-point arithmetic. The one-fit-per-clock-cycle architecture involves a combination of systolic arrays of registers and efficient usage of the hardened DSP slices high-performance modern devices offer, and of their dedicated interconnects. The computational core was then further tuned to arrive at the final architecture, in order to reduce resource utilization and power by 50 %. By including physical considerations ≈ (placement and routing effects) early in the design phase, the implementation reaches an operating frequency close to the advertised limits of the device, surpassing 600 MHz. In a typical mid-range device more than four instances of the TF can be placed: given sufficient input bandwidth, fit performance of up to 2.4 GFits/s is possible; such performance can be especially desirable in very high occupancy environments, as it can help maintain the trigger efficiency by forgiving a smaller data reduction ratio in the previous track reconstruction stages.

To verify these components advanced methodologies were used: the Universal Ver- ification Methodology (UVM) methodology was employed, and a novel method was used to utilize the same high-level testbench to verify correct operation in both the Register-Transfer Level (RTL) simulation, and the actual implemented design while it is running on the FPGA device on an ad-hoc demonstrator design, implemented on an evaluation board. Software that models a simple, toy cylindrical detector has also been written in Python to help with the generation and fine-tuning of the constants used in the TF step, and to evaluate the quality of the results of the simple track reconstruction chain that has been implemented.

The DO and TF implementations described in this thesis have been used in a research project, which is also described here, that proposes a full track reconstruction chain for the L1 trigger for the HL-LHC CMS (Compact Muon Solenoid) detector upgrade in 2025. Finally, the DO is also being considered for inclusion in the Hardware Tracking Trigger (HTT) ATLAS upgrade for HL-LHC. This fact is suggestive of the contribution of this thesis in high-performance implementations of algorithms that are used in trigger systems in HEP.

10The value of N depends on the application, in FTK for example it is eleven. THESIS ORGANIZATION 15

1.4 Thesis Organization

Section 1.1 has provided a historical perspective in the evolution of programmable logic. Then, in section 1.2 on page 5 the FPGA high-level architecture has been outlined to introduce the reader to the concepts necessary to better evaluate the contents of this thesis; also the uses and advantages of FPGA are briefly discussed, followed by lower-level details of FPGA architectures. Once these concepts are introduced, the contribution of the work presented in the rest of the thesis is outlined.

Chapter 2 presents a Canny edge detection implementation designed for the ma- chine vision subsystem of a LoC microfluidics project. The chapter opens with a brief introduction in LoC systems and edge detection algorithms; related work in the fields of machine vision for microfluidics and Canny FPGA implementations. Section 2.2 on page 19 discusses the implementation of the gaussian filtering block; it also outlines the handling of image borders and the application of parallelism that also underlie other parts of the implementation. Since the gaussian filtering block can be used autonomously, the section concludes with performance and resource utilization figures for its imple- mentation. Section 2.3 on page 30 describes the implementation of the other block that can be used autonomously, the Sobel edge detector. Separate performance and resource utilization figures are also presented here. The other blocks that comprise the Canny edge detector and the top-level implementation thereof are detailed on section 2.4 on page 36, along with example images at the various intermediate steps of the algorithm to better illustrate their purpose. The integration ofthe Canny edge detector in the machine vision subsystem, the other building blocks of the subsystem, and its packaging in an IP core are described in section 2.5 on page 45.

Chapter 3 on page 55 describes in detail the algorithms developed for applications in the field of HEP. An high-level overview of the LHC machine, the ATLAS and CMS experiments, and the FTK project are provided to set the context. Related work, albeit limited, is presented in section 3.2 on page 66. The implementation of the DO algorithm is thoroughly discussed in section 3.3: low-level details on the implementation can be found on pages 68–78, followed by the verification environment that was designed, and concluding with performance and resource utilization figures. The next component, the Combiner, is briefly presented on section 3.4 on page 84. The last component, the TF, is detailed on section 3.5 on page 88; the details of the implementation, the area and power optimizations, and the final performance and resources occupied are presented. Abrief overview of the capabilities of the DSP slices featured in modern FPGA devices has also 16 CHAPTER 1: INTRODUCTION been interjected in this section as the implementation makes use of low-level features of these blocks. Next, section 3.6 on page 96 provides an evaluation environment for the algorithms presented, based on an evaluation board–based demonstrator and a toy detector model. The objective of this effort has been to provide an environment that helps with: (i) verifying the algorithm implementations on-board, (ii) gathering statistics on the actual event processing speed of the implementations, and (iii) evaluating the quality of the track reconstruction results. The chapter is concluded with section 3.7 on page 112, that describes the FPGA design of a demonstrator board constructed for an R&D project. This is an very suitable way to end the chapter asthe DO and TF algorithms presented have found use in this real-world application.

Chapter 4 on page 121 concludes the main body of the thesis. It comprises a review of the work presented, along with a reiteration of the main achievements, and a discussion on future work that may emerge from this thesis.

A number of appendix chapters were added for completeness. Appendix A on page 127 illustrates some Clock Domain Crossing (CDC) circuits to handle metastability in FPGA designs and some common pitfalls in their use. Appendix B on page 135 outlines some recent advancements in the Verilog HDL, in the form of SystemVerilog, and appendix C on page 145 some important synthesis and implementation attributes for designing with Xilinx Vivado®. Finally, appendix D on page 149 assembles a selection of code listings from the designs described in the thesis. Chapter 2

High-Performance Implementations of Image Processing Algorithms

“Sooner or later all things are numbers, yes?” — Terry Pratchett

2.1 Introduction

Modern image processing applications demonstrate an increasing demand for computa- tional power and memory space. This stems from the fact that image and video resolutions have multiplied in the past few years, especially after the introduction of high-definition video and high-resolution digital cameras. Therefore there is a need for image processing implementations that can perform demanding computations on substantial amounts of data, with high throughput, and often need to meet real-time requirements.

Edge detection is the first step in many computer vision algorithms [16]. It is used to identify sharp discontinuities in an image, such as changes in luminosity or in the intensity due to changes in scene structure. Edge detection has been researched exten- sively. A lot of edge detector algorithms have been proposed, such as Robert detector, Prewitt detector, Kirsch detector, Gauss-Laplace detector and Canny detector. Among all the above algorithms, the Canny algorithm [17] is the most widely used due to its good performance and its ability to extract optimally edges even in images that are contaminated by noise. The Canny algorithm has the ability to achieve a low error-rate by eliminating almost all non-edges and improving the localization of all identified edges.

17 18 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

Ours is a novel implementation of a Canny edge detector [18, 19] specifically tailored for the machine vision subsystem of a LoC application. The main distinguishing feature of architecture is that it takes advantage of four-pixel parallel computation. It is a pipelined architecture that uses on-chip BRAM memories to cache data between the different stages. The exploitation of both hardware parallelism and pipelining produces a very efficient design, and maintains the same memory requirements as a design without any pixel computation parallelism. This results in increased throughputs for high-resolution images and a computation time of 1.5 ms for a 1.2 Mpixel image on a Spartan-6 FPGA.

Initially, high-performance implementations for two key components of the Canny implementation, Gaussian smoothing and Sobel edge detection, will be presented. These are popular image processing filters, and their implementations can also be used outside of the Canny algorithm so they will be discussed in more detail. Moreover, similar techniques have been applied in the development of the rest of the blocks of the Canny algorithm implementation, so this is expected to benefit the reader in understanding them, as well.

The Canny implementation presented was used as part of a machine vision system in a LoC project, which will also be briefly described. LoCs for molecular diagnostics applications offer the capability of executing complicated biological experiments ina miniaturized environment at the Point-of-Care (PoC). The advantages of LoCs are multiple, such as portability, reduced cost and ease of use due to the automated experimental process, which has led to an increased interest in research targeting the improvement of such systems. LoC systems are controlled by off-chip control units, which rely on the information sent by the on-chip sensors. In recent years research groups have experimented using machine vision for acquiring experimental data, reducing the number of necessary on-chip sensors, thus making the chips less complicated and cheaper to manufacture. Exploiting machine vision for LoC implementations requires real-time response and precision. FPGAs have been exploited for both LoC control and machine vision implementations due to their high speed, ability to host systems on chip, low cost and accelerated time-to-market.

2.1.1 Related Work

Because of its algorithmic efficiency and applicability many Canny implementations have been proposed. In [20] an implementation of a self-adapt threshold Canny algorithm is proposed. This design is FPGA based and intended for a mobile robot system. The GAUSSIAN CONVOLUTION IMPLEMENTATION 19 results presented are for an Altera Cyclone FPGA and the highest frequency achieved is 27 MHz, which result in 2.5 ms computation time for a 360 280 grayscale image. In [21] × an industrial implementation for ceramic tiles defect detection is presented, which defines the hysteresis thresholds with a histogram subtraction method. A Canny edge detection on NVIDIA CUDA is presented in [22], which takes advantage of the CUDA framework to implement the entire Canny algorithm on a GPU (Graphics Processing Unit). It achieves a 10.92ms computation time for a 1024 1024 image. In [23] there is an implementation × of an adaptive edge detection filter on an FPGA using a combination of hardware and software components proposed by Altera. In[24] a reconfigurable architecture and implementation of edge detection using Handel-C [25] is presented. This is a pipelined design of a canny-like edge detection algorithm. It achieves a computation time of 4.2 ms for a 256 256 grayscale image. × Moreover, different research groups have implemented various machine vision algo- rithms for LoC experiments. In [26], an FPGA-based multi-core system for edge detection on micro-array LoCs is introduced. The implemented algorithm is a simple Sobel edge detection on a multi-Microblaze platform. On [27] a machine vision-based droplet control system for studying enzyme kinetics is presented. A microfluidic system for large DNA molecule arrays in which an integrated image acquisition system is used for molecule detection is presented on [28]. An FPGA-based system is proposed in [29], but this is performs solely functions of LoC control. Finally, a machine vision system with real-time response for robotic motion control is presented on [30]; however, the throughput of that system is much lower than that of the application we targeted.

2.2 Gaussian Convolution Implementation

Many image processing operations can be carried out using the convolution process. Convolution, named after the conceptually similar mathematical operation, is the process of constructing a new image by taking the weighted arithmetic sum of each element of an existing image with its neighbors. The weights of the operation take the form ofa2-D matrix, that’s called the kernel of the convolution. A visualization of the convolution process using a 3 3 kernel can be seen on figure 2.1 on the following page. × Gaussian Convolution (also known as Gaussian blur) is the first step in various edge detection methods, including the Canny algorithm. It serves to reduce image noise, as the various methods of gradient extraction that follow are sensitive to it. It is also essential 20 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

Source image 3 2 1 8 5 3 1 7 8 0 -1 4 1 2 4 6 +1 2 3 4 1 8 0 -2 5 3 9 +2 8 1 1 0 7 0 -1 2 4 +1 2 3 8 6 3 6 8 1 8 3 7 7 3 0 4 0 6 1 2 Convolution 7 2 1 kernel 6 8 1 0 8 8 3 5 2 8 3 3 Resulting pixel 1 1 value

Figure 2.1: 3 3 kernel convolution × for scale space representation applications1 and, in 1-D, it is used in the GSM cellular technology as a part of the GMSK phase modulation. It is a convolution of the image using a Gaussian kernel, and it can be thought of as a 2-D discrete Weierstrass transform.2

To compute the kernel for the Gaussian smoothing, one should start with the two- dimensional Gaussian function:3

2 2 1 − x +y G(x,y)= e 2σ2 (2.1) √2πσ2

First, the σ parameter (named the standard deviation, or Gaussian RMS) needs to be chosen. It controls the intensity of the smoothing, with larger values producing: (i) more blur, (ii) less noise, and (iii) greater loss of any fine image features. That should be done according to the size of the kernel, with the values at the edges of the matrix being much less than the central value, to correctly approximate the Gaussian function, eliminating truncation errors as much as possible. It is evident that for larger σ values, larger kernel sizes are necessary. Finally, a suitable kernel can be obtained by applying symmetric integer values to the x and y variables, according to the desired matrix dimensions.

Values from 0.5 to 3 have been considered for this implementation. The smoothing obtained by setting σ =1.4 has been found to smooth noisy images, while at the same

1To handle image structures at different scales, an image can be represented as a one-parameter family of smoothed images, the scale-space representation, parametrized by the σ value of the Gaussian kernel which results in variably suppressed fine-scale structures. 2 2 1 y The function F (x)= ∞ f(x y) e− 4 dy √4π − 3More generically, the x and y−∞axes can be assigned different σ parameters to obtain different smoothing R intensities in each direction, but this is outside the scope of this text. GAUSSIAN CONVOLUTION IMPLEMENTATION 21

dvalid dvalid out clk // // // d in[31..0] d out[31..0] pixel data // // // Gaussian eol eol out dvalid // // // smoothing eof processing eof out eol // // // block eof // // // clk Line end Data can stop Frame end rst at any point (a) Block I/O (b) Timing diagram

Figure 2.2: Block I/O and timing diagram for the Gaussian smoothing core time, conserve important image features, such as edges.

0.0121 0.0261 0.0336 0.0261 0.0121 0.0261 0.0561 0.0724 0.0561 0.0261 G = 0.0336 0.0724 0.0934 0.0724 0.0336 (2.2)    . . . . .  0 0261 0 0561 0 0724 0 0561 0 0261   0.0121 0.0261 0.0336 0.0261 0.0121     Then, by scaling appropriately and rounding, we get the desired Gaussian kernel which will be used for the implementation. By carefully selecting the scaling factor so rounding errors are minimized, the following 5 5 matrix was obtained; the multiplications by its × coefficients happen to be easily computed in hardware, requiring up to just 2shifts/ additions or shifts / subtractions.

24 5 42 4 9 12 9 4 1 G = 5 12 15 12 5 (2.3) 159     4 9 12 9 4   24 5 42     It’s worth to note here that, even though the rule of thumb is for a Gaussian filter of standard deviation σ to have a kernel of size 6σ 1 which in 2-D would translate to 7 7, − × a 5 5 kernel was chosen; it provided satisfactory results while using significantly less × resources.

2.2.1 Implementation Details

The implementation works with 8-bit grayscale images, whose width has to be a multiple of 4 and height larger than 5 (which is almost always the case, anyway). The maximum 22 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS width and height of the image is 2044 and 2047 respectively.4 The interface of the block is simple and fully synchronous to its single clock, including the reset signal. The I/O ports and a timing diagram that describes an image frame being read into the block to be processed, can be seen on figures 2.2a and 2.2b on the previous page. The block inputs are the following:

• System clock, clk • System reset, rst • 32-bit data for 4 consecutive 8-bit pixels • Data valid signal, dvalid • eol signal, that signifies the last pixel quad of an image line • eof signal, that signifies the last pixel quad of the image

The input of the block is subject to the following constraint: between two consecutive images there should be a pause of at least three times the image width, divided by four; that is, the time it takes to process three image lines, in order for the cache lines that are still in use to be flushed out. Other from that there are no special concerns; the input data flow can stop at any point, as long asthe dvalid signal is deasserted. The outputs of the block—dvalid for valid data, eol at the end of an image line, eof at the end of an image—follow the exact same protocol as the inputs.

Coarsely, the implementation is split in three major blocks: (i) cache and control, (ii) core processing block, and (iii) the output formatter. The main function of the output formatter is simple, it rearranges the output so that only 4-pixel words are produced. At each line’s beginning or end, the processing pipeline produces only two pixels of output. So at each word the two last (rightmost) pixels are combined with the 2 leftmost pixels of the next word. That way, only word size (4 pixels) output is produced. Other than that, it functions as a “sanity check” of the output control signals up to that point; a simplified schematic is shown on figure 2.3 on the facing page. The other two blocks are more complex, and are going to be described over the next subsections, but a top-level overview of how everything is put together can be seen on figure 2.4 on the next page.

2.2.1.1 Cache Lines and Control Signals

Since a 5 5 Convolution Matrix needs 5 image lines to be applied onto, these data have × to be stored in a caching structure. For this reason, a “cache and control” block consisting of five BRAMs, two FSMs and some surrounding logic has been developed. That block

4This is due to the limited width of some internal counters and can be very easily adjusted, withjusta negligible impact on resource usage and performance. GAUSSIAN CONVOLUTION IMPLEMENTATION 23

31..16 data in concat reg data out reg 15..0

frame start frame start reg out

reg valid out valid

eol out eol reg reg

eof reg eof out

Figure 2.3: Gaussian smoothing output formatter implementation

ctrl data cache line

cache line cache parallel data in cache read output pixels out writing processing multiplexers forma‹er logic units cache line

cache line

control control and sync logic signals

Figure 2.4: Gaussian smoothing top-level block diagram also takes care of generating the control signals for the other blocks of the algorithm.

A simplified Algorithmic State Machine (ASM) chart of the control logic can be seen on figure 2.5 on the following page. After the first valid input pixel arrives, the FSM leaves the idle state, starts storing data in the cache lines, and uses the first line to measure the width of the image. After three cache lines have been filled, the control logic enables the processing block, passing it the right data. The cache lines rotate such that they are filled in a cyclic manner; when the cache that handles the uppermost line oftheimageis emptied, it starts storing the current line (which is locally the bottom one). The cache read multiplexers (figure 2.4) are kept up-to-date with the cache line rotation and rearrange their output to maintain proper line order.

For each pair of pixel quads (see subsection 2.2.1.4 on page 28) the cache and control logic feeds the processing core, it also generates the signals missing_h and missing_v to pass information about the current location in relation to the image borders. This is 24 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

1 line N, reset IDLE got eol st 0 1 quad

got data 0 processing line N, send eof valid 3rd line middle 1 1 1 0 line N, output 0 incr. line got eol width last quad 2 lines

1 processing 1 got eol got eof flush lines 0 2nd line 0

Figure 2.5: Simplified ASM chart for the Gaussian smoothing cache and control block important for the processing core in order to adjust details of the convolution to avoid artifacts; that is explained in detail in the next section.

Finally, after the image is over, signified bythe eof signal, the control logic continues with emptying the cache lines and generating the control signals for the processing core as appropriate, until the output image is complete. The information about the image width is stored for that reason.

2.2.1.2 Image Border Handling

In convolution operations, some ambiguity can arise on how to handle pixels at—or very near—the image boundary, where data from outside the image is necessary. The four most obvious ways to handle this are to:

1. extend the image, by defining virtual pixels, as necessary, and assigning eachof them the value of its closest border pixel 2. zero pad the image 3. mirror the image close to its borders to extend it outside of them 4. crop the offending pixels, resulting in a slightly smaller image

For Gaussian smoothing, the first approach produces somewhat skewed results, slightly darkening or brightening the resulting image along its borders in visually per- ceptible way; the second approach gives considerably darker results. The third approach gives slightly better results but, in the end, (and especially in the corners) exhibits similar behavior. Finally, the last approach has the downside of wasting potentially useful pixels, by only computing results where image pixels suffice.

To handle image border situations a different approach was adopted, focusing instead GAUSSIAN CONVOLUTION IMPLEMENTATION 25 on the kernel, rather than on the image. Considering the coefficients of the kernel that fall outside the image to be zero, the normalization factor can be modified; this is adapted to the position relative to the image borders. By exploiting the centrosymmetric nature of the convolution kernel, the combinations that have to consider are reduced. For a specific position, the matrix elements that match with pixels outside the image are zeroed, and the normalization factor is adjusted. This method resulted in images with much less perceivable artifacts along the borders. To better clarify the procedure, examples of the possible kernel configurations can be seen on figure 2.6 on the next page.

2.2.1.3 Division by a Constant Implementation

Division of an integer by a constant is usually implemented in hardwareby using “magic numbers”, or multiplicative inverses. The operating principle is outlined5,6 below

a a n = ≪ n (2.4) b b ≫ 2n a = n (2.5) b ≫ 2n 2n = a n=(aMi) n, Mi = (2.6) b ≫ ≫ b    

7 For this to work for unsigned integers in the range [0, amax], the number n has to be chosen such that Mi amax [31]. ≥ The Gaussian smoothing normalization involves divisor constants ranging from68 (on an image corner) to 159 (inside the image), as can be seen on figures 2.6a to 2.6f on the following page. Let us proceed by examining the application of the multiplicative inverse method for the second and largest divisor. Since the dividend would be in the range [0, 255 159] = [0, 40545], needing 16 bits to be represented, the minimum value for M × i 23 would be 2 /159 = 52759. Multiplying amax by Mi, we get the value 2139113655, ⌈ ⌉ which needs 31 bits to be represented.

The method described here offers exact division results, but implementing multiple 31-bit adders (just for the division with 159, 9 terms have to be summed) has an impact on resource usage. Seeing though that this application could tolerate some precision loss, a simpler approach was investigated. Consider the binary representation of 1/159, which in Q16 format8 is 0b0.0000000110011100 = 2−8 +2−9 +2−12 +2−13 +2−14. By ≈ 5It is assumed here that the result of the division is rounded down (a/b a/b ). 6The symbols and refer to logical shift operations. ≡⌊ ⌋ 7This method≪ is described≫ for unsigned integers, but it can also be applied to two’s complement signed integers with slight modifications. 8Q is a number format for fixed-point arithmetic. Qm.n signifies a number with m integer bits andn 26 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

2 3 3 4 1 1 5 2 2 5 2 4 4 8 5 4 2 8 9 1 7 8 5 1 7 8 3 3 9 12 4 2 4 4 4 2 4 4 5 1 2 4 6 9 1 6 3 3 12 2 4 1 9 12 2 4 1 15 5 4 5 5 9 5 12 4 3 9 3 9 1 0 7 12 1 0 7 8 1 15 8 1 9 12 2 5 12 4 2 3 4 2 6 3 4 9 6 4 4 5 3 8 9 12 3 8 2 4 2 7 3 0 2 4 7 7 3 0 4 2 7 6 8 5 6 8 8 3 4 8 3 4 1 6 2 4 1 6 0 2 0 2 1 1 2 1 5 2 1 5 7 7 6 8 8 8 3 6 8 0 8 8 3 2 N = 159 0 2 N = 142 1 f 1 3 f 3 3 3 8 8 1 1 1 1

(a) No pixels missing (b) Horizontal line missing

2 4 5 2 4 4 3 9 3 2 1 1 4 2 9 12 2 5 4 5 5 8 5 8 2 4 4 1 7 8 12 1 7 8 9 3 15 3 4 5 12 4 4 2 4 9 12 1 2 4 6 9 1 6 4 5 3 3 2 4 1 9 12 2 4 1 12 5 4 5 9 15 3 9 4 2 3 5 12 4 1 0 7 5 1 0 7 9 8 1 4 8 1 2 2 2 3 9 12 4 6 4 4 2 4 3 8 6 3 3 8 5 2 2 7 3 0 7 7 3 0 7 2 4 6 8 6 8 8 3 4 8 3 4 1 6 1 6 0 2 0 2 1 1 2 1 5 2 1 5 7 7 3 8 8 3 6 8 8 8 6 8 0 2 N = 104 0 2 N = 127 1 f 1 3 f 3 8 3 3 8 1 1 1 1

(c) 2 horizontal lines missing (d) 1 horizontal and 1 vertical lines missing

4 2 2 5 4 5 3 2 4 4 3 1 9 1 2 4 4 2 2 9 9 12 5 8 5 4 5 8 9 12 1 7 8 1 7 8 4 5 3 12 3 4 15 4 2 4 12 1 2 4 6 5 12 4 1 6 15 3 9 3 5 12 4 2 4 1 2 4 1 9 5 9 12 5 3 9 4 3 9 9 12 1 0 7 4 2 1 0 7 4 2 8 1 5 8 1 4 2 2 3 5 6 3 4 2 4 6 4 2 4 3 8 3 8 2 2 7 3 0 7 7 3 0 7 6 8 6 8 8 3 4 8 3 4 1 6 1 6 0 2 0 2 1 1 2 1 5 2 1 5 7 7 3 6 8 8 8 6 8 0 8 8 3 2 N = 93 0 2 N = 68 1 f 1 3 f 3 3 3 8 8 1 1 1 1

(e) 2 horizontal and 1 vertical lines missing (f) 2 horizontal and 2 vertical lines missing

Figure 2.6: Gauss convolution matrix normalization factors in different positions GAUSSIAN CONVOLUTION IMPLEMENTATION 27 directly multiplying with this using shifts and additions, only 5 8-bit adders are necessary.

This is almost equivalent to the previous method but unlike Mi, the multiplication factor here isn’t scaled to exceed amax. As such, some of its bits are truncated, which affects the precision of the result.

But it is possible to achieve a reduction of the terms in these expressions. To do that, let us start with the formula for the sum of the geometric series:

n− 1 1 rn ark = a − (2.7) 1 r k X=0 − If a is taken to be rm and n is incremented, (2.7) becomes

n rm rm+n+1 rm+k = − (2.8) 1 r k X=0 − Finally, by setting r =2 we obtain

n 2m+k =2m+n+1 2m (2.9) − k X=0

By exploiting this fact, consecutive additive terms for shift-add multiplication can be replaced by a single subtraction. For example, 2−8 +2−9 +2−12 +2−13 +2−14 can become 2−8 +2−9 +2−11 2−14, replacing two additions with a single subtraction and reducing − the terms from 5 to 4. By performing this simplification on the multiplicative inverse example, its number of terms can also be reduced, from 9 to 7 in the case presented.

To investigate the suitability of the two methods, a simple Python program was written. It considers all the numbers in the range [0, 40545], and for each method, compares the results with the floor division operator (integer division in Python) results. As expected, the first method exhibited perfect matching with the reference results. The second method had a Mean Absolute Error of 0.46 and a Maximum Absolute Error of 2. For an edge detection application, that amount of error is considered tolerable as a difference of 2 in an 8-bit pixel is imperceptible and should not make any diffence in the edge detection afterwards. Thus, considering the savings in resources by moving from7 31-bit adders to 4 8-bit adders/subtractors,9 it has been decided to use the approximation of the second method for the normalization. fractional bits. The m factor can be omitted, leaving only the number of fractional bits (nforQn). 9Post-place and route implementation results on a Kintex Ultrascale device show 85 LUTs / 113 FFs and 15 LUTs / 48 FFs, respectively. 28 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

2.2.1.4 Application of Parallelism

As it has already been mentioned, FPGAs provide great parallelism capabilities that should be exploited. Here, this is naturally exploited not only at the pixel computation level, but also at the image level, by processing more than one pixels at the same time.

Figure 2.7 on the facing page shows a block diagram of the processing core that produces one pixel of output. By computing all the matrix multiplication elements and all the possible normalizations concurrently, a throughput of one output pixel per clock cycle is achieved. Its heavily pipelined design keeps the clock rate high at the cost of more registers, however these exist plentiful in modern FPGA architectures for this exact reason.

Additionally, four pixels of output are processes simultaneously, increasing the per- formance fourfold. An image processing block that generates a set of pixels of the output image typically must have access to a larger window of the input data; the part of the window around the input pixels is called an apron. The shape of the apron when com- puting four pixels in parallel in the Gaussian smoothing implementation can be seen at figure 2.8 on the next page. Here, the apron is 8 pixels wide, which amounts to exactly two 32-bit pixel quads.

Combining those two levels of parallelism a processing rate of 4 pixels per clock cycle is achieved. The next section will show that this more than covers the specifications for our application.

2.2.2 Performance and Resource Utilization

The resource utilization of the Gaussian smoothing block can be seen on table 2.1 on page 30. Implementation results were obtained for Spartan-3E, Spartan-6, Virtex-6 and Kintex Ultrascale devices.10 The module has been enclosed in a wrapper providing registers, which was then synthesized, placed and routed with I/O buffer generation turned off; that way, the implementation results closely approximate the usage ofthe block in a larger design.

The performance ranges from 840 Mpixels/s to 1040 Mpixels/s for the devices consid- ered for the machine vision application; that corresponds to processing times for 1 Mpixel images in the range of 0.96 ms to 1.19 ms. Considering that the full system will need to

10The exact devices are xc3s500e-5vq100, xc6slx75-3csg484, xc5vlx110-3ff676 and xcku060-ffva1156-3, respectively GAUSSIAN CONVOLUTION IMPLEMENTATION 29

reg. stages

+ ÷159 reg

+ + ÷142 reg

+ ÷127 reg coeff. data out data in + mux shi‰ / add calculation + ÷104 reg

+ + ÷93 reg

+ ÷68 reg

missingH 5 clk mux missingV delay line sel calc

Figure 2.7: Gaussian smoothing computation core implementation

Pixel / pixel tile

Apron

Figure 2.8: Gauss parallelism process 60 fps which correspond to 16.6 ms, that leaves room to accommodate the later processing stages.

On the Kintex Ultrascale device it would achieve a throughput of 1720 Mpixels/s and a respective processing time of 0.58 ms. That is a last-generation device that didn’t exist at the time of this project, and the design has been ported to it just to provide an indication of the progress of FPGA technology in performance over the last years.

The resource usage is generally low, around 2 %–5 % for the devices considered, except the Spartan-3E. That device is older than the others and comparably very small; itishere to show that by means of the approximations introduced in the Gaussian smoothing filter, the design not only fits, but it can process 255 Full HD images per second, even onthat.

These devices represent an almost historical timeline of FPGA development—from the 90 nm Spartan-3E device introduced in 2005 to the 20 nm Kintex Ultrascale of 2015. Some interesting points on the progress of their capabilities can be made, outside of the obvious advantage of the maximum frequency and increased LUT, FF and BRAM 30 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS counts. By looking at the absolute counts of these resources across the different families, a jump in the capabilities of the LUT elements can be observed between Spartan-3E and Spartan-6. The older device family used 4-input LUTs while the newer ones use 6-input LUTs, allowing them to pack more logic in a single LUT. The newer, 6-input LUTs can also be split into two, independent, 5-input LUTs, further enhancing the logic density of the device. The Virtex-5 family displays a significantly higher LUT count when compared to the other modern families—even though it has a similar LUT architecture. It is difficult to uncover the reasons behind this “anomaly”, as they may lie in the software effort to reach the frequency goal, its efficiency, and other low-level CLB architecture specifics that would probably be too low-level for the purpose of this text.

2.3 Sobel Filter Implementation

The Sobel filter, also known as Sobel-Feldman operator, named after Irwin Sobeland Gary Feldman who co-developed it as PhD candidates, is used as a step in many edge detection algorithms. It uses two convolution operations to approximate the horizontal and vertical derivatives of the image intensity function. It was first suggested in a 1968 talk at Stanford; its derivation was described by Irwin Sobel in an appendix in the 1990 paper “Generalized and Separable Sobel Operators” [32].

For each direction, the kernel is the product of two separable operations: (i) a triangle filter, perpendicular to the derivative direction, for smoothing; and (ii) a central difference calculation in the derivative direction. That way, the kernels obtained for each direction are shown in (2.10) and (2.11) on the next page.

Table 2.1: Gaussian smoothing implementation results

FF LUT BRAM Fmax Throughput (k) (%) (k) (%) (%) (MHz) (Mpixels/s) Spartan-3E 2.9 31 4.9 53 5 25 133 532 Spartan-6 2.4 2 2.9 6 5 2 210 840 Virtex-5 2.4 3 3.6 5 2.5 2 260 1040 Kintex Ultrascale 2.5 0.4 3 0.9 2.5 0.2 450 1800 SOBEL FILTER IMPLEMENTATION 31

1 +1 0 1 − Gx = 2 1 0 1  A = +2 0 2 A (2.10) − ∗ − ∗ 1 h i +1 0 1     −       1 +1 +2 +1

Gy =  0  1 2 1  A =  0 0 0  A (2.11) ∗ ∗  1 h i  1 2 1 −   − − −       By combining the results of these two convolutions, an estimate of the image gradient can be obtained. Since they are essentially the coordinates of the gradient vector in the Euclidean space, the formulas that give its norm and angle are shown in (2.12) and (2.13).

I = G = G2 + G2 (2.12) ∇ x y q G θ = arctan y (2.13) Gx

At this pint it needs to be said that formally, the image gradient is a function that de- pends on the whole image for each point. That would make this 3 3 pixel approximation × one of the most basic that could be constructed. In practice, though, the results obtained are satisfactory for edge detection applications.

2.3.1 Implementation Details

A block diagram of the Sobel filter block can be seen on figure 2.9 on the next page. The input and output ports are same as for the Gaussian smoothing block (see figure 2.2a on page 21) with the addition of an 8-bit gradient angle output, that corresponds to four 2-bit values. It can be seen that the architecture is very similar to the one of the previous block (Gaussian smoothing), which stands to reason as they are essentially both blocks that perform convolution operations.

The same parallelism levels of the previous block apply here, as well. Each clock cycle, four gradient pixels and their gradient angles are computed. The caching mechanism is also similar to the one in the Gaussian smoothing block, with the difference being that there are only three lines that need to be cached for the Sobel filter; it involves 3 3 kernels. Even though the apron of the computation changes from 5 8 to 3 6, the × × × 32 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

ctrl data cache line

cache parallel data in cache read output pixels out writing cache line processing multiplexers forma‹er logic units angles out

cache line

control control and sync logic signals

Figure 2.9: Sobel top-level block diagram cache mechanism still provides two pixel quads, of which only the relevant pixels are used. Other details of the implementation, such as the output formatter, also remain equivalent—apart from the addition of the angle output. The components and techniques that are different compared to those of the previous blocks are described in thenext subsections.

2.3.1.1 Image Border Handling

The Sobel filter, involving two 3 3 convolution operations, needs special attention to × the pixels along the image border. The kernels are not symmetric as in the Gaussian smoothing step, so a similar approach cannot be transferred here. The image cannot be zero-extended, as in that case spurious edges might be detected alongside all the borders.

One approach that might work could be to switch to a simpler operator for those pixels that are at the borders, such as the Roberts cross:

+1 0 Gx = (2.14) " 0 1# − 0 +1 Gy = (2.15) " 1 0 # − or a simple difference operator:

Gx = +1 1 A (2.16) − ∗ h i +1 Gy = A (2.17) " 1# ∗ −

That approach could facilitate edge detection at the borders, but: (i) the detected edge SOBEL FILTER IMPLEMENTATION 33

reg. stages

tan 22.5◦ G positive x abs x reg > coeff. calc (shi‰ / add) – reg reg

Gx negative coeff. calc angle out sgn reg reg = angle calc (shi‰ / add) mux

data in reg + saturation pixel out tan 22.5◦ logic G positive y abs x reg > coeff. calc (shi‰ / add) – reg reg

Gy negative coeff. calc sgn reg reg (shi‰ / add)

missingH 6 clk mux missingV delay line sel calc

Figure 2.10: Sobel computation core implementation details would be shifted by half a pixel, (ii) separate thresholds would need to be defined, and (iii) combining kernels from different techniques simply does not seem very elegant asa solution. Therefore, it has been chosen to simply discard the pixels of the image border. That would lift any ambiguity, while not affecting the LoC application in any adverse way.11

2.3.1.2 Computation Core

To compute the image gradient estimate, the Sobel computation core takes a 3 3 area, × calculates the two coefficients of the image gradient estimate and, from them, calculates an approximation of the norm and angle of the vector. All independent operations are done in parallel, in a pipelined way. A block diagram of the Sobel computation core can be seen on figure 2.10.

For this application, there is no need for a precise estimation of the edge angle, or its direction. Placing the angle into one of the four bins shown in figure 2.12 on page 36 is enough as that is exactly the precision required by the Non-Maximum Suppression (NMS) step, as explained in subsection 2.4.1 on page 37. These are 45° regions around

11At this point, the reader might be tempted to think that the same could be said for the Gaussian smoothing step. If these pixels were also discarded, though, including the Sobel step we would have three pixels missing from each edge—in contrast to just one. To phrase it differently, it’s easier to think ofan extreme case of an edge being affected by coming three pixels close to some border, than having totouch one. 34 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS radii at 0°, 45°, 90° and 135°. For this angle calculation the absolute value and the sign of each coefficient is used. Using the absolute values, the angle bin on quadrant QI can be found; then, it can be projected onto the correct quadrant, by using the sign information.

The binning calculations for the quadrant QI can be simplified in the following way. Let us begin by taking the reflection of the vector G with respect either via axis x, axis y, or the point (0, 0), so that it lies in QI. Let us assume that the angle of the resulting vector, which we will call θabs, is larger than 22.5° (π/8 rad). We have

θ >π/8 (2.18) abs ⇐⇒ tan(θ ) > tan(π/8) (2.19) abs ⇐⇒ Gy | | > tan(π/8) (2.20) Gx ⇐⇒ | | G > G tan(π/8) (2.21) | y| | x|

Similarly, for θabs < 3π/8:

tan(θabs) < tan(3π/8) (2.22) ⇐⇒ Gy 1 | | < (2.23) G tan(π/8) ⇐⇒ | x| G > G tan(π/8) (2.24) | x| | y|

Thus, by multiplying G and G with tan(π/8) and taking the results of the | x| | y| comparisons in (2.21) and (2.24), the correct bin on QI can be obtained. That can then be projected onto the right quadrant, according to the signs of the coefficients Gx and Gy.

2 2 Then, the norm of the image gradient, Gx + Gy has to be estimated. To make the implementation fast while, at the same time,q considering the needs of its application, the following approximation, that uses the Manhattan distance and saturated logic, was constructed:

Gx + Gy if Gx + Gy 255 Gapprox = | | | | | | | |≤ (2.25) | | 255 if G + G > 255  | x| | y| To do an analysis of this method’s suitability, a Python program that compares the relative error of the approximation was written. As can be seen on figure 2.11 on the next page it produced, the maximum relative error is 40 % for angles close to 45°; for ≈ angles up to 7° around the horizontal and vertical axes, the relative error is less than 10 %. SOBEL FILTER IMPLEMENTATION 35

250 40%

35% 200 30%

150 25%

20%

100 15%

10% 50

5%

0 0% 0 50 100 150 200 250

Figure 2.11: Norm approximation error with respect to Gx and Gy

Effectively, that would translate to a threshold that is lowered by 29 % for diagonal edges. Since the application involves edges that diverge from the horizontal or vertical axes by less than 3° (as explained in subsection 2.5.2.1 on page 46), this approximation can be deemed suitable, and has been chosen for the implementation.

As an alternative, a CORDIC (COordinate Rotation DIgital Computer) algorithm could be used—this would also provide an accurate estimate of the angle—but the resource usage would be higher. For four CORDIC cores the LUT and FF usage would be 1.2 k and 1.5 k, respectively, whereas for the whole Sobel computation core (including the logic for the convolution, the output muxes, and the rest of the necessary logic) the resource usage amounts to just 0.6 k and 0.9 k, respectively. Additionally, the maximum operating frequency of the approximation was found to be higher than the CORDIC core (450 MHz versus upwards of 600 MHz).12

2.3.2 Performance and Resource Utilization

Table 2.2 on the next page shows the resource utilization of the Sobel filter implementation. As before, implementation results were obtained for Spartan-3E, Spartan-6, Virtex-6 and Kintex Ultrascale devices.

It can se observed that this block is more compact than that of the Gaussian smooth- ing (table 2.1 on page 30) and, also, that it allows a higher maximum clock frequency.

12The results mentioned in this paragraph are taken from implementation runs on a Kintex Ultrascale device 36 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

p11 p12 p13 ◦ 5 QII . QI 135 22 ◦ 5 ± ± . 22 22 90 ± .5 ◦ 45 ◦ p21 pc 22.5◦ ±22.5 p23

QIII QIV

p31 p33 p32

Figure 2.12: The angle bins the Sobel filter computes

This is due to the smaller convolution kernel dimension and the application-specific optimizations applied.

2.4 Implementation of a Modified Canny Algorithm for Edge Detection

The Canny edge detection algorithm, developed in 1986 by John Canny, as the result of his work “a computational approach to edge detection” [17]. Using calculus of variations, he came up with an optimal function for edge detection that satisfies the following three criteria:

• Good accuracy, meaning that the number of true edges that are detected should be maximized, and the number of false positives, minimized • Good localization, in that the distance between the detected and the true edges

Table 2.2: Sobel filter implementation results

FF LUT BRAM Fmax Throughput (k) (%) (k) (%) (%) (MHz) (Mpixels/s) Spartan-3E 1.4 15 1.7 18 3 15 170 680 Spartan-6 1.4 2 1.5 2 2 1 240 960 Virtex-5 1.4 2 1.4 2 2 1 350 1400 Kintex Ultrascale 1.4 0.2 1.2 0.4 1.5 0.1 530 2120 MODIFIED CANNY ALGORITHM IMPLEMENTATION 37

should be minimized • Single response, for an true edge point only one detected edge point should be produced

This function is quite complex, and is composed of four exponential terms, but itcanbe closely approximated by the first derivative of a Gaussian, as described in the original paper [17]. The method that resulted is strictly defined, and it consists of the following five steps:

1. Gaussian filter, to reduce the image noise 2. Sobel filter, to approximate the image gradient 3. NMS step, to thin the resulting edges 4. Double thresholding, to separate edge pixels in strong and weak ones 5. Hysteresis pass, to keep the weak edge pixels that are connected to strong ones, and suppress those that are not

A block diagram showing the different steps of the Canny algorithm is shown on figure 2.13, and images taken after each one can be seen on figures 2.18 to 2.19 on pages 43–44. The first two steps have already been presented in the previous sections; the NMS and Hysteresis steps (the double thresholding is a minor step, it is just two comparisons) will be discussed in the current one.

2.4.1 NMS Stage

The NMS pass is used to eliminate any gradients that are adjacent to an edge. This serves mainly to thin the edges, which implies cleaner data for any processing stages that follow (e.g., that would benefit the accuracy of a Hough transform). Only the local maxima need to be kept along the edge, which means suppressing less intense pixels in the positive and negative image gradient direction. The operating principle can be divided in three steps:

1. The edge direction is rounded into one of four bins, coded with different colorsin figure 2.12 on the facing page.

2. The center pixel, pc, is compared to the pixel pair which is perpendicular to the

Gaussian Edge gradient data in smoothing calculation Non-Maximum Double Hysteresis data out (5×5 (Sobel, 3×3 Suppression Œresholding passes convolution) convolutions)

Figure 2.13: Main Canny building blocks 38 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

reg. stages

reg reg

0 mux data out data in 8 comparators comparison sel selection

angles

Figure 2.14: NMS core implementation

edge, defined by the gradient direction (one of the four pairs, p21–p23, p13–p31,

p12–p32, and p11–p33). 3. If its value is a local maximum in the gradient direction (the most intense of those pixels) it is passed along as an output. If not, it is suppressed, and the output is zero.

A block diagram of the NMS implementation is shown on figure 2.14. This is the only step that uses the image gradient angle computed by the Sobel step. Its operating principle clearly explains the reason the Sobel stage only needed to approximate the angle as one of four bins.

2.4.2 Hysteresis Approximation

After the NMS pass the intermediate image is a clean-looking, thin, gradient. As an output, though, a binary image is needed, so a thresholding pass should take place. Simple thresholding functions with a single threshold value, turning all pixels below it to zero and all pixels above it to one. By introducing a thresholding step, working with an arbitrary threshold, the goal is to separate the noise that made its way to this stage of the edge detection process from the actual edges. So, while it is desired for the threshold to be low enough to capture any less intense parts of an edge, this can mean letting spurious pixels go through.

Another, more sophisticated thresholding approach, that is best suited for edge detection applications and the one used in the Canny algorithm, is called hysteresis thresholding. Hysteresis thresholding tackles this problem by using two threshold values, tL and tH , that can be set by the user to allow them to fine-tune the operation according to the specific lighting and noise conditions of the target images. Every pixel that isbelow tL is suppressed and assigned a value of 0x00 (definitely not an edge), and every pixel that is above tH is translated to a “strong” edge and assigned a value of 0x01 (definitely MODIFIED CANNY ALGORITHM IMPLEMENTATION 39

Edge pixel Non-edge pixel Potential edge pixel

Figure 2.15: Hysteresis thresholding example an edge). A pixel that stands between these two values can be though of as a potential, “weak” edge, and is assigned a value of 0x10.

Weak edge pixels, if they are in the 8-neighborhood of an edge pixel13, can be promoted to strong edge pixels themselves, possibly influencing more weak edge pixels next to them. If they are not in direct proximity to any strong edge pixels (or promoted weak edge pixels), they are suppressed. An example of this operating principle is shown on figure 2.15.

The recursive nature of this algorithm is evident. A method involving multiple passes, or a graph-based method resembling Connected-component Labeling (CCL), should be employed. The multiple passes method has the disadvantage of requiring ansizable intermediate buffer for the image; that is because the valuable bandwidth of the external memory (which is also used in other components of the final application) should be conserved.

2.4.2.1 Image Compression

Initially, a two-pass architecture, involving a forward and a backward pass, was investi- gated. To keep the buffer on the FPGA memory, a lossless compression scheme involving a combination of RLE and Huffman coding was devised; an overview of it can be seen on figure 2.16 on the next page.

The compression algorithm works with quads of 2-bit data that encode the no / weak / strong edge information. For the RLE implementation, that leaded to significant adjustments, as data is encoded in groups of four, in parallel. The first optimization arises

13The pixels adjacent to it in a horizontal, vertical or diagonal way; formally stated, theyhavea Chebyshev distance of 1. 40 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

Huffman RLE data out[7..0] decoder decoder

data in[7..0] RLE Huffman Temp. encoder encoder memory

Huffman RLE data out[7..0] decoder decoder

Figure 2.16: Compression scheme

Le‰ shi‰er enc valid out data1 Main encoding Concat. Concat. Output comp. data data2 block enc. data logic data selection logic data3

length Right shi‰er

Remaining Shi‰ing logic bits calc. logic

Figure 2.17: Huffman encoder block diagram after observing that—at least in the groups of images relevant to the application—most of the empty, continuous space, will correspond to quads with a value of 0x00 and any large horizontal edges will have a value of 0xff. Quads with edges inside would rarely have their pattern be repeated. Thus, the simplification that only the quads withthe values 0x00 or 0xff would be RLE encoded was enforced, which had negligible impact on the compression ratio but allowed a much simpler implementation. The RLE encoder produces three byte-sized types of output: (i) RLE-encoded data, where bit [7] denotes the value of the quad and bits [6..0] the length of the run; (ii) passthrough data, the 8-bit input value; and (iii) a 0x00 word, that indicated operation mode change, from RLE mode to passthrough mode, and vice-versa. For any input word up to three words may be produced, at least one of which will be a mode change word. The RLE encoder stops when it encounters the end of a line, and for each new line it is initialized in RLE mode.

Table 2.3: NMS implementation results

FF LUT BRAM Fmax Throughput (k) (%) (k) (%) (%) (MHz) (Mpixels/s) Spartan-3E 0.7 7 1.2 13 4 20 170 680 Spartan-6 0.7 1 0.7 1 3.5 1 250 1000 Virtex-5 0.7 1 0.8 1 3.5 2 350 1400 Kintex Ultrascale 0.7 0.1 0.7 0.2 2 0.2 530 2120 MODIFIED CANNY ALGORITHM IMPLEMENTATION 41

For the Huffman encoder which is the next step in the compression, after different maximum code lenghs were examined, the value 12 was chosen as a compromise between compressin and memory requirements. The coding table that was produced fora LoC image were shown to be almost equally effective for other images, which indicates agood match of the algorithm to the RLE step; thus, the same static, pre-computed, encoding is used for all images. A block diagram of the implementation can be seen on figure 2.17 on the facing page. The main encoding block uses a two-port BRAM to store the Huffman coding look-up table; its dimensions are 256 bit 16 bit, where 12 bits in each cell store × the encoded word, and the other 4 bits hold its length. The rest of the encoder aligns the output of this block and packs it into 32-bit words.

The Huffman decoder uses a BRAM-based two-port, 4096 bit 12 bit memory to decode × the coded data stream. Since the RLE decoder may only need up to two words per clock cycle to function, out of which one has to be the mode-change word 0x00, the decoder design is somewhat simplified, having to align two input words instead of three.

Finally, the RLE decoder design is straightforward, featuring an FSM to hold the decoding mode (RLE or pass-through), a counter for the RLE mode, and some multiplexing and control logic.

To link the encoding and the decoding blocks a special temporary buffer has been implemented, that holds the input lines and their encoded lengths, and outputs them in the reverse order. The decoding logic implementation could not reach the operating frequency of the encoding stage (and the rest of the system). Thus, a dual clock scheme was devised around this temporary memory, writing data at the clock rate of the encoder, but switching to a clock of exactly half the frequency to read the data. This was deemed necessary because, in order to keep up the throughput despite the halved operating frequency, both of its ports are used to provide two lines of output, simultaneously. The encoder clock domain was decoupled from this muxed clock domain by means of a FIFO, in order to eliminate the hold violations that would occur in the routing stage of the implementation run due to the delay of the clock mux.

Table 2.4: Hysteresis implementation results

FF LUT BRAM Fmax Throughput (k) (%) (k) (%) (%) (MHz) (Mpixels/s) Spartan-3E 0.1 1 0.1 1 1 5 166 666 Spartan-6 0.1 1 0.1 1 1 1 285 1140 Virtex-5 0.1 1 0.1 1 1 1 385 1540 Kintex Ultrascale 0.1 0.1 0.1 0.1 0.5 0.1 620 2480 42 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

The implementation of this compression scheme achieves a data compression ratio of at least 2:1, and has a maximum operating frequency of 120 MHz (and 60 MHz for the decoders). Since the rest of the system could support frequencies of up to 200 MHz (on a Spartan-6 device), the performance drop was deemed unacceptable, and a 1-pass hysteresis approach was examined. The results were found to be very close to the 2-pass runs with less than 1 % loss of edge pixels (this can be seen on figures 2.19e and 2.19f on page 44) as that specific application involved relatively simple images; thus, given the time constraints that were imposed, it was decided not to make any efforts to further optimize the compression blocks and the 1-pass approach was adopted.

2.4.3 Performance and Resource Utilization

Implementation results of the NMS and hysteresis blocks are shown on tables 2.3 and 2.4 on page 40 and on the previous page, respectively. These blocks are very light in resources and have a high operating frequency, which is indicative of their simple function. In fact, the hysteresis core, is able to approach the Kintex Ultrascale frequency limit by reaching a maximum operating frequency of 620 MHz.

The implementation of the complete Canny algorithm resulted in a moderate resource usage, making the algorithm usable by very low-cost systems. As seen on table 2.5, it can still fit on an obsolete Spartan 3E device, while still producing acceptable performance for real-time systems. When considering a more modern device, such as the Spartan-6, the footprint is almost negligible at 2% slice utilization.

Table 2.5: Canny implementation LUT usage broken down by block, Fmax

Gauss Sobel NMS Hysteresis Total Total Fmax (k) (k) (k) (k) (k) (%) (MHz) Spartan-3E 2.6 1.1 0.7 0.1 4.2 28 120 Spartan-6 2.4 1.4 0.7 0.1 4.5 2 200 Virtex-5 2.4 1.4 0.7 0.1 4.5 6 290 Kintex Ultrascale 2.4 1.4 0.7 0.1 4.5 6 470 MODIFIED CANNY ALGORITHM IMPLEMENTATION 43

(a) Original image (b) After Gaussian smoothing

(c) After Sobel edge detection (d) After NMS step

(e) After 1st hysteresis pass (f) After 2nd hysteresis pass

Figure 2.18: Canny edge detection steps for the popular “lena” image 44 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

(a) Original image (b) After Gaussian smoothing

(c) After Sobel edge detection (d) After NMS step

(e) After 1st hysteresis pass (f) After 2nd hysteresis pass

Figure 2.19: Canny edge detection steps for the Lab-on-Chip frame HIGH PERFORMANCE MACHINE VISION SYSTEM IMPLEMENTATION 45

2.5 High Performance Machine Vision System Implementation

The machine vision implementation described in this section is an FPGA-based flow detection system designed to identify flows in a microfluidic LoC where the protocol of the experiment requires the phases of the flows to be visible (e.g., in serology experiments). Up to five different concurrent flows can be identified, differentiating both the headandthe tail of each flow; this maximum number of flows is in accordance with the experimental protocol specifications. The machine vision system is specifically designed to compensate for noise in the captured video, induced by non ideal lighting conditions, as well as any LoC vibration that might occur during the experimental procedure. Such specifications call for a complete hardware implementation to accelerate the machine vision algorithms— as opposed to the use of a microprocessor core (or many). The machine vision algorithm and software model were introduced by Μicro2gen, patented in[33] and subsequently implemented on hardware by our team at Aristotle University of Thessaloniki after the required adaptations. The proposed system achieves real-time response and is able to follow the video stream produced by a high-speed camera at frame-rates exceeding 60 fps, at a resolution of 1 Mpixel, which to the best of our knowledge had been a first in the field at the time of its implementation. Figure 2.20 on the next page shows a demonstrator of the resulting LoC that was developed; the final system will be a complete, portable, PoC system, able to function in non-ideal conditions.

2.5.1 System Function

The internal architecture (flow-channel structures, sizes, maximum number of possible flows etc.) of each microfluidic LoC used in the experimental procedure is known prior to the experiment execution. A reduction of the computational complexity of the flow detection algorithm is achieved by constructing an chip-specific input file, which provides the starting coordinates of each flow (using the upper left corner ofthe LoC chip as a reference). In the same file a number of alarm points can be set for each channel; theflow detection system signals the user as soon as a flow is within an alarm point’s vicinity. In order to use this input file, the machine vision system must detect the LoC’s upper left corner. This is achieved by using a specially adapted chip frame detection algorithm. By detecting the chip frame on the video frame the algorithm calculates the necessary reference point and any small rotation of the reference frame; this also reduces the 46 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

Figure 2.20: LoC demonstrator required memory accesses for the flow calculation. The machine vision system is capable to tackle any sudden small LoC movements, as the chip frame detection is run on each frame. The machine vision algorithm was optimized to be implemented on an FPGA.A number of techniques (parallelization, pipelining, memory space cropping) were used to achieve good performance. In addition, the system is fully parametric, allowing the user to fine-tune the necessary options to adjust noise reduction levels, chip frame andflow detection sensitivity, as well as other functions of the system.

2.5.2 Implementation

The details of the Canny implementation were presented in the previous parts of this chapter. Therefore, the other two major system blocks, namely the Hough and theFlow Detection algorithms, will be briefly described here.

2.5.2.1 Frame Detection - Modified Hough Algorithm

In the duration of an experiment, due to high pressure imposed by the integrated valves and / or external actuators, the microfluidic chip may slightly slide from its position. MACHINE VISION SYSTEM IMPLEMENTATION 47

These cases must be detected and compensated forin real-time. To that end, a chip frame detector was developed; it is based on a customized implementation of the Hough transform [34] line detection algorithm, which is favored for its tolerance to noise and partial occlusion—these properties are valuable for this application, since an noisy and non-ideally lit environment is possible. Performing chip frame detection by using re- dundancy patterns on the chips (such as crosses close to the four edges) was rejected because it would restrict the variety of LoCs the system can accommodate. The Frame Detection algorithm works as such: (i) the horizontal and vertical lines that define the edges of the chip’s frame are detected, (ii) the frame’s corners are calculated from their intersections (the vertices), and (iii) the smallest bounding rectangle that contains these corners is computed.

The angle space of the detection does not cover auniform [0, 2π) range, but is con- tained within [ 2.5°, +2.5°] from the orthogonal axes that define the video’s coordinate − system, taking into consideration that the chip’s rotation angle during operation is mini- mal and bounded. The detection works with a discrete step of 5°/32 (0.15625°). To improve the performance of the computationally intensive Hough transform, the algorithm was optimized for quadruple parallel pixel processing and quadruple angle processing during the voting process [35]. This led to the splitting of the accumulation memory in16dis- crete blocks and the cloning of the voting module; these, combined with a deep pipeline, enabled a speed improvement of approximately 16 over the original algorithm. Careful × mapping of arithmetic operations on dedicated DSP blocks and the use of numerical optimizations increased the operating frequency, which further reduced the algorithm’s processing time.

2.5.2.2 Flow Detection

The Flow Detection stage detects the coordinates of the position of the head andtail of up to five different concurrent flows onthe LoC [36]. Information for the starting position of each flow with respect to the upper left corner is required. This information is preloaded by the control microprocessor. The Chip Frame is assumed to have already been computed by the Chip Frame Detection stage, limiting the region of interest to the one within the chip frame.

In order to reduce the complexity of the flow detection procedure flow windows are defined; these are squares areas where a flow is expected (figure 2.21 on the following page). The size of the flow windows is adjustable, controlling the sensitivity ofthe detection. Since a maximum of five concurrent flows can be detected, a maximum of 48 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

Figure 2.21: Video frame, chip frame, and flow windows ten such windows are set; one for the head and tail of each flow. These windows are defined by the Adjust/Load Window component. While in the initializion state,this component sets the starting window for the head of each flow by adjusting its position around the starting flow coordinates within the chip frame. If the coordinates place the flow window near the chip frame, it is transformed to a rectangular shape toexcludethe pixels that are beyond the chip frame. The tails of the flows are expected at thesame coordinates, but only after the head of the flow has exited the square window aroundthe initial coordinates, so as to avoid detection conflicts.

The flow detection is achieved by capturing the pixels in the flow window fromtwo consecutive frames; in figure 2.21 an overview of a microfluidic Polymerase chain reaction (PCR) can be seen, the flow is the transparent-looking liquid moving to the left. Inthe Flow Window Subtraction component the absolute difference between these two sets of pixels is calculated. The resulting pixel window is then thresholded to produce a binary output of the pixels whose luminosity changed drastically between two successive frames. These pixels depict the flow movement. Then, the flow position is calculated byfinding the center of mass of these pixels.

The Alarm Points component calculates whether the flow coordinates are withinthe vicinity of any alarm points of the ones set by the user. The vicinity (number of pixels) is also parametric. The output of the Flow Detection stage consists of: (i) the coordinates of MACHINE VISION SYSTEM IMPLEMENTATION 49 each flow’s head and tail, (ii) the index number of the flow, (iii) whether theflowhead or tail has reached an alarm point, and (iv) the index number of the alarm point. This information is sent to the microprocessor in real-time, which forwards it to the LoC control unit subsystem to be used for actuator control.

2.5.3 IP Core Packaging

The completed system has been packaged into an IP core for easy integration with the rest of the LoC subsystems in the Xilinx Platform Studio (XPS) environment, in the form of a pCore. It is interfaced with a Microblaze embedded microprocessor via the Processor Local Bus (PLB), a bus protocol initially designed for the PowerPC platform but also used in Xilinx embedded systems. It is also interfaced with the external DDR SDRAM memory, via the Xilinx-provided Multi-Port Memory Controller (MPMC) IP core.

2.5.3.1 The PLB Bus—Core Configuration and Control

The PLB bus connects the central Microblaze soft-processor with its peripherals, by treating them as memory-mapped I/O; its reads and writes to the memory space assigned to a peripheral are translated by the PLB bus to reads and writes to the peripheral, itself.

A comprehensive description of the address space assigned to the machine vision subsystem IP core can be seen on figure 2.23 on page 52. This memory space has been implemented using one BRAM block, and with some additional logic to trigger an action when specific addresses are written to, is enough to configure and control the IP core.

For the IP core to function properly, it has to be configured first. The following isa list of the parameters necessary for operation:

• the memory interface needs the frame size and address

• the Canny edge detector needs the Gaussian smoothing factor selector and the tH

and tL thresholds

• the Hough transform block needs two thresholds, one for the horizontal (tH ) and

one for the vertical (tV ) direction

• the Flow detection block needs the number of flows (NF ), the size of the detection

window (DWW ), the initial coordinates of the flows, the number of alarm points

(NA) and their coordinates, and the thresholds for pixel detection (tPD), flow

detection (tFD), and alarm triggering (tAP )

After the values of these parameters have been configured, a write is performed on 50 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS address 0x00 so the configuration is read into the actual processing blocks.

To start the frame detection process, a write has to occur in address 0x02. The value of the write, 0x10 or 0x20, activates the threshold set A or B, respectively. When the results of the frame detection are ready, an interrupt is issued to the processor. Then, any error code can be read from address 0x00, and the results of the frame detection can be read from addresses 0x28–0x29.

Finally, the flow detection can be started by performing a write inaddress 0x03. Again, an interrupt is issued when the results are there, and they can be accessed at addresses 0x2A, and 0x30–0x3F.

2.5.3.2 External Memory Access

External memory access is achieved via the Video Frame Buffer Controller (VFBC) in- terface [37] of the MPMC IP core. That interface has been specifically designed for video processing designs, such as this. Consequently, it provides an easy way to make transactions that involve 2-D data, that is, a single command can transfer a number of lines of a set width.

The interface involves three FIFOs:

• a Command FIFO, to issue commands to the controller • a Read FIFO, to read data from the memory • a Write FIFO, to write data into the memory

In this specific design, since there is no need for this subsystem to write any datatothe external memory, only the first two FIFOs are used.

To issue a transaction, a group of four control words is written to the Command FIFO. The format of these command words can be seen on figure 2.22 on the next page. A transaction is defined by: (i) the start address; (ii) the width of a line in bytes; (iii)the distance, in bytes, from the start of one line to the next one; (iv) the number of lines; and (v) whether the transaction requested is a write, or a read. After these four command words are written to the Command FIFO, the transaction is initiated. If it is a write transaction the data can already be written to the write FIFO; in the case of a read transaction, the data will appear on the read FIFO as soon as they become available. CONCLUSIONS 51

31 15 14 0 CW1 Reserved X size  31 30 0 CW2 Start address  WE 31 24 23 0 CW3 Reserved Y size  31 24 23 0 CW4 Reserved Stride  Figure 2.22: VFBC command words

2.5.4 Performance and Resource Utilization

The resource utilization of the system and its individual components can beseenon table 2.6. The subsystem control logic and the system memory controller resources were counted towards the edge detection results.

The maximum operating frequency of the resulting system was 180 MHz in the Spartan-6 implementation, and 260 MHz in the Virtex-5 implementation. This can support frame-rates of 75 fps to 110 fps. Both implementations offer the necessary performance to support the system goal, which could lead to two system versions; a lower-cost one, and a readily upgradable one, that could be able to support higher resolution cameras via a simple firmware update.

2.6 Conclusions

In this chapter the novel, high-performance image-processing implementations for a LoC machine vision system, and the system itself, were presented. The edge detection algorithm, built on top of them, feeds the Hough Transform algorithm. This enables

Table 2.6: Machine vision implementation results

FF (k) LUT (k) BRAM DSP Time (ms) S6 V5 S6 V5 S6 V5 S6 V5 S6 V5 Edge Detection 6.4 6.6 5.2 5.9 11.5 9 — — 1.46 1.01 Frame Detection 3.7 3.8 7.8 8 24 19 28 28 11.65 8.1 Flow Detection 1.5 1.5 1.3 1.3 6 4 1 1 0.003 0.002 MV subsystem 11.6 11.9 14.3 15.2 41.5 32 29 29 13.14 9.13 LoC system 16.1 16.2 16.3 17.3 178 141 32 32 — — 52 CHAPTER 2: IMAGE PROCESSING ALGORITHM IMPLEMENTATIONS

31 24 23 16 15 8 7 0 ffff ffff

– free –

0000 0040 0000 003F Flow det X Flow det Y   0000 0030   0000 002F    – reserved –   0000 002B   0000 002A – reserved – Alarms   0000 0029 Frame det W Frame det H   0000 0028 Frame det X Frame det Y   0000 0027    Initial coords X Initial coords Y    0000 0020   0000 001F   space address Used Alarm point Alarm point   X coords Y coords   0000 0010   0000 000F – reserved – tAP   0000 000E NF NA DWw tPC tFD 0000 000D – reserved – sselB Canny tH B Canny tL B   0000 000C – reserved – sselA Canny tH A Canny tL A   0000 000B – reserved –   0000 000A Hough tH Hough tV   0000 0009 Width Height   0000 0008    Addresses of stored frames    0000 0005   0000 0004 Number of frames stored   0000 0003 A write here starts flow detection   A write here starts frame detection  0000 0002  0x10 for param. set A, 0x20 for B   0000 0001 – reserved –   Result of the latest operation or error code /  0000 0000  a write here applies parameters     Figure 2.23: Machine Vision subsystem IP core address space CONCLUSIONS 53 the system to find the chip frame in order to use it as a reference for the flow detection algorithms and, additionally, facilitates tracking the chip frame to account for small movements that can occur due to vibrations in the rack.

The main discerning feature of these implementations is the four-pixel parallel com- putation, which—in combination with the application-specific optimizations devised and optimal design practices—leads to their outstanding performance.

Additionally, the machine vision subsystem integration in the form of an IP Core was broadly described. This packaging has lead to a painless integration with the restofthe LoC FPGA subsystems.

Implementation results were obtained for a very wide range of devices. That is in terms of both purpose, having devices ranging from low-end ones that are suited for low-power, low-density, cost-optimized applications, to high-performance ones; it is also in terms of age, including devices launched between the years 2005 to 2015. The results are not just synthesized, but the full timed implementation run, including place and route, was executed. This increases the confidence of the results, especially of the operating frequencies stated.

Finally, to the best of the author’s knowledge, the implementations described display significantly higher performance than others—while at the same time keeping asmall footprint in terms of resources.

Chapter 3

Advanced High-Performance Designs for Track Trigger Applications

“I do not keep up with the details of particle physics” — Murray Gell-Mann

3.1 Introduction

The field of HEP studies the nature of the fundamental particles and their interactions. The most common type of experiment involves the construction of very big instruments, called “particle colliders”. There, particles of a certain type (usually electrons or protons) are gathered in bunches and arranged into beams, which are accelerated to relativistic speeds; these beams are then set to intersect, so the particle bunches periodically collide in pairs. For sufficiently high energies, these bunches of colliding particles create torrents of new, different particles. Each collision of particle bunches is called “bunch crossing”, or “event”. The intersection points, where collisions take place, are very precisely located in the center of massive structures, the detectors; these, as their name suggests, detect all the various products of the collisions, allowing us to study the fundamental interactions up to the energy of the collision.

The technological and scientific advancements in the fieldof High-Energy Physics have fueled developments in a broad range of science and technology disciplines; ex- amples can be found in the fields of medical imaging and research, superconductors, cryogenics, and aerospace. Furthermore, many areas of HEP bring forth very computa-

55 56 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS tionally intensive problems, ranging from designing, constructing, and operating particle accelerator experiments, to interpreting their results.

In that context, after providing some necessary background information on particle colliders, detectors and trigger systems, our novel high-performance FPGA implementa- tions for two diverse tracking trigger applications will be presented. These implemen- tations tackle future challenges in the HEP field, as they are designed to target future upgrades of the LHC particle collider.

3.1.1 The LHC particle collider

The LHC is the largest and most powerful particle collider, on account of its 27 km- long main tunnel and 13 TeV record-breaking proton-proton (pp) collision energy. It is a circular, synchrotron-type collider; it is located at the Conseil Européen pour la Recherche Nucléaire (CERN) site on the Franco-Swiss border, and it feeds seven detector experiments: (i) ALICE, which studies heavy-ion collisions in a quark-gluon plasma state; (ii) ATLAS, a general-purpose detector to study the Higgs boson, the Standard Model, and physics beyond that [38]; (iii) CMS, the other general-purpose detector; (iv) TOTEM, which studies elastic scattering using intact protons from collisions at CMS; (v) LHCb, that measures the parameters of CP violation in the interactions of hadrons containing a bottom quark (that can help explain the asymmetry between matter and antimatter in the universe); (vi) LHCf, that measures the energy and numbers of π0 particles using forward particles from ATLAS (which can potentially help identify the origins of ultra- high-energy cosmic rays); and (vii) MoEDAL, situated in the same cavern as LHCb, that examines the existence of magnetic monopoles.

Inside the LHC, the protons are grouped into bunches, each comprising 1.15 1011 × protons, and accelerated up to a peak energy of 6.5 TeV. In the nominal beam configura- tion, up to 2808 bunches can exist in the accelerator ring at any given time, traveling at 0.999 999 99 c (just 11 km/h slower than the speed of light) with the distance between two consecutive bunches kept at a stable 25 ns.1 In order to keep in a circular path of 27 km these very high energy protons, superconducting magnets that produce fields of up to 8 T are employed.

An important measure of a particle accelerator’s performance, luminosity is the ratio of the event rate2 per interaction cross-section area, expressed in cm−2 s−1, or in b−1 s−1

1To gain a better understanding of the bunch distance, a bunch is 7 ns long. 2The number of events per unit time. ≈ INTRODUCTION 57

Source: ATLAS Experiment © 2016 CERN Figure 3.1: The ATLAS detector units.3 The nominal peak luminosity of the LHC is 1 1034 cm−2 s−1, while peaks of up to × 2 1034 cm−2 s−1 have been achieved in experimental beam configurations. During the × 2012 run (within Run 1, at 8 TeV), each bunch crossing produced an average of 21 pp collisions (defined as µ = 21), with peaks of up to 40. At the time of writing, the LHC h i has delivered more than 100 fb−1 of integrated luminosity, which correspond to more than 1 1016 potential collisions brought at the centers of the ATLAS and CMS detectors. × The LHC program has a number of upgrades planned, interleaved with data-taking periods, also called “Runs”. At the time of writing it is in the Run 2 phase, having undergone one major upgrade that raised the maximum collision energy from 8 TeV to 13 TeV. The major upgrade that will transform the LHC into the HL-LHC will be split in two separate upgrades: the Phase-I upgrade in 2019–2020; and the Phase-II upgrade in 2024–2025. The Phase-I upgrade will bring the collision energy upto 14 TeV and the peak luminosity to 2.2 1034 cm−2 s−1 with µ = 60; the even more ambitious Phase-II × h i upgrade will bring the peak luminosity to 1 1035 cm−2 s−1 for a µ of 140 and even up × h i to µ = 200 [39]. This will make the number of generated particles per event to explode, h i demanding much more performant trigger systems to handle the significantly larger amounts of data that will be generated; the integrated luminosity in the HL-LHC phase is predicted to be of the order of 3000 fb−1, a figure that is thirty times its current integrated luminosity.

3 28 2 A b is equal to 1 10− m− × 58 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

3.1.2 The ATLAS detector

At 46 m long and 25 m in diameter, and weighing about 7000 t, ATLAS is the largest detector ever built. A computer rendering can be seen in figure 3.1 on the preceding page. It is one of the two general-purpose detectors, along with the CMS detector, constructed at the LHC particle accelerator at CERN. Its main purpose is to measure the properties of the Higgs boson, after being the first experiment to confirm its existence; and tosearch for physics beyond the Standard Model while improving our knowledge of the Standard Model itself.

The detector mainly consists of four parts: (i)the Inner Detector (ID), which will be explained in detail; (ii) the two calorimeters (Electromagnetic and Hadronic), that measure the particle energy; (iii) the magnet system, which produces a magnetic field high enough to curve even the most energetic particles, allowing their momentum to be measured; and (iv) the muon detector system, which tracks and measures the momentum of the muons which are produced by the collisions and have traversed the calorimeters.

The 6 m long Inner Detector starts at a radius of just 5 cm from the beam axis, and extends to a radius of 1.2 m. Its function is to perform precision tracking of charged particles by exploiting their interaction with silicon sensors and their bending within a 2 T solenoid magnetic field. By design it has three subsystems and one recent addition, all of which are detailed in the next paragraphs.

The Pixel Detector, that used to be the ID’s innermost component, is the most crucial part for vertex reconstruction. Track reconstruction stands for the reconstruction of a particle’s trajectory—its “track”. Vertex reconstruction signifies the particle track’s extrapolation back to the interaction region, and the precise determination of the point that identifies its origin. The Pixel Detector features 1744 pixel modules, which result in 8 107 pixel readout channels over three cylindrical layers (with radii of 50.5 mm, ≈ × 88.5 mm, and 112.5 mm), and three disks on either endcap region. The pixel size is 50 400 µm2 (50 µm in rφ and 400 µm in z),4,5 offering a spacial resolution of 12 µm in rφ × and 110 µm in z.

The Semiconductor Tracker (SCT) consists of 4088 two-sided silicon microstrip mod- ules, which provide 6 106 readout channels, arranged in four cylindrical layers (at ≈ × 29.9 cm, 37.1 cm, 44.3 cm, and 51.4 cm) and nine disks for each endcap region. The pitch of the strips is 80 µm, with each strip being 126 mm long; the strips are placed along

4z refers to the beam axis. 511% of the pixels are actually “long” pixels, sized 50 600 µm2. × INTRODUCTION 59 the beam axis to provide precise measurements on rφ, but due to a small stereo angle between the two sides of the modules, they also provide precise measurements in the z coordinate.

The Transition Radiation Tracker (TRT) involves straw-tube detectors that—apart from the track measurements—provide additional information on the type of particle detected. The straws are placed parallel to the beam axis and form an appoximately uniform array, spaced about 6.6 mm from each other. The TRT barrel covers the radius range of 36 cm to 108 cm; each straw in the barrel is 144 cm long, readout from both ends, for a total of 50 000 straws. In the endcaps there are 320 000 straws, 39 cm long. The TRT provides 420 000 readout channels in total.

Finally, the Insertable B-Layer (IBL), which was inserted in ATLAS between Run 1 and Run 2 (2013–2014), constitutes an additional inner layer for the Pixel Detector [40]. It was made possible by a reduction of the beam pipe diameter; the space for its installation was very narrow, with only 0.2 mm from the Inner Supporting Tube and 1.9 mm between the supporting tube and the Pixel Detector, for a radius of only 3.2 cm. The pixel size is 50 250 µm2, offering better resolution than the other Pixel Detector layers and further × improving vertex reconstruction. Finally, the IBL adds another 12 106 readout channels × to the Pixel Detector, bringing the total number of Inner Detector channels close to 100 M.

3.1.3 Triggering

As was stated before, the current peak luminosity corresponds to an average of 40 pp collisions per bunch crossing. The 8 107 readout channels of a general-purpose detector ≈ × roughly translate to 25 MB of data. After a process called zero-suppression, they become 1.6 MB, but that translates to a bandwidth of 64 TB per second which is still too high ≈ ≈ to be processed or stored for further analysis. Since the cross-section for producing interesting events6 is quite lower than the total pp interaction cross-section, the selection of interesting events from the background in real time—a process called Triggering—is essential in order to fully exploit the physics potential of these experiments.

Given the huge volume of data, deciding which events stand out poses an extremely demanding computational task [41]. Thus, the process is organized in multiple levels of increasing detail, typically: (i) a faster, localized, and less precise L1 trigger, implemented in hardware, which provides a decision within a fixed latency of 3.5 µs and reduces the event rate from 40 MHz to 100 kHz; and (ii) a HLT, implemented in software, running on

6e.g. an event resulting in the production of Higgs bosons is deemed to be interesting. 60 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS a off-the-shelf PC (Personal Computer) farm, that further reduces this rate to 1 kHz, ≈ with each event requiring a processing time of the order of 1 s. Such multi-level trigger systems present an effective solution for event selection [42, 43, 44]. Typically at the LHC, the L1 trigger decides based on the calorimeters and the muon systems alone, without looking at the ID data, and tracking information is only and partially used at the HLT level [45]. Nevertheless, as the instantaneous luminosity increases, the event complexity increases as well, and the tracking task on the CPU farm becomes heavy enough to consider implementing parts of it in dedicated hardware. Such hardware could either do tracking at L1, or act as a pre-processor for the HLT farm or even as a co-processor of the farm.

3.1.4 The FTK System

The FTK [46, 47] is an ATLAS upgrade for Run 3,7 which uses hardware to generate the tracking information, as a pre-processor to the HLT farm. The real-time tracking in FTK exploits massively-parallel, high-performance, dedicated hardware. The main building block of the architecture is a Processing Unit (PU) that implements a two-stage track reconstruction algorithm. In these PUs, latest generation FPGAs are interfaced with specially tailored ASICs, named AM Chips, that provide the majority of the computing power [48, 49, 50, 51]. This approach has already been applied with success inthe last upgrade of the Silicon Vertex Tracker (SVT) at the Collider Detector at Fermilab (CDF) experiment at the Tevatron particle accelerator [52]. The required high-speed data processing is achieved by employing massively parallel processing and pipelining in multiple levels.

The AM Chip, arguably the core of the FTK system [53], implements the first step of the track reconstruction algorithm, by recognizing track candidates (“roads”) at reduced (low) resolution by performing pattern matching against pre-computed patterns of plausible track candidates. The low resolution hit representation is obtained by subdividing each layer of the detector into bins of equal size, called Super-Strips (SSs); a pattern is a combination of SSs, one in each layer.

The second step, composed of the Data Organizer (DO) and the Track Fitter (TF), is implemented in FPGAs. The DO is a form of smart database built on the fly, which interfaces the low resolution pattern matching stage to the higher resolution track fitting. The TF receives track candidates as possibly track-forming combinations of full (high)

7Tracking at L1, and hardware tracking in the form of a co-processor to the HLT farm, are considered for Phase-II (Run 4 and beyond). INTRODUCTION 61 resolution hits to refine the results of the pattern matching. “hit” here refers to a particle’s energy deposition cluster centroid, as the particle traverses the silicon detector layers. Track Fitting speed is increased by replacing a helical fit with a simplified calculation that is a linear function of the local hit positions. The calculation is a set of scalar products of the hit coordinates and the pre-calculated constants that take into account the detector geometry and alignment. Tracks satisfying a cut on the fit χ2 are kept.

The first two steps of the algorithm take into account up to 8 layers of a potential track in their calculations. In the third step, Track Fitting is performed again, but taking into account all the 12 layers of the ID’s silicon detectors. Finally, duplicate tracks are detected and removed from the found tracks, and the results are converted into the appropriate format and forwarded to the HLT. The final goal of the FTK system is to provide the ATLAS HLT with a complete list of tracks for each event accepted by the L1 (at a rate of 100 kHz). That takes place within a latency of the order of 100 ms, on-time for the event processing at the HLT.

This high-level description of the FTK project’s operating principle does not do justice to its size and complexity. The whole system stores 1 Gpatterns over 8192 AM Chips [54, 55] and, along with those, more than 2000 FPGAs are used in the 322 main boards and 640 mezzanines, housed in the 7 racks that comprise the full system.

The AM Chip

The AM Chip is an ASIC designed to provide massive parallelism in data correlation searches. Its AM03 version has been used with great success in the SVT experiment, and its AM06 version is used to perform pattern matching in the FTK project. The AM Chips generations up to the time of writing are shown in table 3.1 on the next page. Its current version, AM06, designed in 65 nm CMOS (Complementary Metal-Oxide-Semiconductor) technology and featuring 420 million transistors, is able to perform 6.5536 1012 pattern ≈ × comparisons per second [56]. The intrinsic parallel nature of the pattern matching combinatorial problem is exploited by concurrently comparing the data to a full set of pre-calculated “expectations”, or patterns, stored in a large, on-chip “pattern bank”. The results of the pattern matching (i.e. the patterns that are found to match the incoming data) are called “roads”.

Due to high bandwidth requirements and dense PCBs housing 64 AM06 ASICs per VME (VERSAmodule Eurocard) slot, the I/O of the AM06 is done via 2 Gbit/s serial transceivers, carrying 16-bit input data at a rate of 100 MHz and 32-bit output roads at a 62 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

Input N pa‹erns

d, b, f, r e f r d h r b c

b, n, q q d f b v j n e

k, w, o g k v o h w b v

h, c, j r f h c h k j g 8 layers c, f, i c b h f t n i n

v, s, d h k b v w f s d

l, d f l c d h c d c

a, f, c a e k f k b c b

Majority Logic (≥ 7) Fischer tree Matched pa‹erns (roads)

Figure 3.2: AM chip pattern matching visual description rate of 50 MHz. The input data arrive at 8 input buses—one for each layer. Thedatafor each layer are compared with all the stored patterns concurrently, as they arrive. Fora high-level, symbolic overview of the function it performs, see figure 3.2 above. At the end of each event all the patterns that have matches in a minimum number of layers are read out, and the road list is cleared. The AM Chip is able to work with less than eight layers, and it can be configured to trigger a pattern despite missing some layers (typically one, to allow for some detector inefficiency). Furthermore, the pattern matching logic implements a variable-resolution logic, the DC bit feature, making the stored pattern bank more flexible while, at the same time, compressing its size.

The next generation of AM Chips (AM08) is under development, with a focus on speed and latency, to provide a flexible and performant pattern recognition solution in future track trigger applications.

Table 3.1: AM Chip generations

Version Year Patts Fmax (MHz) Power (W) Pkg Tech. Area (mm2) AM01 1992 128 — — QFP 0.7 µm — AM02 1998 128 — — QFP 0.35 µm — AM03 2004 5 k 40 MHz 1.26 QFP 0.18 µm 100 AM04 2012 8 k 100 MHz 3.7 QFP 65 nm 14 AM05 2014 3 k 100 MHz — BGA 65 nm 12 AM06 2015 128 k 100 MHz 3.0 BGA 65 nm 150 AM07 2016 16 k 200 MHz 0.1–0.2 BGA 28 nm 10 INTRODUCTION 63

Source: CMS Collection © 2016 CERN (License: CC-BY-4.0) Figure 3.3: The CMS detector

3.1.5 The CMS detector

The massive CMS detector [57], measuring 21.6 m in length and a diameter of 14.6 m, and weighing an impressive 14000 t, is the second in size general-purpose detector at CERN. ATLAS and CMS share common goals, but use different technical solutions and magnet systems to achieve them [58, 59]. The structure of the detector is similar to the oneof ATLAS; it, too, consists of an Inner Detector, Electromagnetic and Hadron calorimeters (ECAL and HCAL), a muon detector system, and a magnet system.

A major difference between the two detectors lies on their magnet systems. ATLAS uses a 2 T solenoid magnet for the ID and separate barrel and end-cap toroid magnets for the calorimeters; CMS, on the other hand, uses a single, stronger, 4 T solenoid, which allows the tracker to provide a better momentum resolution. The design ofthe magnet systems also strongly affects the performance of the two calorimeters: the Electromagnetic Calorimeter (ECAL) of ATLAS lies outside of the solenoid magnet, resulting in losses due to the magnet’s material, while the CMS ECAL is located inside the solenoid taking advantage of its strong magnetic field (figure 3.3). On the other hand, due to the constraints imposed by the CMS solenoid design, its Hadron Calorimeter (HCAL) is lacking in space and energy resolution when compared to the ATLAS one. The muon 64 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS systems, albeit different themselves, are also affected by the different magnet systems; while the ATLAS muon system has an almost uniform resolution with regard to η, the CMS muon system has a better momentum resolution in the barrel region ( η < 0.9) | | which degrades quite fast towards the endcaps (for higher η), but when combined with the ID measurements allows a better overall muon momentum resolution for the most part of the η–pT space.

The all-silicon CMS Inner Detector, with a total sensitive area of more than 200 m2, is the centerpiece of the detector. It consists of an inner pixel detector and a strip detector on the outer region. The inner pixel detector consists of three barrel layers (at radii of 4.3 cm, 7.3 cm, and 10.4 cm) and two disks on each side. It features 65 106 readout × channels, with the pixel size being 100 150 µm2. The strip detector has ten more barrel × layers, extending up to a radius of 130 cm, and adds another 9.6 106 readout channels. × That gives the ID a total of 75 106 channels. ×

3.1.6 The AM-Chip Approach for Level-1 Tracking at CMS

As mentioned before, the L1 trigger systems in the LHC general-purpose detectors typically reduce the rate of events from the machine event production rate (40 MHz) down to 100 kHz within very tight latency constraints of the order of 4.5 µs. After the ≈ L1, the HLT performs the event reconstruction (including tracking), using large dedicated CPU farms [60].

Track reconstruction (tracking) in hadron collider experiments is often considered an important asset to the event selection. However, the processing power necessary to perform high-quality tracking for real-time event selection at the very high rates required by the L1 triggers has never been available in any LHC experiments, as to reconstruct the trajectories of thousands of subatomic particles while discarding the huge amount of noise is very computationally expensive.

One of the proposed solutions to introduce track reconstruction at the L1 trigger for the HL-LHC CMS upgrade, currently under study in the collaboration, is based on the usage of AM Chips [61]. The HL-LHC upgrade involves increasing the L1 rate to 750 kHz, while relaxing the latency to a (still very tight) 12.5 µs. The CMS Tracker will be segmented in 48 regions in η ϕ (Pseudorapidity (η) and azimuthal angle) called trigger − towers. In each trigger tower track finding will be performed using data from Silicon INTRODUCTION 65

Figure 3.4: The proposed system hierarchy for L1 tracking: ATCA crates house PulsarIIb boards that, in turn, house PRM mezzanines modules belonging to that region. Each tower will receive up to about one hundred stubs8 per layer for each bunch crossing; currently the system works with a fixed number of six layers. This number is a factor of four smaller than the corresponding number in ATLAS, due to the filtering performed by the pT-modules [62]: they retain and propagate a fraction of the data, discarding those stubs coming from tracks with Transverse Momenta (p ) lower than 2 GeV. T ≈ Several boards based on the Advanced Telecommunications Computing Architecture (ATCA) technology–based Pulsar IIb boards [63] will collect data from each trigger tower. Each Pulsar IIb board will house two Pattern Recognition Mezzanine (PRM) boards performing pattern recognition and track fitting (see figure 3.4). The PRMs, in turn, comprise a state-of-the-art FPGA, implementing data flow management, combinatorics reduction algorithms, and the track fitting, and 12 AM Chips.

Each PRM receives and temporarily stores full-resolution stub data in a “smart cache”, while evaluating a reduced resolution representation of the stub data (SSs) and transmit- ting it to the AM Chips; then, the full-resolution stubs corresponding to the roads read back from the AM Chips are retrieved and further filtered by the Track Candidate Builder (TCB) step to reduce the combinatorics. The chain is such that the pattern recognition and TCB steps reduce the stub combinations to be fit by orders of magnitude. Finally, they are fitted by the TF module using a linear approach similar to the one used in the FTK, based on Principal Component Analysis (PCA) methods [64], where the high-resolution parameters, as well as the χ2 values used as goodness-of-fit indicators, are calculated using matrix multiplications involving pre-computed matrices. In this specific implemen- tation the problem is split and treated separately for the R ϕ plane, on which the two − track parameters q/pT and ϕ are computed, and for the R z plane, on which the z and − 0 η parameters are extracted, instead.

By efficiently utilizing the DSP and BRAM resources which are plentiful in the Kintex

8A stub is a full-resolution detector hit that passes some preliminary filtering performed in the readout modules of the CMS detector. For the purposes of this text that can be considered equivalent to a hit. 66 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

Ultrascale FPGA devices, it is possible to achieve great fitting performance [65]. The FPGA implements, in addition to the track reconstruction–related processing and the data flow management, several auxiliary functions to control the AM Chips (e.g., storing the pattern bank in an Serial Peripheral Interface (SPI) Flash memory, configuring and loading the AM Chips with the right pattern bank) and several monitoring and debugging features.

The final goal of the PRM board is to evaluate the performance of the real-time system described above using state-of-the-art AM Chips and to shape the necessary modifications to the FPGA design to meet the tight CMS bandwidth and latency constraints of 3 µs [62].

3.2 Related Work

The DO algorithm, described in the next section, is a redesign of the existing DO that has been developed for the FTK project that is described in the Technical Design Report (TDR) [46]. The existing FTK DO architecture [46, pp. 78–79] has been designed with the current system in mind; that involves a number of limitations that impede its usage in a different application or a future, more demanding trigger application. Specifically, it is not optimized for performance, operating at 140 MHz; it targets 14-bit SSIDs whereas the AM Chip can support up to 16 bits; finally, there is a constraint in the order the full-resolution hits have to arrive: they have to be ordered by SS. These facts called for the development of a novel DO architecture, offering performance to match the future AM Chip generations, while imposing as few limitations as possible. As explained below, the architecture described in this text provides superior performance when compared to the original FTK implementation, and eliminates the limitations that would impede its adoption outside the confines of the FTK system. To the best of the author’s knowledge, there are no comparable implementations outside of this.

The TF algorithm presented in section 3.5 on page 88 is also an effort to redesign the FTK implementation with goals of better performance, simplicity and flexibility. Indeed, the systolic array architecture presented is simpler than the original presented in [46, pp. 79–81], and for similar resources achieves more fits/s.

The adoption of these algorithms bya CMS L1 tracking trigger R&D demonstrator results in a very performant system that manages to reach the design goals—even before lengthy optimizations. There have been other approaches proposed, such as Tracklet [66] and TMTT [67], but a comparison of full tracking trigger systems would be a system-level THE DATA ORGANIZER 67 comparison. That has to take into account the selection of algorithms that comprise each system—not different implementations of similar algorithms. That would, therefore, be well beyond the scope of this text.

3.3 The Data Organizer

The DO is a core component of an AM Chip-based track reconstruction FPGA firmware, interfacing the pattern matching step with the generation of potential track combinations and the full-resolution track fitting. Its position in a firmware of that kind can beseenon figure 3.32 on page 114, to better outline its role in the context of a complete system.

Pattern recognition in the AM Chip is performed using low-resolution identifiers in place of full-resolution hits. These 16-bit low-resolution identifiers are called SSIDs. The DO stores the full-resolution hits (during the write phase, the datapath of which is shown in red in figure 3.5 on the next page) according to their respective SSID. After the results of the pattern matching are available as road IDs, and translated back into the SSIDs that make up these roads by means of an external memory used as a LUT, the DO uses these SSIDs to retrieve the full-resolution hits for each road (this is called the read phase, the datapath of which is shown in green in figure 3.5 on the following page). Since the same memory structures are used in the two phases, the DO cannot be both written to and read from concurrently—it will become evident during the description of the DO architecture that this would lead to corruption of the memory contents.

3.3.1 Special Features of the Data Organizer

There are a number of requirements that have to be met bythe DO component, compli- cating its architecture and making the design of a fast and compact implementation a challenging task.

First of all, the AM Chip’s variable-resolution pattern capability has to be sup- ported [68, 69]. By default, each one of the SSs that comprise a stored pattern,9 has all of its bits used for the comparison. The AM Chip’s DC bit capability allows ignoring a variable amount of Least Significant Bits (LSBs), providing the option to effectively widen the pattern on a per-layer basis. Up to three LSBs able to be configured as DC bits, the effective width of a SS can be extended to eight times its basic value. Fine-resolution pat-

9That would be one SS per layer 68 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

Xilinx FPGA

Track parameters 400 MHz

Data Organizer Core Track comb. Rejection gen. 1 Previous Hit List Address Track Memory full resolution hits Fi‹er Memory 1k×Nb 1k×10 b 500 MHz comb. gen. 2 10-bit counter 250 MHz

Hit List SSID SSIDs Pointer Extraction 32 16 64 k×10 b SSIDs + DC bits 2kbit 16 SSIDs Register Hits (8 layers) File

IPbus road IDs Memory GTX[1..8] GTX[12] logic controller

Ethernet AMmap PRM ⇔ PC (RAM)

Figure 3.5: High-level Track Trigger FPGA design block diagram, showing the main building blocks of the Data Organizer module terns provide better fake rejection, but require more space; conversely, coarse-resolution patterns intrinsically require less pattern space but matched patterns may encompass prohibitive amounts of track combinations which then have to be processed. By allowing variable-resolution patterns a balance betwen fine and coarse resolution can be achieved, improving the overall efficiency of the pattern bank.

Furthermore, the DO architecture should impose no hard limit on the number of hits that can be stored and subsequently retrieved from a SS, and no restrictions on the relative order in which the full-resolution hits belonging to some SS arrive to be stored (no ordering by SSID should be imposed on the input).

3.3.2 Implementation Details

The basis of the DO architecture is the DO core, which is replicated for each detector layer. The architecture allows all the cores to be working in parallel and independently to THE DATA ORGANIZER 69 each other. A very high-level block diagram of the DO core can be seen at figure 3.5 on the preceding page. Inside the DO core, there is a set of BRAM-based memories working together to form a sparse array of linked lists in hardware. In the write phase, for each SSID associated with one or more hits, a singly-linked list is formed to accommodate them, while in the read phase, hit data is extracted by traversing these linked lists.

3.3.2.1 Organization of Memory Structures

The DO core is built around three main memories: the Hit List Memory (HLM), the Hit List Pointer (HLP), and the Previous Address Memory (PAM). The full-resolution hits are written to the HLM in the order in which they arrive (infigure 3.5 on the facing page this is represented by the 10-bit counter which is incremented at each hit arrival). The HLP memory locations have a one-to-one correspondence to all possible SSIDs. At the same time some hit is written to the HLM, its corresponding SSID addresses the HLP memory; the content of this address is then updated with the address of the hit on the HLM. In case this SS already had some hit stored, the HLP already has some contents—that will be overwritten, but not thrown away. The invalidated address the HLP held forthat SS, is written to the PAM at the address the new hit occupies in the HLM. Thus, thePAMholds the address in the HLM of the previous hit that belongs to the same SSID, if there is one; or zero, if the first hit of the SSID is concerned (no more previous hits).10 Auxiliary to those memories, there is also a fourth memory, the Hit Count Memory (HCM). Its function supplements that of the PAM which keeps tabs on the HLM addresses of previous hits; the HCM tracks how many of those hits there are in a SS. This memory helps increase read performance, its exact purpose will be explained in detail in subsection 3.3.2.5 on page 75. An example of this complex writing procedure, in order to better visualize it, can be seen on figure 3.6 on the following page.

This memory scheme can be described using programming equivalents. The HLPcan be considered as a statically allocated array; each of its elements either hold a pointer to the head of some linked list, or a zero. Then, the HLM and the PAM, using a common memory address, represent structs; the HLM part of the struct holds the data of the linked list element, and the PAM part contains the pointer to the next element of the list.

To read the contents of a road received from the pattern matching step, first a LUT called the AMmap (implemented by an external memory) is used to decode the road ID to SSIDs. For each SSID, the HLP memory returns the address that in the HLM /

10For that reason, address 0 of the HLM / PAM memories is unused, and data is written from address 1 and on. 70 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

FAB8 AB1D 13EB ... AB19 749E 29FA 29F3 ... (2) (5) (4) (5) (7) (10) (10)

HLM: X FAB8 AB1D 13EB...... AB19 749E 29FA 29F3 ... 0 1 2 3 4 5 6 7 8 9 10 HLP gets the last HLM addr

HLP: 0 01 03 2 7 0 8 0 0 9 10 ... 0 1 2 3 4 5 6 7 8 9 10 PAM gets the old HLP data

PAM: 0 0 0 0 0 0 0 0 2 0 0 0 9 ... 0 1 2 3 4 5 6 7 8 9 10 HCM is incremented

HCM: 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 2 3 4 5 6 7 8 9 10

Figure 3.6: DO write example

PAM structure points to the head of the linked list created for that SS. The PAM memory initially gets addressed by that and then by its own content, producing one by one the HLM addresses of the hits belonging to that SS; at the address of the last hit of the SS it will return an address of zero, signaling that all the hits written for that SS have been read. At that point, the HLM / PAM memory combination is ready to process the linked list of the next SS.

3.3.2.2 Register File Implementation

In case the missing layer or variable-resolution pattern features are being used, there will be read requests for SSs to which no data have been written. If the HLP, which is the main pointer to the linked list structures the other two memories implement, contains leftover data from previous events, that would lead to invalid data being read instead of the SS being skipped. To cover these cases, the HLP contents could be reset across events. However, BRAM memory contents cannot be reset in a single clock cycle, so a mechanism to suggest which areas of the memory actually contain valid data has to be designed.

To overcome this, a Region Valid (RV) register file that is reset at the beginning of each event is used. When a memory location of the HLP is written to for the first time in an event, the corresponding bit of that register file is raised. That way, in the read phase, it is clear whether the pointers the HLP implements are valid (the SS requested has valid THE DATA ORGANIZER 71 data in the current event). If each HLP address had a corresponding bit in the register file, its size would have tobe 64 kbit, which might sound acceptable as it isn’t beyond the capacity of a modern FPGA. Multiplying that by eight layers, though, would bring the total number of registers to 512 k, which is excessive.

To reduce the size of the register file, the HLP memory width has been selected in such a way that each HLP memory location covers 32 SSIDs, for a memory width of 320 bits. Thus, each bit of the RV register file indicates if current event data havebeen written in any of the 32 SSIDs that make up its corresponding region. This reduces the size of the RV to just 2 k, which becomes 16 kbit for eight layers.

The first time in an event that data are written in a region, the whole region getsreset before the writing. On the read phase, if data are requested from a non activated region (in the case of a missing layer, for example), the request is ignored. This can also have a favorable impact on performance: it makes it possible to read, in one clock cycle, all the possible hit locations of the SSIDs requested by patterns with any number of DC bits enabled. Hence, on the read phase, the HLP can fetch all the addresses of a pattern in a single clock cycle. As it will be discussed in subsection 3.3.2.5 on page 75, the HLM / PAM memories can be duplicated, providing a read bandwidth high enough to make use of the fast, parallelized HLP output.

To achieve optimal performance in the RV register file, a custom mux structure has been developed. Synthesis tools generally do a good job utilizing the CLB resources to implement small multiplexers but can prove inefficient when it comes to inferring very wide multiplexer structures [70, p. 2]. To construct the mux of rhe RV, first the CLB resources are used efficiently to form a 16:1 multiplexer in Kintex-7 [71, p. 42]; Kintex Ultrascale CLB allows a 32:1 multiplexer to be implemented [72, p. 21]. Then, these multiplexers are chained to form wider ones, with the 2 k:1 multiplexer formed by just three pipelined levels (see figures 3.7a and 3.7b on the following page). The larger multiplexers the Kintex Ultrascale architecture over Kintex-7 can accomodate in a single CLB can be seen to mitigate the congestion between slices somewhat, as the number of wires between slices is halved (66 compared to 136 in the Kintex-7 implementation); of course, in this application the congestion is dominated by the number of inputs to the multiplexer. The code for the Kintex-7 mux implementation is listed on appendix D.2 on page 151. 72 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

reg. stages

×× 16:116:116:116:1××16161616 16:1 256 16 ×× 16:116:116:116:1××16161616 16:1 256 16 ×× 16:116:116:116:1××16161616 16:1 256 16 ×× 16:116:116:116:1××16161616 16:1 256 16 8:1 ×× 16:116:116:116:1××16161616 16:1 256 16 ×× 16:116:116:116:1××16161616 16:1 256 16 ×× 16:116:116:116:1××16161616 16:1 256 16 ×× 16:116:116:116:1××16161616 16:1 256 16 (a) Kintex-7 implementation reg. stages

×× 32:132:132:132:1××32323232 32:1 1024 32 2:1 ×× 32:132:132:132:1××32323232 32:1 1024 32 (b) Kintex Ultrascale implementation

Figure 3.7: 2k multiplexer design

3.3.2.3 Write Details and Data Collision Detection

The DO write phase, as has been already described, has many moving parts. Although the HLM is simply written to incrementally, both the RV and the HLP have to beread before the latter is written to again, to check the validity of the region andupdatethe PAM with the last known hit address of the SS. That, however, entails a certain latency, especially since the HLP memory is relatively large, spanning eighteen BRAMs.11 More recent hits that arrive within this latency also need to be written, with any collisions necessitating rewrites that would make the write time non-deterministic.

To solve this, the write data are accepted in packets of 8, followed by idle periods of eight clock cycles (figure 3.8a on page 74).12 During the eight cycles a packet is fed into the DO, reads are issued for the RV and the HLP contents. In the eight cycles that follow,

11To operate a large memory at a high frequency, both the input and the output have to be pipelined so the memory inputs are physically spread and the outputs collected throughout the device. 12A packet is allowed to comprise less than 8 hits, but 8 cycles still have to pass between the last hit of a packet and the first hit of the next one. THE DATA ORGANIZER 73 the RV and the HLP reads are finished, all collisions within the batch of eight hitsare examined and resolved, and the correct data are written to the memory structures. Thus, the input rate for the hit writing phase equals half the operating frequency, ranging from 200 Mhit/layer/s to 250 Mhit/layer/s.

3.3.2.4 Sorting by Data Valid using Python-initialized ROMs

It already has been mentioned that the HLP provides all the eight possible base SSID addresses in a single clock cycle, if the variable-resolution pattern functionality is in use; that means that it can process a full road per clock cycle. Of course, not all the SSIDs of a road are guaranteed to contain data. That leaves the output, which we can consider to be a 8-bit signal of data valid bits and actual address data, in an uncompressed, non- trivial to process, state. Ideally, it would be compressed in a form that would resemble the thermometer code (unary coding), so it could be more easily streamed into a shift register–type processing pipeline.

To achieve this sorting by Data Valid operation that supports the high operating frequency target of the DO, custom logic had to be designed. A bubble sort or a merge sort algorithm could be implemented and give a fast circuit, but a fully parallel approach was chosen instead, in order to achieve minimal latency. Bit indexing is performed for each set bit on the 8-bit data valid word in parallel; based on this bit indexing, the input words are compressed in the lower elements of the output array. In code listings 3.1 and 3.2 on page 75, pseudocode is used to describe in an algorithmic way the count leading zeros operation and the bit indexing that is required; also, figure 3.9 on page 75 displays an example to explain better, in a visual way, the transformation that hastobe performed.

Operations that find the binary index of the first and the last bits in a wordarevery commonly used, with many popular computer architectures (such as the very popular x86) including special instructions to implement them. For CPU architectures without any special instructions,13 a very elegant solution involving de Bruijn sequences has been developed [73].

To exploit the FPGA resources in order to solve this problem in a resource-efficient and performant way, custom logic employing ad-hoc programmed LUTs has been developed. The Kintex-7 and Kintex Ultrascale architectures provide the possibility of arranging 6-input LUTs and fast CLB multiplexers to form ROM and RAM14 memories. Since the

13Nevertheless, a 32-bit multiply instruction with a 64-bit result has to be provided. 14RAMs can only be implemented in SLICEM-type slices. 74 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS 37 36 37 35 36 34 35 33 34 32 33 32 31 31 30 stubs S12 30 R12 29 29 S11 R11 28 stubs 28 S10 R10 27 27 S9 R9 26 26 25 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 S8 R8 17 17 S7 R7 16 16 S6 R6 15 15 S5 R5 14 14 S4 R4 13 13 S3 12 R3 12 11 S2 R2 11 10 S1 R1 10 9 9 roads 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 clk clk Road In Stub In SSID In Stub Out 1 Stub Out 2 Valid In Read mode Valid Out 1 Valid Out 2 Write mode Read Valid In New event Read mode Write mode Reads Finished

(a) (b)

Figure 3.8: DO write phase (a) and read phase (b) waveforms THE DATA ORGANIZER 75

Unsorted Sorted DV Data DV Data

1 BAC8 0 0000 0 XXXX 0 0000 1 04AB 0 0000 1 EB03 0 0000 0 XXXX 1 BAC8 0 XXXX 1 04AB 1 ABCD 1 EB03 0 XXXX 1 ABCD

Figure 3.9: Sort by Data Valid example

function sort_addrs(dv[0..7], d[0..7]) i ← 0 function clz(x) ret_array ← [] if x = 0 while dv > 0 return -1 a ← clz(x) r ← 0 x[i] ← 0 while x[r] = 0 i ← i + 1 r ← r + 1 ret_array ← [ret_array, a] return r return ret_array, d[ret_array]

Code Listing 3.1: Count Code Listing 3.2: Sorting function algorithmic Leading Zeros (clz) representation input Data Valid signal we want to index is eight bits wide and the index output is three bits for each input bit, three such 256 bit 1 bit ROMs are enough to index an element, × taking up four LUTs, two F7 muxes, and one F8 mux (half a CLB), each. Having the input elements indexed by this compact ROM array (it takes up just 12 CLBs), a mux array is used to take into account the DC bits and provide the data outputs, sorted and filtered. A block diagram of this logic can be seen at figure 3.10 on the next page.

To produce the initialization data for the ROMs a Python program has been written, its code can be seen on appendix D.1 on page 149. This program computes the right output for all possible 8-bit inputs, and based on that, generates the Verilog HDL code that instantiates the memories. The code produced uses the Kintex-7 and Kintex Ultrascale FPGA family primitives directly.

3.3.2.5 Multiple Read Ports

Due to the latency the linked-list structure entails, which is imposed by the read from the PAM memory, necessary to get the address of successive hits of a SS, the read operation pipeline around the PAM / HLM memories has an Initiation interval of two. This means 76 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

DC bits

256256256×××1bit1bit1bit ×3 Mux logic ROMROMROM ××33 256256256×××1bit1bit1bit ×3 Mux logic ROMROMROM ××33 256256256×××1bit1bit1bit ×3 Mux logic ROMROMROM ××33 256256256×××1bit1bit1bit ×3 Mux logic ROMROMROM ××33 Sorted Data Valid[7..0] 256256256×××1bit1bit1bit Data Out ×3 Mux logic ROMROMROM ××33 256256256×××1bit1bit1bit ×3 Mux logic ROMROMROM ××33 256256256×××1bit1bit1bit ×3 Mux logic ROMROMROM ××33 256256256×××1bit1bit1bit ×3 Mux logic ROMROMROM ××33 Data

Figure 3.10: LUT-based parallel sort by Data Valid example

that with one memory read port in use, a new hit for a SS can be generated every two clock cycles. The same memory port, however, can be used to read hits from two different SSs. By replicating some read logic around the memory structures, a memory port can be broken into two “virtual” read ports, each having exactly half the physical port’s bandwidth, i.e. one hit per two clock cycles.

To maintain the read bandwidth—which is more important to the overall performance than the write bandwidth—the dual “virtual” read port strategy has been implemented. This approach delivers a total sustained read bandwidth of 400 Mhit/layer/s—one hit per layer per clock cycle, at an operating frequency of 400 MHz; that matches the HLP read bandwidth (of 400 Maddrs/layer/s) that feeds the PAM / HLM structures.

The details of the implementation can be seen at figure 3.11 on the next page. The HLP takes the SSIDs and DC bits as main inputs, along with the road ID and End of Event bit; in turn, it produces a block of 32 HLM addresses, along with a the five lowest significant bits of the input SSID that represent a 5-bit address within that block. By combining that data with the forwarded DC bit information and the output from the RV register file, the appropriate logic produces a block of 8 addresses, along with other accompanying data of the requested road. Each block of data corresponds to a road read request. These THE DATA ORGANIZER 77

Output streams Road ID End of Event PAM HLM SSID DCbits HLP SSID[15..5] addr. select 32 addresses Read logic (5b) (320b)

×2 RV register Slicing and Replicated file DV calc Addr. FIFO Count FIFO (10-bit) (10-bit) End of Event (1b) 8 addresses (80b) Addr. select (3b) Data Valid (8b) Road ID (Nb) DCbits (3b)

Stream FIFO Addr. sort / HCM spli‹er (95 + N bits) serialization

Figure 3.11: Data Organizer read datapath block diagram requests are distributed equally across the replicated logic, by switching between the two datapaths at every request. Each of these datapaths sorts the base addresses and serializes them; as the serialized output is produced, the ports of the HCM memory (one port for each instance of the replicated read logic) are queried to get the count of the hits stored after each base address.15 These base addresses of the SSs and their associated hit counts are written to a pair of FIFOs.

Finally, a complex logic block multiplexes the output of the two FIFO pairs, alternating between them every clock cycle, and using one of the two ports of the HLM / PAM structure, reads all the hits associated with the SSs and, consequently, of the roads requested. The hits out the HLM are naturally split in two output read ports, each oneof which can have data every second clock cycle.

For more demanding systems that can benefit from further increasing the read band- width, the DO can be configured in such a way that both physical HLP / PAM / HLMports are used. In this configuration, the roads are read out by four “virtual” memory ports, in parallel. With this dual-port option enabled, the DO delivers a sustained output rate of 800 Mhit/layer/s; thus, to exploit it fully, the block that follows the DO and produces the hit combinations to form potential tracks must be able to handle that bandwidth. The increase in resources in the dual-port configuration is comparatively small. In terms of memory, it only requires one more BRAM; each “virtual” memory port uses a physical memory port of the HCM, so two copies of this have to be maintained. The RV register file does not need to be replicated; an extra read port is added to it, instead. Finally, the increase in LUTs due to the logic replication is small compared to the deeper and more

15The reader can refer to figure 3.6 on page 70 to go over the memory structure, if needed. 78 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS complex write logic (that remains unchanged). The configuration is as simple as settinga parameter upon instantiation and making sure the extra ports are connected.

3.3.3 Verification Environment

The DO is a complex component, with lots of corner cases. Simple events could potentially trigger bugs just as easily as huge events, and looking at waveforms to see if the design works as intended is impossible. This is compelling reason to develop an automated, self- checking testbench to verify correct DO operation. As described in detail in appendix B.3 on page 141, plain SystemVerilog provides handy and powerful tools to build such a testbench. In order to explore more advanced testbench-building methodologies, however, a combination of Assertion-Based Verification (ABV) with a UVM testbench was used, instead.

3.3.3.1 Use of SVA

The assertion and cover properties written inthe SystemVerilog Assertions (SVA) language and used across the design serve two distinct purposes: (i) to indicate that something went wrong, and where; and (ii) to indicate that a specific case has been covered inthe simulation. An example of each will be shown here to better illustrate their usage and importance.

On code listing 3.3, a code snippet of a very simple property assertion is shown. The property shown makes sure that a road_end signal is always accompanied by valid data.

property no_detached_roadend(val, roadend); @(posedge clk) disable iff (rst) roadend |-> val; endproperty

generate for (genvar i=0; i

Code Listing 3.3: Code snippet of a simple assertion THE DATA ORGANIZER 79

top

test

env Scoreboard Coverage

agent Monitor DUT

Interface Driver Sequencer

Figure 3.12: Simplified UVM testbench diagram

This is then enforced for each port ofthe DO. In the event that this condition is violated, an error message will pop up, informing us of this abnormal condition, along with a timestamp. Careful placement of simple assertions, like this one, throughout the design can greatly speed up the debugging process.

On code listing 3.4, a code snippet of cover property usage is shown. The property triggers when within the same 8-long packet (see subsection 3.3.2.3 on page 72) there are more than one write requests that target the same address. This is formulated here by the prev_addr_r signal taking the same value the curr_addr signal had some clock cycles before (the number of clock cycles being smaller than 8, as that’s the duration of a packet). Since this has been considered an important corner case that we definitely want to see tested during the simulation, the generate block makes cover cases for all the possible arrangements of same-address requests within a packet. Careful placement of such cover properties makes it easy to visualize how effective a test run has been in “stressing” the design.

property same_curr_addr_and_prev_addr_in_block(i); logic [9:0] ma; @(posedge clk) disable iff (rst) fwe ##1 (1, ma=curr_addr) ##i (ma==prev_addr_r) ##2 countmem_web; endproperty

generate for (genvar i=1; i<8; i++) cover property (same_curr_addr_and_prev_addr_in_block(i)); endgenerate

Code Listing 3.4: Code snippet of a cover property 80 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

3.3.3.2 A Short Introduction to UVM

The UVM methodology for constructing functional verification environments; its high- lights are its flexibility, its hierarchical structure that facilitates verification component reuse, and its coverage-based verification features. It is typically employed for verifying complex ASICs. An overview of a very simple UVM-based testbench can be seen on figure 3.12 on the previous page. The interface is a central component in this methodol- ogy, abstracting the ports of the DUT (Design Under Test) and makes them usable by instances of class-based objects. It may feature clocking blocks to facilitate synchronous communication between the DUT and the testbench, tasks that can be called by the testbench to operate on the DUT ports (i.e. instead of initializing all input ports one by one, a single call to a task provided by the interface can be made), and more. The sequencer is the component that forms the (typically) random but constrained input data for the test, as transactions. These are then fed into the driver module, that translates these transactions into wire-level operations, controlling the interface. The monitor is tasked with reading back the response of the DUT, and turning it into transaction-level objects. These go to the scoreboard component, which computes the expected output and compares the DUT output with that, logging the results accordingly. The coverage module can also receive input from the monitor, terminating the simulation when the level of coverage has been reached. The driver, sequencer, and monitor objects form an agent; this is assigned to a specific logical interface of the DUT. It is evident that the structure is deeply hierarchical, and this fact promotes code reuse and quality, especially for larger designs.

3.3.3.3 The Testbench

The UVM testbench that has been developed generates events in a random but constrained and tunable manner, feeds them to the design, records its output, and compares it to the expected output the testbench computes—which is easy to do when using SystemVerilog like a software language. It does this for a number of events until it achieves a prede- termined amount of coverage, or a certain number of events. At the end it reports any errors in the output of the design, and stats on the simulation run. Such a verification environment has been put in place early in the development cycle, and it has proved to be extremely valuable in catching regressions and finding and debugging corner cases. It is also really small, all things considered, with the initial version that tests a single DO layer in a simplified way, not doing any coverage-driven simulation, being a little below 500 Lines of Code (LOCs); this version can be seen on appendix D.7 on page 162. THE DATA ORGANIZER 81

This simple testbench was used as a foundation for later, more complex implementa- tions of the testbench, which ensured correct operation of the DO for various scenarios:

• Events without any hits • Events with hits, but without any roads • Events with just one road • Repeated readout of some roads • Events without any collisions in the write phase • Events with all hits (except the first one) leading to write collisions • All write collision combinations • Complex events with loads of hits and roads • Behavior after reset

These cover the cases that will definitely occur in normal operation, but the scenarios were additionally tuned to ensure that as many potentially troublesome corner cases (that could be foreseen when designing the component) would be sufficiently tested.

3.3.4 Performance, Resource Utilization and Power

The DO component targets Kintex-7 and Kintex Ultrascale series devices, that represent the mid-range of Xilinx FPGAs. The component comprises identical copies of the one- layer “core” and is parametric as to the number of layers that can be supported; to remove this variable from the parameter space, the resource utilization figures will correspond to this one-layer “core”, in various configurations.

The resource utilization results are shown in table 3.2 on the next page. The Kintex- 7 and Kintex Ultrascale devices chosen for the implementation runs are xc7k325 and xcku060, respectively; fmax is reported for both medium (-2) and high (-3) speed grades.

It can be seen that the HIT_W parameter affects only the FF and BRAM utilization; that is because it concerns a quantity that is only propagated and stored (it does not enter any logic). There is one more similar parameter inthe DO core, the ROAD_ID bit-width, but it is omitted; that is because it is just propagated, and would only have an impact only on the number of registers used. Moreover, it practically assumes a range of 17–21, and variations within that (tight) range would hardly have an impact on the FFs utilization.

The double read port configuration adds one 18 k BRAM and, as it is expected, has increased overall LUT and FF resource utilization. That increase is not dramatic, however, especially considering the doubling of read performance that is achieved. 82 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

(a) Top-level view (b) Zoomed-in view

Figure 3.13: The DO core constrained in an triangular-like area. Slices primarily imple- menting the RV register file are shown in green, the read logic in yellow, and themain memory / write logic in purple.

Table 3.2: Data Organizer resource utilization results

FF LUT BRAM fmax (-2) fmax (-3) N_PORTS HIT_W (k) (k) (#) (MHz) (MHz) k-7 us k-7 us k-7 us k-7 us k-7 us 32 12.6 12.5 7.7 7.4 20 20 400 450 450 500 2 126 14.2 14.2 7.7 7.7 22.5 22.5 400 450 450 500 32 16.0 16.0 10.0 9.7 20.5 20.5 350 400 400 450 4 126 18.0 18.0 10.1 10.0 23.0 23.0 333 400 400 450

Table 3.3: Data Organizer fmax results on constrained area

fmax (-2) fmax (-3) N_PORTS HIT_W (MHz) (MHz) k-7 us k-7 us 32 375 435 425 480 2 126 360 435 417 500 32 333 385 385 450 4 126 322 385 370 450 THE DATA ORGANIZER 83

us, 1-port 120 137 75 182 Clock Signals Logic BRAM k-7, 1-port 322 174 83 97

us, 2-port 156 180 104 349

k-7, 2-port 381 259 117 182

Device and configuration 0 100 200 300 400 500 600 700 800 900 1,000 Power consumption [mW]

Figure 3.14: DO power consumption breakdown depending on device family and config- uration

The implementation runs presented on table 3.2 on the facing page took place on an empty device, so the logic was able to spread as much as necessary, which may not be the case in a real, populated, FPGA design. For that reason, more results have been produced with the DO component constrained in a specific area of the device; this way there is around 60% LUT and FF utilization density after synthesis in the area in which they are constrained. The area, containing the placed and routed core, can be seen at figure 3.13 on the preceding page. The shape chosen is a polygon that has a somewhat triangular form; due to characteristics intrinsic to the architecture of the dense write logic (shown in purple), a rectangular area totally crippled the frequency, dropping it by more than 20%, and thus was rejected. To pack multiple layers of the DO in the device, the area chosen can be mirrored with respect to the tilted side, so two layers will still resemble a rectangle and the packing density will not suffer.

Results for those runs are presented on table 3.3 on the facing page, and provide a more accurate representation of the maximum achievable frequency under normal usage of the component, when placed inside a moderately populated FPGA design. The DO involves logic with very wide inputs, such as the write collision detection logic, that has to gather a large number of inputs to be processed and its results reaching an equally large number of endpoints. When packed into a tight space, that leads to local congestion spots that have an impact on the placement and routing of that logic. It can be seen, though, that the decrease in the fmax is not dramatic; it is within 5–10% for the Kintex-7 and 0–4% for the Kintex Ultrascale.

Estimates for the dynamic power consumption of the DO core were taken for Kintex-7 and Kintex Ultrascale devices, and for single and dual read port configurations, with HIT_W = 32 and f = 400 MHz. As the operating frequency is a defining factor in power consumption calculations, it has been kept constant to compare fairly the two different 84 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS generations.16 The results can be seen on figure 3.14 on the previous page. Overall, the Kintex Ultrascale devices display reduced power consumption for the same performance.

It might be interesting to compare the Kintex Ultrascale power estimation for the single-port and dual-port configurations. For the dual-port the power consumption is 790 mW, out of which 350 mW is consumed by the BRAMs. For the 1-port configuration ≈ the power consumption is 520 mW with the BRAMs using 180 mW. It follows that 63 % ≈ of the power consumption difference between the two configurations can be attributed to the increased power consumption when using the BRAMs in dual-port mode.

A comparison of the BRAM power usage between the Kintex-7 and Kintex Ultrascale families brings forth an interesting—and unexpected—observation. Despite showing lower power consumption in all other categories (Clocking, Signals, Logic), the Kintex Ultrascale family displays a BRAM consumption that is almost double that of its pre- decessor. There is a new power saving feature introduced in the Ultrascale family, the RDADDRCHANGE mode. That skips read operations when both the inputs andthe outputs of the memory would stay the same, saving power. However, enabling that feature resulted in a mere 10 mW reduction in BRAM power. The reason behind these weird results has been traced to the power estimator tool itself, which for the Kintex-7 failed to maintain the “enabled” status of the BRAMs. Attempts to force it to fully comply were unsuccessful, but it has been made to keep one of the two BRAM ports always enabled; as a result, the BRAM power consumption shot up by 50 %. It stands to reason that if we also managed to keep the other port enabled, the power consumption of the Kintex-7 BRAMs would rise above that of the Kintex Ultrascale BRAMs.

3.4 The Combiner

The DO, for each matched pattern (road) it gets, outputs a list of hits for each detector layer, independently; each layer has its own core and interface, and they all work in parallel. The hits of the different layers have to be combined in all possible ways,and then each one of these combinations has to be fed to the TF module to be evaluated as a track.

The combiner’s function is not complicated but, in order to be able to process one

16The default toggle rate of 12.5 % has been used; also, the memories are always enabled. The Kintex Ultrascale device speed grade was -2, as was the Kintex-7 speed grade. An observant reader might note that some configurations should not make timing at 400 MHz on the -2 Kintex-7s; picking the -3 speed grade part, however, resulted in equal power consumption. That means that we could pick different speed grades for each generation and the results would be the same, so this fact is mentioned for completeness. THE COMBINER 85 combination per clock cycle, the RTL implementation has a long critical path; this prevents its operating frequency from matching that of the TF. As will be discussed in subsection 3.4.2 on page 87, to effectively utilize the computing resources, two combiner units are assigned to each TF; the combiners run at a frequency exactly half that of the TF (f 250 MHz), and their results are multiplexed at its input. That approach has max,Comb ∼ the additional benefit of utilizing efficiently the extra read ports thatthe DO provides.

Since the combiner’s goal is to produce exactly one combination every clock cycle, it’s worth to mention that most roads contain exactly one combination. Thus, it is imperative that any downtime between successive roads be avoided; that would prevent the combiner from reaching the design goal of one track per clock cycle.

3.4.1 Implementation Details

The architecture is straightforward: each detector layer is assigned a BRAM that stores its hits. Each memory is partitioned into a predetermined number of equivalent sectors; each sector holds the hits contained in a road. In the default configuration the memory depth is 512, partitioned in 8 sectors of 64 hits each; these numbers ensure minimal BRAM usage, while providing adequate road processing buffering and a hit capacity for each sector that supports even the most pessimistic scenario. In any case, these parameters (number of layers, maximum number of hits per layer, nubmber of sectors) can be easily modified to optimize resource usage according to the specifications of the target application.

The hits contained ina road, that are spread across the detector layers, are arranged in all possible combinations that form tracks—at this point, a track can be defined by a collection of hits that belong to different layers, for however many layers actually contain hits for a certain road.

The combiner core has two clock domains: one that allows it to interface withthe DO so the hit, road and event data are written to the combiner; and one for the actual track combination generation and the TF interface that read the track combinations from it. The logic blocks designed for these two clock domains will be described separately, as they function independently from each other.

3.4.1.1 DO Interface

For each layer’s hit storage memory, the hits of a road coming from the DO are written sequentially to some sector, as can be seen in figure 3.15a on the next page. For each 86 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

& Done Road Counter

& Counter N Comp. N hit mem ... & Counter N−1 Comp. N−1 hit mem

...... combs ... out road N +3 sector & Counter 3 Comp. 3 hit mem ... road N +2 sector & Counter 2 Comp. 2 hit mem ... road N +1 sector Counter 1 Comp. 1 hit mem ... road N enable max hits sector hits in in addr. in (a) (b)

Figure 3.15: Combiner memory structure (a) and read logic block diagram (b) layer there is a FIFO that holds the number of hits written to a sector minus one17. That way, the read logic has access to the hit count, so it reads out the right number of hits. This approach has the advantage of not requiring any “end of road” hit—this would allocate one more clock cycle per road, making the combiner unable to reach the goal of processing one combination per clock cycle. These FIFOs have a depth equal to the number of sectors, and they provide an almost_full signal that is used to halt the DO readout on a per-layer basis, before the hit storage memories run out of available sectors. The road-related data to be propagated to the read logic are stored in an additional FIFO, read out every time a new road enters the processing pipeline.

The DO control logic also needs the combiner to produce flags that indicate that the combiner has finished processing: (i) a road, and (ii) an event. These flags are returned from the combination generation logic to the DO interface using another FIFO.18

3.4.1.2 Combination generation

As was mentioned before, Each layer’s memory is split into sectors, and each sector gets filled up with the hits from a specific road; the sector data are written by the DO

17Such that one hit would produce a value of zero. 18Due to the fact that the end of event cannot be active for more than one clock cycle, that signal could cross clock domains using a “cheaper” type of synchronizer (see appendix A.2 on page 128 for examples). However, given that the end-of-road information needs a FIFO—since one road can be processed per each clock cycle, it should be treated as a data signal and not as a flag—and the safe clock domain crossing logic is there anyway, is it being used to pass the end-of-event signal, too, doing away with the need for extra timing constraints. THE COMBINER 87 asynchronously to the combiner, in a cyclic way, and FIFOs indicate when the writing of a sector is done, and how many items it contains. The concepts are similar to the wayan asynchronous FIFO memory works, with empty and full flags being generated for their corresponding clock domains to protect the hit data from being overwritten.

The block diagram of the combiner unit read logic can be seen infigure 3.15b on the preceding page. There is a cascade of counters and comparators, with one set foreach layer. The counters start off with a value of zero; when they reach the reduced-by-one number of hits written (so, for the example of only one hit written, immediately), the comparators raise a “done” flag that the hits for this layer have been fully read out, and the counter is reset. That flag triggers the counter for the next layer to count up, and the one that comes from the last layer triggers the readout of the next road. The architecture’s operating principle resembles a multi-digit counter, but one that features a variable, per-digit, base.

Due to its simple structure, the architecture is easily adaptable for an arbitrary number of layers. However, the inputs to the comparators are read out anew for each road, and the “done” flag from the last layer is what triggers a new road to be read. Thus, increasing the number of layers will elongate the critical path, bringing the operating frequency down. In case more layers are desired, a higher level of parallelism could be reached by changing the ratio of combiner units per TF.

3.4.2 Performance and Resource Utilization

The resource utilization results are shown in table 3.4 on the following page. As for the

DO, fmax is reported for both medium and high speed grades for Kintex-7 and Kintex Ultrascale devices.

The combiner is a small component, taking up less than 0.5%of LUTs and FFs, and about 1% of BRAM, of the smallest device we consider. BRAM usage is proportional to the number of layers and the bit-width of the hits; so does the FF usage, with some offset due to the fixed logic that does the road and event handling. The impactofthe hit bit-width has been omitted from the tables for two reasons: (i) the resource usage is so low that there is no real impact by the increased FF usage; and (ii) the addresses produced by critical path–related logic are then pipelined, being essentially decoupled from the size and physical placement of the hit-carrying memories, so the maximum operating frequency is not affected at all. Moreover, the input circuitry of the combiner is pipelined and relatively simple; as such, it is very fast, easily reaching the maximum 88 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS operating frequency of the DO (that drives it) in all the devices we considered. Thus, it is not shown on the results table.

The maximum operating frequency of the combination generation clock domain, while not dropping below 333 MHz, cannot reach the one the Track Fitter can achieve ( 500 MHz to 600 MHz). That necessitates the use of two combiners per TF to efficiently ≈ use its processing power; a simple way to implement this is to have the combiners run at exactly half the TF’s operating frequency, and the TF input alternating between the outputs of the two combiners. It should be noted here that when the two combiners are ran on a clock that is exactly half the frequency of the TF clock, and phase-locked to it, the obvious solution of directly connecting the outputs of the combiners to the input of the TF to toggle between them has some pitfalls. Even though the two clock domains are synchronous, the place and route tools will have to introduce large hold times to account for both clock edges of the fast clock that correspond to a single period of the slow clock. Accepting a small overhead and adding a pair of FIFOs to cross the data to the fast clock domain constitutes a much more straightforward solution.

The power consumption of the combiner is 110 mW for the maximum operating ≈ frequency of the 8-layer configuration in a -2 speed grade Kintex-7 device. Since that is not considered to be much with respect to the applications it pertains to, it does not warrant additional analysis across different configurations or FPGA families.

3.5 The Track Fitter

In particle physics, track fitting is the last step in a basic track reconstruction processing pipeline.19 Put simply, that step receives as an input a set of hits, distributed across the detector layers, for which it: (i) examines whether they map to an expected, helical, trajectory; and (ii) computes the helix parameters of the trajectory.

Table 3.4: Combiner resource utilization results

FF LUT BRAM fmax (-2) fmax (-3) N_LAYERS (k) (k) (#) (MHz) (MHz) k-7 us k-7 us k-7 us k-7 us k-7 us 6 1.42 1.42 0.55 0.5 3 3 375 450 400 533 8 1.75 1.74 0.75 0.68 4 4 333 440 366 500

19A common step in more complex track reconstruction systems, placed after the track fitting step, is the duplicate removal. Additional track fitting steps to refine the results of the first one canalsobe implemented if the latency specifications allow it. THE TRACK FITTER 89

The particles generated in a detector undergo several interactions that make tracking their trajectories difficult: (i) multiple scattering; (ii) energy loss; (iii) particle decayand creation; and (iv) inhomogeneous magnetic field. The additional errors incurred bythe detector (noise, dead elements) further compound an already difficult situation.

There are several algorithms that can be employed for this task, varying wildlyin complexity, such as the Kalman filter [74], the Hough transform [75], and PCA-based methods [64, 76]. Selecting an algorithm for the track fitting stage depends on mul- tiple factors—including, but not limited to, the detector geometry, its magnetic field, the requirements on the quality of results, and the computing time available. In this implementation, a linear fit method has been chosen, similar to the one employed bythe FTK project [46].

The way the TF unit extracts the five helix parameters from the hits (which canbe expressed in local or global hit coordinates) is by employing linear transformations to make the fit from the coordinate space to the parameter space, usinga χ2 as a goodness of fit indicator. The linear fit model for any given sector consists of the fit coefficients

Cij and qi, where i corresponds to the parameter index and j to the layer. The track parameters p˜i can be calculated from the N hit coordinates xj as such:

N

p˜i = Cilxl + qi l X=1 The χ2 is calculated by squaring terms computed in the exact same way as the track parameters, and calculating the sum of those squares. The number of terms that have to be squared and summed to obtain the χ2 would be equal to the sum of the number of coordinates in total (from all the layers). It can be shown [46, p. 112], however, that the contribution of some terms is negligible, with the number of significant terms being the total number of coordinates minus the number of parameters computed by the linear fit model.20 The formula for the χ2 calculation is shown below:

N−Np N 2 2 χ = Cilxl + qi i=1 l ! X X=1 When the detector is (conceptually) split in relatively narrow areas, for each of which a set of fit coefficients is generated, the resolution of the resulting fit approaches thatofa non-linear helical fit.

20This is the definition ofthe Number of Degrees of Freedom (NDOF) for the fit. 90 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

Sector Constant selection

Parameter Track Parameters computational cores (×5)

Coordinates 2 χ coefficient χ2 computational Accumulation Squaring cores (×6)

Figure 3.16: Block diagram of the TF architecture

3.5.1 Implementation Details

2 The formulas for p˜i and χ make it obvious that designing the TF essentially translates to coming up with an architecture to compute scalar products in an efficient and performant manner. By making extensive use of dedicated DSP units—which are frequently abundant in modern FPGA devices—and by being aware of layout concerns early in the design cycle, a very fast implementation can be achieved.

A high-level block diagram of the TF architecture can be seen in figure 3.16. Specif- ically, what is seen here is the TF configuration for an 8-layer, 11-coordinate tracking trigger application that is described in section 3.6 on page 96. The design is easily adapt- able, and it primarily aims to strike a balance between low latency and resource usage (in terms of both DSP and register count), while maintaining a high operating frequency. By keeping resource usage moderate, the parallelism capabilities of the FPGA can be exploited even more effectively, by allowing more track fitting instances to fitinthe device. This TF implementation produces one fit per clock cycle, after a short initial latency that is of the order of 50 clock cycles—that translates to 500 MFits/s and 100 ns for a 500 MHz clock, respectively.

The core computing architecture consists of sequences of Multiplier-Accumulator (MACC) blocks, accompanied by systolic arrays of registers, causing the input hits and coefficients to arrive to each MACC unit staggered in time. These arrays of registers synchronize the latency with which the coefficients and the inputs arrive, with the extra latency introduced by each previous MACC step. Any processing logic is contained in the dedicated DSP block, resulting in a fast, power-efficient architecture. But before this architecture is described in more detail, it is worth to interject a brief overview of the DSP blocks modern FPGA architectures provide. THE TRACK FITTER 91

Figure 3.17: High-level DSP48E2 block diagram, highlighting its main functionality

3.5.1.1 DSP Block Structure in Modern Xilinx FPGAs

As the massively parallel capabilities of FPGAs started being increasingly exploited by DSP-like applications the FPGA manufacturers, following the demand in that kind of performance, started adding hard arithmetic cores in devices. With time, these resources got faster, more power-efficient, more versatile, and much more abundant over time, leading to FPGAs that feature DSP performance well into the TFLOPs range, with some devices offering hard IEEE 754 floating-point [77] support.

Figure 3.17 [78, p. 6] shows a simplified top-level diagram of the Kintex Ultrascale DSP slice, DSP48E2. Central in the block diagram, stands a 27 18 multiplier, fed by a × pre-adder unit, and followed by an ALU (Arithmetic Logic Unit) and specialized wide XOR and pattern detection logic. These processing elements makethe DSP slice a very powerful tool in the hands of the FPGA designer, enabling the efficient realization ofa very broad range of DSP-centric applications.

Between these arithmetic operators lie registers that can be configured to be by- passed, depending on the operating frequency target, and multiplexers for the inputs of each stage, to increase the flexibility of the slice. Furthermore, since chained MACC operations and wide arithmetic comprise very common usage patterns, dedicated inter- connects between adjacent DSP slices are provided to decrease power consumption and enhance the routability of the design, increasing the maximum frequency that can be attained—especially in densely populated designs. As these dedicated interconnects play an important role in the resulting TF computational core architecture, a more detailed block diagram of the DSP slice that displays their signals is shown in figure 3.18 on the following page [78, p. 10]. 92 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

Figure 3.18: Detailed DSP48E2 block diagram that includes the dedicated interconnect signals (denoted with an asterisk)

3.5.1.2 Computational Core Implementation

The computational core implementation has been designed around taking full advantage of the DSP slices and the dedicated interconnect between them; that should enhance its routability and allow it to run closer to the limits posed by the DSP silicon. A block diagram of the resulting computational core architecture can be seen on figure 3.19 on the next page. Each MACC element takes two inputs, a hit coordinate and a fit constant, multiplies them, and adds the result to the result of the previous block.

Signed fixed-point arithmetic is used, with 18-bit coordinates, 27-bit matrix multipli- cation constants,21 and 36-bit additive constants; the computation result have a width of 48 bits. As it will be seen on subsection 3.6.3.3 on page 107, that fixed-point representation offers a high enough range and accuracy to yield excellent track fitting performance.

The DSP slice configuration that implements this functionality is shown infigure 3.20 on the next page. Since each MACC stage incurs a delay to its output, the inputs have to be delayed in a way that the signals A and B in figure 3.20 on the facing page are in sync. A timing diagram of the data as it is propagated through successive MACC stages can be seen on figure 3.21 on the next page. The total delay from the hit and fit constant inputs of each stage to its output is 4 clock cycles. However, the delay from the input coming from the previous stage to its output is only one clock cycle. Thus, only one clock cycle of delay has to be added to the inputs of each stage, compared to the previous one, leading to the implementation shown in figure 3.19 on the facing page. That makes the total latency of the computational core be N +4 clock cycles.

21The DSP48E1 slice in Kintex-7-series devices features a 25 18 multiplier so in the Kintex-7 configura- tion a 25-bit matrix multiplication constant width is used, instead.× THE TRACK FITTER 93 Hit 1 Const 1 Hit 2 Const 2 Hit 3 Const 3 Hit 4 Const 4 Hit 5 Const 5 Hit 6 Const 6 Hit N Const N

R R R R R R R

R R R R R R R

R R R R R R

R R R R R R

R R R R R R R R . . R R R . . . . R R R

R R

R R Additive Const

R R DLY R R MACC MACC MACC MACC MACC MACC MACC R ADD result

Figure 3.19: Block diagram of the TF computational core architecture

MACC block Hit R R To next A C block × R + R R R B Coeff From previous block

Figure 3.20: Detailed view of the TF MACC block

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

clk Hit/Const 1 I10 I20 I30 Hit/Const 2 I11 I21 I31 Hit/Const 3 I12 I22 I32 Hit/Const N I1N I2N I3N

MACC0 signals Hit/Const 1 (dl by 1) I10 I20 I30 Signal A ×10 ×20 ×30 Signal B 0 Signal C +10 +20 +30

MACC1 signals Hit/Const 2 (dl by 2) I11 I21 I31 Signal A ×11 ×21 ×31 Signal B +10 +20 +30 Signal C +11 +21 +31

MACCN−1 signals Hit/Const N (dl by N+1) I1N I2N I3N Signal A ×1N ×2N ×3N Signal B +1N-1 +2N-1 +3N-1 Signal C +1N +2N +3N

Figure 3.21: Waveform diagram explaining the staggered input to the TF MACC blocks 94 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS Hit 1 Const 1 Hit 2 Const 2 Hit 3 Const 3 Hit 4 Const 4 Hit 5 Const 5 Hit 6 Const 6 Hit 7 Const 7 Hit 8 Const 8 Hit 9 Const 9 Hit 10 Const 10 Hit 11 Const 11 Hit 12 Const 12

R R R R R R R R R R R R

R R R R R R R R R R R R

R R R R R R R R R

R R R R R R R R R

R R R R R R

R R R R R R

R R R

R R R

MACC MACC MACC MACC MACC MACC MACC MACC MACC MACC MACC MACC

R ADD R Additive Const DLY ADD

ADD

result

Figure 3.22: Block diagram of the resource-optimized TF computational core architecture, as configured for a CMS L1 tracking trigger application

3.5.1.3 Optimizing the Computational Core for Resources, Latency and Power

It is evident that this architecture, even though it is symmetrical and straightforward, will result in a large number of registers. The count of the register stages with respect to the number of dimensions is given by the sum of an arithmetic series starting with and incrementing by 1: 1 Sn = 2 n(n + 1)

For the 11-coordinate tracking trigger application described in section 3.6 on page 96 that translates to 66 register stages that in turn correspond to 66 (27 + 18) 3000 registers. × ≈ Considering that this block has to be replicated 5 times to produce the track parameters, and another 6 times to produce the χ2 coefficients, we come up with 33 k registers for the whole TF, which can be considered an “acceptable” count, since only one TF core is instantiated and there are no solid power and utilization restrictions. For the application presented in section 3.7 on page 112, however, where there are 12 coordinates and the χ2 is computed by 10 coefficients, the register count becomes 48 k. Even if the current implementation only uses one TF instance, future upgrades in the module after the DO could lead to more TF instances being desired.

To reduce the register utilization, we split the computational core into smaller pipelines with a length of 4 MACC stages each, joining them with adders in the end. A block diagram of this computational core architecture, configured for the 11-coordinate THE TRACK FITTER 95 case, can be seen on figure 3.22 on the preceding page. Using this approach, the number of register stages becomes:

∗ n 1 Sn = 10 4 + 2 (n mod 4)((n mod 4)+1)   In the range of 10 to 12 dimensions, the reduction in the number of registers is 60 %. In ≈ the 11-coordinate configuration that translates to 13 k registers, and in the 12-coordinate configuration to a much more manageable 20 k. Latency is also benefited from more, shorter pipelines running in parallel, despite the extra adder stages after these.

These benefits come at the expense of log(n/4) extra DSP slices to perform the ⌈ ⌉ extra additions. However, it is much more frequent that there are DSP slices to spare than registers. Especially in Track Trigger applications, DSP slices are mostly used in a few modules, but registers are used throughout the design. That makes it an especially small price to pay in exchange for the improvements in latency, and the dramatic reduction in register resources.

It should also be noted that the reduction in the number of registers also carries a favorable impact on power consumption, as well. The initial configuration uses 12 mW ≈ for the DSP slices and 100 mW for the clock distribution and the registers,22 while the ≈ optimized configuration uses an increased 22 mW for the DSP slices,23 but just 40 mW ≈ ≈ for the clock distribution and the registers. That translates to a 45 % reduction in overall power consumption, as an added benefit of the modified architecture.

3.5.2 Performance and Resource Utilization

In table 3.5 on the following page the resource utilization and fmax figures of the com- putational core, as configured for the applications presented at section 3.6 on the next page and section 3.7 on page 112, are presented. It can be seen that the TF computational core operating frequency can exceed fmax,TF = 600 MHz in both Kintex-7 and Kintex Ultrascale devices. However, unless special techniques such as bottom-up hierarchical

22A 25 % toggle rate was used for the analysis to account for the increased randomness of the fitting constants. 23The DSP slices used in the adders use more power than the others. They operate onlarger, 48-bit operands but, more importantly, dedicated routing can’t be used. To match the timing of the input that uses dedicated routing, data signals from the other DSP slice cannot pass through a register stage; because of placement constraints, these signals would become the critical path of the circuit and lower fmax. 96 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS methods,24 or prioritizing the implementation through partial place and route,25 are used for placement and routing, a more realistic implementation goal for a fully populated design would be 500 MHz.

Looking at the full TF implementations for the applications presented at section 3.6 and section 3.7 on page 112, at table 3.6, the maximum attainable operating frequency seems to drop with respect to the computational core implementation. That fact is not related to placement or routing constraints, however, but is owed to the fact that BRAMs have a lower fmax and they are used for the storage of the fit coefficients.

The resource utilization for the full implementations shows that four units can easilyfit in a modern mid-range device,26 providing a maximum fitting performance of 2 GFits/s. Such a high performance is especially desirable, as it can help maintain the trigger efficiency in high occupancy environments.

3.6 Track Reconstruction System Implemen- tation, Testing, and Evaluation

There is a definite and obvious need for the components described in this chapter, namely the Data Organizer, the Combiner, and the Track Fitter, to be verified exhaustively. The

Table 3.5: Track Fitter core resource utilization results

FF LUT DSP fmax (-2) N_COORDS (k) (k) (#) (MHz) k-7 us k-7 us k-7 us k-7 us 6 0.79 0.81 0.04 0.04 9 9 600+ 600+ 11 1.67 1.70 0.02 0.02 14 14 600+ 600+ 12 1.73 1.79 0.02 0.02 15 15 600+ 600+

Table 3.6: Full Track Fitter resource utilization results

FF LUT BRAM DSP fmax (-2) Configuration (k) (k) (#) (#) (MHz) k-7 us k-7 us k-7 us k-7 us k-7 us Section 3.6 34.2 34.8 2.15 2.15 36.0 36.0 208 208 540 585 Section 3.7 31.4 31.9 2.39 2.39 72.5 72.5 360 360 540 585

24Each module is the sole occupant of a given area on the device; that way, placement and routing of each module is completely disentangled from overall utilization and consistently reaches its top speed. Details of such a flow can be found79 at[ ]. 25Placing and routing a critical module before others, so it meets timing. 26Assuming typical, for these applications, resource utilization. TRACK RECONSTRUCTION SYSTEM IMPLEMENTATION, TESTING, EVALUATION 97 complexity of the DO logic makes its on-board verification all the more important, given the limited number of random-generated events that can be simulated within realistic time constraints—even using the fastest simulation type, behavioral simulation.27

A complete verification and optimization suite that allows extensive testing ofthe track reconstruction processing chain that follows the pattern recognition stage has been developed in both software and firmware. An ad-hoc FPGA-in-the-loop simulation acceleration framework constitutes a major part of this suite.

3.6.1 Hardware Setup

The Xilinx KCU-105 evaluation board, pictured on figure 3.23 on the following page, provides an ideal development environment for Kintex Ultrascale series devices. It features a powerful xcku040 FPGA coupled with various clock generators, SPI Flash memory, DDR4 memory, and a Zynq SoC as a system controller, among other features. It also provides ample connectivity options, offering PCI Express (PCIe), SFP+, RJ-45, HDMI, and VITA 57.1 FPGA Mezzanine Card (FMC) connectors. The FMC connector makes it possible to host mezzanine boards (like the PRM06) on it, making it invaluable for the case in which one wants to emulate a host board using the FPGA on the evaluation board.

The hardware testing setup consists ofa PC running a Linux operating system, connected to a network switch (or a router), and the KCU-105 evaluation board connected to the same switch / router via Ethernet. The PC provides the inputs for the FPGA design (the hits, SSIDs, and roads), loads the configuration dataroad ( ID to SSID Look-Up Tables, TF fit constants), and eventually receives the resulting tracks.

3.6.1.1 Ethernet Communucation and the IPBus suite

The KCU-105 evaluation board hosts an Marvell tri-speed (10/100/1000) Ethernet PHY controller. That, in combination with the Tri-Mode Ethernet MAC IP provided byXilinx and the IPbus infrastructure, provides an easy-to-setup platform for Ethernet communi- cations.

IPbus is a protocol and open-source software / hardware suite, developed to facilitate Ethernet communication for ATCA boards in the CMS experiment. The IPbus protocol is an application layer, sitting on top of the UDP (User Datagram Protocol) transport

27Other simulation types include functional simulation, which involves simulating the synthesized design; and timing simulation, which simulates the placed and routed design, fully modeling logic and routing delays. 98 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

Figure 3.23: The KCU-105 evaluation board layer protocol; that allows a relatively low-resource implementation in the target FPGA. Application Programming Interfaces (APIs) for the C++ and Python languages expose functions that give access to 32-bit data and address memory space within an FPGAs devices. Block transfers are available, greatly improving the transfer speeds when large blocks of data have to be transferred (500 Mbit/s for transfers larger than 1 MB). Finally, the way the software is structured scales well from board-on-bench scenarios to multi- crate setups.

On the side of the FPGA design, the IPbus suite provides a module that sits between the Ethernet MAC IP and the user logic, and encapsulates all low-level protocol and data transfer details. The interface of this module the user design is connected to is simple; its ports can be seen on code listing 3.5 on the next page. The ipb_strobe signal signifies an ongoing transaction, while the ipb_write indicates whether the transfer is a read or a write. The transaction address is provided by ipb_addr, and the data are carried on the ipb_wdata and ipb_rdata for read and write transactions, respectively. Finally, the ipb_ack signal is sent to the core when a transaction is complete; if the transaction is a read, it also signifies that the (valid) read data are placed onthe ipb_rdata bus, ready to be read out. Example transaction waveforms can be seen on figure 3.24 on the facing page. Transactions A and B represent cases in which the design does not provide an immediate response, to model the case of a busy processing pipeline that can’t accept new data, or the case in which data is not yet available. Transactions C and D (in each of which two words are transferred) the response is immediate, to represent data transfers to / from a register file. In case the user design does not respond in a timely fashion,a watchdog counter triggers a transaction error that is communicated to the software side. The purpose of the watchdog is to prevent any logic error in the user design from killing TRACK RECONSTRUCTION SYSTEM IMPLEMENTATION, TESTING, EVALUATION 99

clk // // A B C D ipb strobe // // ipb write // // ipb ack // // ipb addr // // ipb wdata // // ipb rdata // //

Figure 3.24: Timing diagram showing various IPBus transactions the bus, while the timeout threshold is adjustable.

Finally, it is worth to mention that the IPbus provides the infrastructure to convert the bus that interfaces with the user logic to a master bus that controls a number of slave modules, each of which is assigned a region of the address space. This functionality is integrated with the software in an intuitive way: the names and address spaces ofthe slave modules are defined in a single file, which is used both to generate the HDL code that implements the interface of the master bus to the slave buses, and to convey the slave definitions to the software, such that they are accessible byname.

3.6.2 Verification Environment

As was discussed before, to verify the correct operation of the various components described, advanced verification methodologies were employed.

A UVM approach was selected as a basis of our simulation framework in order to maintain flexibility. As in the DO verification environment (see subsection 3.3.3 on page 78) multiple scenarios can be run, but here that flexibility is extended: the

type ipb_wbus is record ipb_addr : std_logic_vector(31 downto 0); ipb_wdata : std_logic_vector(31 downto 0); ipb_strobe : std_logic; ipb_write : std_logic; end record;

type ipb_rbus is record ipb_rdata : std_logic_vector(31 downto 0); ipb_ack : std_logic; ipb_err : std_logic; end record;

Code Listing 3.5: IPBus interface 100 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

SystemVerilog Direct Programming Interface (DPI) interface provides the means to call methods of a different programming language from within SystemVerilog; thus, it allowed us to connect the testbench to the board by making use of the IPbus C++ API, from the testbench environment. The code reuse characteristics of the UVM methodology [80] allows for the same high-level testbench code to be reused for both the RTL behavioral simulation of the design and for its actual FPGA implementation, with just different interfaces written for each verification target. This method represents a viable alternative to the usual practice that is to design a testbench for the behavioral simulation while relying on a different, and usually simpler, set of scripts to verify the behavior ofthe implemented design.

3.6.2.1 The Testbench

The testbench can function in two major configurations: (i) constrained-random operation, and (ii) test vector–driven operation. These configurations, despite satisfying different goals and having big differences, are integrated into a single testbench so the complex DUT interface can be reused.

In the constrained-random operation mode, the testbench initially carries out the fol- lowing tasks: (i) preparing randomly-generated patterns in a constrained area, (ii) loading the road-to-SSID map and the DC bits onto the design memory, (iii) loading the fit coeffi- cients onto the TF memory, and (iv) preparing events that consist of randomly-generated hits in a constrained area. It then computes the expected results, runs the events on the DUT, and reads out the results. In this configuration the patterns and the hits are random: several hundred hits are generated within a relatively narrow area, and then one or two thousand roads are randomly generated and matched against them, producing approximately one hundred roads, on average. This mode is designed to maximally stress the DO and combiner units. As the TF module is a simple pipeline and, thus, has less ways to break, the fit constants are not randomized for this test in an attempt tokeep things as simple as possible.

In the test vector–driven operation mode the testbench loads the patterns, hits, and fit coefficients from a set of text files generated by a Python toy detector model, described in subsection 3.6.3 on page 102. The results produced by the design, along with a measurement of the time it takes to process each event, are also dumped in a text file. That provides the necessary data to produce a wide range of plots so the fit resolution and performance can be evaluated. TRACK RECONSTRUCTION SYSTEM IMPLEMENTATION, TESTING, EVALUATION 101

top

DUT

Real DUT Scoreboard (DO, comb, TF) IPBus Interface Interface Mem model Sequencer

Figure 3.25: Block diagram of the testbench top-level

3.6.2.2 The BFM and On-Board Verification

To facilitate the on-board verification of the implemented design, an FPGA design that is tightly integrated with the testbench has been developed for the KCU-105 evaluation board. The design and the testbench exchange data over an Ethernet link using the IPbus software / hardware suite.

The testbench, apart from having two possible configurations in terms of whether the test vectors are generated or loaded, also provides two options for the DUT instantiation: the behavioral model and the real design, implemented on the evaluation board. More precisely, the testbench interfaces with a Bus Functional Model (BFM) that abstracts all communication with the DUT by providing callable tasks. As an example, to write data to the IPbus interface of the DUT, the testbench makes a single call to the write_ipbus task.

The write_ipbus task, shown in code listing 3.6 on the next page, handles the low- level bit-toggling bus details when using the behavioral simulation model and, using the DPI interface, calls the C++ function that sends data over Ethernet through the IPbus protocol. When the BFM is configured to control the behavioral simulation model, an empty C++ function is defined, instead. Since tasks can be called asynchronously, semaphores are used to make sure no two tasks attempt to take control of the same signal.

It is evident that by abstracting the communication with the DUT through the use of a BFM not only the testbench code can be simplified—since it does not have to handle the low-level details—but is becomes much easier to make the test setup more flexible. 102 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

3.6.3 Track Reconstruction Testing and Evaluation in Python

Track reconstruction performance has been studied using the simulation of a “toy” detector that resembles the silicon detector geometries of the LHC experiments [81, 82]. The detector simulation is strongly simplified compared tothe LHC detector simulations (e.g., detector material interactions and noise are not simulated), however it is enough to satisfy two major goals, which are: (i) to measure the track reconstruction chain processing speed as a function of number of roads to be processed and fits per real track to be executed; and (ii) to verify that the FPGA implementations of the algorithms are bit-wise equivalent to their programmatic implementations. The simulated events aim to reproduce the high-occupancy, crowded environment created by energetic jets. Inside those jets the number of fits to be executed increases exponentially due to the fact that two or more tracks can enter the same road [83].

A stand-alone simulation program has been written to explore tracks generation and reconstruction in a restricted phase space (i.e., a thin slice of the detector model), that is representative of the system performance when scaled to the whole detector. The simula- tion has been written in Python. This language was mostly chosen because of its easeof use. It offers a rapid way to easily model complex algorithms, mostly through itsmany easy-to-use libraries (such as pandas and numpy) and the list comprehension construct.

semaphore ipb_sem = new(1); task automatic write_ipb(logic [31:0] addr, logic [31:0] data = 0); ipb_sem.get(1);

cb_ipb.ipbus_w.ipb_strobe <= 1; cb_ipb.ipbus_w.ipb_write <= 1; cb_ipb.ipbus_w.ipb_addr <= addr; cb_ipb.ipbus_w.ipb_wdata <= data; @cb_ipb; while (cb_ipb.ipbus_r.ipb_ack==0) @cb_ipb; cb_ipb.ipbus_w.ipb_strobe <= 0; cb_ipb.ipbus_w.ipb_write <= 0;

uhal_write(addr, data);

ipb_sem.put(1); endtask //write_ipb

Code Listing 3.6: write_ipbus task TRACK RECONSTRUCTION SYSTEM IMPLEMENTATION, TESTING, EVALUATION 103

Training Tracks sample PT > 2GeV , η ∈ (0.1, 0.2), φ0 ∈ (0.69, 0.88) d0[µm] ∈ (−1000.0, 1000.0), z0[µm] ∈ (−2000.0, 2000.0)

900

650 ] mm [ 450 Y

250 150 50 -1000 -500 0 500 1000 X [mm]

Figure 3.26: The detector geometry with concentric cylindrical layers, the z axisthat defines the beam direction coincides with the axis of the cylinders.

It also features a very powerful plotting library, matplotlib, that allows converting the computation results into publication-quality plots. Finally, parallel execution features are easily accessible via the high-level concurrent.futures interface; since some of the most time-consuming computations in this simulator lend themselves to parallel implementations, that is particularly helpful.

3.6.3.1 Toy Detector Model Description

As mentioned before, reproducing low-level detector phenomena is avoided for this simple “toy” detector model; the overload caused by generating collimated tracks is enough to reproduce the exponential growth of found roads and track fits that is observed in real-world conditions. Figure 3.26 shows a cross-section of the simple, cylindrical, detector geometry (transverse to the beam), showing tracks confined in a tight space, attempting to represent a particle jet.

In this simple geometry, the three inner cylindrical layers represent pixel detectors; the five outer layers, strip detectors. The pixel detectors of our model are able to produce two coordinates, while the strip detectors are able to produce only one coordinate. That gives our track reconstruction algorithms eleven coordinates over eight layers.

At this point it is important to introduce the five helical track parameters:

• Transverse Momentum (pT). The particle charge, divided by pT , gives the inverse

transverse momentum, Q/pT . This defines the particle direction along the helical track and the radius of its curvature. Since more energetic particles will follow

straighter trajectories, the pT quantity is closely correlated to the particle energy 104 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

and thus is associated with whatever physics happened during the collision. That

property attributes a special significance to high-pT particles, in contrast with high momentum along the beam-line, which are mostly scattered beam particles (not the physics that is usually desired to be studied). • Pseudorapidity (η), a quantity signifying the angle of the track relative to the beam axis. This, too, is related to the particle energy, as a particle that is followinga trajectory perpendicular to the beam signifies that the collision was “heads-on” and thus, more energy is deposited in the collision.

• d0, the transverse component of the impact parameter, is the distance of closest approach to the beam-line.

• z0, the longitudinal component of the impact parameter, is the value of z at the

point that defines d0. The impact parameter defines the point of origin ofthe tracked particle. The distance of that from the collision point indicates whether the track is a direct product of the collision (primary, or pile-up, vertex) or a decay product of some physical process (secondary vertex).

• φ0, finally, is the azimuthal angle.

In the helical track equations that follow, B is the magnetic field in the detector along the axis ~z, in units of T, and Q is the charge of the particle in elementary charge units, e. But to conveniently work with those equations, it is very helpful to first define some quantities first:

a = 0.2998BQ − λ = cot θ cot sinh−1(η) 0 ≡ 2 2 p⊥ = px + py  q u = px/p⊥

v = py/p⊥

u = p x/p⊥ cos(φ ) 0 0 ≡ 0 v = p y/p⊥ sin(φ ) 0 0 ≡ 0 1 a ρ = R ≡ p⊥

Then the equations that describe the helical track are givenby:

p = p cos(ρs⊥) p sin(ρs⊥) x 0x − 0y p = p cos(ρs⊥) p sin(ρs⊥) y 0y − 0x pz = p0z TRACK RECONSTRUCTION SYSTEM IMPLEMENTATION, TESTING, EVALUATION 105

p0x p0y x = x + sin(ρs⊥) (1 cos(ρs⊥)) 0 a − a − p0y p0x y = y + sin(ρs⊥) (1 cos(ρs⊥)) 0 a − a −

z = z0 + λs⊥ with

a p = p u = cos(φ ) 0x ⊥ 0 ρ 0 a p = p v = sin(φ ) 0y ⊥ 0 ρ 0 a p = p λ = λ 0z ⊥ ρ x = Dv = D sin(φ ) 0 − 0 − 0 y0 = Du0 = D cos(φ0)

z0 = z0

Here, the coordinates are expressed as a function of s⊥, the length on the x y − plane from the point (x0,y0,z0) to (x,y,z). It is more practical, though, to try and define the track coordinates as a function of the radius r.

u v x = x + 0 2ǫB 1 B2 0 2B2 0 ρ − − ρ v p u y = y + 0 2ǫB 1 B2 + 0 2B2 0 ρ − ρ p z = z0 + λs⊥

B = ρ (r2 d2)/(1+ ρd ) 2 − 0 0 q2 −1 ρ sin B for ǫ = +1 s⊥ =  2 (π sin−1 B) for ǫ = 1  ρ − −  These equations are used to generate the track intersections atthe layers of our simple detector model, with the track parameters as inputs. Uniform parameter distributions have been selected for the generated tracks: for the q/PT (particle charge to the mo- −1 mentum transverse to the beam axis) the interval is q/PT [GeV ] ( 1/2, 1/2); for the ∈ − impact parameter in the transverse plane, d [mm] ( 1, 1); and for the distance from 0 ∈ − the center of the detector along the beam, z [mm] ( 2, 2). The angle of the tracks with 0 ∈ − respect to the z axis, η ln(tan(θ/2)) (0.1, 0.2), and with respect to the x axis on ≡− ∈ the transverse x y plane, φ [rad] (0.69, 0.88), represents a narrow region, as shown − 0 ∈ 106 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS in figure 3.26 on page 103.

An attempt to choose realistic parameters for the “toy” detector model hasbeen made. The value of the magnetic field has been chosen tobe 2 T. The radii that have been selected for the pixel and strip layers are 50 mm, 90 mm, and 120 mm; and 150 mm, 250 mm, 450 mm, 650 mm, and 900 mm, respectively. Finally, the resolutions for the “toy” detector elements that are used to generate the hit positions in the detector layers have been set to 8 µm (in both directions) for the pixel layers, and 20 µm (in rφ) for the strip layers.

3.6.3.2 Functions Performed by the Software

As mentioned before, the main objectives of this software are to evaluate: (i) the quality of the track reconstruction achieved, and (ii) the processing bandwidth (speed) of the implemented algorithms. In order to produce resolution plots to evaluate the fit algorithm quality, the following steps have to be executed:

1. A “training” set of 200 k tracks is generated 2. The fit constants are computed based on these tracks using linear algebra methods (described in [46, p. 112]) 3. A different set of “test” tracks (10 k) is generated 4. This set of tracks is fit using the fit constants generated in the previousstep 5. The fit constants are converted tothe fixed-point representation used in the TF implementation 6. The tracks are fit again, this time using fixed-point arithmetic

The steps involved in evaluating the speed and efficiency / fake rate of the implemented algorithms are more complex:

1. A pattern bank is constructed using the training tracks 2. Pattern recognition is ran for the test tracks 3. All possible track-forming combinations are generated based on the pattern recog- nition results (roads) 4. These tracks are fit using fixed-point arithmetic 5. The fit tracks that make the χ2 cut are compared to the test dataset to compute the efficiency / fake rates 6. The hits of the computed tracks and the roads are sent for processing on the FPGA device to compare the results with the software-generated tracks and directly extract processing time figures from the implemented design TRACK RECONSTRUCTION SYSTEM IMPLEMENTATION, TESTING, EVALUATION 107

The pattern bank is constructed in such a way that it achieves 100 % efficiency for the training tracks. The resulting size of the pattern bank is 100 k. A more efficient pattern ≈ bank generation would use the DC bit functionality to reduce the amount of patterns; that would also increase the number of track combinations per road. Given the detector setup that is used, however, that figure is already sufficiently large to stress the algorithm implementations; the complexity of compressing the pattern bank and further tuning the detector model to match real-world conditions can be avoided.

Steps 4 and 5 are very computationally intensive. For that reason, the computed pattern bank is segmented and the computations are ran in parallel using the Python concurrent.futures library. The performance is CPU intensive and it scales almost linearly with the number of threads, completely saturating a 36-thread ® Xeon® CPU. To further reduce run time, especially during the development of the software, disk storage is used to save intermediate results in various points of these processes.

3.6.3.3 Reconstruction Resolution and Performance

The resolution of the track reconstruction algorithms is described in figures 3.27a to 3.27b on the following page. On the top part of each figure, histogram plots represent the difference between the parameters that were used to actually generate the tracks, and the parameters that were computed by the track fitting algorithm, for 10 k tracks. On the bottom part, a scatter plot highlights the difference, per histogram bin, betweenthe floating-point and fixed-point results.

The resolution for the d0 and z0 parameters is perfectly aligned with resolution of the “toy” detector, itself (8 µm for the pixel layers and 20 µm for the strip layers). For the η,

1/PT , and φ0 parameters, the resolution is less than 1 %. No significant bias errors are observed in any parameter.

As the histogram plots show, no noteworthy differences are found between the results computed by floating-point and fixed-point representations. The scatter plots show differences of up to 40 % close to the outer edges; however, this is not particularly surprising given the fact that the deviations the histogram bins represent at those areas are very small.

On figure 3.28 on page 109, the ideal χ2 distribution for a system with 6 degrees of freedom is shown.28 Drawn on top of that are histograms taken from the floating-

28That number is derived from the number of coordinates used for the fit (11) minus thenumberof parameters computed (5). 108 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

1/PT fit resolution η fit resolution (1/PT ∈ (−0.5, 0.5)[1/GeV ]) (η ∈ (0.1, 0.2)) 1000 Floating-Point Floating-Point 1000 Fixed-Point 750 Fixed-Point 750 500 500 N Tracks N Tracks 250 250 0 0 1.5 1.5 ×10−3 ×10−3

1.0 1.0 FI N / FP N 0.5 FI N0 / FP. N 5 −2 −1 0 1 2 −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 −3 −3 ∆(1/PT )[1/GeV ] ×10 ∆η ×10

(a) Track curvature (b) Pseudorapidity

d0 fit resolution z0 fit resolution (d0 ∈ (−1000.0, 1000.0)[µm]) (z0 ∈ (−2000.0, 2000.0)[µm])

1000 Floating-Point Floating-Point 600 Fixed-Point Fixed-Point 750 400 500 N Tracks N Tracks 200 250 0 0 1.5 1.5

1.0 1.0

FI N0 / FP. N 5 FI N0 / FP. N 5 −40 −20 0 20 40 −40 −20 0 20 40 ∆d0[µm] ∆z0[µm]

(c) Impact parameter (d) z0 position

φ0 fit resolution (φ ∈ (π/4 − π/32, π/4 + π/32)[rad])

800 Floating-Point Fixed-Point 600

400 N Tracks 200

10.5 ×10−4

1.0

FI N0 / FP. N 5 −4 −3 −2 −1 0 1 2 3 4 −4 ∆φ0[rad] ×10

(e) Azimuthal angle

Figure 3.27: Track parameter resolution plots TRACK RECONSTRUCTION SYSTEM IMPLEMENTATION, TESTING, EVALUATION 109

Track Fit χ2 distribution

0.14 Theoretical Fixed-Point 0.12 Floating-Point

0.10

0.08

0.06 NDOF=6 freq (normalized) 0.04

0.02

0.00 0 5 10 15 20 25 30 35 40 χ2

Figure 3.28: Floating-point and fixed-point χ2 distribution comparison

Processing time vs occupancy Processing time for 5 and 10 tracks in a ∆η × ∆φ ≈ 0.1 × 0.2 region average 105 th 95 percentile 0.10 ]

ns 0.08

104 0.06

0.04 Processing time [ freq (normalized) 103 0.02

0.00 3 4 5 1 2 3 4 5 6 7 8 9 10 10 10 10 Number of tracks in a ∆η × ∆φ ≈ 0.1 × 0.2 region Processing time [ns]

(a) (b)

Figure 3.29: Processing time vs occupancy: the processing time is shown as a function of the number of generated tracks inside the phase space ∆η∆φ 0.1 0.2 on figure (a); ≈ × histograms of the processing time for the cases of 5 and 10 tracks are shown on figure (b).

χ2 cut efficiency vs occupancy 100 99 98 Floating-Point efficiency 97 mini-PU efficiency Efficiency % 96 10 8 Floating-Point fake rate 6 mini-PU fake rate 4 2

Fake rate % 0 1 2 3 4 5 6 7 8 9 10 Number of tracks in a ∆η × ∆φ ≈ 0.1 × 0.2 region

Figure 3.30: Floating-point and fixed-point efficiency / fake rate comparison 110 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS point and fixed-point fit results. These values correlate fairly well with each otherand further support the good track parameter matching observed in figures 3.27a to 3.27b on page 108.29 Finally, from the dataset that produces this histogram, one can compute the χ2 value that has to be used as a cut to achieve a certain efficiency; for a target efficiency of 99.5 %, the resulting χ2 is 18.3.

3.6.3.4 Reconstruction Processing Time as a Function of the Detector Occu- pancy

Given the extreme effects it has on the combinatorics involved in a track reconstruction system, it becomes crucial to examine the processing bandwidth of these algorithm implementations as a function of the detector occupancy. The number of tracks that appear in a narrow ∆η∆φ 0.1 0.2 region has been selected as a reasonable metric of ≈ × detector occupancy.

Because of the low resolution used for the pattern matching, a road may contain physical hits belonging to different particles. Also, depending on the road size and on the event hit density, there is a level of combinatorial background consisting of fake tracks, i.e., track candidates that will be rejected at full resolution. Therefore, the processing time30 strictly depends on the following quantities: (i) occupancy, which is a measure of the track density; (ii) roads per track, the ratio of the number of matching roads to the number of actual tracks per event; and (iii) combinations per road, the number of hit combinations to be fitted per road, in average.

When many tracks lie in a small phase space, as is the case in high energy jets, the number of fits to be executed increases substantially, and with it, the time neededto process such events. Figure 3.29a on the previous page shows the extreme way occupancy impacts processing time; increasing it by one order of magnitude leads to an increase in processing time of more than two orders of magnitude, due to the geometric increase in combinatorics. It is important to note that in an average occupancy environment the processing rate is greater than 100 kHz, and that even in the worst—and rare—occupancy case that involves 25 k track combinations on average, the processing rate still does ≈ not drop below 5 kHz for 95 % of events.

29As a side note, the histograms match the ideal χ2 distribution because they are only computed for valid tracks. In real conditions where the χ2 is also computed for invalid tracks that are rejected the trail is more pronounced and an additional peak at some large χ2 value (of the order of some millions) appears. 30Here processing time is defined as the amount of time that passes from the first pattern matching result to the last track that exits the design: the pattern matching time itself is irrelevant to the algorithms examined, and the time it takes to write the hits to the DO is also less than the time it takes to write them to the pattern matching structures and, thus, is also deemed irrelevant. TRACK RECONSTRUCTION SYSTEM IMPLEMENTATION, TESTING, EVALUATION 111

The processing time adheres very well to the formula tproc[ns]=290+2Nfits, which factors in two distinct components: the fixed initial latency, and the number of fitsto be performed. That can be better seen on figure 3.29b on page 109, which shows an asymmetric distribution for five tracks; when the number of combinations is not high, the initial latency becomes a significant component of the distribution, skewing it tothe left.31 Finally, it can be observed that both distributions have long tails towards high values; that is owed to the statistical variability of the number of fits to be processed.

Finally, the impact of occupancy in the efficiency and fake rates can be seenon figure 3.30 on page 109. The efficiency is relatively flat and remains high—around 99.5 %, as dictated by the χ2 cut that has been selected. The fake rate, however, starts to increase as more and more tracks are collimated, with more hits being close to each other— producing tracks that are very close to valid ones. The last point in the plot corresponds to 10 tracks in a region equivalent to a jet cone of radius 0.22; this is considered to be a ∼ very high track density and the increase in fake production is an expected effect. The good agreement between the floating-point and the fixed-point results can also be seen here. It should be noted that there is a slight disagreement of the order of 0.1 % for an occupancy index of four; this discrepancies can be considered as a statistical anomaly, however, as the agreement has been observed to improve by increasing the sample size.

3.6.3.5 Testing and Validation

As an intermediate step of the efficiency and fake rate versus occupancy computations, the Python simulation produces detector hits, roads found by the pattern matching, and fitting constants, before it generates the expected output, the fitted tracks, tocompare with the originals and track the quality of the results. As has been mentioned before, the software can export the roads, hits, and fit constants as input test vectors in a textual format. These can then be read in by the testbench to be used in behavioral simulation, or to be loaded on-board to drive the design on the FPGA.

That allowed taking direct measurements for the processing time plots presented in figure 3.29 on page 109; these data points have been taken by counters implemented in the design running on the FPGA, as the events are processed.

31A source of some slight deviation from this formula, that is noticed in low numbers of combinations, is the dependence on the DO read bandwidth (400 Mhit/layer/s, or 2.5 ns per hit), rather than the TF (500 MFits/s, or 2 ns per fit), when the number of combinations approaches oneper road. The fact that the processing speed does not drop below this baseline provides additional confirmation of the DO read bandwidth, and the combiner processing speed, being sustainable and impervious to fluctuations in the number of hits per road and number of combinations. 112 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

Figure 3.31: Top view of the PRM06 demonstrator board

Finally, with the help of this software, bit-wise comparisons for thousands of randomly- generated events have been performed. As expected, the fixed-point results of the software match perfectly with the results taken from the FPGA design, effectively validating the processing chain. To add to this, the large number of roads and combinations of the higher-occupancy events applies pressure to all processing steps: the DO has to extract a large number of roads per event, the combiners and the TF generate and fit many thousands of track candidates per event; every component is pushed to, and validated at, the maximum of its capabilities.

3.7 The PRM06 Demonstrator for L1 Track Reconstruction in CMS

The DO and TF implementations described on sections 3.3 and 3.5 have been used in the PRM06 board FPGA design [84]. That is a custom 14.9 cm 14.9 cm PCB that has × been designed to perform low-latency track reconstruction [85] so that the AM Chip technology can be evaluated in the context of L1 triggering for the CMS detector after THE PRM06 DEMONSTRATOR FOR L1 TRACK RECONSTRUCTION IN CMS 113 the HL-LHC upgrade. The board is pictured in figure 3.31 on the preceding page. It is a mezzanine, meant to be mounted on a host board such as the Pulsar IIb [63]. It comprises the following major components:

• One Xilinx Kintex Ultrascale FPGA, shown in green • 12 AM Chips, shown in orange • Two low-latency, 1.125 GBit RLDRAM3 memory parts, shown in blue • Power regulator blocks, shown in teal • Two FMC connectors, shown in red

The Kintex Ultrascale FPGA32 is the core of the mezzanine: it receives and temporarily stores the full-resolution stub data coming from the host board, evaluates the reduced resolution stub data to be used in the pattern recognition and distributes them to the12 AM Chips, collects the pattern recognition results, uses them to retrieve the full-resolution stubs, and finally performs further combinatorics reduction operations and track fitting.

Each one of the two RLDRAM3 external memory parts provides a total of 1.125 Gbits of fast, low-latency memory storage in a 32 Meg 36 bit organization. Their purpose is × to keep a copy of the data stored in the AM pattern bank, to be used in decoding the pattern recognition results back into SSs. The two memory parts can provide a cumulative random read bandwidth of >900 MT/s. Each transfer represents a 144-bit packet, that carries the SS and DC bit information of up to 8 layers. The read bandwidth fulfills the demands emanating from the maximum burst transfer rate of 50 12 = 600 MT/s the AM × Chip bank output can achieve.

The two FMC connectors used to interface the PRM to the Pulsar IIb host board are compliant with the VITA 57.1 standard, carrying both data signals and power supply voltages. The host board provides 3.3 V and 12 V power rails, with a maximum available power draw of 150 W. The power regulators are then used to generate all the voltages (1.0 V, 1.2 V, 1.8 V, 2.5 V) required by the FPGA, AM Chip and RLDRAM3 parts.

Regarding the host board communication, 6 bidirectional high-speed serial links (3 links through each FMC connector) are used to send and retrieve data from the PRM, corresponding to a total symmetric I/O bandwidth of up to150 Gbit/s.33 Moreover, 34 additional LVDS (Low-Voltage Differential Signaling) lines are used to cover the slower configuration and monitoring needs of the PRM.

The PRM06 board is the successor to the PRM05 board [86] that featured a Kintex-7– family FPGA, Double-Data-Rate (DDR)II memory parts, and AM05 AM Chips. Selecting

32The exact part number is (xcku060-ffva1156-1) 33The links have been tested for line rates .of upto12 5 Gbit/s. 114 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

Xilinx KU060 Kintex Ultrascale FPGA Track Candidate Road Stub Track Fi‹er Builder Serializer

stubs Coordinate SSID Extraction Data Organizer Conversion

AM Control Road Decoder

roads SSIDs SSIDs roads AMMap FMC connectors AM Bank (RLDRAM3 ×2)

Figure 3.32: PRM06 demonstrator system block diagram a Kintex Ultrascale–family part for the PRM06 allowed for reduced power, faster serial link transceivers, more computing resources, and significant speed improvements in high utilization scenarios, compared to previous-generation, Kintex-7–based board [87]. These improvements greatly help the chances of achieving the very tight latency target of 3 µs.34

3.7.1 The PRM06 FPGA design

A block-level overview of the PRM06 FPGA design can be seen on figure 3.32. Dataflow starts with the serial links receiving stubs from the motherboard through the FMC connectors. These are fed to a local-to-global coordinate conversion module so that the stubs stored in the DO can be used directly for the computations that take place in the TCB module. The stubs are also passed to the SSID extraction block that generates the SSs for the pattern recognition. These are sent to the DO and the AM Chip array, through 2 Gbit/s serial links. After the pattern recognition operation is complete, the AM Chips send the roads to the FPGA through the serial links. Then, the external RLDRAM3 memories are used to convert them back to SSIDs to extract the full-resolution stubs from the DO. The DO outputs the stubs across different layers in parallel, while the TCB features a parametric number of processing pipelines,35 each of which receives the stubs that have been extracted for a road in a serial way. A stub serializer module takes care of this rearranging and feeds the processing pipelines of the TCB. Finally, the latter feeds

34In this context, latency is defined as the period between the time the first stub is sent to the mezzanine and the time the last track is returned to the host board. 35In this design that number has been set to 12. THE PRM06 DEMONSTRATOR FOR L1 TRACK RECONSTRUCTION IN CMS 115 the track candidates to the TF, that performs the track fits, which are sent back to the host board by the serial links that pass through the FMC connectors.

Moreover, IPbus communication with the host board is provided by making use of the FMC-provided LVDS lines. The Ethernet-accessible IPbus master module is intended to be instantiated on the host board, and the LVDS link with the PRM implements the master- slave bus shown in code listing 3.5 on page 99. The IPbus interface provides a convenient means of handling board control and configuration operations (e.g. configuring the AM Chips and loading the pattern banks, loading the TCB and TF fit constants, run control).

Apart from the main building blocks, features that improve stability and resilience of the design have been added. Cyclic Redundancy Check (CRC) is performed for the hit data packets that are received by the six input serial links. The correct CRC is received at the end of each data packet and, in case data in the packet have arrived corrupted, this information is propagated to the host board and the event is repeated. Moreover, a watchdog module stands between the serial links that receive the pattern matching results from the AM Chips and the DO-AM handling logic: in the event some AM Chip’s response has timed out, the end-of-event word gets injected into the data-stream.

Overall, this is a moderately complex design. While that does not constitute an absolute metric of the design complexity, the number of clock domains of the design provides an indication. The “user logic”36 of the design operates across 10 clock domains:

• two clock domains for the host board serial links (TX/RX) • two clock domains for the AM Chips serial links (TX/RX) • two clock domains for the two external memories • the clock domain for IPbus communication and the configuration logic • the DO clock domain • the TCB clock domain • the TF clock domain

Of the algorithm implementations described in this chapter, this design uses the DO and the TF. Instead of the combiner algorithm, a Track Candidate Builder (TCB) module is used: this implements a combinatorics reduction algorithm37 that extrapolates a cone around a track formed by the three38 innermost layers, and rejects any stubs that lay outside of this cone. The stubs that lay inside of this projected cone are used to generate the track candidates that will be fit by the TF.

36The logic that lies outside of any ready-to-use IP cores. 37This has been developed by a group from the University of Lyon. 38Or two innermost layers, in case the detected pattern does not contain any stubs for one of these layers. 116 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS

The remainder of this section will detail the adaptations made to these modules and, finally, outline the performance and resource utilization figures.

3.7.1.1 DO port configuration and Double instantiation

In this design, the single-port configuration of the DO has been used. We chose this particular configuration due to considerations regarding the TCB processing rate. As reported earlier, the TCB has multiple processing pipelines, each of which receives hits serially. That will not drop its track processing rate, however, as every clock cycle thereis a new track in the output of the DO39 that can go to one of the 12 TCB pipelines; in fact, even an average of two 6-hit tracks could be written to the TCB pipelines every clock cycle. The TCB processing pipeline architecture, however, has a large Initiation interval and blocks new data until certain resources, internal to that, are freed. That lowers the effective throughput and would render any additional ports inthe DO output redundant.

Regarding the demanding processing rate specifications for this application, the use of the pipelined mode of the AM Chips is dictated. This mode stores the pattern matching results in a buffer so they can be sent at the same time hits of the next event arewritten to it. On the other hand, as stated in section 3.3 on page 67, the DO can either be in the write phase or the read phase; writing data before the completion of the read phase would corrupt the linked list structures.

To support uninterrupted event processing and the AM Chip pipelined operation, two DO instances are employed. Writing and reading happens in a way similar to a ping-pong buffer; while one DO instance is used to read the last event, the other instance is used to process the incoming event without waiting for the last event to be completely read out, which would cause a prohibitive drop in the processing rate. The outputs of the two DOs towards the TCB module are selected in a similar way; only one can be active at a time.

3.7.1.2 TF Parameter Binning Support

As was reported previously, linear track fitting can produce good results only when the fit coefficients are used within the narrow area for which they have been generated.That means that the right set of constants has to be selected and loaded according to the region of the track that is attempted to be fit and, consequently, some information on thathas to be known beforehand.

39Assuming the two modules share the same operating frequency. THE PRM06 DEMONSTRATOR FOR L1 TRACK RECONSTRUCTION IN CMS 117

When a pattern recognition step precedes the track fitting, the ID of the pattern that gave the (potential) track to be fit could be used to retrieve the right set of constants, as it dictates the region in which the track can be found. In this architecture, however, the TCB module, that precedes the track fitter, can produce even more accurate results. This processing step not only filters the stubs to produce a smaller number of combinations but, as was mentioned before, computes estimates for the PT and the η parameters of the potential track as a byproduct.

These preliminary values are then provided to the track fitter and rounded into16 bins in the pT region and 20 bins in the η region. That allows a selection among 320 constant sets, all of which are stored using BRAM resources, to be used for the track fitting.

3.7.1.3 Performance and Resource Utilization

The maximum operating frequencies achieved in that design are 500 MHz for the TF, 333 MHz for the TCB, and 250 MHz for the DO. It is worth mentioning that the DO and TF modules do not reach their full potential in terms of the operating frequency. The prime cause of this discrepancy is a consequence of the logic that has been developed to control the DOs, and the interconnecting logic that transforms the parallel data at the DO output to serial data in a number of lanes to feed the TCB processing pipelines. Unfortunately it was not possible to further optimize the performance of the design because of constraints in the completion timeline of this R&D project. It should be noted, however, that the operating frequencies already achieved are enough to meet the <3 µs latency goal for tt¯events with an average pile-up of 200.

Table 3.7 shows the resource utilization of the major blocks that comprise the PRM06 FPGA design. It can be seen that almost half of the LUT and FF resources of the device are occupied by the two DO instances. Also, a noticeable amount of resources is allocated by

Table 3.7: PRM06 resource utilization

FF LUT BRAM DSP module (k) (%) (k) (%) units (%) units (%) DO 1 99.4 15.0 49.2 14.8 135 12.5 0 0.0 DO 2 99.4 15.0 47.4 14.3 135 12.5 0 0.0 DO-AM Ctrl 26.7 4.0 27.1 8.2 66 6.1 48 1.8 TCB 79.4 12.0 57.1 17.2 180 16.7 83 3.0 Stub serializer 32.0 4.8 27.8 8.4 0 0.0 0 0.0 TF 35.2 5.3 2.1 0.7 72.5 6.7 360 13.0 Total 394.6 59.5 226.4 68.3 643.5 59.6 494 17.9 118 CHAPTER 3: DESIGNS FOR TRACK TRIGGER APPLICATIONS control logic and the Stub serializer module. DSP usage remains low, as the throughput of the TCB core does not justify using more than one TF unit and increasing the number of processing pipelines in the TCB module would have an impact on the operating frequency, because of the effect the increase in LUTs would have on the routing resources. Overall, it can be seen that the complete design occupies roughly 70 % of the device in terms of LUTs, and roughly 60 % in terms of FFs and BRAM blocks.

3.8 Conclusions

To reiterate the importance of hardware-based systems for track reconstruction it is worth acknowledging an alternative approach that has attracted the attention of the HEP community for real-time applications, GPU processing, due to their improved performance over CPUs in a variety of applications [88, 89]. Both ATLAS [90, 91] and CMS [92] have studied the performance of modern GPUs in real-time tracking. Even though performance still looks promising compared to CPU-based solutions, the latency scales up to tens of milliseconds for simplified algorithms and lower detector occupancy, with a sharp increase to hundreds of milliseconds for increased occupancy environments.

In this chapter, novel FPGA implementations for two algorithms of great importance to pattern matching–based track reconstruction were presented: the Data Organizer, which is essentially a high-performance sparse array of linked lists; and the Track Fit- ter, that computes scalar products in a massively parallel manner, by making efficient use of the DSP slices that abound in modern FPGA devices. Detailed presentations of a evaluation board–based system, developed to evaluate these algorithms’ perfor- mance, and a hardware-based system that performs low-latency, high-performance track reconstruction utilizing these implementations, conclude the chapter.

Performance and resource utilization figures were discussed for a number of modern, mid-grade FPGA devices. Across all tested devices, the operating frequencies achieved for the DO and TF are found to be greater than 400 MHz and 500 MHz, respectively. These fig- ures are well above the 200 MHz target clock of the corresponding FTK functions [46]. An important advantage of the novel DO implementation, apart from the above-mentioned performance gains, is the elimination of the restrictions the FTK implementation imposed on the data format. This is especially important when considering its application in low-latency applications as adhering to these restrictions implies spending a significant amount of processing time to reorder data. The techniques employed to achieve such high levels of performance were described in detail. CONCLUSIONS 119

The verification strategies that were employed for these components were discussed in section 3.6 on page 96. These include a unified hardware/software approach that utilizes a single UVM testbench for both RTL simulation and on-board tests through the DPI interface and the IPbus infrastructure. Furthermore, a software suite that implements a toy detector model has been coded to further assist in validating the implementations and facilitate the exploration of the track reconstruction performance of the pattern recognition–linear track fitting method.

In conclusion, the hardware approaches described in this chapter are orders of mag- nitude faster than any available commercial computing device. Therefore, both ATLAS and CMS consider their possible application at the L1 trigger [68] for the 2025 HL-LHC accelerator upgrade, when the LHC luminosity will cause hundreds of pile-up collisions and will require much faster—and more efficient—trigger selections.

Chapter 4

Conclusions

“Finally, in conclusion, let me say just this.” — Peter Sellers

4.1 Review

The objective of this thesis has been to explore strategies and methodologies for designing high-performance FPGA algorithm implementations. To that end, algorithms from the fields of machine vision and track reconstruction in High-Energy Physics have been used, with topics ranging from the architectural decisions, through design methodologies and strategies, to the verification methods that were employed to validate the resulting components, have been discussed in detail. Moreover, the integration of said components into fully realized systems has been outlined.

The work has been divided in two chapters, based on the field of application ofthe algorithms discussed: the algorithms presented in chapter 2 on page 17 target machine vision applications; chapter 3 on page 55 examines algorithms for track reconstruction in High-Energy Physics experiments.

In chapter 2 on page 17, algorithms used in the machine vision subsystem of a Lab-on-Chip application were laid out. As stated in the introduction of this chapter, the LoC system, which targets molecular diagnostics applications, allows performing complex biological experiments in a Point-of-Care environment. To achieve that level of miniaturization while keeping the number of on-chip sensors low, a machine vision system is put in place to monitor the on-chip flows, its first image processing step being

121 122 CHAPTER 4: CONCLUSIONS an edge detection algorithm. A Canny algorithm implementation has been developed, exploiting the principles of parallelism, computing four pixels per clock cycle; and pipelining, both at the cycle level and the block level. As a consequence, a throughput of more than 1 Gpixels/s is achieved, making this edge detection implementation ideally suited to feed the remainder of the real-time system; the performance exceeds the system requirements, allowing even high-resolution images to be used. An additional advantage of this architecture is that by employing pixel computation parallelism, the memory requirements have not been increased but the memory read accesses have been reduced. In addition, the algorithm is split in processing steps, two of which (namely the Gaussian smoothing and the Sobel edge detector) are image processing filters that are widely used by themselves, adding to the merit of this novel implementation.

The reasoning behind the approximations used in certain elements of the proposed architecture, such as the gradient computation (see subsection 2.3.1.2 on page 33) and the single hysteresis pass (see subsection 2.4.2 on page 38), has been explained in detail; furthermore, their impact in the quality of results has been fully quantified. Performance and resource utilization figures are presented for each component of the implementation, accompanied by a brief overview of differences between successive FPGA families. More- over, details of an image compression implementation that has been developed are laid out, along with arguments for not utilizing that component in the final system.

Finally, the packaging of the machine vision subsystem in the form of an IP Core, a task undertaken by the author, is outlined in subsection 2.5.3 on page 49. Performance and resource utilization figures of the machine vision subsystem and the full, integrated, system conclude the chapter.

Chapter 3 on page 55 covers the redesign and verification of key algorithms used in the FTK track reconstruction system. This is a system designed to perform real-time reconstruction of the particle trajectories traced in the silicon layers of the ATLAS detector by using a combination of ASICs and FPGAs. These algorithms are redesigned to support other, even more demanding applications, such as the future L1 trigger system of the ATLAS and CMS experiments that will be much more computationally demanding as a consequence of the future HL-LHC upgrades.

The Data Organizer, which is a based on a high-performance implementation of an instantly-erasable array of linked lists, performs a data handling task: it stores full- resolution hits based on low-resolution identifiers, and retrieves them after the pattern matching step based on these identifiers. Supporting the features of the AM ASICs, such as variable-size patterns and missing layers, has implications that severely add tothe REVIEW 123 complexity of the architecture. This novel architecture has been based on a combination of wide, BRAM-based memories, collision resolution logic, and a wide register file to avoid any reset of large memories that would lead to downtime between successive events. Advanced design methods, such as the automated generation of Look-up Table (LUT) instantiation code, were employed to arrive at a final implementation that supports an operating frequency of upwards of 400 MHz, greatly surpassing the specification targets.

The combiner, a combination generating module, links the DO to the final component of this track reconstruction chain, the Track Fitter module, by producing all the track- forming combinations out of sets of hits. Its architecture was designed such that one combination is produced per clock cycle. Finally, the TF architecture is presented, which makes efficient use of the DSP slices and their dedicated interconnects to provide high- performance scalar product computations. By adding a network of systolic arrays of registers one full track can be processed per clock cycle. Physical layout considerations were taken into account early in the design phase, leading to an implementation that runs at a frequency of 600 MHz, reaching the silicon limits of the FPGAs used.

The DO and TF implementation descriptions are followed by a detailed discussion on their performance, resource utilization, and power figures for various configurations. Additionally, low-level details in the performance and power differences between two successive FPGA generations are discussed in subsection 3.3.4 on page 81.

The methods used for the thorough and intensive verification ofthe DO component are detailed in subsection 3.6.2 on page 99. Taking into account the complexity overhead that results from the focus in a high operating frequency, laying out a complete verification strategy is essential as it is difficult for a typical, rudimentary testbench to cope with the amount of the various testing scenarios and corner cases. In addition to that, in order to keep mistakes in design entry as low as possible, advances in Hardware Description Languages have been studied and recent features of the SystemVerilog HDL, such as the interface and always_comb constructs, have been utilized in writing the RTL code.

A track reconstruction system based on an evaluation board has been designed and presented in section 3.6 on page 96. Software that emulates a toy detector model to assess the quality of the track parameter computation and help further validate the algorithm implementations has been written. The principles of this detector model, and metricson the reconstruction performance, are discussed in detail. In addition, a novel transaction- level software/hardware co-simulation approach that utilizes a single UVM testbench for both RTL simulation and on-board tests has been realized and described.

Finally, the PRM06 demonstrator for track reconstruction at the L1 trigger for the 124 CHAPTER 4: CONCLUSIONS

HL-LHC CMS upgrade, based on pattern matching, is presented in section 3.7 on page 112. The DO and TF implementations comprise critical components of the FPGA design, whose development has been primarily conducted by the author. Adaptations to the algorithm implementations and fine points of the design are presented. Finally, a discussion onthe performance and resource utilization of the design concludes the chapter.

4.2 Future Work

This research has given rise to a number of topics in need of further examination; itis, thus, proper to conclude with a brief discussion of the possible directions for future work.

One topic that can be examined regards the Canny edge detector algorithm, which offers itself as a prime candidate fora HLS implementation. Although there exist such implementations, to the best of the author’s knowledge an implementation that handles pixel-level parallelism in a way that is similar to this work has not been published yet. A comparison between the well-written RTL implementations that have been presented and the automatically-generated logic the HLS compiler would emit would be of great interest. Furthermore, considering the relative ease with which a configurable convolution core can be implemented using HLS techniques, an exploration of the pixel-level parallelism and the impact of this on the performance and resource utilization would provide an excellent side-topic to this.

Another interesting project that could arise is the following: linked-list structures are very commonly found in software algorithm design; a critical part ofthe DO is the high-performance array of linked lists implementation. This can be extracted from the rest of the DO, and modified to be used outside its narrow context, as a building block for other, more complex algorithms. This could offer itself for use in an application that may require sequential access to a number of arrays, across which data is distributed, without a size limit in a specific array, but in the total amount of data to be stored. This feasibility study might lead to a challenging but worthwhile project.

The work described is complete in the sense that it has already found use initstarget applications. However, there are further improvements that could be made but had to be deferred to future work, due to project deadlines that had to be respected.

An example of those can be found in the BRAM power consumption in the DO imple- mentation for the Kintex Ultrascale family: a power saving feature has been introduced, the SLEEP mode. That gives the designer the opportunity to completely shutdown FUTURE WORK 125 the memory when more than five clock cycles of inactivity can be foreseen. Inthe DO that can be exploited easily by turning the memories off in the idle periods where itis neither in the write or the read phases. Another, more aggressive, power saving option could be to bundle write and read requests together, switching the memory structures off between bursts. That would optimize power consumption in the case of the write orread bandwidth not being fully utilized, at the expense of some added complexity and extra latency.

Also, as was discussed in subsection 3.7.1.1 on page 116, two DO instances are employed to support uninterrupted event processing in the PRM06 FPGA design, and the write and read scheduling behaves in a way similar to the use of a ping-pong buffer. For a relatively uniform event size distribution this natural arrangement does not hinder optimal performance, as the two instances will be equally busy on average. When events that greatly vary in complexity arrive, however, it will inflict a loss of performance. Let us consider an example: a large event is read into DO instance 1, and two small events follow; the first small event is quickly processed by DO instance 2, but the second small event has to wait for the large event to be processed by instance 1—while at the same time DO instance 2 that received the small event is free. Therefore, the scheduling of the DO instances is an area that can be improved. Each instance should be considered as a resource that is shared between successive incoming events. Thus, an arbiter module could be implemented in the DO-AM controller: that would decide which DO instance to write the incoming event to depending on which one is actually free, in contrast with the simple ping-pong logic that can prove less efficient.

Overall, it would be fair to conclude that the design and implementation of increasingly more demanding designs will keep challenging us for years.

Appendix A

Designing around Metastability

“…but the simulation works!” — Anonymous FPGA designer

A.1 Introduction

Due to the design complexity in contemporary FPGA designs, such as the ones presented in this work, a considerable number of clock domains is involved. The relationship between two clock domains can be categorized as such

• Synchronous, when the clocks that have an exact N:M clock period relationship and a fixed phase differencei ( .e. they are generated by the same PLL) • Mesochronous, for clocks that have the same frequency but an unknown phase difference: their phase difference can “wander” significantly enough nottobe considered fixed • Plesiochronous, when the two clocks have a small frequency difference • Asynchronous, the most generic term, when none of the above applies

In a typical design, only a few clocks are fully synchronous to each other. When moving data across clock domain boundaries that are not synchronous, metastability issues need to be considered carefully to avoid reliability issues as serious as system failures.

To correctly capture data at their input, registers have certain timing requirements: the input must be stable for some amount of time before the clock edge (the setup time, tSU ) and for a different amount of time after that (the hold time, tH ). If these are upheld,

127 128 APPENDIX A: DESIGNING AROUND METASTABILITY

the output of the FF is provided within a clock-to-output delay (tCO) after the clock edge. If these requirements are violated, however, the output of the FF may take an undefined value, between the high and low voltages, which is called a metastable state. It will leave this problematic state after some time and assume either the old or the newinput value; how long that takes cannot be guaranteed, but it can be estimated statistically (i.e. a certain probability can be assigned to the register leaving this state within a certain amount of time). For the process technologies and operating frequencies involved in

FPGAs, that can be assumed to be greater than tCO, but usually less than a clock cycle.

As this metastable state can be propagated and affect logic further into the design, unexpected failures can occur. Problems with metastability are difficult in a number of ways:

• they can be hard to trace; they do not show up on most simulations and the design may even be working as intended for extended periods of time • depending on the specifics, the design may even work one day and not the next, depending on minor variations in voltage or temperature • they can cause completely unpredictable behavior in a certain part of the design that manifests as a problem in a different part

Specialized circuitry can be placed in Clock Domain Crossings (CDCs) to deal with metastability in a safe way; knowledge of such techniques is equally crucial for ASIC and FPGA engineers alike.

A.2 CDC Circuits

Depending on the specifics of the signals that have to cross clock domains, there are a number of circuits that can be used; here, two of those circuits are presented, the 2FF and the handshake synchronizers. They are enough to explain the basic principles and they cover two very common cases. Other circuits include the recirculation-MUX synchronizer for passing seldom-changing data, gray converters before and after the CDC for handling counters, and the more complex FIFOs [93].

A.2.1 The basic 2FF synchronizer

This is the simplest CDC circuit and it also serves as a basis for other, more complicated ones. It can be used to cross a slowly changing signal between two asynchronous clock CDC CIRCUITS 129

sigA sigB s0 sigB clkA input DQ DQ DQ sigA clkB clkA sigB s0 clkB sigB (a) 2FF synchronizer circuit schematic (b) Example Signals

Figure A.1: Basic 2FF synchronizer schematics (a) and example timing diagram (b), illustrating the metastability resolution domains.

A schematic and its basic operating principle can be seen on figure A.1. The first FF on the clock domain clkB is susceptible to metastability, as its input is generated at the asynchronous clock domain clkA. Its output is connected directly to a second FF, however, leaving ample time for any metastable state to resolve itself towards the new or the old value of the synchronizer input before it is sampled by the second FF at the next clock cycle. Ideally, the two FFs are placed close together such that the slack between them approaches the clock period.

The Mean Time Between Failures (MTBF) can be computed by the following formula:

etMET /C2 MTBF = C f f 1 · clk · data

Here, tMET is the available metastability resolution time, fclk is the clock frequency, fdata the data toggle rate, and C1 and C2 are constants that depend on the process technology and the operating conditions. Depending on the application, the values MTBF can range from some years to some billions of years. To achieve larger MTBF, more than two synchronization registers are used. That has a dramatic effect on MTBF, as it changes the tMET term that is used inside the exponential by adding to it the resolution slack from each synchronization register stage.1 While this may seem to be a very safe value, it has to be taken into account that the overall system MTBF, i.e. the probability that some synchronizer will fail, is defined by

N 1 1 = MTBF MTBFi system i=0 X That has two major implications on the total system failure rate: (i) it is dominatedby the least reliable synchronizer, and (ii) assuming equivalent synchronizers, the failure

1The MTBFs of the various synchronization steps are effectively multiplied, as they all have to fail at the same time for the whole synchronizer to fail. 130 APPENDIX A: DESIGNING AROUND METASTABILITY

QD QD A2 A1

busy sel 0 out M1 0 0 1 M2 DQ DQ DQ DQ 1 1 A B1 B2 PD input sel clkA clkB Figure A.2: Handshake synchronizer circuit schematic rate is multiplied by their number. As an example, a system that has a MTBF of one year and one thousand clock domain crossings will fail every few hours; timing critical systems have MTBF requirements of many times their expected life cycle. Thus, it is not uncommon to encounter longer synchronizer chains.

A.2.2 The handshake synchronizer

In order to synchronize one-clock pulses using the basic 2FF synchronizer and a pulse detector at its output would work, but only if the destination clock domain is faster than the source clock domain by an amount that would cover any metastability upsets. To guarantee that a pulse that has to cross clock domains finds its way, another approach can be used, instead.

Figure A.2 shows the schematic of a handshake synchronizer circuit. The pulse is sampled through M1, M2, and A so that the output of A stays up. This stable signal crosses clock domains through the B1 and B2 registers, and a positive transition is detected by the PD–AND gate combination. Once the positive signal is out of the 2FF synchronizer, it is crossed back to the source clock domain through a second 2FF synchronizer comprising A1 and A2. That way, M1 propagates a zero to register A, completing the transaction and returning the system to its initial state. In the duration of each transaction any new pulses will be lost, making the generation of a busy signal necessary. This method is suited for “infrequent” transactions, to count all possible clock edges a more complex, FIFO-based, circuit would be needed.2

Despite the handshaking mechanism, in the event of a synchronizer failure this

2Even in that case, if the destination clock domain is slower than the source clock domain there may be a busy signal asserted if the pulses are frequent enough. COMMON PITFALLS 131 system will still fail; the output of the B1–B2 synchronizer directly feeds the gate that provides the output so that will be propagated. More synchronizer stages can be added to obtain a higher MTBF, with the downside of some added latency.

A.3 Common Pitfalls

Even when using correct synchronizers for CDCs to resolve metastability, there are some pitfalls that have to be avoided. The most common of these are explained below.

A.3.1 Convergence

In synchronous logic, combinatorial functions are positioned between register stages with the place and route tools ensuring that hold and setup times are respected. That implies that the output of any combinatorial logic is stabilized a certain amount of time before the edge of the clock such that any glitches go unnoticed.

Between asynchronous clocks, however, setup and hold times cannot be upheld. Thus, connecting signals that are driven by combinatorial logic to a synchronizer makes it susceptible to pick up any glitches that might occur. That situation is described as convergence, as paths of different delay converge, and it can lead to functional failures in a design due to the occasional propagation of the glitches. An example schematic diagram of such a structure can be seen on figure A.3 on the following page. The two combinatorial logic blocks, that may have different propagation delays, converge at the inputs of the AND gate before they are synchronized at the clkB clock domain. This situation can be avoided by only crossing registered signals between clock domains.

A.3.2 Divergence

Divergence occurs when a signal is sampled by two (or more) different FFs in a way that they can sample different values. There are two ways that can happen in CDCs: (i) divergence in the crossover, and (ii) divergence in the metastable signal.

Figure A.4 on the next page shows an example of crossover divergence. The signal crosses clock domains in two different points. Given the probabilistic nature of metasta- bility resolution, in the sense that the output of the synchronizer can resolve either to the old value or the new one, these two synchronizers can produce different values, causing 132 APPENDIX A: DESIGNING AROUND METASTABILITY

input 2 DQ

DQ DQ output

input 1 DQ

clkB clkA

Figure A.3: Convergence example schematic

DQ DQ output 2

input DQ DQ DQ output 1

clkA clkB Figure A.4: Crossover divergence example schematic

output 2

input DQ DQ DQ output 1

clkA clkB Figure A.5: Metastable signal divergence example schematic

input 2 DQ DQ DQ

output

input 1 DQ DQ DQ

clkA clkB Figure A.6: Re-convergence example schematic COMMON PITFALLS 133 problems with data correlation that can lead to system failure. To avoid this, only one synchronizer should be used to carry a signal between two clock domains.

An example of metastable signal divergence can be seen in figure A.5 on the facing page. The output of the first synchronizer FF is used directly by logic. Since that signal often turns metastable, the logic that uses it will propagate the error downstream inthe design, defeating the purpose of the synchronizer. By packing the logic of the latter in a dedicated module, that mistake can easily be avoided.

A.3.3 Re-convergence

When metastability occurs in a CDC the synchronizer gives an output that is resolved, but the precise number of cycles before the new signal value is available in the receiving clock domain cannot be guaranteed. Even without metastability, however, when two or more signals are independently synchronized, they may arrive skewed with respect to one another because of slight differences in their routing delays. If these independently synchronized signals have to be correlated to participate in a logic function at the desti- nation clock domain, this is described as a re-convergence defect. The discrepancy in the timing of their synchronized versions has the potential to introduce functional errors in the design. Thus, if more than one signals have to be correlated to participate inalogic function, that has to either be done in the source clock domain; otherwise, a multi-bit synchronization method such as a FIFO has to be used, instead.

Appendix B

The SystemVerilog Design and Verification Language

“We shape our tools and afterwards our tools shape us.” — Marshall McLuhan

B.1 Introduction

The SystemVerilog language [94] was developed in order to provide an evolutionary path from Verilog in a way that would support the complexities of SoC designs. The language extends the Verilog standard to add rich functionality in synthesizable logic design [95] but, most importantly, introduces features that make it a full-fledged Hardware Verification Language (HVL). Since it has been used extensively in the designs described in this work, some of its major features and improvements are outlined in this chapter.

B.2 SystemVerilog Synthesis Features

While SystemVerilog’s main features focus on verification-related tasks, it has also been enhanced with features aiding [96]. The most important of these features will be discussed below.

135 136 APPENDIX B: THE SYSTEMVERILOG DESIGN AND VERIFICATION LANGUAGE

always_comb always @(sel, A, B) if (sel) if (sel) O = B; O = B; else else O = A; O = A;

Code Listing B.1: 2-in-1 multi- Code Listing B.2: 2-in-1 multi- plexer in SystemVerilog plexer in Verilog

The always_ff, always_comb and always_latch Procedural Blocks

In large procedural blocks, it’s easy to introduce mistakes that may make the design synthesize differently than intended, or result in synthesis/simulation mismatches. The always_ff, always_comb and always_latch introduced with SystemVerilog help with this by defining the design intent more accurately97 [ ].

When designing combinatorial procedural blocks in Verilog or VHDL, one of the most frequent mistakes is to have an incomplete sensitivity list [98, 99]. While the synthesis tools can correctly infer the sensitivity list from the code within the block, and in fact ignore it completely, the simulation tools consider it. This discrepancy can lead to hard-to-trace synthesis/simulation mismatches and while synthesis tools warn about this, this can be discovered after many hours of simulation time have been invested in some component.

The always_ff procedure is intended to be used to describe Flip-Flop behavior. It should contain one and only one event control, which models the clock, and optionally the asynchronous reset, of the sequential logic. Additionally, like in the circuit as synthesized for an FPGA, a signal can only be assigned a value in a single always_ff block, they cannot be written to by another block.

The always_comb procedure allows describing combinatorial circuits. The tools perform checks and warn if the behavior of the circuit described does not represent purely combinatorial logic (e.g. infers latches). Similar functionality has been provided by the Verilog-2001 always @* construct; however, if functions are used in the code, any external signals that are referenced in them do not get added to the sensitivity list; the always_comb construct fixes that. An additional detail is that in simulation, always_comb blocks execute at time zero, while plain always blocks only execute when some signal in their (defined or implied) sensitivity list changes value; this brings the SYSTEMVERILOG SYNTHESIS FEATURES 137 behavior closer to the function of the synthesized circuit.

Caveat: another frequent mistake when designing combinatorial logic in a procedural block is to accidentally introduce a latch by missing one or more signal assignments in a conditional statement case. The synthesis tool will issue a warning about this, but in a large project it can be easy to miss, buried among hundreds—or even thousands—of benign warnings, causing problems that are hard to trace. Despite that fact, and having introduced a separate always_latch keyword precisely for the purpose of describing latches, the SystemVerilog standard doesn’t consider latch-generating logic inside an always_comb block an error [100].

Finally, always_latch can be used to describe logic containing latches. Similarly to the other constructs, the tools warn if the behavior of the circuit described does not represent latched logic. It has to be mentioned, however, that in FPGAs this is rarely recommended: latches require very careful Static Timing Analysis (STA) to ensure correct operation and (unintentional) latches are a very common cause for hard-to-trace system failures. uniqe and priority Keywords

These keywords expand and improve the parallel_case and full_case pragmas used in Verilog designs, providing equivalent functionality that is safer to use. The defining characteristic that makes these keywords safer to use is that they are recognized and utilized by all the tools that use the SystemVerilog code: synthesis, simulation, linting, and formal verification tools. The old pragmas were only recognized by the synthesizer, leading to synthesis-simulation mismatches and, eventually, unexpected behavior in the resulting implementation.

The unique keyword is meant to be used to allow optimizations in case or if…else statements that have one and only one item matching the expression. If no item matches the expression in a case statement, the optimizations inferred by the synthesizer can assign any value to the outputs, and the simulator will throw a warning to inform the designer about the unexpected event. Using the parallel_case pragma would allow 138 APPENDIX B: THE SYSTEMVERILOG DESIGN AND VERIFICATION LANGUAGE the same optimizations, but in the event of multiple case items matching at the same time the simulation would use priority logic, as this is the default in case statements, leading to different behavior than the synthesized design and an uninformed designer.

The priority keyword instructs the tools that the case or if…else statements will infer priority logic, and that all valid cases have been specified. The synthesizer is free to make optimizations without any regard for the case none of the items match. The simulation tools are aware of that, however, and will inform the designer if that situation ever comes up by issuing a warning—again, in contrast to the equivalent full_case Verilog pragma.

Interfaces

With the ever-increasing complexity of SoC systems, constructs that abstraction-modeling constructs can be useful for development teams. One such construct is that of the interface. Modules are connected together by ports, with every connection between them coded explicitly. This offers the benefit of mapping directly to hardware connections but it can be cumbersome and error-prone, especially as systems get larger.

Instead of connecting modules using a port for every signal, the interface construct helps abstracting the actual protocol by being able to encapsulate not only the signals that make up the interface, but also in many cases logic, tasks and functions to manipulate data in the interface, and assertions to check its correct behavior.

New Logic Data Types

SystemVerilog has introduced a number of new data types to aid both design entry and verification. First of all, the Verilog reg and wire types1 are still there, but have been essentially replaced by a single data type, logic.

The two-state bit data type has been introduced. Verilog is based solely on four-state signals (see table B.1b on the next page) and this has an impact on the simulation speed and space. Thus, for signals that don’t have any use for additional states other than0and 1 (like clocks), the bit type can be used to help the simulator a bit (no pun intended).

Other types resembling those of C have been established on top of the bit type.

1In Verilog, the wire data type is used to model connections and combinatorial logic (it can be used in assign blocks and port connections, while the reg data type is used to model sequential or combinatorial logic (it can be used in initial and always blocks). This can cause confusion, especially to newcomers. SYSTEMVERILOG SYNTHESIS FEATURES 139

logic Q_i; // Four-state "logic" type bit clk; // Two-state "bit" type byte s8; // Two-state 8-bit signed type shortint b16; // Two-state 16-bit signed type int s32; // Two-state 32-bit signed type longint s64; // Two-state 64-bit signed type int unsigned u32; // Two-state 32-bit unsigned type

Code Listing B.3: Examples of new data types introduced in SystemVerilog

The new two-state types are byte (8-bit), shortint (16-bit), int (32-bit), and longint (64-bit). They all behave as signed types but the unsigned keyword can be used to declare unsigned types. Examples can be seen on code listing B.3.

Arrays and Streaming Operators

In Verilog, one can declare an array using the following syntax:

֓← element_type [vector_dimension] array_name [array_dimension]...[more_array_dimensions];

The array is split in vector (packed) elements, and array (unpacked) elements. Thepacked elements represent a contiguous memory space (one can consider them as bytes in a 32-bit word), and unpacked elements are separate entities. One difference is that Verilog allows only one packed dimension, while SystemVerilog doesn’t impose such a limit. Another difference is that Verilog allows access to single elements of the array only, while SystemVerilog makes it more flexible, allowing read and write access to any slice (any

Table B.1: VHDL (a) and Verilog (b) logic values

(a) VHDL IEEE 1164 logic values [101] (b) Verilog’s 4-state logic values Character Value Character Value ’U’ uninitialized X unknown value ’X’ unknown value 0 logic zero ’0’ logic zero 1 logic one ’1’ logic one Z high impedance ’Z’ high impedance ’W’ weak unknown ’L’ weak logic zero ’H’ weak logic one ’-’ don’t care 140 APPENDIX B: THE SYSTEMVERILOG DESIGN AND VERIFICATION LANGUAGE number of contiguous elements of an array dimension) of the unpacked array (or even all of it). Finally, arrays can be passed as I/O ports in modules: this long overdue feature improves code readability by removing the code overhead that was necessary to work around this restriction.

Streaming operators, also called pack/unpack operators, can be used to reorder data. They can be used either on the right-hand side of an assignment, behaving aspacking operators; or on the left-hand side, as unpacking operators. A sample of their usecanbe seen below.

assign data_little_endian = {<

The use of the << or >> operators determines the order of the blocks the data is sliced in. The >> operator preserves left-to-right order (no reordering is performed), while the << operator reverses it.

Struct, Union, Typedef, and Enum

To facilitate the construction of more complex systems, the data structures of structs and unions have been brought in. Structs can be used to organize multiple signals as sub-fields of a structure. That reduces the lines of code and improves consistency. Unions provide alternative representations of a number of bits; that way, when a signal can carry different types of data depending on some status, the data of each type can be accessed and handled more intuitively in the code.

To complement the added flexibility in array declarations, structs, and unions, the typedef construct has been introduced. When put together, that added functionality can be very useful to define and manipulate entities such as floating-point or complex numbers:

typedef struct { bit sign; bit [4:0] exponent; bit [9:0] fraction; } half_float_t;

typedef struct { half_float_t real_part; half_float_t imag_part; } complex_t;

complex_t a, b, c, d; SYSTEMVERILOG VERIFICATION FEATURES 141

always_ff @(posedge clk) c.real_part.exponent <= a.real_part.exponent + b.real_part.exponent;

always_ff @(posedge clk) if (a.real_part.sign == 1'b1) d <= a; else d <= b;

The above example shows how the typedef and nested struct constructs can be used to increase the abstraction level in a design. As a result, the number of signal declarations and assignments can be significantly reduced, leading to cleaner and less error-prone code.

Caveat: Union support is still missing from some synthesis tools, while structs seem to be supported better.

Besides these types, SystemVerilog also adds support for enumerated types. That is especially helpful when designing FSMs. With Verilog, naming FSM states involved declaring the value of each state label as a parameter. That approach is not as well- supported by simulation tools as VHDL FSMs, as the state label is not displayed in a waveform—only the declared value of the state is. It is also error-prone, as nothing would prevent the designer from assigning the same bits to two different states. SystemVerilog enum types enforce unique values assigned to the enumerated labels. An example of an FSM state signal can be seen below:

typedef enum logic [3:0]{ IDLE, DO_RESET, WAIT_RESET_DOWN, GO} state_t;

state_t rst_state = IDLE;

B.3 SystemVerilog Verification Features

In the previous section some of the most important features SystemVerilog brings to make describing synthesizable logic easier and less error-prone were highlighted and it 142 APPENDIX B: THE SYSTEMVERILOG DESIGN AND VERIFICATION LANGUAGE can be argued that they make up for a significant contribution.

Since the bulk of the innovation SystemVerilog brings is focused at verification, providing a comprehensive description of even the most important features is a task out of the scope of this text. However, an attempt can be made at giving a brief overview of some of the most prominent features that were used to verify the designs presented in this thesis.

Assertion-Based Verification

ABV is a technique that aims to shorten the simulation time necessary to comprehensively validate a design and help catch bugs earlier and easier. One way to think about assertions is as active specifications, injected in the RTL code not only to better convey design intent but also to catch mistakes—while output verification can indicate that something has gone wrong, a well-written set of assertions can also indicate where. Assertions either delineate legal behavior or describe instances of illegal behavior, and result in either a pass or a fail. The simulation tools use assertions to flag unexpected behavior orto gather coverage statistics to quantify the quality of the functional verification scenarios selected. Formal verification tools also use assertion description languages to describe properties about a circuit and assumptions about its inputs, with the ultimate goal of mathematically and definitively proving its correctness.

Assertion languages, such as Property Specification Language (PSL) [102, p. 18-8, 103] and the SVA [104] language integrated in SystemVerilog, provide mechanisms for expressing complex temporal relationships across Boolean expressions and operators to describe complex design behavior. An example of SVA assertions used to help with the validation of the DO component can be seen on subsection 3.3.3.1 on page 78.

Dynamic Arrays, Queues, and Associative Arrays

A dynamic array is a powerful data structure: it is a random access list of objects with a variable size that can be dynamically adjusted at run-time. This is a non-synthesizable construct, due to the immutability of hardware memory structures, and is one of the features borrowed from high-level languages to make a range of verification-related tasks more straightforward.

Considering the flexibility introduced by the constrained-random testing facilities of the language, an array structure that can dynamically adapt to the testing conditions SYSTEMVERILOG VERIFICATION FEATURES 143 is an indispensable tool. Dynamic arrays can be both created and resized by calling the new[] function, as such:

program dynmem_test();

int mem[];

initial begin mem = new[10]; // allocate a memory of size 10 for (int i = 0; i < 10; ++i) mem[i] = i;

$display("Array size is %d", mem.size()); // Will print 10

mem = new[100]; // dynamically resize the memory, retaining old content for (int i = 10; i < 100; ++i) mem[i] = 2*i;

$display("Array size is now %d", mem.size()); // Will print 100

...

mem.delete(); // Clears the array end endprogram

The size() and delete() member functions are also provided to work with dynamic arrays. Their names are self-explanatory: the first function returns the current size ofthe array, and the second function clears its elements and frees it. Apart from those special functions, dynamic arrays can be used like plain unpacked SystemVerilog arrays, using the indexing, concatenation, and slicing operators.

Another useful structure, the queue, is an ordered array of variable size that can be used to naturally model FIFO and Last-In First-Out memory (LIFO) functionality. Like the dynamic array, it can be manipulated using plain SystemVerilog array operators but apart from that, a number of member functions is provided to give it its unique features. The push_front(a) and push_back(a) functions can be used to insert a new element at the front and the back of the queue, respectively. The pop_front(a) and pop_back(a) functions are provided to read out and remove the front or back end of the queue. The insert(i, a) and delete(i) functions can be used to insert or delete an element at a specified index, but the first set of functions is the one that is more commonly used.

The associative array is another useful and dynamic high-level data structure that is worthy of mention; it can be used to store sparse data in the form of key-value pairs. An associative array instance is essentially a dynamic lookup table, defined by two data types: the data type of its elements, and the data type of its index.

Appendix C

Vivado® Synthesis and Implementation attributes

“Man must shape his tools lest they shape him.” — Arthur Miller

C.1 Introduction

This appendix contains the most useful attributes that were used in the systems developed for this thesis. These directive guide the Synthesis and Implementation tools and convey design intent in a more definitive way.

C.2 Attributes

ASYNC_REG

This attribute is used to specify that a register is used as a part of a synchronization chain. As mentioned in appendix A.2.1 on page 128, the registers of a 2FF CDC synchronizer have to be placed close together to maximize the slack of the potentially metastable signal. That is the main purpose of setting the ASYNC_REG attribute to a register butnotthe only one: synthesis, implementation, and timing simulation models are all affected by it.

The effect this attribute has in synthesis is that the register, or any logic surrounding

145 146 APPENDIX C: VIVADO® SYNTHESIS AND IMPLEMENTATION ATTRIBUTES it, cannot be optimized away. In the placement and routing steps of implementation, the FFs on synchronization chains are placed closely together, if possible in the same CLB, so the route is minimized and the MTBF maximized. In timing simulation, the model used for the register is such that when a timing violation occurs, the register will return the last known value, instead of an X. That X would be propagated through the design causing unexpected behavior, but that would not correspond to the real behavior since the metastability is normally resolved by the synchronizer chain. Due to the impact this attribute has on the MTBF, failure to assign it to synchronization chains will trigger a Design Rule Check (DRC) warning about the synchronizer quality.

MAX_FANOUT

In FPGA designs, signal fanouts of many hundreds, or even thousands, are not uncommon. In high-speed designs, though, such nets would fail to support the required operating frequency. Thus, the tools can be directed to replicate the drivers of such high-fanout nets, so they are not a limiting factor in the performance of the design; the MAX_FANOUT attribute sets the maximum number of inputs a net can reach before its driver is replicated by the synthesizer. It has to be taken into account that by selecting a low value, the drivers of very high-fanout signals will be replicated many times; that will just transfer the problem to the previous register level. High-fanout signals in very high-speed designs1 require to be pipelined appropriately to allow for the driver replication to be propagated backwards.

Caveat: Signals that go inside IP cores included as netlists do not count to- wards the MAX_FANOUT attribute setting. Fanout optimization for these cases is handled in the physical optimizations step of the implementation run.

USE_DSP

This attribute can be used to control the inference of DSP slices for arithmetic operations. By default DSP slices are only used to carry out operating that involve multiplications,

1For the purposes of this example we can assume that to mean >400 MHz. ATTRIBUTES 147 such as multiply-add, multiply-subtract, and multiply-accumulate, and the implementa- tion of plain adders, subtractors, and accumulators, by default infers fabric logic.

However, in some cases, using DSP slices for simple arithmetic is desirable. Setting the value of this attribute to “yes” at the signal or hierarchy level enables this behavior. Furthermore, setting this attribute to “logic” directs the synthesizer to implement wide XOR functions (wider than 35 bits) by inferring DSP slices.

RAM_STYLE

Synthesis can recognize coding patterns that describe memories. Taking into consid- eration factors such as memory size, resource utilization, and speed, the synthesis tool uses heuristics to try and infer the optimal type of resources to implement the memory. However, there are times that the user would prefer a different kind of resource to be used for a particular memory. Applying this attribute at the array that defines the memory allows directing the synthesis tool in its choice of resources between BRAM, distributed RAM (implemented using special LUTs), and registers.

Appendix D

Code Listings

“Talk is cheap. Show me the code.” — Linus Torvalds

D.1 Bit Indexing ROM Initialization Using Python

The following Python program computes the right output for all possible 8-bit inputs, as this is described in subsection 3.3.2.4 on page 73; based on that, it generates the Verilog HDL code that constructs the LUTs, by directly instantiating Kintex-7 and Kintex Ultrascale family primitives.

def listtohex(a): lj = 0 lres = 0 for li in a: 5 if (li != 0): lres += 2**lj lj += 1 return lres

10 def getzerobit(a, ib): zi = 0 while ((zi<8) and (a & 2**zi)): zi += 1

Code Listing D.1: Python program that generates HDL code for ROM initial- ization

149 150 APPENDIX D: CODE LISTINGS

15 if (zi<8): return zi else: return ib

20 res = [[0 for y in range(8)] for x in range(256)] for x in range(256): num = x for i in range(8): 25 j = i while ((j<8) and not (num & 2**j)): j += 1 if (j<8): num = (num ^ 2**j) | 2**i 30 res[x][i] = j else: if (x & 2**i): res[x][i] = getzerobit(x, i) else: 35 res[x][i] = i

config = [[[0 for z in range(256)] for y in range(3)] for x in range(8)] for bitno in range(8): for i in range(256): 40 config[bitno][0][i] = 1 if (res[i][bitno] & 0b001) else 0 config[bitno][1][i] = 1 if (res[i][bitno] & 0b010) else 0 config[bitno][2][i] = 1 if (res[i][bitno] & 0b100) else 0 #for bitno in range(8): # for j in range(3): 45 # config[bitno][j].reverse()

֓← config_str = [[['x' for z in range(64)] for y in range(3)] for x in range(8)] for bitno in range(8): 50 for confbit in range(3): for hexno in range(64): #print "bitno", bitno, "confbit", confbit, "hexno", hexno ֓← = [config_str[bitno][confbit][hexno ֓← 01x}".format(listtohex(config[bitno][confbit][hexno*4:hexno*4+4:}" ]))

55 for bitno in range(8): for confbit in range(3): config_str[bitno][confbit].reverse()

print res[0xfc] 60 print

֓← ultrascale = raw_input("For ultrascale output enter 1, for Kintex-7 enter 0: ")

if (ultrascale == '1'):

Code Listing D.1: Python program that generates HDL code for ROM initial- ization (continued) KINTEX-7 OPTIMAL MULTIPLEXER 151

65 print "Generating output for Ultrascale device" print for bitno in range(8): for confbit in range(3): if (config_str[bitno][confbit] == ['f' for x in range(64)] ): ֓← ,"][" ,(print "".join(["assign reorder_i[", str(bitno 70 str(confbit), "] = 1'b1;"]) else: ֓← ,"print "".join(["RAM256X1S #(.INIT(256'h ֓← ,(" ,([join(config_str[bitno][confbit."" .IS_WCLK_INVERTED(1'b0))"]) ֓← " ,(print "".join(["RAM256X1S_", str(bitno), str(confbit (.O(reorder_i[", str(bitno), "][", str(confbit), "]), "]), print ".A(nonzeros8_inter), .WE(1'b0), .WCLK(clk), .D(1'b0));" 75 print print else: print "Generating output for Kintex-7 device" print 80 for bitno in range(8): for confbit in range(3): if (config_str[bitno][confbit] == ['f' for x in range(64)] ): ֓← ,"][" ,(print "".join(["assign reorder_i[", str(bitno str(confbit), "] = 1'b1;"]) else: ֓← ,"print "".join(["ROM256X1 #(.INIT(256'h 85 "".join(config_str[bitno][confbit]), "))"]) ֓← " ,(print "".join(["ROM256X1_", str(bitno), str(confbit (.O(reorder_i[", str(bitno), "][", str(confbit), "]), "]), for i in range(7): ֓← ,([" ,(print "".join([".A", str(i), "(nonzeros8_inter[", str(i "]), print ".A7(nonzeros8_inter[7]));" 90 print print

Code Listing D.1: Python program that generates HDL code for ROM initial- ization (continued)

D.2 Kintex-7 Optimal Multiplexer

The following SystemVerilog code implements optimal multiplexer structures for Xilinx 7-series devices, according to the Application Note in [70]. The goal was to implement a very high-performance 2 k:1 multiplexer on Kintex-7 FPGAs for the Data Organizer, as described in subsection 3.3.2.2 on page 70. 152 APPENDIX D: CODE LISTINGS

module mux8_fast ( input logic [7:0] A, input logic [2:0] sel, 5 output logic O );

logic LUTouts[1:0];

10 generate for (genvar i=0; i<2; i++) begin: lutgen LUT6 #( .INIT(64'hFF00F0F0CCCCAAAA) // Specify LUT Contents ) LUT6_inst 15 ( .O(LUTouts[i]), // LUT general output .I0(A[4*i+0]), // LUT input .I1(A[4*i+1]), // LUT input .I2(A[4*i+2]), // LUT input 20 .I3(A[4*i+3]), // LUT input .I4(sel[0]), // LUT input .I5(sel[1]) // LUT input ); end 25 endgenerate

MUXF7 MUXF7_inst ( .O(O), // Output of MUX to general routing 30 .I0(LUTouts[0]), // Input (tie to LUT6 O6 pin) .I1(LUTouts[1]), // Input (tie to LUT6 O6 pin) .S(sel[2]) // Input select to MUX ); endmodule // mux8_fast 35

module mux16_fast ( input logic [15:0] A, 40 input logic [3:0] sel, output logic O );

logic LUTouts[3:0]; 45 logic F7outs[1:0];

generate for (genvar i=0; i<4; i++) begin: lutgen LUT6 #( 50 .INIT(64'hFF00F0F0CCCCAAAA) // Specify LUT Contents ) LUT6_inst ( .O(LUTouts[i]), // LUT general output .I0(A[4*i+0]), // LUT input

Code Listing D.2: Optimal 2 k:1 multiplexer for Kintex-7 devices KINTEX-7 OPTIMAL MULTIPLEXER 153

55 .I1(A[4*i+1]), // LUT input .I2(A[4*i+2]), // LUT input .I3(A[4*i+3]), // LUT input .I4(sel[0]), // LUT input .I5(sel[1]) // LUT input 60 ); end for (i=0; i<2; i++) begin: F7gen MUXF7 MUXF7_inst ( 65 .O(F7outs[i]), // Output of MUX to general routing .I0(LUTouts[2*i+0]), // Input (tie to LUT6 O6 pin) .I1(LUTouts[2*i+1]), // Input (tie to LUT6 O6 pin) .S(sel[2]) // Input select to MUX ); 70 end

endgenerate MUXF8 MUXF8_inst ( 75 .O(O), // Output of MUX to general routing .I0(F7outs[0]), // Input (tie to MUXF7 L/LO out) .I1(F7outs[1]), // Input (tie to MUXF7 L/LO out) .S(sel[3]) // Input select to MUX ); 80 endmodule // mux16_fast

module mux64_fast ( 85 input logic [63:0] A, input logic [5:0] sel, output logic O );

90 logic LUTouts[15:0]; logic F7outs[7:0]; logic F8outs[3:0];

generate 95 for (genvar i=0; i<16; i++) begin: lutgen LUT6 #( .INIT(64'hFF00F0F0CCCCAAAA) // Specify LUT Contents ) LUT6_inst ( 100 .O(LUTouts[i]), // LUT general output .I0(A[4*i+0]), // LUT input .I1(A[4*i+1]), // LUT input .I2(A[4*i+2]), // LUT input .I3(A[4*i+3]), // LUT input 105 .I4(sel[0]), // LUT input .I5(sel[1]) // LUT input ); end

Code Listing D.2: Optimal 2 k:1 multiplexer for Kintex-7 devices (continued) 154 APPENDIX D: CODE LISTINGS

for (i=0; i<8; i++) begin: F7gen 110 MUXF7 MUXF7_inst ( .O(F7outs[i]), // Output of MUX to general routing .I0(LUTouts[2*i+0]), // Input (tie to LUT6 O6 pin) .I1(LUTouts[2*i+1]), // Input (tie to LUT6 O6 pin) 115 .S(sel[2]) // Input select to MUX ); end for (i=0; i<4; i++) begin: F8gen MUXF8 MUXF8_inst 120 ( .O(F8outs[i]), // Output of MUX to general routing .I0(F7outs[2*i+0]), // Input (tie to MUXF7 L/LO out) .I1(F7outs[2*i+1]), // Input (tie to MUXF7 L/LO out) .S(sel[3]) // Input select to MUX 125 ); end endgenerate LUT6 #( .INIT(64'hFF00F0F0CCCCAAAA) // Specify LUT Contents 130 ) LUT6_inst2 ( .O(O), // LUT general output .I0(F8outs[0]), // LUT input .I1(F8outs[1]), // LUT input 135 .I2(F8outs[2]), // LUT input .I3(F8outs[3]), // LUT input .I4(sel[4]), // LUT input .I5(sel[5]) // LUT input ); 140 endmodule // mux64_fast

module mux256_fast ( 145 input logic clk, input logic [255:0] A, input logic [7:0] sel, output logic O ); 150 logic [15:0] mux16o; logic [15:0] mux16r;

generate 155 for (genvar i=0; i<16; i++) begin mux16_fast mux16_inst ( .O(mux16o[i]), .A(A[16*(i+1)-1:16*i]), 160 .sel(sel[3:0]) ); end

Code Listing D.2: Optimal 2 k:1 multiplexer for Kintex-7 devices (continued) KINTEX-7 OPTIMAL MULTIPLEXER 155

endgenerate

165 always_ff @(posedge clk) mux16r <= mux16o;

logic [7:0] sel_r; always_ff @(posedge clk) 170 sel_r <= sel;

mux16_fast mux16_inst2 ( .O(O), 175 .A(mux16r), .sel(sel_r[7:4]) ); endmodule // mux256_fast

180 module mux2k_fast ( input logic clk, input logic [2047:0] A, 185 input logic [10:0] sel, output logic O );

logic [7:0] mux256o; 190 logic [7:0] mux256o_r;

generate for (genvar i=0; i<8; i++) begin mux256_fast mux256_inst 195 ( .clk(clk), .O(mux256o[i]), .A(A[256*(i+1)-1:256*i]), .sel(sel[7:0]) 200 ); end endgenerate

always_ff @(posedge clk) 205 mux256o_r <= mux256o;

logic [2:0] sel_high_r; logic [2:0] sel_high_r2; always_ff @(posedge clk) begin 210 sel_high_r <= sel[10:8]; sel_high_r2 <= sel_high_r; end

mux8_fast mux8_inst 215 ( .O(O),

Code Listing D.2: Optimal 2 k:1 multiplexer for Kintex-7 devices (continued) 156 APPENDIX D: CODE LISTINGS

.A(mux256o_r), .sel(sel_high_r2) ); 220 endmodule // mux2k_fast

Code Listing D.2: Optimal 2 k:1 multiplexer for Kintex-7 devices (continued)

D.3 Parametric Delay Line

The following SystemVerilog module implements a delay line of parametric width, depth and maximum output fanout. It also features switchable shift register extraction, a feature intended to be used to connect synchronous modules that have to be placed far from each other on the device.

module pipe_custom #( parameter WIDTH = 8, parameter DEPTH = 2, 5 parameter RONLY = 0, parameter FANOUT = 512 ) ( input logic clk, 10 input logic [WIDTH-1:0] p_in, (* max_fanout = FANOUT *) output logic [WIDTH-1:0] p_out = '0 );

generate 15 if (DEPTH>2) begin: pipe_deep if (RONLY==0) begin: pipe_deep_sr logic [WIDTH-1:0] pipe[DEPTH-2:0] = '{(DEPTH-1){'0}}; always_ff @(posedge clk) begin pipe[0] <= p_in; 20 pipe[DEPTH-2:1] <= pipe[DEPTH-3:0]; p_out <= pipe[DEPTH-2]; end end else begin: pipe_deep_nosr (* shreg_extract = "no" *) 25 logic [WIDTH-1:0] pipe[DEPTH-2:0] = '{(DEPTH-1){'0}}; always_ff @(posedge clk) begin pipe[0] <= p_in; pipe[DEPTH-2:1] <= pipe[DEPTH-3:0]; p_out <= pipe[DEPTH-2]; 30 end end end else if (DEPTH==2) begin: gen_2regs

Code Listing D.3: Parametric delay line RESET SYNCHRONIZER 157

logic [WIDTH-1:0] pipe = '0; always_ff @(posedge clk) begin 35 pipe <= p_in; p_out <= pipe; end end else if (DEPTH==1) begin: gen_1reg always_ff @(posedge clk) 40 p_out <= p_in; end endgenerate

endmodule // pipe_custom

Code Listing D.3: Parametric delay line (continued)

D.4 Reset Synchronizer

The following module implements an asynchronous reset synchronizer. The reset signalis triggered asynchronously, but is deasserted synchronously to the target clock to provide: (i) safe transition from the reset state (avoiding metastability); and (ii) synchronous transition from the reset state for all FFs (e.g. all FFs of a state machine have to reach their post-reset state in the same clock cycle, an illegal state might occur otherwise).

module reset_sync #(parameter INIT = 2'b11, parameter STAGES = 5 ) 5 ( input logic clk, input logic reset_in, input logic reset_out ); 10 logic [STAGES:0] reset_stage;

(* shreg_extract = "no", async_reg = "true" *) begin: gen_ffs0 15 FDP #(.INIT(INIT[0])) reset_sync0 (.C(clk), .PRE(reset_in), .D(1'b0), 20 .Q(reset_stage[0]) ); end

Code Listing D.4: Reset synchronizer with parametric initialization state and number of clock cycles to keep reset up 158 APPENDIX D: CODE LISTINGS

generate 25 for (genvar i = 0; i < STAGES; i++) begin: gen_ffsN (* shreg_extract = "no", async_reg = "true" *) FDP #(.INIT(INIT[0])) reset_sync (.C(clk), 30 .PRE(reset_in), .D(reset_stage[i]), .Q(reset_stage[i + 1]) ); end 35 endgenerate

assign reset_out = reset_stage[STAGES];

endmodule // reset_sync

Code Listing D.4: Reset synchronizer with parametric initialization state and number of clock cycles to keep reset up (continued)

D.5 Pulse Detection Synchronizer

The following code extends the input signal to a parametrizable number of cyclesto match the output clock. Optionally, a handshake protocol is implemented for the syn- chronization, instead, offering guaranteed pulse delivery to the destination clock domain, irrespective of the frequency relationship, at the expense of some extra latency.

module pulse_sync ֓← parameter HS = 1, // Make handshake synchronizer (makes N_ext)# parameter // irrelevant) ֓← parameter N_ext = 2 // N_ext defines how much to extend the source signal 5 // in the source clock domain, roughly it should // be at least ceil(1 + (f_in / f_out)) ) ( input logic clki, 10 input logic d, input logic clk, input logic r, output logic q = 1'b0 ); 15 logic d_ext = 1'b0;

Code Listing D.5: Pulse detection and synchronization implementation. PULSE DETECTION SYNCHRONIZER 159

generate if (HS == 1) begin: handshake_version 20 logic handshake = 1'b0; logic handshake_clki; logic waiting_for_input_to_go_down = 1'b0;

always_ff @(posedge clki) begin 25 if (~d_ext) d_ext <= d; else if (handshake_clki) d_ext <= 0; end 30 always_ff @(posedge clki) begin if (~waiting_for_input_to_go_down) waiting_for_input_to_go_down <= d; else if (~d) 35 waiting_for_input_to_go_down <= 0; end

logic syncin = 1'b0; always_ff @(posedge clki) begin 40 syncin <= d_ext | waiting_for_input_to_go_down; end

logic syncsig; sync_tworeg cross_domains 45 (.clk(clk), .p_in(syncin), .p_out(syncsig));

sync_tworeg cross_handshake (.clk(clki), .p_in(handshake), 50 .p_out(handshake_clki));

logic syncsig_r = 1'b0; always_ff @(posedge clk) syncsig_r <= syncsig; 55 always_ff @(posedge clk) begin if (~handshake & syncsig) handshake <= 1; else if (handshake & ~syncsig) 60 handshake <= 0; end

always_ff @(posedge clk) begin if (r) begin 65 q <= 1'b0; end else if (syncsig & ~syncsig_r) begin q <= 1'b1; end else begin q <= 1'b0; 70 end end

Code Listing D.5: Pulse detection and synchronization implementation. (con- tinued) 160 APPENDIX D: CODE LISTINGS

end endgenerate

75 generate if (HS == 0) begin: no_handshake_version if (N_ext == 0) begin: no_ext always_ff @(posedge clki) begin d_ext <= d; 80 end end if (N_ext == 1) begin: ext_single logic d_dr; always_ff @(posedge clki) begin 85 d_dr <= d; d_ext <= d | d_dr; end end if (N_ext > 1) begin: ext_delayline 90 logic [N_ext-1:0] d_dl; always_ff @(posedge clki) begin d_dl[0] <= d; d_dl[N_ext-1:1] <= d_dl[N_ext-2:0]; d_ext <= d | (|d_dl); 95 end end

logic syncsig; sync_tworeg cross_domains 100 (.clk(clk), .p_in(d_ext), .p_out(syncsig));

logic syncsig_r; always_ff @(posedge clk) 105 syncsig_r <= syncsig;

always_ff @(posedge clk) begin if (r) begin q <= 1'b0; 110 end else if (syncsig & ~syncsig_r) begin q <= 1'b1; end else begin q <= 1'b0; end 115 end end endgenerate

endmodule // pulse_sync

Code Listing D.5: Pulse detection and synchronization implementation. (con- tinued) COMBINER MODULE CORE 161

D.6 Combiner Module Core

The following SystemVerilog code implements the Combiner module combinatorics generation core, as described in subsection 3.4.1 on page 85. The number of layers and, through the address width of the hit-storing memories, the maximum number of hits allowed in each layer in a road, are parametric.

module comb_gencomb #( parameter addr_width = 6, parameter layers = 8 5 ) ( input logic clk, input logic rst, input logic en, 10 input logic [addr_width-1:0] last_addresses[layers-1:0], output logic [addr_width-1:0] addresses[layers-1:0], output logic done, output logic start ); 15 logic [addr_width-1:0] addr[layers-1:0] = '{8{'0}};

logic [layers:0] layer_en; logic [layers:0] layer_start; 20

assign layer_en[0] = en; generate for (genvar i=1; i<=layers; i++) begin: lay_en_gen 25 always_comb ֓← == [layer_en[i] = ( layer_en[i-1] && (addr[i-1 last_addresses[i-1]) ); end endgenerate

30 assign done = layer_en[layers];

assign layer_start[0] = en; generate for (genvar i=1; i<=layers; i++) begin: lay_start_gen 35 always_comb layer_start[i] = ( layer_start[i-1] && (addr[i-1] == 0) ); end endgenerate

40 assign start = layer_start[layers];

Code Listing D.6: Combiner module core in SystemVerilog 162 APPENDIX D: CODE LISTINGS

generate for (genvar i=0; i

55 always_ff @(posedge clk) addresses <= addr;

endmodule // comb_gencomb

Code Listing D.6: Combiner module core in SystemVerilog (continued)

D.7 Data Organizer UVM testbench

The following SystemVerilog code constitutes the first version of the UVM DO testbench. This is the simplest UVM testbench that has been designed, and only tests the one-layer version (the core) of the component.

`timescale 1ns / 1ps

module top; import uvm_pkg::*; 5 `include "uvm_macros.svh" import dcm_pkg::*;

dcm_if #(.clk_p(2.5)) bfm();

10 do_core_mems dut ( .clk(bfm.clk), .rst(bfm.rst), .new_event(bfm.new_event), 15 .state_write(bfm.state_write), .state_read(bfm.state_read), .ssid(bfm.ssid), .hit_data(bfm.hit_data), .valid_in(bfm.valid_in),

Code Listing D.7: DO UVM testbench top-level source DATA ORGANIZER UVM TESTBENCH 163

20 .read_valid_in(bfm.read_valid_in), .read_end(bfm.read_end), .read_ssid(bfm.read_ssid), .read_roadID(bfm.read_roadID), .read_DCbits(bfm.read_DCbits), 25 .ready_read_more(bfm.ready_read_more), .finished_reading(bfm.finished_reading), .read_more_roads1(bfm.read_more_roads1), .read_more_roads2(bfm.read_more_roads2), .dout1_valid(bfm.dout1_valid), 30 .dout1(bfm.dout1), .dout1_roadID(bfm.dout1_roadID), .dout1_ends_road(bfm.dout1_ends_road), .dout2_valid(bfm.dout2_valid), .dout2(bfm.dout2), 35 .dout2_roadID(bfm.dout2_roadID), .dout2_ends_road(bfm.dout2_ends_road) );

initial begin 40 uvm_config_db #(virtual dcm_if.tb)::set(null, "*", "bfm", bfm.tb); run_test(); end

endmodule // top

Code Listing D.7: DO UVM testbench top-level source (continued)

package dcm_pkg; import uvm_pkg::*; `include "uvm_macros.svh"

5 // virtual dcm_if bfm_g;

class a_road; logic [16:0] roadid; logic [15:0] ssid; 10 logic [1:0] DCvalue; logic [31:0] hit_list[$]; endclass // a_road

class rand_helper #(int max_ssid = 64, 15 int max_roadid = 256); randc logic [16:0] roadid; rand logic [15:0] ssid; rand logic [1:0] DCvalue;

20 constraint road_up_limit { roadid < max_roadid; } constraint ssid_up_limit { ssid < max_ssid;

Code Listing D.8: DO UVM testbench package source 164 APPENDIX D: CODE LISTINGS

25 } constraint DC_dist { DCvalue dist { 0 := 65, 1 := 20, 30 2 := 10, 3 := 5 }; }

35 endclass // rand_helper

typedef a_road roads_q_t[$];

`include "../scoreboard.svh" 40 `include "../random_tester.svh" `include "../random_test.svh"

endpackage // dcm_pkg

Code Listing D.8: DO UVM testbench package source (continued)

class random_test extends uvm_test;

`uvm_component_utils(random_test)

5 virtual dcm_if.tb bfm;

random_tester random_tester_h; scoreboard scoreboard_h;

10 function new(string name, uvm_component parent); super.new(name, parent); if (!uvm_config_db #(virtual dcm_if.tb)::get(null, "*", "bfm", bfm)) $fatal("failed to get BFM"); endfunction // new 15 function void build_phase(uvm_phase phase); random_tester_h = new("random_tester_h", this); scoreboard_h = new("scoreboard_h", this); endfunction // build_phase 20 function void connect_phase(uvm_phase phase); random_tester_h.roads_ap.connect(scoreboard_h.analysis_export); endfunction // connect_phase

25 task run_phase(uvm_phase phase); phase.raise_objection(this);

for (int i=0; i<100; i++) fork 30 random_tester_h.execute();

Code Listing D.9: DO UVM testbench test source DATA ORGANIZER UVM TESTBENCH 165

scoreboard_h.execute(); join

phase.drop_objection(this); 35 endtask // run_phase

endclass // random_test

Code Listing D.9: DO UVM testbench test source (continued)

`timescale 1ns / 1ps import dcm_pkg::roads_q_t;

interface dcm_if #(parameter clk_p = 2.5) (); 5 logic clk, rst;

logic new_event; logic state_write, state_read; logic [15:0] ssid; 10 logic [31:0] hit_data; logic valid_in;

logic read_valid_in; logic read_end; 15 logic [15:0] read_ssid; logic [16:0] read_roadID; logic [1:0] read_DCbits;

logic ready_read_more; 20 logic finished_reading;

logic read_more_roads1; logic read_more_roads2;

25 logic dout1_valid; logic [31:0] dout1; logic [16:0] dout1_roadID; logic dout1_ends_road;

30 logic dout2_valid; logic [31:0] dout2; logic [16:0] dout2_roadID; logic dout2_ends_road;

35 modport dut ( input clk, rst, new_event, state_write, state_read, ssid, hit_data, valid_in, read_valid_in, read_end, read_ssid, read_roadID, read_DCbits, read_more_roads1, read_more_roads2, 40 output ready_read_more, finished_reading, dout1, dout2, dout1_roadID, dout2_roadID, dout1_valid, dout2_valid, dout1_ends_road, dout2_ends_road

Code Listing D.10: DO UVM testbench interface source 166 APPENDIX D: CODE LISTINGS

);

45 default clocking cb @(posedge clk); input ready_read_more, finished_reading, dout1, dout2, dout1_roadID, dout2_roadID, dout1_valid, dout2_valid, dout1_ends_road, dout2_ends_road; output clk, new_event, state_write, state_read, ssid, 50 hit_data, valid_in, read_valid_in, read_end, read_ssid, read_roadID, read_DCbits, read_more_roads1, read_more_roads2; endclocking // cb

modport tb 55 ( clocking cb, input ready_read_more, finished_reading, dout1, dout2, dout1_roadID, dout2_roadID, dout1_valid, dout2_valid, output clk, new_event, state_write, state_read, ssid, 60 hit_data, valid_in, read_valid_in, read_end, read_ssid, read_roadID, read_DCbits, read_more_roads1, read_more_roads2, import task reset(), task set_new_event(), task write_data(logic [15:0] ssid_l[], logic [31:0] hit_data_l[]), 65 task read_data(roads_q_t roads), is_reading );

initial begin 70 clk = 0; forever begin #(clk_p/2); clk = !clk; end 75 end

logic ready_read_more_r;

task reset(); 80 rst <= 0; ##1; rst <= 1; new_event <= 0; state_write <= 0; 85 state_read <= 0; ssid <= '0; hit_data <= '0; valid_in <= 0; read_valid_in <= 0; 90 read_end <= 0; read_ssid <= '0; read_roadID <= '0; read_DCbits <= '0; ready_read_more_r <= 0; 95 read_more_roads1 <= 1; read_more_roads2 <= 1;

Code Listing D.10: DO UVM testbench interface source (continued) DATA ORGANIZER UVM TESTBENCH 167

##50; rst <= 0; ##1; 100 endtask // reset

task set_new_event(); ##1; new_event <= 1; 105 ##1; new_event <= 0; ##1; endtask // new_event

110 task write_data(logic [15:0] ssid_l[], logic [31:0] hit_data_l[]); assert(ssid_l.size() == hit_data_l.size()) else $fatal("ssid and hit_data lists have different lengths");

##1; 115 state_write <= 1; ##1;

for (int i=0; i

task register_rrm(); while (1) begin @(posedge clk); 145 ready_read_more_r <= ready_read_more; end endtask // register_rrm

task read_data(roads_q_t roads); 150 fork

Code Listing D.10: DO UVM testbench interface source (continued) 168 APPENDIX D: CODE LISTINGS

register_rrm(); join_none

state_read <= 1; 155 for (int i = 0; i < roads.size(); i++) begin ##1; while (!ready_read_more_r) begin read_valid_in <= 0; read_ssid <= '0; 160 read_DCbits <= '0; read_roadID <= '0; ##1; end read_valid_in <= 1; 165 read_ssid <= roads[i].ssid; read_DCbits <= roads[i].DCvalue; read_roadID <= roads[i].roadid; read_end <= (i==(roads.size()-1))?1:0; end // for (int i = 0; i < roadid_l.size(); i++) 170 ##1; read_valid_in <= 0; read_ssid <= '0; read_DCbits <= '0; read_roadID <= '0; 175 read_end <= 0;

while (!cb.finished_reading) ##1; state_read <= 0; 180 ##1;

disable fork; endtask // read_data

185 function logic is_reading(); return !finished_reading; endfunction // is_reading

endinterface // dcm_if

Code Listing D.10: DO UVM testbench interface source (continued)

class random_tester extends uvm_component;

`uvm_component_utils(random_tester);

5 uvm_analysis_port #(roads_q_t) roads_ap;

virtual dcm_if.tb bfm;

function new(string name, uvm_component parent); 10 super.new(name, parent);

Code Listing D.11: DO UVM testbench driver source DATA ORGANIZER UVM TESTBENCH 169

endfunction // new

function void build_phase(uvm_phase phase); if (!uvm_config_db #(virtual dcm_if.tb)::get(null, "*", "bfm", bfm)) 15 $fatal("Failed to get BFM");

roads_ap = new("roads_ap", this); endfunction // build_phase

20 typedef int q_of_int[$]; function q_of_int getrange(int num, logic [1:0] dc); if (dc==2'b00) begin return {num}; end else if (dc==2'b01) begin 25 return {2*(num/2), 2*(num/2)+1}; end else if (dc==2'b10) begin return {4*(num/4), 4*(num/4)+1, 4*(num/4)+2, 4*(num/4)+3}; end else if (dc==2'b11) begin return {8*(num/8), 8*(num/8)+1, 8*(num/8)+2, 8*(num/8)+3, 30 8*(num/8)+4, 8*(num/8)+5, 8*(num/8)+6, 8*(num/8)+7}; end endfunction // getrange

parameter max_ssid = 1024; 35 parameter max_roadid = 256; parameter hits_per_ss = 8; parameter no_hits = 1000; parameter no_roads = 128;

40 logic [15:0] ssid_l[no_hits]; logic [31:0] hit_data_l[no_hits];

logic [16:0] road_id_l[no_roads]; logic [15:0] base_ss_l[no_roads]; 45 logic [1:0] dcbits_l[no_roads]; roads_q_t matched_roads;

function void make_hits_and_roads(); ֓← ,(automatic rand_helper #(.max_ssid(max_ssid .max_roadid(max_roadid)) rand_helper_h; 50 automatic a_road roads[no_roads]; automatic int rand_hit; automatic int ind[$];

matched_roads.delete(); 55 ֓← ,(rand_helper_h = rand_helper #(.max_ssid(max_ssid .max_roadid(max_roadid))::new();

foreach (roads[i]) begin roads[i] = new(); 60 assert(rand_helper_h.randomize()) else $fatal("fatal error generating unique Road IDs"); roads[i].roadid = rand_helper_h.roadid;

Code Listing D.11: DO UVM testbench driver source (continued) 170 APPENDIX D: CODE LISTINGS

roads[i].ssid = rand_helper_h.ssid; roads[i].DCvalue = rand_helper_h.DCvalue; 65 end

foreach (ssid_l[i]) begin rand_hit = $urandom_range(hits_per_ss * max_ssid, hits_per_ss); hit_data_l[i] = rand_hit; 70 ssid_l[i] = rand_hit/hits_per_ss; end

foreach (roads[i]) begin ֓← ind = ssid_l.find_index with (item inside 75 {getrange(roads[i].ssid, roads[i].DCvalue)}); if (ind.size()) begin foreach (ind[j]) begin roads[i].hit_list.push_back(hit_data_l[ind[j]]); end 80 matched_roads.push_back(roads[i]); end

road_id_l[i] = roads[i].roadid; base_ss_l[i] = roads[i].ssid; 85 dcbits_l[i] = roads[i].DCvalue; end // foreach (roads[i]) roads_ap.write(matched_roads);

endfunction // make_hits 90 static int times = 0; virtual task execute(); if (!times) bfm.reset(); 95 times++;

bfm.set_new_event(); make_hits_and_roads();

100 bfm.write_data(ssid_l, hit_data_l); bfm.read_data(matched_roads);

while (bfm.is_reading()) @bfm.cb; 105 endtask // execute

endclass // random_tester

Code Listing D.11: DO UVM testbench driver source (continued) DATA ORGANIZER UVM TESTBENCH 171

class scoreboard extends uvm_subscriber #(roads_q_t);

`uvm_component_utils(scoreboard)

5 virtual dcm_if.tb bfm;

local roads_q_t roads;

10 local int roadIDs_invalid[$]; local time roads_time_invalid[$];

local int hits_invalid[$]; local time hits_time_invalid[$]; 15 local int events_with_roads_missed[$]; local roads_q_t roads_missed[$];

local int ends_road_invalid[$]; 20 local time ends_road_time_invalid[$];

local int event_no;

function new(string name="", uvm_component parent); 25 super.new(name, parent); endfunction // new

function void build_phase(uvm_phase phase); if (!uvm_config_db #(virtual dcm_if.tb)::get(null, "*", "bfm", bfm)) 30 $fatal("Failed to get BFM"); endfunction // build_phase

function void write(roads_q_t t); roads = t; 35 endfunction // write

local function int find_road_by_id(logic [16:0] id); foreach (roads[i]) begin if (roads[i].roadid == id) begin 40 return i; end end return (-1); endfunction // find_road_by_id 45 local function int find_hit(int road_index, logic [31:0] hit); foreach (roads[road_index].hit_list[i]) begin if (roads[road_index].hit_list[i] == hit) return i; 50 end return (-1); endfunction // find_hit

Code Listing D.12: DO UVM testbench scoreboard source 172 APPENDIX D: CODE LISTINGS

֓← local function void remove_hits_and_roads(int road_id, int hit, int ends_road); 55 int i, j; i = find_road_by_id(road_id); if (i == -1) begin // Store invalid road data to show on report phase $warning("Invalid roadID received"); roadIDs_invalid.push_back(road_id); 60 roads_time_invalid.push_back($time); return; end j = find_hit(i, hit); if (j == -1) begin 65 $warning("Invalid hit received"); hits_invalid.push_back(hit); hits_time_invalid.push_back($time); return; end 70 roads[i].hit_list.delete(j); if (ends_road && (roads[i].hit_list.size())!=0) begin $warning("Hit doesn't end road"); ends_road_invalid.push_back(hit); ends_road_time_invalid.push_back($time); 75 end if (roads[i].hit_list.size() == 0) begin roads.delete(i); end endfunction // remove_hits_and_roads 80 function void report_phase(uvm_phase phase); $display("\n Scoreboard report:\n");

if (roadIDs_invalid.size()) 85 $display("Invalid Road IDs:"); else $display("No invalid Road IDs received!"); foreach (roadIDs_invalid[i]) $display("%d - %d", roads_time_invalid[i], roadIDs_invalid[i]); 90 if (hits_invalid.size()) $display("Invalid hits:"); else $display("No invalid hits received!"); 95 foreach (hits_invalid[i]) $display("%d - %d", hits_time_invalid[i], hits_invalid[i]);

if (ends_road_invalid.size()) $display("Hits that don't end roads (despite claiming so):"); 100 else $display("All hits end roads as claimed!"); foreach (ends_road_invalid[i]) ֓← ,[display("%d - %d", ends_road_time_invalid[i$ ends_road_invalid[i]);

105 if (roads_missed.size())

Code Listing D.12: DO UVM testbench scoreboard source (continued) DATA ORGANIZER UVM TESTBENCH 173

$display("Missed some roads:"); else $display("No missing roads!"); foreach (roads_missed[i,j]) begin ֓← ,display("event %d, road [%d]: id = %d, base SS = %d, DC = %d$ 110 ֓← ,hit_list = [", events_with_roads_missed[i], j ֓← ,roads_missed[i][j].roadid, roads_missed[i][j].ssid roads_missed[i][j].DCvalue); foreach (roads[i].hit_list[j]) $display("%d", roads[i].hit_list[j]); $display("]"); end 115 endfunction // report_phase

task check_results(); int i, j; while (bfm.is_reading()) begin 120 @bfm.cb; if (bfm.cb.dout1_valid==1) ֓← ,remove_hits_and_roads(bfm.cb.dout1_roadID, bfm.cb.dout1 bfm.cb.dout1_ends_road); if (bfm.cb.dout2_valid==1) ֓← ,remove_hits_and_roads(bfm.cb.dout2_roadID, bfm.cb.dout2 bfm.cb.dout2_ends_road); 125 end endtask // check_results

task toggle_enable(); while(1) begin 130 repeat (50) @bfm.cb; bfm.read_more_roads1 <= 1; bfm.read_more_roads2 <= 1; repeat (10) @bfm.cb; bfm.read_more_roads1 <= 0; 135 bfm.read_more_roads2 <= 0; end endtask // toggle_enable

virtual task execute(); 140 event_no++;

while (!bfm.state_read) @bfm.cb; 145 fork check_results(); toggle_enable(); join_any 150 if (roads.size()) begin $error("Roads left unread!!!"); roads_missed.push_back(roads); events_with_roads_missed.push_back(event_no);

Code Listing D.12: DO UVM testbench scoreboard source (continued) 174 APPENDIX D: CODE LISTINGS

155 end

disable fork;

endtask // verify_data 160 endclass // scoreboard

Code Listing D.12: DO UVM testbench scoreboard source (continued) References

[1] P. Chaudhuri. Computer Organization and Design. Prentice Hall India Pvt., Limited, 2004. isbn: 9788120312548 (cit. on p. 2). [2] C. A. Mead and L. A. Conway. Introduction to VLSI Systems. Addison-Wesley Pub (Sd), 1978. isbn: 978-0201043587 (cit. on p. 2). [3] M. R. Warren Miller and B. Sievers. (AMD) 22V10 Program- mable Array Logic (PAL) Development Team Oral History Panel. July 2012. url: http://archive.computerhistory.org/resources/access/text/2013/ 03/102746508-05-01-acc.pdf (cit. on pp. 2, 3). [4] “IEEE Standard VHDL Language Reference Manual”. In: IEEE Std 1076-1987 (1988). doi: 10.1109/IEEESTD.1988.122645 (cit. on p. 4). [5] “IEEE Standard Hardware Description Language Based on the Verilog(R) Hard- ware Description Language”. In: IEEE Std 1364-1995 (Oct. 1996), pp. 1–688. doi: 10.1109/IEEESTD.1996.81542 (cit. on p. 4). [6] 7 Series FPGAs CLB User Guide. UG474. v1.8. Xilinx. Sept. 2016. url: https:// www.xilinx.com/support/documentation/user_guides/ug474_7Series_ CLB.pdf (cit. on p. 5). [7] L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction tothe Design of Warehouse-Scale Machines. 2009 (cit. on p. 6). [8] J. Fowers, G. Brown, P. Cooke, and G. Stitt. “A Performance and Energy Com- parison of FPGAs, GPUs, and Multicores for Sliding-window Applications”. In: Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. FPGA ’12. Monterey, California, USA: ACM, 2012, pp. 47–56. isbn: 978-1-4503-1155-7. doi: 10.1145/2145694.2145704 (cit. on p. 6). [9] The Xilinx SDAccel Development Environment. Xilinx. 2014. url: http://www. xilinx . com / support / documentation / backgrounders / sdaccel - back grounder.pdf (cit. on p. 6). [10] S. Kestur, J. D. Davis, and O. Williams. “BLAS Comparison on FPGA, CPU and GPU”. In: 2010 IEEE Computer Society Annual Symposium on VLSI. Institute of Electrical and Electronics Engineers (IEEE), July 2010. doi: 10.1109/isvlsi. 2010.84 (cit. on p. 6). [11] I. Kuon and J. Rose. “Measuring the gap between FPGAs and ASICs”. In: Proceedings of the internation symposium on Field programmable gate arrays - FPGA’06. ACM Press, 2006. doi: 10.1145/1117201.1117205 (cit. on p. 6). [12] O. Pell and O. Mencer. “Surviving the End of Frequency Scaling with Reconfig- urable Dataflow Computing”. In: ACM Comp. Ar. 39.4 (Dec. 2011), pp. 60–65. issn: 0163-5964. doi: 10.1145/2082156.2082172 (cit. on p. 8).

175 176 REFERENCES

[13] M. R. Maheshwarappa, M. D. J. Bowyer, and C. P. Bridges. “Improvements in CPU and FPGA Performance for Small Satellite SDR Applications”. In: IEEE Transactions on Aerospace and Electronic Systems PP.99 (2017), pp. 1–1. issn: 0018-9251. doi: 10.1109/TAES.2017.2650320 (cit. on p. 8). [14] C.-L. Sotiropoulou, L. Voudouris, C. Gentsos, S. Nikolaidis, N. Vassiliadis, et al. “FPGA-based machine vision implementation for Lab-on-Chip flow detection”. In: Circuits and Systems (ISCAS), 2012 IEEE International Symposium on. IEEE. 2012, pp. 2047–2050. doi: 10.1109/ISCAS.2012.6271683 (cit. on p. 10). [15] C.-L. Sotiropoulou, L. Voudouris, C. Gentsos, A. M. Demiris, N. Vassiliadis, et al. “Real-time machine vision FPGA implementation for microfluidic monitoring on lab-on-chips”. In: IEEE Transactions on Biomedical Circuits and Systems 8.2 (Apr. 2014), pp. 268–277. issn: 1932-4545. doi: 10.1109/TBCAS.2013.2260338 (cit. on p. 10). [16] R. C. Gonzalez and R. E. Woods. Digital Image Processing (3rd Edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2006. isbn: 013168728X (cit. on p. 17). [17] J. Canny. “A Computational Approach to Edge Detection”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-8.6 (Nov. 1986), pp. 679–698. doi: 10.1109/tpami.1986.4767851 (cit. on pp. 17, 36, 37). [18] C. Gentsos, C.-L. Sotiropoulou, S. Nikolaidis, and N. Vassiliadis. “Real-time canny edge detection parallel implementation for FPGAs”. In: Electronics, Circuits, and Systems (ICECS), 2010 17th IEEE International Conference on. IEEE. 2010, pp. 499– 502. doi: 10.1109/ICECS.2010.5724558 (cit. on p. 18). [19] C.-L. Sotiropoulou, C. Gentsos, S. Nikolaidis, and A. Rjoub. “FPGA-based Canny Edge Detection for Real-Time Applications”. In: Proceedings, Conference on Design of Circuits and Integrated System (DCIS 2011). Nov. 2011, pp. 73–77. url: https:// pdfs . semanticscholar . org / fee0 / 728220e9cccb8dc6fc09b4bff8522 aeef021.pdf (cit. on p. 18). [20] W. He and K. Yuan. “An improved Canny edge detector and its realization on FPGA”. In: 2008 7th World Congress on Intelligent Control and Automation. Institute of Electrical and Electronics Engineers (IEEE), 2008. doi: 10.1109/wcica.2008. 4594570 (cit. on p. 18). [21] Z. Hocenski, S. Vasilic, and V. Hocenski. “Improved Canny Edge Detector in Ceramic Tiles Defect Detection”. In: IECON 2006 - 32nd Annual Conference on IEEE Industrial Electronics. Institute of Electrical and Electronics Engineers (IEEE), Nov. 2006. doi: 10.1109/iecon.2006.347535 (cit. on p. 19). [22] Y. Luo and R. Duraiswami. “Canny edge detection on NVIDIA CUDA”. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Institute of Electrical and Electronics Engineers (IEEE), June 2008. doi: 10.1109/cvprw.2008.4563088 (cit. on p. 19). [23] H. Shan and N. A. Hazanchuk. Adaptive Edge Detection for Real-Time Video Pro- cessing using FPGAs I. 2005. url: http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.118.4139 (cit. on p. 19). REFERENCES 177

[24] D. V. Rao and M. Venkatesan. “An efficient reconfigurable architecture and im- plementation of edge detection algorithm using Handle-C”. In: International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. Institute of Electrical and Electronics Engineers (IEEE), 2004. doi: 10.1109/itcc.2004.1286764 (cit. on p. 19). [25] Celoxica Ltd. “DK4: Handel-C Language Reference Manual”. In: (2005). url: http: //babbage.cs.qc.cuny.edu/courses/cs345/Manuals/HandelC.pdf (cit. on p. 19). [26] G. Kornaros. “A soft multi-core architecture for edge detection and data analysis of microarray images”. In: Journal of Systems Architecture 56.1 (Jan. 2010), pp. 48– 62. doi: 10.1016/j.sysarc.2009.11.004 (cit. on p. 19). [27] Y.-J. Shin and J.-B. Lee. “Digital microfluidics-based high-throughput imaging for systems biology”. In: 2008 IEEE Sensors. Institute of Electrical and Electronics Engineers (IEEE), Oct. 2008. doi: 10.1109/icsens.2008.4716658 (cit. on p. 19). [28] E. T. Dimalanta, A. Lim, R. Runnheim, C. Lamers, C. Churas, et al. “A Microfluidic System for Large DNA Molecule Arrays”. In: Analytical Chemistry 76.18 (Sept. 2004), pp. 5293–5301. doi: 10.1021/ac0496401 (cit. on p. 19). [29] M. M. A. Mohamed, A. A. Youssef, Y. H. Ghallab, and W. Badawy. “On the design of digital control for lab-on-chip systems”. In: 2009 IEEE International Symposium on Circuits and Systems. Institute of Electrical and Electronics Engineers (IEEE), May 2009. doi: 10.1109/iscas.2009.5117931 (cit. on p. 19). [30] S. Jin, D. Kim, T. Song, Q. N. Le, and J. W. Jeon. “An FPGA-based motion-vision integrated system for real-time machine vision applications”. In: 2009 IEEE Inter- national Symposium on Industrial Electronics. Institute of Electrical and Electronics Engineers (IEEE), July 2009. doi: 10.1109/isie.2009.5218259 (cit. on p. 19). [31] T. Granlund and P. L. Montgomery. “Division by Invariant Integers Using Mul- tiplication”. In: SIGPLAN Not. 29.6 (June 1994), pp. 61–72. issn: 0362-1340. doi: 10.1145/773473.178249 (cit. on p. 25). [32] P.-E. Danielsson and O. Seger. “Generalized and Separable Sobel Operators”. In: Machine Vision for Three-Dimensional Scenes. Ed. by H. Freeman. Academic Press, 1990, pp. 347–379. isbn: 978-0-12-266722-0. doi: 10.1016/B978-0-12-266722- 0.50016-6 (cit. on p. 30). [33] A. Demiris, S. Blionas. “Integrated System for the Visual Control, Quantitative and Qualitative Flow Measurement in Microfluidics”. Patent, Hellenic Industrial Property Organisation, 20110100390. July 2011 (cit. on p. 45). [34] R. O. Duda and P. E. Hart. “Use of the Hough transformation to detect lines and curves in pictures”. In: Communications of the ACM 15.1 (Jan. 1972), pp. 11–15. doi: 10.1145/361237.361242 (cit. on p. 47). [35] L. Voudouris, S. Nikolaidis, and A. Rjoub. “High speed FPGA implementation of hough transform for real-time applications”. In: 2012 IEEE 15th International Symposium on Design and Diagnostics of Electronic Circuits Systems (DDECS). Apr. 2012, pp. 213–218. doi: 10.1109/DDECS.2012.6219060 (cit. on p. 47). 178 REFERENCES

[36] C.-L. Sotiropoulou, C. Gentsos, and S. Nikolaidis. “High performance median FPGA implementation for machine vision applications”. In: Electronics, Circuits, and Systems (ICECS), 2013 IEEE 20th International Conference on. IEEE. 2013, pp. 173–176. doi: 10.1109/ICECS.2013.6815382 (cit. on p. 47). [37] Integrating a Video Frame Buffer Controller (VFBC) in System Generator. XAPP1136. v1.0. Xilinx Inc. June 2009. url: https://www.xilinx.com/support/document ation/application_notes/xapp1136.pdf (cit. on p. 50). [38] ATLAS Collaboration. “The ATLAS Experiment at the CERN Large Hadron Col- lider”. In: J. Instrum. 3 (2008). Also published by CERN Geneva in 2010, S08003. 437 p. url: https://cdsweb.cern.ch/record/1129811 (cit. on p. 56). [39] C. ATLAS. Letter of Intent for the Phase-II Upgrade of the ATLAS Experiment. Tech. rep. CERN-LHCC-2012-022. LHCC-I-023. Draft version for comments. Geneva: CERN, Dec. 2012. url: https://cds.cern.ch/record/1502664 (cit. on p. 57). [40] A. La Rosa and f. the ATLAS Collaboration. “ATLAS IBL: a challenging first step for ATLAS Upgrade at the sLHC”. In: The 2011 Europhysics Conference on High Energy Physics – HEP2011. Sept. 2011. url: https://arxiv.org/pdf/1109.3372.pdf (cit. on p. 59). [41] W. H. Smith. “Triggering at LHC experiments”. In: Nucl. Instrum. Methods Phys. Res., Sect. A 478.1–2 (2002). Proceedings of the ninth Int.Conf. on Instrumentation, pp. 62–67. issn: 0168-9002. doi: 10.1016/S0168-9002(01)01720-X (cit. on p. 59). [42] M. zur Nedden. “The LHC Run 2 ATLAS trigger system: design, performance and plans”. In: Journal of Instrumentation 12.03 (2017), p. C03024. doi: 10.1088/1748- 0221/12/03/C03024 (cit. on p. 60). [43] C. Collaboration. “The CMS trigger system”. In: Journal of Instrumentation 12.01 (Jan. 2017), P01020–P01020. doi: 10.1088/1748-0221/12/01/p01020 (cit. on p. 60). [44] A. Sambade, M. Frank, D. Galli, C. Gaspar, B. Jost, et al. “Controlling a large CPU farm using industrial tools”. In: 2009 16th IEEE-NPSS Real Time Conference. May 2009, pp. 220–223. doi: 10.1109/RTC.2009.5321887 (cit. on p. 60). [45] J. A. Nacif, T. S. F. Silva, L. F. M. Vieira, A. B. Vieira, A. O. Fernandes, et al. “Tracking Hardware Evolution”. In: 2011 12th International Symposium on Quality Electronic Design. Mar. 2011, pp. 1–6. doi: 10.1109/ISQED.2011.5770764 (cit. on p. 60). [46] ATLAS Collaboration. Fast TracKer (FTK) Technical Design Report. Tech. rep. CERN- LHCC-2013-007. ATLAS-TDR-021. Geneva: CERN, June 2013. url: %7Bhttp : //cdsweb.cern.ch/record/1552953%7D (cit. on pp. 60, 66, 89, 106, 118). [47] V. Cavaliere, J. Adelman, C. Gentsos, et al. “Design of a hardware track finder (Fast Tracker) for the ATLAS trigger”. In: Journal of Instrumentation 11.02 (Feb. 2016), p. C02056. doi: 10.1088/1748-0221/11/02/C02056 (cit. on p. 60). [48] M. Dell’Orso and L. Ristori. “VLSI structures for track finding”. In: Nucl. Instrum. Methods Phys. Res., Sect. A 278.2 (1989), pp. 436–440. issn: 0168-9002. doi: 10. 1016/0168-9002(89)90862-0 (cit. on p. 60). REFERENCES 179

[49] A. Andreani, A. Annovi, C. Gentsos, et al. “The Associative Memory Serial Link Processor for the Fast TracKer (FTK) at ATLAS”. In: Journal of Instrumentation 9.11 (Nov. 2014), p. C11006. doi: 10.1088/1748-0221/9/11/C11006 (cit. on p. 60). [50] L. S. Ancu, A. Annovi, D. Britzger, P. Giannetti, J. W. Howarth, et al. “Associative memory computing power and its simulation”. In: Real Time Conference (RT), 2014 19th IEEE-NPSS. May 2014, pp. 1–5. doi: 10.1109/RTC.2014.7097552 (cit. on p. 60). [51] A. Andreani, A. Annovi, R. Beccherle, M. Beretta, N. Biesuz, et al. “Character- isation of an Associative Memory Chip for high-energy physics experiments”. In: 2014 IEEE International Instrumentation and Measurement Technology Confer- ence (I2MTC) Proceedings. May 2014, pp. 1487–1491. doi: 10.1109/I2MTC.2014. 6860993 (cit. on p. 60). [52] A. Bardi, S. Belforte, J. Berryhill, A. Cerri, A. Clark, et al. “SVT: an online Silicon Vertex Tracker for the CDF upgrade”. In: Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 409.1 (1998), pp. 658–661. issn: 0168-9002. doi: 10 . 1016 / S0168 - 9002(97)01345-4 (cit. on p. 60). [53] C. L. Sotiropoulou, I. Maznas, S. Citraro, A. Annovi, L. S. Ancu, et al. “The As- sociative Memory System Infrastructures for the ATLAS Fast Tracker”. In: IEEE Transactions on Nuclear Science 64.6 (June 2017), pp. 1248–1254. issn: 0018-9499. doi: 10.1109/TNS.2017.2703908 (cit. on p. 60). [54] A. Annovi, F. Bertolucci, C. Gentsos, N. Biesuz, D. Calabro, et al. “Highly paral- lelized pattern matching execution for the ATLAS experiment”. In: Proceedings, 2015 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC 2015). IEEE. IEEE, May 2015. doi: 10.1109/NSSMIC.2015.7581789 (cit. on p. 61). [55] P. Luciano, A. Annovi, C. Gentsos, A. Andreani, R. Beccherle, et al. “The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS”. In: Proceedings, 3rd International Conference on Technology and Instrumentation in Particle Physics (TIPP 2014). Vol. TIPP2014. SISSA. SISSA, 2014. url: http://pos.sissa.it/cgi- bin/reader/conf.cgi?confid=213 (cit. on p. 61). [56] R. S. Shojaii. “An associative memory chip for the trigger system of the ATLAS experiment”. In: PoS Vertex2016 (2017), p. 058. doi: 10.1142/9789814603164_ 0057 (cit. on p. 61). [57] CMS Collaboration. “The CMS experiment at the CERN LHC”. In: Journal of Instrumentation 3 (08 Aug. 2008), S08004–S08004. doi: 10.1088/1748-0221/3/ 08/s08004 (cit. on p. 63). [58] D. Froidevaux and P. Sphicas. “General-Purpose Detectors for the Large Hadron Collider”. In: 56 (Nov. 2006), pp. 375–440. doi: 10.1146/annurev.nucl.54. 070103.181209 (cit. on p. 63). [59] D. Green, W. Erdmann, L. Rossi, M. Mannelli, L. Mandelli, et al. At the leading edge: The ATLAS and CMS LHC Experiments. Ed. by D. Green. World Scientific, Feb. 2010. isbn: 9789814277617 (cit. on p. 63). 180 REFERENCES

[60] CMS Collaboration. “Description and performance of track and primary-vertex reconstruction with the CMS tracker”. In: J. Instrum. 9.10 (2014), P10009. doi: 10.1088/1748-0221/9/10/P10009 (cit. on p. 64). [61] F. Palla, M. Pesaresi, and A. Ryd. “Track Finding in CMS for the Level-1 Trigger at the HL-LHC”. In: Journal of Instrumentation 11 (03 Mar. 2016), pp. C03011–C03011. doi: 10.1088/1748-0221/11/03/c03011 (cit. on p. 64). [62] CMS Collaboration. Technical Proposal for the Phase-II Upgrade of the CMS Detector. Tech. rep. CERN-LHCC-2015-010. LHCC-P-008. CMS-TDR-15-02. Geneva, June 2015. url: http://cds.cern.ch/record/2020886 (cit. on pp. 65, 66). [63] J. Olsen, T. Liu, and Y. Okumura. “A full mesh ATCA-based general purpose data processing board”. In: JINST 9.01 (2014), p. C01041. doi: 10.1088/1748- 0221/9/01/C01041 (cit. on pp. 65, 113). [64] H. Wind. “Principal component analysis and its application to track finding”. In: Formulae and Methods in Experimental Data Evaluation 3 (1984) (cit. on pp. 65, 89). [65] C. Gentsos, F. Crescioli, P. Giannetti, D. Magalotti, and S. Nikolaidis. “Future Evolution of the Fast TracKer (FTK) processing unit”. In: Proceedings, 3rd Inter- national Conference on Technology and Instrumentation in Particle Physics (TIPP 2014). Vol. TIPP2014. SISSA. SISSA, 2014. url: http://pos.sissa.it/cgi- bin/reader/conf.cgi?confid=213 (cit. on p. 66). [66] L. Skinnari. “L1 track triggering at CMS for High Luminosity LHC”. In: Journal of Instrumentation 9.10 (2014), p. C10035. doi: 10.1088/1748-0221/9/10/C10035 (cit. on p. 66). [67] G. Hall. “A time-multiplexed track-trigger for the CMS HL-LHC upgrade”. In: Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spec- trometers, Detectors and Associated Equipment 824 (2016). Frontier Detectors for Frontier Physics: Proceedings of the 13th Pisa Meeting on Advanced Detectors, pp. 292–295. issn: 0168-9002. doi: 10.1016/j.nima.2015.09.075 (cit. on p. 66). [68] A. Annovi, G. Broccolo, A. Ciocci, P. Giannetti, F. Ligabue, et al. “Associative Memory for L1 Track Triggering in LHC Environment”. In: IEEE Trans. Nucl. Sci. 60.5 (Oct. 2013), pp. 3627–3632. issn: 0018-9499. doi: 10.1109/TNS.2013. 2281268 (cit. on pp. 67, 119). [69] S. Amerio, A. Andreani, A. Andreazza, A. Annovi, M. Beretta, et al. “ATLAS FTK: Fast Track Trigger”. In: The 20th Anniversary International Workshop on Vertex Detectors. Sept. 2013. url: https://pos.sissa.it/137/040/pdf (cit. on p. 67). [70] Multiplexer Design Techniques for Datapath Performance with Minimized Rout- ing Resources. XAPP522. v1.2. Xilinx. Oct. 2014. url: https://www.xilinx. com/support/documentation/application_notes/xapp522-mux-design- techniques.pdf (cit. on pp. 71, 151). [71] 7 Series FPGAs Configurable Logic Block. UG474. v1.8. Xilinx. Sept. 2016. url: https://www.xilinx.com/support/documentation/user_guides/ug474_ 7Series_CLB.pdf (cit. on p. 71). [72] UltraScale Architecture Configurable Logic Block. UG574. v1.5. Xilinx. Feb. 2017. url: https://www.xilinx.com/support/documentation/user_guides/ug574- ultrascale-clb.pdf (cit. on p. 71). REFERENCES 181

[73] C. E. Leiserson, H. Prokop, and K. H. Randall. Using de Bruijn Sequences to Index a 1 in a Computer Word. 1998 (cit. on p. 73). [74] R. Frühwirth. “Application of Kalman filtering to track and vertex fitting”. In: Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spec- trometers, Detectors and Associated Equipment 262.2 (1987), pp. 444–450. issn: 0168-9002. doi: https://doi.org/10.1016/0168-9002(87)90887-4 (cit. on p. 89). [75] S. J. Bailey, G. W. Brandenburg, N. Felt, T. Fries, S. Harder, et al. “Rapid 3-D track reconstruction with the BABAR trigger upgrade”. In: IEEE Transactions on Nuclear Science 51.5 (Oct. 2004), pp. 2352–2355. issn: 0018-9499. doi: 10.1109/TNS.2004. 834708 (cit. on p. 89). [76] S. Amerio, A. Annovi, M. Bettini, M. Bucciantonio, P. Catastini, et al. “TheGi- gaFitter: A next generation track fitter to enhance online tracking performances at CDF”. In: Nuclear Science Symposium Conference Record (NSS/MIC), 2009 IEEE. Ed. by R. C. Lanza. IEEE, Oct. 2009, pp. 1143–1146. isbn: 9781424439614. doi: 10.1109/NSSMIC.2009.5402410 (cit. on p. 89). [77] “IEEE Standard for Floating-Point Arithmetic”. In: IEEE Std 754-2008 (Aug. 2008), pp. 1–70. doi: 10.1109/IEEESTD.2008.4610935 (cit. on p. 91). [78] UltraScale Architecture DSP Slice User Guide. UG579. v1.2. Xilinx. Jan. 2015. url: https://www.xilinx.com/support/documentation/user_guides/ug579- ultrascale-dsp.pdf (cit. on p. 91). [79] Vivado Design User Guide: Hierarchical Design. ug905. v2017.4. Xilinx. Dec. 2017. url: https : / / www . xilinx . com / support / documentation / sw _ manuals / xilinx2017_4/ug905-vivado-hierarchical-design.pdf (cit. on p. 96). [80] S. Marconi, E. Conti, J. Christiansen, and P. Placidi. “Reusable SystemVerilog-UVM Design Framework with Constrained Stimuli Modeling for High Energy Physics Applications”. In: 2015 IEEE International Symposium on Systems Engineering (ISSE). Sept. 2015, pp. 391–397. doi: 10.1109/SysEng.2015.7302788 (cit. on p. 100). [81] C. Gentsos, F. Crescioli, F. Bertolucci, D. Magalotti, S. Citraro, et al. “The Future Evolution of the Fast TracKer Processing Unit”. In: Proceedings, 2015 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC 2015). IEEE. IEEE, May 2015. doi: 10.1109/NSSMIC.2015.7581783 (cit. on p. 102). [82] C. Gentsos, G. Volpi, S. Gkaitatzis, P. Giannetti, S. Citraro, et al. “A Coprocessor for the Fast Tracker Simulation”. In: IEEE Transactions on Nuclear Science 64.6 (June 2017), pp. 1255–1262. issn: 0018-9499. doi: 10.1109/TNS.2017.2688586 (cit. on p. 102). [83] A. Annovi et al. “Hadron collider triggers with high-quality tracking at very high event rates”. In: IEEE Trans. Nucl. Sci. 51 (2004), pp. 391–400. doi: 10.1109/TNS. 2004.828639 (cit. on p. 102). [84] G. Fedi, A. Modak, C. Gentsos, B. Checcucci, D. Magalotti, et al. “L1 track trigger for the CMS HL-LHC upgrade using AM chips + FPGA”. In: Proceedings, Connecting The Dots / Intelligent Trackers 2017 (CTD/WIT 2017). Mar. 2017. url: http://cds. cern.ch/record/2263760/files/CR2017_117.pdf (cit. on p. 112). 182 REFERENCES

[85] C. Gentsos, G. Fedi, G. Magazzù, D. Magalotti, A. Modak, et al. “Track Finding Mezzanine for Level-1 Triggering in HL-LHC Experiments”. In: Proceedings, 6th International Conference on Modern Circuits and Systems Technologies (MOCAST 2017). IEEE. IEEE, pp. 1–4. doi: 10.1109/MOCAST.2017.7937676 (cit. on p. 112). [86] G. Magazzù, G. Fedi, C. Gentsos, et al. “Track finding based on Associative Mem- ories for Level-1 triggering in HL-LHC experiments”. In: Proceedings, 5th Interna- tional Conference on Modern Circuits and Systems Technologies (MOCAST 2016). IEEE. IEEE, May 2016. doi: 10.1109/MOCAST.2016.7495145 (cit. on p. 113). [87] Xilinx UltraScale Architecture for High-Performance, Smarter Systems. White Paper WP434. v1.2. Oct. 2015. url: https://www.xilinx.com/support/document ation/white_papers/wp434-ultrascale-smarter-systems.pdf (cit. on p. 114). [88] A. K. Sampathirao, P. Sopasakis, A. Bemporad, and P. Patrinos. “Distributed solution of stochastic optimal control problems on GPUs”. In: 2015 54th IEEE Conference on Decision and Control (CDC). IEEE, Dec. 2015. doi: 10.1109/cdc. 2015.7403352 (cit. on p. 118). [89] S. Katsigiannis, V. Dimitsas, and D. Maroulis. “A GPU vs CPU performance evalu- ation of an experimental video compression algorithm”. In: 2015 Seventh Interna- tional Workshop on Quality of Multimedia Experience (QoMEX). IEEE, May 2015. doi: 10.1109/qomex.2015.7148134 (cit. on p. 118). [90] D. Emeliyanov and J. Howard. “GPU-Based Tracking Algorithms for the ATLAS High-Level Trigger”. In: J. Phys. Conf. Ser. 396.1 (2012), p. 012018. doi: 10.1088/ 1742-6596/396/1/012018 (cit. on p. 118). [91] J. Mattmann and C. Schmitt. “Track finding in ATLAS using GPUs”.In: J. Phys. Conf. Ser. 396.2 (2012), p. 022035. doi: 10.1088/1742-6596/396/2/022035 (cit. on p. 118). [92] V. Halyo, A. Hunt, P. Jindal, P. LeGresley, and P. Lujan. “GPU enhancement of the trigger to extend physics reach at the LHC”. In: J. Instrum. 8.10 (2013), P10005. doi: 10.1088/1748-0221/8/10/P10005 (cit. on p. 118). [93] C. E. Cummings. “Simulation and Synthesis Techniques for Asynchronous FIFO Design”. In: User Group Silicon Valley 2002 Proceedings (SNUG2002). 2002 (cit. on p. 128). [94] “IEEE Standard for SystemVerilog–Unified Hardware Design, Specification, and Verification Language”. In: IEEE Std 1800-2012 (Revision of IEEE Std 1800-2009) (Feb. 2013), pp. 1–1315. doi: 10.1109/IEEESTD.2013.6469140 (cit. on p. 135). [95] P. Flake. “Why SystemVeriog?” In: Proceedings of the 2013 Forum on Specification and Design Languages (FDL). Sept. 2013, pp. 1–6 (cit. on p. 135). [96] S. Sutherland, S. Davidmann, and P. Flake. SystemVerilog for Design. A Guide to Using SystemVerilog for Hardware Design and Modeling. Springer Nature, 2006. isbn: 978-0-387-33399-1. doi: 10.1007/0-387-36495-1 (cit. on p. 135). [97] S. Sutherland and D. Mills. “Synthesizing SystemVerilog. Busting the Myth that SystemVerilog is only for Verification”. In: SyNopsys User Group Silicon Valley 2013 Proceedings (SNUG2013). 2013. url: https://www.synopsys.com/community/ snug/snug-silicon-valley/location-proceedings-2013.html (cit. on p. 136). REFERENCES 183

[98] S. Ramachandran. Digital VLSI Systems Design. A Design Manual for Implemen- tation of Projects on FPGAs and ASICs Using Verilog. Springer Netherlands, 2007. isbn: 978-1-4020-5829-5. doi: 10.1007/978-1-4020-5829-5 (cit. on p. 136). [99] R. Jasinski. Effective Coding with VHDL: Principles and Best .Practice The MIT Press, 2016. isbn: 978-0262034227 (cit. on p. 136). [100] C. E. Cummings. “SystemVerilog Logic Specific Processes for Synthesis ‐ Bene- fits and Proper Usage”. In: SyNopsys User Group Silicon Valley 2013 Proceedings (SNUG2016). 2016. url: https://www.synopsys.com/community/snug/snug- silicon-valley/location-proceedings-2016.html (cit. on p. 137). [101] “IEEE Standard Multivalue Logic System for VHDL Model Interoperability (Std_lo- gic_1164)”. In: IEEE Std 1164-1993 (May 1993), pp. 1–24. doi: 10.1109/IEEESTD. 1993.115571 (cit. on p. 139). [102] L. Scheffer, L. Lavagno, and G. Martin. EDA for IC System Design, Verification, and Testing. CRC Press, 2006. isbn: 978-0-849-37923-9 (cit. on p. 142). [103] “IEEE Standard for Property Specification Language (PSL)”. In: IEEE Std 1850-2005 (2005), pp. 1–143. doi: 10.1109/IEEESTD.2005.97780 (cit. on p. 142). [104] B. Cohen, S. Venkataramanan, A. Kumari, and L. Piper. SystemVerilog Assertions Handbook. For Dynamic and Formal Verification. 4th. VhdlCohen Publishing, 2016. isbn: 9781518681448 (cit. on p. 142). [105] Proceedings, 2015 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC 2015). IEEE. IEEE, May 2015. [106] Proceedings, 3rd International Conference on Technology and Instrumentation in Particle Physics (TIPP 2014). Vol. TIPP2014. SISSA. SISSA, 2014. url: http://pos. sissa.it/cgi-bin/reader/conf.cgi?confid=213.

Publications

International Peer-Reviewed Journal Publications [1] C. Gentsos, G. Volpi, S. Gkaitatzis, P. Giannetti, S. Citraro, et al. “A Coprocessor for the Fast Tracker Simulation”. In: IEEE Transactions on Nuclear Science 64.6 (June 2017), pp. 1255–1262. issn: 0018-9499. doi: 10.1109/TNS.2017.2688586. [2] C. L. Sotiropoulou, I. Maznas, S. Citraro, A. Annovi, L. S. Ancu, et al. “The As- sociative Memory System Infrastructures for the ATLAS Fast Tracker”. In: IEEE Transactions on Nuclear Science 64.6 (June 2017), pp. 1248–1254. issn: 0018-9499. doi: 10.1109/TNS.2017.2703908. [3] C.-L. Sotiropoulou, L. Voudouris, C. Gentsos, A. M. Demiris, N. Vassiliadis, et al. “Real-time machine vision FPGA implementation for microfluidic monitoring on lab-on-chips”. In: IEEE Transactions on Biomedical Circuits and Systems 8.2 (Apr. 2014), pp. 268–277. issn: 1932-4545. doi: 10.1109/TBCAS.2013.2260338.

International Peer-Reviewed Journal Publications from Conference Proceedings [1] V. Cavaliere, J. Adelman, C. Gentsos, et al. “Design of a hardware track finder (Fast Tracker) for the ATLAS trigger”. In: Journal of Instrumentation 11.02 (Feb. 2016), p. C02056. doi: 10.1088/1748-0221/11/02/C02056. [2] A. Andreani, A. Annovi, C. Gentsos, et al. “The Associative Memory Serial Link Processor for the Fast TracKer (FTK) at ATLAS”. In: Journal of Instrumentation 9.11 (Nov. 2014), p. C11006. doi: 10.1088/1748-0221/9/11/C11006.

Proceedings from International Peer-Reviewed Conferences [1] C. Gentsos, G. Fedi, G. Magazzù, D. Magalotti, A. Modak, et al. “Track Finding Mezzanine for Level-1 Triggering in HL-LHC Experiments”. In: Proceedings, 6th International Conference on Modern Circuits and Systems Technologies (MOCAST 2017). IEEE. IEEE, pp. 1–4. doi: 10.1109/MOCAST.2017.7937676. [2] G. Fedi, A. Modak, C. Gentsos, B. Checcucci, D. Magalotti, et al. “L1 track trigger for the CMS HL-LHC upgrade using AM chips + FPGA”. In: Proceedings, Connecting The Dots / Intelligent Trackers 2017 (CTD/WIT 2017). Mar. 2017. url: http://cds. cern.ch/record/2263760/files/CR2017_117.pdf.

185 186 PUBLICATIONS

[3] G. Magazzù, G. Fedi, C. Gentsos, et al. “Track finding based on Associative Mem- ories for Level-1 triggering in HL-LHC experiments”. In: Proceedings, 5th Interna- tional Conference on Modern Circuits and Systems Technologies (MOCAST 2016). IEEE. IEEE, May 2016. doi: 10.1109/MOCAST.2016.7495145. [4] A. Annovi, F. Bertolucci, C. Gentsos, N. Biesuz, D. Calabro, et al. “Highly paral- lelized pattern matching execution for the ATLAS experiment”. In: Proceedings, 2015 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC 2015). IEEE. IEEE, May 2015. doi: 10.1109/NSSMIC.2015.7581789. [5] C. Gentsos, F. Crescioli, F. Bertolucci, D. Magalotti, S. Citraro, et al. “The Future Evolution of the Fast TracKer Processing Unit”. In: Proceedings, 2015 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC 2015). IEEE. IEEE, May 2015. doi: 10.1109/NSSMIC.2015.7581783. [6] C. Gentsos, F. Crescioli, P. Giannetti, D. Magalotti, and S. Nikolaidis. “Future Evolution of the Fast TracKer (FTK) processing unit”. In: Proceedings, 3rd Inter- national Conference on Technology and Instrumentation in Particle Physics (TIPP 2014). Vol. TIPP2014. SISSA. SISSA, 2014. url: http://pos.sissa.it/cgi- bin/reader/conf.cgi?confid=213. [7] P. Luciano, A. Annovi, C. Gentsos, A. Andreani, R. Beccherle, et al. “The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS”. In: Proceedings, 3rd International Conference on Technology and Instrumentation in Particle Physics (TIPP 2014). Vol. TIPP2014. SISSA. SISSA, 2014. url: http://pos.sissa.it/cgi- bin/reader/conf.cgi?confid=213. [8] C.-L. Sotiropoulou, C. Gentsos, and S. Nikolaidis. “High performance median FPGA implementation for machine vision applications”. In: Electronics, Circuits, and Systems (ICECS), 2013 IEEE 20th International Conference on. IEEE. 2013, pp. 173–176. doi: 10.1109/ICECS.2013.6815382. [9] C.-L. Sotiropoulou, L. Voudouris, C. Gentsos, S. Nikolaidis, N. Vassiliadis, et al. “FPGA-based machine vision implementation for Lab-on-Chip flow detection”. In: Circuits and Systems (ISCAS), 2012 IEEE International Symposium on. IEEE. 2012, pp. 2047–2050. doi: 10.1109/ISCAS.2012.6271683. [10] C.-L. Sotiropoulou, C. Gentsos, S. Nikolaidis, and A. Rjoub. “FPGA-based Canny Edge Detection for Real-Time Applications”. In: Proceedings, Conference on Design of Circuits and Integrated System (DCIS 2011). Nov. 2011, pp. 73–77. url: https:// pdfs . semanticscholar . org / fee0 / 728220e9cccb8dc6fc09b4bff8522 aeef021.pdf. [11] C. Gentsos, C.-L. Sotiropoulou, S. Nikolaidis, and N. Vassiliadis. “Real-time canny edge detection parallel implementation for FPGAs”. In: Electronics, Circuits, and Systems (ICECS), 2010 17th IEEE International Conference on. IEEE. 2010, pp. 499– 502. doi: 10.1109/ICECS.2010.5724558. PUBLICATIONS 187

Publications as member of the ATLAS Collaboration

[1] M. Aaboud et al. “A search for high-mass resonances decaying to τν in pp colli- sions at √s = 13 TeV with the ATLAS detector. A search for high-mass resonances decaying to τν in pp collisions at √s = 13 TeV with the ATLAS detector”. In: Phys. Rev. Lett. 120.CERN-EP-2017-341 (Jan. 2018), 161802. 20 p. url: https : //cds.cern.ch/record/2301691. [2] M. Aaboud et al. “Search for W ′ tb decays in the hadronic final state using pp → collisions at √s = 13 TeV with the ATLAS detector”. In: Phys. Lett. B 781.CERN-EP- 2017-340 (Jan. 2018), 327. 22 p. url: https://cds.cern.ch/record/2302089. 0 ± [3] M. Aaboud et al. “Search for a Structure in the Bs π Invariant Mass Spectrum with the ATLAS Experiment”. In: Phys. Rev. Lett. 120.CERN-EP-2017-333 (Feb. 2018), 202007. 19 p. url: https://cds.cern.ch/record/2302747. [4] M. Aaboud et al. “Search for photonic signatures of gauge-mediated supersymme- try in 13 TeV pp collisions with the ATLAS detector”. In: Phys. Rev. D 97.CERN- EP-2017-323 (Feb. 2018), 092006. 32 p. url: https://cds.cern.ch/record/ 2303558. [5] M. Aaboud et al. “Search for the Decay of the Higgs Boson to Charm Quarks with the ATLAS Experiment”. In: Phys. Rev. Lett. 120.CERN-EP-2017-334 (Feb. 2018), 211802. 20 p. url: https://cds.cern.ch/record/2304413. [6] M. Aaboud et al. ZZ ℓ+ℓ−ℓ′+ℓ′− cross-section measurements and search for → anomalous triple gauge couplings in 13 TeV pp collisions with the ATLAS detector. Tech. rep. CERN-EP-2017-163. Geneva: CERN, Sept. 2017. url: https://cds. cern.ch/record/2285386. [7] M. Aaboud et al. A search for resonances decaying into a Higgs boson and a new particle X in the XH qqbb final state with the ATLAS detector. Tech. rep. CERN- → EP-2017-204. Geneva: CERN, Sept. 2017. url: https://cds.cern.ch/record/ 2285011. [8] M. Aaboud et al. Analysis of the Wtb vertex from the measurement of triple- differential angular decay rates of single top quarks produced inthe t-channel at √s = 8 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-089. Geneva: CERN, July 2017. url: https://cds.cern.ch/record/2274911. [9] M. Aaboud et al. Combination of inclusive and differential tt charge asymmetry measurements using ATLAS and CMS data at √s = 7 and 8 TeV. Tech. rep. CMS- TOP-15-016. CMS-TOP-15-016-003. Geneva: CERN, Sept. 2017. url: https:// cds.cern.ch/record/2284428.

[10] M. Aaboud et al. Determination of the strong coupling constant αs from transverse energy-energy correlations in multijet events at √s = 8 TeV using the ATLAS detector. Tech. rep. CERN-EP-2017-093. July 2017. url: https://cds.cern.ch/ record/2273893. [11] M. Aaboud et al. Direct top-quark decay width measurement in the tt¯lepton+jets channel at √s =8TeV with the ATLAS experiment. Tech. rep. CERN-EP-2017-187. Geneva: CERN, Sept. 2017. url: https://cds.cern.ch/record/2284083. 188 PUBLICATIONS

[12] M. Aaboud et al. “Evidence for light-by-light scattering in heavy-ion collisions with the ATLAS detector at the LHC”. In: Nature Phys. CERN-EP-2016-316 (Feb. 2017), 31 p. url: https://cds.cern.ch/record/2244408. [13] M. Aaboud et al. Evidence for the H b¯b decay with the ATLAS detector. Tech. rep. → CERN-EP-2017-175. Geneva: CERN, Aug. 2017. url: https://cds.cern.ch/ record/2278245. [14] M. Aaboud et al. “Evidence for the associated production of the Higgs boson and a top quark pair with the ATLAS detector”. In: Phys. Rev. D 97.CERN-EP-2017-281. 7 (Dec. 2017), 072003. 44 p. url: https://cds.cern.ch/record/2299050. [15] M. Aaboud et al. Femtoscopy with identified charged pions in proton-lead collisions at √sNN =5.02 TeV with ATLAS. Tech. rep. CERN-EP-2017-004. Geneva: CERN, Apr. 2017. url: https://cds.cern.ch/record/2258366. [16] M. Aaboud et al. “Fiducial, total and differential cross-section measurements of t-channel single top-quark production in pp collisions at 8 TeV using data collected by the ATLAS detector. Fiducial, total and differential cross-section measurements of t-channel single top-quark production in pp collisions at 8 TeV using data collected by the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP- 2016-223 (Feb. 2017), 531. 70 p. url: https://cds.cern.ch/record/2244654. [17] M. Aaboud et al. “Identification and rejection of pile-up jets at high pseudorapidity with the ATLAS detector. Identification and rejection of pile-up jets athigh pseudorapidity with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2017-055. 9 (May 2017), 580. 49 p. url: https://cds.cern.ch/record/2262419. [18] M. Aaboud et al. Jet energy scale measurements and their systematic uncertainties in proton-proton collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-038. Geneva: CERN, Mar. 2017. url: https://cds.cern.ch/ record/2257300. [19] M. Aaboud et al. “Jet reconstruction and performance using particle flow with the ATLAS Detector”. In: Eur. Phys. J. C 77.CERN-EP-2017-024. 7 (Mar. 2017), 466. 67 p. url: https://cds.cern.ch/record/2257597. [20] M. Aaboud et al. Measurement of τ polarisation in Z/γ∗ ττ decays in proton- → proton collisions at √s = 8 TeV with the ATLAS detector. Tech. rep. CERN-EP- 2017-172-CERN-EP-2017-172-CERN-EP-2017-172-CERN-EP-2017-172. Geneva: CERN, Sept. 2017. url: https://cds.cern.ch/record/2283028. [21] M. Aaboud et al. Measurement of b-hadron pair production with the ATLAS detector in proton-proton collisions at √s =8 TeV. Tech. rep. CERN-EP-2017-057. Geneva: CERN, May 2017. url: https://cds.cern.ch/record/2263035. [22] M. Aaboud et al. “Measurement of WW /WZ ℓνqq′ production with the → hadronically decaying boson reconstructed as one or two jets in pp collisions at √s = 8 TeV with ATLAS, and constraints on anomalous gauge couplings. Measurement of WW /WZ ℓνqq′ production with the hadronically decaying → boson reconstructed as one or two jets in pp collisions at √s =8 TeV with ATLAS, and constraints on anomalous gauge couplings”. In: Eur. J. Phys. C 77.CERN-EP- 2017-060. 8 (June 2017), 563. 47 p. url: https://cds.cern.ch/record/2267592. PUBLICATIONS 189

[23] M. Aaboud et al. “Measurement of charged-particle distributions sensitive to the underlying event in √s = 13 TeV proton-proton collisions with the ATLAS detector at the LHC”. In: JHEP 03.CERN-EP-2016-28 (Jan. 2017), 157. 40 p. url: https://cds.cern.ch/record/2242640. [24] M. Aaboud et al. Measurement of detector-corrected observables sensitive to the anomalous production of events with jets and large missing transverse momentum in pp collisions at √s = 13 TeV using the ATLAS detector. Measurement of detector- corrected observables sensitive to the anomalous production of events with jets and large missing transverse momentum in pp collisions at √s = 13 TeV using the ATLAS detector. Tech. rep. CERN-EP-2017-116. July 2017. url: https://cds. cern.ch/record/2274215. [25] M. Aaboud et al. Measurement of inclusive and differential cross sections in the H ZZ∗ 4ℓ decay channel in pp collisions at √s = 13 TeV with the ATLAS → → detector. Tech. rep. CERN-EP-2017-139. Geneva: CERN, Aug. 2017. url: https: //cds.cern.ch/record/2277731. [26] M. Aaboud et al. “Measurement of internal structure of jets in Pb+Pb collisions at √sNN = 2.76 TeV with the ATLAS detector at the LHC”. In: Eur. Phys. J. C 77.CERN-EP-2017-005. 6 (Feb. 2017), 379. 41 p. url: https://cds.cern.ch/ record/2243758.

[27] M. Aaboud et al. Measurement of jet pT correlations in Pb+Pb and pp collisions at √sNN =2.76 TeV with the ATLAS detector. Measurement of jet pT correlations in Pb+Pb and pp collisions at √sNN = 2.76 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-054. June 2017. url: https://cds.cern.ch/record/2272240. [28] M. Aaboud et al. Measurement of jet fragmentation in 5.02 TeV proton–lead and proton–proton collisions with the ATLAS detector. Tech. rep. CERN-EP-2017-065. Geneva: CERN, June 2017. url: https://cds.cern.ch/record/2268255. [29] M. Aaboud et al. Measurement of lepton differential distributions and the top quark mass in tt¯ production in pp collisions at √s = 8 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-200. Geneva: CERN, Sept. 2017. url: https://cds. cern.ch/record/2285937. [30] M. Aaboud et al. Measurement of longitudinal flow de-correlations in Pb+Pb col- lisions at √sNN = 2.76 and 5.02 TeV with the ATLAS detector. Measurement of longitudinal flow de-correlations in Pb+Pb collisions at √sNN =2.76 and 5.02 TeV with the ATLAS detector. Tech. rep. CERN-PH-2017-191. Sept. 2017. url: https: //cds.cern.ch/record/2282815. [31] M. Aaboud et al. “Measurement of multi-particle azimuthal correlations in pp, p+Pb and low-multiplicity Pb+Pb collisions with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2017-048. 6 (May 2017), 428. 54 p. url: https://cds. cern.ch/record/2263609. [32] M. Aaboud et al. Measurement of multi-particle azimuthal correlations with the subevent cumulant method in pp and p+Pb collisions with the ATLAS detector at the LHC. Tech. rep. CERN-EP-2017-160. Geneva: CERN, Aug. 2017. url: https: //cds.cern.ch/record/2278247. 190 PUBLICATIONS

[33] M. Aaboud et al. Measurement of quarkonium production in proton–lead and proton–proton collisions at 5.02 TeV with the ATLAS detector. Tech. rep. CERN- PH-EP-2017-197. Sept. 2017. url: https://cds.cern.ch/record/2283154. [34] M. Aaboud et al. “Measurement of the k splitting scales in Z ℓℓ events in t → pp collisions at √s = 8 TeV with the ATLAS detector. Measurement of the k splitting scales in Z ℓℓ events in pp collisions at √s = 8 TeV with the t → ATLAS detector”. In: JHEP 08.CERN-EP-2017-033 (Apr. 2017), 026. 42 p. url: https://cds.cern.ch/record/2258390. [35] M. Aaboud et al. Measurement of the ttγ¯ production cross section in proton-proton collisions at √s =8 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-086. Geneva: CERN, June 2017. url: https://cds.cern.ch/record/2268389. [36] M. Aaboud et al. “Measurement of the tt¯production cross section in the τ + jets final state in pp collisions at √s = 8 TeV using the ATLAS detector”. In: Phys. Rev. D 95.CERN-EP-2016-288 (Feb. 2017), 072003. 37 p. url: https://cds.cern. ch/record/2253729. [37] M. Aaboud et al. “Measurement of the W +W − production cross section in pp collisions at a centre-of-mass energy of √s = 13 TeV with the ATLAS experiment. Measurement of the W +W − production cross section in pp collisions at a centre- of-mass energy of √s = 13 TeV with the ATLAS experiment”. In: Phys. Lett. B 773.CERN-EP-2016-267 (Feb. 2017), 354–374. 21 p. url: https://cds.cern.ch/ record/2252574. [38] M. Aaboud et al. Measurement of the W -boson mass in pp collisions at √s =7 TeV with the ATLAS detector. Tech. rep. CERN-EP-2016-305. Geneva: CERN, Jan. 2017. url: https://cds.cern.ch/record/2242923. [39] M. Aaboud et al. “Measurement of the cross section for inclusive isolated-photon production in pp collisions at √s = 13 TeV using the ATLAS detector”. In: Phys. Lett. B 770.CERN-EP-2016-291 (Jan. 2017), 473–793. 21 p. url: https://cds. cern.ch/record/2242863. [40] M. Aaboud et al. “Measurement of the cross section for isolated-photon plus jet production in pp collisions at √s = 13 TeV using the ATLAS detector”. In: Phys. Lett. B 780.CERN-EP-2017-265 (Dec. 2017), 578–602. 25 p. url: https : //cds.cern.ch/record/2299162. [41] M. Aaboud et al. Measurement of the cross-section for electroweak production of dijets in association with a Z boson in pp collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. arXiv:1709.10264. Geneva: CERN, Sept. 2017. url: https://cds.cern.ch/record/2286337. [42] M. Aaboud et al. Measurement of the exclusive γγ µ+µ− process in proton–proton → collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-151. Geneva: CERN, Aug. 2017. url: https://cds.cern.ch/record/2278608. [43] M. Aaboud et al. “Measurement of the inclusive jet cross-sections in proton– proton collisions at √s =8 TeV with the ATLAS detector. Measurement of the inclusive jet cross-sections in proton–proton collisions at √s =8 TeV with the ATLAS detector”. In: JHEP 09.CERN-EP-2017-043 (June 2017), 020. 54 p. url: https://cds.cern.ch/record/2268420. PUBLICATIONS 191

[44] M. Aaboud et al. “Measurements of electroweak Wjj production and constraints on anomalous gauge couplings with the ATLAS detector. Measurements of elec- troweak Wjj production and constraints on anomalous gauge couplings with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2017-008. 7 (Mar. 2017), 474. 103 p. url: https://cds.cern.ch/record/2255384. [45] M. Aaboud et al. “Measurements of integrated and differential cross sections for isolated photon pair production in pp collisions at √s =8 TeV with the ATLAS detector”. In: Phys. Rev. D 95.CERN-EP-2017-040 (Apr. 2017), 112005. 27 p. url: https://cds.cern.ch/record/2259444. [46] M. Aaboud et al. “Measurements of the production cross section of a Z boson in association with jets in pp collisions at √s = 13 TeV with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2016-297. 6 (Feb. 2017), 361. 47 p. url: https: //cds.cern.ch/record/2253057. [47] M. Aaboud et al. Measurements of top-quark pair differential cross-sections in the lepton+jets channel in pp collisions at √s=13 TeV using the ATLAS detector. Tech. rep. CERN-EP-2017-058. Geneva: CERN, Aug. 2017. url: https://cds.cern.ch/ record/2276466. [48] M. Aaboud et al. Performance of the ATLAS Track Reconstruction Algorithms in Dense Environments in LHC run 2. Tech. rep. CERN-EP-2017-045. Geneva: CERN, Apr. 2017. url: https://cds.cern.ch/record/2261156. [49] M. Aaboud et al. “Performance of the ATLAS Transition Radiation Tracker in Run 1 of the LHC: tracker properties”. In: JINST 12.CERN-EP-2016-311 (Feb. 2017), P05002. 45 p. url: https://cds.cern.ch/record/2253059. [50] M. Aaboud et al. “Probing the Wtb vertex structure in t-channel single-top- quark production and decay in pp collisions at √s = 8 TeV with the ATLAS detector”. In: JHEP 04.CERN-EP-2017-011 (Feb. 2017), 124. 48 p. url: https : //cds.cern.ch/record/2253372. [51] M. Aaboud et al. Search for a new heavy gauge boson resonance decaying into a lepton and missing transverse momentum in 36 fb−1 of pp collisions at √s = 13 TeV with the ATLAS experiment. Tech. rep. CERN-EP-2017-082. Geneva: CERN, June 2017. url: https://cds.cern.ch/record/2269048. [52] M. Aaboud et al. Search for a scalar partner of the top quark in the jets plus missing transverse momentum final state at √s=13 TeV with the ATLAS detector. Search for a scalar partner of the top quark in the jets plus missing transverse momentum final state at √s=13 TeV with the ATLAS detector. Tech. rep. CERN-PH-2017-162. Sept. 2017. url: https://cds.cern.ch/record/2285563. [53] M. Aaboud et al. Search for additional heavy neutral Higgs and gauge bosons in the ditau final state produced in 36− fb 1 of pp collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-199. Geneva: CERN, Sept. 2017. url: https://cds.cern.ch/record/2285183. [54] M. Aaboud et al. Search for an invisibly decaying Higgs boson or dark matter candidates produced in association with a Z boson in pp collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-166. Geneva: CERN, Aug. 2017. url: https://cds.cern.ch/record/2281451. 192 PUBLICATIONS

[55] M. Aaboud et al. “Search for dark matter at √s = 13 TeV in final states containing an energetic photon and large missing transverse momentum with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2017-044. 6 (Apr. 2017), 393. 45 p. url: https://cds.cern.ch/record/2259751. [56] M. Aaboud et al. Search for dark matter in association with a Higgs boson decaying to two photons at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-079. Geneva: CERN, June 2017. url: https://cds.cern.ch/record/2268735. [57] M. Aaboud et al. Search for diboson resonances with boson-tagged jets in pp collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-147. Geneva: CERN, Aug. 2017. url: https://cds.cern.ch/record/2278615. [58] M. Aaboud et al. “Search for direct top squark pair production in events with a Higgs or Z boson, and missing transverse momentum in √s = 13 TeV pp collisions with the ATLAS detector”. In: JHEP 08.CERN-EP-2017-106 (June 2017), 006. 29 p. url: https://cds.cern.ch/record/2268747. [59] M. Aaboud et al. Search for direct top squark pair production in final states with two leptons in √s = 13 TeV pp collisions with the ATLAS detector. Tech. rep. CERN- EP-2017-150. Geneva: CERN, Aug. 2017. url: https://cds.cern.ch/record/ 2278172. [60] M. Aaboud et al. Search for heavy Higgs bosons A/H decaying to a top quark pair in pp collisions at √s =8 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-134. Geneva: CERN, July 2017. url: https://cds.cern.ch/record/2275054. [61] M. Aaboud et al. Search for heavy resonances decaying to a W or Z boson and a Higgs boson in the qq¯(′)b¯b final state in pp collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-111. Geneva: CERN, July 2017. url: https://cds.cern.ch/record/2275252. [62] M. Aaboud et al. Search for new high-mass phenomena in the dilepton final state using 36.1 fb−1 of proton-proton collision data at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-119. July 2017. url: https://cds.cern.ch/ record/2273892. [63] M. Aaboud et al. “Search for new phenomena in a lepton plus high jet multiplicity final state with the ATLAS experiment using √s = 13 TeV proton-proton collision data”. In: JHEP 1709.CERN-EP-2017-053 (Apr. 2017), 088. 51 p. url: https://cds. cern.ch/record/2261361. [64] M. Aaboud et al. “Search for new phenomena in dijet events using 37 fb−1 of pp collision data collected at √s =13 TeV with the ATLAS detector. Search for new phenomena in dijet events using 37 fb−1 of pp collision data collected at √s =13 TeV with the ATLAS detector”. In: Phys. Rev. D 96.CERN-EP-2017-042 (Mar. 2017), 052004. 26 p. url: https://cds.cern.ch/record/2257101. [65] M. Aaboud et al. Search for new phenomena in high-mass diphoton final states using 37 fb−1 of proton–proton collisions collected at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-132. Geneva: CERN, July 2017. url: https: //cds.cern.ch/record/2274538. PUBLICATIONS 193

[66] M. Aaboud et al. Search for new phenomena in high-mass final states with a photon and a jet from pp collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. arXiv:1709.10440. Geneva: CERN, Sept. 2017. url: https : / / cds . cern . ch / record/2286380. [67] M. Aaboud et al. Search for new phenomena with large jet multiplicities and missing transverse momentum using large-radius jets and flavour-tagging at ATLAS in 13 TeV pp collisions. Tech. rep. CERN-EP-2017-138. Geneva: CERN, Aug. 2017. url: https://cds.cern.ch/record/2277363. [68] M. Aaboud et al. Search for pair production of heavy vector-like quarks decaying to high-pmathrmT W bosons and b quarks in the lepton-plus-jets final state in pp collisions at √s=13 TeV with the ATLAS detector. Search for pair production of heavy vector-like quarks decaying to high-pmathrmT W bosons and b quarks in the lepton- plus-jets final state in pp collisions at √s=13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-094. July 2017. url: https://cds.cern.ch/record/2274216. [69] M. Aaboud et al. “Search for pair production of vector-like top quarks in events with one lepton, jets, and missing transverse momentum in √s = 13 TeV pp collisions with the ATLAS detector”. In: JHEP 08.CERN-EP-2017-075 (May 2017), 052. 39 p. url: https://cds.cern.ch/record/2266574. [70] M. Aaboud et al. Search for squarks and gluinos in events with an isolated lepton, jets and missing transverse momentum at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-140. Geneva: CERN, Aug. 2017. url: https://cds. cern.ch/record/2280982. [71] M. Aaboud et al. Search for supersymmetry in events with b-tagged jets and missing transverse momentum in pp collisions at √s = 13 TeV with the ATLAS detec- tor. Search for supersymmetry in events with b-tagged jets and missing transverse momentum in pp collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-154. Geneva: CERN, Sept. 2017. url: https://cds.cern.ch/ record/2283170. [72] M. Aaboud et al. “Search for supersymmetry in final states with two same-sign or three leptons and jets using 36 fb−1 of √s = 13 TeV pp collision data with the ATLAS detector”. In: JHEP 1709.CERN-EP-2017-108 (June 2017), 084. 43 p. url: https://cds.cern.ch/record/2268733. [73] M. Aaboud et al. “Search for the dimuon decay of the Higgs boson in pp collisions at √s = 13 TeV with the ATLAS detector”. In: Phys. Rev. Lett. 119.CERN-EP-2017-078 (May 2017), 051802. 20 p. url: https://cds.cern.ch/record/2263640. [74] M. Aaboud et al. Search for the direct production of charginos and neutralinos in √s = 13 TeV pp collisions with the ATLAS detector. Search for the direct production of charginos and neutralinos in √s = 13 TeV pp collisions with the ATLAS detector. Tech. rep. CERN-PH-2017-173. Aug. 2017. url: https://cds.cern.ch/record/ 2281439. [75] M. Aaboud et al. “Search for the Standard Model Higgs boson produced in associ- ation with top quarks and decaying into a b¯b pair in pp collisions at √s = 13 TeV with the ATLAS detector. Search for the Standard Model Higgs boson produced in association with top quarks and decaying into a b¯b pair in pp collisions at √s = 13 TeV with the ATLAS detector”. In: Phys. Rev. D CERN-EP-2017-291 (Dec. 2017), 44 p. url: https://cds.cern.ch/record/2299049. 194 PUBLICATIONS

[76] M. Aaboud et al. Search for top quark decays t qH, with H γγ, in √s = 13 → → TeV pp collisions using the ATLAS detector. Tech. rep. CERN-EP-2017-118. Geneva: CERN, July 2017. url: https://cds.cern.ch/record/2273256. [77] M. Aaboud et al. Searches for heavy ZZ and ZW resonances in the ℓℓqq and ννqq final states in pp collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-146. Geneva: CERN, Aug. 2017. url: https://cds.cern.ch/ record/2281501. [78] M. Aaboud et al. Searches for the Zγ decay mode of the Higgs boson and for new high-mass resonances in pp collisions at √s = 13 TeV with the ATLAS detector. Tech. rep. CERN-EP-2017-095. Geneva: CERN, July 2017. url: https://cds. cern.ch/record/2276364. [79] M. Aaboud et al. SSearch for Dark Matter Produced in Association with a Higgs Boson Decaying to b¯b using 36 fb−1 of pp collisions at √s = 13 TeV with the ATLAS Detector. Tech. rep. CERN-EP-2017-117. Geneva: CERN, July 2017. url: https://cds.cern.ch/record/2273199. [80] M. Aaboud et al. “Studies of Zγ production in association with a high-mass dijet system in pp collisions at √s = 8 TeV with the ATLAS detector. Studies of Zγ production in association with a high-mass dijet system in pp collisions at √s = 8 TeV with the ATLAS detector”. In: JHEP 07.CERN-EP-2017-046 (May 2017), 107. 50 p. url: https://cds.cern.ch/record/2261980. [81] M. Aaboud et al. “Study of WWγ and WZγ production in pp collisions at √s =8 TeV and search for anomalous quartic gauge couplings with the ATLAS experi- ment”. In: Eur. Phys. J. C 77.CERN-EP-2017-096 (July 2017), 646. 44 p. url: https: //cds.cern.ch/record/2274915. [82] M. Aaboud et al. Study of ordered hadron chains with the ATLAS detector. Tech. rep. CERN-EP-2017-092. Geneva: CERN, Sept. 2017. url: https://cds.cern.ch/ record/2285180. [83] M. Aaboud et al. Study of the material of the ATLAS inner detector for Run 2 of the LHC. Tech. rep. CERN-EP-2017-081. July 2017. url: https://cds.cern.ch/ record/2273894. [84] M. Aaboud et al. “Top-quark mass measurement in the all-hadronic tt¯ decay channel at √s = 8 TeV with the ATLAS detector. Top-quark mass measurement in the all-hadronic tt¯decay channel at √s = 8 TeV with the ATLAS detector”. In: JHEP 09.CERN-EP-2016-264 (Feb. 2017), 118. 39 p. url: https://cds.cern.ch/ record/2253316. [85] W. Adam et al. “Characterisation of irradiated thin silicon sensors for the CMS phase II pixel upgrade”. In: Eur. Phys. J. C 77.8 (2017), 567. 13 p. url: https: //cds.cern.ch/record/2281434. [86] W. Adam et al. “P-Type Silicon Strip Sensors for the new CMS Tracker at HL- LHC”. In: JINST 12.06 (2017), P06018. 27 p. url: https://cds.cern.ch/record/ 2275928. [87] M. Aaboud et al. “A measurement of material in the ATLAS tracker using sec- ondary hadronic interactions in 7 TeV pp collisions”. In: JINST 11.CERN-EP- 2016-137. CERN-EP-2016-137. 11 (Sept. 2016), P11020. 45 p. url: https://cds. cern.ch/record/2215485. PUBLICATIONS 195

[88] M. Aaboud et al. “A measurement of the calorimeter response to single hadrons and determination of the jet energy scale uncertainty using LHC Run-1 pp collision data with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2016-149. 1 (July 2016), 26. 47 p. url: https://cds.cern.ch/record/2202682. [89] M. Aaboud et al. “Charged-particle distributions at low transverse momentum in √s = 13 TeV pp interactions measured with the ATLAS detector at the LHC”. In: Eur. Phys. J. C 76.arXiv:1606.01133. CERN-EP-2016-099 (June 2016), 502. 33 p. url: https://cds.cern.ch/record/2157893. [90] M. Aaboud et al. “Dark matter interpretations of ATLAS searches for the elec- troweak production of supersymmetric particles in √s =8 TeV proton-proton collisions”. In: JHEP 09.CERN-EP-2016-165. CERN-EP-2016-165 (Aug. 2016), 175. 25 p. url: https://cds.cern.ch/record/2203733. [91] M. Aaboud et al. “Electron efficiency measurements with the ATLAS detector using 2012 LHC proton-proton collision data”. In: Eur. Phys. J. C 77.CERN-EP- 2016-262. 3 (Dec. 2016), 195. 64 p. url: https://cds.cern.ch/record/2237544.

[92] M. Aaboud et al. “High-ET isolated-photon plus jets production in pp collisions at √s = 8 TeV with the ATLAS detector”. In: Nucl. Phys. B 918.CERN-EP-2016-252 (Nov. 2016), 257–316. 59 p. url: https://cds.cern.ch/record/2233726. [93] M. Aaboud et al. “Measurement of W +W − production in association with one jet in proton–proton collisions at √s =8 TeV with the ATLAS detector”. In: Phys. Lett. B 763.arXiv:1608.03086. CERN-EP-2016-186 (Aug. 2016), 114–133. 20 p. url: https://cds.cern.ch/record/2206433. [94] M. Aaboud et al. “Measurement of W ±W ± vector-boson scattering and limits on anomalous quartic gauge couplings with the ATLAS detector”. In: Phys. Rev. B D 96.CERN-EP-2016-167 (Nov. 2016), 012007. 34 p. url: https://cds.cern.ch/ record/2230988. [95] M. Aaboud et al. “Measurement of W boson angular distributions in events with high transverse momentum jets at √s = 8 TeV using the ATLAS detector”. In: Phys. Lett. B 765.CERN-EP-2016-182 (Sept. 2016), 132–153. 38 p. url: https : //cds.cern.ch/record/2217557. [96] M. Aaboud et al. “Measurement of exclusive γγ W +W − production and search → for exclusive Higgs boson production in pp collisions at √s =8 TeV using the ATLAS detector”. In: Phys. Rev. D 94.CERN-EP-2016-123. CERN-EP-2016-123. 3 (July 2016), 032011. 31 p. url: https://cds.cern.ch/record/2198983. [97] M. Aaboud et al. “Measurement of forward-backward multiplicity correlations in lead-lead, proton-lead and proton-proton collisions with the ATLAS detector”. In: Phys. Rev. C 95.CERN-EP-2016-124. CERN-EP-2016-124 (June 2016), 064914. 30 p. url: https://cds.cern.ch/record/2194713. [98] M. Aaboud et al. “Measurement of jet activity in top quark events using the eµ final state with two b-tagged jets in pp collisions at √s =8 TeV with the ATLAS detector”. In: JHEP 09.arXiv:1606.09490. CERN-EP-2016-122 (June 2016), 074. 59 p. url: https://cds.cern.ch/record/2195357. 196 PUBLICATIONS

[99] M. Aaboud et al. “Measurement of jet activity produced in top-quark events with an electron, a muon and two b-tagged jets in the final state in pp collisions at √s = 13 TeV with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2016-218. 4 (Oct. 2016), 220. 57 p. url: https://cds.cern.ch/record/2228300. [100] M. Aaboud et al. “Measurement of the W ±Z boson pair-production cross section in pp collisions at √s = 13 TeV with the ATLAS Detector”. In: Phys. Lett. B 762.arXiv:1606.04017. CERN-EP-2016-108 (June 2016), 1–22. 21 p. url: https: //cds.cern.ch/record/2160353. [101] M. Aaboud et al. “Measurement of the W boson polarisation in tt¯events from pp collisions at √s = 8 TeV in the lepton+jets channel with ATLAS”. In: Eur. Phys. J. C 77.CERN-EP-2016-219 (Dec. 2016), 264. 42 p. url: https://cds.cern.ch/ record/2238498. [102] M. Aaboud et al. “Measurement of the ZZ production cross section in proton– proton collisions at √s = 8TeV using the ZZ ℓ−ℓ+ℓ′−ℓ′+ and ZZ ℓ−ℓ+νν¯ → → decay channels with the ATLAS detector”. In: JHEP 01.CERN-EP-2016-194 (Oct. 2016), 099. 55 p. url: https://cds.cern.ch/record/2227023. [103] M. Aaboud et al. Measurement of the cross-section for producing a W boson in association with a single top quark in pp collisions at √s = 13TeV with ATLAS. Tech. rep. CERN-EP-2016-238. Geneva: CERN, Dec. 2016. url: https://cds. cern.ch/record/2239810. [104] M. Aaboud et al. “Measurement of the inclusive cross-sections of single top-quark and top-antiquark t-channel production in pp collisions at √s = 13 TeV with the ATLAS detector”. In: JHEP 04.CERN-EP-2016-197. CERN-EP-2016-197 (Sept. 2016), 086. 24 p. url: https://cds.cern.ch/record/2215404. [105] M. Aaboud et al. “Measurement of the Inelastic Proton-Proton Cross Section at √s = 13 TeV with the ATLAS Detector at the LHC”. In: Phys. Rev. Lett. 117.CERN- EP-2016-140 (June 2016), 182002. 12 p. url: https://cds.cern.ch/record/ 2159578. [106] M. Aaboud et al. “Measurement of the prompt J/ψ pair production cross-section in pp collisions at √s = 8 TeV with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2016-211. 2 (Dec. 2016), 76. 53 p. url: https://cds.cern.ch/ record/2238505. [107] M. Aaboud et al. “Measurement of the top quark mass in the tt¯ dilepton → channel from √s =8 TeV ATLAS data”. In: Phys. Lett. B 761.CERN-EP-2016-114 (June 2016), 350. 18 p. url: https://cds.cern.ch/record/2158842. [108] M. Aaboud et al. “Measurement of the total cross section from elastic scattering in pp collisions at √s =8 TeV with the ATLAS detector”. In: Phys. Lett. B 761.CERN- EP-2016-158. CERN-EP-2016-158 (July 2016), 158–178. 17 p. url: https://cds. cern.ch/record/2201004. [109] M. Aaboud et al. “Measurement of top quark pair differential cross-sections in the dilepton channel in pp collisions at √s = 7 and 8 TeV with ATLAS”. In: Phys. Rev. D 94.CERN-EP-2016-144 (July 2016), 092003. 33 p. url: https://cds.cern. ch/record/2200174. PUBLICATIONS 197

[110] M. Aaboud et al. “Measurements of ψ(2S) and X(3872) J/ψπ+π− production → in pp collisions at √s =8 TeV with the ATLAS detector”. In: JHEP 01.CERN-EP- 2016-193. arXiv:1610.09303 (Oct. 2016), 117. 42 p. url: https://cds.cern.ch/ record/2228304. [111] M. Aaboud et al. “Measurements of charge and CP asymmetries in b-hadron decays using top-quark events collected by the ATLAS detector in pp collisions at √s = 8 TeV”. In: JHEP 02.CERN-EP-2016-221 (Oct. 2016), 071. 47 p. url: https://cds.cern.ch/record/2227245. [112] M. Aaboud et al. “Measurements of long-range azimuthal anisotropies and as- sociated Fourier coefficients for pp collisions at √s = 5.02 and 13 TeV and p+Pb collisions at √sNN = 5.02 TeV with the ATLAS detector”. In: Phys. Rev. C 96.CERN-EP-2016-200. CERN-EP-2016-200. 2 (Sept. 2016), 024908. 37 p. url: https://cds.cern.ch/record/2217088. [113] M. Aaboud et al. “Measurements of top quark spin observables in tt¯events using dilepton final states in √s =8 TeV pp collisions with the ATLAS detector”. In: JHEP 03.CERN-EP-2016-263 (Dec. 2016), 113. 50 p. url: https://cds.cern.ch/ record/2239808. [114] M. Aaboud et al. “Measurements of top-quark pair differential cross-sections in the eµ channel in pp collisions at √s = 13 TeV using the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2016-220. 5 (Dec. 2016), 292. 43 p. url: https: //cds.cern.ch/record/2239330. [115] M. Aaboud et al. “Measurements of top-quark pair to Z-boson cross-section ratios at √s = 13, 8, 7TeV with the ATLAS detector”. In: JHEP 1702.CERN-EP-2016-271 (Dec. 2016), 117. 55 p. url: https://cds.cern.ch/record/2238696. [116] M. Aaboud et al. “Performance of the ATLAS Trigger System in 2015”. In: Eur. Phys. J. C 77.CERN-EP-2016-241. 5 (Nov. 2016), 317. 76 p. url: https://cds. cern.ch/record/2235584. [117] M. Aaboud et al. “Precision measurement and interpretation of inclusive W +, W − and Z/γ∗ production cross sections with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2016-272. 6 (Dec. 2016), 367. 98 p. url: https://cds.cern.ch/ record/2238565. [118] M. Aaboud et al. “Reconstruction of primary vertices at the ATLAS experiment in Run 1 proton–proton collisions at the LHC”. In: Eur. Phys. J. C 77.CERN-EP- 2016-150. CERN-EP-2016-150. 5 (Nov. 2016), 332. 52 p. url: https://cds.cern. ch/record/2235651. [119] M. Aaboud et al. “Search for anomalous electroweak production of WW /WZ in association with a high-mass dijet system in pp collisions at √s = 8 TeV with the ATLAS detector”. In: Phys. Rev. D 95.CERN-EP-2016-171. CERN-EP-2016-171 (Sept. 2016), 032001. 37 p. url: https://cds.cern.ch/record/2216191. [120] M. Aaboud et al. “Search for bottom squark pair production in proton–proton collisions at √s =13 TeV with the ATLAS detector”. In: Eur. Phys. J. C 76.CERN- EP-2016-138. CERN-EP-2016-138 (June 2016), 547. 37 p. url: https://cds.cern. ch/record/2195073. 198 PUBLICATIONS

[121] M. Aaboud et al. “Search for dark matter in association with a Higgs boson decaying to b-quarks in pp collisions at √s = 13 TeV with the ATLAS detector”. In: Phys. Lett. B 765.CERN-EP-2016-181 (Sept. 2016), 11. 36 p. url: https://cds. cern.ch/record/2215917. [122] M. Aaboud et al. “Search for dark matter produced in association with a hadron- ically decaying vector boson in pp collisions at (s)=13 TeV with the ATLAS detector”. In: Phys. Lett. B 763.CERN-EP-2016-180 (Aug. 2016), 251. 10 p. url: p https://cds.cern.ch/record/2206052. [123] M. Aaboud et al. “Search for heavy long-lived charged R-hadrons with the ATLAS detector in 3.2 fb−1 of proton–proton collision data at √s = 13 TeV”. In: Phys. Lett. B 760.CERN-EP-2016-131. CERN-EP-2016-131 (June 2016), 647–665. 15 p. url: https://cds.cern.ch/record/2161142. [124] M. Aaboud et al. “Search for heavy resonances decaying to a Z boson and a photon in pp collisions at √s = 13 TeV with the ATLAS detector”. In: Phys. Lett. B 764.CERN-EP-2016-163. arXiv:1607.06363 (July 2016), 11. 18 p. url: https: //cds.cern.ch/record/2200706. [125] M. Aaboud et al. “Search for Higgs and Z Boson Decays to φγ with the ATLAS Detector”. In: Phys. Rev. Lett. 117.CERN-EP-2016-130 (July 2016), 111802. 9 p. url: https://cds.cern.ch/record/2198619. [126] M. Aaboud et al. “Search for Minimal Supersymmetric Standard Model Higgs bosons H/A and for a Z′ boson in the ττ final state produced in pp collisions at √s = 13 TeV with the ATLAS Detector”. In: Eur. Phys. J. C 76.CERN-EP-2016-164 (Aug. 2016), 585. 28 p. url: https://cds.cern.ch/record/2203593. [127] M. Aaboud et al. “Search for new phenomena in different-flavour high-mass dilepton final states in pp collisions at √s = 13 TeV with the ATLAS detector”. In: Eur. Phys. J. C 76.CERN-EP-2016-168. CERN-EP-2016-168 (July 2016), 541. 42 p. url: https://cds.cern.ch/record/2202137. [128] M. Aaboud et al. “Search for new phenomena in events containing a same-flavour opposite-sign dilepton pair, jets, and large missing transverse momentum in √s = 13 TeV pp collisions with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN- EP-2016-260. 3 (Nov. 2016), 144. 46 p. url: https://cds.cern.ch/record/ 2233837. [129] M. Aaboud et al. “Search for new resonances decaying to a W or Z boson and a Higgs boson in the ℓ+ℓ−b¯b, ℓνb¯b, and ννb¯ ¯b channels with pp collisions at √s = 13 TeV with the ATLAS detector”. In: Phys. Lett. B 765.CERN-EP-2016-148. arXiv:1607.05621 (July 2016), 32–52. 18 p. url: https://cds.cern.ch/record/ 2200145. [130] M. Aaboud et al. “Search for new resonances in events with one lepton and missing transverse momentum in pp collisions at √s = 13 TeV with the ATLAS detector”. In: Phys. Lett. B 762.CERN-EP-2016-143. CERN-EP-2016-143 (June 2016), 334–352. 14 p. url: https://cds.cern.ch/record/2160550. [131] M. Aaboud et al. “Search for resonances in diphoton events at √s=13 TeV with the ATLAS detector”. In: JHEP 09.CERN-EP-2016-120 (June 2016), 001. 32 p. url: https://cds.cern.ch/record/2160228. PUBLICATIONS 199

[132] M. Aaboud et al. “Search for squarks and gluinos in events with hadronically decaying tau leptons, jets and missing transverse momentum in proton-proton collisions at √s = 13 TeV recorded with the ATLAS detector”. In: Eur. Phys. J. C 76.CERN-EP-2016-145. 12 (July 2016), 683. 30 p. url: https://cds.cern.ch/ record/2200242. [133] M. Aaboud et al. “Search for supersymmetry in a final state containing two photons and missing transverse momentum in √s = 13 TeV pp collisions at the LHC using the ATLAS detector”. In: Eur. Phys. J. C 76.arXiv:1606.09150 (June 2016), 517. 34 p. url: https://cds.cern.ch/record/2194987. [134] M. Aaboud et al. “Search for TeV-scale gravity signatures in high-mass final states with leptons and jets with the ATLAS detector at √s = 13 TeV”. In: Phys. Lett. B 760.CERN-EP-2016-109 (June 2016), 520–537. 12 p. url: https : //cds.cern.ch/record/2158821. [135] M. Aaboud et al. “Search for the Higgs boson produced in association with a W boson and decaying to four b-quarks via two spin-zero particles in pp collisions at 13 TeV with the ATLAS detector”. In: Eur. Phys. J. C 76.CERN-EP-2016-135. CERN-EP-2016-135 (June 2016), 605. 41 p. url: https://cds.cern.ch/record/ 2194717. [136] M. Aaboud et al. “Search for triboson W ±W ±W ∓ production in pp collisions at √s =8 TeV with the ATLAS detector”. In: Eur. Phys. J. C 77.CERN-EP-2016-172. CERN-EP-2016-172. 3 (Oct. 2016), 141. 39 p. url: https://cds.cern.ch/record/ 2225426. [137] M. Aaboud et al. “Searches for heavy diboson resonances in pp collisions at √s = 13 TeV with the ATLAS detector”. In: JHEP 09.CERN-EP-2016-106. CERN- EP-2016-106 (June 2016), 173. 46 p. url: https : / / cds . cern . ch / record / 2161140. [138] M. Aaboud et al. “Study of hard double-parton scattering in four-jet events in pp collisions at √s = 7 TeV with the ATLAS experiment”. In: JHEP 11.CERN-EP- 2016-183 (Aug. 2016), 110. 32 p. url: https://cds.cern.ch/record/2205694. [139] A. Annovi, F. others Bertolucci, N. Biesuz, D. Calabro, G. Calderini, et al. “Highly parallelized pattern matching execution for the ATLAS experiment”. In: (2016), 7581789. 3 p. url: https://cds.cern.ch/record/2263725. [140] M. Dragicevic, M. Friedl, J. Hrubec, H. Steininger, A. Gädda, et al. Test Beam Performance Measurements for the Phase I Upgrade of the CMS Pixel Detector. Tech. rep. CMS-NOTE-2017-002. CERN-CMS-NOTE-2017-002. Geneva: CERN, July 2016. url: https://cds.cern.ch/record/2262932. [141] G. Aad et al. Technical Design Report for the Phase-I Upgrade of the ATLAS TDAQ Sys- tem. Tech. rep. CERN-LHCC-2013-018. ATLAS-TDR-023. Sept. 2013. url: https: //cds.cern.ch/record/1602235. 200 PUBLICATIONS

Publications as member of the Tracker Group of the CMS Collaboration

[1] W. Adam, T. Bergauer, E. Brondolin, M. Dragicevic, M. Friedl, et al. “Characterisa- tion of irradiated thin silicon sensors for the CMS phase II pixel upgrade”. In: The European Physical Journal C 77.8 (Aug. 2017). doi: 10.1140/epjc/s10052-017- 5115-z. [2] W. Adam, T. Bergauer, E. Brondolin, M. Dragicevic, M. Friedl, et al. “P-Type Silicon Strip Sensors for the new CMS Tracker at HL-LHC”. In: Journal of Instrumentation 12.06 (June 2017), P06018–P06018. doi: 10.1088/1748-0221/12/06/p06018. [3] M. Dragicevic, M. Friedl, J. Hrubec, H. Steininger, A. Gädda, et al. “Test beam performance measurements for the Phase I upgrade of the CMS pixel detector”. In: Journal of Instrumentation 12.05 (May 2017), P05022–P05022. doi: 10.1088/1748- 0221/12/05/p05022.