ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΤΜΗΜΑ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ

Διπλωματική Εργασία του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

Αστερίου Κωνσταντίνου του Νικολάου

Αριθμός Μητρώου: 228281

Θέμα Ανάπτυξη και βελτιστοποίηση του OpenCL driver για τις NEMA GPUs Implementation and Optimization of the OpenCL driver for the NEMA GPUs

Επιβλέπων Επίκουρος Καθηγητής Μπίρμπας Μιχάλης, Πανεπιστήμιο Πατρών

Αριθμός Διπλωματικής Εργασίας: 228281/2019

Πάτρα, 12/2019 ΠΙΣΤΟΠΟΙΗΣΗ

Πιστοποιείται ότι η Διπλωματική Εργασία με θέμα Ανάπτυξη και βελτιστοποίηση του OpenCL driver για τις NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs

Του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών

Αστερίου Κωνσταντίνου του Νικολάου Αριθμός Μητρώου: 228281

Παρουσιάστηκε δημόσια και εξετάστηκε στο Τμήμα Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών στις / /2019

Ο επιβλέπων Ο διευθυντής Τομέα

Μπίρμπας Μιχάλης Παλιουράς Βασίλειος Επίκουρος Καθηγητής Καθηγητής Αριθμός Διπλωματικής Εργασίας: 228281/2019

Θέμα: Ανάπτυξη και βελτιστοποίηση του OpenCL driver για τις NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs

Φοιτητής Επιβλέπων Αστερίου Κωνσταντίνος Μπίρμπας Μιχάλης Περίληψη

Ι) ΕΙΣΑΓΩΓΗ Στο παρελθόν όλα τα προγράμματα λογισμικού ήταν γραμμένα για σειριακή επεξεργασία. Για να λυθεί ένα πρόβλημα, κατασκευάζονταν ένας αλγόριθ- μος ο οποίος υλοποιούνταν ως μια σειριακή ακολουθία εντολών. Η εκτέλεση αυτών των εντολών συνέβαινε σε έναν υπολογιστή με έναν μόνο επεξεργα- στή. Μόνο μια εντολή εκτελούνταν τη φορά και αφού τελείωνε η εκτέλεση της μιας εντολής, η επόμενη εκτελούνταν εν συνεχεία. Ο χρόνος εκτέλεσης οποιουδήποτε προγράμματος ήταν ανάλογος του αριθμού των εντολών, της περιόδου του ρολογιού του υπολογιστή και των κυκλών που απαιτούνταν για την κάθε εντολή. Οι υπολογιστές με το πέρασμα των ετών γινόντου- σαν πιο αποδοτικοί όσον αφορά το χρόνο εκτέλεσης προγραμμάτων καθώς οι μηχανικοί κατάφερναν να βελτιώσουν δύο παράγοντες. Κατά πρώτον η συχνότητα του ρολογιού των υπολογιστών αυξήθηκε σημαντικά και κατά δεύτερον με βάση το νόμο του Moore ο αριθμός των τρανζίστορ σε μια επι- φάνεια πυριτίου θα διπλασιάζονταν κάθε 1.5 χρόνο περίπου. Χαρακτηριστικά αναφέρουμε πως ο περίφημος επεξεργαστής 8086 είχε 29.000 τρανζίστορ και συχνότητα ρολογιού 5MHz ενώ ο σύγχρονος Core i7 διαθέτει πάνω από 1 δισεκατομμύριο τρανζίστορ και συχνότητα ρολογιού 4GHz. Αυτές οι βελτιστοποιήσεις όμως είχαν ως συνέπεια να αυξηθεί δραματικά η ενεργειακή κατανάλωση των επεξεργαστών, η οποία δίνεται από τον τύπο P=CxV2xF, όπου C είναι το σύνολο των χωρητικοτήτων των οποίων η είσοδος αλλάζει ανά κύκλο ρολογιού, V είναι η τάση και F η συχνότητα ρολογιού. Η α- πάντηση των μηχανικών στην συνεχώς αυξανόμενη ενεργειακή κατανάλωση ήταν να δημιουργούν πολυπύρηνους επεξεργαστές με ενεργειακά αποδοτι- κούς πυρήνες. Ο πυρήνας είναι η μονάδα επεξεργασίας του επεξεργαστή και όλοι οι πυρήνες μπορούν να έχουν πρόσβαση στην ίδια θέση μνήμης ταυτόχρονα. Για να εκμεταλλευτούμε τους πολλαπλούς πυρήνες του επεξερ- γαστή δημιουργήσαμε και προγράμματα που εκτελούνται παράλληλα. Στα παράλληλα προγράμματα χρησιμοποιούνται πολλαπλές μονάδες επεξεργασίας ταυτόχρονα για να λύσουν το πρόβλημα. Αυτό επιτυγχάνεται ¨σπάζοντας’ το πρόβλημα σε μικρότερα κομμάτια, όπου η εκτέλεση κάθε ενός μπορε- ί να πραγματοποιηθεί ανεξάρτητα. Οι μονάδες επεξεργασίας που μπορούν να χρησιμοποιηθούν ποικίλουν, και μπορεί να είναι από έναν υπολογιστή με πολλαπλούς πυρήνες, πολλούς διασυνδεδεμένους υπολογιστές μέχρι και εξιδεικευμένο υλικό (’hardware¨). Για να αξιοποιηθεί στο μέγιστο την υ- πάρχουσα παραλληλία του υλικού και να ελαχιστοποιήσουμε όσο αυτό είναι δυνατό τον χρόνο εκτέλεσης των προγραμμάτων, ο προγραμματιστής πρέπει να αναδομήσει και να παραλληλίσει κατάλληλα τον κώδικα του.

ΙΙ) OpenCL - POCL Η OpenCL είναι ένα ανοιχτό στάνταρ για τον ηλων προγραμμάτων που περι- λαμβάνει ¨γλώσσα¨, ΑΡΙ, βιβλιοθήκες και runtime και δίνει έτσι τη δυνατότη- τα συγγραφής φορητών αλλά αποδοτικών προγραμμάτων. Χρησιμοποιώντας την OpenCL ο προγραμματιστής μπορεί να γράψει προγράμματα γενικής χρήσης τα οποία εκτελούνται σε όλες τις συμβατές με αυτήν συσκευές χωρίς να χρειάζεται να αλλάξει οτιδήποτε στον κώδικα του όταν αλλάζει συσκευ- ή. Η φορητότητα των προγραμμάτων της OpenCL ανάμεσα σε ένα μεγάλο εύρος διαφορετικών ετερόγενων πλατφόρμων επιτυγχάνεται περιγράφοντας των κώδικα της (kernel) ως strings που χτίζονται εν συνεχεία από το run- time API για την επιλεγμένη συσκευή πάνω στην οποία θέλουμε να τρέξει. Ενώ το πλαίσιο CUDA υποστηρίζεται μόνο από τις κάρτες γραφικών της NVIDIA, εφαρμογές της OpenCL μπορούν να τρέξουν σε μια σειρά διαφο- ρετικών συσκευών από διαφορετικούς παρόχους. Συμβατές υλοποιήσεις με την OpenCL είναι διαθέσιμες από εταιρίες όπως η Altera, AMD, Xilinx, ARM, Intel και άλλες. Για να περιγράψουμε τις βασικές ιδέες πίσω από την OpenCL θα χρησιμοποιήσουμε την ακόλουθη ιεραρχία μοντέλων :

• Μοντέλο Πλατφόρμας : Το μοντέλο πλατφόρμας της OpenCL αποτελε- ίται από μια συσκευή ¨οικοδεσπότη’ (host) πάνω στην οποία είναι συνδε- δεμένες μία ή περισσότερες OpenCL συσκευές. Οι OpenCL συσκευές διαιρούνται σε μία ή περισσότερες υπολογιστικές μονάδες (Compute Units) οι οποίες διαιρούνται περαιτέρω σε ένα ή περισσότερα στοιχεία επεξεργασίας (Processing Elements). ΄Ολα τα στοιχεία επεξεργασίας εκτελούν κώδικα OpenCL δηλαδή όλοι οι υπολογισμοί που περιγράφο- νται στον κώδικα της OpenCL συμβαίνουν εκεί. Κατά την εκτέλεση ενός OpenCL προγράμματος η συσκευή ¨οικοδεσπότης’ υποβάλλει ε- ντολές για να ενορχηστρώσει την εκτέλεση του κώδικα της OpenCL στα στοιχεία επεξεργασίας της επιλεγμένης συσκευής. Τα στοιχεία ε- πεξεργασίας μπορούν να εκτελέσουν μία ροή εντολών ως SIMD ή ως SPMD μονάδες. Η SIMD μονάδα ορίζεται ως μια κλάση παράλληλων υπολογιστών όπου τα στοιχεία επεξεργασίας του εκτελούν όλα τον ίδιο κώδικα, στην περίπτωση μας OpenCL κώδικα, το καθένα με τα δικά του δεδομένα και κοινό program counter. Από την άλλη ορίζουμε ως SPMD μονάδα το προγραμματιστικό μοντέλο όπου πολλαπλά στοιχεία επεξεργασίας εκτελούν τον ίδιο κώδικα το καθένα με δικά του δεδομένα και δικό του program counter.

Σχήμα 1: Μοντέλο Πλατφόρμας

• Μοντέλο Εκτέλεσης : Η εκτέλεση ενός OpenCL προγράμματος χω- ρίζεται σε δύο μέρη : το ένα μέρος εκτελείται στον όικοδεσπότη΄ (host) και το άλλο, ο κώδικας OpenCL δηλαδή, εκτελείται σε μία ή περισσότε- ρες επιλεγμένες συσκευές που είναι συνδεδεμένες με τον ¨οικοδεσπότη¨. Ο ¨οικοδεσπότης’ ορίζει ένα πλαίσιο (context) για την εκτέλεση του OpenCL κώδικα. Το πλαίσιο αυτό περιέχει τους ακόλουθος πόρους : τις συσκευές που μπορούν να χρησιμοποιηθούν από τον όικοδεσπότη΄, τον kernel δηλαδή τον OpenCL κώδικας που θα τρέξει σε μία από τις προαναφερθείσες συσκευές, το Program Object δηλαδή το εκτελέσι- μο που κρατάει την αναπαράσταση του OpenCL κώδικα και τέλος τα αντικείμενα μνήμης "Memory Objects" δηλαδή μια συλλογή από αντι- κείμενα τα οποία είναι ορατά από τον όικοδεσπότη΄ και κρατούν τιμές που θα χρησιμοποιηθούν από τον OpenCL κώδικα. Το πλάισιο (context) ορίζεται από τον όικοδεσπότη΄ και χειραγωγείται από αυτόν χρησιμο- ποιώντας συναρτήσεις του OpenCL API. Ο όικοδεσπότης΄ ορίζει μια δομή δεδομένων που ονομάζεται (command queue) για να συντονίσει την εκτέλεση του OpenCL κώδικα, δημιουργώντας και τοποθετώντας εντολές πάνω σε αυτή την δομή οι οποίες θα εκτελεστούν στην συ- σκευή στην οποία δημιουργήθηκε προηγουμένως το πλαίσιο (context). Αυτές οι εντολές μπορεί να είναι εντολές εκτέλεσης OpenCL κώδικα, εντολές αναφορικά με τα αντικείμενα μνήμης ή εντολές συγχρονισμο- ύ. ΄Οταν ένας kernel κατατίθεται για εκτέλεση από τον ¨οικοδεσπότη¨, ένας ¨χώρος περιεχομένων’ (index space) ορίζεται. Ο ¨χώρος περιεχο- μένων’ που ορίζεται στην OpenCL είναι ένας Ν-διαστάσεων όπου το Ν μπορεί να είναι 1,2 ή 3 και αποκαλείται NDRange. ΄Ενας NDRange ο- ρίζεται ως ένας πίνακας ακεραίων μεγέθους Ν, και προσδιορίζει το εύρος του ¨χώρου περιεχομένων’ σε κάθε διάσταση να αρχίζει από ένα offset F. ΄Ενα στιγμιότυπο του kernel εκτελείται για κάθε σημείο στον ¨χώρο

Σχήμα 2: Μοντέλο Εκτέλεσης

περιεχομένων¨. Το κάθε στιγμιότυπο του kernel ονομάζεται work- item και ταυτοποιείται από την θέση του στον ¨χώρο περιεχομένων¨, που του παρέχει μια μοναδική global ID (¨καθολική ταυτότητα¨). ΄Ολα work-item διαθέτουν τον ίδιο κώδικα αλλά οι εντολές που θα εκτε- λέσει το κάθε ένα από αυτά και τα δεδομένα που θα χρησιμοποιήσει διαφέρουν μπορεί να διαφέρουν ανά work-item. Τα work-items ορ- γανώνονται σε work-groups. Στα work-groups ανατίθεται μια μο- ναδική ταυτότητα (work-group ID) με διαστάσεις ίδιες με αυτές του ¨χώρου περιεχομένων¨. Στα work-items εντός των work-groups ανα- τίθεται μια μοναδική ¨τοπική ταυτότητα’ (local-ID) έτσι ώστε το κάθε work-item να μπορεί να αναγνωριστεί είτε μέσω του global-ID είτε μέσω του συνδιασμού του work-group ID και του local-ID. • Μοντέλο Μνήμης : Η OpenCL ορίζει ένα Μοντέλο Μνήμης κατά το οποίο τα work items που εκτελούν έναν kernel έχουν πρόσβαση σε 4 διαφορετικές περιοχές μνήμης. Αρχικά υπάρχει η Global Memory (¨καθολική μνήμη¨) στην οποία έχουν πρόσβαση για ανάγνωση και εγ- γραφή όλα τα work-items από όλα τα work-groups. Τα work-items μπορούν να διαβάσουν και να γράψουν κάθε στοιχείο ενός Memory Object. Εν συνεχεία υπάρχει η Constant Memory (¨σταθερή μνήμη¨) η οποία παραμένει αναλλοίωτη κατά την εκτέλεση του OpenCL κώδικα και μπορεί να γραφτεί μόνο πριν την εκτέλεση του. Υπάρχει η Local Memory (¨Τοπική Μνήμη¨) την οποία μοιράζονται όλα τα work-items εντός ενός work-group. Κάθε work-group έχει τη δική του ¨τοπι- κή μνήμη’ και work-items που δεν ανήκουν σε αυτό το work-group δεν έχουν πρόσβαση σε αυτή την περιοχή μνήμης. Τέλος υπάρχει η Private Memory (ίδιωτική μνήμη΄), κάθε work-item διαθέτει τη δική του ϊδιωτική μνήμη’ και έχει πρόσβαση για ανάγνωση και εγγραφή μόνο αυτό. • Μοντέλο Προγραμματισμού : Το μοντέλο της OpenCL ορίζει δύο είδη προγραμματιστικών μοντέλων, το task parallel, το data parallel κα- θώς και υβρίδιο των δύο. Το κυρίαρχο μοντέλο που οδηγεί την σχεδίαση εφαρμογών με τη χρήση της OpenCL είναι το data parallel σύμφωνα με το οποίο υπάρχει αντιστοίχηση ένα προς ένα μεταξύ των work-items και των στοιχείων ενός memory object.Η OpenCL ορίζει και μια πιο ¨χαλαρή’ έκδοση του μοντέλου αυτού όπου δεν είναι απαρραίτητη αυτή η ένα προς ένα αντιστοίχηση. Υπάρχουν δύο τομείς συγχρονισμού στην OpenCL. Αρχικά υπάρχει ο συγ- χρονισμός ανάμεσα σε work-items που ανήκουν στο ίδιο work-group χρη- σιμοποιώντας την εντολή barrier. ΄Ολα τα work-items πρέπει να εκτέλε- σουν την εντολή αυτή πρωτού δωθεί η δυνατότητα σε καθέ ένα από αυτά να συνεχίσει την εκτέλεση του OpenCL κώδικα. Ο συγχρονισμός μεταξύ work-items που ανήκουν σε διαφορετικά work-groups δεν είναι δυνατός. Ο δεύτερος τομέας συγχρονισμού στην OpenCL είναι μεταξύ εντολών που κατατεθεί σε ένα command queue σε ένα context. Τέλος κλείνουμε την περιγραφή της OpenCL με μια σύντομη αναφορά στα memory objects. Υ- πάρχουν δύο ειδών memory objects στην OpenCL : τα buffer objects και τα image objects. Σε αυτή την διπλωματική εργασία πραγματευτήκα- με μόνο τα buffer objects. Τα buffer objects είναι μονοδιάστατη συλλογή από στοιχεία. Τα στοιχεία αυτά μπορεί να είναι οποιοδήποτε scalar type (int,float,char) ή πίνακας ή struct. Τα image objects χρησιμοποιούνται για την αποθήκευση texture, frame-buffer ή image δύο ή τριών διαστάσεων. POCL είναι μια ανοιχτή υλοποίηση της OpenCL, καθιστώντας δυνατή την ε- κτέλεση OpenCL εφαρμογών σε μια σειρά από διαφορετικές αρχιτεκτονικές. Η υλοποίηση αυτή χωρίζεται σε κομμάτια που εκτελούνται στην συσκευή όι- κοδεσπότη΄ και σε κομμάτια που έχουν συμπεριφορά διαφορετική ανάλογα με την επιλεγμένη συσκευή πάνω στην οποία θα τρέξει ο κώδικας της OpenCL. Αυτός ήταν ένας από τους λόγους που επιλέξαμε να χρησιμοποιήσουμε την υλοποίηση του POCL καθώς πολλές συναρτήσεις που τρέχουν στον host και δεν έχουν καμία target-specific συμπεριφορά είναι ήδη υλοποιημένες. Το άλλο πλεονέκτημα της υλοποίησης POCL είναι πως χρησιμοποιεί για το χτίσιμο του OpenCL κώδικα το LLVM-toolchain. Οι NEMA GPUs δια- θέτουν backend για αυτό το toolchain απλοποιώντας έτσι σημαντικά την διαδικασία παραγωγής εκτελέσιμου αρχείου. Το LLVM-toolchain αποτε- λείται από τρία μέρη. Αρχικά από τον Clang ο οποίος διαβάζει τον κώδικα της OpenCL και παράγει ένα αρχείο μορφής LLVM-IR. Εν συνεχεία είναι ο Optimizer ο οποίος δέχεται σαν είσοδο την έξοδο του Clang και μπορεί να κάνει διάφορες βελτιστοποιήσεις πάνω στο LLVM-IR αρχείο. Και τέλος είναι το LLVM-backend το οποίο χρησιμοποιεί το κατάλληλο Instruction Set (¨Σύνολο Εντολών¨) έτσι ώστε το τελικό εκτελέσιμο να μπορεί να τρέξει στην επιλεγμένη συσκευή. ΙΙΙ) ΠΕΡΙΓΡΑΦΗ ΕΡΓΑΛΕΙΩΝ Καθόλη τη διάρκεια της διπλωματικής εργασίας όλοι οι κώδικες της OpenCL έτρεξαν πάνω στον Vertex Processor της NEMA|S κάρτας γραφικών. Η NEMA|S είναι μία πολυπύρηνη, multi-threaded κάρτα γραφικών, εξαιρε- τικά υψηλής επίδοσης και πολύ χαμηλής κατανάλωσης που μπορεί να χρη- σιμοποιηθεί τόσο για γραφική απεικόνιση όσο και για γενικού σκοπού ε- πεξεργασία (GPGPU). Ο Vertex Processor στο Graphics Pipeline είναι υπεύθυνος για την μετατροπή των τρισδιάστατων συντεταγμένων κάθε α- ντικειμένου που υπάρχει στην εικόνα που θέλουμε να απεικονίσουμε, στις αντίστοιχες δισδιάστατες της οθόνης απεικόνισης. Η τοποθεσία κάθε αντι- κειμένου σε μια εικόνα, περιγράφεται από μια δομή δεδομένων που ονομάζε- ται Vertex, και ορίζει τη θέση του αντικειμένου ως σύνολο σημείων στον δισδιάστατο ή τρισδιάστατο χώρο. Τα Vertex data αποθηκεύονται ως ένα συνεχόμενο block στη μνήμη που ονομάζεται Vertex buffer.Ο Vertex Processor της NEMA|S υποστηρίζει μόνο ένα hardware thread γεγο- νός που λειτούργησε αρνητικά όσον αφορά την επίδοση της υλοποίησης. Οι λόγοι που χρησιμοποιήσαμε μόνο τον Vertex Processor της NEMA|S και όχι τον Fragment Processor για παράδειγμα, ο οποίος υποστηρίζει πολλα- πλά hardware threads παρά την μειωμένη απόδοση είναι ποικίλοι. Αρχι- κά ο Fragment Processor δεν είναι πλήρως υλοποιημένος και λειτουργικός ακόμα. Επιπλέον η αποσφαλμάτωση (debugging) σε ένα single thread επεξεργαστή είναι αρκετά ευκολότερη από ότι σε έναν επεξεργαστή που υ- ποστηρίζει πολλαπλά hardware threads. Επιπρόσθετα ο Vertex Processor εκτελεί απευθείας τις λειτουργίες αναγνωσής και εγγραφής στη μνήμη ενώ ο Fragment Processor χρησιμοποιεί ένα Module που ονομάζεται Texture Map, που είναι σχεδιασμένο για γραφικά, για τις λειτουργίες αυτές και ει- σάγει επομένως έναν επιπλέον βαθμό δυσκολίας στην υλοποίηση μας. Τέλος ο επεξεργαστής μπορεί να σηκώσει απευθείας ένα thread στον Vertex Pro- cessor ενώ ο Fragment Procesor οδηγείται από τον Rasterizer ο οποίος οδηγείται από τον Vertex Processor, γεγονός που εισάγει επιπλέον δυσκο- λία και πάλι. Ζητούμενο στην υλοποίηση αυτή δεν είναι η απόδοση αλλά να είναι όσο το δυνατόν πιο λειτουργική και ορθή γίνεται καθώς αποτελεί το πρώτο βήμα για την υποστήριξη της OpenCL από τις NEMA GPUs. Για να τεστάρουμε την υλοποίηση μας χρησιμοποιήσαμε τα conformance tests που παρέχει η εταιρία Khronos, η οποία δημιούργησε την OpenCL. Η επιτυχής ολοκλήρωση του συνόλου των τεστ για μια πλατφόρμα συνεπάγεται πως ο driver αλλά και το hardware της πλατφόρμας μπορούν να υποστηρίξουν κάθε εφαρμογή γραμμένη σε OpenCL. IV) ΥΛΟΠΟΙΗΣΗ Το κομμάτι της υλοποίησης του driver της OpenCL για τις NEMA GPUs χωρίστηκε σε τρία στάδια, το χτίσιμο του POCL,την δημιουργία εκτελέσι- μου αρχείου για την NEMA|S από κώδικα της OpenCL και τέλος την υλο- ποίηση όλων των target-specific συναρτήσεων του OpenCL API. Αρχικά χτίσαμε το POCL σε έναν υπολογιστή με επεξεργαστή Intel(R)Core(TM) i3-6100 και μια σειρά από απλούς κώδικες υλοποιήθηκαν για την κατανόη- ση της OpenCL. Εν συνεχεία μεταφερθήκαμε στην πλατφόρμα Zynq-7000 SoC ZC706 που διαθέτει επεξεργαστές ARM και πάνω στην πλατφόρμα αυτή ήταν φορτωμένος ο Vertex Processor. Εκεί χτίσαμε ξανά το POCL και τρέξαμε τους κώδικες που είχαμε υλοποιήσει για να διασταυρώσουμε ότι η διαδικασία είχε ολοκληρωθεί επιτυχώς. Το τελικό βήμα στο πρώτο αυτό στάδιο περιελάμβανε την προσθήκη των βιβλιοθηκών NemaGFX.so ToolChainAPI.so στο χτίστιμο του POCL, οι οποίες ήταν απαρραίτητες για την ολοκλήρωση των επόμενων βημάτων. Ο σκοπός του επόμενου σταδίου ήταν η εξαγωγή εκτελέσιμου ("binary") αρχείου από κώδικα της OpenCL το οποίο μπορεί να εκτελεστεί στον Vertex Processor. Η διαδικασία αυτή ολο- κληρώθηκε με χρήση των συναρτήσεων της βιβλιοθήκης ToolChainAPI.so και όλος ο κώδικας για την εξαγωγή του εκτελέσιμου περιελαμβάνονταν σε ένα αρχείο "nema-gen.cc". Τελικό βήμα ήταν η υλοποίηση των target- specific συναρτήσεων του OpenCL API. Η διαδικασία αυτή ολοκληρώθηκε με χρήση της βιβλιοθήκης NemaGFX.so και το σύνολο των υλοποιημένων συναρτήσεων περιελαμβάνονταν σε ένα αρχείο nema.c. V ΑΠΟΤΕΛΕΣΜΑΤΑ Η ορθότητα της υλοποίησης μας ήρθε μέσα από την επιτυχή ολοκλήρωση μιας σειράς από διαφορετικές κατηγορίες conformance tests. Σε όλες τις κατηγορίες τα tests σχετικά με images και samplers απενεργοποιήθηκαν. Οι κατηγορίες που ολοκληρώθηκαν με επιτυχία είναι οι ακόλουθες : • basic • buffers • device-partition • headers • integer-ops • multiple-device-context • relationals • select Abstract

The purpose of this diploma thesis is the software implementation of the OpenCL driver for the NEMA GPUs. For the development of this software POCL version 1.0 was used. POCL is an open source imple- mentation of OpenCL that is platform indepedent enabling OpenCL on a wide range of architectures. This implementation is divided to parts that are executed on the host (front-end) and to those that imple- ment device-specific behaviour (back-end). All of the OpenCL code that had NEMA|S GPU as the target device run on the Vertex Proces- sor of NEMA|S. Both hardware (Xilinx Zynq-7000 SoC ZC706) and software (libraries) tools were used from the company Think-Silicon throughout this work. The generated software was developed un- der the Linux enviroment, and the programming languages C and C++ were used. A number of tests called conformance tests provided by Khronos, the company that created the OpenCL framework, were run in order to test the accuracy of our work and make optimizations when possible. The final software is the POCL back-end implementa- tion for the NEMA|S GPU. Acknowledgements

This thesis was carried out in collaboration with the Think-Silicon company. I want to thank the supervisor of this thesis, Professor Bir- bas Michalis for our collaboration as well as the CSO of Think Silicon Dr. Georgios Keramidas for his support and advice throughout the course of this thesis. Furthermore I would like to thank Mr. Yannis Economou and Mr. Nick Stavropoulos from the Think-Silicon team for all the knowledge and help, that was always available.

xv

Contents

1 Introduction1 1.1 OpenCL...... 1 1.2 Thesis subject...... 1 1.3 Structure of the text...... 2

2 Theoretical Background3 2.1 ...... 3 2.2 OpenCL...... 6 2.2.1 Introduction to OpenCL...... 6 2.2.2 Platform Model...... 7 2.2.3 Execution Model...... 9 2.2.4 Memory Model...... 13 2.2.5 Programming Model...... 15 2.2.6 Synchronization...... 15 2.2.7 Memory Objects...... 16 2.3 POCL...... 17

3 Description of tools 21 3.1 GPGPU...... 21 3.2 NEMA|S GPU...... 23 3.3 Vertex Processor...... 25 3.4 CUDA...... 26 3.5 NEMA|GFX...... 29

4 Conformance tests 33 4.1 Conformance tests description...... 33 4.1.1 Categories of Conformance tests...... 34

5 OpenCL implementation for the NEMA|S GPU 39 5.1 Building POCL - Learning OpenCL...... 39 5.2 Compiling OpenCL code for the NEMA|S GPU.... 45 5.3 Implementation of the POCL backend...... 47 xvi

5.3.1 Implementation of the clCreateBuffer function. 48 5.3.2 Implementation of the clEnqueueWriteBuffer func- tion...... 50 5.3.3 Implementation of the clEnqueueReadBuffer func- tion...... 51 5.3.4 Implementation of the clEnqueueNDRangeKer- nel function...... 52

6 Proof of Work 57

7 Summary and Future Work 63 7.1 Summary...... 63 7.2 Future Work...... 63 Contents xvii

List of Figures

1 Μοντέλο Πλατφόρμας...... vi 2 Μοντέλο Εκτέλεσης...... vii 2.1 Serial execution...... 3 2.2 Equation for a computer program runtime...... 4 2.3 Parallel execution...... 5 2.4...... 8 2.5 An example of an NDRange index space showing work-items, their global IDs and their mapping onto the pair of work-group and local IDs...... 12 2.6 The memory regions and how they relate to the plat- form model...... 14 2.7 The memory regions and how they relate to the plat- form model...... 14 2.8 The host layer includes parts that are executed in the OpenCL host. The device layer is used as an hardware abstraction layer to encapsulate the device- specific parts...... 18 2.9 The LLVM-Toolchain...... 20 3.1 Design differences between GPU and CPU...... 22 3.2 Theoretical memory bandwidth of the GPU...... 23 3.3 NEMA|S GPU...... 24 3.4 Rendering Flow...... 25 3.5 A multithreaded program divided into blocks that are allocated on 2 or 4 cores...... 28 3.6 NEMA|GFX Architecture...... 30 5.1 Our tsi-main function the kernel compilation...... 46

1

Chapter 1

Introduction

1.1 OpenCL

OpenCL (Open Computing Language) is a framework for writing pro- grams that execute across heterogeneous platforms consisting of cen- tral processing units (CPUs), graphics processing units (GPUs), digi- tal signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies pro- gramming languages (based on C99 and C++11) for programming these devices and application programming interfaces (APIs) to con- trol the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

1.2 Thesis subject

In this thesis the implementation of the OpenCL driver for the NEMA|S GPU is presented. POCL version 1.0 was used and all of the OpenCL code run on the Vertex Processor of the NEMA|S GPU. The whole of this work consists of three main parts. Building POCL in two different host machines and writing some demo examples in order to under- stand the concept of OpenCL was the first. Building OpenCL code to a binary executable that could run on the Vertex Processor of NEMA|S was the second task. Finally, most of the OpenCL API functions with target specific behaviour were implemented. This implementation supports kernel execution with all scalar data types (char,int,float), vectors as well as structs. Optimizations were made when it was pos- sible so as to decrease the execution time and the memory usage. Im- ages and Samplers are currently not supported. Conformance tests were used in order to exhaustively test this work. 2 Chapter 1. Introduction

1.3 Structure of the text

In chapter 2, the theoretical background for parallel computing, for the OpenCL and POCL is presented. In chapter 3, some key knowledge for the GPGPU, the Vertex Pro- cessor and the NEMA|GFX library are presented. Finally a brief overview of the CUDA framework is made. In chapter 4, the conformance tests are thoroughly analyzed and an overview on each of the different categories is presented. In chapter 5, the implementation for the NEMA GPUs is illustrated. This chapter is divided in three different sections. At first the build- ing procedure of POCL is described, then the compilation of OpenCL code for the NEMA GPUs and finally the implementation of target- specific functions of the OpenCL API. In chapter 6, the accuracy of the implementation is presented, based on the conformance tests that are presented in a previous chapter. In chapter 7, a conclusion about the developed implementation and its efficiency is presented, as well as future plans and work regarding this thesis are proposed. 3

Chapter 2

Theoretical Background

2.1 Parallel Computing

Traditionally, computer software has been written for serial com- putation. To solve a problem, an algorithm is constructed and im- plemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer.

FIGURE 2.1: Serial execution

Only one instruction may execute at a time—after that instruction is finished, the next one is executed. The amount of time for a pro- gram to be executed is proportional to the number of instructions, the number of cycles needed for each instruction of the program and the time for each cycle (clock frequency). Two were the main reasons why computer manufacturers managed to improve the performance of the processors. Frequency scaling[34] was the dominant reason for improvements in computer performance from the mid-1980s until 2004. The runtime of a program is equal to the number of instructions multiplied by the average time per in- struction. Maintaining everything else constant, increasing the clock frequency decreases the average time it takes to execute an instruc- tion. An increase in frequency thus decreases runtime for all compute- bound programs. The second reason was Moore’s law[5][6]. Moore’s 4 Chapter 2. Theoretical Background

FIGURE 2.2: Equation for a computer program runtime law is the observation that the number of transistors in a dense in- tegrated circuit doubles about every two years. By increasing the number of transistors the processing power of the integrated circuit is also increased. The observation is named after Gordon Moore, the co-founder of Fairchild Semiconductor and CEO of Intel, whose 1965 paper described a doubling every year in the number of components per integrated circuit, and projected this rate of growth would con- tinue for at least another decade. In 1975[7], looking forward to the next decade, he revised the forecast to doubling every two years. Ad- vancements in digital electronics are strongly linked to Moore’s law: the rapid scaling of MOSFET (metal-oxide-semiconductor field-effect transistor) devices, quality-adjusted prices, memory capacity (RAM and flash), sensors, and even the number and size of pixels in digital cameras[8][10]. The clock frequency has increased by almost four orders of magnitudes between the first 8088 Intel pro- cessor and the actual processors, and the number of transistors also raised from 29.000 for to approximately 730 million for an i7-920 processor. However, increasing power consumption made computer manu- facturers make drastic changes. On May 8, 2004 Intel cancelled its Tejas and Jayhawk processors, which is generally cited as the end of frequency scaling. Power consumption P by a integrated circuit is given by the equation P = CxV2xF, where C is the capacitance being switched per clock cycle (proportional to the number of transistors whose inputs change), V is voltage, and F is the processor frequency 2.1. Parallel Computing 5

(cycles per second). Even though the voltage in which circuits oper- ated went down from 5 volts even to 1.2 volts in some cases, the over- all power consumption went way up. By doubling the number of tran- sistors following Moore’s law the factor C was also doubled and the frequency of the circuits was significantly increased. Increasing pro- cessor power consumption led ultimately to Intel’s May 8, 2004 can- cellation of its Tejas and Jayhawk processors, which is generally cited as the end of frequency scaling. To deal with the problem of power consumption and overheating the major central processing unit (CPU or processor) manufacturers started to produce power efficient pro- cessors with multiple cores[37]. The core is the computing unit of the processor and in multi-core processors each core is independent and can access the same memory concurrently. Parallel computing[36], uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others. The processing elements can be di- verse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or any combination of the above. Historically parallel computing was

FIGURE 2.3: Parallel execution used for scientific computing and the simulation of scientific prob- lems, particularly in the natural and engineering sciences, such as me- teorology[14][15]. This led to the design of parallel hardware and soft- ware, as well as high performance computing. An operating system can ensure that different tasks and user programs are run in parallel on the available cores. However, for a serial software program to take full advantage of the multi-core architecture the programmer needs to 6 Chapter 2. Theoretical Background restructure and parallelise the code. A speed-up of application soft- ware runtime will no longer be achieved through frequency scaling, instead programmers will need to parallelise their software code to take advantage of the increasing computing power of multicore archi- tectures[29].

2.2 OpenCL

2.2.1 Introduction to OpenCL OpenCL is an open industry standard for programming a hetero- geneous collection of discrete computing devices organized into a sin- gle platform. OpenCL is more than a language. It’s a framework for parallel programming and includes a language, API, libraries and a runtime system to give the programmers the abillity to write portable yet efficient code[2]. Using OpenCL, for example, a programmer can write general purpose programs that execute on any OpenCL con- formant device without the need to rework their algorithms when switching to another device. Portability of OpenCL programs across a wide range of different heterogeneous platforms is achieved by de- scribing the kernels as source code strings which are then explicitly compiled using the runtime API to the targeted devices. OpenCL sup- ports a wide range of applications, ranging from embedded and con- sumer software to HPC solutions, through a low-level, high-performance, portable abstraction. It provides a top level abstraction for low level hardware routines as well as consistent memory and execution mod- els for dealing with massively-parallel code execution. The advan- tage of this abstraction layer is the ability to scale code from simple embedded microcontrolers to general purpose CPUs[17] from Intel and AMD, up to massively-parallel GPGPU hardware pipelines, all without reworking code[25]. OpenCL is particularly suited to play an increasingly significant role in emerging interactive graphics applica- tions that combine general parallel compute algorithms with graph- ics rendering pipelines. OpenCL consists of an API for coordinating parallel computation across heterogeneous processors; and a cross- platform programming language with a well-specified computation environment. The OpenCL standard: 2.2. OpenCL 7

1. Supports both data- and task-based parallel programming mod- els 2. Utilizes a subset of ISO C99 with extensions for parallelism 3. Defines consistent numerical requirements based on IEEE 754 4. Defines a configuration profile for handheld and embedded de- vices 5. Efficiently interoperates with OpenGL[32], OpenGL ES and other graphics APIs Portability of OpenCL programs across a wide range of different het- erogeneous platforms is achieved by describing the kernels as source code strings which are then explicitly compiled using the runtime API to the targeted devices. While CUDA[11] framework is supported only by NVIDIA GPUs, OpenCL applications can be executed on a number of different platforms from different vendors[2]. Conformant implementations are available from Altera, AMD, Apple, ARM, Cre- ative, IBM, Imagination, Intel, Nvidia, Qualcomm, Samsung, Xilinx and more. To describe the core ideas behind OpenCL, we will use a hierarchy of models: 1. Platform Model 2. Execution Model 3. Memory Model 4. Programming Model Furthermore we will discuss about synchronization in OpenCL, the different categories of memory objects and also their characteristics.

2.2.2 Platform Model The Platform model for OpenCL is defined in the figure below. The model consists of a host connected to one or more OpenCL de- vices[2]. An OpenCL device is divided into one or more compute units (CUs) which are further divided into one or more processing el- ements (PEs). All of these processing elements execute OpenCL code. 8 Chapter 2. Theoretical Background

All of the computations specified on the OpenCL code ("kernel") oc- cur within the processing elements of the OpenCL device. The spe- cific definition of compute units is different depending on the hard- ware vendor. An OpenCL application runs on a host according to the models native to the host platform. The OpenCL application sub- mits commands from the host to coordinate the execution of the ker- nel on the processing elements within a device. The processing ele- ments within a compute unit execute a single stream of instructions as SIMD[22] units or as SPMD units. SIMD (Single Instruction Mul- tiple Data) unit is defined a class of parallel computers where multi- ple processing elements execute the same code, in our case OpenCL code, each with it’s own set of data and a shared program counter. All processing elements have an identical set of instructions. On the other hand SPMD (Single Program Multiple Data) unit is defined as the programing model where multiple processing elements execute the same code, each with it’s own set of data and it’s own program counter. Hence, while all computational resources run the same code they maintain their own instruction counter and due to branches in a code, the actual sequence of instructions can be quite different across the set of processing elements. OpenCL also offers Platform Mixed

Platform model . . . one host plus one or more compute devices each with one or more compute units each with one or more processing elements.

FIGURE 2.4

Version Support. It is designed to support devices which conform to differrent versions of the OpenCL specification under the same plat- form .There are three important version identifiers to consider for an OpenCL system: the platform version, the version of a device, and the version(s) of the OpenCL C language supported on a device. The 2.2. OpenCL 9 platform version indicates the version of the OpenCL runtime sup- ported. This includes all of the APIs that the host can use to interact with the OpenCL runtime, such as contexts, memory objects, devices, and command queues. The device version is an indication of the de- vices capabilities, separate from the runtime and compiler, as repre- sented by the device info returned by clGetDeviceInfo. Examples of attributes associated with the device version are resource limits and extended functionality. The version returned corresponds to the high- est version of the OpenCL spec for which the device is conformant, but is not higher than the platform version. The language version for a device represents the OpenCL programming language features a developer can assume are supported on a given device. The version reported is the highest version of the language supported. OpenCL C is designed to be backwards compatible, so a device is not required to support more than a single language version to be considered con- formant. If multiple language versions are supported, the compiler defaults to using the highest language version supported for the de- vice. The language version is not higher than the platform version, but may exceed the device version.

2.2.3 Execution Model The execution of an OpenCL program is divided in two parts : a host program that executes on the host and kernels that execute on one or more OpenCL devices connected to the host[2]. The host defines a context for the execution of the kernels. The context includes the following resources: 1. Devices: The collection of OpenCL devices to be used by the host. 2. Kernels: The OpenCL functions that run on OpenCL devices. 3. Program Objects: The program source and executable that imple- ment the kernels. 4. Memory Objects: A set of memory objects visible to the host and the OpenCL devices. Memory objects contain values that can be operated on by instances of a kernel. The context is created and manipulated by the host using functions from the OpenCL API. The host creates a data structure called a command- queue to coordinate execution of the kernels on the devices. The host 10 Chapter 2. Theoretical Background places commands into the command-queue which are then scheduled onto the devices within the context. These include: 1. Kernel execution commands: Execute a kernel on the processing elements of a device. 2. Memory commands: Transfer data to, from, or between memory objects, or map and unmap memory objects from the host address space. 3. Synchronization commands: Constrain the order of execution of commands The command-queue schedules commands for execution on a de- vice. These execute asynchronously between the host and the device. Commands execute relative to each other in one of two modes: 1. In-order Execution: Commands are launched in the order they appear in the command queue and complete in order. In other words, a prior command on the queue completes before the fol- lowing command begins. This serializes the execution order of commands in a queue. 2. Out-of-order Execution: Commands are issued in order, but do not wait to complete before following commands execute. Any order constraints are enforced by the programmer through ex- plicit synchronization commands. The core of the OpenCL execution model is defined by how the ker- nels execute. The OpenCL execution model supports two categories of kernels: 1. OpenCL kernels are written with the OpenCL C programming language and compiled with the OpenCL compiler. All OpenCL implementations support OpenCL kernels. 2. Native kernels are accessed through a host function pointer. Na- tive kernels are queued for execution along with OpenCL kernels on a device and share memory objects with OpenCL kernels. For example, these native kernels could be functions defined in ap- plication code or exported from a library. 2.2. OpenCL 11

When a kernel is submitted for execution by the host, an index space is defined. The index space supported in OpenCL is an N-dimensional index space, where N is one, two or three and is called NDRange. An NDRange is defined by an integer array of length N specifying the ex- tent of the index space in each dimension starting at an offset index F (zero by default). An instance of the kernel executes for each point in this index space. This kernel instance is called a work-item and is identified by its point in the index space, which provides a global ID for the work-item. Each work-item executes the same code but the specific execution pathway through the code and the data oper- ated upon can vary per work-item. Work-items are organized into work-groups. The work-groups provide a more coarse-grained de- composition of the index space. Work-groups are assigned a unique work-group ID with the same dimensionality as the index space used for the work-items. Work-items are assigned a unique local ID within a work-group so that a single work-item can be uniquely identified by its global ID or by a combination of its local ID and work-group ID. The work-items in a given work-group execute concurrently on the processing elements of a single compute unit. Each work-item’s global ID and local ID are N-dimensional tuples. The global ID com- ponents are values in the range from F, to F plus the number of el- ements in that dimension minus one. Work-groups are assigned IDs using a similar approach to that used for work-item global IDs. An ar- ray of length N defines the number of work-groups in each dimension. Work-items are assigned to a work-group and given a local ID with components in the range from zero to the size of the work-group in that dimension minus one. Hence, the combination of a work-group ID and the local-ID within a work-group uniquely defines a work- item. Each work-item is identifiable in two ways; in terms of a global index, and in terms of a work-group index plus a local index within a work group.

For example consider the 2-dimensional index space in the figure above. We input the index space for the work-items (Gx,Gy), the size of each work-group (Sx,Sy) and the global ID offset (Fx,Fy). The global in- dices define an Gx by Gy index space where the total number of work- items is the product of Gx and Gy. The local indices define a Sx by Sy index space where the number of work-items in a single work-group 12 Chapter 2. Theoretical Background

FIGURE 2.5: An example of an NDRange index space showing work- items, their global IDs and their mapping onto the pair of work-group and local IDs. is the product of Sx and Sy. Given the size of each work-group and the total number of work-items we can compute the number of work- groups. A 2-dimensional index space is used to uniquely identify a work-group. Each work-item is identified by its global ID (gx, gy) or by the combination of the work-group ID (wx, wy), the size of each work-group (Sx, Sy) and the local ID (sx, sy) inside the work-group such that

(gx , gy) = (wx * Sx + sx + Fx, wy * Sy + sy + Fy).

The number of work-groups can be computed as:

(Wx,Wy) = ( Gx/Sx, Gy,Sy).

Given a global ID and the work-group size, the work-group ID for a work-item is computed as:

(wx, wy) = ( (gx – sx – FX)/ Sx,(gy – sy – Fy)/ Sy ). 2.2. OpenCL 13

2.2.4 Memory Model OpenCL defines a Memory Model in which work-item(s) execut- ing a kernel have access to four distinct memory regions[2]: 1. Global Memory : This memory region permits read/write access to all work-items in all work-groups. Work-items can read from or write to any element of a memory object. Reads and writes to global memory may be cached depending on the capabilities of the device. 2. Constant Memory: A region of global memory that remains con- stant during the execution of the OpenCL code. Data written to this memory can only be read by the kernel. The host allocates and initializes memory objects placed into constant memory be- fore the execution of the kernel. 3. Local Memory: A memory region local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group. It may be implemented as dedicated regions of memory on the OpenCL device. Alterna- tively, the local memory region may be mapped onto sections of the global memory. 4. Private Memory: A region of memory private to a work-item. Variables defined in one work-item’s private memory are not vis- ible to another work-item. Private memory space is the fastest of all OpenCls memory spaces, so may in some limited cases might be used to increase execution speed OpenCL uses a relaxed consistency memory model; i.e. the state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times. Within a work-item memory has load / store consistency. Local memory is consistent across work-items in a single work-group at a work-group barrier. Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory con- sistency between different work-groups executing a kernel. Work- groups can communicate through shared memory and synchroniza- tion primitives, however their memory access is independent of other work-groups as depicted in the figure below. 14 Chapter 2. Theoretical Background

FIGURE 2.6: The memory regions and how they relate to the platform model

FIGURE 2.7: The memory regions and how they relate to the platform model

An important issue to keep in mind when programming OpenCL Kernels is that memory access on the global and local memory blocks is not protected in any way. This means that segfaults are not reported when work-items dereference memory outside their own global stor- age. As a result, GPU memory set aside for the OS can be clobbered unintentionally, which can result in behaviors. 2.2. OpenCL 15

2.2.5 Programming Model The OpenCL execution model support two types of program- ming models : task paralllel , data parallel , as well as hybrids of these two[2]. The primary model driving the OpenCL design is data par- allel. The data parallel programming model defines a computation in terms of a sequence of instructions applied to multiple elements in a memory object. The index space defined by the OpenCL execu- tion model defines the work-items and how the data maps onto these work-items. In a strictly parallel programming model there is one to one mapping between the work-items and the elements in a mem- ory object. OpenCL supports a relaxed version of the data parallel programming model where the strict one to one mapping is not nec- essary. OpenCL provides a hierarchical data parallel programming model. There are two ways to specify the hierarchical subdivision. In the explicit model a programmer defines the total number of work- items to execute in parallel and also how the work-items are divided among work-groups. In the implicit model, a programmer specifies only the total number of work-items to execute in parallel, and the di- vision into work-groups is managed by the OpenCL implementation. On the other hand the OpenCL task parallel programming model de- fines a model in which a single instance of a kernel is executed in- dependent of any index space. It is logically equivalent to executing a kernel on a compute unit with a work-group containing a single work- item. Under this model, users can express parallelism by using vector data types implemented by the device, by enqueuing multiple tasks, and/or by enqueing native kernels developed using a programming model orthogonal to OpenCL.

2.2.6 Synchronization There are two domains of synchronization in OpenCL[2]. At first there can be synchronization between work-items in a single work- group using a work-group barrier. All work-items of a work-group must execute the barrier before any are allowed to continue the exe- cution of the kernel beyond the barrier. It is important to note that the barrier must be encountered by all work-items in a work-group or by none at all. Synchronization between work-items of different work-groups is not possible. The second domain of synchronization 16 Chapter 2. Theoretical Background is between commands enqueued to command queue in a single con- text. The synchronization points between commands are : 1) Command queue barrier. This barrier can only be used to syn- chronize commands in a single command queue. It ensures that all previously queued commands have finished execution and any result- ing updates to memory objects are visible to subsequently enqueued commands before they begin execution. 2) Waiting on an event. All OpenCL API functions that enqueue commands return an event when executed that identifies the com- mand and the memory object it updated. A subsequent command waiting on that event is guaranteed that updates to those memory ob- jects are visible before the command begins execution.

2.2.7 Memory Objects Memory objects are described by a cl_mem object[2]. The host device writes the memory objects to the device memory before the kernel execution. Kernels take memory objects as input, and output to one or more memory objects. After the kernel execution the host device writes the memory objects that the kernel has produced back to its own memory. There are two types of memory objects available in OpenCL : buffer objects and image objects. A buffer objects stores a one-dimension collection of items. Items stored in a buffer object can be a scalar type (int, float, char), vector type or a user defined struc- ture. Elements of a buffer object are stored in sequential fashion and can be accessed by the kernel using a pointer in the same sequential format. An image object is used to store a two- or three- dimension texture, frame-buffer or image. Elements of an image are stored in a format that is not transparent to the user and cannot be directly ac- cessed using a pointer. There are functions available in the OpenCL API to allow the kernel to read from or write to an image. In the case of an image object the data format used to store the image elements may not be the same as the data format used inside the kernel. Im- age elements are always a 4-component vector (each component can be a float or signed/unsigned integer) in a kernel. The built-in func- tion to read from an image converts image element from the format it is stored into a 4-component vector. Similarly, the built-in function to write to an image converts the image element from a 4-component 2.3. POCL 17 vector to the appropriate image format specified such as 4 8-bit ele- ments, for example.

2.3 POCL

The benefits of a common programming standard are clear mul- tiple vendors can provide support for application descriptions writ- ten according to the standard, thus reducing the program porting ef- fort[1]. While the OpenCL standard provides an extensive program- ming platform for portable heterogeneous parallel programming, the standard is quite low-level, exposing a plenty of details of the plat- form to the programmer. Thus, using these platform queries, it is possible to adapt the program to each of the platforms. However, this means that to achieve maximum performance on a different plat- form, the programmer has to explicitly do the adaptation for each pro- gram separately. In addition, implementations of the OpenCL stan- dard are vendor and platform specific, thus acquiring the full per- formance of an OpenCL application requires the programmer to be- come familiar with the special characteristics of the implementation at hand and tune the program accordingly. This is a serious drawback for gaining maximum performance when trying to port the same code to another platform, as manual optimizations are needed. POCL is an open source implementation of OpenCL that is platform indepedent enabling OpenCL on a wide range of architectures. The implemen- tation is divided to parts that are executed in the host and to those that implement device-specific behavior. The software architecture of POCL is modularized to encourage code reuse and to isolate the de- vice specific aspects of OpenCL to provide a platform portable imple- mentation. Most of the API implementations of the OpenCL frame- work in POCL are generic implementations written in C which call the device layer through a generic host-device interface for device- specific parts. For example, when the OpenCL program queries for the number of devices, POCL returns a list of supported devices with- out needing to do anything device-specific yet. However, when the application asks for the size of the global memory in a device, the query is delegated down to the device layer implementation of the device at hand. The host layer implementation is portable to targets 18 Chapter 2. Theoretical Background with operating system C compiler support. The device layer encapsu- lates the operating system and instruction-set architecture (ISA) spe- cific parts such as code generation for the target device, and orchestra- tion of the execution of the kernels in the device. It consists of target- specific implementations for functionality such as target-specific parts of the kernel compilation process, the final execution of the command queue including uploading the kernel to the device and launching it, performing data transfers, querying device characteristics, etc. One important responsibility of a device layer implementation is resource management, that is, ensuring the resources of the device needed for kernel execution resources are properly shared and synchronized be- tween multiple kernel executions. The higher-level components of POCL are illustrated in the following figure. POCL helps the pro-

FIGURE 2.8: The host layer includes parts that are executed in the OpenCL host. The device layer is used as an hardware abstraction layer to encapsulate the device-specific parts. grammer achieve maximum performance by using an OpenCL ker- nel compiler which exposes the parallelism in the kernels in such a way that it can be mapped to the diverse parallel resources avail- able in the different types of computing devices. The POCL kernel compiler is based on unmodified Clang and LLVM tools also known as the LLVM-toolchain[12][21]. First Clang[20] parses the OpenCL C 2.3. POCL 19 kernels and produces an LLVM Intermediate Representation (IR) for the pocl kernel compiler passes. The generated LLVM IR contains the representation of the kernel code for a single work-item, match- ing the original OpenCL C kernel description as an LLVM IR func- tion. LLVM is designed around a language-independent intermedi- ate representation (LLVM IR) that serves as a portable, high-level as- sembly language that can be optimized with a variety of transfor- mations over multiple passes. Then the Optimizer[40] who is part of the LLVM-toolchain, performs many optimizations on the LLVM IR that is fed. The LLVM IR function that describes the behavior of a single work-item in the work-group is then processed by the LLVM-backend, which links the IR against an LLVM IR library of device-specific OpenCL built-in function implementations at the bit- code level. Using a library at bitcode level for the linking of the LLVM IR of the kernel code can be really beneficial in terms of runtime and size of the final executable. This library holds the implementations of all the functions that can be called upon a kernel. Each time dur- ing the execution of the kernel a kernel function is called LLVM does not dynamically copy the content of all the library into the final ex- ecutable, instead only the implementation of the function called is copied. This way we drastically reduce the execution time since the time needed for the dynamic linking is much smaller and also the size of the executable is significantly reduced which is much needed in embedded applications where resources are limited. The function is converted to a work-group function that generates a version of the function that statically executes all the work-items of the workgroup. Finally, the work-group function is passed to the code generator and assembler which generate the executable kernel binary for the target device. The work-group function is potentially accompanied with a launcher function in case of a heterogeneous device. In that case the device contains its own main function which executes the work-group function on-demand. 20 Chapter 2. Theoretical Background

FIGURE 2.9: The LLVM-Toolchain 21

Chapter 3

Description of tools

3.1 GPGPU

A graphics processing unit (GPU) is a specialized electronic cir- cuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a dis- play device. Early on in the personal computer revolution, as graph- ics rendering started to stress the ability of the CPU, the GPU was developed to handle these graphics. GPUs are used in embedded sys- tems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics and image processing. In a personal computer, a GPU can be present on a video card or embedded on the motherboard. In cer- tain CPUs, they are embedded on the CPU die. Most high-end desk- top PCs will feature a dedicated graphics card, which occupies one of the motherboard’s PCIe slots. These usually have their own dedi- cated memory allocation built into the card, which is reserved exclu- sively for graphical operations. Some particularly advanced PCs will even use two GPUs hooked up together to provide even more pro- cessing power. The term "GPU" was coined by Sony in reference to the PlayStation console’s Toshiba-designed Sony GPU in 1994. The term was popularized by Nvidia in 1999, who marketed the GeForce 256 as "the world’s first GPU". It was presented as a "single-chip pro- cessor with integrated transform, lighting, triangle setup/clipping, and rendering engines". Rival ATI Technologies coined the term "vi- sual processing unit" or VPU with the release of the Radeon 9700 in 2002. A general-purpose GPU (GPGPU) is a graphics processing unit (GPU) that performs non-specialized calculations that would typically be conducted by the CPU (central processing unit)[13]. A CPU (cen- tral processing unit) is often called the “brain” or the “heart” of a computer. It is required to run the majority of engineering and office 22 Chapter 3. Description of tools

FIGURE 3.1: Design differences between GPU and CPU software. However, there is a multitude of tasks that can overwhelm a computer’s central processor. That is when using GPU becomes essential for computing. CPUs and GPUs process tasks in different ways. A CPU can work on a variety of different calculations, while a GPU is best at focusing all the computing abilities on a specific task. That is because a CPU consists of a few cores (up to 24) optimized for sequential serial processing. It is designed to maximize the per- formance of a single task within a job; however, the range of tasks is wide .Modern GPUs are high performance many-core processors that can obtain high FLOP rates. In the past the processing units of the GPU were designed only for computer graphics but now GPUs are truly general-purpose parallel processors. Since the first idea of us- ing GPU for general purpose computing, GPU programming models have evolved and there are several approaches to GPU programming now: CUDA (Compute Unified Device Architecture) from NVIDIA and APP (Stream) from AMD. The GPU, unlike the CPU, has parallel processing capability – it can perform operations more quickly, due to the intense numbers of mathematical calculations necessary to render sophisticated graphics, and that is the reason why a great number of applications were ported to use the GPU and they obtain speedups of few orders of magnitude comparing to optimized multicore CPU implementations. Their highly parallel structure makes them more efficient than general-purpose central processing units (CPUs) for al- gorithms that process large blocks of data in parallel. GPGPU is used 3.2. NEMA|S GPU 23

FIGURE 3.2: Theoretical memory bandwidth of the GPU nowadays to speed up parts of applications that require intensive nu- merical computations. Traditionally, these parts of applications are handled by the CPUs but GPUs have now MFLOPs rates much bet- ter than CPUs. The reason why GPUs have floating point operations rates much better even than multicore CPUs is that the GPUs are spe- cialized for highly parallel intensive computations and they are de- signed with much more transistors allocated to data processing rather than flow control or data caching. While GPUs operate at lower fre- quencies, they typically use thousands of smaller and more efficient cores for a massively parallel architecture aimed at handling multiple functions at the same time. Modern GPUs provide superior process- ing power, memory bandwidth and efficiency over their CPU counter- parts. Migrating data into graphical form and then using the GPU to scan and analyze it can create a large speedup. They are 50–100 times faster in tasks that require multiple parallel processes, such as 3D Visualization, Image Processing, Deep Machine Learning and many more.

3.2 NEMA|S GPU

Implements a multi-core, multi-threaded graphics processing unit (GPU) with extremely high performance and ultra-low power consumption aimed at both graphics rendering/acceleration and general-purpose 24 Chapter 3. Description of tools computing in embedded applications[33]. The configurable and scal- able NEMA|S GPU uses an innovative architecture consisting of one or many processing clusters interconnected with a proprietary Net- work On Chip (NoC). Each cluster can have one to four floating point vector processing cores, and each core is able to run up to 128 threads. The resulting performance is extremely competitive, providing, for example 19.2 GFlops at just 533 MHz with one four-core cluster. NEMA|S combines this processing power with ultra-low power consumption. Proprietary compression techniques minimize the bandwidth to the

FIGURE 3.3: NEMA|S GPU frame buffer (access to which is the major power consumer of any GPU) and intelligent Dynamic Voltage Frequency Scaling (DVFS) al- lows adjusting the power consumption to suit the computation load. Optional custom hardware accelerators for typical graphic process- ing tasks such as Texture Mapping, Pixel Blending, and Polygon Ras- terization further reduce power consumption. The NEMA|S GPU is easy to program using an included compiler tool chain and support- ing popular graphics APIs and operating systems Android Linux. The features of this GPU are the following : 1. Multicore Architecture 2. Unified Shader Architecture 3. C/C++/OpenCL LLVM Compiler 3.3. Vertex Processor 25

4. Ultra threaded Processor 5. GPGPU Compute The Vertex Processor of NEMA|S was used in order to run OpenCL code throughout this thesis. The ultimate goal is to make the NEMA|S GPU fully conformant with the OpenCL standard.

3.3 Vertex Processor

Computer graphics is the science of communicating visually via a display and its interaction devices. It is a cross-disciplinary field in which physics, mathematics, human perception, human-computer interaction and engineering blend, towards creating artificial images with the help of programming. It heavily involves computations, cre- ation and manipulation of data and is based on a set of well-defined principles. In order to display a 2D or 3D model which is stored as bi- nary data on our system’s memory, a number of actions need to take place, known as the Graphics Pipeline. The term "pipeline" is used due to the sequential steps that are used for the actual transformation from mathematical model to pixels; the results of the one stage are pushed on to the next stage so that the first stage can begin process- ing the next element immediately. The location of each object in the

FIGURE 3.4: Rendering Flow 26 Chapter 3. Description of tools image, is described by a data structure called Vertex, which define its corners as positions of points in two or three-dimensional space. All of the vertex data is stored in a contiguous block of memory called a vertex buffer. The first step in the rendering pipeline, is to trans- form each vertex’s 3D position in object space to the 2D coordinate at which it will appear on the screen. This action takes place in the Vertex Processor. The Vertex Processing unit in the rendering pipeline handles the processing of individual vertices. It calculates the depth value and manipulates properties such as position, depth, color and texture coordinates. The Vertex Processing unit utilizes 64-bit VLIW (very long instruction word) instructions and performs computations on vertices which sends them to the Rasterizer unit[38] through the Configuration Registers. The Rasterizer Unit reads the coordinates of the primitives’ vertices and feeds the Fragment Processor with the fragments contained in the geometry. The Vertex Processing unit is programmable through binary executables called Vertex Shaders. The Fragment Processor and the Vertex Processor of NEMA|S share the same architecture but the first supports multiple hardware threads, whereas the second is single threaded. Using the Vertex Processor to run OpenCL code instead of the Fragment Processor is a serious draw- back in terms of performance. This was not a random choice though. At first the hardware of the Fragment Processor of NEMA|S is not fully developed yet. Debugging is much easier in a single threaded processor and also the Vertex Processor performs load-store opera- tions straight from graphics memory. In contrast, the Fragment Pro- cessor uses a module called Texture Map, which is designed for graph- ics, to perform such an operation. Furthermore the CPU can directly start a thread in the Vertex Processor but in the Fragment Processor this is not possible. The Fragment Processor is driven by the Raster- izer. All these factors add complexity to the development of such an application. In this thesis, the first step towards creating a fully con- formant driver is made, so performance is not the case.

3.4 CUDA

CUDA (Compute Unified Device Architecture) was introduced for the first time in 2006 by NVIDIA. It is a general purpose paral- lel programming architecture that uses the parallel compute engine in 3.4. CUDA 27

NVIDIA GPUs to solve complex computational problems in a more ef- ficient way than a CPU does[4]. At the time of its introduction CUDA supported only the C programming language, but nowadays it sup- ports FORTRAN , C++ , Java, Phyton, etc. CUDA[39] was not used in this thesis but is worth mentioning in order to have a better un- derstanding of the parallel programming models. This one has three main key abstractions – a hierarchy of thread groups, shared memo- ries, and barrier synchronization. These abstractions are exposed to the programmer as language extensions. They provide fine grain data parallelism and thread parallelism together with task parallelism that can be considered coarse grain parallelism[12][23]. The CUDA paral- lel programming model requires programmers to partition the prob- lem to be solved into coarse tasks that can be independently executed in parallel by blocks of threads and each task is further divided into finer pieces of code that can be executed cooperatively in parallel by the threads within the block. This model allows threads to cooperate when solving each task, and also enables automatic scalability[28]. Each block of threads can be scheduled for execution on any of the available processor cores, concurrently or sequentially. This allows a CUDA program to be executed on any number of processor cores.The following figure shows how a program is partitioned into blocks of threads each block being executed independently from each other. A GPU with more cores will execute the program in less time than a GPU with fewer cores. A multithreaded program divided into blocks that are allocated on 2 or 4 cores The main C language extension of the CUDA programming model allows the programmer to define C functions, called kernels, that are executed N times in parallel by N different CUDA threads[28]. A CUDA kernel definition specifies the number of CUDA threads that execute that kernel for a given call. A unique thread ID identifies each thread that executes the kernel. This ID is accessible within the kernel through the built-in threadIdx vari- able. ThreadIdx variable is a 3-component vector, so that each thread can be identified using a one-dimensional, two-dimensional, or three- dimensional index. By indexing the threads in this way one can ex- ecute computations on data elements organized in a vector, matrix or 3D space. The thread number per block is limited because all the threads of a block will be executed by one processor core and must 28 Chapter 3. Description of tools

FIGURE 3.5: A multithreaded program divided into blocks that are al- located on 2 or 4 cores share the limited memory resources of that processor core. This limi- tation on current GPUs implies that a thread block can contain up to 1024 threads. The blocks are structured into one-dimensional, two- dimensional, or three-dimensional grid of thread blocks. The num- ber of thread blocks in a grid is limited by the size of the data be- ing processed or the number of processors in the GPU. In the CUDA programming model, threads can access data from multiple memory spaces during their execution life time. Each thread has its own pri- vate local memory, each thread block shares a memory visible to all the threads in that block and all threads have access to the same global memory. CUDA threads are executed on a physically separate device that operates like a coprocessor to the host processor running the C program. The device is located on the GPU while the host is the CPU. Since its introduction in early 2007, a variety of applications have benefitted by the tremendous computational power of current GPUs. These benefits include few orders-of-magnitude speedups over the previous state-of-the-art implementations. Medical imaging is one of the earliest areas that benefited most from the CUDA programming model on GPU. Few of the GPGPU applications that benefits from the advantages of CUDA programming model are: 1. Linear algebra and large scale numerical simulations; 3.5. NEMA|GFX 29

2. Molecular dynamics, protein folding; 3. Signal processing (FFT); 4. Speech and Image recognition; 5. Sorting and searching algorithms; 6. Raytracing;

3.5 NEMA|GFX

NEMA|GFX Library is a low level library that interfaces directly with the NEMA GPUs and provides a software abstraction layer to organize and employ drawing commands with ease and efficiency[3]. The target of NEMA|GFX is to be able to be used as a back-end to existing APIs (such as OpenGL , DirectFB, or any proprietary one) but also to expose higher level drawing functions, so as to be used as a standalone Graphics API. Its small footprint, efficient design and lack of any external dependencies, makes it ideal for use in embedded ap- plications. By leveraging NEMA’s sophisticated architecture, it allows great performance with minimum CPU/MCU usage and power con- sumption. NEMA|GFX includes a set of higher level calls, forming a complete standalone Graphics API for applications in systems where no other APIs are needed. This API is able to carry out draw oper- ations from as simple as lines, triangles and quadrilaterals to more complex ones like blitting and perspective correct texture mapping. NEMA|GFX is built on a modular architecture. An implementor may use only the lower layers of the architecture that provides communi- cation to the NEMA hardware, synchronization and basic primitives drawing. The very thin Hardware Abstraction Layer allows for fast integration to the underlying hardware. The upper low level drawing API acts as a back-end interface for accelerating any higher 3rd party Graphics API. NEMA|GFX is build on a modular architecture. These modules are generally stacked one over another, forming a layered scheme. This gives the implementor the freedom to tailor the software stack according to ones needs. The lowest layer is a thin Hardware Abstraction Layer (HAL). It includes some hooks for basic interfac- ing with the hardware such as register accessing, interrupt handling etc. The layer above is the Command List Manager. It provides the 30 Chapter 3. Description of tools

FIGURE 3.6: NEMA|GFX Architecture appropriate API for creating, organizing and issuing Command Lists. A Command List (CL)[26] is considered amongst the most important features of the NEMA series. CL usage facilitates GPU and CPU de- coupling, while its inherent re-usability greatly contributes to the de- crease of the computational effort of the CPU. This approach renders the overall architecture capable of drawing complicated scenes while keeping the CPU workload to the very minimum. The design princi- ples of CLs allow developers to extend the features of their application while optimizing its functionality at the same time. For instance, a CL is capable of jumping to another CL, thus forming a chain of seam- lessly interconnected commands. In addition, a CL is able to branch to another CL and once the branch execution is concluded, resume its functionality after the branching point. The NEMA|GFX Library helps developers to easily take advantage of all these features through certain basic function calls that trigger the whole spectrum of CL ca- pabilities. Above the Command List Manager lies the Hardware Pro- gramming Layer (HPL). This is a set of helper functions that assemble commands for programming the NEMA GPU. These commands ac- tually write to the NEMA’s Configuration Register File, which is used to program the submodules of the GPU. Alongside the HPL resides the Blender module. This module programs NEMA’s Programmable 3.5. NEMA|GFX 31

Processing Core. It creates binary executables for the Core. These exe- cutables correspond to the various blending modes that are supported by the NEMA|GFX Library. On top of the NEMA|GFX stack lies the NEMA|GFX Graphics API. This API offers function calls to draw geometry primitives (lines, triangles, quadrilaterals etc), blit images, render text, transform geometry objects, perform perspective correct texture mapping etc. When using NEMA|GFX as a back-end for a third party Graphics API, much of the NEMA|GFX Graphics API may be disabled. NEMA|GFX Library has been designed to be eas- ily portable in a variety of different platforms. This includes systems with or without an operating system. In order to port NEMA|GFX successfully, one must take into account the target platform and adapt the HAL accordingly. NEMA|GFX can be used not only for graph- ics as the backend of an existing Graphics API like OpenGL or as a standalone Graphics API, but also as the backend to the OpenCL API, in order to orchestrate the execution of OpenCL code on the NEMA GPUs. We can implement all the device-specific functions defined by the OpenCL API by using the NEMA|GFX library. For example, device-specific operations like creating a buffer object in the graphics memory, reading and writing these buffer objects, loading the ker- nel executables in the graphics memory for execution and many more can be easily implemented by calling the appropriate functions of the NEMA|GFX library.

33

Chapter 4

Conformance tests

4.1 Conformance tests description

The Khronos Group, Inc. is an American non-profit member-funded industry consortium based in Beaverton, Oregon, focused on the cre- ation of open standard, royalty-free application programming inter- faces (APIs), such as OpenGL, Vulkan and OpenCL, for authoring and accelerated playback of dynamic media on a wide variety of platforms and devices. Any company that is developing a product that imple- ments or partially implements a Khronos standard then it must pass conformance tests defined by Khronos before you can use the name or logo of the standard in association with your product or call your product ‘compliant’ or ‘conformant’ with that Khronos specification. This helps to ensure that Khronos standards are consistently imple- mented by multiple vendors to create a reliable platform for devel- opers. To enable companies to test their products for conformance, Khronos has established an Adopters Program for each standard[35]. All Adopters sign the Khronos Adopters Agreement and pay a one- time fee for that particular version of an API. A company does not have to be a member of Khronos in order to become an Adopter. Adopters are provided access to the Adopters Package on a password protected section of the Khronos web-site and are enabled to make an unlimited number of Submissions for any number of Implemen- tations using any version of the Specification up to the Paid Specifi- cation Version. Submission means a complete set of results created by performing the Tests on an Implementation according to the Pro- cess and which are passed to Khronos. The Adopter should make no changes to the source code that disable or change the intended opera- tion of any test unless the Adopter identifies a potential bug in a test. Becoming an Adopter of a Khronos standard gives you access to the Khronos Conformance Testing Process: 34 Chapter 4. Conformance tests

1. Download the source of the Khronos conformance tests to port and run on your implementation. The Tests are provided as is and the Adopter is responsible for porting and running the Tests on the implementation to generate the necessary information for a Submission. 2. Access the Adopters Mailing list; a priority channel for two-way interaction with Khronos Members who can offer assistance on running tests. 3. Upload generated test results for Working Group review and ap- proval to become officially conformant. Once your implementation test results have been approved by the Re- view committee your implementation is officially conformant gaining a number of significant benefits.

4.1.1 Categories of Conformance tests The conformance tests that Khronos provides for OpenCL in order to test our implementation are organised in the following categories. 1. allocations : These tests check whether our implementation is ca- pable of performing allocations for buffer and image objects. The size of the buffer and image objects is not constant and the read- ing and writing of these memory objects is also checked. 2. api : These tests are made to check various functions of the OpenCL API, whether a number of extensions are supported and features regarding the kernels. 3. atomics : All of the atomic functions are checked. These func- tions provide atomic operations on 32-bit signed, unsigned inte- gers and single precision floating-point to locations in global or local memory. 4. basic : These tests check many operations that our implementa- tion should be able to perform regarding the kernels. These op- erations include the proper memory alignment for the kernel pa- rameters, the proper functioning of many kernel functions such as the barrier, the if, the async-work-group copy function and also the basic math functions such as add, sub and multiply. 4.1. Conformance tests description 35

5. buffers : These tests check whether our implementation is capable of performing many operations regarding the buffer objects such as fill, read, write, map and migrate. These operations take place on a number of buffer objects of different sizes. 6. commonfns : These tests check whether the common built-in func- tions of OpenCL are working properly. Built-in common func- tions operate component-wise and the description is per-component and are implemented using the round to nearest even rounding mode. 7. compiler : This category check compiler’s functionality. 8. computeinfo : These tests check whether our implementation is capable of returning all the necessary information regarding the targeted device by calling the clGetDeviceInfo() function. 9. conversions : These tests check if conversions from a data type to another are possible. They apply to scalar and vector data types. 10. device-partition : The targeted device can be partitioned in a few different ways resulting in an array of sub-devices that each ref- erence a non-intersecting set of compute units. The output sub- devices may be used in every way that the root (or parent) device can be used, including creating contexts and building programs. These tests check whether all the possible ways of partitioning the device specified by the OpenCL specification are possible. 11. events : Kernel execution and memory commands submitted to a queue generate event objects. These are used to control execu- tion between commands and to coordinate execution between the host and devices. All the parameters regarding the event objects in OpenCL are tested here. 12. geometrics : This category tests the proper functioning of the built-in Geometric functions. They all operate component-wise. 13. half : According to the OpenCL specification the data type half is defined as a 16-bit float. The half data type must conform to the IEEE 754-2008 half precision storage format. This category checks whether our implementation support this data type. 36 Chapter 4. Conformance tests

14. headers : In this category we make sure that the various cltypen types work and conform to expectation. By typen we mean all the built-in scalar and vector data types. We also verify that the various OpenCL headers can compile stand alone. That is to en- sure that they may be used a la carte. This provides developers a lifeline in the case that some unneeded part of OpenCL (e.g. cl/gl sharing) brings in a pile of symbols (e.g. all of OpenGL) that col- lides with other headers needed by the application. We also check to make sure that the headers don’t cause spurious warnings. 15. images : In this category we verify that the image objects are fully supported by our implementation. This includes reading and writing images, passing them as kernel arguments and many more. 16. integer-ops : Here the Integer Built-in functions are inspected. Built-in integer functions can be used to take scalar or vector ar- guments. 17. math-brute-force : This category is designed to do a somewhat exhaustive examination of the single and double precision math library functions in OpenCL, for all vector lengths. Math library functions are compared against results from a higher precision reference function to determine correctness. All possible inputs are examined for unary single precision functions. Other func- tions are tested against a table of difficult values, followed by a few billion random values. 18. mem-host-flags : These tests check a number of host flags regard- ing the buffer and image objects. We apply these flags when we create a buffer object or an image. We call them as host flags be- cause they define the way the host machine can interact with the memory object. 19. multiple-device-context : In this category we make sure that it is feasible to create multiple contexts on a single device and a single context containing two or more devices. 20. printf : The OpenCL C programming language implements the printf function. The printf built-in function writes output to an implementation-defined stream such as stdout under control of 4.1. Conformance tests description 37

the string pointed to by format that specifies how subsequent ar- guments are converted for output. These tests check whether the printf function is fully functional when called in an OpenCL ker- nel. 21. profiling : These tests measure the time that is needed in order to perform simple operations such as reading,writing,copying a buffer object or an image object. 22. relationals : The tests in this category are designed to do an ex- haustive examination of the Relational Built-in Functions. The built-in relational functions can be used with built-in scalar or vector types as arguments and return a scalar vector integer re- sult. They can be extended with cl-khr-fp64 to include appropri- ate versions of functions that take double, and double2|4|8|16 as arguments and return values. They are also extended with with cl-khr-fp16 to include appropriate versions of functions that take half, and half2|4|8|16 as arguments and return values. 23. select : The built-in select function is examined thoroughly in this category. The description of the select function is the following : For each component of a vector type, result[i] = if MSB of c[i] is set ? b[i] : a[i]. For scalar type, result = c ? b : a. 24. thread-dimensions : This category tests thread dimensions by ex- ecuting a kernel across a range of dimensions. Each kernel in- stance does an atomic write into a specific location in a buffer to ensure that the correct dimensions are run. To handle large di- mensions, the kernel masks its execution region internally. This allows a small (128MB) buffer to be used for very large executions by running the kernel multiple times. 25. vec-step : The vec-step built-in function takes a built-in scalar or vector data type argument and returns an integer value repre- senting the number of elements in the scalar or vector. For all scalar types, vec-step returns 1. The vec-step built-in functions that take a 3-component vector return 4. vec-step may also take a pure type as an argument, e.g. vec-step(float2). These tests carry out an thorough examination of the vec-step function.

39

Chapter 5

OpenCL implementation for the NEMA|S GPU

The OpenCL implementation for the NEMA GPUs will be pre- sented in this chapter. We used POCL version 1.0 and all the work was developed under the Linux enviroment. The Zynq R -7000 SoC ZC706 platform and a computer with Intel(R) Core(TM) i3-6100 CPU were used throughout this thesis. The Vertex processor of the NEMA|S GPU was installed in ZC706 platform and all of the kernel code that had the NEMA|S GPU as the targeted device run on this processing element. The whole of the work will be divided in three different sections. The first section is all about building POCL on the host ma- chines and adding the NEMA|S GPU as a target device. In the second section, the procedure of building kernel code for the NEMA|S GPU will be analyzed. In the last section, the POCL backend implementa- tion for the same device will be depicted and discussed.

5.1 Building POCL - Learning OpenCL

The first step of this process, was to download POCL version 1.0 from git and build it on the -64 machine. The ocl-icd (Installabe Client Driver) was enabled at this stage (this was a prerequisite in order to run the conformance tests later on). The ocl-icd gives the abillity to the user to choose the desired implementation of OpenCL. To direct the ocl-icd to use only the POCL in the build tree, the ap- propriate value was assigned to the environment variable "OCL-ICD- VENDORS". A x86-64 machine and the ARM Cortex-A9 processors of the ZC706 platform were used as a reference throughout this work, be- cause POCL has backends supporting both of their architecture, and this way all of the kernel code that run on the Vertex processor could be crosschecked for any faults. The process of building POCL was 40 Chapter 5. OpenCL implementation for the NEMA|S GPU done via CMakeFiles. All the necessary files for building POCL on the x86-64 machine and on the ARM processors were found by the gcc compiler, that was used for this procedure, either in the root file system of the Linux Operating System or in the downloaded file. Af- ter having POCL successfully built on the x86-64 machine, a number of demos were developed in order to understand better the concepts of the OpenCL framework. These demos were run on the x86-64 plat- form so as to verify that the building procedure was successful. The code of the host side and the kernel code of an OpenCL application for matrix addition will be briefly presented here. Most of the func- tions that run on the host side have a device specific behaviour. The implementation of all these functions had to be written from scratch for the NEMA|S GPU. The kernel code for the matrix addition is presented. Two buffer objects were used as input arguments and one as output argument. By calling the get_global_id function the unique global work-item ID value for the dimension specified in the parenthesis is returned. The global work-item ID specifies the work-item ID based on the number of global work-items specified to execute the kernel. This way the in- teger "i" will access all the elements on this dimension and the values in the same position in the two arrays will be added. The dimensions of this demo are specified later on. This is a single dimension example.

•1 __kernel void 2 matrixaddition(__global float ∗output , __global float ∗ f r s t i n p u t , __global float ∗ sndinput ) 3 { 4 i n t i = get_global_id(0); //i= [0.. global_dimension −1] 5 output[i] = frstinput[i] + sndinput[i]; 6 }

Below the code that runs on the host side is illustrated. At first the list of platforms and devices available is obtained. For the desired device, a context is created. Contexts are used by the OpenCL run- time for managing objects such as command-queues, memory, pro- gram and kernel objects and for executing kernels on one or more de- vices specified in the context.

•1 /∗ Read OpenCL Kernel ∗/ 2 source_file = fopen("MatrixAddition.cl","r"); 3 i f (source_file == NULL){ 4 source_file = fopen ( SRCDIR"/MatrixAddition.cl","r"); 5.1. Building POCL - Learning OpenCL 41

5 } 6 7 assert(source_file != NULL &&"MatrixAddition.cl not found!"); 8 fseek (source_file , 0, SEEK_END); 9 source_size = ftell (source_file); 10 fseek (source_file , 0, SEEK_SET); 11 source = (char ∗) malloc (source_size + 1); 12 assert (source != NULL); 13 fread (source, source_size , 1, source_file); 14 source[source_size] =’\0’; 15 fclose (source_file); 16 17 /∗ Finished Reading OpenCL Kernel ∗/ 18 input = (cl_float ∗) malloc (sizeof(cl_float) ∗HEIGHT∗WIDTH) ; 19 output = (cl_float ∗) malloc (sizeof(cl_float) ∗HEIGHT∗WIDTH) ; 20 f o r ( i =0;i

59 kernel = clCreateKernel(program,"matrixaddition", &err); 60 i f ( ! kernel || e r r != CL_SUCCESS) 61 { 62 p r i n t f ("Error: Failed to create compute kernel!\n"); 63 returnEXIT_FAILURE; 64 } 65 read0 = c l C r e a t e B u f f e r ( context , CL_MEM_READ_ONLY, sizeof(float) ∗ HEIGHT∗WIDTH, NULL, NULL) ; 66 read1 = c l C r e a t e B u f f e r ( context , CL_MEM_READ_ONLY, sizeof(float) ∗ HEIGHT∗WIDTH, NULL, NULL) ; 67 write = c l C r e a t e B u f f e r ( context , CL_MEM_WRITE_ONLY, sizeof(float) ∗ HEIGHT∗WIDTH, NULL, NULL) ; 68 i f (!read0 || !read1 || !write) 69 { 70 p r i n t f ("Error: Failed to allocate device memory!\n"); 71 e x i t ( 1 ) ; 72 } 73 cmd_queue = clCreateCommandQueue(context , device_id , 0, &err) ; 74 i f (!cmd_queue) 75 { 76 p r i n t f ("Error: Failed to createa command commands!\n"); 77 returnEXIT_FAILURE; 78 } 79 err = clEnqueueWriteBuffer(cmd_queue, read0 , CL_TRUE, 0, sizeof( f l o a t) ∗ HEIGHT∗WIDTH, input , 0 , NULL, NULL) ; 80 i f ( e r r != CL_SUCCESS) 81 { 82 p r i n t f ("Error: Failed to write to source array!\n"); 83 e x i t ( 1 ) ; 84 } 85 err = clEnqueueWriteBuffer(cmd_queue, read1 , CL_TRUE,0 , sizeof( f l o a t) ∗ HEIGHT∗WIDTH, input , 0 , NULL, NULL) ; 86 i f ( e r r != CL_SUCCESS) 87 { 88 p r i n t f ("Error: Failed to write to source array!\n"); 89 e x i t ( 1 ) ; 90 } 91 err = clSetKernelArg(kernel, 0, sizeof(cl_mem) , &write); 92 err |= clSetKernelArg(kernel , 1, sizeof(cl_mem) , &read0) ; 93 err |= clSetKernelArg(kernel , 2, sizeof(cl_mem) , &read1) ; 94 i f ( e r r != CL_SUCCESS) 95 { 96 p r i n t f ("Error: Failed to set kernel arguments!%d\n", err); 97 e x i t ( 1 ) ; 98 } 99 l o c a l = 2 ; 100 101 err = clEnqueueNDRangeKernel(cmd_queue, kernel , 1, NULL,&global,& l o c a l , 0 , NULL, NULL) ; 102 i f (err) 103 { 104 p r i n t f ("Error: Failed to execute kernel!%d\n", err); 105 returnEXIT_FAILURE; 106 } 107 108 clFinish (cmd_queue) ; 109 5.1. Building POCL - Learning OpenCL 43

110 err = clEnqueueReadBuffer( cmd_queue, write , CL_TRUE, 0, sizeof( f l o a t) ∗ HEIGHT∗WIDTH, output , 0 , NULL, NULL ) ; 111 112 //error checking 113 error_checking(input ,input ,output) ; 114 115 clReleaseMemObject(read0) ; 116 clReleaseMemObject(read1) ; 117 clReleaseMemObject(write) ; 118 clReleaseProgram(program) ; 119 clReleaseKernel(kernel) ; 120 clReleaseCommandQueue(cmd_queue) ; 121 clReleaseContext(context) ; 122 123 return 0;

The procedure of reading and building the OpenCL code is also depicted. We read the external file and store the OpenCL code as a string. Then by calling the clCreateProgramWithSource() function a program object is created within the context that was previously de- fined, and the string holding the kernel code representation is loaded into the program object. After having the OpenCL kernel code loaded, the program executable is built from the program object by calling the clBuildProgram() function. A kernel object is also created for later use. The creation of a buffer object is done by calling the clCreateBuffer() function. OpenCL defines a plethora of different flags to use when creating a buffer object. In this example the flag CL-MEM-READ- ONLY indicates that the buffer object can only be read by the kernel whereas the CL-MEM-WRITE-ONLY flag indicates that the buffer ob- ject can only be written by the kernel. Afterwards, the buffer objects get written with the desired values, in this case we used the same val- ues in both objects, by calling the clEnqueueWriteBuffer() function. The created buffer objects are passed as kernel arguments, by call- ing the clSetKernelArg() function, in the right order according to the kernel function declaration. The second parameter in this function is the argument index. Arguments to the kernel are referred by indices that go from 0 for the leftmost argument to n - 1, where n is the total number of arguments declared by a kernel. The last argument in this function is a pointer to data that should be used as the argument value for argument specified by argument index. The data pointed to by the pointer is copied and the pointer can therefore be reused by the appli- cation after clSetKernelArg() returns. The kernel function is launched by calling the clEnqueueNDRangeKernel() function. In this function 44 Chapter 5. OpenCL implementation for the NEMA|S GPU the number of dimensions are specified - in our case there is only one dimension. The number of dimensions are used to specify the global work-items and work-items in the work-group. This number must be greater than zero and less than or equal to three. The global variable here is used to define the total number of work-items and the local variable the number of work-items that make up a work-group (also referred to as the size of the work-group) that will execute the kernel. After the kernel execution, we read the buffer object that stored the output by calling the clEnqueueReadBuffer() and verify that the re- sult are correct. Finally we release the memory that was occupied for all objects. After building POCL and running the demo examples on the x86-64 machine, I moved to the ZC706 platform. An SD Card was used in or- der to boot Linux in this platform. The communication with this plat- form was done using the SSH protocol. By utilizing the NFS (Network File System) the directory containing POCL’s files were mounted from my host x86-64 computer to the ZC706 platform. NFS allows you to mount your local file systems over a network and remote hosts to interact with them as they are mounted locally on the same sys- tem. In order to build POCL on the ARM processor the same pro- cedure was followed as on the x86-64 processor. Later on, I run the demo examples, that were implemented earlier, on the new platform and crosscheck the results to verify that everything went smoothly. The last step of this task was to add the NEMA|S GPU in the list of devices that could be selected to run OpenCL code, and also to add all the necessary files for the GPU to the building process of POCL. These files were the ToolChainAPI.so library which would be used to build any kernel code to an executable file for NEMA|S since all of the NEMA GPUS have an LLVM backend, and also the NemaGFX.so library which would be used in the implementation of the target spe- cific functions of the OpenCL API. And finally all the header files holding the reference of the functions included in those two libraries. The building of POCL produced 3 libraries. At first there is the li- brary holding all the kernel functions implementation also called "lib- kernel" as a .bc file. The lib-kernel library is built on the host device but runs on the target device. Then there is the libllvmopencl.so which is used by the compiler. Finally the libpocl.so library holding the im- plementation of all the OpenCL API functions, and with this library 5.2. Compiling OpenCL code for the NEMA|S GPU 45 we linked the NemaGFX.so and the ToolChainAPI.so libraries. The libpocl.so library runs on the hot device. The selection of the target device was done through the command line by setting the appropri- ate value to the enviroment variable "POCL-DEVICES".

5.2 Compiling OpenCL code for the NEMA|S GPU

After having POCL successfully built on the development plat- form, the next challenge was to produce an executable file from a ker- nel function, that could run on the Vertex Processor of the NEMA|S GPU. All of the code for this operation was written in a file called "nema-gen.cc". The procedure of compiling OpenCL code was done in two stages and will be analyzed step by step in this section. All the files produced in intermediate stages, are stored in a directory de- fined by the value the enviroment variable "TEMP-FILES-DIR" holds. At first, when the clBuildProgram() function is called an LLVM-IR file is produced holding the represantation of the OpenCL kernel and the rest of the process takes place when the clEnqueueNDRangeKer- nel() function is called. The implementation of POCL backend will be presented in the next section. In order to launch the kernel we produced our own main function called "tsi_main". The structure of the "tsi_main" function is illustrated in the figure 5.8. In order to set the value for a kernel argument we call the clSetKernelArg() func- tion. A pointer to data that should be used as the argument value is passed on this function. The pointers of all the argument values are stored sequentially. The nema-args pointer holds the address of the first pointer and by adding 4 bytes to each address, because the ZC702 platfrom has a 32bit architecture, we can access all of them. The body of the "tsi_main" function consists of the kernel function call only. Below the procedure of producing a kernel executable is illustrated. At first a new instance of the LLVM-toolchain is created and the target device is also defined. The LLVM-backed should use the right config- urations so as to produce the appropriate binary code for NEMA|S.

•1 //∗ CREATE TOOLCHAINAPI 2 ToolchainAPI ∗main_compiler = new ToolchainAPI() ; 3 TapiTarget t a r g e t = TARGET_NEMA_S; 4 5 6 //∗ CL TO LLVM 46 Chapter 5. OpenCL implementation for the NEMA|S GPU

FIGURE 5.1: Our tsi-main function the kernel compilation

7 std:: string llvm_name = path_files +"main.ll"; 8 //printf("\n llvm_name=%s\n",llvm_name.c_str()); 9 TapiResult tapi_llvm = main_compiler−>ClToLlvmAssembly(main_name_s . c _ s t r ( ) , 10 llvm_name. c_str () , 11 TARGET_NEMA_S, 12 ""); 13 14 //∗ LINK kernel.ll and main.ll 15 std::string llvm_output = path_files +"output.ll"; 16 17 std:: string kernel_llvm_ir; 18 kernel_llvm_ir = path_files + (std:: string)kernel −>name +".ll"; 19 20 TapiResult tapi_llvm_output = main_compiler−>LlvmLink(3 , llvm_name. c_str() , kernel_llvm_ir.c_str() , llvm_output.c_str()); 21 22 i f ( tapi_llvm != TAPI_SUCCESS ) { 23 p r i n t f ("ERROR OCCURRED in CppToLlvmAssembly:%d\n", tapi_llvm); 24 return −1; 25 } 26 //∗ LLVM TO OBJECT 27 i n t result = llvm_to_object(final_path ,"output"); 28 29 i f (result == −1) { 30 a s s e r t (0 &&"KERNEL DOES NOT COMPILE SUCCESSFULLY"); 31 } 32 33 34 //∗ OBJECT TO EXECUTABLE 35 std:: string cl_jump_name = (std:: string)cl_jump +"cl_jump.o"; 36 std:: string executable_name = path_files + (std:: string)kernel −>name +".obj"; 37 std:: string kernel_object_name = path_files +"output.o"; 38 39 TapiResult tapi_exe = main_compiler−>ObjectToExecutable( kernel_object_name. c_str () , 40 executable_name. c_str () , 41 TARGET_NEMA_S, 42 cl_jump_name. c_str () ) ; 43 5.3. Implementation of the POCL backend 47

44 i f ( tapi_exe != TAPI_SUCCESS ) { 45 p r i n t f ("ERROR OCCURRED in ObjectToExecutable:%d\n", tapi_exe); 46 return −1; 47 } 48 49 //∗ EXECUTABLE TO BINARY 50 std:: string binary_name = path_files; 51 binary_name = binary_name + kernel −>name +".bin"; 52 53 TapiResult tapi_binary = main_compiler−>ElfToBinary(executable_name . c _ s t r ( ) , 54 binary_name. c_str () , 55 TAPI_ELF_EXECUTABLE , 56 ""); 57 58 i f ( tapi_binary != TAPI_SUCCESS ) { 59 p r i n t f ("ERROR OCCURRED in ElfToBinary:%d\n", tapi_binary); 60 return −1; 61 } 62 63 main_func. str (""); 64 d e l e t e main_compiler; 65 66 return 1;

Afterwards the tsi_main function is converted to LLVM-assembly (LLVM-IR) by calling the ClToLllvmAssembly(), the produced LLVM- IR file is named main.ll and is stored in the TEMP-FILES-DIR direc- tory. The main_name_s variable holds the string representation of the kernel function. The file "kernel-llvm-ir.ll" holding the kernel repre- sentation, produced when the clBuildProgram() function was called, and the file "main.ll" that the previous step produced are linked to- gether, producing the "output.ll" file. The final executable will be cre- ated from the "output.ll" file. From the LLVM-IR file we produce an assembly file, and from the assembly file emerged the object file. In both functions the target device was specified because this operation is target specific. In this stage the LLVM-backend used the appropriate ISA. Finally the binary file that will be loaded in the graphics mem- ory by the host device in order to be executed by the Vertex Processor. This stage required two function calls. One to get the executable file (ELF file) from the object file and one to get the binary file.

5.3 Implementation of the POCL backend

In this section the implementation of the OpenCL API functions that run on the host side but have a target specific behaviour will be 48 Chapter 5. OpenCL implementation for the NEMA|S GPU presented. A file called "nema.c" holds the implementation of all these functions. The "nema.c" and the "nema-gen.cc" files were included in the building process of POCL. The whole of the implemented func- tions cannot be analyzed in this document because their number is quite big. So only the functions that perform more crucial operations, like launching a kernel will be depicted. Images are not supported in this implementation because a lot of work needs to be done in the hardware of NEMA|S. Also local parameters or local variables are currently not supported in kernel function.

5.3.1 Implementation of the clCreateBuffer function For the allocation of buffer objects the function called pocl_nema_alloc_mem_obj was implemented. Whenever a buffer object needed to be stored in the graphics memory the nema_buffer_create function was called. If there was an allocation failure then this function returned the CL- MEM-OBJECT-ALLOCATION-FAILURE value and the program exe- cution was terminated. The nema_buffer_create function, which is part of the NemaGFX, takes as input argument the size of the buffer and returns a pointer to a nema_buffer_t struct. This struct holds in- formation regarding the size of the buffer, the virtual and the phys- ical address where the buffer is stored. This way we can easily ac- cess the buffer for purposes like writing and reading. The first thing that is done whenever the clCreateBuffer function is called, is to check whether there is space already allocated for this memory object. If there is then we use that allocated space, otherwise a number of dif- ferent actions are performed depending on the flag that was used dur- ing the call of the clCreateBuffer function. According to the OpenCL specification the flags that can be used when creating a buffer object are the following. 1. CL-MEM-READ-ONLY : For this flag we just allocate the desired space 2. CL-MEM-WRITE-ONLY : For this flag we just allocate the desired space 3. CL-MEM-READ-WRITE : For this flag we just allocate the desired space 5.3. Implementation of the POCL backend 49

4. CL-MEM-USE-HOST-PTR : For this flag we allocate the desired space, we write the buffer object by calling the memcpy function with the data starting at the address where the host_ptr points to, and we also store the address of the void pointer because after the kernel execution we want to write the results back from the buffer object to the host_ptr. When using this flag, there is no need to call the clEnqueueReadBuffer and the clEnqueueWriteBuffer functions in order to read or write the buffer object. 5. CL-MEM-ALLOC-HOST-PTR : For this flag we allocate the de- sired space and assign the virtual address of the buffer object to the mem_host_ptr variable for later use. 6. CL-MEM-COPY-HOST-PTR : For this flag we allocate the desired space, we assign the virtual address of the buffer object to the mem_host_ptr variable and finally we write the buffer object by calling the memcpy function with the data starting at the address where the host_ptr points to. Whenever we want to access a memory object in the graphics memory from the host side, we have to use the virtual addresses otherwise a segmentation fault will occur.

•1 c l _ i n t 2 pocl_nema_alloc_mem_obj (cl_device_id device , cl_mem mem_obj, void∗ host_ptr ) 3 { 4 cl_mem_flags flags = mem_obj−>f l a g s ; 5 unsigned i; 6 POCL_MSG_PRINT_MEMORY (" mem%p, dev%d\n" , mem_obj, device −> dev_id ) ; 7 /∗ check if some driver has already allocated memory for this mem_obj 8 in our global address space, and use that ∗/ 9 f o r (i = 0; i < mem_obj−>context −>num_devices; ++i ) 10 { 11 i f (!mem_obj−>device_ptrs[i ]. available) 12 continue; 13 i f (mem_obj−>device_ptrs[i ].global_mem_id == device −> global_mem_id && 14 mem_obj−>device_ptrs[i].mem_ptr != NULL) 15 { 16 mem_obj−>device_ptrs[device −>dev_id ].mem_ptr = 17 mem_obj−>device_ptrs[i ].mem_ptr; 18 POCL_MSG_PRINT_MEMORY ( 19 "mem%p dev%d, using already allocated mem\n", mem_obj , 20 device −>dev_id ) ; 21 return CL_SUCCESS ; 50 Chapter 5. OpenCL implementation for the NEMA|S GPU

22 } 23 } 24 25 nema_buffer_t ∗nema_bo = (nema_buffer_t ∗) 0 ; 26 27 i f ( ( f l a g s & CL_MEM_USE_HOST_PTR) && ( host_ptr != NULL) ) 28 { 29 nema_bo = (nema_buffer_t ∗) malloc (sizeof(nema_buffer_t) + s i z e o f(uintptr_t) ); 30 ∗nema_bo = nema_buffer_create(mem_obj−>s i z e ) ; 31 u i n t p t r _ t ∗nema_bo_host_ptr = (uintptr_t ∗) (nema_bo+1) ; 32 ∗nema_bo_host_ptr = (uintptr_t)host_ptr; 33 i f (nema_bo−>base_virt==NULL) { 34 free (nema_bo) ; 35 return CL_MEM_OBJECT_ALLOCATION_FAILURE ; 36 } 37 mem_obj−>shared_mem_allocation_owner = device ; 38 memcpy ( nema_bo−>base_virt , host_ptr , mem_obj−>s i z e ) ; 39 } 40 e l s e 41 { 42 POCL_MSG_PRINT_MEMORY ("! USE_HOST_PTR\n"); 43 nema_bo = (nema_buffer_t ∗) malloc (sizeof(nema_buffer_t)) ; 44 ∗nema_bo = nema_buffer_create(mem_obj−>s i z e ) ; 45 i f (nema_bo−>base_virt==NULL) { 46 free (nema_bo) ; 47 return CL_MEM_OBJECT_ALLOCATION_FAILURE ; 48 } 49 mem_obj−>shared_mem_allocation_owner = device ; 50 } 51 52 /∗ use this dev mem allocation as host ptr ∗/ 53 i f ( ( f l a g s & CL_MEM_ALLOC_HOST_PTR) && ( mem_obj−>mem_host_ptr == NULL) ) { 54 mem_obj−>mem_host_ptr = nema_bo−>base_virt ; 55 } 56 57 i f ( f l a g s & CL_MEM_COPY_HOST_PTR) 58 { 59 assert(host_ptr != NULL); 60 memcpy ( nema_bo−>base_virt , host_ptr , mem_obj−>s i z e ) ; 61 mem_obj−>mem_host_ptr = nema_bo−>base_virt ; 62 63 } 64 mem_obj−>device_ptrs[device −>dev_id].mem_ptr = (void ∗) nema_bo ; 65 return CL_SUCCESS ; 66 }

5.3.2 Implementation of the clEnqueueWriteBuffer function In order to write to a buffer object from host memory the function called pocl_nema_write was implemented. The void pointer device_ptr points to the address in the host memory where the nema_buffer_t struct is stored. This struct holds the information regarding the buffer 5.3. Implementation of the POCL backend 51 object that was created when the clCreateBuffer function was called. The void pointer host_ptr points to the data that we want to trans- fer.In order to access from the host side the memory object that is stored in the graphics memory we use virtual addresses. The data transfer operation is done using the memcpy function.

•1 void 2 pocl_nema_write (void ∗data , const void ∗ host_ptr , void ∗ device_ptr , 3 size_t offset , size_t cb) 4 { 5 i f (host_ptr == device_ptr) 6 return; 7 const nema_buffer_t ∗nema_bo = (const nema_buffer_t ∗)device_ptr ; 8 9 memcpy ( (char ∗) nema_bo−>base_virt + offset , host_ptr, cb); 10 i n t ∗ i n t p t r = (int ∗)((char ∗) nema_bo−>base_virt + offset); 11 }

5.3.3 Implementation of the clEnqueueReadBuffer function In order to read from a buffer object to host memory the function called pocl_nema_read was implemented. The void pointer device_ptr points to the address in the host memory where the nema_buffer_t struct is stored. This struct holds the information regarding the buffer object that was created when the clCreateBuffer function was called. The void pointer host_ptr points to an address in the host memory. In this address we want to store the buffer object. To access from the host side any memory object that is stored in the graphics memory we use virtual addresses. The data transfer operation is done using the memcpy function.

•1 void 2 pocl_nema_read (void ∗data , void ∗ host_ptr , const void ∗ device_ptr , 3 size_t offset , size_t cb) 4 { 5 i f (host_ptr == device_ptr) 6 return; 7 const nema_buffer_t ∗nema_bo = (const nema_buffer_t ∗)device_ptr ; 8 memcpy (host_ptr , (char ∗) nema_bo−>base_virt + offset , cb); 9 } 52 Chapter 5. OpenCL implementation for the NEMA|S GPU

5.3.4 Implementation of the clEnqueueNDRangeKernel function In order to execute a kernel on the Vertex Processor of the NEMA|S GPU the pocl_nema_run function was implemented. The first opera- tion of this function is to check the type of the kernel arguments be- cause samplers,images and local arguments are not supported in our implementation so far. Then we call the main-codegen function and the procedure that was described in the previous section follows. We end up with the binary file of the kernel function that is located in our host memory. By calling the load_objects function we load the kernel executable from host memory to graphics memory and we also load the kernel address to the GPU. After having the kernel executable stored in NEMA’s memory we load pocl-context to NEMA’s constant registers and set up the stack pointer. The registers are written in the following order: constant register 1: global-offset[3] constant register 2: group-id[3] constant register 3: local-size[3] constant register 4: num-groups[3] constant register 5: work-dims[3] constant register 6: local-x constant register 7: local-y constant register 8: local-z

The next step is to execute the kernel. The CPU starts sequentially - in a for loop- all the threads required for the kernel execution. If we had used the Fragment Processor in order to exploit multiple hard- ware threads this would not be possible. In order to access all the global work items we iterate over all the work-groups and within each work group we aceess all of it’s local items. The kernel execution starts by calling the nema_vertex_start function. After the kernel execution is complete, we check whether a buffer object was created with the CL-MEM-USE-HOST-PTR flag in order to write the data back from the buffer to host memory. Finally we destroy all the nema buffers so as to avoid memory leaking.

•1 void 2 pocl_nema_run 3 (void ∗data , _cl_command_node∗ cmd) { 4 5 s t r u c t pocl_argument ∗ pocl_arg ; 5.3. Implementation of the POCL backend 53

6 size_t x, y, z; 7 unsigned i; 8 s t r u c t pocl_context ∗pc = &cmd−>command . run . pc ; 9 cl_kernel kernel = cmd−>command . run . kernel ; 10 pocl_arg = &(cmd−>command.run. arguments [0]) ; 11 nema_buffer_t ∗bo ; 12 char ∗ kernel_bin_path = 0; 13 void∗ nema_gen_buffer = 0; 14 assert (data != NULL); 15 16 i n t arg_idx=0; 17 f o r (i = 0; i < kernel −>num_args; ++i){ 18 pocl_arg = &(cmd−>command.run.arguments[ i ]) ; 19 i f (kernel −>arg_info[i ]. is_local){ 20 a s s e r t (0 &&"arg_info[i]. is_local"); 21 } 22 e l s e if (kernel −>arg_info [ i ] . type == POCL_ARG_TYPE_POINTER) { 23 assert(pocl_arg −>value ) ; 24 cl_mem mem = ∗( cl_mem ∗) pocl_arg −>value ; 25 a s s e r t ( mem−>device_ptrs [cmd−>device −>dev_id ].mem_ptr ) ; 26 nema_buffer_t ∗nema_bo = (nema_buffer_t ∗) mem−> device_ptrs [cmd−>device −>dev_id ].mem_ptr; 27 continue; 28 } 29 e l s e if (kernel −>arg_info [ i ] . type == POCL_ARG_TYPE_IMAGE) { 30 a s s e r t (0 &&"POCL_ARG_TYPE_IMAGE"); 31 } 32 e l s e if (kernel −>arg_info [ i ] . type == POCL_ARG_TYPE_SAMPLER) { 33 a s s e r t (0 &&"POCL_ARG_TYPE_SAMPLER"); 34 } 35 e l s e{ 36 continue; 37 } 38 } 39 40 41 f o r (i = kernel −>num_args; i < kernel −>num_args + kernel −> num_locals; ++i){ 42 pocl_arg = &(cmd−>command.run.arguments[ i ]) ; 43 a s s e r t ( 0 ) ; 44 } 45 nema_buffer_t ∗nema_bobj = (nema_buffer_t ∗) 0 ; 46 i n t result_main = main_codegen(cmd,&nema_gen_buffer) ; 47 i f ( result_main == −1) { 48 a s s e r t (0 &&"main.cl did not generate successfully\n"); 49 } 50 51 //∗ Kernel Path 52 char ∗ kernel_bin_name = (char ∗) malloc(1 + strlen(kernel −>name ) + s t r l e n (".bin")); 53 strcpy(kernel_bin_name , kernel −>name) ; 54 strcat(kernel_bin_name ,".bin"); 55 56 kernel_bin_path = (char ∗) malloc(1 + strlen(cmd−>command . run . tmp_dir)+ strlen(kernel_bin_name) ); 57 strcpy(kernel_bin_path , cmd−>command . run . tmp_dir ) ; 58 strcat (kernel_bin_path , kernel_bin_name) ; 54 Chapter 5. OpenCL implementation for the NEMA|S GPU

59 60 //load shader to graphics memory 61 load_objects(&kernel_bin_path) ; 62 63 nema_cmdlist_t cl = nema_cl_create() ; 64 nema_cl_bind(&cl ) ; 65 nema_cl_rewind(&cl ) ; 66 67 //load shader address to the GPU 68 nema_bind_vertex_assembler(shader_bo . base_phys) ; 69 70 pc−>local_size[0] = cmd−>command . run . l o c a l _ x ; 71 pc−>local_size[1] = cmd−>command . run . l o c a l _ y ; 72 pc−>local_size[2] = cmd−>command . run . l o c a l _ z ; 73 74 //load pocl_context to Nema’s registers 75 //reg 1: global_offset[3] 76 //reg 2: group_id[3] 77 //reg 3: local_size[3] 78 //reg 4: num_groups[3] 79 //reg 5: work_dims[3] 80 //reg 6: local_x 81 //reg 7: local_y 82 //reg 8: local_z 83 // −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

84 // −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

85 86 //// ∗ nema_buffer for global_offset 87 nema_buffer_t n_global_offset = nema_buffer_create(sizeof(int) ∗3) ; 88 nema_buffer_map(&n_global_offset ) ; 89 i n t global_offsets[3]; 90 global_offsets[0] = pc−>global_offset [0]; 91 global_offsets[1] = pc−>global_offset [1]; 92 global_offsets[2] = pc−>global_offset [2]; 93 memcpy(n_global_offset .base_virt , &global_offsets , sizeof(int) ∗3) ; 94 nema_vertex_set_const_reg(1, n_global_offset .base_phys) ; 95 96 //∗ nema_buffer for local_size 97 nema_buffer_t n_local_size = nema_buffer_create(sizeof(int) ∗3) ; 98 nema_buffer_map(&n_local_size ) ; 99 i n t local_sizes[3]; 100 local_sizes[0] = pc−>local_size [0]; 101 local_sizes[1] = pc−>local_size [1]; 102 local_sizes[2] = pc−>local_size [2]; 103 memcpy(n_local_size .base_virt , &local_sizes , sizeof(int) ∗3) ; 104 nema_vertex_set_const_reg(3, n_local_size .base_phys) ; 105 106 107 //∗ nema_buffer for num_groups 108 nema_buffer_t n_num_groups = nema_buffer_create(sizeof(int) ∗3) ; 109 nema_buffer_map(&n_num_groups) ; 5.3. Implementation of the POCL backend 55

110 i n t groups[3]; 111 groups[0] = pc−>num_groups [0]; 112 groups[1] = pc−>num_groups [1]; 113 groups[2] = pc−>num_groups [2]; 114 memcpy(n_num_groups. base_virt , &groups , sizeof(int) ∗3) ; 115 nema_vertex_set_const_reg(4 , n_num_groups.base_phys) ; 116 117 //∗ nema_buffer for work dim 118 nema_buffer_t n_work_dims = nema_buffer_create(sizeof(int) ∗3) ; 119 nema_buffer_map(&n_work_dims) ; 120 uint dim = pc−>work_dim ; 121 memcpy(n_work_dims. base_virt , &dim, sizeof(uint)); 122 nema_vertex_set_const_reg(5 , n_work_dims.base_phys) ; 123 124 //∗ nema_buffer for group_id 125 nema_buffer_t n_group_id = nema_buffer_create(sizeof(int) ∗3) ; 126 nema_buffer_map(&n_group_id) ; 127 i n t group_ids[3]; 128 129 //setup stack pointer 130 nema_buffer_t nema_stack = nema_buffer_create(0x10000) ; 131 nema_vertex_setup(&nema_stack) ; 132 133 nema_reg_write(0x304 ,0x30000000) ; 134 nema_reg_write(0x31c ,0x3E80000) ; 135 136 f o r(int group_z = 0; group_z < pc −>num_groups[2]; ++group_z) 137 { 138 f o r(int group_y = 0; group_y < pc −>num_groups[1]; ++group_y ) 139 { 140 f o r(int group_x = 0; group_x < pc −>num_groups[0]; ++ group_x ) 141 { 142 group_ids[0] = group_x; 143 group_ids[1] = group_y; 144 group_ids[2] = group_z; 145 146 memcpy(n_group_id. base_virt , &group_ids , sizeof(int) ∗3) ; 147 nema_vertex_set_const_reg(2 , n_group_id.base_phys) ; 148 149 f o r (z= 0; zcommand . run . l o c a l _ z ; ++z ) 150 { 151 nema_vertex_set_const_reg(8, z) ; 152 f o r (y= 0; ycommand . run . l o c a l _ y ; ++y ) 153 { 154 nema_vertex_set_const_reg(7, y) ; 155 f o r (x= 0; xcommand . run . l o c a l _ x ; ++ x ) 156 { 157 nema_vertex_set_const_reg(6, x) ; 158 //run the shader 159 nema_vertex_start(0) ; 160 161 nema_cl_submit(&cl ) ; 162 nema_cl_wait(&cl ) ; 56 Chapter 5. OpenCL implementation for the NEMA|S GPU

163 nema_cl_rewind(&cl ) ; 164 } 165 } 166 } 167 } 168 } 169 } 170 171 f o r (i = 0; i < kernel −>num_args ; ++i){ 172 pocl_arg = &(cmd−>command.run.arguments[ i ]) ; 173 cl_mem mem = ∗( cl_mem ∗) pocl_arg −>value ; 174 i f ((kernel −>arg_info [ i ] . type == POCL_ARG_TYPE_POINTER) && ( mem−>f l a g s & CL_MEM_USE_HOST_PTR) ) { 175 nema_bobj = (nema_buffer_t ∗)mem−>device_ptrs [cmd−> device −>dev_id ].mem_ptr; // the address of nema buffer object 176 u i n t p t r _ t ∗nema_bo_host_ptr = (uintptr_t ∗)( nema_bobj+1) ; // the address where the address of host_ptr i s written 177 memcpy(∗ nema_bo_host_ptr ,( nema_bobj−>base_virt) , nema_bobj−>s i z e ) ; 178 179 } 180 181 } 182 183 nema_cl_destroy(&cl ) ; 184 nema_buffer_destroy(&shader_bo) ; 185 nema_buffer_destroy (( nema_buffer_t ∗)nema_gen_buffer) ; 186 nema_buffer_destroy(&n_group_id) ; 187 nema_buffer_destroy(&n_local_size ) ; 188 nema_buffer_destroy(&n_global_offset ) ; 189 nema_buffer_destroy(&n_num_groups) ; 190 } 57

Chapter 6

Proof of Work

In the final chapter of this thesis, the correctness of this implemen- tation will be presented. Optimizations could not be made because the nema driver represents only one "compute unit" as it doesn’t ex- ploit multiple hardware threads. Multiple nema devices can be still used for task level parallelism using multiple OpenCL devices. Fur- thermore time measurements were not taken due to the single hard- ware thread support. Time measurements would be really meaningful when this implementation uses the Fragment Processor that supports multiple hardware threads. In order to make an exhaustive verifica- tion of our work we run a number of categories of conformance tests. In all of the categories we disabled the tests that had image related fea- tures and the usage of any local arguments. Out of all the categories that we described in chapter 4 the following categories completed suc- cessfully proving the accuracy of our implementation. We used the command line to launch all of the tests and every test that completed successfully printed "passed". The number of tests in each category is really big so a detail description for each test is almost impossible to be given but the title of each one gives a basic idea for it’s purpose. In addition the list of all the tests in each category is given below except the Select and the Headers categories. In these two categories a single kernel was used with multiple data types. 1. basic : 55 tests 2. buffers : 82 tests 3. device-partition : 11 tests 4. headers : 1 test 5. integer-ops : 92 tests 6. multiple-device-context : 7 tests 58 Chapter 6. Proof of Work

7. relationals : 34 tests 8. select : 1 test

relationals device-partition relational_any partition_equally relational_all partition_by_counts relational_bitselect partition_by_affinity_domain_numa relational_select_signed partition_by_affinity_domain_l4_cache relational_select_unsigned partition_by_affinity_domain_l4_cache relational_isequal partition_by_affinity_domain_l3_cache relational_isnotequal partition_by_affinity_domain_l2_cache relational_isgreater partition_by_affinity_domain_l1_cache relational_isgreaterequal test_partition_by_affinity_domain_l1_cache relational_isless partition_by_affinity_domain_next_partitionable relational_islessequal test_partition relational_islessgreater shuffle_copy shuffle_function_call shuffle_array_cast shuffle_built_in shuffle_built_in_dual_input relational_any relational_all relational_bitselect relational_select_signed relational_select_unsigned relational_isequal relational_isnotequal relational_isgreater relational_isgreaterequal relational_isless relational_islessequal relational_islessgreater shuffle_copy shuffle_function_call shuffle_array_cast shuffle_built_in shuffle_built_in_dual_input Chapter 6. Proof of Work 59

buffers integer_ops buffer_read_async_int integer_clz buffer_read_async_uint integer_hadd buffer_read_async_long integer_rhadd buffer_read_async_ulong integer_mul_hi buffer_read_async_short integer_rotate buffer_read_async_ushort integer_clamp buffer_read_async_char integer_mad_sat buffer_read_async_uchar integer_mad_hi buffer_read_async_float integer_max buffer_read_int integer_min buffer_read_uint integer_upsample buffer_read_long integer_abs buffer_read_ulong integer_abs_diff buffer_read_short integer_add_sat buffer_read_ushort integer_addAssign buffer_read_char integer_subtractAssign buffer_read_uchar integer_multiplyAssign buffer_read_float integer_divideAssign buffer_read_half integer_moduloAssign buffer_read_struct integer_andAssign buffer_read_random_size integer_orAssign buffer_map_read_int integer_exclusiveOrAssign buffer_map_read_uint unary_ops_increment buffer_map_read_long unary_ops_derement buffer_map_read_ulong unary_ops_full buffer_map_read_short integer_mul24 buffer_map_read_ushort integer_mad24 buffer_map_read_char long_math buffer_map_read_uchar long_logic buffer_map_read_float long_shift buffer_map_read_struct long_compare buffer_write_async_int ulong_math buffer_write_async_uint ulong_logic buffer_write_async_short ulong_shift buffer_write_async_ushort ulong_compare buffer_write_async_char int_math buffer_write_async_uchar int_logic buffer_write_async_long int_shift buffer_write_async_ulong int_compare buffer_write_async_float uint_math 60 Chapter 6. Proof of Work

relationals device-partition relational_any partition_equally relational_all partition_by_counts relational_bitselect partition_by_affinity_domain_numa relational_select_signed partition_by_affinity_domain_l4_cache relational_select_unsigned partition_by_affinity_domain_l4_cache relational_isequal partition_by_affinity_domain_l3_cache relational_isnotequal partition_by_affinity_domain_l2_cache relational_isgreater partition_by_affinity_domain_l1_cache relational_isgreaterequal test_partition_by_affinity_domain_l1_cache relational_isless partition_by_affinity_domain_next_partitionable relational_islessequal test_partition relational_islessgreater shuffle_copy shuffle_function_call shuffle_array_cast shuffle_built_in shuffle_built_in_dual_input relational_any relational_all relational_bitselect relational_select_signed relational_select_unsigned relational_isequal relational_isnotequal relational_isgreater relational_isgreaterequal relational_isless relational_islessequal relational_islessgreater shuffle_copy shuffle_function_call shuffle_array_cast shuffle_built_in shuffle_built_in_dual_input

multiple-device-context multiple-device-context context_multiple_contexts_same_device context_two_contexts_same_device context_three_contexts_same_device context_four_contexts_same_device two_devices max_devices hundred_queues Chapter 6. Proof of Work 61

buffers integer_ops integer_ops buffer_write_int uint_logic quick_ushort_compare buffer_write_uint uint_shift quick_char_math buffer_write_long uint_compare quick_char_logic buffer_write_ulong short_math quick_char_shift buffer_write_short short_logic quick_char_compare buffer_write_ushort short_shift quick_uchar_math buffer_write_char short_compare quick_uchar_logic buffer_write_uchar ushort_math quick_uchar_shift buffer_write_float ushort_logic quick_uchar_compare buffer_write_half ushort_shift vector_scalar buffer_write_struct ushort_compare buffer_map_write_int char_math buffer_map_write_uint char_logic buffer_map_write_long char_shift buffer_map_write_ulong char_compare buffer_map_write_short uchar_math buffer_map_write_ushort uchar_logic buffer_map_write_char uchar_shift buffer_map_write_uchar uchar_compare buffer_map_write_float popcount buffer_map_write_struct quick_long_math buffer_fill_int quick_long_logic buffer_fill_uint quick_long_shift buffer_fill_short quick_long_compare buffer_fill_ushort quick_ulong_math buffer_fill_char quick_ulong_logic buffer_fill_uchar quick_ulong_shift buffer_fill_long quick_ulong_compare buffer_fill_ulong quick_int_math buffer_fill_float quick_int_logic buffer_fill_struct quick_int_shift buffer_copy quick_int_compare buffer_partial_copy quick_uint_math mem_read_write_flags quick_uint_logic mem_write_only_flags quick_uint_shift mem_read_only_flags quick_uint_compare mem_copy_host_flags quick_short_math mem_alloc_ref_flags quick_short_logic array_info_size quick_short_shift sub_buffers_read_write quick_short_compare sub_buffers_read_write_dual_devices quick_ushort_math sub_buffers_overlapping quick_ushort_logic buffer_migrate quick_ushort_shift 62 Chapter 6. Proof of Work

basic basic hostptr vector_creation fpmath_float vec_type_hint fpmath_float2 kernel_memory_alignment_global fpmath_float4 kernel_memory_alignment_constant intmath_int kernel_memory_alignment_private intmath_int2 global_work_offsets intmath_int4 get_global_offset intmath_long intmath_long2 intmath_long4 hiloeo if sizeof loop pointer_cast constant constant_source int2float float2int arrayreadwrite arraycopy vload_global vload_constant vload_private vstore_global vstore_private createkernelsinprogram explicit_s2v_bool explicit_s2v_char explicit_s2v_uchar explicit_s2v_short explicit_s2v_ushort explicit_s2v_int explicit_s2v_uint explicit_s2v_long explicit_s2v_ulong explicit_s2v_float explicit_s2v_double enqueue_map_buffer work_item_functions astype prefetch kernel_call_kernel_function host_numeric_constants kernel_numeric_constants kernel_limit_constants kernel_preprocessor_macros parameter_types 63

Chapter 7

Summary and Future Work

7.1 Summary

In the present thesis the basic parts of an OpenCL driver were imple- mented such as : linking the appropriate software libraries with the POCL-backend, compiling the OpenCL code (kernel) for a specific ar- chitecture, reading/writing buffer objects from/to the graphics mem- ory, passing the kernel arguments, loading the kernel executable on the graphics memory and the POCL context (number of work-groups, local id, work-group id etc..) on the desired hardware (in our case constant registers), launching the kernel and properly freeing all the previously bounded memory to avoid any memory leaking.

7.2 Future Work

The future plans for this implementation of the OpenCL driver for the NEMA|S GPU include, but are not limited to : • Support Images and Samples • Implement the rest of the OpenCL API functions with target spe- cific behaviour • Execute the kernels differently in the underlying hardware to ex- ploit multiple hardware threads • Make optimizations wherever possible so as to minimize the ex- ecution time and the dependency for memory • Complete successfully the rest of the conformance tests

65

Bibliography

[1] Pekka Jääskeläinen, Carlos Sanchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, Heikki Berg (August 2014). pocl: A Performance-Portable OpenCL Implementation. Retrieved from https://www.researchgate.net/publication/265683693

[2] Khronos OpenCL Working Group (11/14/12). The OpenCL Specification Version 1.2. Retrieved from www.khronos.org/ registry/OpenCL/specs/opencl-1.2.pdf

[3] Think Silicon S.A (6/6/2018). NEMA R |GFX-API Library: A Comprehensive Overview. Retrieved from think-silicon.com/ wp-content/uploads/2016/11/NemaGFX_API_Manual.pdf

[4] Bogdan Oancea, Tudorel Andrei, Raluca Mariana Dragoescu (2012). GPGPU Computing. Proceedings of the CKS International Conference. Retrieved from arXiv:1408.6923 [5] Robert R. Schaller. Moore’s law: past, present, and future. Journal IEEE Spectrum archive Volume 34 Issue 6, June 1997 Page 52-59 IEEE Press Piscataway, NJ, USA doi>10.1109/6.591665 [6] Moore GE (1965) Cramming more components onto integrated circuits. Electronics 38(6). ftp://download.intel.com/museum/ Moores_Law/Articles-Press_Releases/Gordon_Moore_1965_ Article.pdf

[7] Moore GE (1975) Progress in digital integrated electronics. In: Proceedings of the IEEE electron devices meeting, vol 21. San Francisco, CA, pp 21–25 [8] Liddle DE (2006) The wider impact of Moore’s law. IEEE Solid State Circuits [9] Wirth N (1995) A plea for lean software. Computer 28(2):64–68. 66 BIBLIOGRAPHY

[10] Brock, David C., ed. (2006). Understanding Moore’s law: four decades of innovation. Philadelphia, Pa: Chemical Heritage Foundation. ISBN 978-0941901413. [11] Kamran Karimi Neil G. Dickson Firas Hamze D-Wave Systems Inc. A Performance Comparison of CUDA and OpenCL. 100-4401 Still Creek Drive Burnaby, British Columbia Canada, V5C 6G9 [12] Michael Haidl, Simon Moll, Lars Klein, Huihui Sun, Sebastian Hack, Sergei Gorlatch. November 12 - 17, 2017. Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC Article No. 7. Denver, CO, USA [13] John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, andJames C. Phillips. GPU computing. Proceed- ings of the IEEE, 96(5):879–899, May2008.2 [14] Ian Buck et al. Gpubench: Evaluating gpu performance for numerical and scientic applications. In 2004 ACM Workshop on General Purpose Computing on Graphics Processors, pages C–20, 2004. [15] S. Che, J. Meng, J. Sheaffer, and K. Skadron. A performance study of general purpose applications on graphics processors. In The First Workshop on General Purpose Processing on Graphics Pro- cessing Units, 2007. [16] Hiroyuki Takizawa and Hiroaki Kobayashi. Hierarchical parallel processing of large scale data clustering on a pc cluster with gpu co-processing. The Journal of Supercomputing, 38(3):219–234, 2006 [17] Ralf Karrenberg, Sebastian Hack. Improving Performance of OpenCL on CPUs. Published in CC 2012. DOI:10.1007/ 978-3-642-28652-0_1

[18] Michael Haidl, Simon Moll, Lars Klein, Huihui Sun, Sebas- tian Hack, Sergei Gorlatch. An LLVM-based Portable High- Performance Programming Model. Published in LLVM-HPC@SC 2017 DOI:10.1145/3148173.3148185 BIBLIOGRAPHY 67

[19] K. E. Batcher, (1980), Design of a Massively Parallel Proces- sor, IEEE Transactions on Computers, Vol. C29, September, pp. 836–840. [20] Clang: A C language frontend for LLVM. URL http://clang.llvm.org/. [Online; accessed 20-Sep-2019] [21] LLVM compiler infrastructure. URL http://llvm.org/. [Online; accessed 05-Feb-2014] [22] Shibata, N.: Efficient evaluation methods of elementary functions suitable for SIMD computation. In: Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10, vol. 25, pp. 25–32 (2010). DOI 10.1007/s00450-010-0108-2 [23] Stratton, J.A., Stone, S.S., Hwu, W.M.W.: MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In: J.N. Amaral (ed.) Languages and Compilers for Parallel Computing, LNCS, vol. 5335, pp. 16–30. Springer-Verlag, Berlin, Heidelberg (2008). DOI 10.1007/978-3-540-89740-8 2 [24] Lattner, C., Adve, V.: LLVM: A compilation framework for life- long program analysis transformation. In: Proc. Int. Symp. Code Generation Optimization, p. 75 (2004) [25] Kazuhiko Komatsu, Katsuto Sato, Yusuke Arai, Kentaro Koyama,Hiroyuki Takizawa, and Hiroaki Kobayashi. Evaluating Performance and Portability of OpenCL Programs. Cyberscience Center, Tohoku UniversitySendai Miyagi 980-8578, Japan. [26] Daniel Lustig , Margaret Martonosi, Reducing GPU offload la- tency via fine-grained CPU-GPU synchronization, Proceedings of the 2013 IEEE 19th International Symposium on High Perfor- mance Computer Architecture (HPCA), p.354-365, February 23- 27, 2013 [doi>10.1109/HPCA.2013.6522332] [27] Jääskeläinen, P., Sanchez de La Lama, C., Huerta, P., Takala, J.: OpenCL-based design methodology for application- specific processors. Transactions on HiPEAC 5 (2011). URL http://www.hipeac.net/node/4310 68 BIBLIOGRAPHY

[28] Du, Peng; Weber, Rick; Luszczek, Piotr; Tomov, Stanimire; Peter- son, Gregory; Dongarra, Jack (2012). "From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming". Parallel Computing. [29] Tarditi, David; Puri, Sidd; Oglesby, Jose (2006). "Accelerator: us- ing data parallelism to program GPUs for general-purpose uses" (PDF). ACM SIGARCH Computer Architecture News. [30] Thomas Rauber; Gudula Rünger (2013). Parallel Programming: for Multicore and Cluster Systems. Springer Science Business Media. p. 2. ISBN 9783642378010. [31] Boggan, Sha’Kia and Daniel M. Pressel (August 2007). GPUs: An Emerging Platform for General-Purpose Computation. ARL-SR- 154, U.S. Army Research Lab.

[32] Kessenich, John; Baldwin, Dave. "The OpenGL R Shading Lan- guage, Version 4.60.7". The Khronos Group Inc. Retrieved August 21, 2019. [33] https://think-silicon.com/products/hardware/nema-small/ [34] John L. Hennessy and David A. Patterson. Computer Architec- ture: A Quantitative Approach. 3rd edition, 2002. Morgan Kauf- mann, ISBN 1-55860-724-2. Page 43. [35] https://www.khronos.org/conformance/adopters/ [36] Roosta, Seyed H. (2000). Parallel processing and parallel algo- rithms : theory and computation. New York, NY [u.a.]: Springer. p. 114. ISBN 978-0-387-98716-3. [37] Wilson, Gregory V. (1994). "The History of the Development of Parallel Computing". Virginia Tech/Norfolk State University, In- teractive Learning with a Digital Library in Computer Science. Retrieved 2008-01-08. [38] https://www.scratchapixel.com/lessons/ 3d-basic-rendering/rasterization-practical-implementation? url=3d-basic-rendering/rasterization-practical-implementation [39] https://docs.nvidia.com/cuda/cuda-c-programming-guide/ index.html BIBLIOGRAPHY 69

[40] CHRIS ARTHUR LATTNER. B.S., University of Portland, 2000. LLVM: AN INFRASTRUCTURE FOR MULTI-STAGE OPTIMIZATION. Retrieved from https://llvm.org/pubs/ 2002-12-LattnerMSThesis.pdf