Thesis Title

Total Page:16

File Type:pdf, Size:1020Kb

Thesis Title PANEPISTHMIO PATRWN TMHMA HLEKTROLOGWN MHQANIKWN KAI TEQNOLOGIAS UPOLOGISTWN Διπλωματική ErgasÐa tou foiτητή tou Τμήματoc Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n thc Poλυτεχνικής Sqoλής tou PanepisthmÐou Patr¸n AsterÐou KwnstantÐnou tou Nikoλάου Ariθμός Mhtr¸ou: 228281 Jèma Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Implementation and Optimization of the OpenCL driver for the NEMA GPUs Epiblèpwn EpÐkouroc Καθηγητής MpÐrmpac Miχάλης, Panepiστήμιo Patr¸n Ariθμόc Diplwματικής ErgasÐac: 228281/2019 Πάτρα, 12/2019 PISTOPOIHSH PistopoieÐtai όti h Διπλωματική ErgasÐa me jèma Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs Tou foiτητή tou Τμήματoc Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n AsterÐou KwnstantÐnou tou Nikoλάου Ariθμός Mhtr¸ou: 228281 Παρουσιάστηκε δημόσια kai exetάστηκε sto Τμήμα Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n stic / /2019 O epiblèpwn O διευθυντής Tomèa MpÐrmpac Miχάλης Paliουράς BasÐleioc EpÐkouroc Καθηγητής Καθηγητής Ariθμόc Διπλωματικής ErgasÐac: 228281/2019 Jèma: Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs Foiτητής Epiblèpwn AsterÐou KwnstantÐnoc MpÐrmpac Miχάλης PerÐlhyh I) EISAGWGH Sto παρελθόn όla ta proγράμματa logiσμικού ήτan grammèna gia σειριακή epexergasÐa. Gia na lujeÐ èna πρόβλημα, katασκευάζοntan ènac αλγόριj- moc o opoÐoc ulopoiούντan wc mia σειριακή akoloujÐa entol¸n. H ektèlesh aut¸n twn entol¸n sunèbaine se ènan upologisτή me ènan μόno epexerga- sτή. Μόno mia entoλή εκτελούντan th forά kai αφού teleÐwne h ektèlesh thc miac entoλής, h επόμενη εκτελούntan en suneqeÐa. O χρόnoc ektèleshc opoiουδήποte proγράμματoc ήτan ανάλοgoc tou ariθμού twn entol¸n, thc περιόδου tou rologiού tou upologisτή kai twn kukl¸n pou apaitoύντan gia thn κάθε entoλή. Oi upologistèc me to pèrasma twn et¸n ginόntou- san pio apodotikoÐ όσοn αφορά to χρόno ektèleshc proγραμμάτwn kaj¸c oi mhqanikoÐ katάφερναν na belti¸soun δύο παράγοntec. Katά pr¸ton h συχνόthta tou rologioύ twn upologist¸n aυξήθηκε σημαντικά kai katά δεύτεροn me βάση to nόmo tou Moore o ariθμός twn tranzÐstor se mia epi- φάνεια puritÐou ja διπλασιάζοntan κάθε 1.5 χρόno perÐpou. Qarakthrisτικά anafèroume pwc o perÐfhmoc epexergasτής 8086 eÐqe 29.000 tranzÐstor kai συχνόthta rologiού 5MHz en¸ o σύγχροnoc Intel Core i7 diajètei πάνω από 1 disekatομμύριo tranzÐstor kai συχνόthta rologiού 4GHz. Autèc oi beltistοποιήseic όμως eÐqan wc sunèpeia na auxhjeÐ draματικά h ενεργειακή katανάλωση twn epexergast¸n, h opoÐa dÐnetai από ton τύπο P=CxV2xF, όπου C eÐnai to σύνοlo twn qwrhtikoτήτwn twn opoÐwn h eÐsodoc αλλάζει ανά κύκλo rologiού, V eÐnai h tάση kai F h συχνόthta rologiού. H a- πάντηση twn mhqanik¸n sthn suneq¸c αυξανόμενη ενεργειακή katανάλωση ήτan na dhmiουργούν poλυπύρηνους epexergastèc me ενεργειακά apodoti- κούς πυρήνες. O πυρήνας eÐnai h moνάδα epexergasÐac tou epexergasτή kai όloi oi πυρήνες mporoύν na èqoun πρόσβαση sthn Ðdia jèsh μνήμης tautόqrona. Gia na ekmetalleutούμε touc poλλαπλούς πυρήνες tou epexer- gasτή dhmiουργήσαμε kai proγράμματa pou εκτελούντai παράλληλα. Sta παράλληλα proγράμματa qrhsimopoiούντai pollaplèc moνάδες epexergasÐac tautόqrona gia na λύsoun to πρόβλημα. Autό epiτυγχάνετai ¨σπάζοntας’ to πρόβλημα se μικρόtera kομμάτια, όπου h ektèlesh κάθε ενός mpore- Ð na pragmatopoihjeÐ ανεξάρτητa. Oi moνάδες epexergasÐac pou μπορούν na qrhsimopoiηθούν poikÐloun, kai mporeÐ na eÐnai από ènan upologisτή me poλλαπλούς purήnec, poλλούς diasundedemènouc upologistèc mèqri kai exideikeumèno υλικό (’hardware"). Gia na axiopoihjeÐ sto mègisto thn u- πάρχουσα parallhlÐa tou υλικού kai na elaqistοποιήsoume όσο autό eÐnai dunatό ton χρόno ektèleshc twn proγραμμάτwn, o programmatisτής prèpei na αναδομήσει kai na parallhlÐsei katάλlhla ton k¸dika tou. II) OpenCL - POCL H OpenCL eÐnai èna anoiqtό stάντar gia ton hlwn proγραμμάτwn pou peri- λαμβάνει "gl¸ssa", ARI, bibliοθήκες kai runtime kai dÐnei ètsi th dunatόth- ta συγγραφής forht¸n αλλά apodotik¸n proγραμμάτwn. Qrhsimopoi¸ntac thn OpenCL o programmatisτής mporeÐ na γράψει proγράμματa γενικής χρήσης ta opoÐa εκτελούντai se όlec tic sumbatèc me αυτήν suskeuèc qwrÐc na χρειάzetai na αλλάξει oτιδήποte ston k¸dika tou όtan αλλάζει suskeu- ή. H forhtόthta twn proγραμμάτwn thc OpenCL ανάμεσα se èna μεγάλο εύρος diaforetik¸n ετερόgenwn πλατφόρμων epitυγχάνετai periγράφοntac twn k¸dika thc (kernel) wc strings pou qtÐzontai en suneqeÐa aπό to run- time API gia thn epilegmènh συσκευή πάνω sthn opoÐa jèloume na trèxei. En¸ to plaÐsio CUDA uposthrÐzetai μόno από tic κάρτες grafik¸n thc NVIDIA, efarmogèc thc OpenCL μπορούν na trèxoun se mia σειρά diafo- retik¸n suskeu¸n από διαφορετικούς παρόqouc. Sumbatèc υλοποιήσειc me thn OpenCL eÐnai diajèsimec από etairÐec όπως h Altera, AMD, Xilinx, ARM, Intel kai άλλες. Gia na periγράψουμε tic basikèc idèec pÐsw από thn OpenCL ja χρησιμοποιήσουμε thn ακόloujh ierarqÐa montèlwn : • Montèlo Πλατφόρμας : To montèlo πλατφόρμας thc OpenCL apotele- Ðtai από mia συσκευή ¨οικοδεσπότη’ (host) πάνω sthn opoÐa eÐnai sunde- demènec mÐa ή periσσόterec OpenCL suskeuèc. Oi OpenCL suskeuèc διαιρούντai se mÐa ή periσσόterec upologistikèc moνάδες (Compute Units) oi opoÐec διαιρούντai peraitèrw se èna ή periσσόtera stoiqeÐa epexergasÐac (Processing Elements). 'Ola ta stoiqeÐa epexergasÐac εκτελούν k¸dika OpenCL δηλαδή όloi oi upologismoÐ pou periγράφο- ntai ston k¸dika thc OpenCL sumbaÐnoun ekeÐ. Katά thn ektèlesh ενός OpenCL progrάμμαtoc h συσκευή ¨οικοδεσπότης’ υποβάλλει e- ntolèc gia na enorqhstr¸sei thn ektèlesh tou k¸dika thc OpenCL sta stoiqeÐa epexergasÐac thc epilegmènhc συσκευήc. Ta stoiqeÐa e- pexergasÐac μπορούν na ektelèsoun mÐa ροή entol¸n wc SIMD ή wc SPMD moνάδες. H SIMD moνάδα orÐzetai wc mia κλάση παράλληλων upologist¸n όπου ta stoiqeÐa epexergasÐac tou εκτελούν όla ton Ðdio k¸dika, sthn perÐptwsh mac OpenCL k¸dika, to kajèna me ta δικά tou dedomèna kai koiνό program counter. Από thn άλλη orÐzoume wc SPMD moνάδα to programmatisτικό montèlo όπου poλλαπλά stoiqeÐa epexergasÐac εκτελούν ton Ðdio k¸dika to kajèna me δικά tou dedomèna kai δικό tou program counter. Σχήμα 1: Montèlo Πλατφόρμας • Montèlo Ektèleshc : H ektèlesh ενός OpenCL progrάmmatoc qw- rÐzetai se δύο mèrh : to èna mèroc ekteleÐtai ston όικοδεσπόth' (host) kai to άλλο, o k¸dikac OpenCL δηλαδή, ekteleÐtai se mÐa ή periσσόte- rec epilegmènec suskeuèc pou eÐnai sundedemènec me ton ¨οικοδεσπόth". O ¨οικοδεσπότης’ orÐzei èna plaÐsio (context) gia thn ektèlesh tou OpenCL k¸dika. To plaÐsio autό perièqei touc ακόloujoc πόρους : tic suskeuèc pou μπορούν na qrhsimopoiηθούν aπό ton όικodesπόth', ton kernel δηλαδή ton OpenCL k¸dikac pou ja trèxei se mÐa από tic proanaferjeÐsec suskeuèc, to Program Object δηλαδή to ektelèsi- mo pou kratάei thn αναπαράσtash tou OpenCL k¸dika kai tèloc ta antikeÐmena μνήμης "Memory Objects" δηλαδή mia sulloγή από anti- keÐmena ta opoÐa eÐnai oratά από ton όικοδεσπόth' kai kratούν timèc pou ja qrhsimopoiηθούν από ton OpenCL k¸dika. To πλάιsio (context) orÐzetai από ton όικοδεσπόth' kai qeiragwgeÐtai από autόn qrhsimo- poi¸ntac συναρτήσειc tou OpenCL API. O όικοδεσπόthc' orÐzei mia δομή dedomènwn pou oνομάζετai (command queue) gia na suntonÐsei thn ektèlesh tou OpenCL k¸dika, dhmiourg¸ntac kai topojet¸ntac entolèc πάνω se αυτή thn δομή oi opoÐec ja ektelestούν sthn su- σκευή sthn opoÐa dhmiourγήjhke prohgoumènwc to plaÐsio (context). Autèc oi entolèc mporeÐ na eÐnai entolèc ektèleshc OpenCL k¸dika, entolèc αναφορικά me ta antikeÐmena μνήμης ή entolèc sugqronismo- ύ. 'Otan ènac kernel katatÐjetai gia ektèlesh από ton ¨οικοδεσπόth", ènac "q¸roc perieqomènων’ (index space) orÐzetai. O "q¸roc perieqo- μένων’ pou orÐzetai sthn OpenCL eÐnai ènac N-diastάσεων όπου to N mporeÐ na eÐnai 1,2 ή 3 kai apokaleÐtai NDRange. 'Enac NDRange o- rÐzetai wc ènac pÐnakac akeraÐwn megèjouc N, kai prosdiorÐzei to εύρος tou "q¸rou περιεχομένων’ se κάθε διάσtash na arqÐzei από èna offset F. 'Ena stiγμιόtupo tou kernel ekteleÐtai gia κάje shmeÐo ston "q¸ro Σχήμα 2: Montèlo Ektèleshc perieqomènwn". To κάθε stiγμιόtupo tou kernel oνομάζετai work- item kai tautopoieÐtai από thn jèsh tou ston "q¸ro perieqomènwn", pou tou parèqei mia moναδική global ID ("kajoλική tautόthta"). 'Ola work-item diajètoun ton Ðdio k¸dika allά oi entolèc pou ja ekte- lèsei to κάθε èna από autά kai ta dedomèna pou ja χρησιμοποιήσει diafèroun mporeÐ na diafèroun ανά work-item. Ta work-items or- gan¸nontai se work-groups. Sta work-groups anatÐjetai mia mo- ναδική tautόthta (work-group ID) me diastάσειc Ðdiec me autèc tou "q¸rou perieqomènwn". Sta work-items entόc twn work-groups ana- tÐjetai mia moναδική "toπική tautόthtα’ (local-ID) ètsi ¸ste to κάθε work-item na mporeÐ na anagnwristeÐ eÐte mèsw tou global-ID eÐte mèsw tou sundiασμού tou work-group ID kai tou local-ID. • Montèlo Μνήμης : H OpenCL orÐzei èna Montèlo Μνήμης katά to opoÐo ta work items pou εκτελούν ènan kernel èqoun πρόσβαση se 4 diaforetikèc perioqèc μνήμης. Αρχικά upάrqei h Global Memory ("kajoλική μνήμη¨) sthn opoÐa èqoun prόsbash gia ανάγνωση kai eg- γραφή όla ta work-items από όla ta work-groups. Ta work-items μπορούν na diaβάσουν kai na γράψοun κάθε stoiqeÐo ενός Memory Object. En suneqeÐa υπάρχει h Constant Memory ("stαθερή μνήμη¨) h opoÐa paramènei analloÐwth katά thn ektèlesh tou OpenCL k¸dika kai mporeÐ na grafteÐ μόno prin thn ektèlesh tou. Υπάρχει h Local Memory (¨Τοπική Μνήμη¨) thn opoÐa μοιράζοntai όla ta work-items entόc ενός work-group.
Recommended publications
  • High-Performance Computing and Compiler Optimization
    High-Performance Computing and Compiler Optimization P. (Saday) Sadayappan August 2019 What is a compiler? What •is Traditionally:a Compiler? Program that analyzes and translates from a high level language (e.g., C++) to low-level assembly languageCompilers that can be executed are translators by hardware • Fortran var a • C Machine code var b int a, b; Virtual machine code • C++ mov 3 a a = • 3;Java Transformed source code mov 4 r1 if (a• Text < processing4) { translate Augmented sourcecmpi a r1 b language= 2; • HTML/XML code jge l_e } else { • Command & Low-level commands mov 2 b b Scripting= 3; Semantic jmp l_d } Languages components • Natural language Anotherl_e: language mov 3 b • Domain specific l_d: ;done languages Wednesday,Wednesday, August 22, August 12 22, 12 Source: Milind Kulkarni 2 Compilers are optimizers • Can perform optimizations to make a program more What isefficient a Compiler? Compilers are optimizers var a • Can perform optimizations to make a program more efficient var b var a var c var b int a, b, c; mov a r1 var c b = av + 3; addi 3 r1 mov a r1 var a c = av + 3; mov varr1 bb addivar a3 r1 mov vara r2c movvar r1b b int a, b, c; addimov 3 ar2 r1 movvar r1c c b = a + 3; mov addir2 c3 r1 mov a Source:r1 Milind Kulkarni c = a + 3; mov r1 b addi 3 r1 ♦ Early days of computing: Minimizingmov a numberr2 of executedmov r1 b instructions Wednesday,minimized August 22, 12 program execution timeaddi 3 r2 mov r1 c § Sequential processors: had a singlemov functional r2 c unit to execute instructions § Compiler technology is very advanced in minimizing number of instructions ♦ Today:Wednesday, August 22, 12 § All computers are highly parallel: must make use of all parallel resources § Cost of data movement dominates the cost of performing the operations on data § Many challenging problems for compilers 3 The Good Old Days for Software Source: J.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 2.0
    The Intel x86 Microarchitectures Map Version 2.0 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]
  • The Paramountcy of Reconfigurable Computing
    Energy Efficient Distributed Computing Systems, Edited by Albert Y. Zomaya, Young Choon Lee. ISBN 978-0-471--90875-4 Copyright © 2012 Wiley, Inc. Chapter 18 The Paramountcy of Reconfigurable Computing Reiner Hartenstein Abstract. Computers are very important for all of us. But brute force disruptive architectural develop- ments in industry and threatening unaffordable operation cost by excessive power consumption are a mas- sive future survival problem for our existing cyber infrastructures, which we must not surrender. The pro- gress of performance in high performance computing (HPC) has stalled because of the „programming wall“ caused by lacking scalability of parallelism. This chapter shows that Reconfigurable Computing is the sil- ver bullet to obtain massively better energy efficiency as well as much better performance, also by the up- coming methodology of HPRC (high performance reconfigurable computing). We need a massive cam- paign for migration of software over to configware. Also because of the multicore parallelism dilemma, we anyway need to redefine programmer education. The impact is a fascinating challenge to reach new hori- zons of research in computer science. We need a new generation of talented innovative scientists and engi- neers to start the beginning second history of computing. This paper introduces a new world model. 18.1 Introduction In Reconfigurable Computing, e. g. by FPGA (Table 15), practically everything can be implemented which is running on traditional computing platforms. For instance, recently the historical Cray 1 supercomputer has been reproduced cycle-accurate binary-compatible using a single Xilinx Spartan-3E 1600 development board running at 33 MHz (the original Cray ran at 80 MHz) 0.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 2.2
    The Intel x86 Microarchitectures Map Version 2.2 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • New instructions: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]
  • Redalyc.Optimization of Operating Systems Towards Green Computing
    International Journal of Combinatorial Optimization Problems and Informatics E-ISSN: 2007-1558 [email protected] International Journal of Combinatorial Optimization Problems and Informatics México Appasami, G; Suresh Joseph, K Optimization of Operating Systems towards Green Computing International Journal of Combinatorial Optimization Problems and Informatics, vol. 2, núm. 3, septiembre-diciembre, 2011, pp. 39-51 International Journal of Combinatorial Optimization Problems and Informatics Morelos, México Available in: http://www.redalyc.org/articulo.oa?id=265219635005 How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative © International Journal of Combinatorial Optimization Problems and Informatics, Vol. 2, No. 3, Sep-Dec 2011, pp. 39-51, ISSN: 2007-1558. Optimization of Operating Systems towards Green Computing Appasami.G Assistant Professor, Department of CSE, Dr. Pauls Engineering College, Affiliated to Anna University – Chennai, Villupuram, Tamilnadu, India E-mail: [email protected] Suresh Joseph.K Assistant Professor, Department of computer science, Pondicherry University, Pondicherry, India E-mail: [email protected] Abstract. Green Computing is one of the emerging computing technology in the field of computer science engineering and technology to provide Green Information Technology (Green IT / GC). It is mainly used to protect environment, optimize energy consumption and keeps green environment. Green computing also refers to environmentally sustainable computing. In recent years, companies in the computer industry have come to realize that going green is in their best interest, both in terms of public relations and reduced costs.
    [Show full text]
  • Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory
    Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad Preliminary Examination Proposal submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 5, 2015 Blacksburg, Virginia Keywords: Transaction Memory, Software Transaction Memory (STM), Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems Copyright 2015, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. On the other hand, most applications are designed or written for single core processors; so they do not benefit from the underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering. Transactional memory (TM) has emerged as a powerful concurrency control abstraction. TM simplifies parallel programming to the level of coarse-grained locking while achieving fine- grained locking performance. In this dissertation, we exploit TM as an optimistic execution approach for transforming a sequential application into parallel. We design and implement two frameworks that support automatic parallelization: Lerna and HydraVM. HydraVM is a virtual machine that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code profiling, data depen- dency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler.
    [Show full text]
  • High-Performance Parallel Computing
    High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre – 1 – Course Overview • Emphasis on algorithm development and programming issues for high performance • No assumed background in computer architecture; assume knowledge of C • Grading: • 60% Programming Assignments (4 x 15%) • 40% Final Exam (July 4) • Accounts will be provided on IIT-M system – 2 – Course Topics • Architectures n Single processor core n Multi-core and SMP (Symmetric Multi-Processor) systems n GPUs (Graphic Processing Units) n Short-vector SIMD instruction set architectures • Programming Models/Techniques n Single processor performance issues l Caches and Data locality l Data dependence and fine-grained parallelism l Vectorization (SSE, AVX) n Parallel programming models l Shared-Memory Parallel Programming (OpenMP) l GPU Programming (CUDA) – 3 – Class Meeting Schedule • Lecture from 9am-12pm each day, with mid-class break • Optional Lab session from 12-1pm each day • 4 Programming Assignments Topic Due Date Weightage Data Locality June 24 15% OpenMP June 27 15% CUDA June 30 15% Vectorization July 2 15% – 4 – The Good Old Days for Software Source: J. Birnbaum • Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level- Parallelism) • Applications experienced automatic performance – 5 – improvement Hitting the Power Wall 1000 Power doubles every 4 years Sun's 5-year projection: 200W total, 125 W/cm2 ! Rocket NozzleSurface Nuclear Reactor 100 2 Pentium® 4 m c Pentium® III / Hot plate s t Pentium® II t 10 a Pentium® Pro W i386 Pentium® i486 P=VI: 75W @ 1.5V = 50 A! 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies”6 – Fred Pollack, Intel Corp.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 1.1
    The Intel x86 Microarchitectures Map Version 1.1 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Alternative Names: iAPX 86 Alternative Names: 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) Series: 8086, 8088 (cheaper) Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) • Launch: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • Cheaper: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) • Lower-power: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128 (Celeron), Northwood-128 (Celeron) • Desktop lower-performance:
    [Show full text]
  • Analysis of Task Scheduling for Multi-Core Embedded Systems
    Analysis of task scheduling for multi-core embedded systems Analys av schemaläggning för multikärniga inbyggda system JOSÉ LUIS GONZÁLEZ-CONDE PÉREZ, MASTER THESIS Supervisor: Examiner: De-Jiu Chen, KTH Martin Törngren, KTH Detlef Scholle, XDIN AB Barbro Claesson, XDIN AB MMK 2013:49 MDA 462 Acknowledgements I would like to thank my supervisors Detlef Scholle and Barbro Claesson for giving me the opportunity of doing the Master thesis at XDIN. I appreciate the kindness of Barbro chatting with me in Spanish and the support of Detlef no matter how much time it was required. I want to thank Sebastian, David and the other people at XDIN for the nice environment I lived in during these 20 weeks. I would like to thank the support and guidance of my supervisor at KTH DJ Chen and the help of my examiner Martin Törngren in the last stage of the thesis. I want to thank very much the other thesis colleagues at XDIN Joanna, Cheuk, Amir, Robin and Tobias. You have done this experience a lot more enriching. I would like to say merci! to my friends from Tyresö Benoit, Perrine, Simon, Audrey, Pierre, Marie-Line, Roberto, Alberto, Iván, Vincent, Olivier, Achour, Maxime, Si- mon, Emilie, Adelie, Siim and all the others. I have had great memories with you during the first year at KTH. I thank Osman and Tarek for this year in Midsom- markransen. I thank all the professors and staff from the Mechatronics department Mike, Bengt, Chen, Kalle, Jad and the others for making this programme possible, es- pecially Martin Edin Grimheden for his commitment with the students.
    [Show full text]
  • Extracting Parallelism from Legacy Sequential Code Using Transactional Memory
    Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 25, 2016 Blacksburg, Virginia Keywords: Transaction Memory, Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems, Age Commitment Order, Low-Level TM Semantics, TM Friendly Semantics Copyright 2016, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. However, most applications are designed or written for single core processors; so they do not benefit from the numerous underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering to be made parallel. In the past decades, there has been a growing interest in automatic parallelization. This is to relieve programmers from the painful and error-prone manual parallelization process, and to cope with new architecture trend of multi-core and many-core CPUs. Automatic parallelization techniques vary in properties such as: the level of paraellism (e.g., instructions, loops, traces, tasks); the need for custom hardware support; using optimistic execution or relying on conservative decisions; online, offline or both; and the level of source code exposure. Transactional Memory (TM) has emerged as a powerful concurrency control abstraction.
    [Show full text]
  • Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Lecture 1
    Lecture 1: Why Parallelism? Why Efficiency? Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Leela James “Long Time Coming” (A Change is Gonna Come) “I’d heard a bit about parallelism in 213. Then I mastered the idea of span in 210. And so I was just itching to start tuning code for some Skylake cores.” - Leela James, on the inspiration for “Long Time Coming” CMU 15-418/618, Spring 2017 Hi! Alex Ravi Teguh Junhong Prof. Kayvon Yicheng Tao Anant Riya Prof. Bryant CMU 15-418/618, Spring 2017 One common defnition A parallel computer is a collection of processing elements that cooperate to solve problems quickly We care about performance * We’re going to use multiple We care about efficiency processors to get it * Note: different motivation from “concurrent programming” using pthreads in 15-213 CMU 15-418/618, Spring 2017 DEMO 1 (15-418/618 Spring 2017‘s frst parallel program) CMU 15-418/618, Spring 2017 Speedup One major motivation of using parallel processing: achieve a speedup For a given problem: execution time (using 1 processor) speedup( using P processors ) = execution time (using P processors) CMU 15-418/618, Spring 2017 Class observations from demo 1 ▪ Communication limited the maximum speedup achieved - In the demo, the communication was telling each other the partial sums ▪ Minimizing the cost of communication improved speedup - Moved students (“processors”) closer together (or let them shout) CMU 15-418/618, Spring 2017 DEMO 2 (scaling up to four “processors”) CMU 15-418/618, Spring 2017 Class observations
    [Show full text]
  • GPU Programming
    GPU Programming Rupesh Nasre. http://www.cse.iitm.ac.in/~rupesh IIT Madras July 2017 The Good Old Days for Software Source: J. Birnbaum Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level-Parallelism) Applications experienced automatic performance improvement Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif Hitting the Power Wall 1000 Power doubles every 4 years Sun's 5-year projection: 200W total, 125 W/cm2 ! Rocket Nozzle Surface Nuclear Reactor 100 2 Pentium® 4 m Pentium® III c / Hot plate s t Pentium® II t 10 a Pentium® Pro W i386 Pentium® i486 P=VI: 75W @ 1.5V = 50 A! 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Courtesy Avi Mendelson, Intel. Hitting the Power Wall http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 – Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core ..." The Only Option: Use Many Cores Chip density is continuing increase ~2x every 2 years n Clock speed is not n Number of processor cores may double There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) Parallel Platforms ● Shared memory systems (multi-core) ● Distributed systems (cluster) ● Graphics Processing Units (many-core) ● Field-Programmable Gate Arrays (configurable after manufacturing) ● Application-Specific Integrated Circuits GPU-CPU Performance Comparison Source: Thorsten Thormählen In this course..
    [Show full text]