2019 Final Reports

2019 summerofhpc.prace-ri.eu A long hot summer is time for a break, right? Not necessarily! PRACE Summer of HPC 2019 reports by participants are here. HPC in the summer? Leon Kos There is no such thing as lazy summer. At least not for the 25 participants and their mentors at 11 PRACE HPC sites. ummer of HPC is a PRACE programme that offers Contents university students the opportunity to spend two months in the summer at HPC centres across Europe. 1 Memory: Speed Me Up 3 The students work using HPC resources on projects 2 "It worked on my machine... 5 years ago" 6 Sthat are related to PRACE work with the goal to produce a 3 High Performance Machine Learning 8 visualisation or a video. 4 Electronic density of Nanotubes 11 5 In Situ/Web Visualization of CFD Data using OpenFOAM 14 This year, training week was in Bologna and it seems to 6 Explaining the faults of HPC systems 17 have been the best training week yet! It was a great start to Summer of HPC and set us up to have an amazing summer! 7 Parallel Computing Demos on Wee ARCHIE 19 At the end of the summer videos were created and are 8 Performance of Python Program on HPC 22 available on Youtube as PRACE Summer of HPC 2019 presen- 9 Investigations on GASNet’s active messages 25 tations playlist. Together with the following articles interest- 10 Studying an oncogenic mutant protein with HPC 28 ing code and results are available. Dozens of blog posts were 11 Lead Optimization using HPC 31 created as well. At the end of the activity, every year two 12 Deep Learning for Matrix Computations 34 projects out of the 25 participants are selected and awarded 13 Billion bead baby 36 for their outstanding performance. Award ceremony was 14 Switching up Neural Networks 39 held early in November 2019 at PRACE AISBL office in Brus- 15 Fastest sorting algorithm ever 42 sels. The winners of this year are Mahmoud Elbattah for Best 16 CFD: Colorful Fluid Dynamics? No with HPC! 46 Performance and Arnau Miro Janea as HPC Ambassador. Ben- 17 Emotion Recognition using DNNs 50 jamin surpassed expectations while working on an interesting 18 FMM: A GPU’s Task 54 project, and carrying out more work than planned. Mostly 19 High Performance Lattice Field Theory 57 with no assistance, he carried out benchmarking analysis to 20 Encrypted disks in HPC 59 compare different programming frameworks. His report was 21 A data pipeline for weather data 62 well written in a clear and scientific style. Pablo carried out 22 A glimpse into plasma fusion 65 a great amount of work during his project. More importantly, he has the capacity to clearly present his project to layper- 23 Predicting Electricity Consumption 67 sons and the general public. His blog posts were interesting, 24 Energy reporting in Slurm Jobs 70 and his video was professionally created and presented his 25 Benchmarking Deep Learning for SoHPC 73 summer project in a captivating and pleasant way. Therefore, I invite you to look at the articles and visit the web pages for details and experience the fun we had this PRACE SoHPC2019 Coordinator Leon Kos, University of Ljubljana year. Phone: +386 4771 436 E-mail: [email protected] What can I say at the end of this wonderful summer. PRACE SoHPCMore Information Really, autumn will be wonderful too. Don’t forget to smile! http://summerofhpc.prace-ri.eu Leon Kos 2 Enabling software and hardware profiling to exploit heterogeneous memory systems Memory: Speed Me Up Dimitrios Voulgaris Memory interaction is one of the most performance limiting factors in modern systems. It would be of interest to determine how specific applications are affected, but also to explore whether combining diverse memory architectures can alleviate the problem. upercomputers are gradually es- quite misleading. In order to identify great interest to find out whether by tablishing their position in al- the data which benefit the most from forcing software to imitate hardware’s most every scientific field. Huge being hosted into different memory sub- methodology, i.e. sampling, we are ca- amount of data requires storage systems it is important to gain insight pable of achieving similar final results. Sand processing and therefore impels the of the application behaviour. This the original target of the project. exascale era to arrive even sooner than Application profiling comes in two BACKGROUND anticipated. This immense shift cannot flavors, software and hardware based. Memory architecture, describes the be achieved without seeing the big im- Tools like EVOP [2], which is an exten- methods used to implement computer age. Processing power is usually consid- sion of Valgrind [3], VTune [4] as well data storage in the best possible man- ered as the determinant factor when it as others, offer exclusively instruction- ner. “Best” describes a combination of comes to computer performance, and based instrumentation, i.e. monitoring the fastest, most reliable, most durable, indeed, so far, all efforts in that di- and intercepting of every executed in- and least expensive way to store and rection did seem fruitful. CPU perfor- struction. The additional time overhead retrieve information [6]. Traditionally mance, however, has been extensively implied by this technique is more than and from a high point of view, a mem- researched and thus optimised, leaving apparent, therefore, vendors have come ory system is a mere cache and a RAM scarcely any opportunities for improve- up with the idea of enhancing hard- memory. ment. ware, making it capable of aiding in ap- Caches are very specialised and INTRODUCTION plication profiling. Specialized embed- hence complicated pieces of hardware, Heterogeneous memory (HM) sys- ded hardware counters deploy sampling placed in a hierarchical way very close tems accommodate memories featur- mechanisms to provide rich insight in to the processing unit. Divided into lay- ing different characteristics such as ca- hardware events. PEBS [5] is the im- ers (usually 3 in modern machines), pacity, bandwidth, latency, energy con- plementation of such mechanism in the every cache element presents different sumption or volatility. While HM sys- majority of modern processors. characteristics regarding size and data tems present opportunities in different These two approaches essentially retrieval latency. fields, the efficient usage of such sys- have the same ultimate scope of high- tems requires prior application knowl- lighting the application behaviour re- RAM, being an equally complex edge because developers need to de- garding its memory interaction in or- piece of hardware, presents significantly termine which data objects to place in der to result in an optimal performance- bigger capacity which qualifies it as the which of the available memory tiers [1]. wise memory object distribution. Nev- ideal main storage unit of the system. Given the limited nature of “fast” mem- ertheless the former approach accounts On the downside, it is characterised ory, it becomes clear that the approach for each and every memory access while by degrees of magnitude longer access of placing the most often accessed data the second one performs a sampling time which makes it quite inefficient, objects on the fastest memory can be of the triggered events. It would be of yet necessary to access. 3 Figure 3: Pseudo-code for enabling Valgrind instrumentation and detail collection. Figure 2: Valgrind logo. Figure 1: Memory hierarchy along with typ- ical latency times. benchmarks without sampling in order instrumenting an application will make to get the total number of memory ob- it from four to twenty times slower. In jects. The latter were processed by a Both storage elements interact with order to alleviate this fact we decided to BSC-developed tool in order to be opti- each other in the following way: when restrict the instrumentation only to the mally distributed to the available mem- a memory access instruction is issued, regions of code that is of interest. Val- ory subsystems. The distribution takes cache memory is the first subsystem to grind provides the adequate interface into consideration the last-level cache be accessed. If the respective instruction that eases the aforementioned scope: In misses of each object as well as the num- can be completed in cache (that is: the figure 3 you can see the macros that, ber of loads that refer to each one of referred value resides there) then we by framing the relevant lines of code, them and sets as its goal the minimisa- have a “cache hit”, otherwise a “cache enable or disable the instrumentation tion of the total CPU stall cycles. The miss” will be signaled and the effort of and information collection. total “saved” cycles are calculated along completing the instruction shall move Setting the aforementioned com- with the object distribution and thus on to the next cache level. In case of parison of software and hardware ap- the final speedup can be determined. a “last level cache miss” the main mem- proach as our ultimate scope, we en- This speedup is the maximum achiev- ory has to be accessed in order to ful- hanced EVOP with the option of per- able given the application and the mem- fil the need for data. These exact ac- forming sampled data collection. In ory subsystem mosaic. cesses pose the greatest performance detail, we extended Valgrind source What follows is a trial-and-error ex- limitation especially when it comes to code by integrating a global counter perimentation process of obtaining re- memory-bound applications. Keeping which increments on every memory ac- sults using various sampling periods to that in mind, these exact accesses have cess. While counter’s value is below extract the referenced objects.

Load more