2019 summerofhpc.prace-ri.eu A long hot summer is time for a break, right? Not necessarily! PRACE Summer of HPC 2019 reports by participants are here. HPC in the summer?

Leon Kos

There is no such thing as lazy summer. At least not for the 25 participants and their mentors at 11 PRACE HPC sites.

ummer of HPC is a PRACE programme that offers Contents university students the opportunity to spend two months in the summer at HPC centres across Europe. 1 Memory: Speed Me Up 3 The students work using HPC resources on projects 2 "It worked on my machine... 5 years ago" 6 Sthat are related to PRACE work with the goal to produce a 3 High Performance Machine Learning 8 visualisation or a video. 4 Electronic density of Nanotubes 11 5 In Situ/Web Visualization of CFD Data using OpenFOAM 14 This year, training week was in Bologna and it seems to 6 Explaining the faults of HPC systems 17 have been the best training week yet! It was a great start to Summer of HPC and set us up to have an amazing summer! 7 Parallel Computing Demos on Wee ARCHIE 19 At the end of the summer videos were created and are 8 Performance of Python Program on HPC 22 available on Youtube as PRACE Summer of HPC 2019 presen- 9 Investigations on GASNet’s active messages 25 tations playlist. Together with the following articles interest- 10 Studying an oncogenic mutant protein with HPC 28 ing code and results are available. Dozens of blog posts were 11 Lead Optimization using HPC 31 created as well. At the end of the activity, every year two 12 for Matrix Computations 34 projects out of the 25 participants are selected and awarded 13 Billion bead baby 36 for their outstanding performance. Award ceremony was 14 Switching up Neural Networks 39 held early in November 2019 at PRACE AISBL office in Brus- 15 Fastest sorting algorithm ever 42 sels. The winners of this year are Mahmoud Elbattah for Best 16 CFD: Colorful Fluid Dynamics? No with HPC! 46 Performance and Arnau Miro Janea as HPC Ambassador. Ben- 17 Emotion Recognition using DNNs 50 jamin surpassed expectations while working on an interesting 18 FMM: A GPU’s Task 54 project, and carrying out more work than planned. Mostly 19 High Performance Lattice Field Theory 57 with no assistance, he carried out benchmarking analysis to 20 Encrypted disks in HPC 59 compare different programming frameworks. His report was 21 A data pipeline for weather data 62 well written in a clear and scientific style. Pablo carried out 22 A glimpse into plasma fusion 65 a great amount of work during his project. More importantly, he has the capacity to clearly present his project to layper- 23 Predicting Electricity Consumption 67 sons and the general public. His blog posts were interesting, 24 Energy reporting in Slurm Jobs 70 and his video was professionally created and presented his 25 Benchmarking Deep Learning for SoHPC 73 summer project in a captivating and pleasant way. Therefore, I invite you to look at the articles and visit the web pages for details and experience the fun we had this PRACE SoHPC2019 Coordinator Leon Kos, University of Ljubljana year. Phone: +386 4771 436 E-mail: [email protected] What can I say at the end of this wonderful summer. PRACE SoHPCMore Information Really, autumn will be wonderful too. Don’t forget to smile! http://summerofhpc.prace-ri.eu Leon Kos

2 Enabling and hardware profiling to exploit heterogeneous memory systems Memory: Speed Me Up

Dimitrios Voulgaris

Memory interaction is one of the most performance limiting factors in modern systems. It would be of interest to determine how specific applications are affected, but also to explore whether combining diverse memory architectures can alleviate the problem.

upercomputers are gradually es- quite misleading. In order to identify great interest to find out whether by tablishing their position in al- the data which benefit the most from forcing software to imitate hardware’s most every scientific field. Huge being hosted into different memory sub- methodology, i.e. sampling, we are ca- amount of data requires storage systems it is important to gain insight pable of achieving similar final results. Sand processing and therefore impels the of the application behaviour. This the original target of the project. exascale era to arrive even sooner than Application profiling comes in two BACKGROUND anticipated. This immense shift cannot flavors, software and hardware based. Memory architecture, describes the be achieved without seeing the big im- Tools like EVOP [2], which is an exten- methods used to implement computer age. Processing power is usually consid- sion of Valgrind [3], VTune [4] as well data storage in the best possible man- ered as the determinant factor when it as others, offer exclusively instruction- ner. “Best” describes a combination of comes to computer performance, and based instrumentation, i.e. monitoring the fastest, most reliable, most durable, indeed, so far, all efforts in that di- and intercepting of every executed in- and least expensive way to store and rection did seem fruitful. CPU perfor- struction. The additional time overhead retrieve information [6]. Traditionally mance, however, has been extensively implied by this technique is more than and from a high point of view, a mem- researched and thus optimised, leaving apparent, therefore, vendors have come ory system is a mere cache and a RAM scarcely any opportunities for improve- up with the idea of enhancing hard- memory. ment. ware, making it capable of aiding in ap- Caches are very specialised and INTRODUCTION plication profiling. Specialized embed- hence complicated pieces of hardware, Heterogeneous memory (HM) sys- ded hardware counters deploy sampling placed in a hierarchical way very close tems accommodate memories featur- mechanisms to provide rich insight in to the processing unit. Divided into lay- ing different characteristics such as ca- hardware events. PEBS [5] is the im- ers (usually 3 in modern machines), pacity, bandwidth, latency, energy con- plementation of such mechanism in the every cache element presents different sumption or volatility. While HM sys- majority of modern processors. characteristics regarding size and data tems present opportunities in different These two approaches essentially retrieval latency. fields, the efficient usage of such sys- have the same ultimate scope of high- tems requires prior application knowl- lighting the application behaviour re- RAM, being an equally complex edge because developers need to de- garding its memory interaction in or- piece of hardware, presents significantly termine which data objects to place in der to result in an optimal performance- bigger capacity which qualifies it as the which of the available memory tiers [1]. wise memory object distribution. Nev- ideal main storage unit of the system. Given the limited nature of “fast” mem- ertheless the former approach accounts On the downside, it is characterised ory, it becomes clear that the approach for each and every memory access while by degrees of magnitude longer access of placing the most often accessed data the second one performs a sampling time which makes it quite inefficient, objects on the fastest memory can be of the triggered events. It would be of yet necessary to access.

3 Figure 3: Pseudo-code for enabling Valgrind instrumentation and detail collection. Figure 2: Valgrind logo. Figure 1: Memory hierarchy along with typ- ical latency times. benchmarks without sampling in order instrumenting an application will make to get the total number of memory ob- it from four to twenty times slower. In jects. The latter were processed by a Both storage elements interact with order to alleviate this fact we decided to BSC-developed tool in order to be opti- each other in the following way: when restrict the instrumentation only to the mally distributed to the available mem- a memory access instruction is issued, regions of code that is of interest. Val- ory subsystems. The distribution takes cache memory is the first subsystem to grind provides the adequate interface into consideration the last-level cache be accessed. If the respective instruction that eases the aforementioned scope: In misses of each object as well as the num- can be completed in cache (that is: the figure 3 you can see the macros that, ber of loads that refer to each one of referred value resides there) then we by framing the relevant lines of code, them and sets as its goal the minimisa- have a “cache hit”, otherwise a “cache enable or disable the instrumentation tion of the total CPU stall cycles. The miss” will be signaled and the effort of and information collection. total “saved” cycles are calculated along completing the instruction shall move Setting the aforementioned com- with the object distribution and thus on to the next cache level. In case of parison of software and hardware ap- the final speedup can be determined. a “last level cache miss” the main mem- proach as our ultimate scope, we en- This speedup is the maximum achiev- ory has to be accessed in order to ful- hanced EVOP with the option of per- able given the application and the mem- fil the need for data. These exact ac- forming sampled data collection. In ory subsystem mosaic. cesses pose the greatest performance detail, we extended Valgrind source What follows is a trial-and-error ex- limitation especially when it comes to code by integrating a global counter perimentation process of obtaining re- memory-bound applications. Keeping which increments on every memory ac- sults using various sampling periods to that in mind, these exact accesses have cess. While counter’s value is below extract the referenced objects. It can be we considered as the key factor for opti- a predetermined threshold no access intuitively assumed that the longer the mising our workflow. shall be accounted for. On the con- sampling period is, the fewer the total METHODOLOGY trary, the memory access that forces memory accesses will be, too. Given the In order to monitor memory inter- counter and threshold to be equal fact that each access is related to one actions we opted for Valgrind as it of- will be monitored and therefore ac- memory object, the fewer the accesses fers the additional possibility of simu- cumulated to the final access number. are, the less the possibilities are that not lating cache behaviour. That is, when The threshold is to be plainly defined all existing memory objects are discov- we run our application under Valgrind, when EVOP is called using the flag ered. The latter presumably results in a all data accesses can be monitored and --sampling-period=. worse distribution; worse in the sense collected giving the chance of determin- What is more, we familiarised with that the final speedup is lower than the ing all those accesses that missed cache the internal Valgrind representation of initial, achievable one. Nevertheless, de- and had to deploy memory subsystems memory accesses in order to distinguish pending on the access pattern of each deeper in the hierarchy. between “load” and “store” ones. Sim- benchmark, there is a specific sampling Considering every accessed data in ilarly as before, we added a counter period, or a restricted range of neigh- isolation would be neither intuitive nor that increments every time a load or bouring sampling periods, that identify helpful for further post-processing. This store is detected. The sole accesses that all the objects responsible for the initial is where EVOP intervenes in order to are to be monitored are the ones that speedup. group data into memory objects. An ob- equalise the counter to the predeter- Note that potential access patterns ject can be an array of integers, a struct mined threshold. This time, the user may exist among code loops. A sam- or any other structure the semantics of has to define a combination of flags pling period that always intercepts the which suggest that it has to be consid- such as -sample-loads=no|yes same memory access (in different loop) ered as an entity. Such objects, regard- -sample-stores=no|yes is considered faulty and shall result in less if they are statically or dynamically -sampling-period= in or- biased outcome. In order to avoid that, allocated are labeled and every time an der to specify the type of accesses to we used exclusively prime numbers as access is issued, it is tied to the related be sampled as well as the respective sampling periods. object. sampling period. TEST CASES –RESULTS Simulating, intercepting and keep- Having enabled these two features Since memory behaviour is of ing logistics of that a few billion data our methodology can be summarised essence for this project it is important (memory references) can be highly com- in the following procedure: We initially to choose wisely the applications-under- putationally intense. Indeed, performed a simulation of our test. MiniMD and HPCCG are a pair of

4 sampling, in the first case, is not enough SoHPC participants) to experiment us- to account for every “important” object, ing the sampling periods from our trials in order to determine the most efficient way to gain insight into an application’s memory behaviour and subsequently boost its performance by exclusively fo- cusing on the optimal distribution of its memory objects. REFERENCES 1 H.Servat, A. J. Peña, G. Llort, E. Mercadal, H. C. Hoppe, J. Labarta, “Automating the Appli- Figure 4: Sampling at a specific low rate cation Data Placement in Hybrid Memory Sys- results in omitting memory objects; a higher tems”, 2017 IEEE International Conference on sampling frequency is needed Figure 6: Only low periods are allowed in Cluster Computing (CLUSTER), Honolulu, HI, order to achieve an optimal speedup. USA, September 5-8, 2017. 2 A. J. Peña, P. Balaji, “A Framework for Track- ing Memory Accesses in Scientific Applications”, 2014 43rd International Conference on Parallel Processing Workshops, Minneapolis, MN, USA, September 9-12, 2014.

3 “Only R VTuneTM Amplifier”, Available: https://software.intel.com/en-us/vtune 4 “Valgrind”, last version: valgrind-3.15.0, Avail- able: http://www.valgrind.org/ 5 I. Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual, September 2016, Figure 5: In rather iterative access be- Figure 7: Bigger sampling periods can result vol. Volume 3B: System Programming Guide, Part haviour, low-frequency sampling is enough in equally high speedup as memory objects 2, ch. 18.4.4. keep being discovered. for complete memory object discovery. 6 https://en.wikipedia.org/wiki/ Mem- ory_architecture mini-applications intended as bench- 7 M. A. Heroux, D. W. Doerfler, P. S. Crozier, marks for assessing the performance of J. M. Willenbring, H. C.Edwards, A. Williams, certain production applications. Hence, while in the second case, due to the dif- M. Rajan, E. R. Keiter, H. K. Thornquist,and they are representative of the behaviour ferent access pattern, this is allowed. In R. W. Numrich, “Improving performance via of more complex applications. the following two graphs we pictured mini-applications,”Sandia National Laboratories, MiniMD, a molecular dynamics sim- the exact correlation between the final Tech. Rep., 2009, http://www.sandia.gov/ mah- ulator, has a rather sequential access speedup and the sampling period used erou/docs/MantevoOverview.pdf pattern of the objects that it refers to. to get the objects. While in the first case, As a result, sampling period cannot be there is a strict threshold, after which of the same degree of magnitude of the the final results are strongly damaged, PRACE SoHPCTitle total access number (discovered by the in the second case this threshold can be Analysing effects of profiling in non-sampled execution). In fact, in or- generalised in a larger neighbourhood heterogeneous memory data der to have an optimal speedup, sam- of periods. placement pling period has to be relatively small. FUTURE RESEARCH PRACE SoHPCSite Barcelona Supercomputing Center On the contrary, HPCCG, which offers When deploying the hardware pro- (BSC), Spain (ES) an iterative computational implementa- filing approach aiming in performance PRACE SoHPCAuthors tion of the conjugate gradient method optimisation we are restricted by the Dimitrios Voulgaris [University of on an arbitrary number of processors, imposed sampling that has to be per- Thessaly] — Greece (GR) PRACE SoHPCMentor entails its main computation in a “for formed. So far we have obtained some Dimitrios Voulgaris loop”. Thus, the same memory objects concrete results regarding the effect Marc Jorda [BSC] — Spain (ES) Photo are referenced every time the loop iter- that different sampling rates have on PRACE SoHPCContact ates enabling the use of greater sam- the discovered memory objects. Given Dimitrios, Voulgaris, University of Thessaly pling periods. Indeed, we were able that the latter are responsible for the E-mail: [email protected] to use intervals of the same degree of final performance speedup, as well as PRACE SoHPCAcknowledgement Special thanks to my BSC coordinators Antonio J. Peña magnitude as the overall access number that hardware profiling mechanisms and Marc Jorda for as they proved an unlimited source of some objects and still get accepted rely solely on sampling, we can trivially of help. I am also grateful to the SoHPC coordinator, Mr. Leon Kos for his constant presence and guidance speedup time. assume that the same, or at least sim- when I needed him. Finally my deepest gratitude to my In the first two figures we present a ilar, results should be obtained when colleague and friend Perry Gibson for his humorous comments and advice that made this experience qualitative representation of the mem- profiling using the hardware counters. unique. ory access behaviour for each applica- We have provided the chance to fu- PRACE SoHPCProject ID tion, as it was understood by us. Sparse ture researchers (and why not future 1901

5 Exploring Reproducible Workflows in Automated Heterogeneous Memory Data Distribution Literature Results “It worked on my machine... 5 years ago”

Perry Gibson

Leveraging emerging heterogeneous memory subsystems for memory bound HPC ap- plications requires an understanding of memory access patterns of the target workload. We revisit prior work from 2014, which simulated cache behaviour of HPC test work- loads, and produced near-optimal data placement strategies from high-fidelity runtime data. We find that the results of the original simulations cannot be precisely reproduced. We document our approach of attempting to reproduce the results, and put forward concrete solutions to improve the reproducibility of future HPC research endeavours.

PC applications the de facto lem is moving to a heterogeneous mem- running program. The fork, developed expensive workload. At scale, ory model. Instead of the classic hier- for a suite of papers from the BSC, is even sub-percentage acceler- archical memory model of L1 cache described in more detail in.2 The main ation can represent massive L2 cache ... LL cache → contribution was adding the ability to Hcost savings. Thus, opportunities across Main Memory→ Disk→, we have a choice→ trace loads and stores to individual vari- the stack must be exploited as much as to place individual→ data in a number of ables, and in particular dynamic objects possible. Often the biggest bottleneck in memory subsystems with different prop- which are created by a number of lines these applications is data access times. erties. To prepare current and future of code, and are thus less easy to detect If data is stored close to the processor HPC applications to exploit the poten- directly (for example arrays of point- when needed, such as at the lowest level tial benefits of this change, it is neces- ers). of cache, then we incur very little over- sary to devise novel approaches for plac- The results of using this system, head. However, this isn’t feasible in all ing data objects in an appropriate mem- named EVOP (Extended Valgrind for cases, given typical sizes of L1 caches ory subsystem, giving good automatic Object-differentiated Profiling), on two are on the order of a dozen or so KiB, placement with minimal HPC mini-applications4 1 were pre- and higher levels of cache are on the involvement, while providing fine grain sented in.5 The paper demonstrated the order of a few dozen MiB. Since even control if needed. scope for acceleration, which is still true modest HPC applications use GiBs of today. However, the exact experimental data, or orders of magnitude more, we To perform this profiling, in 2015 results of the access patterns for the can- are forced to store it in main memory, researchers at BSC heavily adapted the didate applications and resulting data 1 or disk. Valgrind instrumentation framework, placement proposal differ. so that they could measure the access A promising mitigation for this prob- patterns for individual data objects in a 1Interested readers can learn more about mini-applications in the SoHPC podcast episode featuring Dr. Mike Heroux.

6 Methods sider the results we collected to be sen- erate the needed computations. These sible. Thus, a useful extension to the details are generally not included in We began our work by getting access work would be to compare the results publications, as they are superfluous to to the original EVOP codebase, and the of a simulate run against the results the conceptual contribution of a work. candidate workloads. Initially, we used on actual hardware. We followed the However, it is clear that once lost to the latest version of the code, commit- approach of,3 which used the Extrae time they cannot easily be recovered, ted two years after the original paper. framework developed at BSC. The use and for some systems are necessary for We sought to first understand its func- of Intel PEBS sampling would allow work’s conclusions. This was not the tionality and configuration, then began a similar type of object tracing infor- case for EVOP, as our results using mod- to run full experiments. With the latest mation to be produced. The advantage ern toolchains still trace objects and version of the code, we found that for over simulation is reduced overhead, at demonstrate theoretical speedups for our first workload, miniMD, the number the cost of lower fidelity data. Collabo- candidate object distributions. They dif- of executed instructions was 17.9% less rating with a colleague investigating the fer numerically from the original results. than the original results, and the last- use of sampling in EVOP would enable Ideally, containerised workflows should level cache miss rate was 64.1% higher. a fair comparison. We had access to a be published, but can also be kept in Care was taken to ensure that the cache configuration file believed to be used internal archives is aspects of their func- configuration was identical to what was to gather the results in.3 However, in tioning are sensitive. described in the original paper. the time remaining we were not able to The pitfall of relying on external Our next step was to eliminate the collect the instrumented data of inter- package repositories for the creation possibility that some version change in est, namely object addresses and their of containerised workflows should be EVOP had changed its instrumentation access patterns. heeded by researchers whose results behaviour. Using version-control com- are possibly influenced by such factors. mit timestamps, we examined the differ- Results and Conclusion Including the packages in the project ences between candidate code versions archives should be considered a pre- that may have been used for original We have ruled out with some degree ferred practice. results. However, none of these versions of certainty the influence of compiler could be successfully compiled. For com- and standard library versions. Because mits of interest, source code files used References of the inconsistent version-control his- 1 Nethercote, Nicholas and Seward, Julian. Valgrind: A inconsistent API calls, and we hypothe- tory, we cannot rule out that a differ- Framework for Heavyweight Dynamic Binary Instru- mentation. sise that during development not all of ent version of the codebase was used to the working files were tracked correctly. perform instrumentation. However, our 2 Pena, Antonio J. and Balaji, Pavan. A Framework for We patched a number of these code ver- patched versions of EVOP generated the Tracking Memory Accesses in Scientific Applications. sions, and found identical results. This same results. 3 Servat, H. and Peña, A. J. and Llort, G. and Mercadal, was promising for the consistency of the E. and Hoppe, H. and Labarta, J. (2017). Automat- As one would expect, the instruc- ing the Application Data Placement in Hybrid Memory tool across versions, but did not explain tion set architecture used has an in- Systems. the difference with the original paper. fluence on the number of instructions 4 Crozier, Paul Stewart and Thornquist, Heidi K. and We hypothesised that the compiler that a given program execution takes. Numrich, Robert W. and Williams, Alan B. and Ed- version used to produce the workload wards, Harold Carter and Keiter, Eric Richard and Ra- However, it does not change the simu- jan, Mahesh and Willenbring, James M. and Doerfler, binaries, coupled with versions of the lated cache behaviour. We followed the Douglas W. and Heroux, Michael Allen Improving Per- standard library could be an influence. description of the cache configuration, formance via Mini-Applications. Thus, we used a containerisation work- and input parameters to our test work- 5 Peña, A. J. and Balaji, P. (2014). Toward the Efficient flow (via the HPC-first Singularity sys- Use of Multiple Explicitly Managed Memory Subsys- loads. However the exhibited behaviour tems. tem6) to systematically compare can- was different. The simulator’s behaviour didate environments. Many of these 6 Kurtzer, Gregory M. and Sochat, Vanessa and Bauer, should be agnostic to the underlying Michael W. Singularity: Scientific Containers for Mo- older tool-versions no longer existed in hardware it runs on, and we found this bility of Compute. package repositories or archives, and re- was true for small scale experiments. quired manual compilation. The effect In conclusion, the replication of re- PRACE SoHPCProject Title of the compiler version on the number sults after years of tools lying dormant Reproducing Automated Heterogeneous Memory Data of instructions and data access patterns is a difficult one. The configurations and Distribution Literature Results and was found to be minimal. Variation of sequence of steps used to produce re- Beyond the instruction set extensions used by sults is easily lost, and can be non-trivial PRACE SoHPCSite the compiler (such as (dis/en)abling to recover at a later date. BSC, Catalonia vector instructions) changed the num- Our recommendation is that dur- PRACE SoHPCAuthors ber of instructions, but not the data ac- ing the development stages of research Perry Gibson cess patterns. tools, continuous integration systems PRACE SoHPCMentor Consulting with the original authors, are used to ensure that the versioned- Pena,˜ A. J. Perry Gibson we reconstructed the original OS envi- project state is as expected.? We have PRACE SoHPCAcknowledgement ronment, using an archived distribution. created a continuous integration to be Many thanks to support and guidance from my mentor, the researchers and staff of the BSC, the organisers of This test for an unknown factor of in- used in future versions of EVOP. the SoHPC at PRACE for getting things working and my fluence. However, again the results re- When the results for publications cohort for bringing interesting stories to the weekly meetings. Special thanks to Dimitris Voulgaris for mained identical. are being produced, then containerisa- working closely with me throughout the project. With a reproduction of the original tion should be used to enshrine the con- PRACE SoHPCProject ID work appearing unlikely, we still con- figuration and tool version used to gen- 1902

7 What can we gain if we use HPC tools for Machine Learning instead of the JVM-based technologies? High Performance Machine Learning

Thizirie Ould Amer

Machine Learning (ML) algorithms are designed to process and learn from huge amounts of data and their correlations to adapt and adjust the results. This big quantity of data needs to be analyzed and classified according to various criteria, which means that our ML algorithms have to evaluate these criteria and select the best ones. As you can imagine, this requires a lot of time and energy to be done. So how can we increase the efficiency of ML algorithm runs?

PC is the solution for many In order to successfully carry out this Decision Trees kinds of problems! Com- project withing a limited time frame, bined with other features, we chose one popular ML algorithm – The idea of the Decision trees algorithm HPC tools can have a big (Gradient Boosting Decision Trees). is to find a set of criteria that will help us Himpact on the performance of our This algorithm is very popular in “predict” (classify or inter/extrapolate applications (programs). Big data Kaggle competitions. Indeed, many of variables we pick. processing is currently dominated the biggest winners of these contests In the example in the Table 1, let’s by JVM-based technologies such as include Gradient Boosting algorithms see if we can predict if an algorithm can Hadoop MapReduce or Apache Spark. in their winning models ensembles, be parallelized or not. Their popular tools are indeed fairly ef- amongst a variety of other models. The ficient, and particularly, they are easy to most commonly used library is XG- Table 1: Example 1 use for developers. Our goal is to eval- Boost and it aims to provide a Scalable, uate the traditional HPC tools such as Portable and Distributed Gradient Boost- MPI, OpenMP or GASPI for ML / big ing (GBM, GBRT, GBDT) Library.1 IndepParts LoopsDep Parallel? data processing, and to see for our self 1 0 1 The main focus of this project is then it they are at least as good as the afore- 0 1 0 the GBDT but the other methods are as mentioned tools, or maybe even better! 0 0 0 popular and successful as this one.

8 Since the goal is to predict the last vari- loop directly since each iteration re- groups with minimum impurity. It it- able (Parallel?), our decision tree will quires the result obtained in the previ- erates (for a fixed column – variable) have to consider each feature (column) ous one. However, it is possible to create through all the different values and cal- and see which one helps us the most a parallel version of the decision trees culates the impurity for the split. The to distinguish if a code can be paral- building code. output is the value that reaches the min- lelized or not. Once such feature is iden- The algorithm for building Decision imum impurity. tified, we split the codes according to Trees can be parallelized in many ways. that variable. Selecting any feature can Each of the approaches can potentially naturally lead to misclassification of ele- be the most efficient one depending on ments, otherwise the feature on its own the size of data, number of features, would correlate perfectly with the result. number of parallel workers, data dis- This misclassification is also referred to tribution, etc. Let us discuss three vi- as the impurity – an error rate param- able ways of parallelizing the algorithm, eter ranging from 0 to 1; 0 for pure and explain their advantages and draw- (no misclassification) and 1 for a group backs. where every element is misclassified. In Figure 2: Sharing the best split research in each node, for each feature. each step of the construction of the de- cision tree, we pick the variable which Sharing the tree nodes creation among the parallel processes at each level leads to split with the lowest impurity. As can be seen in the Figure 2, this Decision trees on their own tend to suf- The way we implemented the decision is the part of the code that can be fer from overfitting, so we must use ad- tree allows us to split the effort among parallelized. So every time a node has ditional tricks to make it a good ML the parallel tasks. This means that we to find a split of individuals in (two) method. would end up with task distribution groups, many processors will compute schematically depicted in the Figure 1. the best local splitting value, and we keep the minimum value from the par- Gradient Boosted Decision allel tasks. Then, the same calculations Trees are repeated for the Right data on one side, and for the Left Data on the other. Gradient boosting in general is an ap- In this case, a parallel process will proach of creating a predicting model in do its job for a node and when done the form of an ensemble of weak predic- it can directly move to another task on tion models. In Gradient Boosting Deci- another node. So, we got rid of the un- sion Tree, the weak prediction model is, balanced workload since (almost) all unsurprisingly, the Decision Trees. By fit- processes will constantly be given tasks ting a decision tree to pseudo-residuals to do. (i.e. errors) of another tree, we can sig- Nevertheless, this also has an no- nificantly increase the accuracy of the table drawback. The cost in communi- data representation. Figure 1: Parallelizing the nodes building at cation is not always worth the effort. each level. Levels are in different colours. Typically, a limited side decision Imagine the last tree level where we trees (of depth 4 to 8) are iterated mul- would have only a few individuals in tiple (typically less than 100) times, cre- Yet, we have a problem with this ap- each node. The cost of the communica- ating (at most) 100 decision trees that proach. We can imagine a case where tion in both ways (the data has to be have different decision criteria and "fix- we would have 50 "individuals" going given to each process, and received the ing" at each iteration the errors that to the left node, and 380 going to the output at the end). The communication the previous tree has made. The algo- right one. We will then expect that one will eventually slow down the global ex- rithm doesn’t give more weight to any of processor will process the data of 50 ecution more than the parallelization of the trees but the tree number i focuses individuals and the other one will pro- the workload speeds it up. on individuals that didn’t fit into their cess the data of 380. This is not a bal- leaves/groups in the tree number i-1 anced distribution of the work, some Parallelize the best split research on each (i.e. focuses on the individuals that have processors doing nothing while others level by features the most important pseudo-residual val- maybe drowning in work. Furthermore, ues). the number of tree nodes that can be The existing literature2 helped us to processed in parallel limits the maxi- merge some features of the two afore- How to parallelize this? mum number of utilizable parallel pro- mentioned approaches to find one that cesses... So we thought about another reduces or eliminates their drawbacks. When it comes to the concept of a paral- way. This time, the idea is to parallelize the lel version of a program, many aspects function that finds the best split, but for have to be taken into account, such as each tree level. Each parallel process Sharing the best split research in each the dependency between various parts node. calculates the impurity for a particular of the code, balancing the work load on variable across all nodes within the level the workers, etc. In our implementation of the decision of the tree. This means that we cannot paral- tree algorithm, there is a function that This method is expected to work better lelize the gradient boosted algorithm’s finds the value that splits the data into because:

9 1. The workload is now balanced ing and continuing the implementations because each parallel process evaluates and comparing them to the usual tools. the same amount of data for "its fea- Including OpenMP in the paralleliza- ture". As long as the number of fea- tion could also be a very interesting tures is the same (or greater, ideally approach and lead to a better perfor- much larger) than the number of par- mance. allel tasks, none of the processes will idle. Another side that could be considered as well is using the fault tolerant func- 2. The impact of the problem we tions of GPI-2 which would ensure a described in the second concept is re- better reliability for the application. duced, since we are working on the whole levels rather than on a single node. Each parallel task is loaded with Figure 4: Timing of communication and com- much larger (and equal) amounts of putation steps: (a) no overlap, (b) partial over- data, thus the communication overhead lap is less significant.

The global concept is shown in Fig- ure 3

Figure 5: Timing of communication and computation steps: (c) complete overlaps.

PRACE SoHPCProject Title Figure 3: Parallelizing the best split re- One of the libraries that allows High Performance Machine Learning search on each level by features for asynchronous communications pat- PRACE SoHPCSite terns, thus overlap of computation Computing Center of SAS, Slovakia and communications is the GPI-2 ( PRACE SoHPCAuthors http://www.gpi-site.com ), an open Thizirie Ould Amer, Sorbonne source reference implementation of the University, France GASPI (Global Address Space Program- PRACE SoHPCMentor ming Interface, http://www.gaspi.de ) Michal Pitonˇak` , CCSAS, Slovakia Thizirie Ould Amer How can this be improved? standard, providing an API for C and PRACE SoHPCContact Thizirie, Ould Amer, Sorbonne University C++. It provides non-blocking one- LinkedIn: in/Thizirie sided and collective operations with E-mail: [email protected] There are several ways to improve the strong focuses on fault tolerance. Suc- PRACE SoHPCMore Information performance of our parallel algorithm. cessful implementations is parallel ma- Author’s blog About PRACE Summer of HPC trix multiplication, K-means and Tera- Let us focus on the communication, Sort algorithms are discribed in Pitonak PRACE SoHPCAcknowledgement First things first, I would like to thank my mentors particularly its timing. While we can- et al. (2019) "Optimization of Computa- Michal Pitonàkˇ and Mariàn Gall for their great help. not completely avoid loosing some wall- tionally and I/O Intense Patterns in Elec- Many thanks to my site coordinator; Lukáš Demoviˇc and the whole CCSAS team who made my time here clock time in sending and receiving tronic Structure and Machine Learning extraordinarily amazing. I would also like to thank my data, i.e. communication among the Algorithms.".3 HPC and Data Analysis classes teachers; Pierre Fortin parallel processes, we can rearrange the and Fanny Villers for their support. algorithm (and use proper libraries) to PRACE SoHPCProject ID facilitate overlap of the computation Conclusion 1903 and communication. As shown in the the figure 4, in the first case (a) we We have discussed some of the miscella- References notice that the processor waits until neous possible parallel versions of this 1 XGBoost GitHub repository: the communication is done to launch algorithm and their efficiency. Unfortu- https://github.com/dmlc/xgboost 2 nately we did not have enough time to Parallel Gradient Boosting Decision Trees: the calculations, whereas in the second http://zhanpengfang.github.io/418home.html finalize the implementations and com- case (b) it starts computation before the 3 Optimization of Computationally and I/O Intense end of the communication process. The pare them with the JVM-based tech- Patterns in Electronic Structure and Machine Learning last case displayed in 5 shows differ- nologies but also with XGBoost perfor- Algorithms: http://doi.org/10.5281/zenodo.2807938 4 mances. Gradient Boosting decision trees: ent ways to facilitate total computation- https://medium.com/mlreview/gradient-boosting- communication overlap. Future works could focus on improv- from-scratch-1e317ae4587d

10 Where can you find the electrons of a nanotube? Electron density computation for each orbital. Electron density of Nanotubes

Iren´ Simko´

We exploit the helical symmetry of nanotubes in the electronic structure computation. The code was extended with electron density computation for each orbital. I used Message Passing Interface for parallelization and visualized the electron density isosurfaces.

ave you heard about (carbon) sity computation and visualization. This dent, the size of the unit cell determines nanotubes? Probably you al- shows us the distribution of electrons the cost of the computation. ready had, but you will hear in the space, to see the chemical bonds Nanotubes have helical symmetry2 a lot more about them in the and the nodes of the orbitals. which allows us to use very small unit Hfuture. The extraordinary mechanical cells. For carbon nanotubes and electrical properties of nanotubes it is enough to have only have a lot of promising applications in two atoms! Then, imagine a engineering and material science. For spiral that is winding on the example, composite materials with car- surface of a cylinder, and bon nanotubes have great mechanical replicate the unit cell fol- strength. Nanotubes are candidates for lowing the path of the spi- nano-electronics as well, because they ral. This way you can build can behave like metals, semiconductors the whole carbon nanotube. or insulators. Using helical symmetry is a The topic of my project was simu- much better approach than lating the nanotubes with the SOLID98 considering only the trans- quantum chemical program.1 The pro- Figure 1: Helical symmetry of the nanotube lational symmetry along the axis of the nanotube. gram –written in FORTRAN77– com- Use symmetry when you can! putes the electronic structure of the nan- otubes. The computational cost of a quantum The electronic structure compu- According to the principles of quan- chemical simulation strongly increases tation tum mechanics, the states of the elec- with the size of the system. But how trons have well-defined discrete en- can we still compute large systems such To get the electronic orbitals and ener- ergies and wave functions (orbitals), as nanotubes or crystals? One solu- gies we have to solve the Schrödinger- which are computed by the program. tion is periodicity and symmetry, which equation. But do not grab a pen and The parallelization of the code was the means that we can build the whole large paper, because it has an analytical solu- task in the previous years of Summer system by replicating a small building tion only for simple systems, we have of HPC, and my goal was to extend the block (unit cell) following symmetry to do it numerically in other cases. The code with a new feature: Electron den- rules. Since the replicas are not indepen- wave function is a linear combination

11 of basis functions in the simulation. The function from the central unit cell and increases the counter value with 1, or Schrödinger-equation is translated to the other one from the whole nanotube. updates it to 0 if it was N 1. This an eigenvalue-equation of the so-called Then we get the total electron density way, process 0 does k = 1, proc.− 1 does Fock matrix, which is solved iteratively. in the same manner as we built the k = 2, proc. N 1 does k = N, then The diagonalization part of the pro- nanotube from the unit cell by replicat- proc 0 does k =−N + 1, and so on. gram is parallellized with Message Pass- ing the contribution of the central unit ing Interface (MPI). Thanks to some cell. In the figure below you can see an mathematical tricks the Fock matrix is example for the benzene molecule. block diagonal, eack block labelled with a k value and diagonalized by a differ- ent process. As a result, we get orbitals Parallelization using MPI and energies for each k block. Next step was to parallelize the electron density computation. I decided to use Electron density: The serial the same technique as it is in the diag- code onalization of the Fock matrix. We use the Master-Slave MPI system, where the The electron density is the probability Master process (rank = 0) distributes of finding the electron at a given point the work among the Slave processes is space. We would like to visualize the (rank = 1,2,...,N 1). First the Mas- individual electron orbitals, so we re- ter process does some− initialisation and strict the computation for one orbital at broadcasts the data to the Slaves, then a time. each process does the work assigned to Electron density computation is the them. last step of the program, we use the re- The electron density computation is sults of the last diagonalization. From done in two nested loops: one loop over the eigenvectors, we build the so-called the k values and one loop over the or- density matrix (P ), and contract it with bitals of a given k. In the first version the basis functions (χ). „Contraction” is of the code, each process computes the a double sum over all basis functions of orbitals belonging to a different k value. Figure 3: MPI parallelization the system: We use a counter variable to distribute the work. The counter is set to 0 at the ρ(r) = P χ (r)χ (r) The problem with this implementa- ab a b beginning of the loop. tion is that you cannot use more pro- Xa,b cessors than than number of k values. So in the second version I moved the parallelization to the inner loop over the orbitals. If we have M orbitals in each block, the first M processes start working on the k = 1 orbitals, the next process gets the first orbital of k = 2. This is a much better solution, because we can utilize more processors than in the first case. On the other hand, the diagonaliza- tion of the Fock matrix is still paral- lelized at the k-loop, so we have idle processors if we have more than the number of k values. The relative time of the diagonalization and the electron density computation determines if it is worth to use the k-loop or the orbital- loop parallelization.

”A picture is worth a thousand words”

One goal of the project was to actually see where we can find the electrons, but Figure 2: Electron density of one unit cell and the whole benzene molecule visualization turned out to be more chal- lenging than I expected. A good scien- A process does the upcoming value First, we compute the contribution of k tific figure should capture the essence if its rank is equal to the actual counter the central unit cell of the electron den- of the topic of the research as well as value. If a process accepts a value, it sity. Here in the sum we take one basis k catch the attention of the reader.

12 Figure 4: Electron density visualization examples

The result of the computation is the elec- electron density is localised inside the References tron density in a large number of grid tube, while in other you can see the 1 J. Comp. Chem., Vol. 20, No. 2, 253.261 (1999) points. If the grid points are on a plane, bonds between the carbon atoms. 2 I plotted the results with Wolfram Math- J. Phys. Chem. B 2005, 109, 52-65 ematica using ListPlot3D and ListDen- sityPlot. On the figures above you can PRACE SoHPCProject Title Electronic structure of nanotubes by see an example for a benzene orbital, utilizing the helical symmetry with the electron density calculated at properties: The code optimization the plane of the molecule. PRACE SoHPCSite Computing Centre of the Slovak I visualized the electron density in Academy of Sciences, Slovakia space by plotting the isosurface, which PRACE SoHPCAuthors is a surface where the electron density Iren´ Simko´,Eotv¨ os¨ Lorand´ University, equals to a given value. I used the Visual Hungary Molecular Dynamics program to make PRACE SoHPCMentor isosurface plots after transforming the Prof. Dr. Jozef Noga, DrSc., CCSAS, output to Gaussian cube file format. Slovakia Irén Simkó PRACE SoHPCContact Irén, Simkó E-mail: [email protected] Results & discussion Figure 5: Speed-up of the parallel program PRACE SoHPCSoftware applied SOLID98, Visual Molecular Dynamics, Wolfram I tested the electron density computa- Mathematica, Blender tion and visualization for the benzene As for the parallelization, I made a PRACE SoHPCAcknowledgement molecule and a small carbon nanotube. test for a nanotube that has 32 k val- I am grateful to my mentor, Jozef Noga, my site I tried to find the physical meaning ues. In the figure above, you can see the co-ordinator, Lukáš Demoviˇc, and all people at CCSAS for their support and the wonderful summer. of each electron density plot. For ex- speed-up for the two code versions. The Calculations were performed in the Computing Centre ample, there are bonding benzene or- maximal number of cores was 32 for the of the Slovak Academy of Sciences using the supercomputing infrastructure acquired in project bitals where the electron density is high k-loop version, and 128 for the orbital- ITMS 26230120002 and 26210120002 (Slovak between the atoms; and there are an- loop version. Up to about 64 cores the infrastructure for high-performance computing) supported by the Research & Development Operational tibonding orbitals with nodal surface, speed-up is ideal, but after that the pro- Programme funded by the ERDF. where the electron density is zero. In gram will not be much faster if we in- PRACE SoHPCProject ID the case of some nanotube orbitals the crease the number of cores. 1904

13 Live Visualisation of Simulation Result using Paraview Catalyst

In Situ / Web Visualisation of CFD Data using OpenFOAM

Li Juan Chan Paraview Catalyst has been proposed to overcome the limitations caused by I/O bottleneck. In this research, the scalability of OpenFOAM with coprocessing has been investigated and it has shown to be very scalable.

ver the years, the compu- ning. There are a few benefits of using Aim and Motivation tational power of processor Paraview Catalyst. One of them is that is developed in a tremen- the live visualisation feature of Paraview In the field of high-performance com- dous fast pace. This has made Catalyst enables researchers to examine puting, the scalability of a software is Olarge-scale computing to become more whether the pre-processing step is set very important. This is because super- affordable. The development of paral- up correctly. If the boundary conditions computer is nothing but a bunch of com- lel computing software is advancing are found to be incorrect, the simula- puters combined together to perform a equally as fast as the processor. How- tion can be stopped immediately before task. Therefore, researchers may want ever, there are subsystems of high- investing more time and resources into to know how much benefit can be ob- performance computer that are not de- an incorrect simulation. Furthermore, tained by running a software with ad- veloped at the same pace with processor. with Paraview Catalyst, the frequency ditional resources. Since Paraview Cata- One of them is the I/O bandwidth. I/O of data writing is significantly reduced lyst is so useful and may be widely used is developed in such a slow pace that it as only the final result is needed to be in the future, I, as a researcher, would is now becoming the bottleneck for ex- saved. The intermediate result can be like to find out how scalable Paraview ascale computing. This project hoped to viewed instantly without having to be Catalyst is and this basically is the aim address this issue and try to get around saved for post-processing. of this project. the bottleneck by using in situ visualisa- tion. Therefore, one may wonder if Par- How to use Paraview Catalyst? aview Catalyst will increase the demand Traditionally, the simulation process of memory due to live visualisation. The consists of pre-processing, processing After introducing Paraview Catalyst, I answer is yes and no. The live visualisa- and post-processing. However, this pro- would like to introduce another soft- tion definitely requires some memory to cess is getting more and more expensive ware that is used together with Par- run. However, the demand of memory due to the I/O bottleneck. Writing and aview Catalyst, which is OpenFOAM. is not significant because Paraview has reading data have become very slow OpenFOAM is an open-source software the ability to extract a small amount of when compared to the processing speed. for computational fluid dynamics (CFD) important data from the result instead In situ visualisation was designed to and it is also highly scalable. There- of saving the full datasets. The features solve this problem by moving some of fore, the scalability of OpenFOAM with that can be extracted from Paraview in- the post-processing tasks in line with and without co-processing will be deter- clude slice, clip, glyphs, streamline, etc. the simulation code. mined and compared. Paraview has come out with a in situ The model I was working on is visualisation library called Catalyst. Cat- shown in Figure 1. It is a wind tun- alyst enables live visualisation of simu- nel construction with two NACA0012 lation while the simulation is still run- airfoils, rotating with respect to hinges.

14 Figure 2 is the zoomed in image of the be loaded before running OpenFOAM. model. As shown in the figure, struc- Once loaded, the simulation can be run tured mesh is used in the upstream and and the intermediate result of the simu- downstream of the model. lation can be visualised concurrently as shown in Figure 3. In Figure 3, the feature displayed is a slice. To extract other types of features, a different pipeline is needed. A pipeline is a python script that specifies the fea- Figure 8: Region ture to be extracted and the parame- ters of the feature. The pipeline can be obtained from Paraview GUI. The What is the outcome? process is repeated with different num- ber of cores and different pipelines. In The three graphs in the next page are to this study, several pipelines, for example summarise the results of this research. Figure 1: Model slice, clip, glyphs, streamline and region It consists of scaling graph, graph of ex- are produced as shown in Figures 4, 5, ecution time of OpenFOAM with and 6,7 and 8. without catalyst and graph of execution time of Paraview Catalyst. First of all, what is speedup? Speedup is a variable that measures scalability. The speedup of n cores is defined as the execution time in one core divided by the execution time in n cores, where n is the number of cores to Figure 2: Mesh be measured. Basically, it measures the The mesh in the middle is unstructured relative performance of one core and due to the rotation of the airfoils. There n cores processing the same problem. Figure 4: Slice are approximately 10 million cells in In the scaling graph, one may notice the mesh. A compressible solver named that the speedup of OpenFOAM with rhoPimpleFoam is used for all simula- coprocessing are very similar regard- tions. less of the type of pipeline. Additionally, the speedup of OpenFOAM with and without coprocessing are very similar up until 108 number of cores. Beyond that point, they start to diverge with the OpenFOAM without coprocessing being more scalable than that with coprocess- ing. The degree of divergence increases Figure 5: Clip as the number of cores increase. We also noticed that the scalability of Open- FOAM with and without coprocessing starts to flatten out beyond 216 num- ber of cores. The sudden reduction of their scalability is due to communica- tion overhead. Therefore, if a good scal- ing after 216 cores is wanted, the size of the mesh is needed to be increased in order to have a good balance between Figure 3: Live Visualisation computation and communication, To run OpenFOAM and Paraview Cata- Figure 6: Glyphs Similar to the scaling graph, the ex- lyst at the same time, a software called ecution time of OpenFOAM with and Remote Connection Manager (RCM) without coprocessing are very similar. is used. It allows the users to con- However, the execution time without nect to the compute node of the su- coprocessing is slightly faster at large percomputer. One RCM session is used number of processor cores. The differ- to run the Paraview. When Paraview is ence increases as the number of proces- loaded, Catalyst can be connected from sor cores increase. In contrast, the exe- the menu bar of the Paraview. Mean- cution time of Paraview Catalyst do not while, another RCM session is opened seem to have a clear pattern as the two to run OpenFOAM. However, an Open- previous graphs. However, all pipelines FOAM plugin, co-developed by CINECA exhibit a trend, in which the execution with ESI-OpenCFD and Kitware, should Figure 7: Streamline time decreases up until 108 processor

15 Compilation of the graphs of result

cores. Beyond that point, the execution PRACE SoHPCProject Title www.paraview.org www.openfoam.org time increases. This trend agrees with In Situ / Web Visualisation of CFD the scaling graph. Data Using OpenFOAM PRACE SoHPCAcknowledgement PRACE SoHPCSite I would like to express my utmost gratitude to my site coordinator, Simone Bnà for providing me the guidance CINECA, Italy In a nutshell and advice throughout the project. Without his PRACE SoHPCAuthors supervision, this project would never reach as far as it OpenFOAM with Paraview Catalyst is Li Juan Chan, University of has. Furthermore, I would also like to thank PRACE for Manchester, UK sponsoring me to work in CINECA. I also like to thank very scalable with only slightly inferior Massimiliano Guarrasi and Leon Kos for arranging my to pure OpenFOAM. However, the bene- PRACE SoHPCMentor allocation in Bologna. Last but not least, I would like to Prof. Federico Piscaglia, Dept. of thank Prof. Federico Piscaglia and Federico Ghioldi for fits that can be obtained from Paraview Aerospace Sci. and Technology providing me the model. Catalyst definitely outweigh the slight Politecnico di Milano, Italy PRACE SoHPCProject ID Simone Bna`, CINECA, Italy Li Juan Chan reduction of scalability. 1905 For anyone who is interested in this PRACE SoHPCContact Li Juan Chan, University of Manchester project, the details of this project can be Phone: +60 125271362 found in the github repository.1 E-mail: [email protected] PRACE SoHPCSoftware applied OpenFOAM 1806, Paraview 5.5.1, Catalyst, Blender References 2.79b 1 https://github.com/ljchan1/SoHPC_19 PRACE SoHPCMore Information

16 Building explainable model for understanding HPC faults Explaining the faults of HPC systems

Martin Molan

This work provides initial investigation into possibilities for machine learning on log data collected on HPC systems at CINECA supercomputer site. General data transformation/preparation pipeline that can be the basis for further machine learning projects is presented. Additionally, this work also explores possibilities for creation of explainable machine learning models on processed data.

he goal of presented work processing should be done in parallel The only constructed feature is is to explore possibilities for and in batches. MaxState which is the value of most machine learning on top of serious warning in a timestamp. All data collected from high perfor- features have values between 0 and Tmance computing (HPC) systems. The 3 where 0 means normal operation, 1 1.1 Feature construction and data was collected on two HPC systems dataset description means warning and 2 means serious (MARCONI and GALILEO) from super- failure (3 signifies missing data). computer site CINECA. The dataset has the following attributes Presented work has the following (features): key points: Feature evaluation Informative- alive::ping ness of features is estimated by eval- Preparation of dataset for future backup::local::status uating them with random foresters and • cluster::status::availability machine learning experiments. cluster::status::criticality extra tree classifiers (ensemble classi- This involves creation of data cluster::status::internal fiers based on decision trees). The most transformation/pre-processing dev::raid::status relevant features evaluated are (accord- pipeline that is capable of han- dev::swc::confcheckself filesys::local::avail ing to both methods of evaluation): dling huge quantities of data in filesys::local::mount parallel and in batches. filesys::shared::mount batchs::client::state memory::phys::total MaxState Building of explainable machine ssh::daemon ssh::daemon • learning (ML) model that investi- sys::ldapsrv::status batchs::JobsH alive::ping gates a learning task on a subset filesys::eurofusion::mount of data sys::gpfs::status dev::ipmi::events Implementing scalable parallel in- dev::swc::bntfru 2 Supervised learning • duction of stable decision trees. dev::swc::bnthealth dev::swc::bnttemp The second goal of the project is to con- dev::swc::confcheck struct a supervised learning task. The batchs::client::state 1 Dataset preparation batchs::client training set consists of instances, de- net::opa::pciwidth scribed by attributes and a target value The main goal of dataset preparation net::opa for each instance. In this work target is to convert data from raw data (each sys::orphaned-cgroups::count value will always be discrete - the train- row represents single attribute, single core::total sys::cpus::freq ing task will be classification. time stamp and single node) to data batchs::client::serverrespond appropriate for machine learning. The MaxState

17 2.1 Comparison of different Features are informative. Feature similarity matrix is parallel) of con- • classification algorithms set provides good information sensus tree algorithm were performed. about the target label without the The open problem remains how to effi- Comparison of classification algorithms need for additional feature con- ciently implement the parallel version is done with 10-fold cross validation. struction. of the algorithm that would be appli- In each repetition of the process, the cable to bigger datasets (great number Features do not require non-linear model is trained on 90% of the data of instances and base estimators). The • transformations to be informative and evaluated on 10%. The results are main problem of this implementation (SVMs do not represent improve- then averaged across (10) iterations. is the memory demands of each indi- ment over base-line features 10-fold cross validation generally pro- vidual thread. Parallel implementation duces more reliable results than sim- Informative features are the basis for and subsequent consensus trees will be ple train/test split (it avoids overly pes- creation of an explainable model. If a topic of future paper prepared with simistic or optimistic estimates of classi- that was not the case - if for instance mentors from university of Bologna. fier performance). The goal of the super- SVMs performed considerably better vised learning model is to predict when than baseline trees - features would a fault (a warning, or a fault) from a need non-linear transformations which References component (attributes described above) would make them much more difficult 1 Pedregosa, F. and Varoquaux, G. and is serious enough to warrant a shut- to interpret. down of a node (binary target class). Gramfort, A. and Michel, V. and Such (explainable) model would enable Decision trees - in their basic imple- Thirion, B. and Grisel, O. and Blon- the operators to make more informed mentation (greedy splitting) - are not a del, M. and Prettenhofer, P. and Weiss, decision about the seriousness of the reliable explainable model for the prob- R. and Dubourg, V. and Vanderplas, J. warnings provided by the system. lem of this scale. Because they are gen- and Passos, A. and Cournapeau, D. and erally unstable (can significantly change Brucher, M. and Perrot, M. and Duch- Table 1: Average classification accuracy in with small changes on the dataset) rea- esnay, E (2011). Scikit-learn: Machine 10-fold cross validation soning provided by the decision trees Learning in Python Journal of Machine cannot be the basis for the final model. Learning Research, 12: 2825–2830 Dummy Classifier 0.9663 2 Nearest Neighbors 0.9865 Kavsek, B. and Lavrac, N. and Freligoj, Linear SVM 0.9926 3 Future work A. (2001) Consensus Decision Trees: RBF SVM 0.9924 Using Consensus Hierarchical Cluster- Decision Tree 0.9928 Based on the relatively good perfor- ing for Data Relabelling and Reduc- Random Forest 0.9744 mance of decision trees the next step tion European Conference on Machine Neural Net 0.9915 in building of the predictive model is Learning, 251-262 AdaBoost 0.9928 to solve inherent instability of decision Naive Bayes 0.9936 trees. A possible approach is a consen- sus tree classifier. This is an ensem- PRACE SoHPCProject Title ble learning approach based on deci- Anomaly detection of system failures sion trees. It essentially combines pre- on HPC machines using Machine Table 2: Average gain compared to dummy Learning Techniques classifier in 10-fold cross validation (in per- dictions of several decision trees into a PRACE SoHPCSite centages) single (stable) model. CINECA, Bologna Basic algorithm for consensus tree DDummy Classifier 0.0 induction can be summarized as:2 PRACE SoHPCAuthors Martin Molan, [International Nearest Neighbors 2.015 1. Perform N-fold partition of the Postgraduate school Jozef Stefan Linear SVM 2.625 Institute] Slovenia data set (same partition as with RBF SVM 2.612 PRACE SoHPCMentor cross validation), train separate Decision Tree 2.644 Andrea Bartolini, University of decision tree on each partition Bologna, Italy Random Forest 0.805 PRACE SoHPCMentor Neural Net 2.514 2. Compute dissimilarity matrix for Andrea Borghesi, University of AdaBoost 2.644 instances with regards to the Bologna, Italy Naive Bayes 2.73 trained decision trees PRACE SoHPCMentor Francesco Beneventi, University of 3. Construct hierarchical clustering Bologna, Italy Martin Molan Classifier evaluation (and base clas- and clean the dataset PRACE SoHPCContact sifiers themselves) are based on Scikit- Martin, Molan, International postgraduate school Jozef leran library.1 Each of iterations was per- 4. Construct decision tree on a Stefan Institute formed in its own thread (paralleliza- cleaned data set E-mail: [email protected] tion). PRACE SoHPCAcknowledgement I am thankful to my mentor (dr. Bartolini) and his 3.1 Implemented on summer colleagues from university of Bologna for all the help The most significant takeaway form and guidance. I am also thankful to my site supervisor of HPC dr. Massimiliano Guarassi for his help with setting up classifier performance is the relative the environment on HPC. strength of decision trees. This suggests Preliminary experiments with parallel PRACE SoHPCProject ID that: implementation (computation of dis- 1906

18 Visualising Parallel Computations for Education on Raspberry-Pi Clusters Parallel Computing Demos on Wee Archie

Caelen Feller

I created a framework for visualising parallel communications on a Raspberry-Pi based parallel computer Wee Archie. I used this to create demonstrations illustrating basic parallel concepts, and an interactive coastal defence simulation.

PCC has developed a small, front-end of Wee Archie, or by program- I developed visualisations for a tu- portable Raspberry-Pi based ming the LED lights attached to each torial illustrating the basics of paral- “supercomputer” which is taken of the 16 Wee Archie Pis to indicate lel communications and a coastline de- to schools, science festivals etc. when communication is taking place fence demonstration. I also developed a Eto illustrate how parallel computers and where it is going (e.g. by display- web interface for the tutorials to cre- work. It is called Wee Archie because ing arrows). All of these visualisations ate a more cohesive user experience. it is a smaller version of the UK national are triggered via the MPI (Message Pass- In these demonstrations, I aimed to be supercomputer ARCHER. It is actually a ing Interface) profiling interface, a stan- able to make it clear what is happening standard cluster and it is straight- dard feature on all modern HPC sys- on the computer to a general audience. forward to port C, C++ and Fortran tems, making this a drop-in solution for code to it. There are already a num- most existing code. ber of demonstrators which run on Wee Animation Server Archie that demonstrate the usefulness of running demonstrations on a paral- Client Code Animation Server To display communications via the LED lel computer, but they do not specifi- panels (created by Adafruit), I used the cally demonstrate how parallel comput- official Adafruit Python library. As such, ing works. my visualisations are done in Python. 1010011 To allow all demonstrations to share In this project, I developed a frame- the panels safely, I start a queue on work for creating and enhancing new, each Pi on a separate background pro- existing and in-development demonstra- cess, where any demo can add an 8px Communication tions that show more explicitly how a 8px image to be displayed on the 8 ×8 Info parallel program runs. This is done by LED panels. I can also give these× im- showing a real-time visualisation on the ages properties such as how long to be

19 displayed and create parcels of images Other differences are discussed in the Application-Specific Animations together which form my various anima- MPI standard.1 tions. Often a demonstration will require I had two major requirements from unique animations, such as a context- my animation system. I need to allow Collective Visualisations appropriate “working” animation, or a demonstrations developed in any pro- visualisation of some communication at The next major class of MPI communi- gramming language to create an anima- a higher level of abstraction than MPI, cations are “collectives” - when many tion, and to be able to coordinate anima- such as the “haloswaps” of the coastal nodes want to communicate others at tions between Pis. To do this, I made a defence demonstration. Here, the ani- once. Say I want to distribute some re- web server on each Pi using Flask which mation server can be contacted using a sult to every node - this is done using a will place an animation in queue when non-MPI process which directly contacts broadcast. According to the standard,1 you make a certain web request against the server and requires modification of this will cause every node participating it. You must provide options for anima- the source code. to wait until it has the correct output tion length, type, and type-specific op- before continuing, though the exact tim- tions as discussed below. ing varies. Unified Web Interface For clarity and consistency, in collec- Point to Point Visualisations tive communication visualisations, all As these demonstrations are used for participants wait until the communica- outreach, the surrounding narrative is In MPI, an important class of commu- tion is done overall before continuing. important for audience engagement. To nications are “point to point” commu- The implementation is similar to that of improve this aspect of the Wee Archie nications. These are when a message is waiting in point to point. interface is an important aspect of ex- plaining more complex concepts such passed from one node (here meaning There were several other types of as parallel communication. In previous a Raspberry Pi) to another directly. In communication I visualised (gathering, demonstrations for Wee Archie, a frame- its most basic form, a node will start to scattering and reduction), whose imple- work written in Python is used to dis- send to some destination, and wait un- mentation is similar. With gathering, the play a user interface on a connected til that destination has begun to receive result is collected from each node and computer. This starts the demonstra- from the correct source before begin- stored on one. With scattering, the re- tion by contacting a demonstration web ning to transfer the message. sult on one node is split into small, even server running on Wee Archie, which in Given the ability to add a sequence parts and distributed to many. With re- turn uses MPI to run the code on all of of images to a queue, how can I accu- duction, the result is gathered but an the other Raspberry Pis. It returns any rately visualise a "send" and "receive" op- operation is done to it as it is collected results to the client continuously. eration between two Pis? My approach such as a sum or product. For more de- I created an internal website for was to use a Python “pipe”. When an tails on these, see the MPI standard.1 animation is reached in queue, it will Wee Archie, but due to the performance show an “entry” and stop, using the pipe constraints of serving complicated web- to wait until the server allows it to con- C and Fortran Framework sites from a Raspberry Pi while it’s han- tinue. This allowed me to synchronise dling so many communications already, and wait on animations between Pis. While demonstrations for Wee Archie I opted to make it a static website - one This accurately replicates the behaviour typically use Python for their user in- which does not require processing by of MPI messaging. terface, they use the C and Fortran the server other than providing the cor- This behaviour is known as a “syn- languages in computations for perfor- rect files. I did this using the Gatsby chronous” or “blocking” send. There mance reasons. Thus, it was desirable framework. also exists a “non-blocking” send. The to create a wrapper around MPI in these In order to allow the static web- main difference between blocking and languages which will automatically let site to start a demonstration, I wrote non-blocking communications is that us- the animation server know when a a my own version of the Wee Archie ing non-blocking communication, the Pi communication is started. framework in JavaScript. This frame- I did this using work uses the Axios library to manage Client Code Wee MPI Animation Server the MPI profiling in- communication with the demonstration terface. MPI inter- server, and the React framework to pro- nally refers to func- vide a generic demonstration web inter- tions using “weak face. 1010011 MPI symbols”, which al- lows you to over- Basic MPI Tutorials ride the functions Communication provided by the li- This series of tutorials consist of a set of Info brary and allows a ten demonstrations to be run on Wee library developer to call their visual- Archie and accompanying text. They will continue working, not waiting for isation and logging code. The frame- are aimed at a complete beginner, who the communication to start, and do the work includes visualisations for most does not have programming experience, communication in the background. This MPI communications, all shown on the but can understand the concept of a is a more efficient but less safe form next page. program doing work, and take them of communication, and I provided visu- through all of the concepts Wee MPI alisation for it as it is commonly used. has to offer.

20 Figure 1: Left: Top: Send and Receive, Middle: Broadcast, Bottom: Gather. Right: Top: Scatter, Middle: Reduce (Sum), Bottom: Wave demonstration in progress.

The first two demonstrate the ben- changed between Pis. This is known as future demos. efits of parallel computing through the a “haloswap” and is shown by a custom It also would be an improvement analogy of cooking, showing an embar- animation. As many thousands of these were all demonstrations for Wee Archie rassingly parallel problem and then, in- occur during the simulation, I only show done using the web framework, as this troducing conflict and making the prob- every hundredth haloswap. would allow users to easily switch be- lem less perfectly parallel, showing the tween demonstrations, and provide a need for efficient communication. The surrounding explanation. It also allows next series go through point to point for novel, interactive visualisations with communications, first showing a loop of the use of new features such as WebGL, blocking and then non-blocking sends and the D3 JavaScript library. and receives. The final set discusses col- lective communications, demonstrating References what each do and showing their ani- 1 Message Passing Interface Forum (2015). MPI: A mations. It also shows the way that a Message-Passing Interface Standard Version 3.1 broadcast could created using point to Recommendations 2 EPCC, The University of Edinburgh (2018). Wee point communications. Archie Github Repository The main goal of the project was to cre- ate an extensible and easily used frame- Coastal Defence Demonstration PRACE SoHPCProject Title work to visualise parallel communica- Parallel Computing Demonstrators on In the demo, the ocean is broken up tions in any language. By using the an- Wee Archie into wide horizontal strips.2 Each strip imation server-client architecture and PRACE SoHPCSite is processed by a single Raspberry Pi. the profiling interface, this goal has EPCC, Scotland However, there needs to be some com- been accomplished. PRACE SoHPCAuthors munication so that waves can propagate As demonstrated in the tutorials and Caelen Feller, TCD, Ireland throughout the simulation. coastal defence simulation, this func- PRACE SoHPCMentor Gordon Gibb, EPCC, Scotland To do this, after every tick of the tions in practice, and as the animations Caelen Feller equation solver that is run in the simu- can easily be modified or turned off, PRACE SoHPCProject ID lation, the edges of these strips are ex- there is nothing stopping adoption in 1907

21 Performance Comparison of Python and C Programs in Computational Fluid Dynamic (CFD) Performance of Python Program on HPC

Ebru Diler

This project aims to improve the speed per- formance of Python, by experimenting with high performance scientific and fast data processing libraries.

cientific researchers use HPC communities including StackOverflow NumPy is one of the most robust to model and simulate complex and CodeAcademy has mentioned the and widely used Python libraries. The problems.They require fast su- rise of Python as a major programming Python Library is a repository of script percomputers in order to re- language. Although it is widely used, it modules that can be accessed from a solveS problems in the areas of Life Sci- has some disadvantages. One of them is Python program. Recovers some fre- ences, Physics, Climate research, En- that it is poor performance. Therefore, quently used commands from rewriting. gineering and many others. Such ma- it is not preferred in large scale model- It also helps to simplify programming. chines harnesses the power of thou- ing and simulation used in high perfor- NumPy provides a multidimensional ar- sands of tightly connected highend mance computers. However, some high ray object, routines for fast operations servers to deliver enough power to pro- performance/fast processing libraries in on arrays, and a range of routine prod- grams. One of such machines is Archer Python also growing. Thus, tradeoff be- ucts that provide a variety of derived located in Edinburgh. tween compiled languages and scripting objects (such as masked arrays and ma- In recent years, the preferred pro- languages reconsidered. Python offers trices), including mathematical, logical, gramming language for HPC is gener- faster implementation time, flexibility basic linear algebra, shape manipula- ally C/C++ or Fortran. Because they and ease to use/learn, which makes it tion, sorting, and selection. give the programmer very high con- very strong community, which should trol. A research or project that started be compared to other languages that MPI for Python provides bindings of with a programming language can be preferly have been used in HPC. the Message Passing Interface (MPI)[2] hard to change programming language, In short, this project aims to improve standard for the Python programming but there are tradeoffs that should be the speed performance of Python, a language, allowing any Python pro- considered. One of other languages is modern programming language known gram to exploit multiple processors. Python, which is constantly growing with ease of use and flexibility but not It supports point-to-point (sends, re- and developed by a strong community. with speed. Main purpose will be opti- ceives) and collective (broadcasts, scat- Python has become a very popu- mize the speed by using: ters, gathers) communications of any lar programming language due to its mpi parallelization library, picklable Python object, as well as wide range of uses. If you read program- • mpi4py optimized communications of Python ming and technology news or blog post object exposing the single-segment then you might have noticed the rise fast data processing library, buffer interface (NumPy arrays, builtin of Python as many popular developer • numpy bytes/string/array objects).

22 Introduction Implementation cores -communicating by the MPI in- terface. In addition parallel approach is Computational fluid dynamics (CFD) is There was already C code written to based on the domain decomposition. All a branch of fluid mechanics that uses solve this problem. Here are 3 different cores calculated details and exchange analysis methods to analyse and solve parameters that affect the algorithm: necessary information due to keep the problems involving fluid flows. In order 1. The scale factor, which has to do problem consistent. This is called halo to simulate the flow of a liquid, com- with the size of the problem swapping (Figure 3) and it’s reguleted puters are used in the calculations re- by MPI. quired. Better solutions can be achieved 2. The reynold number, related to with high-speed supercomputers and it the flow property of the fluid is necessary to solve the biggest and most complex problems with high speed 3. Tolerance, used to provide conver- supercomputers. gence control Fluid dynamics is a continuous prob- The flow images resulting from the lem that can be identified by partial dif- change of the reynold number is given ferential equations. In this study, the in figure 1 and 2. finite difference approach will be used to solve the equations and determine the fluid flow model in a square cav- Figure 3: Visualisation of Halo Swap ity with a single inlet on the right side and a single outlet at the bottom of the cavity. Computational fluid dynamics Results and Discussion (CFD) is a branch of fluid mechanics that uses analysis methods to anal- yse and solve problems involving fluid Comparison of Python Programs flows. In order to simulate the flow For each one thousand iterations, the of a liquid, computers are used in the time that spent by differents runs . The calculations required. fastest program; Numpy with vectoriza- Figure 1: Fluid flow Reynold number is tion which let us to avoid for loops. This None was our expectations at first place. But Methods we were quite suprised with other two versions (Figure 4). The fluid flow can be described by the stream function defined as follows: The version with Numpy worked much more slowly than the one without δ2ψ δ2ψ Numpy. Our prediction was that the pro- ∆2ψ = + = 0 δx2 δy2 gram would run faster using the numpy library. Using the finite difference approach, the Comparison of C Programs flow value at each grid point can be cal- In order to control compilation-time culated as follows: and compiler memory usage, and the trade-offs between speed and space for ψi 1,j+ψi+1,j+ψi,j 1+ψi,j+1 4ψi,j = 0 − − − the resulting executable, GCC provides With the boundary values fixed, the a range of general optimization levels, stream function can be calculated for Figure 2: Fluid flow Reynold number is 2 numbered from 0–3, as well as individ- each point in the grid by averaging the ual options for specific types of opti- mization. value at that point with its four nearest Serial Program I have written se- neighbours. The process continues un- rial code 3 different versions of Python An optimization level is chosen with til the given number of iterations. This to compare the speed. the command line option -OLEVEL, simple approach to solving a problem is where LEVEL is a number from 0 to called the Jacobi Algorithm. The veloc- 1. With using python lists and for 3. The following figure shows the per- ity field u must be calculated to obtain loops formance of the C program running the fluid flow pattern within the cavity. 2. With using Numpy arrays and for on different optimization levels. Figure The x and y components of u are related loops 5 shows the execution time of the C to the stream function by: program according to the optimization 3. With using Numpy arrays and level. built-in Numpy vectorization fea- We also compared C and Python ex- δψ 1 tures [1] ux = = (ψi,j+1 ψi,j 1) ecution speeds. Figure 5 shows the data δy 2 − − Parallel Program I have started generated as a result of C and python from a serial version running -on a sin- programs. According to data, efficient δψ 1 gle processor -and accelerated by dis- features of Numpy has increased perfor- uy = = (ψi+1,j ψi 1,j) tributing the work -over more processor mance 112 times. − δx −2 − −

23 Figure 4: Comparison of Python programs Figure 5: Comparison of C programs Figure 7: Comparison of Python and C pro- grams

Performance measures for Paral- lelization References The speedup of a parallel code is how 1 www.geeksforgeeks.org Vectorization-in-python much faster the parallel version runs 2 https://mpi4py.readthedocs.io/en/stable/ MPI for compared to a non-parallel version. Tak- Python ing the time to run the code on 1 proces- sor is T1 and to run the code on N pro- PRACE SoHPCProject Title Performance of Python programs on cessors is TN , the speedup S is found new HPC architectures by: PRACE SoHPCSite Edinburgh Parallel Computing Centre,United Kingdom T1 S = PRACE SoHPCAuthors TN Figure 6: Parallel Python Program Ebru Diler, [Dokuz Eylul University] This can be affected by many different Turkey factors, including the volume of com- PRACE SoHPCMentor munications to calculation. If the times Conclusion David HENTY, EPCC, UK Ebru Diler are the same speedup is 1, there was no PRACE SoHPCContact Ebru, Diler, ˙Izmir change. Although the Python program still Phone: +90 553 111 71 41 falling behind the C program in terms of E-mail: [email protected] Figure 7 demonstrates the strong scal- performance, it is possible to improve linkedin.com/in/ebru-diler ing of Python parallel program running its performance thanks to the Numpy PRACE SoHPCSoftware applied on different number of processor. The library. In this project: we can say that Python, Numpy, mpi4py graphic describes how the execution using numpy arrays without using ef- PRACE SoHPCAcknowledgement Great thanks to Dr. David Henty for continuous support time of the program is affected by in- ficient numpy features (like vectoriza- in all matters creasing the number of cores as the size tion) does effect performance in nega- PRACE SoHPCProject ID of the problem increases. tive way for Python program. 1908

24 Investigation on GASNet’s Active Messaging component as an alternative to MPI

Investigations on GASNet’s active

Illustration of the difference between the MPI approach messages and GASNet’s active message (AM) approach. Whereas MPI requires the receiving process to actively take the data, an active message directly jump in and modify the Benjamin Huth local memory.

The project aimed at replacing an MPI-based communication backend with GASNet. However, performance tests showed that GASNet does not perform well on this task, and a following broader investigation backs this observation.

arallelism is the most impor- Although this is the most popular library called “EDAT”, which relies on tant concept in modern super- concept, it is not the only one. Another MPI, with a GASNet-based one. computing. On today’s large one called "Active Messaging" is im- supercomputers this is mainly plemented by GASNet (Global-Address Pachieved by a technique called SPMD Space Networking). Its approach is very Change of focus (Single Program Multiple Data). This different from the above one: means, that there is only one unique After implementing the first version of The sender sends not just data, program which is executed on many dif- EDAT with GASNet we saw that the • but also a function to the destina- ferent CPUs (often on several hundreds performance was not as good as ex- tion. or thousands) at one time, but each of pected. So we changed the focus to them has individual data. To communi- If the message arrives at the desti- investigate more the basic properties cate between these programs, program- • nation, it automatically executes of GASNet’s active messaging compo- mers use specific messaging libraries. the function which can work on nent. To get more general results, we The most popular one is MPI (Message the data. decided not to analyse the performance Passing Interface). Its basic concept is of self-made code, but to rely on ex- The receiver has no direct control as follows: • isting standard benchmarks (programs over the message arrivals. for measuring hardware performance) The sending program calls a send- There are tasks which naturally bet- for MPI, and rewrite them with GASNet • function. ter fit to this second approach, e.g. a as close as possible. These benchmarks task-based library, which abstracts the should cover a broad spectrum of paral- The destination program has to problem of explicitly sending data from lel software, to figure out what are the • call a receive-function. the programmer, and lets them focus strengths and weaknesses of GASNet. on the logical structure of his program Its important to mention that we use Only if there is a matching pair of and data. Actually this was the motiva- GASNet here as a drop-in replacement • send- and receive-functions, the tion for my project: we aimed at replac- for passive point-to-point communica- communication takes place. ing the communication back-end of a tion, which is not what it is designed for.

25 Hamburg

Berlin

Cologne

Frankfurt Prague

Stuttgart Munich Vienna

Zurich

0 Frankfurt

1 Cologne Munich Berlin

2 Hamburg Stuttgart Prague Vienna

3 Zurich

(a) Visualisation of the swapping process in a simple 2D stencil (b) Example for a graph. On the bottom one sees how algorithm. The grey circles represent the processes, whereas each the ”breadth-first search” goes through the above graph arrow denotes a data transfer. in four steps; In the first step it reaches all points with distance 1, in the second all with distance two...

Figure 1: Visualisation of the two benchmarks

But if that works out well, it would be has a small fraction of the grid in its The graph500 benchmark first cre- easier for developer to implement than local memory). During the run, each ates a random graph which is again dis- redesigning the whole application. grid point is updated with the sum of tributed over the processes, and then its 4 neighbours (in the simplest 2D starts as “breadth-first search” (see also case). This is a typical scientific applica- figure 1b). This results (in contrast to Benchmarks for analysis tion, so e.g. above algorithm is solving the stencil-benchmark) in many, but the Laplace-equation of electrodynam- very small messages with unpredictable The purpose of the first benchmark is to ics. To perform an update on the grid destinations. determine the basic messaging proper- points on the edges of a local grid, one To implement this benchmark, I ties of each communicating system: must transfer some grid points from a haven’t rewritten the whole program, The bandwidth is the amount of neighbour process to the local one. In but just replaced the already abstracted • data a network can transfer in the simple 2D case, each process has communication layer with a GASNet a certain time, nowadays mod- to send 4 chunks of data to its neigh- based one. ern networks typically have a low bours, but also receives 4 chunks from All used source code is available in two-digit number of GB/s, e.g. 10 them (see figure 1a). As a bases of the a repository (see [4]), together with de- GB/s. Compared to the bandwidth GASNet benchmark, we used the imple- tailed instructions how to compile GAS- of the internal memory this is very mentation of Intel’s Parallel Research Net and the different benchmarks. slow, so this is a big bottleneck for Kernels for MPI (see [2]). parallel software. The last benchmark is called How to run the benchmarks The latency is the time a network graph500 (see [3]). It is designed to • needs to react to a messaging re- simulate applications, which are not re- We run these benchmarks on three dif- quest. It determines the fastest lying on heavy numerical computations, ferent systems, but here we show only possible communication, and is but on complex data structures. It im- the results from ARCHER, UK’s national for small messages often more im- plements a so called “graph”. This is supercomputer. For its network there portant than the bandwidth. an abstract data structure, which con- exists an optimised MPI as well as an sists of points (called “vertices”), and optimised GASNet implementation, but To measure these two quantities, we some of them are connected by “edges”. the results are also representative for use a standard benchmark designed by These edges usually represent a dis- the rest of the systems. the Ohio State University (see [1]). tance, whereas the vertices can contain Whereas in the micro-benchmarks The second application we chose any kind of information. A common and we scale the message-size, but hold the for our tests is a stencil-benchmark. simple example of data one can repre- process count constant, for the two big In this application a grid is distributed sent as a graph is a road map (see figure benchmarks we scaled the parallelism: over many processes (so each process 1b). we started with the serial problem and 1

26 10 10000 12 10000 GASNet MPI MPI GASNet 8 11 1000 1000 10 100 6 9 100

10 4 time [ms]

latency [µs] 7

bandwidth [GB/s] 10 1 2 6 MPI MPI

avg runtime per iteration [ms] GASNet GASNet 0 0 5 1 20 22 24 26 28 210 212 214 216 218 220 222 20 22 24 26 28 210 212 214 216 218 220 222 1 2 4 12 24 48 192 768 3072 0 16 32 48 64 80 96 112 128 message size [byte] message size [byte] cores nodes (16/24 cores used) (a) Performance of the latency- (b) Performance of the (c) Performance of the stencil- (d) Performance of the benchmark. The message size is bandwidth-benchmark. The benchmark. The workload is graph500-benchmark. The increased in powers of 2 from 1 message size is increased in hold constant as 1000000 grid- workload is hold constant byte to about 4 MB. powers of 2 from 1 byte to points per process respectively as 4096 vertices per process about 4 MB. core. respectively core

Figure 2: performance plots from ARCHER

process, and went up to 3072 processes GASNet is now a whole magnitude References (for the stencil) respectively 2048 pro- slower than MPI. This could be sur- cesses (for the graph500 which requires prising, because one might think, that 1. Panda, D. K. OSU Micro-Benchmarks the number of processes to be a power the active messaging approach is more 5.6.1 accessed 7th August 2019. Mar. of two). While we are scaling the num- suitable for this kind of application (in 2019. http://mvapich.cse.ohio- ber of processes, we try to keep the fact, in original benchmark uses an ac- state.edu/benchmarks/. amount of work per process constant tive messaging layer implemented with 2. Intel Corporation. Parallel Research Ker- (called weak scaling), since we’re not MPI). On the other hand, an obvious ex- nels accessed 7th August 2019. 2013. interested in how the algorithm bene- planation for this can be found with the https : / / github . com / ParRes / Kernels. fits from parallelism, but how well the latency results before: A difference in la- communication works. For a theoreti- tency is much relevant with many small 3. Graph500 Steering Committee. cal communication system with infinite messages than with a few big messages. Graph500-3.0.0 accessed 7th August 2019. May 2019. https://github. speed and zero latency, the execution com/graph500/graph500. time would stay constant, so an increase of execution time reflects the commu- 4. Huth, B., Brown, O. T. & Brown, N. EPCCed/GASNet-AM-benchmarks: PAW- nication overhead coming with more Conclusion ATM19_submission Aug. 2019. doi:10. parallelism. 5281 / zenodo . 3368621. https : With these results in mind, it is prob- / / doi . org / 10 . 5281 / zenodo . ably not worth using GASNet’s active 3368621. Results and analysis messaging as a replacement of MPI for passive point-to-point communication. PRACE SoHPCProject Title Already in the micro-benchmarks (see Task-based models on steroids: For this reason, we dropped the previ- Accelerating event driven task-based figures 2a and 2b) we see a significant ous aim of this project and focused on programming with GASNet performance difference. This isn’t suf- pure GASNet as an alternative for MPI. PRACE SoHPCSite ficient as a proof that GASNet’s active I think investigations on this topic are of EPCC, UK messages perform badly on ARCHER, general interest for the parallel comput- PRACE SoHPCAuthors but a strong indicator. ing community; although MPI is very Benjamin Huth, Uni Regensburg, Germany In the latency plot, the left side is the good and widely used, it is always im- PRACE SoHPCMentor most interesting one: Here we see that portant to consider alternatives which the minimal time for a message in GAS- Nick Brown, EPCC, UK can not only help developers choosing Oliver Brown, EPCC, UK Benjamin Huth Net is nearly twice as long as for MPI. right tools for their job, but also pushing PRACE SoHPCContact On the bandwidth plot, the most inter- MPI to further improvements. Huth, Benjamin esting area is the right side, and here we Universität Regensburg However, these are only preliminary Phone: +49 174 605 4097 see also better performance with MPI, results, it could be interesting to inves- E-mail: [email protected] especially for the largest message sizes. tigate a few things further, like doing PRACE SoHPCSoftware applied But these effects are not directly MPI, GASNet more extensive profiling to understand seen in the stencil benchmark: there PRACE SoHPCMore Information the observed performance better. It may is not much difference between both gasnet.lbl.gov also be worth to investigate the possi- frameworks (see figure 2c). After a big PRACE SoHPCAcknowledgement bilities of GASNet’s second component increase at the beginning (which is prob- I want to thank my mentors Nick Brown and Oliver RMA (Remote Memory Access) which Brown for their constant support, Ben Morse for ably caused by memory effects inside a can compete with MPI according to the organising our stay in Edinburgh and PRACE and Leon single node), the performance stays con- Kos for organising this event. Last but not least I want developers. to thank my SoHPC collegues Caelen and Ebru for stant, as expected for weak scaling. spending this awsome summer with me! With the graph500 (see figure 2d), PRACE SoHPCProject ID we see an entirely different picture. 1909

27 Using HPC to investigate the structure and dynamics of a K-Ras4b oncogenic mutant protein

Studying an oncogenic mutant protein with HPC

Rebecca Lait

Molecular Dynamics (MD) simulations and Normal Mode Analysis have been used in order to discover potential binding sites on the K-Ras4b protein, which could be further used in drug design studies.

he oncogenic mutant K-Ras4b as a binary switch in signal transduc- two states act like switches to turning is a protein that has been stud- tion pathways, cycling between inac- the function of the protein on and off, ied for many years but there tive GDP-bound and active GTP-bound respectively. However, the hydrolysis of has been limited progression re- states.[1] In the GTP-bound active state, GTP to GDP in the G12D mutated Tgarding finding new drugs against it. cell proliferation, which refers to an K-Ras4b, where the glycine in residue Drugs are small molecules that bind increase in the number of cells due to position 12 has been replaced with as- to proteins and inhibit protein func- cell growth and division, can normally partic acid, is no longer possible. Con- tion. Also known as ligands, they bind occur. However, in the GDP inactive sequently, the protein is ‘locked’ in the to a specific region of the target pro- form, cell proliferation is inhibited. Fig- active GTP state.[1] This leads to over- tein, the binding site. In some cases, ure 1 shows the GTP molecule bound activation of the K-Ras4b protein and as the location of the binding site is not to the K-Ras4b protein. a result, to cancer growth. The K-Ras4b known in advance, even though the pro- protein is comprised of three functional tein structure is available. In this con- domains: the effector binding region text, computational studies can help dis- which is defined by amino acid residues cover new binding sites. Hence, the goal 1-86; the allosteric region by residues of this project was to identify binding 87-166; and the HVR tail by residues sites on the oncogenic protein K-Ras4b, 167-188. Each of these three domains which could benefit future therapeutic are responsible for a particular interac- approaches. tion or function which contributes to the overall function of the protein.[2] The first domain in K-Ras4b, the effec- K-Ras4b tor binding region, is located in the N- terminus of the protein. There are three K-Ras4b is a small GTPase, which refers Figure 1: The GTP molecule bound to the sub-domains within the effector binding to a large family of hydrolyse enzymes. K-Ras4b protein. region: P-loop; switch I and switch II.[1] It is an essential component of sig- Once the GTP is bound, the protein un- nalling network controlling signal trans- During the hydrolysis of GTP, there is a dergoes conformational changes that in- duction pathways and promoting cell loss of a phosphate group, which results volve switch I and switch II. The sec- proliferation and survival. It operates in the formation of GDP. Thus, these ond domain, the allosteric region, has

28 been suggested to play a role in switch II face of the unmutated wild type and all four K-Ras variants, Molecular Dy- conformation and the orientation of the the G12D mutant K-Ras4b, where small namics simulations were performed us- membrane, while the third domain, the molecules can bind. This would aid the ing the ARIS supercomputer. For the WT HVR tail regulates the biological activity rational design of selective K-Ras4b in- K-Ras4b bound to GDP, the simulation of the protein.[3] Within the aqueous hibitors. In order to achieve this, Molec- was run for 75 ns and for the remaining component of the cytoplasm of a cell, ular Dynamics simulations and Normal 3 variants for 23 ns each. the CYS185 of the protein is farnesy- Mode Analyses were performed. lated, meaning that there is an addition of a farnesyl group, allowing the protein to bind to the inner leaflet of the plasma Molecular Dynamics simulations membrane.[4] There are many posi- Identifying K-Ras4b binding tively charged lysine amino acids within Atomic coordinates files of the protein sites the HVR tail that interact strongly with and the specific attached ligand struc- the negatively charged membrane.[1] ture - either GTP or GDP - were down- These functional domains can be seen loaded from the Protein Data Bank.[5] After performing Molecular Dynamics in Figure 2. The blue region highlights Protein preparation including solvation simulations, the last frame of each sim- the effector binding region. The red re- and ionisation processes was carried ulation was collected and imported into gion displays the allosteric region and out. The pdb files used for the unmu- the Schrodinger Suite for binding site the green region shows the HVR tail. tated WT K-Ras4b bound to GDP, G12D identification. SiteMap was used in or- mutated K-Ras4b bound to GDP, unmu- der to determine whether any cavities tated WT K-Ras4b bound to GTP and with hydrophobic characteristics and hy- G12D mutated K-Ras4b bound to GTP drogen bonding capacity existed on the had pdb codes 5W22, 5US4, 5VQ2 and simulated proteins.[6] Initially, regions 6GOF, respectively.[5] Regarding the on the protein surface, called ‘sites’, G12D K-Ras4b GTP structure, a struc- were identified as potentially promis- ture was not available and so one was ing. These were located using a grid of created using the 6GOF structure as points called ‘site points’. In the second a starting point. The HVR region was stage, contour maps, ‘site maps’, were taken from pdb code 2MSC and ap- generated and the predicted binding pended to the 6GOF structure. More- site was visualised. Finally, each site was over, one of the two magnesium ions assessed by calculating various proper- was removed and the GPPNHP molecule ties, such as druggability and hydropho- was replaced with the GTP molecule. bicity.[6] All of the predicted binding The solvation step, which places the pro- sites were investigated and only the tein in a water box, and the ionization ones that appeared the most promising Figure 2: The K-Ras4b protein with the three domains highlighted in red, blue and step, which places ions in water, was would continue in further analysis. The green. carried out in order to represent a more identified binding sites for the mutated realistic biological environment. Thus, G12D K-Ras4b bound to GTP are shown the following systems were created: un- in Figure 3 below. The binding sites Aims of the project mutated wild type K-Ras4b bound to identified are shown in three colours: GDP or GTP, and G12D mutated red, yellow and purple. They are each The aim of this project was to identify K-Ras4b bound to GDP or GTP. Once highlighted using coloured circles and new potential binding sites on the sur- protein preparation was completed for are each labelled.

Figure 3: The identified binding sites for the G12D mutated K-Ras4b bound to GTP. Each binding site is labelled and is highlighted using coloured circles.

29 Normal mode analysis In the context of this project, the fluc- binding site identification results, it can tuations around predicted binding sites give useful insights regarding the func- Normal mode analysis was carried out were investigated. More specifically, if tional areas of a protein. For each of alongside the binding site identifica- the analysis predicted large scale mo- the four studied K-Ras variants, regions tion in order to investigate large scale tions at a particular area of the protein with considerable large-scale motions motions of the protein structures. Nor- that SiteMap identified as a potential near to the identified binding sites were mal mode vibrations are simple har- binding site, it may be suggested that found. An example of this analysis un- monic oscillations around an energy this is a biologically functional region of dertaken is shown in Figure 4 below minimum. The normal modes which the protein and could act as a binding and compared to the binding site iden- have the biggest amplitude can indicate site. This type of analysis is qualitative. tification result from SiteMap. the functional motions of the proteins. However, when this is combined with

Figure 4: The normal mode analysis undertaken on the unmutated wild type K-Ras4b structure bound to GDP. The binding site is circled.

Conclusions 3 obbs, G., Der, C. and Rossman, K. (2016). RAS iso- PRACE SoHPCContact forms and mutations in cancer at a glance. Journal of Cell Science, 129(7), pp.1287-1292. Rebecca, Lait In this project, models of the K-Ras4b E-mail: [email protected] 4 anˇcík, S., Drábek, J., Radzioch, D. and Hajdúch, structures were constructed and MD M. (2010). Clinical Relevance of KRAS in Human simulations were used to study the Cancers. Journal of Biomedicine and Biotechnology, PRACE SoHPCSoftware applied dynamics of this protein. Binding site 2010, pp.1-13. identification was performed in order 5 RCSB PDB: Homepage. [online] Rcsb.org. Available Virtuoso at: https://www.rcsb.org/ [Accessed 29 Aug. 2019]. to find cavities where candidate drugs could potentially bind, resulting in suc- 6 algren TA (2009) Identifying and characterizing PRACE SoHPCMore Information binding sites and assessing druggability. J Chem Inf cessfully identifying sites on the four Model 49: 377–389. www.virtouso.org K-Ras4b variants in different domains of the protein. Normal Mode Analyses PRACE SoHPCAcknowledgement were used to identify whether the pre- dicted binding sites lie in areas of the PRACE SoHPCProject Title proteins with large fluctuations. Using HPC to investigate the structure I would like to thank the PRACE Summer of HPC pro- and dynamics of a K-Ras4b gram for allowing me to undertake this internship. I oncogenic mutant will forever be grateful for this incredible opportunity and the experience I have gained. In addition, I would References like to thank my colleagues at the Biomedical Research PRACE SoHPCSite Foundation of Athens for their continued support and guidance. They made my project so enjoyable and 1 Lu, S., Jang, H., Gu, S., Zhang, J. and Nussinov, Biomedical Research Foundation of taught me so many useful lessons that I know will R. (2016). Drugging Ras GTPase: a comprehensive Athens (BRFAA), Greece be essential in my future career. Moreover, I am so mechanistic and signaling structural view. Chemical appreciative for being granted time on the ARIS su- Society Reviews, 45(18), pp.4929-4952. percomputer in Athens, during this final project. Being PRACE SoHPCAuthors introduced to the capabilities of this supercomputer has 2 MBL-EBI Train online. (2019). What are been inspiring. Thank you PRACE! protein domains? [online] Available at: Rebecca Lait, England https://www.ebi.ac.uk/training/online/course/introd uction-protein-classification-ebi/protein- PRACE SoHPCProject ID classification /what-are-protein-domains [Accessed PRACE SoHPCMentor 2 Aug. 2019]. 1910 Cournia Zoe, BRFAA, GREECE

30 HPC for candidate drug optimization using free energy perturbation calculations Lead Optimization using HPC

Antonio Miranda

Free Energy Perturbation (FEP) calculations may reduce the time and cost of drug discovery. Specifically, they may help in optimizing the binding affinity between the drug candidate and the therapeutic target. To validate this methodology within the context of drug discovery, FEP calculations were performed and compared with experimental data. Results suggest that FEP has predictive ability and can be used for lead optimization but using HPC resources is crucial.

he cost of creating and launch- icity, and of course efficacy. Drug effi- putational approaches. Specifically, the ing one drug has been reported cacy is directly dependent on the bind- binding affinity of several modified com- to be above US$1 Billion. In- ing affinity of the candidate drug onto pounds (analogs) of an original lead in- deed, there are even studies the pharmaceutical target, which is com- hibitor of a target protein was studied Tthat estimate this cost in more than monly a protein. using using Molecular Dynamics (MD) US$2.8 billion.1 simulations coupled with Free Energy Perturbation (FEP) calculations.

Free Energy Perturbation calcula- tions.

Free Energy Perturbation (FEP) cal- culations offer a promising tool for com- puting the relative binding affinity of two drug candidates, because they have Figure 1: Drug discovery cost. Data Source: Paul et al. (2010)7 a rigorous statistical mechanics back- ground.2, 3 In particular, in FEP calcu- lations the free energy difference, ∆G, Drug discovery has several phases Lead Optimization using computa- between compound A and compound B (Figure 1). One of them, “lead optimiza- tional approaches. is computed by an average of a func- tion” may account for almost 25% of the tion of their energy (kinetic and po- total cost of all four phases, according to This project focuses on the opti- tential) difference by sampling for the a study from 2010.1 This phase includes mization of the binding affinity, i.e. the reference state. Then, using a thermo- optimizing drug metabolism, pharma- strength of interaction of the drug candi- dynamic cycle (Figure 2), the relative cokinetic properties, bioavailability, tox- date onto the target protein using com- binding free energy, ∆∆G, between two

31 congeneric ligands A and B is calculated tion, the thermodynamic cycle of Figure were run on Aris supercomputer. MD from the following equation: ∆∆G = 2 can be used to compute the difference simulations consist of four phases: en- ∆GB – ∆GA. If this approach results in in the free energy of binding between ergy minimization, two equilibration 0 ∆∆G<0, it is concluded that ligand B ligands A and ligands B: ∆∆G = ∆G B phases and production. Energy mini- 0 0 0 is favored over ligand A (Figure 2). - ∆G A = ∆G 2 - ∆G 1. mization is performed to relax the struc- tures and guarantee there are no inap- The use of FEP simulations in propriate geometries. Then, equilibra- drug discovery has been historically re- tion needs to be performed to stabilize stricted by the high computational de- the temperature and pressure of the mand, the limited force field accuracy system. Finally, each intermediate state and its technical challenges. However, was simulated for 5 ns during the pro- in recent years, FEP calculations have duction run each in the 21 non-physical benefited from advances in GPU and intermediate states as described above. HPC availability, force-field research and the development of more elaborate At the end of the production sim- simulation packages.2 ulation, the obtained potential and ki- netic energies allow to calculate the free Once these limitations have been energy difference for the each of the overcome, it is time to assess whether Figure 2: FEP thermodynamic cycle. Source: 6 intermediate steps. In this work, Ben- FEP calculations using HPC resources Athanasiou et al. (2017) nett Acceptance Ratio, was used to com- are efficient for drug discovery. If this pute the free energy differences.14 The is the case, the time and cost of drug total free energy difference results by discovery could be reduced. This reduc- In the consecutive non-physical in- summing up the free energy differences tion could have implications in the final termediates, the interactions of the of each intermediate step. Finally, once market drug price, which would eventu- atoms are progressively transformed both legs were finished, ∆∆G was cal- ally benefit the whole society. Therefore, from the reference to the final state. culated. the scope of this work is the compari- In the simulations performed during son of the FEP simulation results with this work using the software GROMACS Prior to performing the simulations, experimental results as well as to assess 5.1.4,8, 9 21 non-physical intermediates scaling curves were obtained in a previ- whether using HPC resources would were used, split in 21 windows. This re- ous phase of the project, and it was de- be efficient for FEP calculations. The sults in performing 21 FEP calculations. termined that complex leg calculations comparison with experimental results During the first 10 windows, van der should be run on 4 computing nodes. has been done using several analogs of Waals and bonding interactions were The solvent leg only required 1 comput- CK666, a micromolar Arp2/3 inhibitor. switched on for the appearing atoms. ing node. Arp23 is a protein involved in cell lo- During the last 10 windows, van der comotion and a mediator of tumor cell Waals and bonded interactions were migration (thus, it is involved in metas- already switched on and electrostatic tasis).4, 5 The HPC resources that were interactions were switched on progres- Results used were the Greek national supercom- sively. puter, ARIS, from the GRNET facility. The systems need to be prepared Figures below show the correlation before one can perform the FEP cal- among ∆∆G values obtained experi- culations. The setup includes prepara- mentally and with simulations using Methods tion of the structures, the parameteri- Gromacs and GAFF2. These correlation zation of the molecules, the alignment plots show an agreement in the values of the ligands, the complex formation, for some analogs. However, 5 out of For every ligand, there are two states: and the solvation and neutralization. 11 analogs retrieved results with a dis- the ligand free in solution and the lig- For the protein the AMBER ff14 force crepancy larger than 1 kcal/mol. Root and bound to the protein in solution. field10 was used while for the ligands Mean Square Error (RMSE) was 1.42 Then, when computing ∆∆G, there are we used the GAFF2 force field.11 Lig- kcal/mol, Mean Unsigned Error (MUE) four states: ligand A, ligand B in solu- ands are aligned because it is assumed was 1.13 kcal/mol. tion, ligand A-protein complex in solu- that their common core has the same tion and ligand B-protein complex in binding mode. By having them aligned, Results obtained using Gromacs solution. In a FEP or alchemical calcu- there is more overlap between ligands and GAFF2 force field for ligand lations, instead of simulating the bind- potential energies and the FEP calcu- parametrization have been compared 0 ing/unbinding processes directly (∆G 1 lations are more probable to converge. to previous results with the same soft- 0 and ∆G 2 from Figure 2), which would All setup steps were performed using ware package, using the same pipeline require a simulation many times the FESetup tool.12 In addition, coordinate parameters but a different force field, lifetime of the complex, ligand A is al- files for some of the simulated analogs GAFF. From the correlation plots, it is chemically transmuted into ligand B were created by modifying the refer- observed how the computed ∆∆G val- in solution or in the protein environ- ence ligand with Maestro interface of ues are highly correlated with those of ment, through intermediate, nonphys- Schrödinger software.13 the previous experiments except in one 0 0 ical states (∆G A and ∆G B in Figure case (RMSE was 0.31 kcal/mol MUE 2). Because free energy is a state func- After the setup, MD simulations was 0.24 kcal/mol and R2 was 0.97 ex-

32

− − − Δ − − Δ

)

) Δ −

(( − − − −

Δ − − − − Δ

Figure 3. Left: Correlation plot between experimental and FEP results using Gromacs and GAFF2. Upper-right: correlation between experimental and FEP results with Gromacs and GAFF. Lower-right: correlation between FEP results using Gromacs and GAFF and Gromacs and GAFF2. This figure has less points because not all compounds analyzed with GAFF2 were analyzed with GAFF.

cluding that analog). For this outlier study also shows that HPC resources 10 Maier, J. A.; Martinez, C.; Kasavajhala, K.; Wickstrom, L.; Hauser, K. E. and Simmerling, C. ff14SB: Improv- analog, the calculations using GAFF2 are absolutely crucial for enabling high- ing the Accuracy of Protein Side Chain and Backbone correlated better with experimental re- throughput lead optimization with FEP. Parameters from ff99SB. J. Chem. Theory Comput. sults. 2015, 11, 8, 36963713. 11 Wang, J. M.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. It was previously mentioned that, ev- A.; Case, D. A., Development and Testing of a General References AMBER Force Field. J Comput Chem 2004, 25, 1157- ery time, we were simulating 5 ns. Since 1174. there are 21 intermediate windows, the 1 DiMasi, J.; Grabowski, H. and Hansen, R. Innovation 12 Loeffler, H. H.; Michel, J. and Woods, C. FESetup: total simulated time is 105 ns per leg. in the pharmaceutical industry: New estimates of R&D Automating Setup for Alchemical Free Energy Simula- For the solvent leg, every 5 ns simula- costs. J. Health Eco., 2016, 47, pp.20-33. tions, J.. Chem. Inf. Model. 2015,55, 12, 2485-2490 13 tion were run on 10 cores and took 6.5 2 Cournia, Z.; Allen, B. and Sherman, W. Relative Bind- Schrödinger Release 2019-3: Maestro, Schrödinger, hours. For the complex leg, 80 cores ing Free Energy Calculations in Drug Discovery: Re- LLC, New York, NY, 2019. cent Advances and Practical Considerations. J. Chem. 14 were used and to run every 5 ns simula- Infor. and Mod., 2017, 57(12), pp.2911-2937. Bennett, C. H. Efficient estimation of free energy tion during 13.5 hours. This means that, differences from Monte Carlo data. J. Comput. Phys. 3 Zwanzig, R. W. High-temperature equation of state 1976, 22, 245–268. if we were to run simulations for 100 by a perturbation method. I. Nonpolar gases. J. Chem. analogs in 24 hours, we would need Phys. 1954, 22(8), 1420–1426.

84000 cores for the complex leg and 4 Baggett, A. W.; Cournia, Z.; Han, M. S.; Patargias, G.; PRACE SoHPCProject Title 5250 cores for the solvent leg. Glass, A. C.; Liu, S.-Y. and Nolen, B. J. Structural Char- HPC for candidate drug optimization using free energy acterization and Computer-Aided Optimization of a perturbation calculations. Small-Molecule Inhibitor of the Arp2/3 Complex, a Key Regulator of the Actin Cytoskeleton, ChemMed- Chem, 2012, vol. 7, pp. 1286-1294. PRACE SoHPCSite Biomedical Research Foundation of the Academy of Discussion 5 Hetrick, B.; Han, M. S.; Helgeson, L. A. and Nolen, B. Athens (BRFAA) and Greek Research and Technology J. Small Molecules CK-666 and CK-869 Inhibit Actin- Network (GRNET), Greece Related Protein 2/3 Complex by Blocking an Activat- Correlation plots show that, in 6 out ing Conformational Change," Chem and Bio., 2013, vol. 20, pp. 701-712. of 11 analogs, FEP results correlate PRACE SoHPCAuthors 6 with experimental results within a 1 Athanasiou, C.; Vasilakaki, S.; Dellis, D. and Cour- Antonio Miranda, University Carlos III, kcal/mol error. In general, results ob- nia, Z. Using physics-based pose predictions and free Spain energy perturbation calculations to predict binding tained a RMSE of 1.42 kcal/mol, MUE poses and relative binding affinities for FXR ligands PRACE SoHPCMentor of 1.13 kcal/mol. This shows that FEP in the D3R Grand Challenge 2, J Comput Aided Mol Zoe Cournia, BRFAA, Athens Des. 2018 Jan;32(1):21-44. calculations may have predictive power 7 and can be integrated in the drug dis- Paul S.; Mytelka, D.; Dunwiddie, C.; Persinger, C.; Antonio Miranda Munos, B.; Lindborg, S. and Schacht, A. How to covery pipeline. With respect to the fur- improve R&D productivity: the pharmaceutical indus- ther refining, more work on force field try’s grand challenge, Nat Rev Drug Discov. 2010 PRACE SoHPCAcknowledgement Mar;9(3):203-14. My thanks to Dr. Zoe Cournia and the Lab members, at parameters and on sampling techniques, BRFAA, for recieving me, to Dimitris Ntekoumes for all the treatment of the protein-ligand en- 8 Lindahl, E.; Hess, B.; van der Spoel, D. GROMACS his support and suggestions and to Dr.Dimitris Dellis at 3.0: A Package for Molecular Simulation and Trajec- GRNET for his support and providing access to the vironment and of chemical effects (e.g. tory Analysis. J. Mol. Model. 2001, 7, 306-17. ARIS supercomputer. buffer salt conditions or protonation 9 Berendsen, H. J. C.; van der Spoel, D.; van Drunen, R. states) may be required to further en- A Message-Passing Parallel Molecular Dynamics Imple- PRACE SoHPCProject ID hance FEP calculations accuracy. This mentation. Comput. Phys. Commun. 1995, 91, 43-56. 1911

33 Hybrid Monte Carlo/Deep Learning Methods for Matrix Computation on Advanced Architectures

Deep Learning for Matrix Computations

Mustafa Emre S¸ahin Solving or inverting systems of linear algebraic equations(SLAE) is significant in many engineering applications and scientific fields. In this project, the implementation of stochastic gradient descent method and inverting SLAE methods will be simply explained with High Performance Computing (HPC) methods on advanced architectures.

ave you ever realised the wait- ”Math is boring and hard.” are important for the sake of rapidness ing time for the machines de- while finding the inverse of the matrix. Hcreasing day by day in our daily Yes, everything is boring and hard when Assume solving a system of linear life? Of course nobody wants to wait we are struggling to understand. Let’s equation (Ax = b in matrix form) with for slow automated machines or have start with a basic question: b = I, by calculating Ax b over and − slow smart phones, but do you remem- "Why do we need the matrix in- over again until a solution− that under ber how long you were waiting for the version?" desired error is obtained. Still, there are old devices to open? Let’s assume the equation Ax = b, too many steps while using the iterative and we want to find x. To find x, the solvers so it can be reduced with the How to express the real-world both sides can be divided by A: x=b/A. help of preconditioner[1]. However, you can’t just simply divide problems to the machines? when A is a matrix, so you need an- other method to solve the problem. We can describe the problems with help Since A 1A = I, inverse of matrix can Recipe for Matrix Inverse of the linear algebra and solve the prob- A− A = I be used to solve the problem by multi- ∼ ∼ lems by expressing as systems of linear plying from left side: x = A 1b. As you algebraic equations. The machines un- x = A− b In this project, the aim is improv- can see, the matrix operations are not derstand things as numbers in a matrix, ing non-recovered output of "Markov scary monsters, they are just different. and any manipulation means matrix op- chain Monte Carlo matrix inversion 2 erations for the machines. "What is changed in our life when (MC) MI [2]" using a stochastic gra- we solve x?" dient algorithm method. Though using X is the hero of our daily lives. It the (MC)2MI method results bad in- is the voltage in electronic devices, the verse matrix, inverse of matrix can be answer of the many simulations like cli- easily obtained with a time-costly recov- mate change or aerodynamics. Linear ery process [1]. systems of equations are the foundation of the science and the technology.

The Matrix Inversion

Let’s remember the good old memories from high school, the inverse matrix can be calculated by using minors, co- factors and adjugate. However when Figure 2: Outputs of (MC)2MI method, Figure 1: For example this is how we see the matrix size is large, this method left: Non-Recovered Inverse, right: Recov- images, but machines see as numbers. is not very optimal. Iterative methods ered Inverse. [1]

34 Figure 3: Left: Implementation of mSGD algorithm to non-recovered inverse of the MCMCMI. Middle: Speed-up graph of non-uniform mSGD with batches. Right: The time consumption of different parts of the code for per process.

"What is Stochastic Gradient De- Ratio of the number of non-zero the master process since it has more scent (SGD)?" • elements over total number of ele- work than others. Gradient Descent is an iterative ments in the A, p is also included method which is used to minimise an to the stochastic gradient descent objective function. If randomly selected algorithm to estimate a better so- Discussion and Conclusion part of the samples used in every iter- lution. With the magic of the math I successfully implemented and par- ation of gradient descent, it is called (Details are in the [3]), the up- allelized Stochastic Gradient Descent stochastic gradient descent. dated x becomes: 2 method to the (MC)2MI algorithm Let’s check the recipe for the 1 2 xk+1 = xk α( 2 Ai∗(Aixk bi) with the non-uniform random distri- method for our case. − p − − 1 bution, the batching and a successful 1 p2 diag(Ai∗Ai)xk) where diag() gets Assume the system Ax = b where diagonal− matrix of the input. speed-up. As the number of processes • 1 b is Identity matrix and x = A− . increase, the communication took more Since the aim is estimating a solu- time than others, however speed-up is • "Implementing mSGD to the tion that has lower error (Ax b), 2 better as shown in the Figure 3’s Middle − (MC) MI” the objective function should be and Right images. The further accelera- 2 Although, the matrix A is a minimised F (x) = 1 Ax b 2 tion methods for the algorithm can be 2m k − k square matrix (number of rows where m is the number of rows = number of column) in our investigated by future generations of of A. case, the method successfully (Fig- Summer HPC participants. After uniform random selection • ure 3) decreases the error when of row i for every iteration, the 4 4 MaximumSteps = 2 , α = 4 10− References objective function becomes 1 ∗ ∗ and A− , non-recovered inverse from 1 1 2 2 Lebedev, A. (2017). Accelerating stochastic methods fi(x) = (Aix bi) . 2 i m i − i (MC) MI given as initial solution (x) for the SLAE. SoHPC Final Reports. If we want delicious stochastic to the system. After successful results in 2 • Alexandrov, V. N. (1998). Efficient parallel Monte gradient descent, we should add Python, the algorithm is implemented Carlo methods for matrix computations. some gradient to our objective in C++ with Eigen library on Scafell 3 Ma, A., Needell, D. (2017). Stochastic Gra- function, Pike X1000 Supercomputer. dient Descent for Linear Systems with Missing 1 2 Data.arXiv:1702.07098. fi(x) = Ai∗(Aix bi) ∇ i m i i − i where Ai∗ is the conjugate trans- PRACE SoHPCProject Title pose of Ai . ”The Cherry on the cake” Hybrid Monte Carlo/Deep Learning For every round(iteration) Methods for Matrix Computation on • k, we should update x as "Life is not fair. Why should random Advanced Architectures x = x α( f (x)) selection be fair?" PRACE SoHPCSite k+1(new) k(old) − ∇ i where α is the learning− rate∇ which In mSGD, the rows are selected STFC Hartree Centre, United Kingdom determines the proportion of new with uniform probability. The results are slightly better (Figure 3) if probability PRACE SoHPCAuthors information that will be added. Mustafa Emre S¸ahin, of selecting a row is proportional to the ˙Izmir Institute of Technology, Turkey As a deep learning method, the stochas- norm of the row. tic gradient descent algorithm in the PRACE SoHPCMentor "Last touch: Adding Batches to Paral- Prof. Vassil Alexandrov, paper [3] is implemented in Python to lelization" STFC Hartree Centre, UK Mustafa Emre ¸Sahin test our case. According to the paper for Stochastic Gradient Descent and PRACE SoHPCAcknowledgement method mSGD: Batching are as like as two peas in a pod. I am deeply thankful to my project mentor Prof.Dr.Vassil Alexandrov and my shifu Anton Lebedev. Assume a sparse matrix A Instead of using whole matrix A for the Many thanks to Assoc.Prof.Dr. Leon Kos for his helps • which is full rank over- method,rows are divided into the sub- during problematic times. I’m grateful to my mentors Assist.Prof.Dr. I¸sılÖz and Prof.Dr. Ferit Acar Savacı for determined(number of rows » sets for each process in the paralleliza- helping me to build my personality and career. number of column) matrix is used tion with Hybrid MPI/OpenMP.When PRACE SoHPCProject ID on the linear system Ax = b. rows are not equally divided, it is a good 1912 trick to give lower amount of rows to

35 Performance evaluation of a dissipative particle dynamics code for very large systems using latest hybrid CPU-GPU architectures. Billion bead baby

Davide Di Giusto

Dissipative particle dynamics simulations allow to overcome the length-scale limits of molecular dynamics, while maintaining the inner, discrete nature of matter. To simulate realistic applications, a huge system is necessary, where the number of particles is proportional to some billions. This project aim is to extend the current size limits of the DL-meso DPD code and scale it on a large GP-GPUs architecture.

he Dissipative Particle Dynam- search and industrial worlds. The DL the corresponding variables, defined by ics (DPD) method provides a Meso DPD code (DL Meso) imple- a size and a type. Moreover, many of powerful tool to simulate phys- ments the Dissipative Particle Dynamics these were directly proportional to the ical phenomena of common in- method and, thanks to the usage of GPU number of beads stocked for the simula- Tterest. The approach implies the defi- accelerators, we will explore its size lim- tion, therefore we assumed straightfor- nition of a particle size, which can in- its and extend them, in order to perform wardly that such parameter was causing clude several small molecules or few large scale simulations. integer overflow for the mentioned vari- more complex ones. The action of three The DL Meso DPD code is writ- ables. In computer programming, over- forces, conservative, dissipative and ten in Fortran but offers the possibil- flow occurs when an integer value is stochastic ones, allows this method to ity to transfer the calculations on GPUs stored on a variable whose memory al- be valid for the simulation of simple (), as the main location is not big enough for the cur- and complex fluids. Finally, the DPD loop has been ported to CUDA C. Let rent necessity. For example, if a integer deals with the mesoscale, between the us consider the original version of the value exceeds the typical value range nanoscale, that describes the size of code. Aim of the project was to target its of a 4 bytes integer, which goes from the molecules, and the microscale, that size limitations, so to extend the simu- -2147483648 to 2147483647, the code starts to measure the continuum me- lation possibilities. From previous tests, is trying to improperly access the system dia. Potentially, large scale simulations the largest number of beads that had memory, leading to its termination: for based on the DPD can try and over- been deployed was two billions. Not sure this was happening for the variable come the distance that separates the bad at all, we could say! But we should that stocked the total number of beads mesoscale and the microscale, while be ambitious in our work and tried per- in the system (3 billions), but sneakier still mantaining features derived from forming a three billion simulation. For ones were hiding in the scripts. the particle size, closer to the nanoscale it to run, several quantities needed to of matter: this could benefit both re- be stored on the machine memory in

36 Figure 1: NVIDIA Tesla P100 representation: it is possible to observe the moltitude NVIDIA CUDA cores, organised in streaming multiprocessors. The huge number of cores determines its inner parallelism.(Source: Link)

The complexity of this problem, to- cution times returns the speedup in tecture. Luckily, we could obtain access gether with the articulate structure function of the number of parallel ex- to Piz Daint: this is the largest super- of the code, required proper tools ecutions. When the speedup is ideal, computer in Europe, currently 6th in for debugging. The EclipseTM inte- meaning that for twice the processes the top 500 list (Top 500 list), man- grated development environment and our execution will take half the time, aged by the Swiss National Supercom- the NsightTM Eclipse Edition were cho- the code is said to scale well; but the puting Centre (CSCS Piz Daint). More- sen, as they allow a visual exploration ideal scaling is undermined by the com- over, Piz Daint offers 5704 hybrid nodes, of the code lines while running in de- munication between parallel processes. equipped with one 12 cores CPU and bug mode: in this way, it was possible As mentioned before, the particles move one NVIDIA R Tesla R P100 each. These to understand which variables needed through the system volume, while the are indeed terrific numbers, but we to be changed and which ones could parallel decomposition scheme remains probably need to better define the dif- remain of the same integer type. Once static as the simulation progresses in ference between CPU and GPU. the code was ready to go, we could start time. Therefore, every time-step mes- the strong scaling experiment. Basically, sages between all the processes need consecutive runs of the same code exe- to be sent, in order to redistribute the cution were performed: the first was the beads according to their most recent po- serial one, which set a reference time sition. that was required for its completion. Then, the parallel executions were per- formed, doubling the computational re- sources each time. This was possible be- cause the DL Meso DPD code is prone to parallel executions, following the typi- cal domain decomposition implemented Figure 5: Representation of the running con- using MPI libraries: in essence, the sys- figuration: each electronic circuit consists of tem volume is split equally between the one single task CPU-GPU hybrid node. processes, while the particles jump from one to the other as they move. Figure 3: DL Meso code scaling for MPI only version. For a small number of particles the Whereas a CPU is the general elec- system does not scale well. tronic circuit which runs a computer and can be multicore, presenting al- Obviously, the larger the number of ready a certain attitude towards par- parallel executions, the larger the total allelism, a GPU is a specialised device, amount of communication required. highly efficient in large data block ma- Eventually, while scaling our code, the nipulation. Originally GPUs were cre- communication time will be comparable ated to accelerate image processing, but Figure 2: Typical domain decomposition for with the time spent for the calculations, eventually their potential in High Per- 4 parallel processes: each colour indicates a resulting in a deviation from the ideal formance Computing was clear. Since different process identification number. speedup. At this point, we just needed then, they have been deployed as accel- a proper computer to work with. Our erators in the majority of the supercom- The comparison between the exe- basic necessity was a large GPU archi- puters.

37 Figure 4: Dl Meso DPD code scaling on hybrid CPU-GPU nodes for a total number of particles of: left) 300 millions; right) 3 billions.

Table 1: CPU vs GPU: acceleration be accelerated efficiently. The result for tions for 6 billion particles, the commu- the 300 million case shows good scaling nication between processes can become CPU Time elapsed too, providing also an indication for a critical and lead to failure, therefore Nodes CPU GPU match between job size and number of this should kept into account. Moreover, parallel tasks. Instead, when using only some other features could be scaled, 1 3401.98 195.82 3 million particles, the code does not like the Fast Fourier Transformate al- 2 2264.75 147.13 scale well. In conclusion, we have ac- gorithm, which can be tricky on large 16 334.06 75.22 knowledged that GPU accelerators are hybrid architectures. 32 167.28 36.27 a great opportunity to benefit high per- I would like to specially thank CSCS formance computing, as they can out- for the preparatory access (request class the GPU-less applications. Anyway, 2010PA5022) that allowed me to use even on the hybrid CPU-GPU nodes the Piz Daint and for the support during the code could scale non-ideally. This is whole summer. not straightforward: the GPUs are vora- cious in terms of total number of parti- References cles, as they require large amounts, but 1 M. A. Seaton, R. L. Anderson, S. Metz and W. Smith, grant brilliant speedup. Eventually, the (2013). DL-MESO: highly scalable mesoscale simula- larger the number of particles stocked, tions, Mol. Sim. 39 (10) 796-821 the larger the number of nodes whose usage leads to good scaling: this is re- PRACE SoHPCProject Title Scaling the Dissipative Particle ally meaningful as it confirms the pos- Dynamic (DPD) code, DL MESO, on Figure 6: Dl Meso DPD code scaling on hy- sibility of efficiently accelerating large large multi-GPGPUs architectures brid CPU-GPU nodes for 3 million particles scale simulations based on the Dissipa- PRACE SoHPCSite tive Particle Dynamics method. More- STFC Hartree, United Kingdom In table 1 is possible to appreciate over, the scaling possibilities are limited PRACE SoHPCContact Davide Di Giusto, [University of Udine] the acceleration granted when using on the minimum number of nodes that Italy the GPUs. Our running configuration are required in order to run the first ex- PRACE SoHPCMentor was the following: we wanted one GPU periment. Clearly, this limitation is set Jony Castagna, STFC Hartree, U.K. Davide Di Giusto per process, so we would demand only by the GPUs, as they posses a 16 Gb PRACE SoHPCContact single task CPU-GPU hybrid nodes. In memory, pretty low when compared to Name, Surname, Institution figures 4 and 6 the scaling of the Dl the CPU one, that counts up to 64 Gb Phone: +12 324 4445 5556 E-mail: [email protected] Meso DPD code is plotted, for system per node on Piz Daint. PRACE SoHPCSoftware applied sizes of 3 millions in figure 6, and 300 The results we obtained during Eclipse R (IDE), Nvidia Nsight R millions and 3 billions in figure 4, re- these two months suggest the possibil- PRACE SoHPCMore Information spectively left and right. We can notice ity of efficiently simulating large system Nvidia developer blog that the reference case is not always size DPD simulations. This outcome can PRACE SoHPCAcknowledgement the serial version. This craftiness is nec- eventually benefit both the industrial I would like to thank Leon Kos for the guidance that he essary, as the GPU memory is limited world, as more realistic experiments has granted all summer long; moreover, I am really thankful to all the people at Prace for the organisation to 16 Gb, whereas the size of the ar- could be performed on the computers, of the SoHPC. Then, I have nothing but gratitude for rays allocated during execution can be reducing costs and environmental foot- my mentor Jony Castagna, who is somebody to look up to and managed to teach me so much in little time. larger. The obtained speedups show that print, and the research one, opening Finally, I would like to thank Nia for the organisation, the code scales well for large numbers new perspectives of study. Future devel- Vassil Alexandrov and Luke Mason and all the people at the Hartree centre for the continuous support and of beads. The 3 billion case scales al- opments could try performing simula- kindness. most ideally up to 4000 nodes, meaning tions for even larger number of particles. PRACE SoHPCProject ID this kind of big system simulations can Having already tried running simula- 1913

38 Investigating possibilities to accelerate applications deployed on the edge that utilize Deep Neural Networks. Switching up Neural Networks

Igor Kunjavskij

Using the Intel Movidius Neural Compute Stick, I researched on the possibilities of switching models during runtime.

mart devices permeate the fab- that alleviates a lot of those problems, classifying objects in detected images ric of our daily lives, be it the making it possible for images to be pro- such as different road signs. The deploy- phone in everybody’s pocket or cessed on the edge on the device itself. ment of such an application can hap- the smart watches some of us pen on a server in a data center, an wear.S But not only our daily lives are on-premise computer or an edge de- filled with these devices, they also start vice. Now the crux with the latter is, to appear in areas which used to be in- that applications such as computer vi- accessible for humans. This could be sion tend to have quite a high computa- a device attached to satellites in space tional load on hardware, whereas edge running earth observation missions. Or devices tend to be lightweight and bat- underwater robots conducting mainte- tery powered platforms. It is of course nance on critical structures such as un- possible to circumvent this problem by derwater pipelines. In these areas, small streaming the data that an edge device devices tend to take on more and more Figure 1: The Intel Movidius Neural Com- captures to a data center and process it compute intensive tasks such as pro- pute Stick. there. But the next issue that the term cessing images that they capture. But “edge device” already implies is, that especially computer and process- these computing platforms are situated ing images encompass in general very Intel Movidius Neural Compute in quite inaccessible regions, like on an compute intensive tasks, even more so oil rig in the ocean or attached to a satel- if Deep Neural Networks are involved. Stick lite in outer space. Transmitting data is Lightweight devices that do not provide in these cases costly and comes with a too much compute power, as for exam- The Intel Movidius Neural Compute lot of latency, rendering real time de- ple the Raspberry Pi, can quickly be- Stick was developed specifically for cision making nearly impossible. Espe- come overwhelmed with these tasks. A computer vision applications on the cially if this data encompasses images, solution to this problem would be to just edge, with dedicated visual processing which as a rule of thumb tend to be send the data that a device collects to a units at its disposal. Once a Deep Neu- large. That’s where hardware accelera- data center. But given the fact that these ral Network was trained on powerful tors like the Neural Compute Stick bring devices are deployed in regions with re- hardware for hours or sometime days, in a lot of value. Instead of sending im- stricted connectivity, sending data back it needs to be deployed somewhere in a ages or video to some server for further and forth for processing purposes is not production environment to fulfill its pur- analysis, the processing of data can hap- a viable option. That’s where the Intel pose. This can be for example in com- pen on the device itself and only the Movidius Neural Compute Stick comes puter vision applications on the edge, result i.e. pure text is sent for storage into play, an accelerator in USB format which can encompass detecting pedes- or statistical and visualization purposes trians or other objects in images or also

39 to some remote location. Such an ap- their own inherent file format. After the the training process, such as dropout proach brings numerous benefits, but training phase is completed, a network layers. These layers are useless during most importantly alleviates latency and is ready to be used on yet unseen data inference and might increase the infer- communication bandwidth related con- to provide answers for the task it was ence time. In most cases, these layers cerns. Now what is this Neural Compute trained for. Using a network to provide can be automatically removed from the Stick all about? The Intel Movidius Neu- answers in a production environment resulting Intermediate Representation. ral Compute Stick is a tiny fan device is referred to as inference. Now in its Even more so, if a group of layers can that can be used to deploy Artificial Neu- simplest form, an application will just be represented as one mathematical op- ral Networks on the Edge. It is powered feed a network with data and wait for eration, and thus as a single layer, the by an Intel Movidius Vision Processing it to output the results. However, while Model Optimizer recognizes such pat- Unit, which comes equipped with 12 doing so there are many steps that can terns and replaces these layers with just so called “SHAVE” (Streaming Hybrid be optimized, which is what so called one. The result is an Intermediate Rep- Architecture Vector Engine) processors. inference engines do. Now where does resentation that has fewer layers than These can be thought of as a set of the OpenVino toolkit fit in concerning the original model, which decreases the scalable independent accelerators, each both tasks of training and deploying inference time. This Intermediate Repre- with their own local memory, which al- a model? The issue is that using algo- sentation comes in the form of an XML lows for a high level of paralleled pro- rithms like Deep Learning is not only file containing the model architecture cessing of input data. computationally expensive during the and a binary file containing the model’s training phase, but also upon deploy- weights and biases. ment of a trained model in a production After using the Model Optimizer to OpenVino environment. Eventually, the hardware create an Intermediate Representation, on which an application utilizing Deep the next step is to use this representa- Complementing this stick is the so Learning is running on is of crucial im- tion in combination with the Inference called Open Visual Inferencing and portance. Engine to produce results. Now this In- Neural Network Optimization (Open- Neural Networks, especially those ference Engine is broadly speaking a set Vino) toolkit, a software development used for visual applications, are usually of utilities that allow to run Deep Neural kit which enables the development and trained on GPUs. However, using a GPU Networks on different processor archi- deployment of applications on the Neu- for inference in the field is very expen- tectures. This way a developer is capa- ral Compute Stick. This toolkit basically sive and doesn’t pay off when using it in ble to deploy an application on a whole abstracts away the hardware that an combination with an inexpensive edge host of different platforms, while using application will run on and acts as an device. It is definitely not a viable op- a uniform API. This is made possible by “mediator”. To make use of the Neu- tion to use a GPU that might cost a few so called “plugins”. These are software ral Compute Stick, it is necessary to hundred dollars to make a surveillance components that contain complete im- first install the Open Visual Inferenc- camera or a drone. If it comes to us- plementations for the inference engine ing and Neural Network Optimization ing Deep Neural Networks in the field, to be used on a particular Intel device, (OpenVino) toolkit, which I briefly in- most of them are actually running on be it a CPU, FPGA or a VPU as in the troduced in my last post. This toolkit CPUs. Now there are a lot of platforms case of the Neural Compute Stick. Even- aims at speeding up the deployment of out there using different CPU architec- tually, these plugins take care of trans- neural networks for visual applications tures, which of course adds on to the lating calls to the uniform API of Open- across different platforms, using a uni- complexity of the task to develop an ap- Vino, which are platform independent, form API. Training a Deep Neural Net- plication that runs on a variety of these into hardware specific instructions. This work is like taking a “black box” with platforms. API encompasses capabilities to read in a lot of switches and then tuning those That’s where OpenVino comes into play, the network’s intermediate representa- for as long as it takes for this “black box” it solves the problem of providing a uni- tion, manipulate network information to produce acceptable answers. During fied framework for development which and most importantly pass inputs to the this process, such a network is fed with abstracts away all of this complexity. network once it is loaded onto the tar- millions of data points. Using these data All in all, OpenVino enables applica- get device and get outputs back again. points, the switches of a network are tions utilizing Neural Networks to run Now a common workflow to use the in- adjusted systematically so that it gets inference on a heterogeneous set of ference engine includes the following as close as possible to the answers we processor architectures. The OpenVino steps: expected. Now this process is computa- toolkit can be broken down into two ma- tionally very expensive, as data has to jor components, the “Model Optimizer” 1. Read the Intermediate Represen- be passed through the network millions and the “Inference Engine”. The former tation - Read an Intermediate Rep- of times. For such tasks GPUs perform takes care of the transformation step to resentation file into an application very well and are the de facto standard produce an optimized Intermediate Rep- which represents the network in if it comes to hardware being used to resentation of a model, which is hard- the host memory. This host can be train large neural networks, especially ware agnostic and useable by the Infer- a Raspberry Pi or any other edge for tasks such as computer vision. If it ence Engine. This implies that the trans- or computing device running an comes to the frameworks used to train formation step is independent of the fu- application utilizing the inference Deep Neural Networks, so there are a ture hardware that the model has to run engine from OpenVino. lot that can be utilized like Tensorflow, on, but solely depends on the model to 2. Prepare inputs and outputs for- PyTorch, MxNet or . All of these be transformed. Many pre-trained mod- mat - After loading the network, frameworks yield a trained network in els contain layers that are important for specify input and output precision

40 and the layout of the network. that could occur in a production envi- Of course, there is a lot of room 3. Create Inference Engine Core ob- ronment, it would be possible to have for improvement given these numbers. ject – This object allows an ap- many small models that could be loaded One such improvement would be con- plication to work with different in and out of memory at runtime. cerned with the parsing of the model devices and manages the plugins dimensions. I used a simple XML parser needed to communicate with the Model switching prototype to do so and had to read in the input target device. and output dimensions of a model on 4. Compile and Load Network to de- In my project I investigated the feasibil- every switch. Doing this once for all vice – In this step a network is ity of doing so and implemented a small models that potentially will be used compiled and loaded to the target prototype that switches models at run- on the Neural Compute Stick when device. time. This prototype switches between the application starts running and sav- 5. Set input data - With the network two models detecting human bodies ing the dimensions into a lookup ta- loaded on to the device, it is now and faces. These models are both so ble could cut the switch time almost ready to run inference. Now the called Single shot detector MobileNets, in half. Further speedup of this switch application running on the host networks that are better suited to be de- could be achieved by conducting it asyn- can send an “infer request” to the ployed on smaller devices which local- chronously, as while the model is loaded network in which it signals the ize and classify an object in a single pass onto the Neural Compute Stick the next memory locations for the inputs through the network drawing bound- frame can already be capture instead and outputs. ing boxes around it. I used OpenCV for of waiting for the switching process to 6. Execute - With the input and out- this task, which is a library featuring all finish. All in all, I found that although at put memory locations now de- sorts of algorithms for image processing. the current state this prototype would fined, there are two execution Next to OpenCV I had OpenVino run- not be applicable to real time applica- modes to choose from: ning as a backend, a toolkit accompany- tions yet, given the potential for im- provement it could get there. Yet if no (a) Synchronously to block until ing the Neural Compute Stick, which al- hard conditions are imposed for it to an inference request is com- lows to utilize it in an optimized manner. perform in real time as is the case for pleted. I eventually tested this model switch- ing prototype by loading and offloading many applications, it is deployable al- (b) Asynchronously to check the models in and out of memory of the ready. status of the inference re- Neural Compute Stick. I did this with a quest while continuing with very high frequency of one switch per other computations. frame to determine what the latency 7. Get the output – After the infer- of such a model switch would be in a ence is completed, get the output worst-case scenario. The switching pro- memory. cess includes reading the input and out- put dimensions of a model by using the Another problem that remains XML representation of its architecture though, even with the Neural Compute and then loading it into the memory of Stick accelerating computations, is that the Neural Compute Stick. On average once a model is deployed on an edge this switch caused an extra overhead of device, it’s hard to repurpose the device about 14 percent of the overall runtime. to run a different kind of model. This To put this into absolute numbers, on is where the “dynamic” part of the ti- average it took my application half a sec- PRACE SoHPCProject Title tle of my project comes into play. The Dynamic Deep Learning Inference on ond to capture and generate an output the Edge Neural Compute Stick is a highly paral- for an image, whereas a model switch lelized piece of hardware, with twelve PRACE SoHPCSite in between would add a little less than Irish Centre for High-End Computing, processing units independently running a tenth of a second to this time. Ireland computations. This way, it is possible PRACE SoHPCAuthors to not only have one, but several mod- Igor Kunjavskij, University of Stuttgart, els loaded into its memory. Even more Germany so, it is possible to switch these mod- PRACE SoHPCMentor els and load new models into memory. Venkatesh Kannan, ICHEC, Ireland Igor Kunjavskij This allows to adapt to new situations PRACE SoHPCContact in the field, like when the feature space Kunjavskij, Igor, University of Stuttgart E-mail: [email protected] changes or the things that are supposed PRACE SoHPCSoftware applied to be detected change. The simplest Intel OpenVino case of such an occurrence might be PRACE SoHPCMore Information the sun setting or bad weather condi- https://software.intel.com/en-us/openvino-toolkit tions coming up. Another motivation PRACE SoHPCAcknowledgement to switch models might also be to save I would like to thank my supervisors for their amazing power, as edge devices tend to have support throughout this whole project and in general the staff at ICHEC for welcoming me and making this limited capacities if it comes to energy Figure 2: Example of using a SSD MobileNet stay such a great experience! sources. Instead of deploying one big in combination with the Neural Compute PRACE SoHPCProject ID model that is supposed to cover all cases Stick. 1914

41 Distributed Memory Radix sort (DiMeR) Fastest sorting algorithm

Jordy Ajanohoun

With nowadays large-scale data and large-scale problems, parallel computing is a pow- erful ally and important tool. Parallel sorting is one of the important component in parallel computing. The parallel Radix sort allows us to reduce the sorting time and to sort more data, amounts that can’t be sorted serially. Indeed, it is possible when we want to sort a huge amount of data that it can’t fit on one single computer because computers are mem- ory bounded. Or, it may take too much time using only one computer, a time we can’t afford. Thus, for memory and time motivations, we can use distributed systems and have our data distributed across several computers and sort them in that way. But how? Introduction

To make our applications and programs run faster, it is im- Indeed, it is become rare to work with data without hav- portant to know where in our programs computers spend ing to sort them in any way. On top of that, we are now in the most of the execution time. Then we have to understand era of Big Data which means we collect and deal with more why and finally, we figure out a solution to improve that. and more data from daily life. More and more data to sort. This is how HPC works. Plus, sorting algorithms are useful to plenty more complex Donald E. Knuth wrote in his book Sorting and Searching: algorithms like searching algorithms. Also, on websites or “Computer manufacturers of the 1960’s estimated that more software, we always sort products or data by either date or than 25 percent of the running time of their computers was price or weight or whatever. It is a super frequent operation. spent on sorting, when all their customers were taken into ac- The question is how can we improve that? What kind of count. In fact, there were many installations in which the task improvement can be done regarding sorting algorithms to go of sorting was responsible for more than half of the computing always faster? This is where the Radix sort and my project time”. come into play. In 2011, John D. Cook, PhD added: “Computing has changed since the 1960’s, but not so much that sorting has gone from It has been proved that for a comparison based sort, (where being extraordinarily important to unimportant.” nothing is assumed about the data to sort except that they

42 can be compared two by two), the complexity lower bound them. Therefore, the parameter d is fixed and known. For is O(Nlog(N)). This means that you can’t write a sorting instance, we can write a Radix sort function which sorts algorithm which both compares the data to sort them and int16_t (integers stored on two bytes) and we know in has a complexity better than O(Nlog(N)). advance (while writing the sort function) that all the num- The Radix sort is a non-comparison based sorting algo- bers will be composed of two base 256 digits. Plus, with rithm that can runs in O(N). Unfortunately, it is not so often templates-enable programming languages like C++, it is used whereas well implemented, it is the fastest sorting straightforward to make it works with all other integer sizes algorithm for long lists. When sorting short lists, almost any (int8_t, int32_t and int64_t) without duplicating the function algorithm is sufficient, but as soon as there is enough data for them. From now, we assume that we use the base 256. to sort, we should choose carefully which one to use. The potential time gain is not negligible. It is true that “enough” Why use Counting or Bucket sort? is quite vague, but roughly, a length of 10000 elements is Because we take advantage of knowing all the values a already enough to feel a difference between an appropriate byte can take and the range is small. The value of one byte and an inappropriate sorting algorithm. is between 0 and 255. And this helps us because in such Currently, Quicksort (a comparison based sort) is probably conditions, Counting Sort and Bucket Sort are simple and the most popular sorting algorithm. It is known to be fast fast. They can run in O(N) when the length of the list is enough in most of the cases, although its complexity is greater than the maximum of the list (in absolute value) O(Nlog(N)). The Radix Sort has a complexity of O(N d). and especially when this maximum is known in advance. ∗ Thus, Radix sort is faster as soon as the number of keys When it is not known in advance, it is more tricky to make digits (d parameter) is smaller than log(N). This means the the Bucket sort runs in O(N) than the Counting sort. They more our list is huge, the more we should use the Radix can run in O(N) because they are not comparison-based sort. To be more precise, from a certain length, the Radix sorting algorithms, it is why the Radix sort too. In the Radix sort will be more efficient than any comparison based sorting. sort, we always sort according to one byte so the maximum we can have is 255. If the length of the list is greater than My project was to implement a reusable C++ library with 255, which is a very small length for an array, Counting and a clean interface that uses Radix Sort to sort a distributed Bucket sorts in Radix sort can easily be written having O(N) array using MPI. complexity.

Why not use only either counting or bucket sort to sort the Serial Radix sort data all of a sudden? Because we will no longer have our assumption about Radix sort takes in a list of N elements with their sort- the maximum as we are no longer sorting according to one ing key in a base b (the radix) and such that each byte. The maximum of the list can’t be known in advance key has d digits. For example, three digits are needed and we don’t have an upper bound. In such conditions, the to represent decimal 255 in base 10. The same num- complexity of Counting sort is O(N + k) and Bucket sort can ber needs two digits to be represented in base 16 (FF) be worse depends on implementations. With k the maximum and eight in base 2 (1111 1111). This is the algorithm: of the list (in absolute value). In contrast, with the Radix sort, we have O(N d) which is equivalent to O(N) because Data: A (array to sort), N (the array size), b (the keys ∗ radix), d (the number of digits of the keys) d can’t be greater than 8 (biggest numbers are usually stored Result: A sorted according to the data keys on 8 bytes in computers). In other words, since the Radix sort iterates through the bytes and always sorts according to one begin for each key digit i where i varies from the Least byte, it is insensitive to the parameter k because we only Significant Digit (LSD) to the Most Significant Digit care about the maximum value of one byte. Unlike both the (MSD) do Counting and Bucket sorts whose execution times are highly sort A according to the i’th keys digit using sensitive to the value of k, a parameter we rarely know in Counting sort or Bucket sort ; advance. It is dangerous because the maximum value can be end a big number. For instance, if we have only 1000 numbers to sort and the maximum is 1000000, with Counting and Bucket end sorts, the sorting time is more or less the same as sorting a Algorithm 1: Serial Radix sort list of 1000000 with them. Whereas it is not the case with the We first sort the elements based on the last keys digit (the Radix sort, sort a list of 1000 elements using it will always Least Significant Digit) using Counting Sort or Bucket Sort. take much less time than sorting 1000000 elements. Then the result is again sorted by the second digit (from right to left), continue this process for all keys digits un- til we reach the Most Significant Digit, the last one on the left. Problem the parallel Radix Sort can solve

What if we want to sort a list of numbers having different The following is the problem more precisely define. number of digits in base 10 like (10, 149, 2, 46, 135, 25)? In practice, to avoid this problem, we often use base 256 What we have to take advantage of the bitwise representation of the numbers in computers. Indeed, a digit in base 256 corre- P processors able to communicate between them- sponds to a byte. Integers and reals are stored on a fixed • selves, store and process our data. Each processor has and known number of bytes and we can access each of a unique rank between 0 and P 1. −

43 N data, distributed equally across our processors. • This means that each processor has the same amount of data.

The data are too huge to be stored or processed by only • one processor (for memory or time reasons).

What we want

Sort the data across the processors, in a distributed • way. After the sort, the processor 0 should have the lower part of the sorted data, the processor 1 the next Figure 1: Unsorted versus sorted distributed array across three and so on. processors. Our goal with the parallel Radix sort is to get the sorted one from the unsorted.

We want to be as fast as possible and consume the • lowest memory possible on each processor.

Parallel Radix sort

The idea of the parallel Radix Sort is, for each keys digit, to first sort locally the data on each processor according to the digit. All the processors do that concurrently. Then, to compute which processors have to send which por- tions of their local data to which other processors in order to have the distributed list sorted across proces- sors and according to the digit. Below is the algorithm. Data: rank (rank of the processor), L (portion of the distributed array held by the processor), N (the distributed array size), P (number of processors), b (the keys radix), d (the number of Figure 2: First iteration of the parallel Radix Sort on the distributed digits of the keys) array across three processors. The numbers are sorted according to Result: the distributed array sorted across the their base 10 LSD (Least Significant Digit). One iteration according processors with the same amount of data on to their base 10 MSD is remaining to complete the algorithm and each processor get the desired final distributed sorted array. begin for each key digit i where i varies from the LSD to the MSD do sort L according to the i’th keys digit using Counting sort or Bucket sort;

share information with other processors to figure out which local data to send where and what to receive from which processors in order to have the distributed array sorted across processors according to the i’th keys digit;

proceed to these exchanges of data between processors; end end Algorithm 2: Parallel Radix sort Each processor runs the algorithm with its rank and its portion of the distributed data. Let’s go through an exam- Figure 3: Last iteration of the parallel Radix Sort on the distributed ple to understand better. We will run the example in base 10 array across three processors. The numbers are sorted according for simplicity but don’t forget that we can use any number to their base 10 MSD (Most Significant Digit). At the end, the base we want and in practice, base 256 is used as explained. algorithm is completed and the distributed array is sorted.

44 My implementation and performance results With the log-log scale we can see better what is happening. We are able to tell that using dimer::sort we go faster than I have used MPI for the communications between proces- anything else when the array is enough big and this is what sors. As hardware, I have used the ICHEC Kay cluster to run we expect when we use HPC tools. For small array sizes, we all the benchmarks. The framework used to get the execu- are not really expecting an improvement because they can tion times is google benchmark. The numbers to sort are be easily well managed serially and enough quickly. Plus, generated with std::mt19937, a Mersenne Twister pseudo- we add extra steps to treat the problem in a parallel way, random generator of 32-bit numbers with a state size of therefore, when the problem size is small, it is faster serially 19937 bits. For each execution time measure, I use ten dif- because we don’t have these extra steps. It becomes faster in ferent seeds (1, 2, 13, 38993, 83030, 90, 25, 28, 10 and parallel when these extra steps are a small workload compare 73) to generate ten different arrays (of a same length) to to the time needed to sort serially. sort. Then, the execution times mean is registered as the execution time for the length. These ten seeds have been chosen randomly. I proceed like that because the execution Future work time also depends on the data, so it allows us to have a more I have presented here the LSD Radix Sort. But, there is a accurate estimation. variant called MSD Radix Sort which can be parallelized too. The difference is: With the LSD, we sort the elements based on the keys • LSD first, and then continue to the left until we reach the MSD With the MSD, we sort the elements based on the keys • MSD first, and then continue to the right until we reach the LSD For more information, feel free to read my blog posts on PRACE Summer of HPC website. My project is more detailed there and you can find everything you need to reproduce the work or understand deeper.

Acknowledgement

Figure 4: Sorting time as a function of the number of items to I would like to express my gratitude to the whole ICHEC sort. The items here are 4 bytes integers. dimer::sort curve is the staff, with special thanks to Dr. Paddy Ó Conbhuí and Dr. execution times of my parallel Radix Sort implementation ran with Simon Wong for supervising the project. Special thanks go to 2 processors. PRACE for giving me a place on this Summer of HPC 2019. Many thanks to Dr. Leon Kos and Dr. Massimiliano Guarrasi for all their work with the training week and the program in general.

References 1 A. Sohn and Y. Kodama, Load balanced parallel radix sort,in "Proc. 12th ACM Inter- national Conference on Supercomputing, Melbourne, Australia, July 14–17, 1998".

PRACE SoHPC Project Title Distributed Memory Radix Sort PRACE SoHPC Site ICHEC (Irish Centre for High-End Computing), Dublin, Ireland PRACE SoHPC Authors Jordy Ajanohoun Sorbonne University, Paris, France PRACE SoHPC Mentor Paddy O´ Conbhu´ı ICHEC, Dublin, Ireland Jordy Ajanohoun Figure 5: Same as figure 4 but in log-log scale. PRACE SoHPC Contact Jordy, Ajanohoun, Sorbonne University Phone: +33 6 95 39 87 90 We can notice that my implementation seems good for big E-mail: [email protected] sizes. It seems to be faster than boost::spreadsort because PRACE SoHPC Software applied when the array is not distributed, I am using ska_sort_copy C++, MPI, Catch2 (C++ test framework), google benchmark, Slurm instead of boost::spreadsort. ska_sort_copy is a great and PRACE SoHPC More Information very fast implementation of the serial Radix sort. Actually, it https://summerofhpc.prace-ri.eu/author/jordya/ is the fastest I have found. PRACE SoHPC Project ID 1915

45 Study on how HPC systems can enhance and interact with CFD at all stages: pre-processing, resolution and visualization CFD: Colorful Fluid Dynamics? No with HPC!

David Izquierdo Sus´ın

It is typical to obtain some nice pictures of the flow around an object after performing a CFD (Computational Fluid Dynamics) simulation. However, how much of truth is there behind those colorful images? The project presented here analyses how HPC systems can be used to make CFD a more trustworthy tool, and shows how some parameters can be optimized for a specific application - a formula car.

ABOUT FLUID DYNAMICS Student car-. To get the forces, one value of the velocity at that point. must first compute the pressure at many For a case like the Formula Student he Navier-Stokes equations points over the surface of the object, car, around 100 000 000 of these points are the set of equations that de- and the pressure cannot be obtained are required to get an accurate enough fine how a fluid moves, and are without computing the velocity in the solution. If it is estimated that the com- based on principles such as the whole domain; the problem is coupled. puter needs more than 100 operations Tconservation of mass and Newton’s to solve a value single point/cell (the When solving the motion of a fluid Second Law. When one is set to solve around an object, the objective is, there- equations are complex), and it is as- an aerodynamics problem, solving these fore, to know the velocity and the sumed that we want the value of the ve- equations will always be, somehow or pressure of the fluid at many differ- locity at 1 000 time instants -to observe other, unavoidable. The only problem is ent points. In order to obtain a re- how the flow evolves as time passes by-, that these equations are actually impos- sult similar to that shown below in then the total number of computer op- sible to solve analytically in the vast (5), where the scale from blue to red erations may be in the order of 10 000 majority of practical cases, and the ac- shows the values of the velocity on 000 000 000, or what is the same, ten curacy of the results heavily depends on the symmetry plane of the car, the vari- of the former UK billions. the numerical method used. ables must be solved at a very large All this constitutes just a rough or- Aerodynamics is about comput- amount of points, in order to then be der of magnitude approximation to give ing the forces that the air causes on able to paint each of these points with an idea of the size of the problem that a given object -in this case, a Formula a different colour, corresponding to the is being faced and its motivation.

46 WHAT IS THIS TRULY ABOUT? It can be seen that the mesh is not adapted to the surface geometry of the The engineering problem to be dealt car, since it is just made out of cubes. with in this document is the aerody- This is solved in the next step (2), in namic analysis of the TU Ostrava For- which the mesh is snapped to provide it mula Student car. The main objective with the actual shape of the geometry. is building from scratch a robust sim- Finally, in step (3) layers are added ulation setup with which obtaining re- near the surface of the body. This type sults for the aerodynamic forces on the of cells are beneficial when trying to Formula Student car in an automated solve the behaviour of the flow in the manner. Apart from that, it is the aim boundary layer. of this project to better understand the In the following step, it is necessary capabilities of OpenFOAM when deal- to define the physical and numerical ing with external aerodynamics cases. parameters on which the simulation will be based. The choices made at this point will determine the degree of con- METHODS (HANDS-ON) vergence of the simulations, together with the accuracy of the results. Again, The approach used to solve the tasks the number of parameters is very large mentioned in the previous section con- and it is of significant difficulty to find sists of using the Linux operating sys- the ideal configuration. The approach tem to install the corresponding soft- followed to obtain an stable setup with- ware and edit the files in a more ef- out the need of adjusting one by one ficient manner than what could be all the parameters was to pick as a base achieved in a Windows operating sys- a reference case provided by the soft- tem, by means of vim and other editors ware and based on a motorbike simula- like sed. tion. In this way, even if the parameters The methodology followed in this had to be adjusted (since the geometry project is defined in several steps that proposed for the project was far more can be easily distinguished one from an- Figure 1: Visual state of the geometry at complex, and the requirements in terms other: some of the steps in the CFD workflow. of accuracy were also more restricting), Pre-processing of the geometry the process was simplified to some ex- • tent. Meshing setup Lastly, the huge bunch of numbers HPC IN OSTRAVA • obtained as an output from the simula- Simulation setup Here is where HPC (High Performance • tions must be post-processed in order Computing) systems come in handy. Post-processing of the results to gain some degree of comprehension HPC systems are essentially computers • from the simulations performed. This that consist of many smaller computers The first step consists of getting the was done as an initial checking step in that are able to work together to solve a initial car geometry designed by the ParaView, an example of which can be specific task. When such a large number TU Ostrava Formula Student team and seen in (5) above, but later the process of operations is required to solve just transforming it into a surface mesh of was migrated to EnSight, where further once the problem of interest, using a points that can be read by the CFD soft- and more involved post-processing was personal computer is not an option: it ware. Then, a proper mesh has to be conducted, including generation of iso- would not only be too slow -weeks or generated, in this case by means of the surfaces and videos. even months would be needed to get snappyHexMesh the first results of a complex problem-, utility. There are but possibly it would be incapable of more than 50 solving the problem, due to the large parameters to amount of data that must be handled, be adjusted, and which in turn implies a very large re- the suitability of quirement in terms of memory. the mesh heavily The IT4Innovations National Super- depends on the computing Centre in Ostrava offers two choices made on of these very powerful systems that these parameters. can be used to better solve these kind Above at the left, of problems. In particular, Anselm ma- in the images (1) chine relies on more than 3 300 cores, to (3) of Fig. 1, while Salomon has up to 24 192 cores. the different steps To put it into context, a typical laptop performed when Figure 2: Scalability Curves can typically be based on 4 processors creating a mesh and 8 logical units, which is more than can be observed: in (1) the mesh after Given the complexity of the setup of 3 000-fold less than Salomon. the first step (castellation) is shown. the simulations in terms of number of

47 parameters to be defined, it is not possi- mesh convergence study for which first -although note that quality standards ble to show or discuss here the full set several meshes are defined in Fig. 4. were set very strict to perform this eval- of parameters. But, in order to ensure uation; not all cells that did not pass that the work is reproducible, some are necessarily badly-shaped-. extra detail is given in this blog post Next figure, Fig. 6, shows the results and in the final presentation. of the mesh convergence study. Several coefficients of forces were monitored for the different mesh configurations SCALABILITY AND MESH with the already described characteris- CONVERGENCE tics. In particular, these coefficients are the total coefficient of lift (in absolute The first outcome of this project consists value, it is actually negative), the to- of the scalability study. This study is tal coefficient of drag of and the rear aimed at determine how OpenFOAM coefficient of lift of the car, and the co- works depending on the number of Figure 4: Reference Meshes efficients of lift and drag of the front cores used to run the cases. Results are wing. By observance of the fact that the shown in Fig. 2, where it can be ob- value of the coefficients tends to be- For the reference meshes with the served that the solving algorithm (or- come constant as the mesh gets finer, number of cells shown in Fig. 4, the ange curve) scales better than the mesh- it may be concluded that the results mesh quality characteristics depicted in ing algorithm (blue curve). In the case are mesh independent for the roughly Fig. 5. have been found. of the former, between 768 and 1536 140-millions mesh used for the scal- the reduction in total computing time abitily report. is still of a 31.0 %, which is far from However, Fig. 6 also shows that, the ideal 50 % but it is still significantly even if the variations tail off, these do large given the huge amount of cores. not totally disappear, and in fact in Two approaches were considered to some of the coefficients (such as the to- run the simulations: a single-job ap- tal CL) these variations seem to slightly proach, in which a single job would be increase with the step from fine to very launch for each simulation, and a two- fine mesh with respect to the previous jobs approach, that separates the mesh- steps. ing and solution algorithms execution All this may indicate that conver- into two different scripts and two differ- gence has not been achieved up to a ent jobs. very high degree, and this is expected to have something to do with the bound- Figure 5: Monitoring of Layers Creation and ary layer resolution and the y* value Quality Check on Meshes -non-dimensional height of the centroid of the cells touching the body surface; In terms of the number of layers i.e. size of the cells close to the surface-, (referring again to the step (3) in Fig. which should lie everywhere below six, 1 for the layer definition), the target but that is not true for the whole mesh. was set to three. One of the geome- tries in which the creation of lay- ers is more com- plex, due to its geometry, is the Figure 3: Single-job Versus Two-jobs Ap- suspension of the proach car. Therefore, the number of layers The results in Fig. 3 show that the is monitored in two-jobs approach, finally adopted in this part and it the project, is a 12 % faster, which is a can be checked significant reduction that does have an that, for the finer impact, since it allows a more optimal meshes, the user allocation of resources. requirement is al- All these computations were per- most met. In terms formed once the mesh convergence of illegal cells, had been proved. This is very impor- that are the cells tant, since it ensures that the results that do not pass Figure 6: Mesh Convergence Study of the aerodynamic study are approxi- the user-selected mately independent of the size of the quality check, the number was also sig- In fact, y* goes up to 17-20 in some re- mesh elements. It was achieved via a nificantly reduced with larger meshes gions, making thus necessary the use

48 of wall functions to cover the delicate suction capabilities of those devices, al- to its influence in all the other aerody- buffer layer and losing some of the ad- though these cannot be seen in this im- namic devices of the car, and therefore, vantages of the k-omega SST turbulence age. What can be seen is the reduction this kind of representations are essential model1 used in these simulations. in pressure in the zones of high curva- to understand them and manage them ture, where corner flows emerge. Exam- properly. ples of this are the lateral of the front wheels and the lateral of the helmet of the driver. The triangular blue zone in DISCUSSION the rear wing end-plates is also relevant, Overall, the objectives of the project since it shows how the wing-tip vortex were met and key conclusions were that arises at the edges of every wing drawn: (1) OpenFOAM solver simple- develops, growing towards the rear end Foam scales well -while the meshing of the wing. utility is the bottleneck-, (2) the two- Finally, in Fig. 8 the wake of the car jobs approach is better than the single- can be observed from the bottom. It is job one and (3) using HPC systems constructed by plotting velocity isosur- helps to reach more meaningful and faces for a range of values of the veloc- trustworthy results with CFD for a same ity going from 2 m/s up to 15 m/s (with amount of time available. The implica- the free stream always 20 m/s, there- tions of these results are that a better fore all the values plotted are smaller). Figure 7: Static Pressure over the Surface and more extensive knowldege about of the Car the use of OpenFOAM in HPC systems will be available for the development of The quality of some of the cells may similar projects. also be a reason for this imperfection in the mesh convergence results. References In spite of that, the results are rea- 1 F. R. Menter, M. Kuntz and R. Langtry (2003). Ten sonable in overall terms and meet the Years of Industrial Experience with the SST Turbu- lence Model. accuracy requirements that had been proposed. This allows to proceed to PRACE SoHPCProject Title the next part, in which the actual post- Computational Fluid Dynamics processing results are going to be dis- Simulations of Formula Student Car cussed by means of images of the car. Using HPC PRACE SoHPCSite IT4Innovations - National Supercomputing Centre. Ostrava, POST-PROCESSING Czech Republic. PRACE SoHPCAuthors The final study entails analysing the David Izquierdo Sus´ın, UC3M, Madrid flow around the car, which is done by PRACE SoHPCMentor looking at the results in terms of veloc- Figure 8: Wake of the Car - Bottom View Toma´sˇ Brzobohaty´,VSB,ˇ Czech Republic David Izquierdo Susín ity and pressure fields. Fig. 7 illustrates the pressure field A first remark to be made is that the PRACE SoHPCContact David Izquierdo, UC3M over the surface of the car. It is easy to flow is not symmetry, since the geom- Phone: +34 658 43 09 55 observe that the more drag-generating etry is not symmetric. Even if the car E-mail: [email protected] parts -largest static pressure, in red- are may look as if it had symmetry at first PRACE SoHPCSoftware applied those exposed in a more perpendicu- sight, it turns out that the intake of the OpenFOAM, ParaView, EnSight lar and direct way to the flow, such as engine and the cooling system (radiator PRACE SoHPCMore Information the tip of the nosecone, the driver, the and fan) and highly non-symmetrical, www.openfoam.org front tyres and the steepest zones of which leads to a wake slightly deflected PRACE SoHPCAcknowledgement I would like to thank all the people from PRACE for the wings. The zones of smallest static towards the left, from a driver’s perspec- organizing the programme, as well as the Ostrava pressure (negative with respect to the tive. Apart from that, the vortical struc- coordinators and my mentor, Tomáš Brzobohatý. And above all, I wanted to thank my family, and all the ambient pressure, i.e. suction) lie be- tures created around each of the four people that helped me to get here. low the car, specifically below the front tyres are of extreme importance for the PRACE SoHPCProject ID wing and the undertray, thanks to the whole aerodynamic vehicle concept due 1916

49 Real time emotion recognition based on action units using deep neural networks (DDNs) and clustering algorithms. From High Performance Computing (HPC) to the edge. Emotion Recognition using DNNs

Pablo Lluch Romero

Visual context, particularly facial expressions, play a key role in understanding the intention of a message. However, visually impaired individuals lack this context. An application able to detect 7 basic emotions in real time with an accuracy of 74% is introduced to compensate for the lack of visual context in conversations.

n order to make a machine under- to find underlying patterns in data. A to classify new images as a part of the stand facial expressions we need nearest neighbour approach is used final application. a way to encode an image, an ar- from the output of the trained network ray of pixels, such that its emotion Iis easily classifiable. Many different ap- proaches can be used to make this en- coding but, I will be using a facial action coding system that classifies sets of mus- cle movements in bundles called action units. Any facial expression can be en- coded using this framework. However, I used 17 different action units, as shown in figure 1, which are enough to encode the emotions I trained the application to predict: happiness, sadness, surprise, anger, disgust, fear and neutral.

I used a deep neural network to predict the action units from faces. A neural network is a computer algorithm 1 that imitates biological neurons with Figure 1: Action units used many interconnected layers that is able

50 Figure 2: Sample layers from superficial [top] down to deeper ones [bottom] from fine-tuned VGGFace

From HPC... of the dataset and add more variabil- encode meaning related to the class of ity to the classes I augmented the data the image. A sample of these layers are Data gathering and preprocessing by applying random rotations, shears, shown in figure 2. I also fine-tuned the zooms, and brightness shifts to each im- weights of these convolutional layers to The data I needed to train the neu- age before feeding it into the network, match the problem of action unit de- ral network came from four existing as shown in Figure 3. tection better. The output of the last datasets that consisted of video footage convolutional layer (conv5_3 as shown of spontaneous emotions with the ac- in Figure 2) could be thought of as the tive action units annotated frame by high-level features that define a face. In frame. These were The Cohn-Kanade order to classify these features I added AU-Coded Facial Expression dataset2,3, an array of dense and dropout layers MMI facial expression database col- to the network. Dense layers are fully lected by Valstar and Panticand4,5, the connected layers of neurons that fire Denver Intensity of Spontaneous Fa- an output only if their inputs are high cial Action Database6,7, and BP4D- enough, the same way neurons behave Spontaneous8,9. The choice of datasets in our brain. Dropout layers disable a was not trivial, as they had to be con- random fraction of the connections be- sistent with the way the actions units tween neurons during the training pro- were annotated. cess. Once I combined these four datasets into one with a total of 300K images, During the training of the neural net- I went through an array of different work, I explored a number of hyperpa- processes to improve its quality. The rameters and settings such as the learn- first issue that I faced was class imbal- Figure 3: Data augmentations from a single ing rate, the batch size, the number and image c Jeffrey Cohn ance. The most popular action unit in size of dense and dropout layers, and the combined dataset (upper lip riser) different preprocessing of the data. I was present in 80,000 images, while the Training found that using dropout layers before least popular (lip presor) was in only I built the Neural Network using Trans- and after every dense layer reduced 3,000. Using such unbalanced data to fer Learning. This is a common practice overfitting considerably. However, the train a machine learning model is a very that involves reusing existing networks model achieved the best validation ac- bad practice as these models learn the that were devised to solve problems in a curacy when the first dropout layer kept statistical footprint of the classes. So, similar domain. I took the convolutional a larger portion of units than the others. very popular classes would yield a lot layers from a network called VGGFace, The learning rate –a parameter that of false positives and less popular ones which was originally developed for face specifies how fast a model reacts to new a lot of false negatives. Secondly, the detection. A convolution can be under- information– was also key. Too high or contiguous frames were removed from stood as a filter that is applied to an too low of a learning rate wouldn’t let the dataset to prevent the model from image. So, convolutional layers are just the model arrive to a decent solution overfitting –memorising the individual a set of these filters acting both in paral- in a reasonable time. I also found out patterns of a dataset without gaining lel and in sequence. Superficial convolu- that the larger the size of the dense the ability to generalize on unseen data. tional layers encode low level concepts layers, the lower the learning rate had Finally, in order to increase the size like edges or corners and deeper ones to be for the model to converge. The

51 data preprocessing mentioned before unit prediction is not perfect, as the unit is not present) or values very close also reduced overfitting, as it increased small dataset was created using this to 1 (when the action unit is present variability and removed similar images. biased model, the imperfection will be at any level of intensity), as a sigmoid The final setting produced a model able accounted for when the new prediction function is used on the final layer of the to identify action units with a recall of is compared to the existing points in DNN to polarise the probabilities. 64.5% and a precision of 73.6%. For the hyperspace. However, this is not the reality for the training process described above I my model. The reason for why cosine used and Tensorflow as the en- Application pipeline similarity gave a better result is that all vironment, and I ran it on the NVIDIA Faces are cropped from a live cam- we care about for classifying an emo- DGX-2 cluster of the IT4Innovations era feed using Python OpenCV’s Haar tion is the ratios between the different supercomputer center. I only used a Cascade Classifier face detection algo- action units, not the size of the action single slice (a single GPU) of the DGX-2 rithm running at approximately 60ms unit vector itself –as this would encode but that was already 30x faster than per image. Images are downscaled be- the intensity of the emotion. Using eu- the CPU of my development system. fore they’re fed into the algorithm clidean distance would result in differ- ent intensities considered as different emotions, which would make it harder to classify new predictions. This approach needs a way to deter- mine when no emotion is present (or the emotion is neutral). For this I set Figure 4: Topology of the final neural network that uses VGGFace’s convolutional layers a threshold in the probabilities of the for speedup reasons and then the action units so that if every action unit face position is inferred in the orig- is below this threshold then the emo- ... to the edge inal image so no resolution is lost. tion had to be neutral. The emotion dis- Faces are fed asynchronously to the played in the application is the most Inferring the emotions in real-time The action units predicted using the deep neural network were translated to the corresponding emotion using a nearest neighbour approach. The peak emotion images in the Cohn-Kanade dataset were fed into the trained neural network and the output –the probability of the image containing each action unit– was stored in a small dataset of around 300 im- ages. Kohn-Canade also included the corresponding emotions, so I stored this alongside the action unit prediction. Inferring the emotion that corresponds to a new image is just a matter of getting the vector of action unit prob- abilities from the network, finding the most similar probability vectors in this small dataset and looking at the most common emotion among them. This Figure 5: 2D representation of the emotion hyperspace corresponding to the action unit could be understood as looking at the predictions neighbours of the action unit probabil- common emotion over the last five in- ity vector in a hyper-dimensional space ferences (i.e. frames) to avoid flickering where each dimension corresponds to trained model that predicts the action and foster stability. In order to produce one of the 17 action units. Another units only once the model has finished the audio saying the emotion, I used more straightforward approach would processing the previous image. The python’s gTTs text to speech module. be checking the action units whose action units predicted by the model are The audio message is only fired once probabilities are larger than 0.5 and then compared to the small dataset of the emotion has settled to avoid con- comparing them to a table of the com- Kohn-Canade predictions to find the 10 stant interruptions. mon action units that are archetypal in closest neighbours using cosine simi- The original idea of the project was each emotion. However, the approach larity, and the most common emotion to deploy the application to the In- I followed is better for two reasons: among them is chosen. tel Movidius Neural Compute Stick 2 A) it allows for a wider variability in Both cosine similarity and euclidean (NCS2). This was achieved using Open- the emotions, as not every emotion fits distance were tried but cosine similarity VINO and Intel’s own model optimizer. within these prototypical definitions yielded a better result. In an ideal world A working application was made but and B) it helps mitigate biases that the the neural network would only return the NCS2 was giving incorrect infer- model might have. Even if the action values very close to 0 (when the action ence results, probably due to the lack of

52 compatibility with some neural network but a sequence of them. An objective References operations. The older NCS 1 was used metric of the accuracy of the applica- 1 Facial Action Coding System, accessed instead, yielding an inference time of tion is still a matter for further analysis. 30th/August/2019, imotions.com/blog/facial- 900ms. Another version of the applica- This just serves as a quick verification action-coding-system/ tion was deployed using the CPU of my of how the application is able to infer 2 Kanade, T., Cohn, J. F., & Tian, Y. (2000). Compre- hensive database for facial expression analysis. Pro- development system where Keras and the emotion on different and diverse ceedings of the Fourth IEEE International Conference Tensorflow were used for the inference. subjects. on Automatic Face and Gesture Recognition (FG’00), Grenoble, France, 46-53. This version was able to infer the emo- During the training process of the tions in 200ms. neural network it’s important to have a 3 Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Am- badar, Z., & Matthews, I. (2010). The Extended Cohn- Kanade Dataset (CK+): A complete expression dataset for action unit and emotion-specified expression. Pro- ceedings of the Third International Workshop on CVPR for Human Communicative Behavior Analysis (CVPR4HB 2010), San Francisco, USA, 94-101.

4 M.F. Valstar, M. Pantic, “Induced Disgust, Happiness and Surprise: an Addition to the MMI Facial Expres- sion Database”, Proceedings of the International Lan- guage Resources and Evaluation Conference, Malta, May 2010

5 M. Pantic, M.F. Valstar, R. Rademaker and L. Maat, “Webbased database for facial expression analysis”, Proc. IEEE Int’l Conf. on Multimedia and Expo (ICME’05),Amsterdam, The Netherlands, July 2005

6 Mavadati, S.M.; Mahoor, M.H.; Bartlett, K.; Trinh, P.; Cohn, J.F., "DISFA:A Spontaneous Facial Action In- tensity Database," Affective Computing,IEEE Transac- tions on , vol.4, no.2, pp.151,160, April-June 2013 , doi:10.1109/T-AFFC.2013.4

7 Mavadati, S.M.; Mahoor, M.H.; Bartlett, K.; Trinh, P., "Automatic detection of non-posed facial ac- Figure 6: Screenshot of the working application with the action units showing at the left tion units," Image Processing (ICIP), 2012 19th IEEE International Conference on , vol., no., pp.1817,1820, Sept. 30 2012-Oct. 3 2012 , doi: Discussion & Conclusion robust metric of how good a model is. 10.1109/ICIP.2012.6467235 This is true because multiple combina- 8 Xing Zhang, Lijun Yin, Jeff Cohn, Shaun Canavan, I believe a very important part of this tions of hyperparameters are tried dur- Michael Reale, Andy Horowitz, Peng Liu, and Jeff Gi- rard, “BP4D-Spontaneous: A high resolution sponta- work is its scalable nature. Making ing training, so multiple versions of the neous 3D dynamic facial expression database”, Image the application detect a new emotion same model are produced. Even though and Vision Computing, 32 (2014), pp. 692-706 (spe- wouldn’t require any further training of the neural network is being trained to cial issue of the Best of FG13) the neural network as long as the ac- detect action units in faces, the ultimate 9 Xing Zhang, Lijun Yin, Jeff Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, and Peng Liu, “A high tion units present in the emotion are goal of these action units is to serve as resolution spontaneous 3D dynamic facial expression included in the 17 action units which a basis to infer the emotion on images. database”, The 10th IEEE International Conference on Automatic Face and Gesture Recognition (FG13), the network is able to detect. All that Thus, I will suggest a way to evaluate April, 2013. would be needed is to add a few new a model by examining how well these data points corresponding to the predic- action units serve to predict emotions. PRACE SoHPCProject Title tions of the model on images containing If we go back to the action unit hyper- Emotion Recognition using DDNs that new emotion to the small action space that was discussed before, it can PRACE SoHPCSite unit-emotion dataset that I used for the be seen that emotions form clusters in IT4Innovations, VSBˇ - Technical multi-dimensional comparison. it. A good way of assessing how easily University of Ostrava, Czech Republic In order to assess the accuracy of the separable these emotions are is to eval- PRACE SoHPCAuthors Pablo Lluch Romero, The University application, I compiled a dataset of im- uate how little the clusters overlap with of Edinburgh, United Kingdom ages taken from Google corresponding each other. Thus, very separate and iso- PRACE SoHPCMentors to the 7 target emotions. This dataset lated clusters mean easily classifiable Georg Zitzlsberger and Martin contains faces with clearly identifiable emotions and higher accuracy overall. Golasowski, IT4Innovations, Czech Republic Pablo Lluch Romero unambiguous emotions in good light- That’s why, a metric to evaluate this ing and resolution. It includes different lack of overlap could be introduced to PRACE SoHPCContact Leon, Kos, University of Ljubljana ages, genders and ethnicities. The ap- evaluate the possible models and deter- Phone: +12 324 4445 5556 plication predicted the correct emotion mine which will yield the best accuracy E-mail: [email protected] on this dataset with an accuracy of 74%. in the final application. Another pos- PRACE SoHPCSoftware applied The app doesn’t perform as well with im- sible optimization would be to reduce Python, Tensorflow, keras ages that are ambiguous, unclear or con- the dimensionality of the small action PRACE SoHPCAcknowledgement tain cues other than muscle movements Thanks to IT4Innovations and my mentors Georg unit-emotion dataset to reduce the time Zitzlsberger and Martin Golasowski for providing me (for example the application wouldn’t taken to find the nearest neighbours. with access to the supercomputer facilities and support categorise a person with tears in their throughout my project. Also thanks to PRACE for This could be done using algorithms like creating this opportunity and allowing us to get closer eyes as sad). It’s important to bear in PCA that, as shown in Figure 5, are able to the incredibly interesting and power field of HPC. I mind that the live application delivers a would also like to thank the owners of the datasets I to reduce data dimensionality while still used as I was given them for free. slightly better accuracy than the individ- showing separable clusters, which as ex- PRACE SoHPCProject ID ual predictions, as not only one picture plained above, translate to easily classi- 1917 is considered to determine the emotion fiable emotions.

53 Implementing a Tasking Framework in CUDA for the Fast Multipole Method FMM:A GPU’s Task

Noe´ Brucy

Is you graphics card able to run N-body GPU simulations in a smart way? A complex tree algorithm, a sophisticated tasking system, is all that a task for a GPU? No, some will say, a graphics card can do only basic linear algebra operations. Well, maybe the hardware is capable of much more...

he Fast Multipole Method framework. It is a simplified version tasks that are ready, that is all tasks (FMM) is an algorithm to com- of one of the operators of the FMM, from leaves, are pushed into the queue pute long-range forces within and we named it the accumulation tree. (blue tasks). Now we are ready to start. a set of particles. It enables to Tsolve numerically the N-body problem 16 Round 1 with a linear complexity, where it would otherwise be quadratic. Doubling the number of particles only doubles the 4 4 4 4 computation time instead of quadru- 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 pling it. However the FMM algorithm 4 4 2 is hard to parallelize because of data The principle is very simple: the content dependencies. Tasking, meanings split- of each cell is added to its parent. So 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ting the work into tasks and putting one task is exactly "adding the content threads (GPU or CPU) FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU them into a queue helps a lot to give of a cell to its parent". You already see Ten blue tasks are done. Two new green work to all threads. A working tasking that dependencies will appear since a tasks are ready; so they are pushed into framework for the FMM has been im- node needs the result from all children the queue. plemented on Central Processing Units before having its tasks launched. Imag- (CPU) by D. Haensel (2018). Will such ine we have 10 processing units (GPU Round 2 a tasking framework run efficiently on or CPU threads), the computation of General Purpose Graphical Processing the accumulation will be as follows. Units (GPGPUs or more simply GPUs)? 8 The answer is not obvious at all, be- cause CPU and GPU have a very differ- Initialisation ent architectures. My goal this summer 4 4 4 4

was to shed some light on that topic. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

threads (GPU or CPU) FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU The last six remaining blue tasks are A smooth start done as well as two green tasks. The 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 last two green tasks are ready and Let me first introduce the problem threads (GPU or CPU) FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU pushed into the queue. Here we can see that we will use to test the tasking All leaves are initialised with 1. All that the tasking permits to maximise

54 the threads usage since all tasks that Time to speed things up! Cutting allocations are ready can be executed.

In this section I will tell you my strate- A B gies to improve the performance. Let’s queue Round 3 begin by describing how the tasking framework I was provided with worked. 16

The starting point mutex

4 4 4 4

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 queue threads (GPU or CPU) FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU The first improvement was to The last two green tasks are executed. change the queue implementation with We get the correct result, hip hip my own relying on a fixed sized queue. hooray! mutex The reason behind this is that memory allocation is very expensive on a GPU, and with adaptive size queue you are al- And on a GPU? locating and freeing memory each time The protagonists of our story are a thread pushes and pops a task. As said in the introduction, such a task- the following. The threads are our ro- ing system works well on a CPU. Why man soldiers. They are working on a Reestablishing private property can’t we just copy-paste the code, trans- Floating Point Processing Unit (FPU) late some instructions and add some and they do the computation. They annotation to make it work on a GPU? are grouped in units of 32 threads shared queue Because many assumptions we can rely A AB called warps. Threads within the same mutex private queues mutex on CPUs do not hold anymore on GPUs. warp should execute the same instruc- The biggest of them is thread indepen- tion at the same time, but on differ- dence. You can compare the CPU to a ent data. Warps are themselves grouped mutex barbaric army: only few strong soldiers, into blocks. Threads within the same any of them being able to act individu- block can communicate fast through ally. shared memory and they can synchro- nise together. However, no inter-block The second idea was to reduce the communication or synchronisation is al- contention on the global queue since lowed except through the global mem- working threads are always trying to ac- ory which is much slower. cess it. I added small private queues for The tree is another player. He lives each block, that can be stored either in in the global memory since all threads the cache or in the shared memory for should be able to access it. At last thre fast access. The threads within a block credits: Wikimedia CC-BY-SA is the queue, also living in the global use the block’s private queue in prior- A graphics card however is more like memory. It is implemented by a chained ity, and the global queue as a fallback if the roman army, with a lot of men di- list, meaning that its size can adapt to the private one is full (push) or empty vided into unit. All soldiers within the the number of tasks it contains. To avoid (pop). same units are bound to do exactly the race conditions, access to the tree and same thing (but on different targets). the queue are protected with a mutual Solving the unemployment crisis exclusion system (mutex). We rely on a specific implementation of a mutex for GPU. It allows only one thread to shared queue A AB access the data protected by the mutex private queues at a time. To avoid deadlocks (a kind of infinite loop when two threads wait credits: https://patricklarkinthrillers.files.wordpress.com for an event that will never occur), only mutex Many of the problems caused by the one thread per warp tries to enter the special nature of the GPU were already mutex. We call it the warp master, and tackled when I joined the project. In- in this early version, only the warp mas- deed, I was provided with a working ters can work, as follows: For now only a few threads (the version of the tasking framework ap- 1. Fetch a task from the queue warp masters) are working. It’s time to plied to the accumulation tree example. 2. Synchronisation of the block put an end to that. First, since threads However, it had serious performance is- 3. Execute the task within a block are synchronised, access sues, since the accumulation took 700 4. Synchronisation of the block to the private queue is done at the same ms on an octree of depth 5. An octree is 5. Push new tasks (if any) time thus performed sequentially be- a tree where each node has 8 children, Step 5 is done only if the task done at cause of the mutex. I decided that only and an octree of depth 5 has 37449 step 3 resolves a dependency and cre- one thread per block will access the nodes, which is not that much. ates a new task. queue, and will be in charge to fetch

55 Non optimized 700 ms 107 measured expected A B A B Fixed size queues 450 ms 106

A A B 105 B More queues 200 ms 104 B B

More parallelism 100 ms 3

time × number of threads (ms) 10 Reducing contention 25 ms 102 1 2 3 4 5 6 7 8 Load balancing 1.5 ms depth Left: approximate timings for the accumulation kernel for an octree of depth 5 on a Tesla P100 GPU, excluding initialisation. Right: total thread execution time for several depths on the Tesla P100 the work (step 1) and solve dependen- push to its own private queue, its own Towards a full GPU FMM solver cies (step 5) for all the threads. shared queue and also other block’s shared queue. Thus, we do not need mutex protection for the popping oper- Breaking down the wall M2L L2L ation. M2M x Last but not least, the work must shared queue mute be equally shared between blocks (that A AB is called load balancing). I provided a private queues function that tells a calling thread in which shared queue a newly created The accumulation tree example is atomics task should be pushed. very close to an operator of the FMM called M2M. There are two other Was it worth it? tree operators in the FMM algorithm: M2L that compute interaction between Now that all threads are working, Now it’s time to evaluate how much nodes at the same level and L2L that is they are all waiting to enter the mutex time we saved with our drastic politics. pretty like M2M but from the top of the protecting the tree, even if they are try- And the answer is: a lot! The top-left fig- tree to the bottom. I also implemented ing to access different parts of it. So I ure shows how much time the computa- operators that have the same execution removed the mutex, and ensured that tion of the accumulation octree of depth scheme as M2L and L2L. They work cor- all operations on the tree are atomic. 5 took for each of the above mentioned rectly and have execution times within That’s a bit like if there was a mutex on steps. The results are quite impressive, the same order of magnitude as the ac- each node in the tree, making it possi- as my work reduced the execution time cumulation tree. Therefore we are not ble to several threads to access the tree from 700 ms down to 1.5 ms, resulting very far from a taskified FMM algorithm concurrently. in a speedup of approximately 450. running on GPU. The remaining work is It also scales pretty well if we in- the replacement of the simple addition Fighting inequality crease the size of the tree. Indeed one in the tree by the the complex mathe- matical operations of the FMM.

x would expect that the total execution PRACE SoHPCProject Title

mute time (the measured time multiplied by mutex shared queues the number of threads) increases eight- Good-bye or Taskify! A AB private queues fold if the depth is incremented by 1. PRACE SoHPCSite Julich¨ Supercomputing Centre, That is because by adding one tree level, Germany we increase the number of task approx- atomics PRACE SoHPCAuthor imately by a factor 8. The top-right Noe´ Brucy, France graph shows that the scaling seems to PRACE SoHPCMentors be just as expected from the depth 3. Laura Morgenstern,Julich,¨ Germany An important step to get this result Ivo Kabadshow,Julich,¨ Germany Removing the mutex from the tree was to parallelize the allocation and Andreas Beckmann,Julich,¨ Germany Noé Brucy resulted in a huge gain on time, so I PRACE SoHPCContact the initialisation of the tree, gaining a Brucy, Noe, AIM, CEA Saclay tried to also get rid of it for the shared speedup of 900 for depth 5 and thus en- E-mail: [email protected] queue access. First I did split the queue abling runs with a depth up to 8. There PRACE SoHPCSoftware applied so that there is one shared queue for is not enough memory on the GPU for Cuda, Nvidia profiler, Python. each block. Because the queue is fixed higher depths. I also wrote a specific PRACE SoHPCAcknowledgement in size, the operations of pushing and Python benchmarking programme to Thank you very much Laura and Ivo, I really liked working with you! popping a task are independent. One compare different versions of the code. credits: the GPU icon is adapted from Arthur Shlain (the Noun Project) block master can pop only from its own PRACE SoHPCProject ID queues (private and shared) and can 1918 56 High Performance Lattice Field Theory uses the computational power of supercomputers in order to produce a high accuracy lattice field simulation which cannot be created on conventional computers due to its computational expensiveness High Perfor- manceLattice Field Theory The image above shows a subatomic particle composed Andreas Nikolaidis by three quarks, it can be a neutron or a proton

Motivation of project: Understanding how the lattice simulation works in order to solve mass problem that occurred during the simulation for the propagation of a meson particle

uantum chromodynamics (QCD). we would be talking about quantum reason lies with the highly non-linear QQCD theory is a quantum field the- flavourdynamics. Due to the very small nature of strong force and the large cou- ory that describes the strong interaction mass of the quarks they should have pling constant at low energies which between quarks and gluons. Quarks been constantly moving with speeds makes it impossible for an analytical or are the subatomic particles composing close that of light, so how are they perturbative solution to be found. the well known neutrons and protons bound together? At this point the and they occur in six flavours up, down, strong force is introduced and it is strange, charm, top and bottom. The called strong because it is indeed strong Problem proton for example is a composition of so that it can hold the quarks bounded three quarks, 2 up quarks and 1 down together and it is the strongest of all When there is a big project taking place quark. If the story ended here then four of the fundamental forces. the project breaks down to smaller parts Carrier of the strong force and i was tackling one of the problems is the gluon. Analogous to occurred during this project. The main electric charge, in QCD, is project was lattice field simulation of the colour charge and in- how a meson ( a subatomic particle stead of having positive composed by a quark and an anti-quark) and negative charge there propagates from a point X at a point are three colours represent- X’ and what concerns the simulation is ing it, red, green and blue what path the particle has used to get (though colour has noth- from X to X’. The data contains the num- ing to do with the real ber of measurements taken between colours). Lattice QCD is these two points which is denoted by a non-perturbative method tmax (i.e. tmax= 28 means 28 mea- Figure 1: The small sphere is the gluon carrying an anti- for solving QCD problems, surements have been used). Now, while red and blue colour. Once it reaches one of the big red but why we use non- the simulation had been running just spheres representing the quarks the anti-red with red will pertubative methods? The fine for two different data inputs the be annihilated and the quark will be blue 57 results showed different rest mass for Figure 3: Initial plot of the smsm file the same meson particle. So that tells us which is the other input file of the that there should be a point where there simulation code (tmax=28) is an error which might occur by a vari- ety of reasons i.e. mistake either at the data input or at the initial parameters added to the code or at the simulation.

Methods to tackle the problem Results and discussion Firstly, out of the number of measure- ments taken that is mentioned above (tmax) some range of this data gives bet- In order to understand where the prob- ter confidence of results i.e. if tmax=28 lem occurred we can compare the two instead of using the full range of mea- figures already mentioned. It is rather surements from 0-28 we get a range obvious from the two figures that figure out of it, as an example from tmin= 2 is at least to a certain degree consis- 5 to tmax=25 and observe what con- tent with the rest mass results while fidence, rest mass and error in mass figure 3 shows very unstable results. this data provides. Now since we get Based on this deduction, the problem seemed to lie only with the one data different answers for different ranges it input. So, by checking the input param- is reasonable that the results needed to eters for that file the error found to lie Figures: The figures above were plotted in be analysed in order to extrapolate the with the order that masses were fed to order to show conformance of the results after the mistake was solved. mass, error in mass and confidence for the fit-code, but what does that even each of the tmin values and for different mean? Well, the rest mass that fed the code was larger than the mass of the Conclusion tmax values. This task was executed first excited state which cannot be the with the help of a bash script as follows. case since in the first excited state the The analysis of the data was successful and it The script iterated over all tmin values particle holds more energy and thus showed directly where the errors. Though, at the for different fixed values of tmax and it is heavier. After arranging that issue beginning the issue was thought to exist with for each range the quantities mentioned the above figures where re-plotted in the range of values that it was used for the two order to see if the problem has at last files and at the end the error occurred due to the above where extrapolated into a file. been solved and this time with various program not analysing the correct data. The error values of tmax. The final and most im- of the program was that it did not check whether Continuing, initial plots for the two portant plot is one that shows the two the mass of the ground state was smaller than the results of the files on a single graph and one of the first excited state and that caused all data inputs were produced for mass the issues. vs tmin for a fixed tmax value where the tmax values that were selected are the ones providing the greatest confi- produced as shown in the figures 2 and dence. This is the figure 4 shown below 3 below. Acknowledgement

I wanna thank PRACE summer for this opportu- nity that was given to me and i would also like to thank my supervisors who helped me with learning new things, and to be honest almost everything was unknown to me.

PRACE SoHPCProject Title Lattice QCD simulation Figure 4: It should be noted that the PRACE SoHPCSite Figure 2: Initial plot of the smpt circles shown on the graph symbolise JSC (Julich supercomputer center), file which is one of the data inputs the confidence of the data Germany of the simulation code (tmax=28) PRACE SoHPCAuthors So, by observation of figure 4 we can Andreas Nikolaidis, Cyprus see that apart from some issues at the small values of tmin which are pretty PRACE SoHPCMentors reasonable due to the way that the Stefan Krieg, JSC, Germany Eric Gregory, JSC, Germany Andreas Nikolaidis simulation works, the data for both files is pretty stable and at the same range. PRACE SoHPCContact We can thus conclude that the problem Andreas, Nikolaidis, Phone: +447871554308 with the rest masses of the two input PRACE SoHPCAcknowledgement files has been solved. Thanks to my supervisors this summer i learned things that i wasn’t aware before and i am really happy about Another two graphs were produced it. that include all the tmax values PRACE SoHPCProject ID that were used to analyse the data. 1919

58 Configuring and using encrypted disks in an HPC environment Encrypting disks in HPC

Kara Moraw

Supercomputers which are publicly available have relaxed security measures by default. When it comes to processing sensitive data, further steps towards data security have to be taken. One possible measure is working on encrypted disks which, according to our findings, has an acceptable impact on performance.

here is an increasing de- python-based tool, it could be modified disk every time. Only the master key mand for processing confiden- to run virtual machines on encrypted has to be re-encrypted which saves a lot tial data, such as the genetic disks. of overhead. makeup of humans, personal Working in an HPC environment, LUKS makes use of a series of fur- Tdata, and data that, if leaked or mod- performance is always a priority. So to ther security measurements to impede ified, could have serious legal con- evaluate the practicability and useful- attackers calculating the keys, such as sequences. Normally, supercomputers ness of adding the feature described AF-splitter and PBKDF2 [REFERENCES]. and clusters are shared by many users, above to PCOCC, we decided to run a The details are not relevant for this which makes it difficult to meet strict se- series of benchmarks on a simple setup project and are therefore not discussed curity requirements. Furthermore, net- of KVM virtual machines with LUKS en- here. work restrictions hinder the flexibility crypted volumes. It is possible to use different encryp- and the open character that make clus- tion algorithms, modes, key sizes and ters popular. hash functions within LUKS. In this case, PCOCC (Private Cloud On a Com- LUKS encryption format we limited our tests to the algorithms pute Cluster) is a promising technol- AES, Twofish, Serpent and CAST5 in the ogy that enables creating private HPC- LUKS is short for Linux Unified Key Setup modes CBC, ECB and XTS. The follow- clusters on existing clusters, by automat- and provides a standard for encrypt- ing gives a short overview. ing the setup of KVM virtual machines, ing disks in Linux operating systems. All four of the considered algorithms connected with an overlay network. It takes care of the key management remain unbroken and are therefore se- These guest private clusters can be tai- with a two-level key hierarchy: A strong cure. It is out of the scope of this report lored to the security requirements per master key is generated by the system to explain their details, but the main project, whilst maintaining the flexibil- and used to encrypt or decrypt the hard thing to know in order to understand ity and openness of the host cluster. disk. The master key itself is encrypted the significance of encryption modes A security weakness in the current by a user key and only this encrypted is that they are all block ciphers. This PCOCC software is that disk images that version is stored. To decrypt the hard means that they work with a key of a are used for storing persistent cluster disk, the user must provide his user key predetermined length and this key can data are not encrypted. Because the disk which is used to decrypt the master key only encrypt a message of the same images reside on the host cluster file sys- which in turn decrypts the hard disk. length. For a short message, this might tems, confidential data could be read by This way, the user key can be changed be alright, but when looking to encrypt users with enough privileges on the sys- frequently - this is fundamental for se- a 500 GiB hard disk, a 500 GiB long key tem. Since PCOCC is an open-source curity - without re-encrypting the hard is hardly of use. Instead, the data is split

59 Example secret /guest_images/vol.img

(a) XML file for secret object (c) Electronic Code Book mode

vol.img 7 /guest_images/vol.img (d) Cipher Block Chaining mode

(b) XML file for volume (e) XTS mode

Figure 1: On the left: XML files for configuration. On the right: different encryption modes.

into data blocks of the same length as Test environment memory as opposed to being stored the key (e.g. 256 bits). There are differ- persistently and that the value to ent possibilities to apply the encryption PCOCC itself creates a set of virtual ma- be associated is never revealed. This algorithm to those blocks, i.e. different chines working together as a cluster, definition also declares which vol- modes: with KVM as their hypervisor. Our test ume is to be protected by the se- ECB setup lies on the HPC-Cloud [REFER- cret. Note that this definition does ENCE], a service offered by SURFsara. not contain the actual passphrase. It The Electronic Code Book mode, de- It lets users create virtual machines run- has to be set afterwards by calling picted in 1c, is the simplest and fastest ning on SSDs. One test instance consists virsh secret-set-value. way to encrypt blocks of plaintext. Ev- of two virtual machines on the HPC- Defining the encryption ery block is encrypted separately with Cloud: One acts as the host for the vir- Telling libvirt which encryption to use the chosen algorithm using the key pro- tual machine to be benchmarked, and is part of the volume definition. Again, vided. This can be done in parallel. one acts as an NFS-Server which will a new encrypted virtual volume can be However, it is highly deterministic: iden- contain the encrypted disk. created from an XML file like the on in tical plaintexts have identical cipher- Within the virtual machine acting 1b. texts. as a host (we will be calling it “host Here, format, secret and cipher to CBC VM” from now on), we recreated what be used are declared. The secret ob- The Cipher Block Chaining mode, seen PCOCC does in a simpler way by just ject points to the secret created above. in 1d, is less deterministic, but slower. defining one virtual machine instead There are different encryption algo- The encryption is randomised by using of a cluster. Details on doing this can rithms, modes, key sizes and even an initialisation vector to encrypt the be found in the REDHAT DOCUMEN- hashes available. first block. Every subsequent block’s en- TATION, but the essential steps for cre- cryption depends on the block before, ating the encrypted virtual volume the i.e. also on the initialisation vector. This machine should run on are: Benchmarks cannot be done in parallel. 1. Define a secret. To evaluate the performance, we set up XTS eight differently encrypted volumes and The XEX-based tweaked-codebook 2. Define encryption to be used. had a virtual machine transcode a 17- mode with ciphertext stealing, seen in minutes long video on each of them. 1e, uses two keys and supports encryp- Defining a secret The configurations we considered are tion of sectors which cannot be divided With libvirt, an object of type “secret” listed in table 1. The results displayed into equal-length blocks. can be created from an XML file like the in 2 and 3 show the average time and one in 1a. the standard deviation after five runs Attributes ephemeral and private de- on each configuration. scribe that the secret is only kept in The results in 2 show that there is

60 no significant impact on the user time. is not possible to make a general Discussion The average performance is decreased statement on which algorithm and by 0.2 3%. There is a higher impact mode have the lowest or highest im- Looking back at the process, we find on the− system time as seen in 3. The pact on performance. Nevertheless, the it was fairly simple to set up an en- performance is decreased by 13 30%. results indicate that the details of the crypted disk for a virtual machine. Due to the small number of runs,− it configuration affect the performance. While we could not verify the configura- tion within PCOCC itself due to the time Table 1: Configuration of volumes constraints, it looks like encrypted disks could be added in after minor changes Volume No. Cipher Mode Key Size Hash to the PCOCC code. The benchmark 1 None None None None results show that the encryption does 2 AES ECB 256 SHA256 impact the performance, but only to a 3 AES XTS 256 SHA256 moderate extent. Also, they suggest that 4 AES CBC 256 SHA256 performance demands can be balanced 5 AES CBC 128 SHA256 with security demands since different 6 Twofish CBC 256 SHA256 configurations of the encryption lead to 7 Serpent CBC 256 SHA256 different timings. Further work could re- 8 CAST5 CBC 128 SHA256 search the impact of encryption modes or algorithms on overall performance and work towards the incorporation of encrypted disks in the PCOCC code.

PRACE SoHPC Project Title Encrypted volumes for PCOCC private clusters PRACE SoHPC Site SURFsara, Netherlands PRACE SoHPC Authors Kara Moraw, Universitat¨ Bonn, Germany PRACE SoHPC Mentors Figure 2: Average user time when using encrypted disks (blue) compared to a Lykle Voort, SURFsara, Netherlands non-encrypted disk (green). Caspar Van Leeuwen, SURFsara, Netherlands Kara Moraw PRACE SoHPC Contact Kara Moraw, Universität Bonn E-mail: [email protected] PRACE SoHPC Software applied KVM, PCOCC, libvirt PRACE SoHPC More Information KVM: https://www.linux-kvm.org PCOCC: https://github.com/cea-hpc/pcocc libvirt: https://libvirt.org/ PRACE SoHPC Acknowledgement I would like to thank my mentors Lykle and Caspar for organising this project and offering the opportunity to get to know SURFsara. My colleagues at SURFsara also deserve a big thank you for listening to my problems over lunch and offering their expertise. Also, I would like to thank EPCC’s Amy Krause and Nick Brown for agreeing to have me as an intern two years ago and introducing me to the HPC world. And last but not least, thanks to Allison Walker for always listening and being the best flatmate you could hope for. Figure 3: Average system time when using encrypted disks (blue) compared to a PRACE SoHPC Project ID non-encrypted disk (green). 1920

61 Processing and visualising meteorological data for an improved understanding of bird migration patterns A data pipeline for weather data

Allison Walker

For ecologists studying bird movement, new data sources can contribute a deeper understanding of migratory patterns. This project focused on building a data pipeline enabling ecologists to efficiently access and visualise a new stream of meteorological radar data.

irds have amazing capabili- The aviation industry as well as Project Overview • ties for migration, with some Air Forces across the globe adjust- The aim of this project was to improve species capable of travelling ing the scheduling of flights. the accessibility of this weather data so vast distances each year. While Wind farms stopping or slow- that ecologists can more readily study Becologists have traditionally focused on • ing their wind turbines when a weather patterns and local bird migra- GPS tracking when studying migratory large number of birds are flying tion for a given radar. patterns, it is increasingly important to through. Where the researchers are used to integrate additional data sources. In par- Switching off lights in tall build- • working with relational databases, this ticular, ecologists at the University of ings and communication towers, project explored the suitability of replac- Amsterdam are beginning to work with which have been known to con- ing these with static storage along with a new stream of weather radar data to fuse birds flying at night. an accompanying working environment track flocks of birds, and answer ques- and practices that would form a virtual tions like: how and when are birds most lab. The intention was that querying active? How does weather impact when, this static database still feels natural, where and how birds migrate? with ecologists able to apply their ex- These questions are important to isting IT knowledge. This project was answer for various reasons. Migrat- also intended as an investigative anal- ing birds share airspace with airplanes, ysis into the potential infrastructures windmills, and other man-made struc- for data sources beyond weather and tures like high-rise buildings. It’s im- GPS. A set of scientific workflows and a portant for both the birds’ safety and guide for best practice for storing and the longevity of these aircraft and struc- working the data stream will provide tures that we are able to accurately pre- a foundation for future solution design. Figure 1: A falcon with GPS tracker, the tra- dict when flocks will be migrating and For the purposes of this internship, how- ditional means for capturing data on bird where. Some of the real-world implica- migration. ever, the use case was meteorological tions of accurately predicting bird mi- weather data. gration are: The solution developed allows ecol-

62 ogists to select the metric, the time pe- of dimensions (azimuths, radius). All riod, the elevation angle and the visu- of these data and metadata need to be alisation format desired for the output. considered in order to correctly project The final visualisation is efficiently gen- the data captured into an accurate visu- erated, giving researchers more time to alisation. focus on ecology. A typical radar might perform a Key tools used in developing this fresh sweep every 15 minutes, mean- pipeline include: Parquet storage format ing 96 sweeps per day. Data collected in for efficient filtering and loading, Spark each sweep is stored in its own file, in for distributed computation, Python an HDF5 format. HDF5s are hierarchical as the key programming language, files, and somewhat complicated to nav- Wradlib, Geoviews and Holoviews li- igate. The highest level is the elevation braries for visualisation of the trans- angle, and nested within each angle is formed data. The following sections of further nested information about the this paper will explain in greater detail different metrics and their associated Figure 4: A PPI visualisation from the the application of these tools. metadata. In the prevailing approach, Wradlib library. the most time-consuming step in pro- cessing these files was filtering through Learning the Domain With the radar data formatted into the multiple layers and metadata to re- Parquet and saved in Minio - the object The domain specificity of this project trieve only the data arrays for the ele- storage software used in this project - required a deep-dive into the world of vation angle and metric of importance. the next step in the data pipeline was to radar. A critical first step in this project load the relevant data. Spark - a general- was understanding the way in which purpose cluster-computing framework - radars capture and store data, the meta- was critical in this step. Functions for fil- data collected at each sweep, and the Spark and Parquet tering and loading data were written in prevailing methodologies for reading pyspark, and run in a Kubernetes cluster. and visualising these data. The many complexities of the radar This cluster has 25 machines, each with measurement system meant that it was 4 nodes. Already by this stage the effi- important that the solution built could ciencies of the new infrastructure were easily filter and return only the data that clear. Where 200 seconds are needed to the researcher is interested in for their filter, load and perform a groupby func- particular question. Parquet was the tion over relevant data from a series most sensible format for these purposes. of HDF5 files, only 120 seconds were Parquet is a column-oriented data stor- required for the same process from Par- age format that is efficiently filtered, quet. and highly compatible with Spark, the next key tool in this project’s toolbox. Radar Visualisation Transforming a data array and metadata into simple visualisations came next. Wradlib, a python library for Weather Radar Data Processing, was very impor- tant to understand the conventions used for transforming polar data (where the dimensions are azimuths and radius) into gridded data that can be more eas- ily visualised. Again with the help of Figure 2: HDF5 file structure for radar data. Spark, the Wradlib library was used to produce visualisations as seen in figure Radar collects information about ob- 4. These were further expanded into jects in the surrounding area by using Figure 3: Parquet fields. time series visualisations in a gif format radio waves. The returned echoes can (which unfortunately cannot be demon- be used to identify aircraft, weather Therefore, the critical first step in strated in this article format). patterns, and even flocks of birds. Ev- building the data pipeline was to con- ery time a radar completes a sweep, a vert HDF5s to Parquet for a year’s worth significant amount of metadata is cap- of data from a handful of European Georeferencing tured and stored. Examples of this meta- radars. This process required an under- While these visualisations are use- data are: radar latitude, radar longi- standing of which metadata would be ful, ecologists also require information tude, radar height, sweep elevation an- required to accurately visualize the data, about the geographic location of these gle, sweep range, and timestamp. In to ensure that all relevant information weather patterns. In order to maximise terms of the actual data, information is was transferred from HDF5 to Parquet. the usability of the radar data, it was captured for up to 16 different metrics The structure of these Parquet files can therefore important to add a geographi- in each sweep and stored into an array be seen in figure 3. cal representation to the visualisations.

63 Key tools used in this stage of the The final visualisation, with zoom Future Work pipeline were: Geoviews, Holoviews and time-slide interactivity is shown in A number of further optimisations re- and Xarray. These are all python li- the figure above. Geoviews is capable main. Firstly, the role of colour in braries that allow us to place the visu- of projecting information for multiple radar visualisations cannot be underes- alised radar data exactly where it be- radars in the one view, so this solution is timated. The current colourmap is not longs on the map. easily scalable for multiple radars, once consistent with what the researchers are Geoviews and Holoviews expect geo- that data is also transformed to Parquet familiar with, so needs to be changed. graphical data to be in an Xarray format. format. Next, the ecologists often chop out data These are essentially Numpy arrays, but from the furthest scans due to loss of with the addition of dimensions that Final Product accuracy, but all data has been included summarise the geography and time of in the current solution. Next, there is a The final output is, as intended, a func- each datapoint. Utilising the principles need to optimise the structure of the Par- tional data pipeline allowing ecologists from Wradlib, the raw data arrays were quet files and bring in additional meta- to query and visualise radar data based transformed into a gridded format, and data from the HDF5s. Finally, this data on their desired filters. Though the ex- converted to Xarray. In fact, this step pipeline can be adapted to produce ver- act timings of the previous approach are removed the reliance on the Wradlib tical profiles; a separate visualisation unconfirmed, it is clear that the solution library altogether. derived from the same HDF5 files. Finally, the Xarray datasets were developed cuts hours of time from the process to develop these weather visu- converted into Geoviews Images and PRACE SoHPCProject Title then projected onto a Cartolight map. alisations. Large scale date techniques for research in ecology PRACE SoHPCSite SURFsara, Netherlands PRACE SoHPCAuthors Allison Walker, Australia PRACE SoHPCMentor Matthijs Moed , SURFsara, Netherlands Allison Walker PRACE SoHPCContact Allison Walker E-mail: [email protected] PRACE SoHPCSoftware applied Spark, Python, Geoviews, Holoviews, Parquet PRACE SoHPCAcknowledgement I would like to acknowledge the invaluable support of the project mentors Matthijs Moed and Ander Astudillo. I learnt even more than I expected, and owe my learning and productivity to both Matthijs and Ander and their constant support. Further, the summer in Amsterdam would not have been so successful and fun without the friendship of Kara Moraw. PRACE SoHPCProject ID 1921

64 Visualising the enormous amount of data from plasma fusion simulation, by creating an open-source visualization software. A glimpse into plasma fusion

Arsenios Chatzigeorgiou

Plasma fusion simulations have the potential to guide a minimisation of the plasma microturbulences and make plasma fusion viable. GENE is an open-source code that performs such simulations. The output, however, is visualised with a proprietary-IDL software. In this project, we reproduced the IDL plotting schemes, using open-source alternatives.

lasma fusion is the holy grail of heat from hot plasma, and the average an open-source code that can simulate the energy science. By mimick- temperature is lower than desired. among other things, the plasma micro- ing the sun’s procedure, a nu- Research on plasma fusion reactors turbulences detected in reactors. These clear fusion reactor can produce is also very time-consuming, expensive simulations demand big computational enormousP amounts of energy with no power, and may only run in HPC clus- and a lot of resources (human and ma- significant waste and using only iso- terial) are needed to build a one. A ters. GENE, uses MPI (Message Passing topes of hydrogen as substance (2H reactor that could potentially generate Interface) for the parallelized part of and 3H). In order to produce such en- enough energy has bigger size than a code. ergy however, very high plasma temper- middle shaped human, making the cre- The GENE algorithm, generates atures are needed. ation even more expensive. Therefore a some numerical ASCII files but mostly valid prediction of the reactor’s function big, complex binary files as an output, Research on plasma fusion for en- is crucial, essential and mandatory. but no visualisation occurs. ergy generation started back in 1940. However, until today, a functional re- actor able to generate power have not The scientific problem Current GENE status become available. The major problem is the inability to create steady state In order to foresee and avoid plasma In the GENE suite, the option offered plasma fusion. This means that plasma microturbulences, plasma behavior is for analysing and visualising the results is kept stable and at high temperatures being investigated through plasma mod- is an IDL plugin. This plug-in uses the for long periods of time during which eling or plasma fusion simulations. output files and produces a large list of the energy is being constantly emitted". Scientists try to find the ideal reac- plots (around 40 different plots) that Until now the plasma process duration tor’s characteristics (the geometry of the can be accessed via the IDL GENE diag- is being measured in milliseconds and reactor, the properties of the magnets nostics GUI. getting plasma in steady state for just a that will speed plasma etc), that min- IDL however, is not publicly avail- few seconds is challenging. imises plasma microturbulences and able. In order to fully analyse and visualise the results, a licence is re- One of the reasons the plasma therefore maximise the energy pro- quired. The user interface, is also very doesn’t reach steady state, is the plasma duced. old-fashioned, and a lot of configura- microturbulences. It is what happens One of such plasma simulation tion needed makes the GUI less user- when you mix plasma of different tem- utilities is GENE, a Gyrokinetic friendly. peratures together. Cold plasma absorb Electromagnetic Numerical Experiment,

65 Figure 1: On the left, the our GUI is presented, with selected plot Geometry which stands for geometric elements of the reactor. Top right plot is balloning modes, and bottom right is toroidal representation.

Our accomplishments late the data, and draw many of the ridiculously simple, that the manual IDL plots, but there was not GUI. De- would be redundant? So, we made ev- In this project we managed to repro- spite the bugs, the chaotic and undoc- erything automated, and the user has to duce some of the IDL plotting schemes, umented way there were written, and select only the GENE results folder and in a more simple, modern and open- some incompatibilities, we managed to choose a plot to visualise. The simple source GUI. The code was written in use them and provide some valid plots instructions needed are printed in the Python 3.7 and PyQt5 was used. and animations from the GENE output GUI. PyQt5 is a Python Toolkit based on files. Animations that weren’t available Don’t get me wrong, the repro- the C++ Qt5 application framework. in the IDL plug-in since the GUI didn’t duction of IDL is far from done. Re- It utilises widgets as graphical inter- support live video or animation repre- implementation of the GENE Python face elements, and provides the abil- sentation. scripts, enrichment of the plotting ity to create layouts of GUIs (Graphical To be fair, the number of differ- schemes available, and different bench- User Interface). For plotting the data we ent plots we finally provide is not big mark cases are some of the things need employed the matplotlib library and enough. We managed to produce 13 out to be done. However, solid foundation numpy was used as well. For some plots, of 40 plots, but we didn’t have enough for an open-source visualisation of gy- we used some scripts from a python im- time to reproduce all of them. Also, rokinetic data has been laid. plementation that existed in the GENE since the GENE IDL plug-in is a propri- PRACE SoHPCProject Title git repository, but was not currently in etary program, we also found difficul- Visualization schema for HPC development. ties about running benchmark cases. gyrokinetic data One of the biggest difficulties we The native HPC in University of Ljubl- PRACE SoHPCSite University of Ljubljana, Slovenia came up to, was that we didn’t know jana didn’t have an IDL license and enough about the GENE output files. Ini- GENE or IDL were not installed in the PRACE SoHPCAuthors Arsenios Chatzigeorgiou, [UoA] tially, we weren’t able to access the bi- supercomputer. Thankfully, my mentor Greece nary files of the results. Also, formulas had access to the MARCONI HPC clus- PRACE SoHPCMentor that produced the plots were not men- ter, in Bologna, where GENE and IDL Dejan Penko, UL, Slovenia Arsenios Chatzigeorgiou tioned and the number of different vari- are installed and licensed, and we were PRACE SoHPCSoftware applied ables surpassed the greek alphabet sym- able to run a few benchmark cases. PyQt5, GENE, IDL bolism. To make matters worse, the IDL Initially, apart from the GUI imple- PRACE SoHPCAcknowledgement scripts were also poorly documented. mentation, we planned to create a man- I have to thank my mentor Dejan Penko, and his colleague Gregor Simic. Without their help I wouldn’t The deux-ex-machina was the afore- ual to help the user navigate in our be able to reach at any point close to those results. mentioned python scripts. Those scripts, program. But a different approach pre- PRACE SoHPCProject ID were able to read the files, manipu- vailed. What if we made the GUI so 1922

66 Building Electricity Consumption Prediction Model using R and Hadoop to enable Smart Planning Predicting Electricity Consumption

Khyati Sethia

Global electrical energy consumption is increasing rapidly. In order to make accurate electricity buy for selected time interval and enable smart planning, a predictive model is built for the consumption of end-user electricity.

he consumption for electrical developing a new billing system for end Data Sources energy is increasing rapidly. consumers. The following figure shows The selected Slovene company the project flow schema: The dataset is a 15 minutes Electricity which sells electricity to its cus- Consumption data spanning one year Ttomers wants to build a customer elec- for 85 end-users. Fields are: tricity forecasting system in order to make better spending forecasts from 1. Date of Consumption the consumer’s consumption history 2. Time of Consumption and data on influential related factors, 3. Region ID thereby making it more accurate to 4. Consumer ID know how much electricity should be 5. Consumption in KW bought for the selected time interval, which will ultimately save energy and, After the identification of data and increase the profits for the company and their sources; the influential external decreases costs for the end-users. factors for the consumption of end-user electricity are investigated. The focus is Here, different influential external on the calendar and the weather. For factors like geography-wise weather, this, the consumption data is fused with holidays and time are examined and data about influential external factors study is done to find the correlations and exploratory analysis is performed for the prediction of energy consump- to understand how the selected factors tion over time. Process is developed influence the energy consumption. The for predicting consumption, which en- weather data is downloaded from the ables smart ordering and planning of Environmental Agency of the Republic energy consumption and thus great sav- of Slovenia (ARSO) website. The holi- ings. Algorithms for predicting electric- days in Slovenia information is obtained ity are developed in the R environment from the Time and Date Slovenia we- Figure 1: Process Flow and then adapted to work over big blink. data databases and NoSQL MongoDB databases. This will serve as a basis for

67 Data Preprocessing on weekends. Period Analysis: We find that 67% of Data preprocessing is a data mining the consumers have highest consump- technique that involves transforming tion during noon while 29% of con- raw data into an understandable for- sumers have highest consumption in mat. night. No consumer has highest con- sumption during evening. The files are extracted from the • nested zipped folder (received as In Figure 2, the consumption varies input data from the Slovene Com- with temperature; more electricity is re- pany) and bind-ed together into quired at very low and very high tem- one file. peratures. The column formats are changed: • European number format is con- Figure 4: Electricity Consumption by Day verted to US number format, the date format is changed. Date related columns are derived: Unsupervised Learning • Week number, day of week, parts of the day, etc. Clustering is a Machine Learning tech- nique that involves the grouping of sim- Similar data processing as above ilar data points while data points in dif- • is done for external factors - ferent groups should have highly dissim- Weather & Holiday datasets. ilar properties. Figure 2: Variation of Consumption with Special characters are replaced in Temperature In this project we perform Hierarchical • Region names. and K-Means Clustering. Cluster analysis is performed to gener- Weather columns are normalised. ate clusters of days with similar kind of • Figure 3 shows the correlation of electricity consumption with Euclidean All the three datasets are then various weather factors with consump- • merged. Distance and other custom distance met- tion. Here the temperature is scaled by rics. The euclidean distance among data Day Type information is derived subtracting 10 and taking the absolute points is calculated using the following • whether it is a Holiday, Weekend value. It can be seen that Consumption formula: or a Working day. has high correlation with (scaled) tem- perature and radiation. n The long holiday information is EuclideanDistance = (q p )2 • derived i.e. suppose Thursday is a v i − i ui=1 holiday then it is highly likely that uX t (1) most of the people take Friday We tried these methods to find the also as a leave day, thus making optimal number of clusters for all con- it a long holiday. A similar case is sumers - Elbow Method, Gap Statistics considered for Tuesday. and Hartigan. The method that gave the least prediction error is Hartigan index. Data Analysis The Hartigan index is computed using following equation: Data analysis is the process of evaluat- trace (W ) ing data using analytical and statistical hartigan = q 1 trace (W + 1) − tools to discover useful information and  q  aid in business decision making. We per- (n q 1) − − form data analysis using two methods - (2) Data Mining and Data Visualisation. Figure 3: Correlation of various weather factors with consumption Where q T Total Consumption Analysis: Con- Wq = (xi x¯k)(xi x¯k) k=1 i Ck − − sumer 33 has maximum consumption Though, there are many consumers is the within-group∈ dispersion matrix compared to other consumers. who don’t exhibit similar temperature for dataP clusteredP into q clusters, q ∈ Day Analysis: Consumer 33 has max- vs consumption behaviour. (1, ..., n-2). imum consumption for all day types; In Figure 4, it is shown how the con- xi = p-dimensional vector of objects of th with 65% on working days and 29% sumption varies in a week day (every the i object in cluster Ck, on weekends. But the Consumer 35 has 15 mins, hence a total of 96 data points xk = centroid of cluster k, the highest overall 31% consumption on in a day). The consumption is low dur- n is the number of observations in the weekends and Consumer 7 has highest ing the night and high in the day with data matrix. 94% consumption on Working days. Ma- peak in the early morning. It has also jority of consumers (except few) have been observed that the consumption de- In Figure 5, we have obtained these high consumption on weekdays and less creases significantly on weekends. highlighted groups or clusters for one

68 consumer using statistical methods. below equation and represented using Conclusions and Future Work Here the dendrogram visualize his daily Figure 7: In this project, we use consumption which is divided into three b the electricity consump- different groups of days. Area = [f (y) f (y)] dy (3) tion data received from 1 − 2 Za the Slovene company. We collect Weather and Holi- days information and find their influence on electric- ity consumption. We per- form clustering to find groups of days with similar consumption pattern. Us- ing this clustering informa- tion, we construct a predic- Figure 5: This dendrogram shows various tion model. We also adapt clusters for one customer. Figure 7: Area under the Actual and Predicted Curve the data to NoSQl and RHadoop MapReduce framework for Figure 6 below, shows the clusters large datasets & parallel processing. obtained using the K-Means Clustering As a future direction, analysis can be ex- for one consumer. This graph explain Adaption to NoSQL Database tended to larger amounts of data which the data point variability using PCA (the will improve the predictions many folds. method that minimises the error orthog- NoSQL MongoDB is used to store the Also, Advanced Clustering Algorithm onal (perpendicular) to the model line). dataset for enhanced scalability in terms (ACA) can be explored to reduce the of storing large volumes of data, for flex- cluster variances. ibility with JSON-like documents and for faster retrieval, ad-hoc queries, in- dexing, replication and MapReduce pro- References grams. R ‘Mongolite’ package is used to 1 Fazil Kayteza, M. Cengiz Taplamacioglua, Ertugrul Camb, Firat Hardalac (2014). Forecasting electric- work with NoSQL MongoDB Database. ity consumption: A comparison of regression analysis, Using this package following was per- neural networks and least squares support vector ma- chines. formed: data retrieval, manipulation, and analysis using the data stored in 2 Usman Ali, Concettina Buccella, Carlo Cecati (2016). Households electricity consumption analysis with data MongoDB. mining techniques.

Adaption to Big Databases

Hadoop is used which is mainly useful Figure 6: This clusplot shows various clus- for large datasets, that can’t be man- ters for one customer. aged by single pc. It has distributed PRACE SoHPCProject Title storage and any time any number of Industrial Big Data analysis with The clusters are formed with simi- RHadoop computers can be added or removed lar consumption pattern in same cluster from this cluster. The files have replicas PRACE SoHPCSite and very different from other clusters. University Of Ljubljana, LECAD on different nodes thus making the sys- We also studied that relationship exists Laboratory, Slovenia tem very reliable & resilient to failure. between the clusters with day of week PRACE SoHPCAuthors It solves the problem by divide and con- Khyati Sethia, The University of & holidays quer approach and achieve parallel pro- Salford, United Kingdom cessing. Some of the algorithms have PRACE SoHPCMentor Janez Povh, University of Ljubljana, Cluster Data Prediction been adapted to big databases for paral- Faculty of Mechanical Engineering, lel processing on multiple nodes using Ljubljana, Slovenia Khyati Sethia Using this cluster information, the re- MapReduce Hadoop scripts in R envi- PRACE SoHPCContact lation among various factors like day ronment. Khyati Sethia, University Of Salford of the week and the holidays is found. Phone: +44 7739897649 E-mail: [email protected] We get better results using K-Means MOOC "Managing Big Data with R PRACE SoHPCSoftware applied Clustering than Hierarchical Clustering. and Hadoop" : R, RHadoop, MongoDB For now, a process has been developed PRACE FutureLearn MOOC is a course PRACE SoHPCMore Information which uses this cluster information to on how to manage large amounts of R, RHadoop, Future Learn, MongoDB predict consumption. To measure the data using Hadoop MapReduce in R PRACE SoHPCAcknowledgement prediction accuracy, the error percent- environment. The data has been made The author would like to thank PRACE Summer of HPC age is calculated between the actual available in HDFS. The script developed for this opportunity. Also, author is grateful to Dr. Janez Povh and Dr. Leon Kos for their mentoring during this and predicted consumption. This error is how to do data formatting, aggrega- project. is equal to the difference in area be- tion and calculating mean and standard PRACE SoHPCProject ID tween the two curves and is given by deviation. Then plotting the results. 1923

69 Enhancing energy consumption reporting capabilities in HPC centres by developing a Slurm plugin for NVIDIA GPUs Energy reporting in Slurm Jobs

Matteo Stringher

HPC and Supercomputing centres are very intensive energy consumers. Energy reporting in such centres is a key feature of the system, especially for operations staff. During this project we have worked towards the implementation of a plugin to enhance the Slurm energy reporting capabilities.

nergy consumption is one of Energy consumption can be re- - plugstack.conf, which controls where the main concerns in the HPC trieved in two different ways: external plugins are loaded from and how Slurm world. As order of magni- sensors and internal monitoring. The should deal with their exit states. It is tude, the power demand in the project focused on the second method- suggested to compile it every time the Elargest HPC centres is about 5 to 10 ology. Slurm, natively, provides support Slurm installation on the operational megawatts. In the race to Exascale-level for IPMI and RAPL. The former is fo- system is updated. The configuration systems, the HPC facilities have set a cused on baseboard management and must be identical on each node where cap on the total power available, to en- consumption retrieval, while the latter the plugin will be run. sure both efficient and economically- operates at a more fine-grained level for NVML viable systems. Intel chips. In the past decade GPUs have been in- The power used for cooling the sys- Slurm plugin development creasingly requested by scientific users, tem can be roughly equal to the power Slurm allows external interaction with in fact, they allow to speed up a variety consumed to keep up the nodes while its functions in two different ways: the of codes, from molecular dynamics ap- under user workload. It is clear that Slurm plugin API or SPANK. As regards plications to training of deep neural net- measuring the consumption for each the first one, Slurm provides a different works. Deep learning training phases component of the system allows to bet- interface for each type of task the plugin can be costly, especially, when an hyper- ter understand and optimise the work is meant for (e.g. accounting storage, parameter analysis is needed to fine- performed on the system. In the past energy). tune the model. GPU spot power con- years, more and more attention has sumption can be retrieved for NVIDIA been paid not only on performance pa- SPANK which stands for Slurm Plug-in Architecture for Node and job GPUs through IPMI or NVML (provided rameters, but also on efficiency, such by NVIDIA). According to the NVIDIA as a system’s Gigaflops per watt metric, (K)control, provides an easier approach, where there is no need to access the documentation, it is possible to retrieve which is used for the Green500 ranking the power usage for the GPU and its of supercomputers. source code. In the project the latter has been chosen, since our requirements associated circuitry. were satisfied by the SPANK capabili- To our knowledge, Slurm does not Tool set and project description ties (e.g. possibility to spawn a metrics provide an integration for GPU power collection process for the duration of a consumption recording, so we decided Slurm user’s task). to build and experiment a plugin able to Slurm is one of the main job sched- Moreover, the installation is rather collect data and summarise the results uler and resource management system straightforward: once compiled the to the user, such as the total consump- in HPC centres. It is adopted by most code into a shared library, the plugin tion of his tasks that use GPUs. of the computer systems listed in the can be easily mounted by adding a Hierarchical Data Format TOP500. line to an internal configuration file HDF5 is a popular file format in the

70 slurmd

slurmd_init()

job_prolog() Retrieval example slurmstepd Frequency init() 60 Hz 160 1 Hz process spank options

user_init() 140

for each task 120 fork()

execve() 100 Power (w) for each task

wait() 80

exit() 60

job_epilog() 40 slurmd_exit() 0 5 10 15 20 25 30 Seconds (s)

Figure 1: Left image: Simplified workflow of the SPANK functions that can be called inside the slurmd and slurmstep daemons. Right image: Demo run in the second test phase (before the installation of the plugin) on the Iris cluster to compare the use of different sampling frequencies.

scientific fields to store heterogeneous which is in charge of starting a slurm- value of 20 measurements per second and large amounts of data in a single stepd daemon for each of a job’s steps. will be used. file. Moreover, it integrates with MPI for Our plugin interacts with the slurmd SPANK provides an interface, which writing/reading data in parallel up to daemons thanks to the SPANK library. must be used to let the slurmd and large scales. Slurm inbuilt energy accounting slurmstepd daemons to invoke the plu- Data collected through SLURM’s gin. SPANK plugins are loaded in up to IPMI/RAPL mechanisms is retrieved five separate contexts: local, remote, al- Project development during the computation and the value locator, slurmd, job_script. Remote was of the final consumption is stored on the mainly used one, since most of oper- Virtualized system and Slurm architec- the slurm database. This is actually a ations are executed inside the step level. ture tunable option that must be set in the The image 1 at the left shows the dif- To develop the plugin a virtualized sys- proper configuration file. ferent calls that can be used with the tem has been adopted in order to work The slurmdb library offers differ- SPANK interface. in a safe testing environment, without ent functions to query the database. At each step of the job script, a pro- relying on a phsyical cluster. We have Our plugin retrieves the value from the cess is forked on every node, which is in leveraged the Vagrant-VirtualBox com- database at the end of the job and in- charge of storing on the HDF5 format bination in order to to deploy a virtual- forms the user. The Slurm documenta- file the timestamp and the measured ized HPC infrastructure. This improves tion highlights that the value must be power consumption, i.e. one times- debugging and development time. More considered only if the node has been tamp and one measurement for each information about the setup and micro- reserved in exclusive mode, such that a GPU mounted on the mainboard. The cluster composition is available in the job’s consumption is accurately tracked. sampling frequency must be equal or repository documentation.1 SPANK integration greater to 1. All the samples are stored The Slurm infrastructure is mainly on a buffer which is flushed to the HDF composed by 3 entities: manage- SPANK easily allows to modify the file every second. ment, front-end and computation nodes. job behaviour. Our plugin can be in- When the user task ends, the forked When launching a job, Slurm allocates voked from the sbatch command, when monitoring process is stopped. The HDF the number of nodes requested by the launching the job script. Along with file will then contain several datasets, user. One node with relative ID equal the –energy-reporting flag - which ac- one for each step and for each GPU. to 0 (a job’s ’head node’) manages task tivates energy measurements - the user In each dataset, two sequences of launches. Each computation node runs can specify the sampling frequency (ex- n samples can be used to calculate its own copy of the slurmd daemon, pressed in hertz), otherwise a default the consumption over the time. Given 1puppet-slurm: github.com/ULHPC/puppet-slurm

71 GPU power consumption during Tensorflow-GPU training on MNIST

Gpu number gpu0 120 gpu1 gpu2 gpu3

100

Power (w) 80

60

40 0 5 10 15 20 25 30 35 Seconds (s)

Figure 2: MNIST convolutional NN training test.

P1,...,n the samples collected at each are shown in Figure 1 at the right. which can interact with NVIDIA GPUs timestamp t1,...,n, the total consumption The core data retrieval functions through the NVML interface, allowing can be approximated by the equation: at this stage were able to collect for GPU energy consumpton retrieval -

n information from all the GPUs on a novel development to our knowledge. Pi + Pi 1 the mainboard. The code is available as open-source.2 Etot = (ti ti 1) − − − 2 The energy consumption data is i=2 In the last phase the plugin was X • stored for further analysis if needed. Using the equation above the consump- fully assembled and tested by in- The HDF5 file format supports large tion of each GPU of each node reserved stalling it on the cluster and exe- datasets and compression, and is used will be computed, and the final value is cuted on compute nodes featuring by our plugin to store timestamped en- shown to the user expressed in Joules. dual Skylake CPUs and quad Volta ergy measurements. The plugin man- V100 GPUs. ages parallel jobs and all the contained Results & further work Figure 2 shows a Slurm job executing a steps. The user is informed about the Singularity container with GPU-enabled energy usage for each job step and pre- The implemented plugin can seamlessly Tensorflow 1.12 and Python3. The same cise identification of hot spots is pos- interact with the Slurm execution. Fig- code has been tested on a longer run sible (together with other information ure 1 at the right shows the power con- achieving similar results. In detail the from Slurm accounting). sumption for the same run with differ- job steps are: ent sampling frequencies. It can be seen 0-5s: loading software environ- that the choice of the sampling rate has • ment (lmod), Singularity con- a notable effect, the blue curve has a lot tainer initialization PRACE SoHPCProject Title of spikes, which can not be highlighted Energy Reporting in Slurm Jobs with a lower frequency. It must be un- 5-12s: TensorFlow-GPU initializa- PRACE SoHPCSite • derlined, that the precision of the mea- tion, data load disk to memory University of Luxembourg, surement provided by the NVML library Luxembourg 12-36s: training for 10 epochs is subject to error. PRACE SoHPCAuthors • Matteo Stringher The plugin has been developed and job spindown tested through three stages. • PRACE SoHPCMentor Future work could study and under- Valentin Plugaru, Unilu, LU Matteo Stringher In the first one we have used the stand the overhead of the retrieval pro- PRACE SoHPCContact • virtualized cluster to ensure to cess. This runs mostly on the CPU, so Matteo, Stringher, University of Luxembourg E-mail: [email protected] run a safe code on the physical we expect the overhead to be low for one. This development part took PRACE SoHPCSoftware applied GPU-intensive processes. Slurm, Vagrant, Puppet, Virtualbox, C most of the project time. PRACE SoHPCAcknowledgement In the second one a prototype Conclusions I gratefully acknowledge the PCOG group for the • project and the support given before and throughout of the code has been run on the the whole summer. Iris cluster without affecting the We have prototyped a new Slurm plu- PRACE SoHPCProject ID Slurm configuration. The results gin for energy reporting of a user’s tasks, 1924

2gitlab.uni.lu/sohpc/public/slurm-spank-energy-reporting

72 Performance analysis of Distributed and Scalable Deep Learning Benchmarking Deep Learning

Sean Mahon

This project deals with different ways of evaluating the performance of distributed training of Deep Learning models and comparing the efficiency of multiple widely used Frameworks. In particular, experiments were run to determine the scalability of training Resnet models on the CIFAR10 Dataset using both CPU and GPUs.

eep learning is becoming an of data points processed per second, is was written train models in a variety extremely prominent method a common approach taken by chip man- of frameworks while collecting data of solving problems both in ufacturers but may not be helpful if the relevant to benchmarks such as those academia and in industry. model does not train as effectively on described previously between epochs. However,D training deep neural net- too many devices. For this reason, many The current version allows the user works, which often contain hundreds existing ML benchmarking efforts, such import a model and dataset and supply of thousands, if not millions of parame- as MLPerf, place considerable emphasis a config file with details of the train- ters, can be extremely computationally on time-to-accuracy results. This in- ing before it automatically distributes expensive. Therefore, it is natural to volves measuring the walltime required everything it across the hardware spec- attempt to distribute this training over for the model to reach a certain level of ified, trains it for some given number multiple processors where possible. accuracy. of epochs, and collects common bench- There are multiple ways of approaching marking metrics at the end of the run. this task with the simplest and most Further complications arise from the In addition, there is also a script to sort common being data parallelism.1 fact that a number of different libraries parse the config file and produce a suit- and frameworks exist for deep learning, able SLURM batch script to ensure that Put simply, data parallelism involves many with very similar functionality. It resources allocated match what is spec- splitting up the sample of the data, is important to note that, while these ified. The frameworks supported are or batch, used in each training step. frameworks share many of the same Tensorflow (distributed with Horovod), between all available processors. The functions, the implementation can vary Keras (native multi_gpu_model resulting changes in the trainable pa- somewhat between frameworks. While and Horovod), MXNet, and PyTorch rameters of the model are then aver- there have been some efforts to cre- (DistributedDataParallel Mod- aged across all processors before the ate benchmarking tools which work els with Gloo backend) process is repeated. However, due to with multiple frameworks, such as the high number of parameters to be Deep500,2 there does not seem to be As mentioned in the last section, there communicated between processors, it is any individual benchmarking platform are multiple variables which affect rarely optimal to keep the same hyper- which provides comprehensive support whether the training process can be parameters when adding more devices. for all of the most common frameworks parallelised efficiently. Some research Normally, this is achieved by fixing the in use. and experimentation was required to batch size per device and adjusting find sensible default settings for the other parameters to compensate. scaling of hyperparameters. However, Structure of the Code in general, it was found that current This creates a problem when consid- best practice as described in the liter- ering how to evaluate how well a In order to compare the performance of ature3 (and well summarised in this model is performing. Calculating the different frameworks on the University article) was effective. speedup in throughput, or the number of Luxembourg’s Iris Cluster, software

73 (a) Speedup in Throughput for all Frameworks Considered (b) Time-to-Accuracy curves for Tensorflow and

Figure 1: Results from GPU Experiments

Sample Problem: Resnet and example of a Convolutional Neural net- than one node (4 GPUs) is used. CIFAR10 work for image classification problems. In particular, a 44 layer network, with Tensorflow distributed using • While the code written for this project over 600,000 trainable parameters was Horovod also scales well though could in principle be used for a variety used. While training such a model may inefficiencies are visible at a lower of different models and tasks, it may sound like a substantial task, this would number of GPUs than for py- be insightful to consider a well known be considered to be a medium sized torch. It should be noted that, example for experiments. The dataset problem in deep learning. While it once serial code has been im- used was CIFAR10,4 a standard collec- would be inadvisable to attempt to train plemented, distributing training tion of 32 32 pixel images featuring 10 this model on a normal laptop, it would with Horovod requires very little categories.× The dataset contains 50,000 not necessarily take several hours to get extra code compared to other images to train a model on in order to meaningful results when using more so- frameworks. correctly identify the category of object phisticated hardware. in a further 10,000 unseen test images. Throughput for MXNet increased • at a slightly lower rate than Ten- GPU Results sorflow and PyTorch with moder- ate inefficiencies visible when us- This model was trained for 40 epochs ing just 4 GPUs. Distributed train- with a local batch size of 256 using the ing was also less user-friendly frameworks mentioned for using vary- to setup than for other frame- ing numbers of GPUs. It should be noted works, meaning support for mul- that the same data pipeline was used to tiple nodes has yet to be imple- load and perform simple augmentations mented. on each batch to ensure inconsistencies The experiments for Keras did not in data used do not affect training re- • sults. The mean speedup in throughput appear to scale well. This is not is shown in figure 1a. The main trends surprising as it is the most high visible for each framework were: level of the frameworks consid- ered and Horovod was originally The PyTorch results (found using designed to work with Tensorflow. • the DistributedDataParallel There is also the issue that the na- Figure 2: Sample Images from CIFAR10 functionality and gloo backend) tive multi_gpu_model only ex- seem to scale very well and only ecutes training steps in parallel The model used to classify these images begin to substantially deviate rather than loading data, poten- was a version of Resnet,5 a well known from ideal predictions once more tially creating extra bottlenecks

74 As mentioned previously, while exam- batch size, caused the time-to-accuracy ining the throughput is a very concise, results to stop improving once too many scientific, way to determine how fast processors were used, at least in this a program is running, it does not give case. much indication as to whether the efficiency of the training has been ad- Similar results were found when us- versely effected by adding more devices. ing CPUs and GPUs. In the case of the For this reason, plots of the walltime former, little improvement was seen required for the model to reach various in the time-to-accuracy results once test accuracies for the two frameworks more than 56 CPUs were used. How- with the highest throughput are shown ever it was noted that, despite being in 1b. From examining these graphs, it Figure 3: Throughput Speedup from CPU lower than the GPU case in general, appears that once 4 or more GPUs are Experiments the speedup in throughput when more used, the benefits of adding more seem CPUs were used was very close to ideal, somewhat limited. Note that Tensorflow However, as before, the efficiency of likely due to the proportion of time seemed to actively slow down and did training seemed to be impacted signif- spent on communication being lower. not reach 80% accuracy in its 40-epoch icantly by the high global batch size From this fact, it may be reasonable to run once 12 GPUs were used. This is which accompanies the use of many pro- claim that, if the problem is sufficiently likely due to the fact that the changes to cessors. As seen in figure 4, improve- large and is not particularly sensitive the training hyperparameters, such as ment in time-to-accuracy results is negli- to changes in batch size and other hy- increasing the global batch size, even- gible when more than 56 CPUs are used. perparameters, it would be possible tually start to prevent the model from Note also that y axis scale suggests that to achieve near ideal scaling using the training effectively. the performance in this limiting case is correct framework. similar to what would be expected from using two GPUs. It also appears that the curves for References Horovod are not quite as smooth as 1 Géron, Aurélien (2017), Hands-on machine learning those for PyTorch. A possible explana- with Scikit-Learn and TensorFlow : concepts, tools, tion of this is that Horovod does not and techniques to build intelligent systems, O’Reilly Media synchronise non-trainable weights in the model’s batch normalisation layers. 2 Ben-Nun, Tal and Besta, Maciej and Huber, Simon and Nikolaos Ziogas, Alexandros and Peter, Daniel and While these weights should converge Hoefler, Torsten (2019), A Modular Benchmarking In- to common values eventually, it is frastructure for High-Performance and Reproducible Deep Learning plausible that discrepancies between processors would add noise to results 3 Goyal, Priya and Dollár, Piotr and Girshick, Ross and Noordhuis, Pieter and Wesolowski, Lukasz and Ky- from the early stages of the training. rola, Aapo and Tulloch, Andrew and Jia, Yangqing and He, Kaiming (2017), Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Figure 4: Throughput Speedup from CPU 4 Krizhevsky, Alex(2009) Learning Multiple Layers of Experiments Features from Tiny Images 5 Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun (2015) Deep Residual Learning for Im- CPU Results age Recognition, 2016 IEEE Conference on Computer Conclusions Vision and Pattern Recognition (CVPR) (770-778)

While some aspects of the results found In the case of the two most promising are likely to be specific to this par- PRACE SoHPCProject Title Performance analysis of Distributed frameworks, similar experiments were ticular neural network, it appears and Scalable Deep Learning run using CPUs instead of GPUs for com- that, for problems of this type. The PRACE SoHPCSite parison. To avoid memory errors, the DistributedDataParallel fea- University of Luxembourg, batch size had to be reduced to 16 per tures in the PyTorch library seem to Luxembourg CPU. As may be expected, these took be the most efficient of the frameworks PRACE SoHPCAuthors much longer to run. However, as can be considered for distributing the training Sean Mahon, Trinity College Dublin, Ireland seen in figure 3, the throughput seems process across many devices. Tensor- to scale much at least as efficiently as flow with Horovod also scales well and PRACE SoHPCMentor Sebastien Varette, University of the GPU version. PyTorch seemed to was easy to use but it appears that Luxembourg; Luxembourg; Valentin stay within a reasonable margin of er- some of the shortcuts taken to improve Plugaru, University of Luxembourg, Luxembourg Sean Mahon ror to ideal scaling, up to the 4 nodes throughput had a negative effect on test (112 CPUs considered). A likely reason accuracy. The other frameworks tried PRACE SoHPCAcknowledgement I would like to thank my project mentors Sebastien for this is that, as using a single CPU for generally did not scale as well as these Varette and Valentin Plugaru, as well as the other training would be substantially slower two. It should be noted that, for both researchers and students in the UniLu HPC group for their continued help and support during my time in than one GPU, communication is taking PyTorch and Tensorflow, it was evident Luxembourg. up a smaller fraction of the total train- that some of the common changes to PRACE SoHPCProject ID ing time in these experiments. hyperparameters, such as increasing the 1925

75 www.summerofhpc.prace-ri.eu