2019 summerofhpc.prace-ri.eu A long hot summer is time for a break, right? Not necessarily! PRACE Summer of HPC 2019 reports by participants are here. HPC in the summer?
Leon Kos
There is no such thing as lazy summer. At least not for the 25 participants and their mentors at 11 PRACE HPC sites.
ummer of HPC is a PRACE programme that offers Contents university students the opportunity to spend two months in the summer at HPC centres across Europe. 1 Memory: Speed Me Up 3 The students work using HPC resources on projects 2 "It worked on my machine... 5 years ago" 6 Sthat are related to PRACE work with the goal to produce a 3 High Performance Machine Learning 8 visualisation or a video. 4 Electronic density of Nanotubes 11 5 In Situ/Web Visualization of CFD Data using OpenFOAM 14 This year, training week was in Bologna and it seems to 6 Explaining the faults of HPC systems 17 have been the best training week yet! It was a great start to Summer of HPC and set us up to have an amazing summer! 7 Parallel Computing Demos on Wee ARCHIE 19 At the end of the summer videos were created and are 8 Performance of Python Program on HPC 22 available on Youtube as PRACE Summer of HPC 2019 presen- 9 Investigations on GASNet’s active messages 25 tations playlist. Together with the following articles interest- 10 Studying an oncogenic mutant protein with HPC 28 ing code and results are available. Dozens of blog posts were 11 Lead Optimization using HPC 31 created as well. At the end of the activity, every year two 12 Deep Learning for Matrix Computations 34 projects out of the 25 participants are selected and awarded 13 Billion bead baby 36 for their outstanding performance. Award ceremony was 14 Switching up Neural Networks 39 held early in November 2019 at PRACE AISBL office in Brus- 15 Fastest sorting algorithm ever 42 sels. The winners of this year are Mahmoud Elbattah for Best 16 CFD: Colorful Fluid Dynamics? No with HPC! 46 Performance and Arnau Miro Janea as HPC Ambassador. Ben- 17 Emotion Recognition using DNNs 50 jamin surpassed expectations while working on an interesting 18 FMM: A GPU’s Task 54 project, and carrying out more work than planned. Mostly 19 High Performance Lattice Field Theory 57 with no assistance, he carried out benchmarking analysis to 20 Encrypted disks in HPC 59 compare different programming frameworks. His report was 21 A data pipeline for weather data 62 well written in a clear and scientific style. Pablo carried out 22 A glimpse into plasma fusion 65 a great amount of work during his project. More importantly, he has the capacity to clearly present his project to layper- 23 Predicting Electricity Consumption 67 sons and the general public. His blog posts were interesting, 24 Energy reporting in Slurm Jobs 70 and his video was professionally created and presented his 25 Benchmarking Deep Learning for SoHPC 73 summer project in a captivating and pleasant way. Therefore, I invite you to look at the articles and visit the web pages for details and experience the fun we had this PRACE SoHPC2019 Coordinator Leon Kos, University of Ljubljana year. Phone: +386 4771 436 E-mail: [email protected] What can I say at the end of this wonderful summer. PRACE SoHPCMore Information Really, autumn will be wonderful too. Don’t forget to smile! http://summerofhpc.prace-ri.eu Leon Kos
2 Enabling software and hardware profiling to exploit heterogeneous memory systems Memory: Speed Me Up
Dimitrios Voulgaris
Memory interaction is one of the most performance limiting factors in modern systems. It would be of interest to determine how specific applications are affected, but also to explore whether combining diverse memory architectures can alleviate the problem.
upercomputers are gradually es- quite misleading. In order to identify great interest to find out whether by tablishing their position in al- the data which benefit the most from forcing software to imitate hardware’s most every scientific field. Huge being hosted into different memory sub- methodology, i.e. sampling, we are ca- amount of data requires storage systems it is important to gain insight pable of achieving similar final results. Sand processing and therefore impels the of the application behaviour. This the original target of the project. exascale era to arrive even sooner than Application profiling comes in two BACKGROUND anticipated. This immense shift cannot flavors, software and hardware based. Memory architecture, describes the be achieved without seeing the big im- Tools like EVOP [2], which is an exten- methods used to implement computer age. Processing power is usually consid- sion of Valgrind [3], VTune [4] as well data storage in the best possible man- ered as the determinant factor when it as others, offer exclusively instruction- ner. “Best” describes a combination of comes to computer performance, and based instrumentation, i.e. monitoring the fastest, most reliable, most durable, indeed, so far, all efforts in that di- and intercepting of every executed in- and least expensive way to store and rection did seem fruitful. CPU perfor- struction. The additional time overhead retrieve information [6]. Traditionally mance, however, has been extensively implied by this technique is more than and from a high point of view, a mem- researched and thus optimised, leaving apparent, therefore, vendors have come ory system is a mere cache and a RAM scarcely any opportunities for improve- up with the idea of enhancing hard- memory. ment. ware, making it capable of aiding in ap- Caches are very specialised and INTRODUCTION plication profiling. Specialized embed- hence complicated pieces of hardware, Heterogeneous memory (HM) sys- ded hardware counters deploy sampling placed in a hierarchical way very close tems accommodate memories featur- mechanisms to provide rich insight in to the processing unit. Divided into lay- ing different characteristics such as ca- hardware events. PEBS [5] is the im- ers (usually 3 in modern machines), pacity, bandwidth, latency, energy con- plementation of such mechanism in the every cache element presents different sumption or volatility. While HM sys- majority of modern processors. characteristics regarding size and data tems present opportunities in different These two approaches essentially retrieval latency. fields, the efficient usage of such sys- have the same ultimate scope of high- tems requires prior application knowl- lighting the application behaviour re- RAM, being an equally complex edge because developers need to de- garding its memory interaction in or- piece of hardware, presents significantly termine which data objects to place in der to result in an optimal performance- bigger capacity which qualifies it as the which of the available memory tiers [1]. wise memory object distribution. Nev- ideal main storage unit of the system. Given the limited nature of “fast” mem- ertheless the former approach accounts On the downside, it is characterised ory, it becomes clear that the approach for each and every memory access while by degrees of magnitude longer access of placing the most often accessed data the second one performs a sampling time which makes it quite inefficient, objects on the fastest memory can be of the triggered events. It would be of yet necessary to access.
3 Figure 3: Pseudo-code for enabling Valgrind instrumentation and detail collection. Figure 2: Valgrind logo. Figure 1: Memory hierarchy along with typ- ical latency times. benchmarks without sampling in order instrumenting an application will make to get the total number of memory ob- it from four to twenty times slower. In jects. The latter were processed by a Both storage elements interact with order to alleviate this fact we decided to BSC-developed tool in order to be opti- each other in the following way: when restrict the instrumentation only to the mally distributed to the available mem- a memory access instruction is issued, regions of code that is of interest. Val- ory subsystems. The distribution takes cache memory is the first subsystem to grind provides the adequate interface into consideration the last-level cache be accessed. If the respective instruction that eases the aforementioned scope: In misses of each object as well as the num- can be completed in cache (that is: the figure 3 you can see the macros that, ber of loads that refer to each one of referred value resides there) then we by framing the relevant lines of code, them and sets as its goal the minimisa- have a “cache hit”, otherwise a “cache enable or disable the instrumentation tion of the total CPU stall cycles. The miss” will be signaled and the effort of and information collection. total “saved” cycles are calculated along completing the instruction shall move Setting the aforementioned com- with the object distribution and thus on to the next cache level. In case of parison of software and hardware ap- the final speedup can be determined. a “last level cache miss” the main mem- proach as our ultimate scope, we en- This speedup is the maximum achiev- ory has to be accessed in order to ful- hanced EVOP with the option of per- able given the application and the mem- fil the need for data. These exact ac- forming sampled data collection. In ory subsystem mosaic. cesses pose the greatest performance detail, we extended Valgrind source What follows is a trial-and-error ex- limitation especially when it comes to code by integrating a global counter perimentation process of obtaining re- memory-bound applications. Keeping which increments on every memory ac- sults using various sampling periods to that in mind, these exact accesses have cess. While counter’s value is below extract the referenced objects. It can be we considered as the key factor for opti- a predetermined threshold no access intuitively assumed that the longer the mising our workflow. shall be accounted for. On the con- sampling period is, the fewer the total METHODOLOGY trary, the memory access that forces memory accesses will be, too. Given the In order to monitor memory inter- counter and threshold to be equal fact that each access is related to one actions we opted for Valgrind as it of- will be monitored and therefore ac- memory object, the fewer the accesses fers the additional possibility of simu- cumulated to the final access number. are, the less the possibilities are that not lating cache behaviour. That is, when The threshold is to be plainly defined all existing memory objects are discov- we run our application under Valgrind, when EVOP is called using the flag ered. The latter presumably results in a all data accesses can be monitored and --sampling-period=
4 sampling, in the first case, is not enough SoHPC participants) to experiment us- to account for every “important” object, ing the sampling periods from our trials in order to determine the most efficient way to gain insight into an application’s memory behaviour and subsequently boost its performance by exclusively fo- cusing on the optimal distribution of its memory objects. REFERENCES 1 H.Servat, A. J. Peña, G. Llort, E. Mercadal, H. C. Hoppe, J. Labarta, “Automating the Appli- Figure 4: Sampling at a specific low rate cation Data Placement in Hybrid Memory Sys- results in omitting memory objects; a higher tems”, 2017 IEEE International Conference on sampling frequency is needed Figure 6: Only low periods are allowed in Cluster Computing (CLUSTER), Honolulu, HI, order to achieve an optimal speedup. USA, September 5-8, 2017. 2 A. J. Peña, P. Balaji, “A Framework for Track- ing Memory Accesses in Scientific Applications”, 2014 43rd International Conference on Parallel Processing Workshops, Minneapolis, MN, USA, September 9-12, 2014.
3 “Only Intel R VTuneTM Amplifier”, Available: https://software.intel.com/en-us/vtune 4 “Valgrind”, last version: valgrind-3.15.0, Avail- able: http://www.valgrind.org/ 5 I. Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual, September 2016, Figure 5: In rather iterative access be- Figure 7: Bigger sampling periods can result vol. Volume 3B: System Programming Guide, Part haviour, low-frequency sampling is enough in equally high speedup as memory objects 2, ch. 18.4.4. keep being discovered. for complete memory object discovery. 6 https://en.wikipedia.org/wiki/ Mem- ory_architecture mini-applications intended as bench- 7 M. A. Heroux, D. W. Doerfler, P. S. Crozier, marks for assessing the performance of J. M. Willenbring, H. C.Edwards, A. Williams, certain production applications. Hence, while in the second case, due to the dif- M. Rajan, E. R. Keiter, H. K. Thornquist,and they are representative of the behaviour ferent access pattern, this is allowed. In R. W. Numrich, “Improving performance via of more complex applications. the following two graphs we pictured mini-applications,”Sandia National Laboratories, MiniMD, a molecular dynamics sim- the exact correlation between the final Tech. Rep., 2009, http://www.sandia.gov/ mah- ulator, has a rather sequential access speedup and the sampling period used erou/docs/MantevoOverview.pdf pattern of the objects that it refers to. to get the objects. While in the first case, As a result, sampling period cannot be there is a strict threshold, after which of the same degree of magnitude of the the final results are strongly damaged, PRACE SoHPCTitle total access number (discovered by the in the second case this threshold can be Analysing effects of profiling in non-sampled execution). In fact, in or- generalised in a larger neighbourhood heterogeneous memory data der to have an optimal speedup, sam- of periods. placement pling period has to be relatively small. FUTURE RESEARCH PRACE SoHPCSite Barcelona Supercomputing Center On the contrary, HPCCG, which offers When deploying the hardware pro- (BSC), Spain (ES) an iterative computational implementa- filing approach aiming in performance PRACE SoHPCAuthors tion of the conjugate gradient method optimisation we are restricted by the Dimitrios Voulgaris [University of on an arbitrary number of processors, imposed sampling that has to be per- Thessaly] — Greece (GR) PRACE SoHPCMentor entails its main computation in a “for formed. So far we have obtained some Dimitrios Voulgaris loop”. Thus, the same memory objects concrete results regarding the effect Marc Jorda [BSC] — Spain (ES) Photo are referenced every time the loop iter- that different sampling rates have on PRACE SoHPCContact ates enabling the use of greater sam- the discovered memory objects. Given Dimitrios, Voulgaris, University of Thessaly pling periods. Indeed, we were able that the latter are responsible for the E-mail: [email protected] to use intervals of the same degree of final performance speedup, as well as PRACE SoHPCAcknowledgement Special thanks to my BSC coordinators Antonio J. Peña magnitude as the overall access number that hardware profiling mechanisms and Marc Jorda for as they proved an unlimited source of some objects and still get accepted rely solely on sampling, we can trivially of help. I am also grateful to the SoHPC coordinator, Mr. Leon Kos for his constant presence and guidance speedup time. assume that the same, or at least sim- when I needed him. Finally my deepest gratitude to my In the first two figures we present a ilar, results should be obtained when colleague and friend Perry Gibson for his humorous comments and advice that made this experience qualitative representation of the mem- profiling using the hardware counters. unique. ory access behaviour for each applica- We have provided the chance to fu- PRACE SoHPCProject ID tion, as it was understood by us. Sparse ture researchers (and why not future 1901
5 Exploring Reproducible Workflows in Automated Heterogeneous Memory Data Distribution Literature Results “It worked on my machine... 5 years ago”
Perry Gibson
Leveraging emerging heterogeneous memory subsystems for memory bound HPC ap- plications requires an understanding of memory access patterns of the target workload. We revisit prior work from 2014, which simulated cache behaviour of HPC test work- loads, and produced near-optimal data placement strategies from high-fidelity runtime data. We find that the results of the original simulations cannot be precisely reproduced. We document our approach of attempting to reproduce the results, and put forward concrete solutions to improve the reproducibility of future HPC research endeavours.
PC applications the de facto lem is moving to a heterogeneous mem- running program. The fork, developed expensive workload. At scale, ory model. Instead of the classic hier- for a suite of papers from the BSC, is even sub-percentage acceler- archical memory model of L1 cache described in more detail in.2 The main ation can represent massive L2 cache ... LL cache → contribution was adding the ability to Hcost savings. Thus, opportunities across Main Memory→ Disk→, we have a choice→ trace loads and stores to individual vari- the stack must be exploited as much as to place individual→ data in a number of ables, and in particular dynamic objects possible. Often the biggest bottleneck in memory subsystems with different prop- which are created by a number of lines these applications is data access times. erties. To prepare current and future of code, and are thus less easy to detect If data is stored close to the processor HPC applications to exploit the poten- directly (for example arrays of point- when needed, such as at the lowest level tial benefits of this change, it is neces- ers). of cache, then we incur very little over- sary to devise novel approaches for plac- The results of using this system, head. However, this isn’t feasible in all ing data objects in an appropriate mem- named EVOP (Extended Valgrind for cases, given typical sizes of L1 caches ory subsystem, giving good automatic Object-differentiated Profiling), on two are on the order of a dozen or so KiB, placement with minimal programmer HPC mini-applications4 1 were pre- and higher levels of cache are on the involvement, while providing fine grain sented in.5 The paper demonstrated the order of a few dozen MiB. Since even control if needed. scope for acceleration, which is still true modest HPC applications use GiBs of today. However, the exact experimental data, or orders of magnitude more, we To perform this profiling, in 2015 results of the access patterns for the can- are forced to store it in main memory, researchers at BSC heavily adapted the didate applications and resulting data 1 or disk. Valgrind instrumentation framework, placement proposal differ. so that they could measure the access A promising mitigation for this prob- patterns for individual data objects in a 1Interested readers can learn more about mini-applications in the SoHPC podcast episode featuring Dr. Mike Heroux.
6 Methods sider the results we collected to be sen- erate the needed computations. These sible. Thus, a useful extension to the details are generally not included in We began our work by getting access work would be to compare the results publications, as they are superfluous to to the original EVOP codebase, and the of a simulate run against the results the conceptual contribution of a work. candidate workloads. Initially, we used on actual hardware. We followed the However, it is clear that once lost to the latest version of the code, commit- approach of,3 which used the Extrae time they cannot easily be recovered, ted two years after the original paper. framework developed at BSC. The use and for some systems are necessary for We sought to first understand its func- of Intel PEBS sampling would allow work’s conclusions. This was not the tionality and configuration, then began a similar type of object tracing infor- case for EVOP, as our results using mod- to run full experiments. With the latest mation to be produced. The advantage ern toolchains still trace objects and version of the code, we found that for over simulation is reduced overhead, at demonstrate theoretical speedups for our first workload, miniMD, the number the cost of lower fidelity data. Collabo- candidate object distributions. They dif- of executed instructions was 17.9% less rating with a colleague investigating the fer numerically from the original results. than the original results, and the last- use of sampling in EVOP would enable Ideally, containerised workflows should level cache miss rate was 64.1% higher. a fair comparison. We had access to a be published, but can also be kept in Care was taken to ensure that the cache configuration file believed to be used internal archives is aspects of their func- configuration was identical to what was to gather the results in.3 However, in tioning are sensitive. described in the original paper. the time remaining we were not able to The pitfall of relying on external Our next step was to eliminate the collect the instrumented data of inter- package repositories for the creation possibility that some version change in est, namely object addresses and their of containerised workflows should be EVOP had changed its instrumentation access patterns. heeded by researchers whose results behaviour. Using version-control com- are possibly influenced by such factors. mit timestamps, we examined the differ- Results and Conclusion Including the packages in the project ences between candidate code versions archives should be considered a pre- that may have been used for original We have ruled out with some degree ferred practice. results. However, none of these versions of certainty the influence of compiler could be successfully compiled. For com- and standard library versions. Because mits of interest, source code files used References of the inconsistent version-control his- 1 Nethercote, Nicholas and Seward, Julian. Valgrind: A inconsistent API calls, and we hypothe- tory, we cannot rule out that a differ- Framework for Heavyweight Dynamic Binary Instru- mentation. sise that during development not all of ent version of the codebase was used to the working files were tracked correctly. perform instrumentation. However, our 2 Pena, Antonio J. and Balaji, Pavan. A Framework for We patched a number of these code ver- patched versions of EVOP generated the Tracking Memory Accesses in Scientific Applications. sions, and found identical results. This same results. 3 Servat, H. and Peña, A. J. and Llort, G. and Mercadal, was promising for the consistency of the E. and Hoppe, H. and Labarta, J. (2017). Automat- As one would expect, the instruc- ing the Application Data Placement in Hybrid Memory tool across versions, but did not explain tion set architecture used has an in- Systems. the difference with the original paper. fluence on the number of instructions 4 Crozier, Paul Stewart and Thornquist, Heidi K. and We hypothesised that the compiler that a given program execution takes. Numrich, Robert W. and Williams, Alan B. and Ed- version used to produce the workload wards, Harold Carter and Keiter, Eric Richard and Ra- However, it does not change the simu- jan, Mahesh and Willenbring, James M. and Doerfler, binaries, coupled with versions of the lated cache behaviour. We followed the Douglas W. and Heroux, Michael Allen Improving Per- standard library could be an influence. description of the cache configuration, formance via Mini-Applications. Thus, we used a containerisation work- and input parameters to our test work- 5 Peña, A. J. and Balaji, P. (2014). Toward the Efficient flow (via the HPC-first Singularity sys- Use of Multiple Explicitly Managed Memory Subsys- loads. However the exhibited behaviour tems. tem6) to systematically compare can- was different. The simulator’s behaviour didate environments. Many of these 6 Kurtzer, Gregory M. and Sochat, Vanessa and Bauer, should be agnostic to the underlying Michael W. Singularity: Scientific Containers for Mo- older tool-versions no longer existed in hardware it runs on, and we found this bility of Compute. package repositories or archives, and re- was true for small scale experiments. quired manual compilation. The effect In conclusion, the replication of re- PRACE SoHPCProject Title of the compiler version on the number sults after years of tools lying dormant Reproducing Automated Heterogeneous Memory Data of instructions and data access patterns is a difficult one. The configurations and Distribution Literature Results and was found to be minimal. Variation of sequence of steps used to produce re- Beyond the instruction set extensions used by sults is easily lost, and can be non-trivial PRACE SoHPCSite the compiler (such as (dis/en)abling to recover at a later date. BSC, Catalonia vector instructions) changed the num- Our recommendation is that dur- PRACE SoHPCAuthors ber of instructions, but not the data ac- ing the development stages of research Perry Gibson cess patterns. tools, continuous integration systems PRACE SoHPCMentor Consulting with the original authors, are used to ensure that the versioned- Pena,˜ A. J. Perry Gibson we reconstructed the original OS envi- project state is as expected.? We have PRACE SoHPCAcknowledgement ronment, using an archived distribution. created a continuous integration to be Many thanks to support and guidance from my mentor, the researchers and staff of the BSC, the organisers of This test for an unknown factor of in- used in future versions of EVOP. the SoHPC at PRACE for getting things working and my fluence. However, again the results re- When the results for publications cohort for bringing interesting stories to the weekly meetings. Special thanks to Dimitris Voulgaris for mained identical. are being produced, then containerisa- working closely with me throughout the project. With a reproduction of the original tion should be used to enshrine the con- PRACE SoHPCProject ID work appearing unlikely, we still con- figuration and tool version used to gen- 1902
7 What can we gain if we use HPC tools for Machine Learning instead of the JVM-based technologies? High Performance Machine Learning
Thizirie Ould Amer
Machine Learning (ML) algorithms are designed to process and learn from huge amounts of data and their correlations to adapt and adjust the results. This big quantity of data needs to be analyzed and classified according to various criteria, which means that our ML algorithms have to evaluate these criteria and select the best ones. As you can imagine, this requires a lot of time and energy to be done. So how can we increase the efficiency of ML algorithm runs?
PC is the solution for many In order to successfully carry out this Decision Trees kinds of problems! Com- project withing a limited time frame, bined with other features, we chose one popular ML algorithm – The idea of the Decision trees algorithm HPC tools can have a big (Gradient Boosting Decision Trees). is to find a set of criteria that will help us Himpact on the performance of our This algorithm is very popular in “predict” (classify or inter/extrapolate applications (programs). Big data Kaggle competitions. Indeed, many of variables we pick. processing is currently dominated the biggest winners of these contests In the example in the Table 1, let’s by JVM-based technologies such as include Gradient Boosting algorithms see if we can predict if an algorithm can Hadoop MapReduce or Apache Spark. in their winning models ensembles, be parallelized or not. Their popular tools are indeed fairly ef- amongst a variety of other models. The ficient, and particularly, they are easy to most commonly used library is XG- Table 1: Example 1 use for developers. Our goal is to eval- Boost and it aims to provide a Scalable, uate the traditional HPC tools such as Portable and Distributed Gradient Boost- MPI, OpenMP or GASPI for ML / big ing (GBM, GBRT, GBDT) Library.1 IndepParts LoopsDep Parallel? data processing, and to see for our self 1 0 1 The main focus of this project is then it they are at least as good as the afore- 0 1 0 the GBDT but the other methods are as mentioned tools, or maybe even better! 0 0 0 popular and successful as this one.
8 Since the goal is to predict the last vari- loop directly since each iteration re- groups with minimum impurity. It it- able (Parallel?), our decision tree will quires the result obtained in the previ- erates (for a fixed column – variable) have to consider each feature (column) ous one. However, it is possible to create through all the different values and cal- and see which one helps us the most a parallel version of the decision trees culates the impurity for the split. The to distinguish if a code can be paral- building code. output is the value that reaches the min- lelized or not. Once such feature is iden- The algorithm for building Decision imum impurity. tified, we split the codes according to Trees can be parallelized in many ways. that variable. Selecting any feature can Each of the approaches can potentially naturally lead to misclassification of ele- be the most efficient one depending on ments, otherwise the feature on its own the size of data, number of features, would correlate perfectly with the result. number of parallel workers, data dis- This misclassification is also referred to tribution, etc. Let us discuss three vi- as the impurity – an error rate param- able ways of parallelizing the algorithm, eter ranging from 0 to 1; 0 for pure and explain their advantages and draw- (no misclassification) and 1 for a group backs. where every element is misclassified. In Figure 2: Sharing the best split research in each node, for each feature. each step of the construction of the de- cision tree, we pick the variable which Sharing the tree nodes creation among the parallel processes at each level leads to split with the lowest impurity. As can be seen in the Figure 2, this Decision trees on their own tend to suf- The way we implemented the decision is the part of the code that can be fer from overfitting, so we must use ad- tree allows us to split the effort among parallelized. So every time a node has ditional tricks to make it a good ML the parallel tasks. This means that we to find a split of individuals in (two) method. would end up with task distribution groups, many processors will compute schematically depicted in the Figure 1. the best local splitting value, and we keep the minimum value from the par- Gradient Boosted Decision allel tasks. Then, the same calculations Trees are repeated for the Right data on one side, and for the Left Data on the other. Gradient boosting in general is an ap- In this case, a parallel process will proach of creating a predicting model in do its job for a node and when done the form of an ensemble of weak predic- it can directly move to another task on tion models. In Gradient Boosting Deci- another node. So, we got rid of the un- sion Tree, the weak prediction model is, balanced workload since (almost) all unsurprisingly, the Decision Trees. By fit- processes will constantly be given tasks ting a decision tree to pseudo-residuals to do. (i.e. errors) of another tree, we can sig- Nevertheless, this also has an no- nificantly increase the accuracy of the table drawback. The cost in communi- data representation. Figure 1: Parallelizing the nodes building at cation is not always worth the effort. each level. Levels are in different colours. Typically, a limited side decision Imagine the last tree level where we trees (of depth 4 to 8) are iterated mul- would have only a few individuals in tiple (typically less than 100) times, cre- Yet, we have a problem with this ap- each node. The cost of the communica- ating (at most) 100 decision trees that proach. We can imagine a case where tion in both ways (the data has to be have different decision criteria and "fix- we would have 50 "individuals" going given to each process, and received the ing" at each iteration the errors that to the left node, and 380 going to the output at the end). The communication the previous tree has made. The algo- right one. We will then expect that one will eventually slow down the global ex- rithm doesn’t give more weight to any of processor will process the data of 50 ecution more than the parallelization of the trees but the tree number i focuses individuals and the other one will pro- the workload speeds it up. on individuals that didn’t fit into their cess the data of 380. This is not a bal- leaves/groups in the tree number i-1 anced distribution of the work, some Parallelize the best split research on each (i.e. focuses on the individuals that have processors doing nothing while others level by features the most important pseudo-residual val- maybe drowning in work. Furthermore, ues). the number of tree nodes that can be The existing literature2 helped us to processed in parallel limits the maxi- merge some features of the two afore- How to parallelize this? mum number of utilizable parallel pro- mentioned approaches to find one that cesses... So we thought about another reduces or eliminates their drawbacks. When it comes to the concept of a paral- way. This time, the idea is to parallelize the lel version of a program, many aspects function that finds the best split, but for have to be taken into account, such as each tree level. Each parallel process Sharing the best split research in each the dependency between various parts node. calculates the impurity for a particular of the code, balancing the work load on variable across all nodes within the level the workers, etc. In our implementation of the decision of the tree. This means that we cannot paral- tree algorithm, there is a function that This method is expected to work better lelize the gradient boosted algorithm’s finds the value that splits the data into because:
9 1. The workload is now balanced ing and continuing the implementations because each parallel process evaluates and comparing them to the usual tools. the same amount of data for "its fea- Including OpenMP in the paralleliza- ture". As long as the number of fea- tion could also be a very interesting tures is the same (or greater, ideally approach and lead to a better perfor- much larger) than the number of par- mance. allel tasks, none of the processes will idle. Another side that could be considered as well is using the fault tolerant func- 2. The impact of the problem we tions of GPI-2 which would ensure a described in the second concept is re- better reliability for the application. duced, since we are working on the whole levels rather than on a single node. Each parallel task is loaded with Figure 4: Timing of communication and com- much larger (and equal) amounts of putation steps: (a) no overlap, (b) partial over- data, thus the communication overhead lap is less significant.
The global concept is shown in Fig- ure 3
Figure 5: Timing of communication and computation steps: (c) complete overlaps.
PRACE SoHPCProject Title Figure 3: Parallelizing the best split re- One of the libraries that allows High Performance Machine Learning search on each level by features for asynchronous communications pat- PRACE SoHPCSite terns, thus overlap of computation Computing Center of SAS, Slovakia and communications is the GPI-2 ( PRACE SoHPCAuthors http://www.gpi-site.com ), an open Thizirie Ould Amer, Sorbonne source reference implementation of the University, France GASPI (Global Address Space Program- PRACE SoHPCMentor ming Interface, http://www.gaspi.de ) Michal Pitonˇak` , CCSAS, Slovakia Thizirie Ould Amer How can this be improved? standard, providing an API for C and PRACE SoHPCContact Thizirie, Ould Amer, Sorbonne University C++. It provides non-blocking one- LinkedIn: in/Thizirie sided and collective operations with E-mail: [email protected] There are several ways to improve the strong focuses on fault tolerance. Suc- PRACE SoHPCMore Information performance of our parallel algorithm. cessful implementations is parallel ma- Author’s blog About PRACE Summer of HPC trix multiplication, K-means and Tera- Let us focus on the communication, Sort algorithms are discribed in Pitonak PRACE SoHPCAcknowledgement First things first, I would like to thank my mentors particularly its timing. While we can- et al. (2019) "Optimization of Computa- Michal Pitonàkˇ and Mariàn Gall for their great help. not completely avoid loosing some wall- tionally and I/O Intense Patterns in Elec- Many thanks to my site coordinator; Lukáš Demoviˇc and the whole CCSAS team who made my time here clock time in sending and receiving tronic Structure and Machine Learning extraordinarily amazing. I would also like to thank my data, i.e. communication among the Algorithms.".3 HPC and Data Analysis classes teachers; Pierre Fortin parallel processes, we can rearrange the and Fanny Villers for their support. algorithm (and use proper libraries) to PRACE SoHPCProject ID facilitate overlap of the computation Conclusion 1903 and communication. As shown in the the figure 4, in the first case (a) we We have discussed some of the miscella- References notice that the processor waits until neous possible parallel versions of this 1 XGBoost GitHub repository: the communication is done to launch algorithm and their efficiency. Unfortu- https://github.com/dmlc/xgboost 2 nately we did not have enough time to Parallel Gradient Boosting Decision Trees: the calculations, whereas in the second http://zhanpengfang.github.io/418home.html finalize the implementations and com- case (b) it starts computation before the 3 Optimization of Computationally and I/O Intense end of the communication process. The pare them with the JVM-based tech- Patterns in Electronic Structure and Machine Learning last case displayed in 5 shows differ- nologies but also with XGBoost perfor- Algorithms: http://doi.org/10.5281/zenodo.2807938 4 mances. Gradient Boosting decision trees: ent ways to facilitate total computation- https://medium.com/mlreview/gradient-boosting- communication overlap. Future works could focus on improv- from-scratch-1e317ae4587d
10 Where can you find the electrons of a nanotube? Electron density computation for each orbital. Electron density of Nanotubes
Iren´ Simko´
We exploit the helical symmetry of nanotubes in the electronic structure computation. The code was extended with electron density computation for each orbital. I used Message Passing Interface for parallelization and visualized the electron density isosurfaces.
ave you heard about (carbon) sity computation and visualization. This dent, the size of the unit cell determines nanotubes? Probably you al- shows us the distribution of electrons the cost of the computation. ready had, but you will hear in the space, to see the chemical bonds Nanotubes have helical symmetry2 a lot more about them in the and the nodes of the orbitals. which allows us to use very small unit Hfuture. The extraordinary mechanical cells. For carbon nanotubes and electrical properties of nanotubes it is enough to have only have a lot of promising applications in two atoms! Then, imagine a engineering and material science. For spiral that is winding on the example, composite materials with car- surface of a cylinder, and bon nanotubes have great mechanical replicate the unit cell fol- strength. Nanotubes are candidates for lowing the path of the spi- nano-electronics as well, because they ral. This way you can build can behave like metals, semiconductors the whole carbon nanotube. or insulators. Using helical symmetry is a The topic of my project was simu- much better approach than lating the nanotubes with the SOLID98 considering only the trans- quantum chemical program.1 The pro- Figure 1: Helical symmetry of the nanotube lational symmetry along the axis of the nanotube. gram –written in FORTRAN77– com- Use symmetry when you can! putes the electronic structure of the nan- otubes. The computational cost of a quantum The electronic structure compu- According to the principles of quan- chemical simulation strongly increases tation tum mechanics, the states of the elec- with the size of the system. But how trons have well-defined discrete en- can we still compute large systems such To get the electronic orbitals and ener- ergies and wave functions (orbitals), as nanotubes or crystals? One solu- gies we have to solve the Schrödinger- which are computed by the program. tion is periodicity and symmetry, which equation. But do not grab a pen and The parallelization of the code was the means that we can build the whole large paper, because it has an analytical solu- task in the previous years of Summer system by replicating a small building tion only for simple systems, we have of HPC, and my goal was to extend the block (unit cell) following symmetry to do it numerically in other cases. The code with a new feature: Electron den- rules. Since the replicas are not indepen- wave function is a linear combination
11 of basis functions in the simulation. The function from the central unit cell and increases the counter value with 1, or Schrödinger-equation is translated to the other one from the whole nanotube. updates it to 0 if it was N 1. This an eigenvalue-equation of the so-called Then we get the total electron density way, process 0 does k = 1, proc.− 1 does Fock matrix, which is solved iteratively. in the same manner as we built the k = 2, proc. N 1 does k = N, then The diagonalization part of the pro- nanotube from the unit cell by replicat- proc 0 does k =−N + 1, and so on. gram is parallellized with Message Pass- ing the contribution of the central unit ing Interface (MPI). Thanks to some cell. In the figure below you can see an mathematical tricks the Fock matrix is example for the benzene molecule. block diagonal, eack block labelled with a k value and diagonalized by a differ- ent process. As a result, we get orbitals Parallelization using MPI and energies for each k block. Next step was to parallelize the electron density computation. I decided to use Electron density: The serial the same technique as it is in the diag- code onalization of the Fock matrix. We use the Master-Slave MPI system, where the The electron density is the probability Master process (rank = 0) distributes of finding the electron at a given point the work among the Slave processes is space. We would like to visualize the (rank = 1,2,...,N 1). First the Mas- individual electron orbitals, so we re- ter process does some− initialisation and strict the computation for one orbital at broadcasts the data to the Slaves, then a time. each process does the work assigned to Electron density computation is the them. last step of the program, we use the re- The electron density computation is sults of the last diagonalization. From done in two nested loops: one loop over the eigenvectors, we build the so-called the k values and one loop over the or- density matrix (P ), and contract it with bitals of a given k. In the first version the basis functions (χ). „Contraction” is of the code, each process computes the a double sum over all basis functions of orbitals belonging to a different k value. Figure 3: MPI parallelization the system: We use a counter variable to distribute the work. The counter is set to 0 at the ρ(r) = P χ (r)χ (r) The problem with this implementa- ab a b beginning of the loop. tion is that you cannot use more pro- Xa,b cessors than than number of k values. So in the second version I moved the parallelization to the inner loop over the orbitals. If we have M orbitals in each block, the first M processes start working on the k = 1 orbitals, the next process gets the first orbital of k = 2. This is a much better solution, because we can utilize more processors than in the first case. On the other hand, the diagonaliza- tion of the Fock matrix is still paral- lelized at the k-loop, so we have idle processors if we have more than the number of k values. The relative time of the diagonalization and the electron density computation determines if it is worth to use the k-loop or the orbital- loop parallelization.
”A picture is worth a thousand words”
One goal of the project was to actually see where we can find the electrons, but Figure 2: Electron density of one unit cell and the whole benzene molecule visualization turned out to be more chal- lenging than I expected. A good scien- A process does the upcoming value First, we compute the contribution of k tific figure should capture the essence if its rank is equal to the actual counter the central unit cell of the electron den- of the topic of the research as well as value. If a process accepts a value, it sity. Here in the sum we take one basis k catch the attention of the reader.
12 Figure 4: Electron density visualization examples
The result of the computation is the elec- electron density is localised inside the References tron density in a large number of grid tube, while in other you can see the 1 J. Comp. Chem., Vol. 20, No. 2, 253.261 (1999) points. If the grid points are on a plane, bonds between the carbon atoms. 2 I plotted the results with Wolfram Math- J. Phys. Chem. B 2005, 109, 52-65 ematica using ListPlot3D and ListDen- sityPlot. On the figures above you can PRACE SoHPCProject Title Electronic structure of nanotubes by see an example for a benzene orbital, utilizing the helical symmetry with the electron density calculated at properties: The code optimization the plane of the molecule. PRACE SoHPCSite Computing Centre of the Slovak I visualized the electron density in Academy of Sciences, Slovakia space by plotting the isosurface, which PRACE SoHPCAuthors is a surface where the electron density Iren´ Simko´,Eotv¨ os¨ Lorand´ University, equals to a given value. I used the Visual Hungary Molecular Dynamics program to make PRACE SoHPCMentor isosurface plots after transforming the Prof. Dr. Jozef Noga, DrSc., CCSAS, output to Gaussian cube file format. Slovakia Irén Simkó PRACE SoHPCContact Irén, Simkó E-mail: [email protected] Results & discussion Figure 5: Speed-up of the parallel program PRACE SoHPCSoftware applied SOLID98, Visual Molecular Dynamics, Wolfram I tested the electron density computa- Mathematica, Blender tion and visualization for the benzene As for the parallelization, I made a PRACE SoHPCAcknowledgement molecule and a small carbon nanotube. test for a nanotube that has 32 k val- I am grateful to my mentor, Jozef Noga, my site I tried to find the physical meaning ues. In the figure above, you can see the co-ordinator, Lukáš Demoviˇc, and all people at CCSAS for their support and the wonderful summer. of each electron density plot. For ex- speed-up for the two code versions. The Calculations were performed in the Computing Centre ample, there are bonding benzene or- maximal number of cores was 32 for the of the Slovak Academy of Sciences using the supercomputing infrastructure acquired in project bitals where the electron density is high k-loop version, and 128 for the orbital- ITMS 26230120002 and 26210120002 (Slovak between the atoms; and there are an- loop version. Up to about 64 cores the infrastructure for high-performance computing) supported by the Research & Development Operational tibonding orbitals with nodal surface, speed-up is ideal, but after that the pro- Programme funded by the ERDF. where the electron density is zero. In gram will not be much faster if we in- PRACE SoHPCProject ID the case of some nanotube orbitals the crease the number of cores. 1904
13 Live Visualisation of Simulation Result using Paraview Catalyst
In Situ / Web Visualisation of CFD Data using OpenFOAM
Li Juan Chan Paraview Catalyst has been proposed to overcome the limitations caused by I/O bottleneck. In this research, the scalability of OpenFOAM with coprocessing has been investigated and it has shown to be very scalable.
ver the years, the compu- ning. There are a few benefits of using Aim and Motivation tational power of processor Paraview Catalyst. One of them is that is developed in a tremen- the live visualisation feature of Paraview In the field of high-performance com- dous fast pace. This has made Catalyst enables researchers to examine puting, the scalability of a software is Olarge-scale computing to become more whether the pre-processing step is set very important. This is because super- affordable. The development of paral- up correctly. If the boundary conditions computer is nothing but a bunch of com- lel computing software is advancing are found to be incorrect, the simula- puters combined together to perform a equally as fast as the processor. How- tion can be stopped immediately before task. Therefore, researchers may want ever, there are subsystems of high- investing more time and resources into to know how much benefit can be ob- performance computer that are not de- an incorrect simulation. Furthermore, tained by running a software with ad- veloped at the same pace with processor. with Paraview Catalyst, the frequency ditional resources. Since Paraview Cata- One of them is the I/O bandwidth. I/O of data writing is significantly reduced lyst is so useful and may be widely used is developed in such a slow pace that it as only the final result is needed to be in the future, I, as a researcher, would is now becoming the bottleneck for ex- saved. The intermediate result can be like to find out how scalable Paraview ascale computing. This project hoped to viewed instantly without having to be Catalyst is and this basically is the aim address this issue and try to get around saved for post-processing. of this project. the bottleneck by using in situ visualisa- tion. Therefore, one may wonder if Par- How to use Paraview Catalyst? aview Catalyst will increase the demand Traditionally, the simulation process of memory due to live visualisation. The consists of pre-processing, processing After introducing Paraview Catalyst, I answer is yes and no. The live visualisa- and post-processing. However, this pro- would like to introduce another soft- tion definitely requires some memory to cess is getting more and more expensive ware that is used together with Par- run. However, the demand of memory due to the I/O bottleneck. Writing and aview Catalyst, which is OpenFOAM. is not significant because Paraview has reading data have become very slow OpenFOAM is an open-source software the ability to extract a small amount of when compared to the processing speed. for computational fluid dynamics (CFD) important data from the result instead In situ visualisation was designed to and it is also highly scalable. There- of saving the full datasets. The features solve this problem by moving some of fore, the scalability of OpenFOAM with that can be extracted from Paraview in- the post-processing tasks in line with and without co-processing will be deter- clude slice, clip, glyphs, streamline, etc. the simulation code. mined and compared. Paraview has come out with a in situ The model I was working on is visualisation library called Catalyst. Cat- shown in Figure 1. It is a wind tun- alyst enables live visualisation of simu- nel construction with two NACA0012 lation while the simulation is still run- airfoils, rotating with respect to hinges.
14 Figure 2 is the zoomed in image of the be loaded before running OpenFOAM. model. As shown in the figure, struc- Once loaded, the simulation can be run tured mesh is used in the upstream and and the intermediate result of the simu- downstream of the model. lation can be visualised concurrently as shown in Figure 3. In Figure 3, the feature displayed is a slice. To extract other types of features, a different pipeline is needed. A pipeline is a python script that specifies the fea- Figure 8: Region ture to be extracted and the parame- ters of the feature. The pipeline can be obtained from Paraview GUI. The What is the outcome? process is repeated with different num- ber of cores and different pipelines. In The three graphs in the next page are to this study, several pipelines, for example summarise the results of this research. Figure 1: Model slice, clip, glyphs, streamline and region It consists of scaling graph, graph of ex- are produced as shown in Figures 4, 5, ecution time of OpenFOAM with and 6,7 and 8. without catalyst and graph of execution time of Paraview Catalyst. First of all, what is speedup? Speedup is a variable that measures scalability. The speedup of n cores is defined as the execution time in one core divided by the execution time in n cores, where n is the number of cores to Figure 2: Mesh be measured. Basically, it measures the The mesh in the middle is unstructured relative performance of one core and due to the rotation of the airfoils. There n cores processing the same problem. Figure 4: Slice are approximately 10 million cells in In the scaling graph, one may notice the mesh. A compressible solver named that the speedup of OpenFOAM with rhoPimpleFoam is used for all simula- coprocessing are very similar regard- tions. less of the type of pipeline. Additionally, the speedup of OpenFOAM with and without coprocessing are very similar up until 108 number of cores. Beyond that point, they start to diverge with the OpenFOAM without coprocessing being more scalable than that with coprocess- ing. The degree of divergence increases Figure 5: Clip as the number of cores increase. We also noticed that the scalability of Open- FOAM with and without coprocessing starts to flatten out beyond 216 num- ber of cores. The sudden reduction of their scalability is due to communica- tion overhead. Therefore, if a good scal- ing after 216 cores is wanted, the size of the mesh is needed to be increased in order to have a good balance between Figure 3: Live Visualisation computation and communication, To run OpenFOAM and Paraview Cata- Figure 6: Glyphs Similar to the scaling graph, the ex- lyst at the same time, a software called ecution time of OpenFOAM with and Remote Connection Manager (RCM) without coprocessing are very similar. is used. It allows the users to con- However, the execution time without nect to the compute node of the su- coprocessing is slightly faster at large percomputer. One RCM session is used number of processor cores. The differ- to run the Paraview. When Paraview is ence increases as the number of proces- loaded, Catalyst can be connected from sor cores increase. In contrast, the exe- the menu bar of the Paraview. Mean- cution time of Paraview Catalyst do not while, another RCM session is opened seem to have a clear pattern as the two to run OpenFOAM. However, an Open- previous graphs. However, all pipelines FOAM plugin, co-developed by CINECA exhibit a trend, in which the execution with ESI-OpenCFD and Kitware, should Figure 7: Streamline time decreases up until 108 processor
15 Compilation of the graphs of result
cores. Beyond that point, the execution PRACE SoHPCProject Title www.paraview.org www.openfoam.org time increases. This trend agrees with In Situ / Web Visualisation of CFD the scaling graph. Data Using OpenFOAM PRACE SoHPCAcknowledgement PRACE SoHPCSite I would like to express my utmost gratitude to my site coordinator, Simone Bnà for providing me the guidance CINECA, Italy In a nutshell and advice throughout the project. Without his PRACE SoHPCAuthors supervision, this project would never reach as far as it OpenFOAM with Paraview Catalyst is Li Juan Chan, University of has. Furthermore, I would also like to thank PRACE for Manchester, UK sponsoring me to work in CINECA. I also like to thank very scalable with only slightly inferior Massimiliano Guarrasi and Leon Kos for arranging my to pure OpenFOAM. However, the bene- PRACE SoHPCMentor allocation in Bologna. Last but not least, I would like to Prof. Federico Piscaglia, Dept. of thank Prof. Federico Piscaglia and Federico Ghioldi for fits that can be obtained from Paraview Aerospace Sci. and Technology providing me the model. Catalyst definitely outweigh the slight Politecnico di Milano, Italy PRACE SoHPCProject ID Simone Bna`, CINECA, Italy Li Juan Chan reduction of scalability. 1905 For anyone who is interested in this PRACE SoHPCContact Li Juan Chan, University of Manchester project, the details of this project can be Phone: +60 125271362 found in the github repository.1 E-mail: [email protected] PRACE SoHPCSoftware applied OpenFOAM 1806, Paraview 5.5.1, Catalyst, Blender References 2.79b 1 https://github.com/ljchan1/SoHPC_19 PRACE SoHPCMore Information
16 Building explainable model for understanding HPC faults Explaining the faults of HPC systems
Martin Molan
This work provides initial investigation into possibilities for machine learning on log data collected on HPC systems at CINECA supercomputer site. General data transformation/preparation pipeline that can be the basis for further machine learning projects is presented. Additionally, this work also explores possibilities for creation of explainable machine learning models on processed data.
he goal of presented work processing should be done in parallel The only constructed feature is is to explore possibilities for and in batches. MaxState which is the value of most machine learning on top of serious warning in a timestamp. All data collected from high perfor- features have values between 0 and Tmance computing (HPC) systems. The 3 where 0 means normal operation, 1 1.1 Feature construction and data was collected on two HPC systems dataset description means warning and 2 means serious (MARCONI and GALILEO) from super- failure (3 signifies missing data). computer site CINECA. The dataset has the following attributes Presented work has the following (features): key points: Feature evaluation Informative- alive::ping ness of features is estimated by eval- Preparation of dataset for future backup::local::status uating them with random foresters and • cluster::status::availability machine learning experiments. cluster::status::criticality extra tree classifiers (ensemble classi- This involves creation of data cluster::status::internal fiers based on decision trees). The most transformation/pre-processing dev::raid::status relevant features evaluated are (accord- pipeline that is capable of han- dev::swc::confcheckself filesys::local::avail ing to both methods of evaluation): dling huge quantities of data in filesys::local::mount parallel and in batches. filesys::shared::mount batchs::client::state memory::phys::total MaxState Building of explainable machine ssh::daemon ssh::daemon • learning (ML) model that investi- sys::ldapsrv::status batchs::JobsH alive::ping gates a learning task on a subset filesys::eurofusion::mount of data sys::gpfs::status dev::ipmi::events Implementing scalable parallel in- dev::swc::bntfru 2 Supervised learning • duction of stable decision trees. dev::swc::bnthealth dev::swc::bnttemp The second goal of the project is to con- dev::swc::confcheck struct a supervised learning task. The batchs::client::state 1 Dataset preparation batchs::client training set consists of instances, de- net::opa::pciwidth scribed by attributes and a target value The main goal of dataset preparation net::opa for each instance. In this work target is to convert data from raw data (each sys::orphaned-cgroups::count value will always be discrete - the train- row represents single attribute, single core::total sys::cpus::freq ing task will be classification. time stamp and single node) to data batchs::client::serverrespond appropriate for machine learning. The MaxState
17 2.1 Comparison of different Features are informative. Feature similarity matrix is parallel) of con- • classification algorithms set provides good information sensus tree algorithm were performed. about the target label without the The open problem remains how to effi- Comparison of classification algorithms need for additional feature con- ciently implement the parallel version is done with 10-fold cross validation. struction. of the algorithm that would be appli- In each repetition of the process, the cable to bigger datasets (great number Features do not require non-linear model is trained on 90% of the data of instances and base estimators). The • transformations to be informative and evaluated on 10%. The results are main problem of this implementation (SVMs do not represent improve- then averaged across (10) iterations. is the memory demands of each indi- ment over base-line features 10-fold cross validation generally pro- vidual thread. Parallel implementation duces more reliable results than sim- Informative features are the basis for and subsequent consensus trees will be ple train/test split (it avoids overly pes- creation of an explainable model. If a topic of future paper prepared with simistic or optimistic estimates of classi- that was not the case - if for instance mentors from university of Bologna. fier performance). The goal of the super- SVMs performed considerably better vised learning model is to predict when than baseline trees - features would a fault (a warning, or a fault) from a need non-linear transformations which References component (attributes described above) would make them much more difficult 1 Pedregosa, F. and Varoquaux, G. and is serious enough to warrant a shut- to interpret. down of a node (binary target class). Gramfort, A. and Michel, V. and Such (explainable) model would enable Decision trees - in their basic imple- Thirion, B. and Grisel, O. and Blon- the operators to make more informed mentation (greedy splitting) - are not a del, M. and Prettenhofer, P. and Weiss, decision about the seriousness of the reliable explainable model for the prob- R. and Dubourg, V. and Vanderplas, J. warnings provided by the system. lem of this scale. Because they are gen- and Passos, A. and Cournapeau, D. and erally unstable (can significantly change Brucher, M. and Perrot, M. and Duch- Table 1: Average classification accuracy in with small changes on the dataset) rea- esnay, E (2011). Scikit-learn: Machine 10-fold cross validation soning provided by the decision trees Learning in Python Journal of Machine cannot be the basis for the final model. Learning Research, 12: 2825–2830 Dummy Classifier 0.9663 2 Nearest Neighbors 0.9865 Kavsek, B. and Lavrac, N. and Freligoj, Linear SVM 0.9926 3 Future work A. (2001) Consensus Decision Trees: RBF SVM 0.9924 Using Consensus Hierarchical Cluster- Decision Tree 0.9928 Based on the relatively good perfor- ing for Data Relabelling and Reduc- Random Forest 0.9744 mance of decision trees the next step tion European Conference on Machine Neural Net 0.9915 in building of the predictive model is Learning, 251-262 AdaBoost 0.9928 to solve inherent instability of decision Naive Bayes 0.9936 trees. A possible approach is a consen- sus tree classifier. This is an ensem- PRACE SoHPCProject Title ble learning approach based on deci- Anomaly detection of system failures sion trees. It essentially combines pre- on HPC machines using Machine Table 2: Average gain compared to dummy Learning Techniques classifier in 10-fold cross validation (in per- dictions of several decision trees into a PRACE SoHPCSite centages) single (stable) model. CINECA, Bologna Basic algorithm for consensus tree DDummy Classifier 0.0 induction can be summarized as:2 PRACE SoHPCAuthors Martin Molan, [International Nearest Neighbors 2.015 1. Perform N-fold partition of the Postgraduate school Jozef Stefan Linear SVM 2.625 Institute] Slovenia data set (same partition as with RBF SVM 2.612 PRACE SoHPCMentor cross validation), train separate Decision Tree 2.644 Andrea Bartolini, University of decision tree on each partition Bologna, Italy Random Forest 0.805 PRACE SoHPCMentor Neural Net 2.514 2. Compute dissimilarity matrix for Andrea Borghesi, University of AdaBoost 2.644 instances with regards to the Bologna, Italy Naive Bayes 2.73 trained decision trees PRACE SoHPCMentor Francesco Beneventi, University of 3. Construct hierarchical clustering Bologna, Italy Martin Molan Classifier evaluation (and base clas- and clean the dataset PRACE SoHPCContact sifiers themselves) are based on Scikit- Martin, Molan, International postgraduate school Jozef leran library.1 Each of iterations was per- 4. Construct decision tree on a Stefan Institute formed in its own thread (paralleliza- cleaned data set E-mail: [email protected] tion). PRACE SoHPCAcknowledgement I am thankful to my mentor (dr. Bartolini) and his 3.1 Implemented on summer colleagues from university of Bologna for all the help The most significant takeaway form and guidance. I am also thankful to my site supervisor of HPC dr. Massimiliano Guarassi for his help with setting up classifier performance is the relative the environment on HPC. strength of decision trees. This suggests Preliminary experiments with parallel PRACE SoHPCProject ID that: implementation (computation of dis- 1906
18 Visualising Parallel Computations for Education on Raspberry-Pi Clusters Parallel Computing Demos on Wee Archie
Caelen Feller
I created a framework for visualising parallel communications on a Raspberry-Pi based parallel computer Wee Archie. I used this to create demonstrations illustrating basic parallel concepts, and an interactive coastal defence simulation.
PCC has developed a small, front-end of Wee Archie, or by program- I developed visualisations for a tu- portable Raspberry-Pi based ming the LED lights attached to each torial illustrating the basics of paral- “supercomputer” which is taken of the 16 Wee Archie Pis to indicate lel communications and a coastline de- to schools, science festivals etc. when communication is taking place fence demonstration. I also developed a Eto illustrate how parallel computers and where it is going (e.g. by display- web interface for the tutorials to cre- work. It is called Wee Archie because ing arrows). All of these visualisations ate a more cohesive user experience. it is a smaller version of the UK national are triggered via the MPI (Message Pass- In these demonstrations, I aimed to be supercomputer ARCHER. It is actually a ing Interface) profiling interface, a stan- able to make it clear what is happening standard Linux cluster and it is straight- dard feature on all modern HPC sys- on the computer to a general audience. forward to port C, C++ and Fortran tems, making this a drop-in solution for code to it. There are already a num- most existing code. ber of demonstrators which run on Wee Animation Server Archie that demonstrate the usefulness of running demonstrations on a paral- Client Code Animation Server To display communications via the LED lel computer, but they do not specifi- panels (created by Adafruit), I used the cally demonstrate how parallel comput- official Adafruit Python library. As such, ing works. my visualisations are done in Python. 1010011 To allow all demonstrations to share In this project, I developed a frame- the panels safely, I start a queue on work for creating and enhancing new, each Pi on a separate background pro- existing and in-development demonstra- cess, where any demo can add an 8px Communication tions that show more explicitly how a 8px image to be displayed on the 8 ×8 Info parallel program runs. This is done by LED panels. I can also give these× im- showing a real-time visualisation on the ages properties such as how long to be
19 displayed and create parcels of images Other differences are discussed in the Application-Specific Animations together which form my various anima- MPI standard.1 tions. Often a demonstration will require I had two major requirements from unique animations, such as a context- my animation system. I need to allow Collective Visualisations appropriate “working” animation, or a demonstrations developed in any pro- visualisation of some communication at The next major class of MPI communi- gramming language to create an anima- a higher level of abstraction than MPI, cations are “collectives” - when many tion, and to be able to coordinate anima- such as the “haloswaps” of the coastal nodes want to communicate others at tions between Pis. To do this, I made a defence demonstration. Here, the ani- once. Say I want to distribute some re- web server on each Pi using Flask which mation server can be contacted using a sult to every node - this is done using a will place an animation in queue when non-MPI process which directly contacts broadcast. According to the standard,1 you make a certain web request against the server and requires modification of this will cause every node participating it. You must provide options for anima- the source code. to wait until it has the correct output tion length, type, and type-specific op- before continuing, though the exact tim- tions as discussed below. ing varies. Unified Web Interface For clarity and consistency, in collec- Point to Point Visualisations tive communication visualisations, all As these demonstrations are used for participants wait until the communica- outreach, the surrounding narrative is In MPI, an important class of commu- tion is done overall before continuing. important for audience engagement. To nications are “point to point” commu- The implementation is similar to that of improve this aspect of the Wee Archie nications. These are when a message is waiting in point to point. interface is an important aspect of ex- plaining more complex concepts such passed from one node (here meaning There were several other types of as parallel communication. In previous a Raspberry Pi) to another directly. In communication I visualised (gathering, demonstrations for Wee Archie, a frame- its most basic form, a node will start to scattering and reduction), whose imple- work written in Python is used to dis- send to some destination, and wait un- mentation is similar. With gathering, the play a user interface on a connected til that destination has begun to receive result is collected from each node and computer. This starts the demonstra- from the correct source before begin- stored on one. With scattering, the re- tion by contacting a demonstration web ning to transfer the message. sult on one node is split into small, even server running on Wee Archie, which in Given the ability to add a sequence parts and distributed to many. With re- turn uses MPI to run the code on all of of images to a queue, how can I accu- duction, the result is gathered but an the other Raspberry Pis. It returns any rately visualise a "send" and "receive" op- operation is done to it as it is collected results to the client continuously. eration between two Pis? My approach such as a sum or product. For more de- I created an internal website for was to use a Python “pipe”. When an tails on these, see the MPI standard.1 animation is reached in queue, it will Wee Archie, but due to the performance show an “entry” and stop, using the pipe constraints of serving complicated web- to wait until the server allows it to con- C and Fortran Framework sites from a Raspberry Pi while it’s han- tinue. This allowed me to synchronise dling so many communications already, and wait on animations between Pis. While demonstrations for Wee Archie I opted to make it a static website - one This accurately replicates the behaviour typically use Python for their user in- which does not require processing by of MPI messaging. terface, they use the C and Fortran the server other than providing the cor- This behaviour is known as a “syn- languages in computations for perfor- rect files. I did this using the Gatsby chronous” or “blocking” send. There mance reasons. Thus, it was desirable framework. also exists a “non-blocking” send. The to create a wrapper around MPI in these In order to allow the static web- main difference between blocking and languages which will automatically let site to start a demonstration, I wrote non-blocking communications is that us- the animation server know when a a my own version of the Wee Archie ing non-blocking communication, the Pi communication is started. framework in JavaScript. This frame- I did this using work uses the Axios library to manage Client Code Wee MPI Animation Server the MPI profiling in- communication with the demonstration terface. MPI inter- server, and the React framework to pro- nally refers to func- vide a generic demonstration web inter- tions using “weak face. 1010011 MPI symbols”, which al- lows you to over- Basic MPI Tutorials ride the functions Communication provided by the li- This series of tutorials consist of a set of Info brary and allows a ten demonstrations to be run on Wee library developer to call their visual- Archie and accompanying text. They will continue working, not waiting for isation and logging code. The frame- are aimed at a complete beginner, who the communication to start, and do the work includes visualisations for most does not have programming experience, communication in the background. This MPI communications, all shown on the but can understand the concept of a is a more efficient but less safe form next page. program doing work, and take them of communication, and I provided visu- through all of the concepts Wee MPI alisation for it as it is commonly used. has to offer.
20 Figure 1: Left: Top: Send and Receive, Middle: Broadcast, Bottom: Gather. Right: Top: Scatter, Middle: Reduce (Sum), Bottom: Wave demonstration in progress.
The first two demonstrate the ben- changed between Pis. This is known as future demos. efits of parallel computing through the a “haloswap” and is shown by a custom It also would be an improvement analogy of cooking, showing an embar- animation. As many thousands of these were all demonstrations for Wee Archie rassingly parallel problem and then, in- occur during the simulation, I only show done using the web framework, as this troducing conflict and making the prob- every hundredth haloswap. would allow users to easily switch be- lem less perfectly parallel, showing the tween demonstrations, and provide a need for efficient communication. The surrounding explanation. It also allows next series go through point to point for novel, interactive visualisations with communications, first showing a loop of the use of new features such as WebGL, blocking and then non-blocking sends and the D3 JavaScript library. and receives. The final set discusses col- lective communications, demonstrating References what each do and showing their ani- 1 Message Passing Interface Forum (2015). MPI: A mations. It also shows the way that a Message-Passing Interface Standard Version 3.1 broadcast could created using point to Recommendations 2 EPCC, The University of Edinburgh (2018). Wee point communications. Archie Github Repository The main goal of the project was to cre- ate an extensible and easily used frame- Coastal Defence Demonstration PRACE SoHPCProject Title work to visualise parallel communica- Parallel Computing Demonstrators on In the demo, the ocean is broken up tions in any language. By using the an- Wee Archie into wide horizontal strips.2 Each strip imation server-client architecture and PRACE SoHPCSite is processed by a single Raspberry Pi. the profiling interface, this goal has EPCC, Scotland However, there needs to be some com- been accomplished. PRACE SoHPCAuthors munication so that waves can propagate As demonstrated in the tutorials and Caelen Feller, TCD, Ireland throughout the simulation. coastal defence simulation, this func- PRACE SoHPCMentor Gordon Gibb, EPCC, Scotland To do this, after every tick of the tions in practice, and as the animations Caelen Feller equation solver that is run in the simu- can easily be modified or turned off, PRACE SoHPCProject ID lation, the edges of these strips are ex- there is nothing stopping adoption in 1907
21 Performance Comparison of Python and C Programs in Computational Fluid Dynamic (CFD) Performance of Python Program on HPC
Ebru Diler
This project aims to improve the speed per- formance of Python, by experimenting with high performance scientific and fast data processing libraries.
cientific researchers use HPC communities including StackOverflow NumPy is one of the most robust to model and simulate complex and CodeAcademy has mentioned the and widely used Python libraries. The problems.They require fast su- rise of Python as a major programming Python Library is a repository of script percomputers in order to re- language. Although it is widely used, it modules that can be accessed from a solveS problems in the areas of Life Sci- has some disadvantages. One of them is Python program. Recovers some fre- ences, Physics, Climate research, En- that it is poor performance. Therefore, quently used commands from rewriting. gineering and many others. Such ma- it is not preferred in large scale model- It also helps to simplify programming. chines harnesses the power of thou- ing and simulation used in high perfor- NumPy provides a multidimensional ar- sands of tightly connected highend mance computers. However, some high ray object, routines for fast operations servers to deliver enough power to pro- performance/fast processing libraries in on arrays, and a range of routine prod- grams. One of such machines is Archer Python also growing. Thus, tradeoff be- ucts that provide a variety of derived located in Edinburgh. tween compiled languages and scripting objects (such as masked arrays and ma- In recent years, the preferred pro- languages reconsidered. Python offers trices), including mathematical, logical, gramming language for HPC is gener- faster implementation time, flexibility basic linear algebra, shape manipula- ally C/C++ or Fortran. Because they and ease to use/learn, which makes it tion, sorting, and selection. give the programmer very high con- very strong community, which should trol. A research or project that started be compared to other languages that MPI for Python provides bindings of with a programming language can be preferly have been used in HPC. the Message Passing Interface (MPI)[2] hard to change programming language, In short, this project aims to improve standard for the Python programming but there are tradeoffs that should be the speed performance of Python, a language, allowing any Python pro- considered. One of other languages is modern programming language known gram to exploit multiple processors. Python, which is constantly growing with ease of use and flexibility but not It supports point-to-point (sends, re- and developed by a strong community. with speed. Main purpose will be opti- ceives) and collective (broadcasts, scat- Python has become a very popu- mize the speed by using: ters, gathers) communications of any lar programming language due to its mpi parallelization library, picklable Python object, as well as wide range of uses. If you read program- • mpi4py optimized communications of Python ming and technology news or blog post object exposing the single-segment then you might have noticed the rise fast data processing library, buffer interface (NumPy arrays, builtin of Python as many popular developer • numpy bytes/string/array objects).
22 Introduction Implementation cores -communicating by the MPI in- terface. In addition parallel approach is Computational fluid dynamics (CFD) is There was already C code written to based on the domain decomposition. All a branch of fluid mechanics that uses solve this problem. Here are 3 different cores calculated details and exchange analysis methods to analyse and solve parameters that affect the algorithm: necessary information due to keep the problems involving fluid flows. In order 1. The scale factor, which has to do problem consistent. This is called halo to simulate the flow of a liquid, com- with the size of the problem swapping (Figure 3) and it’s reguleted puters are used in the calculations re- by MPI. quired. Better solutions can be achieved 2. The reynold number, related to with high-speed supercomputers and it the flow property of the fluid is necessary to solve the biggest and most complex problems with high speed 3. Tolerance, used to provide conver- supercomputers. gence control Fluid dynamics is a continuous prob- The flow images resulting from the lem that can be identified by partial dif- change of the reynold number is given ferential equations. In this study, the in figure 1 and 2. finite difference approach will be used to solve the equations and determine the fluid flow model in a square cav- Figure 3: Visualisation of Halo Swap ity with a single inlet on the right side and a single outlet at the bottom of the cavity. Computational fluid dynamics Results and Discussion (CFD) is a branch of fluid mechanics that uses analysis methods to anal- yse and solve problems involving fluid Comparison of Python Programs flows. In order to simulate the flow For each one thousand iterations, the of a liquid, computers are used in the time that spent by differents runs . The calculations required. fastest program; Numpy with vectoriza- Figure 1: Fluid flow Reynold number is tion which let us to avoid for loops. This None was our expectations at first place. But Methods we were quite suprised with other two versions (Figure 4). The fluid flow can be described by the stream function defined as follows: The version with Numpy worked much more slowly than the one without δ2ψ δ2ψ Numpy. Our prediction was that the pro- ∆2ψ = + = 0 δx2 δy2 gram would run faster using the numpy library. Using the finite difference approach, the Comparison of C Programs flow value at each grid point can be cal- In order to control compilation-time culated as follows: and compiler memory usage, and the trade-offs between speed and space for ψi 1,j+ψi+1,j+ψi,j 1+ψi,j+1 4ψi,j = 0 − − − the resulting executable, GCC provides With the boundary values fixed, the a range of general optimization levels, stream function can be calculated for Figure 2: Fluid flow Reynold number is 2 numbered from 0–3, as well as individ- each point in the grid by averaging the ual options for specific types of opti- mization. value at that point with its four nearest Serial Program I have written se- neighbours. The process continues un- rial code 3 different versions of Python An optimization level is chosen with til the given number of iterations. This to compare the speed. the command line option -OLEVEL, simple approach to solving a problem is where LEVEL is a number from 0 to called the Jacobi Algorithm. The veloc- 1. With using python lists and for 3. The following figure shows the per- ity field u must be calculated to obtain loops formance of the C program running the fluid flow pattern within the cavity. 2. With using Numpy arrays and for on different optimization levels. Figure The x and y components of u are related loops 5 shows the execution time of the C to the stream function by: program according to the optimization 3. With using Numpy arrays and level. built-in Numpy vectorization fea- We also compared C and Python ex- δψ 1 tures [1] ux = = (ψi,j+1 ψi,j 1) ecution speeds. Figure 5 shows the data δy 2 − − Parallel Program I have started generated as a result of C and python from a serial version running -on a sin- programs. According to data, efficient δψ 1 gle processor -and accelerated by dis- features of Numpy has increased perfor- uy = = (ψi+1,j ψi 1,j) tributing the work -over more processor mance 112 times. − δx −2 − −
23 Figure 4: Comparison of Python programs Figure 5: Comparison of C programs Figure 7: Comparison of Python and C pro- grams
Performance measures for Paral- lelization References The speedup of a parallel code is how 1 www.geeksforgeeks.org Vectorization-in-python much faster the parallel version runs 2 https://mpi4py.readthedocs.io/en/stable/ MPI for compared to a non-parallel version. Tak- Python ing the time to run the code on 1 proces- sor is T1 and to run the code on N pro- PRACE SoHPCProject Title Performance of Python programs on cessors is TN , the speedup S is found new HPC architectures by: PRACE SoHPCSite Edinburgh Parallel Computing Centre,United Kingdom T1 S = PRACE SoHPCAuthors TN Figure 6: Parallel Python Program Ebru Diler, [Dokuz Eylul University] This can be affected by many different Turkey factors, including the volume of com- PRACE SoHPCMentor munications to calculation. If the times Conclusion David HENTY, EPCC, UK Ebru Diler are the same speedup is 1, there was no PRACE SoHPCContact Ebru, Diler, ˙Izmir change. Although the Python program still Phone: +90 553 111 71 41 falling behind the C program in terms of E-mail: [email protected] Figure 7 demonstrates the strong scal- performance, it is possible to improve linkedin.com/in/ebru-diler ing of Python parallel program running its performance thanks to the Numpy PRACE SoHPCSoftware applied on different number of processor. The library. In this project: we can say that Python, Numpy, mpi4py graphic describes how the execution using numpy arrays without using ef- PRACE SoHPCAcknowledgement Great thanks to Dr. David Henty for continuous support time of the program is affected by in- ficient numpy features (like vectoriza- in all matters creasing the number of cores as the size tion) does effect performance in nega- PRACE SoHPCProject ID of the problem increases. tive way for Python program. 1908
24 Investigation on GASNet’s Active Messaging component as an alternative to MPI
Investigations on GASNet’s active
Illustration of the difference between the MPI approach messages and GASNet’s active message (AM) approach. Whereas MPI requires the receiving process to actively take the data, an active message directly jump in and modify the Benjamin Huth local memory.
The project aimed at replacing an MPI-based communication backend with GASNet. However, performance tests showed that GASNet does not perform well on this task, and a following broader investigation backs this observation.
arallelism is the most impor- Although this is the most popular library called “EDAT”, which relies on tant concept in modern super- concept, it is not the only one. Another MPI, with a GASNet-based one. computing. On today’s large one called "Active Messaging" is im- supercomputers this is mainly plemented by GASNet (Global-Address Pachieved by a technique called SPMD Space Networking). Its approach is very Change of focus (Single Program Multiple Data). This different from the above one: means, that there is only one unique After implementing the first version of The sender sends not just data, program which is executed on many dif- EDAT with GASNet we saw that the • but also a function to the destina- ferent CPUs (often on several hundreds performance was not as good as ex- tion. or thousands) at one time, but each of pected. So we changed the focus to them has individual data. To communi- If the message arrives at the desti- investigate more the basic properties cate between these programs, program- • nation, it automatically executes of GASNet’s active messaging compo- mers use specific messaging libraries. the function which can work on nent. To get more general results, we The most popular one is MPI (Message the data. decided not to analyse the performance Passing Interface). Its basic concept is of self-made code, but to rely on ex- The receiver has no direct control as follows: • isting standard benchmarks (programs over the message arrivals. for measuring hardware performance) The sending program calls a send- There are tasks which naturally bet- for MPI, and rewrite them with GASNet • function. ter fit to this second approach, e.g. a as close as possible. These benchmarks task-based library, which abstracts the should cover a broad spectrum of paral- The destination program has to problem of explicitly sending data from lel software, to figure out what are the • call a receive-function. the programmer, and lets them focus strengths and weaknesses of GASNet. on the logical structure of his program Its important to mention that we use Only if there is a matching pair of and data. Actually this was the motiva- GASNet here as a drop-in replacement • send- and receive-functions, the tion for my project: we aimed at replac- for passive point-to-point communica- communication takes place. ing the communication back-end of a tion, which is not what it is designed for.
25 Hamburg
Berlin
Cologne
Frankfurt Prague
Stuttgart Munich Vienna
Zurich
0 Frankfurt
1 Cologne Munich Berlin
2 Hamburg Stuttgart Prague Vienna
3 Zurich
(a) Visualisation of the swapping process in a simple 2D stencil (b) Example for a graph. On the bottom one sees how algorithm. The grey circles represent the processes, whereas each the ”breadth-first search” goes through the above graph arrow denotes a data transfer. in four steps; In the first step it reaches all points with distance 1, in the second all with distance two...
Figure 1: Visualisation of the two benchmarks
But if that works out well, it would be has a small fraction of the grid in its The graph500 benchmark first cre- easier for developer to implement than local memory). During the run, each ates a random graph which is again dis- redesigning the whole application. grid point is updated with the sum of tributed over the processes, and then its 4 neighbours (in the simplest 2D starts as “breadth-first search” (see also case). This is a typical scientific applica- figure 1b). This results (in contrast to Benchmarks for analysis tion, so e.g. above algorithm is solving the stencil-benchmark) in many, but the Laplace-equation of electrodynam- very small messages with unpredictable The purpose of the first benchmark is to ics. To perform an update on the grid destinations. determine the basic messaging proper- points on the edges of a local grid, one To implement this benchmark, I ties of each communicating system: must transfer some grid points from a haven’t rewritten the whole program, The bandwidth is the amount of neighbour process to the local one. In but just replaced the already abstracted • data a network can transfer in the simple 2D case, each process has communication layer with a GASNet a certain time, nowadays mod- to send 4 chunks of data to its neigh- based one. ern networks typically have a low bours, but also receives 4 chunks from All used source code is available in two-digit number of GB/s, e.g. 10 them (see figure 1a). As a bases of the a repository (see [4]), together with de- GB/s. Compared to the bandwidth GASNet benchmark, we used the imple- tailed instructions how to compile GAS- of the internal memory this is very mentation of Intel’s Parallel Research Net and the different benchmarks. slow, so this is a big bottleneck for Kernels for MPI (see [2]). parallel software. The last benchmark is called How to run the benchmarks The latency is the time a network graph500 (see [3]). It is designed to • needs to react to a messaging re- simulate applications, which are not re- We run these benchmarks on three dif- quest. It determines the fastest lying on heavy numerical computations, ferent systems, but here we show only possible communication, and is but on complex data structures. It im- the results from ARCHER, UK’s national for small messages often more im- plements a so called “graph”. This is supercomputer. For its network there portant than the bandwidth. an abstract data structure, which con- exists an optimised MPI as well as an sists of points (called “vertices”), and optimised GASNet implementation, but To measure these two quantities, we some of them are connected by “edges”. the results are also representative for use a standard benchmark designed by These edges usually represent a dis- the rest of the systems. the Ohio State University (see [1]). tance, whereas the vertices can contain Whereas in the micro-benchmarks The second application we chose any kind of information. A common and we scale the message-size, but hold the for our tests is a stencil-benchmark. simple example of data one can repre- process count constant, for the two big In this application a grid is distributed sent as a graph is a road map (see figure benchmarks we scaled the parallelism: over many processes (so each process 1b). we started with the serial problem and 1
26 10 10000 12 10000 GASNet MPI MPI GASNet 8 11 1000 1000 10 100 6 9 100
10 4 time [ms]
latency [µs] 7
bandwidth [GB/s] 10 1 2 6 MPI MPI
avg runtime per iteration [ms] GASNet GASNet 0 0 5 1 20 22 24 26 28 210 212 214 216 218 220 222 20 22 24 26 28 210 212 214 216 218 220 222 1 2 4 12 24 48 192 768 3072 0 16 32 48 64 80 96 112 128 message size [byte] message size [byte] cores nodes (16/24 cores used) (a) Performance of the latency- (b) Performance of the (c) Performance of the stencil- (d) Performance of the benchmark. The message size is bandwidth-benchmark. The benchmark. The workload is graph500-benchmark. The increased in powers of 2 from 1 message size is increased in hold constant as 1000000 grid- workload is hold constant byte to about 4 MB. powers of 2 from 1 byte to points per process respectively as 4096 vertices per process about 4 MB. core. respectively core
Figure 2: performance plots from ARCHER
process, and went up to 3072 processes GASNet is now a whole magnitude References (for the stencil) respectively 2048 pro- slower than MPI. This could be sur- cesses (for the graph500 which requires prising, because one might think, that 1. Panda, D. K. OSU Micro-Benchmarks the number of processes to be a power the active messaging approach is more 5.6.1 accessed 7th August 2019. Mar. of two). While we are scaling the num- suitable for this kind of application (in 2019. http://mvapich.cse.ohio- ber of processes, we try to keep the fact, in original benchmark uses an ac- state.edu/benchmarks/. amount of work per process constant tive messaging layer implemented with 2. Intel Corporation. Parallel Research Ker- (called weak scaling), since we’re not MPI). On the other hand, an obvious ex- nels accessed 7th August 2019. 2013. interested in how the algorithm bene- planation for this can be found with the https : / / github . com / ParRes / Kernels. fits from parallelism, but how well the latency results before: A difference in la- communication works. For a theoreti- tency is much relevant with many small 3. Graph500 Steering Committee. cal communication system with infinite messages than with a few big messages. Graph500-3.0.0 accessed 7th August 2019. May 2019. https://github. speed and zero latency, the execution com/graph500/graph500. time would stay constant, so an increase of execution time reflects the commu- 4. Huth, B., Brown, O. T. & Brown, N. EPCCed/GASNet-AM-benchmarks: PAW- nication overhead coming with more Conclusion ATM19_submission Aug. 2019. doi:10. parallelism. 5281 / zenodo . 3368621. https : With these results in mind, it is prob- / / doi . org / 10 . 5281 / zenodo . ably not worth using GASNet’s active 3368621. Results and analysis messaging as a replacement of MPI for passive point-to-point communication. PRACE SoHPCProject Title Already in the micro-benchmarks (see Task-based models on steroids: For this reason, we dropped the previ- Accelerating event driven task-based figures 2a and 2b) we see a significant ous aim of this project and focused on programming with GASNet performance difference. This isn’t suf- pure GASNet as an alternative for MPI. PRACE SoHPCSite ficient as a proof that GASNet’s active I think investigations on this topic are of EPCC, UK messages perform badly on ARCHER, general interest for the parallel comput- PRACE SoHPCAuthors but a strong indicator. ing community; although MPI is very Benjamin Huth, Uni Regensburg, Germany In the latency plot, the left side is the good and widely used, it is always im- PRACE SoHPCMentor most interesting one: Here we see that portant to consider alternatives which the minimal time for a message in GAS- Nick Brown, EPCC, UK can not only help developers choosing Oliver Brown, EPCC, UK Benjamin Huth Net is nearly twice as long as for MPI. right tools for their job, but also pushing PRACE SoHPCContact On the bandwidth plot, the most inter- MPI to further improvements. Huth, Benjamin esting area is the right side, and here we Universität Regensburg However, these are only preliminary Phone: +49 174 605 4097 see also better performance with MPI, results, it could be interesting to inves- E-mail: [email protected] especially for the largest message sizes. tigate a few things further, like doing PRACE SoHPCSoftware applied But these effects are not directly MPI, GASNet more extensive profiling to understand seen in the stencil benchmark: there PRACE SoHPCMore Information the observed performance better. It may is not much difference between both gasnet.lbl.gov also be worth to investigate the possi- frameworks (see figure 2c). After a big PRACE SoHPCAcknowledgement bilities of GASNet’s second component increase at the beginning (which is prob- I want to thank my mentors Nick Brown and Oliver RMA (Remote Memory Access) which Brown for their constant support, Ben Morse for ably caused by memory effects inside a can compete with MPI according to the organising our stay in Edinburgh and PRACE and Leon single node), the performance stays con- Kos for organising this event. Last but not least I want developers. to thank my SoHPC collegues Caelen and Ebru for stant, as expected for weak scaling. spending this awsome summer with me! With the graph500 (see figure 2d), PRACE SoHPCProject ID we see an entirely different picture. 1909
27 Using HPC to investigate the structure and dynamics of a K-Ras4b oncogenic mutant protein
Studying an oncogenic mutant protein with HPC
Rebecca Lait
Molecular Dynamics (MD) simulations and Normal Mode Analysis have been used in order to discover potential binding sites on the K-Ras4b protein, which could be further used in drug design studies.
he oncogenic mutant K-Ras4b as a binary switch in signal transduc- two states act like switches to turning is a protein that has been stud- tion pathways, cycling between inac- the function of the protein on and off, ied for many years but there tive GDP-bound and active GTP-bound respectively. However, the hydrolysis of has been limited progression re- states.[1] In the GTP-bound active state, GTP to GDP in the G12D mutated Tgarding finding new drugs against it. cell proliferation, which refers to an K-Ras4b, where the glycine in residue Drugs are small molecules that bind increase in the number of cells due to position 12 has been replaced with as- to proteins and inhibit protein func- cell growth and division, can normally partic acid, is no longer possible. Con- tion. Also known as ligands, they bind occur. However, in the GDP inactive sequently, the protein is ‘locked’ in the to a specific region of the target pro- form, cell proliferation is inhibited. Fig- active GTP state.[1] This leads to over- tein, the binding site. In some cases, ure 1 shows the GTP molecule bound activation of the K-Ras4b protein and as the location of the binding site is not to the K-Ras4b protein. a result, to cancer growth. The K-Ras4b known in advance, even though the pro- protein is comprised of three functional tein structure is available. In this con- domains: the effector binding region text, computational studies can help dis- which is defined by amino acid residues cover new binding sites. Hence, the goal 1-86; the allosteric region by residues of this project was to identify binding 87-166; and the HVR tail by residues sites on the oncogenic protein K-Ras4b, 167-188. Each of these three domains which could benefit future therapeutic are responsible for a particular interac- approaches. tion or function which contributes to the overall function of the protein.[2] The first domain in K-Ras4b, the effec- K-Ras4b tor binding region, is located in the N- terminus of the protein. There are three K-Ras4b is a small GTPase, which refers Figure 1: The GTP molecule bound to the sub-domains within the effector binding to a large family of hydrolyse enzymes. K-Ras4b protein. region: P-loop; switch I and switch II.[1] It is an essential component of sig- Once the GTP is bound, the protein un- nalling network controlling signal trans- During the hydrolysis of GTP, there is a dergoes conformational changes that in- duction pathways and promoting cell loss of a phosphate group, which results volve switch I and switch II. The sec- proliferation and survival. It operates in the formation of GDP. Thus, these ond domain, the allosteric region, has
28 been suggested to play a role in switch II face of the unmutated wild type and all four K-Ras variants, Molecular Dy- conformation and the orientation of the the G12D mutant K-Ras4b, where small namics simulations were performed us- membrane, while the third domain, the molecules can bind. This would aid the ing the ARIS supercomputer. For the WT HVR tail regulates the biological activity rational design of selective K-Ras4b in- K-Ras4b bound to GDP, the simulation of the protein.[3] Within the aqueous hibitors. In order to achieve this, Molec- was run for 75 ns and for the remaining component of the cytoplasm of a cell, ular Dynamics simulations and Normal 3 variants for 23 ns each. the CYS185 of the protein is farnesy- Mode Analyses were performed. lated, meaning that there is an addition of a farnesyl group, allowing the protein to bind to the inner leaflet of the plasma Molecular Dynamics simulations membrane.[4] There are many posi- Identifying K-Ras4b binding tively charged lysine amino acids within Atomic coordinates files of the protein sites the HVR tail that interact strongly with and the specific attached ligand struc- the negatively charged membrane.[1] ture - either GTP or GDP - were down- These functional domains can be seen loaded from the Protein Data Bank.[5] After performing Molecular Dynamics in Figure 2. The blue region highlights Protein preparation including solvation simulations, the last frame of each sim- the effector binding region. The red re- and ionisation processes was carried ulation was collected and imported into gion displays the allosteric region and out. The pdb files used for the unmu- the Schrodinger Suite for binding site the green region shows the HVR tail. tated WT K-Ras4b bound to GDP, G12D identification. SiteMap was used in or- mutated K-Ras4b bound to GDP, unmu- der to determine whether any cavities tated WT K-Ras4b bound to GTP and with hydrophobic characteristics and hy- G12D mutated K-Ras4b bound to GTP drogen bonding capacity existed on the had pdb codes 5W22, 5US4, 5VQ2 and simulated proteins.[6] Initially, regions 6GOF, respectively.[5] Regarding the on the protein surface, called ‘sites’, G12D K-Ras4b GTP structure, a struc- were identified as potentially promis- ture was not available and so one was ing. These were located using a grid of created using the 6GOF structure as points called ‘site points’. In the second a starting point. The HVR region was stage, contour maps, ‘site maps’, were taken from pdb code 2MSC and ap- generated and the predicted binding pended to the 6GOF structure. More- site was visualised. Finally, each site was over, one of the two magnesium ions assessed by calculating various proper- was removed and the GPPNHP molecule ties, such as druggability and hydropho- was replaced with the GTP molecule. bicity.[6] All of the predicted binding The solvation step, which places the pro- sites were investigated and only the tein in a water box, and the ionization ones that appeared the most promising Figure 2: The K-Ras4b protein with the three domains highlighted in red, blue and step, which places ions in water, was would continue in further analysis. The green. carried out in order to represent a more identified binding sites for the mutated realistic biological environment. Thus, G12D K-Ras4b bound to GTP are shown the following systems were created: un- in Figure 3 below. The binding sites Aims of the project mutated wild type K-Ras4b bound to identified are shown in three colours: GDP or GTP, and G12D mutated red, yellow and purple. They are each The aim of this project was to identify K-Ras4b bound to GDP or GTP. Once highlighted using coloured circles and new potential binding sites on the sur- protein preparation was completed for are each labelled.
Figure 3: The identified binding sites for the G12D mutated K-Ras4b bound to GTP. Each binding site is labelled and is highlighted using coloured circles.
29 Normal mode analysis In the context of this project, the fluc- binding site identification results, it can tuations around predicted binding sites give useful insights regarding the func- Normal mode analysis was carried out were investigated. More specifically, if tional areas of a protein. For each of alongside the binding site identifica- the analysis predicted large scale mo- the four studied K-Ras variants, regions tion in order to investigate large scale tions at a particular area of the protein with considerable large-scale motions motions of the protein structures. Nor- that SiteMap identified as a potential near to the identified binding sites were mal mode vibrations are simple har- binding site, it may be suggested that found. An example of this analysis un- monic oscillations around an energy this is a biologically functional region of dertaken is shown in Figure 4 below minimum. The normal modes which the protein and could act as a binding and compared to the binding site iden- have the biggest amplitude can indicate site. This type of analysis is qualitative. tification result from SiteMap. the functional motions of the proteins. However, when this is combined with
Figure 4: The normal mode analysis undertaken on the unmutated wild type K-Ras4b structure bound to GDP. The binding site is circled.
Conclusions 3 obbs, G., Der, C. and Rossman, K. (2016). RAS iso- PRACE SoHPCContact forms and mutations in cancer at a glance. Journal of Cell Science, 129(7), pp.1287-1292. Rebecca, Lait In this project, models of the K-Ras4b E-mail: [email protected] 4 anˇcík, S., Drábek, J., Radzioch, D. and Hajdúch, structures were constructed and MD M. (2010). Clinical Relevance of KRAS in Human simulations were used to study the Cancers. Journal of Biomedicine and Biotechnology, PRACE SoHPCSoftware applied dynamics of this protein. Binding site 2010, pp.1-13. identification was performed in order 5 RCSB PDB: Homepage. [online] Rcsb.org. Available Virtuoso at: https://www.rcsb.org/ [Accessed 29 Aug. 2019]. to find cavities where candidate drugs could potentially bind, resulting in suc- 6 algren TA (2009) Identifying and characterizing PRACE SoHPCMore Information binding sites and assessing druggability. J Chem Inf cessfully identifying sites on the four Model 49: 377–389. www.virtouso.org K-Ras4b variants in different domains of the protein. Normal Mode Analyses PRACE SoHPCAcknowledgement were used to identify whether the pre- dicted binding sites lie in areas of the PRACE SoHPCProject Title proteins with large fluctuations. Using HPC to investigate the structure I would like to thank the PRACE Summer of HPC pro- and dynamics of a K-Ras4b gram for allowing me to undertake this internship. I oncogenic mutant will forever be grateful for this incredible opportunity and the experience I have gained. In addition, I would References like to thank my colleagues at the Biomedical Research PRACE SoHPCSite Foundation of Athens for their continued support and guidance. They made my project so enjoyable and 1 Lu, S., Jang, H., Gu, S., Zhang, J. and Nussinov, Biomedical Research Foundation of taught me so many useful lessons that I know will R. (2016). Drugging Ras GTPase: a comprehensive Athens (BRFAA), Greece be essential in my future career. Moreover, I am so mechanistic and signaling structural view. Chemical appreciative for being granted time on the ARIS su- Society Reviews, 45(18), pp.4929-4952. percomputer in Athens, during this final project. Being PRACE SoHPCAuthors introduced to the capabilities of this supercomputer has 2 MBL-EBI Train online. (2019). What are been inspiring. Thank you PRACE! protein domains? [online] Available at: Rebecca Lait, England https://www.ebi.ac.uk/training/online/course/introd uction-protein-classification-ebi/protein- PRACE SoHPCProject ID classification /what-are-protein-domains [Accessed PRACE SoHPCMentor 2 Aug. 2019]. 1910 Cournia Zoe, BRFAA, GREECE
30 HPC for candidate drug optimization using free energy perturbation calculations Lead Optimization using HPC
Antonio Miranda
Free Energy Perturbation (FEP) calculations may reduce the time and cost of drug discovery. Specifically, they may help in optimizing the binding affinity between the drug candidate and the therapeutic target. To validate this methodology within the context of drug discovery, FEP calculations were performed and compared with experimental data. Results suggest that FEP has predictive ability and can be used for lead optimization but using HPC resources is crucial.
he cost of creating and launch- icity, and of course efficacy. Drug effi- putational approaches. Specifically, the ing one drug has been reported cacy is directly dependent on the bind- binding affinity of several modified com- to be above US$1 Billion. In- ing affinity of the candidate drug onto pounds (analogs) of an original lead in- deed, there are even studies the pharmaceutical target, which is com- hibitor of a target protein was studied Tthat estimate this cost in more than monly a protein. using using Molecular Dynamics (MD) US$2.8 billion.1 simulations coupled with Free Energy Perturbation (FEP) calculations.
Free Energy Perturbation calcula- tions.
Free Energy Perturbation (FEP) cal- culations offer a promising tool for com- puting the relative binding affinity of two drug candidates, because they have Figure 1: Drug discovery cost. Data Source: Paul et al. (2010)7 a rigorous statistical mechanics back- ground.2, 3 In particular, in FEP calcu- lations the free energy difference, ∆G, Drug discovery has several phases Lead Optimization using computa- between compound A and compound B (Figure 1). One of them, “lead optimiza- tional approaches. is computed by an average of a func- tion” may account for almost 25% of the tion of their energy (kinetic and po- total cost of all four phases, according to This project focuses on the opti- tential) difference by sampling for the a study from 2010.1 This phase includes mization of the binding affinity, i.e. the reference state. Then, using a thermo- optimizing drug metabolism, pharma- strength of interaction of the drug candi- dynamic cycle (Figure 2), the relative cokinetic properties, bioavailability, tox- date onto the target protein using com- binding free energy, ∆∆G, between two
31 congeneric ligands A and B is calculated tion, the thermodynamic cycle of Figure were run on Aris supercomputer. MD from the following equation: ∆∆G = 2 can be used to compute the difference simulations consist of four phases: en- ∆GB – ∆GA. If this approach results in in the free energy of binding between ergy minimization, two equilibration 0 ∆∆G<0, it is concluded that ligand B ligands A and ligands B: ∆∆G = ∆G B phases and production. Energy mini- 0 0 0 is favored over ligand A (Figure 2). - ∆G A = ∆G 2 - ∆G 1. mization is performed to relax the struc- tures and guarantee there are no inap- The use of FEP simulations in propriate geometries. Then, equilibra- drug discovery has been historically re- tion needs to be performed to stabilize stricted by the high computational de- the temperature and pressure of the mand, the limited force field accuracy system. Finally, each intermediate state and its technical challenges. However, was simulated for 5 ns during the pro- in recent years, FEP calculations have duction run each in the 21 non-physical benefited from advances in GPU and intermediate states as described above. HPC availability, force-field research and the development of more elaborate At the end of the production sim- simulation packages.2 ulation, the obtained potential and ki- netic energies allow to calculate the free Once these limitations have been energy difference for the each of the overcome, it is time to assess whether Figure 2: FEP thermodynamic cycle. Source: 6 intermediate steps. In this work, Ben- FEP calculations using HPC resources Athanasiou et al. (2017) nett Acceptance Ratio, was used to com- are efficient for drug discovery. If this pute the free energy differences.14 The is the case, the time and cost of drug total free energy difference results by discovery could be reduced. This reduc- In the consecutive non-physical in- summing up the free energy differences tion could have implications in the final termediates, the interactions of the of each intermediate step. Finally, once market drug price, which would eventu- atoms are progressively transformed both legs were finished, ∆∆G was cal- ally benefit the whole society. Therefore, from the reference to the final state. culated. the scope of this work is the compari- In the simulations performed during son of the FEP simulation results with this work using the software GROMACS Prior to performing the simulations, experimental results as well as to assess 5.1.4,8, 9 21 non-physical intermediates scaling curves were obtained in a previ- whether using HPC resources would were used, split in 21 windows. This re- ous phase of the project, and it was de- be efficient for FEP calculations. The sults in performing 21 FEP calculations. termined that complex leg calculations comparison with experimental results During the first 10 windows, van der should be run on 4 computing nodes. has been done using several analogs of Waals and bonding interactions were The solvent leg only required 1 comput- CK666, a micromolar Arp2/3 inhibitor. switched on for the appearing atoms. ing node. Arp23 is a protein involved in cell lo- During the last 10 windows, van der comotion and a mediator of tumor cell Waals and bonded interactions were migration (thus, it is involved in metas- already switched on and electrostatic tasis).4, 5 The HPC resources that were interactions were switched on progres- Results used were the Greek national supercom- sively. puter, ARIS, from the GRNET facility. The systems need to be prepared Figures below show the correlation before one can perform the FEP cal- among ∆∆G values obtained experi- culations. The setup includes prepara- mentally and with simulations using Methods tion of the structures, the parameteri- Gromacs and GAFF2. These correlation zation of the molecules, the alignment plots show an agreement in the values of the ligands, the complex formation, for some analogs. However, 5 out of For every ligand, there are two states: and the solvation and neutralization. 11 analogs retrieved results with a dis- the ligand free in solution and the lig- For the protein the AMBER ff14 force crepancy larger than 1 kcal/mol. Root and bound to the protein in solution. field10 was used while for the ligands Mean Square Error (RMSE) was 1.42 Then, when computing ∆∆G, there are we used the GAFF2 force field.11 Lig- kcal/mol, Mean Unsigned Error (MUE) four states: ligand A, ligand B in solu- ands are aligned because it is assumed was 1.13 kcal/mol. tion, ligand A-protein complex in solu- that their common core has the same tion and ligand B-protein complex in binding mode. By having them aligned, Results obtained using Gromacs solution. In a FEP or alchemical calcu- there is more overlap between ligands and GAFF2 force field for ligand lations, instead of simulating the bind- potential energies and the FEP calcu- parametrization have been compared 0 ing/unbinding processes directly (∆G 1 lations are more probable to converge. to previous results with the same soft- 0 and ∆G 2 from Figure 2), which would All setup steps were performed using ware package, using the same pipeline require a simulation many times the FESetup tool.12 In addition, coordinate parameters but a different force field, lifetime of the complex, ligand A is al- files for some of the simulated analogs GAFF. From the correlation plots, it is chemically transmuted into ligand B were created by modifying the refer- observed how the computed ∆∆G val- in solution or in the protein environ- ence ligand with Maestro interface of ues are highly correlated with those of ment, through intermediate, nonphys- Schrödinger software.13 the previous experiments except in one 0 0 ical states (∆G A and ∆G B in Figure case (RMSE was 0.31 kcal/mol MUE 2). Because free energy is a state func- After the setup, MD simulations was 0.24 kcal/mol and R2 was 0.97 ex-
32