Editorial

is published two times a year by The GAUSS Centre Editorial for Supercomputing (HLRS, LRZ, JSC)

Welcome to this new issue of inside, tive of the German Research Society the journal on Innovative Supercomput- (DFG) start to create interesting re- Publishers ing in Germany published by the Gauss sults as is shown in our project section. Centre for Supercomputing (GCS). In Prof. Dr. A. Bode | Prof. Dr. Dr. Th. Lippert | Prof. Dr.-Ing. Dr. h.c. Dr. h.c. M. M. Resch this issue, there is a clear focus on With respect to hardware GCS is con- Editor applications. While the race for ever tinuously following its roadmap. New faster systems is accelerated by the systems will become available at HLRS F. Rainer Klank, HLRS [email protected] challenge to build an Exascale system in 2014 and at LRZ in 2015. System within a reasonable power budget, the shipment for HLRS is planned to start Design increased sustained performance allows on August 8 with a chance to start Linda Weinmann, HLRS [email protected] for the solution of ever more complex general operation as early as October Sebastian Zeeden, HLRS [email protected] problems. The range of applications 2014. We will report on further Pia Heusel, HLRS [email protected] presented is impressive covering a progress. variety of different fields. As usual, this issue includes informa- GCS has been a strong player in PRACE tion about events in supercomputing over the last years. In 2015 will con- in Germany over the last months and tinue to provide access to leading edge gives an outlook of workshops in the systems for all European researchers field. Readers are invited to participate through the evaluation process of in these workshops which are now PRACE. By the end of 2015 GCS will part of the PRACE Advanced Training have delivered CPU time worth 100 Center and hence open to all European Million Euro. It is noteworthy that com- researchers.

putational costs as offered by GCS are • Prof. Dr. A. Bode even lower than comparable Cloud (LRZ) offerings. The concept of dedicated • Prof. Dr. Dr. Th. Lippert centers for research is working well (JSC) both for the user community and the funding agencies. For PRACE we will • Prof. Dr.-Ing. Dr. h.c. give a review on the results of the last Dr. h.c. M. M. Resch (HLRS) call or simulation proposals.

HPC continues to be an important issue in the German research community. The working group of the Wissenschaftsrat (Science Council), which is currently evaluating the exist- ing German funding scheme for HPC and discussing possible improvements, is expected to deliver its report in July. At the same time a number of soft- ware initiatives like the HPC Software Initiative of the Federal Ministry of Science (BMBF) or the SPPEXA initia-

2 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 3 News News PRACE: Results of the Optimizing a Seismic 8th Regular Call Wave Propagation Code for PetaScale-Simulations The Partnership for Advanced Computing granted a total of 100 million compute in Europe (PRACE) is continuously of- core hours on JUQUEEN. One research fering supercomputing resources on project has been granted resources The accurate numerical simulation of One of the biggest challenges today is to the highest level (tier-0) to European both on Hermit and SuperMUC. seismic wave propagation through a real- correctly simulate the high frequencies researchers. istic three-dimensional Earth model is in the wave field spectrum. These high Four of the newly awarded research pro- key to a better understanding of the frequencies can be in resonance to The Gauss Centre for Supercomputing jects are from Germany and the UK, Earth’s interior and is used for earth- buildings or enable a clearer image of (GCS) is currently dedicating shares each, three are from Italy, two are from quake simulation as well as hydrocarbon the subsurface due to their sensitivity of its IBM iDataPlex system SuperMUC France and Sweden, each, and one is exploration. Seismic waves generated to smaller structures. To match reality in Garching, of its Cray XE6 system from Belgium, Finland, the Netherlands, by an active (e.g. explosions or air-guns) and simulated results, high accuracy Hermit in Stuttgart, and of its IBM Blue Spain, and Switzerland, each. The re- or passive (e.g. earthquakes) source and fine resolution in combination with Gene/Q system JUQUEEN in Jülich. search projects awarded computing carry valuable information about the geometrical flexibility are the demands France, Italy, and Spain are dedicating time cover again many scientific areas, subsurface structure. Geophysicists on modern numerical schemes. Our shares on their systems CURIE, hosted from Astrophysics to Medicine and interpret the recorded waveforms to software package SeisSol implements by GENCI at CEA-TGCC in Bruyères- Life Sciences. More details, also on the create a geological model (Fig. 1). For- a highly optimized Arbitrary high-order Le-Châtel, FERMI, hosted by CINECA in projects granted access to the ma- ward simulation of waves support the DERivatives Discontinuous Galerkin Casalecchio di Reno, and MareNostrum, chines in France, Italy, and Spain, can survey setup assembly and hypothesis (ADER-DG) method that works on un- hosted by BSC in Barcelona. be found via the PRACE web page testing to expand our knowledge of structured tetrahedral meshes. The www.prace-ri.eu/PRACE-8th-Regular-Call. complicated wave propagation phenom- ADER-DG algorithm is high-order accu- The 8th call for proposals for computing ena. One of the most exciting applica- rate in space and time to ensure lowest time for the allocation time period The 9th call for proposals for the alloca- tions is the full simulation of the frictional dispersion errors for simulating waves March 4, 2014 to March 3, 2015 on tion time period September 2, 2014 to sliding during an earthquake and the propagating over large distances. • Walter Nadler the above systems closed October 15, September 1, 2015 closed March 25, excited waves that may cause damage 2013. Nine research projects have been 2014 and evaluation is still under way, on built infrastructure. Jülich granted a total of about 170 million as of this writing. The 10th call for Supercomputing compute core hours on Hermit, seven proposals is exptected to open in Sep- Centre (JSC) have been granted a total of about 120 tember 2015. million compute core hours on Super- MUC, and five proposals have been Details on calls can be found on www.prace-ri.eu/Call-Announcements.

contact: Walter Nadler, [email protected]

Figure 1: Waveforms of the 2009 L’Aquila earthquake (yellow star). Synthetic seismograms produced by SeisSol are depicted as red lines and compared to recorded real data (black lines) at 9 stations (red triangles). The seismograms span 800 seconds and are filtered to periods of T ≥ 33 s. Figure obtained from Wenk et al. (2013).

4 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 5 News News

In a collaboration of the research groups Optimizing SeisSol for 1 PetaFlop/s Sustained SuperMUC runs, and Mikhail Smelyanskiy on Geophysics at LMU Munich (Dr. PetaScale-Simulations Performance (Intel Labs) for his support in hardware- Christian Pelties, Dr. Alice-Agnes Gabriel, With this goal in mind, three key per- In cooperation with the Leibniz Super- aware optimization. Financial support Stefan Wenk) and HPC at TUM, Munich formance and scalability issues were computing Centre, several benchmark was provided by DFG, KONWIHR and (Prof. Dr. Michael Bader, Alexander addressed in SeisSol: runs were executed with SeisSol on Volkswagen Foundation (project ASCETE). Breuer, Sebastian Rettenberger, SuperMUC. In an Extreme Scaling Period Alexander Heinecke), SeisSol has been • Optimized Matrix Kernels: end of January 2014, a weak-scaling References re-engineered with the explicit goal In SeisSol’s ADER-DG method, the test with 60,000 grid cells per core [1] Breuer, A., Heinecke, A., Rettenberger, S., Bader, M., Gabriel, A.-A., Pelties, C. to realize petascale simulations and to element-local numerical operations are was conducted: SeisSol achieved a per- Sustained Petascale Performance of Seismic prepare SeisSol for the next generation implemented via matrix multiplications. formance of 1.4 PetaFlop/s computing Simulations with SeisSol on SuperMUC, of supercomputers. For these operations, highly optimized, a problem with nearly 9 billion grid cells International Supercomputing Conference hardware-aware kernels were generated and over 1,012 degrees of freedom. ISC’14, accepted. by a custom-made code generator. However, as the true highlight a simula- Depending on the sparsity pattern of tion under production conditions was [2] Breuer, A., Heinecke, A., Bader, M., the matrices and on the achieved time performed. We modelled seismic wave Pelties, C. Accelerating SeisSol by Generating Vectorized to solution, sparse or dense kernels propagation in the topographically com- Code for Sparse Matrix Operators, Parallel were selected for each individual opera- plex volcano Mount Merapi (Fig. 2) Computing–Accelerating Computational tion. Due to more efficient exploitation discretized by a mesh with close to 100 Science and Engineering (CSE), Advances in of current CPUs’ vector computing million cells. In 3 hours of computing Parallel Computing 25, IOS Press, 2014. features, speedups of 5 and higher were time the first 5 seconds of this process achieved. were simulated. During this simulation [3] Pelties, C., de la Punte, J., Ampuero, J.-P., SeisSol achieved a sustained perfor- Brietzke, G.B., Käser, M. Three-dimensional dynamic rupture simu- • Hybrid MPI+OpenMP-Parallelization: mance of 1.08 PetaFlop/s (equivalent lation with a high-order discontinuous SeisSol’s parallelization was carefully to 35% of peak performance, see the Galerkin method on unstructured tetra- Figure 2: Tetrahedral mesh and simulated wave field (after 4 seconds simulated changed towards a hybrid MPI+OpenMP performance plot in Fig. 3), thus enter- hedral meshes, J. Geophys. Res., 117, time) of the Mount Merapi scenario. To show the wave field in the volcano’s interior, approach. The resulting reduction of ing a performance region that usually B02309, 2012, http://dx.doi.org/ the front section is virtually removed. 10.1029/2011JB008857. MPI processes had a positive effect on requires noticeably larger machines. parallel efficiency especially in setups • Michael Bader 1 with huge numbers of compute cores. Outlook [4] Wenk, S., Pelties, C., Igel, H., Käser, M. • Christian Pelties 2 Regional wave propagation using the discon- Non-zeros only FLOP/s Even communication and management Details on the optimization of SeisSol, tinuous Galerkin method, Solid Earth, 4, 1 routines were parallelized, as these and probably results from further large- 43-57, 2013, doi:10.5194/se-4-43-2013. TU Munich, 50 turned into performance obstacles on scale simulations, will be presented at Institut of large numbers of cores. this year’s International Supercomputing Links Informatics 40 1,08 PF Conference (ISC’14) in Leipzig. New http://www.ascete.de http://seissol.geophysik.uni-muenchen.de 2 30 • Parallel Input of Meshes: challenges to tackle include effective Department of http://www.lrz.de SeisSol’s key strength is the accurate load balancing for simulations with local Earth and Environ- 20 modelling of highly complicated ge- time stepping and novel accelerator mental Sciences,

% peak performance % peak ometries via unstructured adaptive technologies such as Intel MIC. contact: Michael Bader, Geophysics, 10 tetrahedral meshes. These are gener- [email protected] University of

0 ated by an external mesh generator Acknowledgements Munich 1024 2048 4096 8000 16000 32000 65536 147456 (SimModeler by Simmetrix). Thus, an- We especially thank Dr. Martin Käser for total number of cores other crucial step towards petascale pioneering the development of SeisSol

Figure 3: Strong scaling results of the 100 million cell Mount Merapi simulation. simulations was the use of netCDF and and for initiating the collaborative proj- Given is the achieved percentage of peak performance (blue) and the respective MPI I/O to input such large meshes. ect ASCETE. Furthermore, we thank fraction of non-zero floating point operations (orange). While reaching almost 50% The new mesh reader can handle the entire LRZ team (including our co- of peak performance on 1,024 cores, a peak efficiency of 35% is maintained on all meshes with more than a billion grid developer Dr. Gilbert Brietzke) for their 147,456 cores of SuperMUC reaching 1.08 Petaflop/s. The observed performance decrease from 16,000 to 32,000 cores is caused by the slower inter-island com- cells, and scales on all 147k cores of strong support in the execution of the munication on SuperMUC. SuperMUC.

6 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 7 Applications Applications Golden Spike Award by the HLRS Steering Committee in 2013

pensive LES model instead of RANS, remove performance bottlenecks and Industrial Turbulence the simulations need to be run on more non-scaling memory problems from the Simulations at Large Scale processes. Yet, commercial solvers code. Especially the usage of MPI-col- currently do not scale very well and lectives instead of less efficient isend/ therefore are of limited use in large irecv implementations improve both, Simulations that are done during indus- not suitable in the given situations, simulations. Often, solvers stop scaling scalability and memory demand on trial design process have to adhere to especially when capturing highly instation- in a range below the average cluster Infiniband based clusters. strict time constraints in the develop- ary processes. Therefore, a more accu- size in large industries development de- ment cycle. Thus, the limited time is rate description of turbulence is needed partments. Therefore, highly scalable Besides code optimizations it is clear often the driving factor, limiting the ac- in these cases. The most accurate academic solvers are investigated and that some parts of the code need to be curacy of such simulations. Yet, there description of turbulence would be tuned to tackle industrial use-cases. redesigned to achieve higher levels is a strong desire to obtain as detailed achieved by a direct numerical simulation of distributed parallelism. Especially the and accurate simulations as possible (DNS). However, a DNS results in Academic codes that are used for both interface between the mesh generation early on, to save later development it- extreme computational loads as the research and education evolve over tool and the solver impose problems erations. Accurate but fast simulation smallest scales have to be resolved. years. With changing developers over in general unstructured solvers. Usually results can be achieved by exploiting Even on today supercomputing systems, the years and functionality-oriented fea- the solver reads in proprietary formats HPC systems for these “Design of Ex- DNS of complex test-cases with large ture implementation, a versatile solver from different mesh generation tools periment” studies. domains can not be captured in a time- is created. Optimizations of such codes such as ICEM or Gambit, and uses a frame that would suit an industrial aim for massively parallel execution partitioning method like ParMetis for As an example for a demanding simula- development process. A middle ground on distributed memory systems. With mesh partitioning. However, the process tion, we have a look at turbulent flows is offered by Large Eddy Simulations the help of parallel performance tools of determining the neighborhood re- in technical devices with complex geom- (LES). LES resolves the large turbulent like Vampir and code-analysis tools like lations between the elements and be- etry. Such use cases are challenging, structures directly and employs a mod- the Valgrind tool suite it is possible to tween the different processes lead to as the complexity of the setup has to elling for those structures, which are be deployed on parallel and distributed too small for a viable resolution. Thus, computing resources. LES reduces modeling assumptions

Figure 1: Challenges for industrial test-cases are highlighted. Each box is a complex topic on its own but for efficient usability everything has to be considered under the aspect of scalability.

While current commercial simulation for the turbulence to a minimum, while software often employs simple turbu- enabling the simulation of relevant geo- lence models like the Reynolds-Aver- metrical objects. aged Navier-Stokes equations (RANS)

to reduce the simulation time (and In order to achieve suitable runtimes Figure 2: A discretized porous medium inside a fluid domain. (Solid elements are depicted in blue, elements with meshing overhead) these models are when using the considerably more ex- boundary conditions are represented in red, fluid elements are white.)

8 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 9 Applications Applications

scalability problems early in the execution each process can locally read in only its main performance measure for mesh References of the solver. own part of the mesh, and at the same generation is the size of the meshes it [1] Klimach, H. and Roller, S. Distributed coupling for multiscale simula- time determine the position of direct can produce. Therefore memory scal- tions. In P. Ivanyi and B. Topping, editors, To efficiently handle this mesh initializa- neighbors of each local element, i.e. de- ability is more important than run-time Proceedings of the Second International tion step, an approach suggested by termine the process IDs of the commu- scalability. The largest mesh that was Conference on Parallel, Distributed, Grid Klimach [1] can be adopted. The propri- nication partners for each domain obtained on Hermit at HLRS consists and Cloud Computing for Engineering. Civil- Comp Ltd., 2011. etary mesh format of mesh generation without global communication. Reading of 66 billion cubes. A mesh of this size tools is converted by a pre-processing the mesh with this method becomes requires a total of 4k processes. To tool, providing a file-format including orders of magnitude faster in the solver run a simulation on this mesh using the [2] Harlacher, D.F., Hasert, M., Klimach, H., Zimny, S., Roller, S. neighborhood information and allowing and unnecessary memory allocations Lattice Boltzmann solver Musubi, the Tree based voxelization of stl Data. In simple distributed reading. In this con- early during the solver execution are solver needs to run on nearly 100k pro- Michael Resch, Xin Wang, Wolfgang Bez, version the mesh is read in and the avoided. cesses. With a serial mesh generation Erich Focht, Hiroaki Kobayashi, and Sabine individual elements are mapped into an the generation of such a mesh would Roller, editors, High Performance Comput- ing on Vector Systems 2011, pages 81{92. order, which preserves physical local- Parallel Mesh Generation not be possible and therefore the usage Springer Berlin Heidelberg, 10.1007/978- ity of the elements using a space-filling The described two-step mesh prepara- of the entire supercomputer not effi- 3-642-22244-3 6. 2012. curve. This allows a suitable partition- tion is an example of how the problems ciently possible. ing for arbitrary number of processes of serial parts in a simulation tool-chain [3] Roller, S., Bernsdorf, J., Klimach, H., by simply cutting the list of elements can affect overall running times. The in- Summary Hasert, M., Harlacher, D., Cakircali, M., into non-overlapping chunks without any herently serial nature of e.g. commercial Academic solvers are capable of reduc- Zimny, S., Masilamani, K., Didinger, L., knowledge of the entire mesh topology. mesh-generation tools poses a two- ing runtime considerably in product- Zudrop, J. An adaptable simulation framework based I.e. the initialization of the simulation fold problem when thinking in both dis- development by the efficient usage of on a linearized octree. In Michael Resch, Xin run is fully parallel read right from the tributed memory and parallel I/O. The HPC systems. But a complete parallel Wang, Wolfgang Bez, Erich Focht, Hiroaki start. conversion of the format in this case tool-chain is needed, including (memory-) Kobayashi, and Sabine Roller, editors, High eases the problem on the solver side, scalable mesh generation and solver Performance Computing on Vector Sys- tems 2011, pages 93{105. Springer Berlin This file format now offers a very flex- however the problem still remains in initialization steps, in order to be able Heidelberg, 10.1007/978-3-642-22244-3 ible and scalable handling of meshes a different form in the pre-processing to also efficiently use the HPC systems 7. 2012. in the solver itself. Due to the already step. On the one hand the generation from tomorrow for larger, more pre- established neighborhood relationship, of the actual mesh itself is still serial cise and/or faster simulations. contact: Daniel F. Harlacher, • Daniel F. Harlacher and might therefore be limited in el- [email protected] • Harald Klimach 100 ement count due to memory and on • Sabine Roller nNodes 1 nNodes 8 the other hand the conversion still nNodes 64 90 needs to be performed whenever the University of nNodes 512 nNodes 1024 geometry is changed or the mesh re- Siegen 80 fined and adapted. A more suitable so- lution in face of the growing parallelism

de 70 is the parallel generation of meshes directly in a suitable format.

60 MLUPS/no For this we develop a parallel mesh gen-

50 eration tool based on an octree mesh representation. This tool, called Seeder [2], provides automated parallel mesh 40 generation in the context of the end-to- end parallel APES [3] simulation suite. It 30 103 104 105 106 107 108 directly generates and stores the mesh- nElems/node data in a format suitable for parallel Figure 3: Performance map for the Lattice Boltzmann solver MUSUBI. processing on distributed systems. The

10 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 11 Applications Golden Spike Award by the HLRS Steering Committee in 2013 Applications

quality of the boundary forcing, namely convection permitting scale e.g. within High-Resolution Climate if precipitation was due to large-scale the DFG funded Research Unit 1695 Predictions and synoptic events. Additionally, summer- “Agricultural Landscapes under Climate time precipitation was subject of sig- Change–Processes and Feedbacks on a Short-Range Forecasts nificant systematic errors as models Regional Scale”2. with coarse grid resolution have dif- ficulties to simulate convective events. Within this context the WRFCLIM objec- to Improve the Process climate changes. During the last 2 de- This resulted in deficiencies with re- tives are to: Understanding and the Rep- cades, various regional climate models spect to simulations of the spatial distri- resentation of Land-Surface (RCM) have been developed and applied bution and the diurnal cycle of precipi- • provide high resolution climatological Interactions in the WRF for simulating the present and future tation. The 50 km resolution RCMs data to the scientific community Model in Southwest Germany climate of Europe at a grid resolution of were hardly capable of simulating the (WRFCLIM) 25 to 50 km. It was found that these statistics of extreme events such as • support the quality assessment and models were able to reproduce the pat- flash floods. interpretation of the regional climate The application of numerical modelling tern of temperature distributions rea- projections for Europe through the con- for climate projections is an important sonably well but a large variability was Due to these reasons, RCMs with higher tribution to the climate model ensemble task in scientific research since these found with respect to the simulation of grid resolution of 10-20 km were de- for Europe (EURO-CORDEX) for the next projections are the most promising precipitation. The performance of the veloped and extensively verified. While IPCC (International Panel of Climatic means to gain insight in possible future RCMs was strongly dependent on the still inaccuracies of the coarse forcing Change) report. data were transferred to the results, these simulations indicated a gain from The main objectives of the convection high resolution due to better resolu- permitting simulations of WRFCLIM are tion of orographic effects. These include as follows: an improved simulation of the spatial distribution of precipitation and wet-day • Replace the convection parameter- frequency and extreme values of ization by the dynamical simulation of precipitation. However, three major the convection chain to better resolve systematic errors remained: the the processes of the specific location windward-lee effect (i.e. too much pre- and actual weather situation physically. cipitation at the windward side of mountain ranges and too little precipi- • Gain of an improved spatial distribu- tation on their lee side), phase errors tion and diurnal cycle of precipitation in the diurnal cycle of precipitation through better land-surface-atmosphere and precipitation return periods. In order feedback simulation and a more re- to provide high-resolution ensembles alistic representation of orography to and comparisons of regional climate simulations in the future, the World Cli-

mate Research Program initiated the 1 CORDEX for the European Domain: COordinated Regional climate Down- www.euro-cordex.net scaling EXperiment (CORDEX). 2 An objective of this research unit is the development and verification of a convection- The scope of this WRFCLIM project at permitting regional climate model system based HLRS, is to investigate in detail the per- on WRF-NOAH including an advanced represen- formance of regional climate simulations tation of land-surface-vegetation-atmosphere feedback processes with emphasis on the water with WRF in the frame of EURO-CORDEX1 and energy cycling between croplands, atmo- Figure 1: Domain of WRF for CORDEX Europe with 0.33° (black frame), 0.11° (red frame) and 0.0367° at 12 km down to simulations at the spheric boundary layer and the free atmosphere. (white frame) resolution.

12 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 13 Applications Applications

support the interpretation of the 12 km Regional Climate Simulations whole simulation (including waiting time and temporarily dependent. One of the climate simulations for local applications A validation run for Europe was per- between restarts) took ~8 months and most important aspects in configuring e.g. in hydrological and agricultural formed with the Weather Research and ~800,000 CPU hours. For the convec- a model is to select the parameteriza- management. Forecasting (WRF) model (Skamarock tion permitting simulation experiment tions to be used and therefore sen- et al. 2008) for a 22-year period a domain of 800*800*54 (in total sitivity tests are an unavoidable part Special attention will be paid to the land- (1989-2009) on the NEC Nehalem 34,560,000) grid boxes of approx. 4 km in improving the model performance. surface-vegetation-atmosphere feed- Cluster at HLRS at a grid resolution of horizontal resolution (white frame in WRF offers a choice of various physical back processes. Further, the model will approx. 12 km with CORDEX simula- Fig. 1) was nested in the EURO-CORDEX parameterizations to describe subgrid be validated with high-resolution case tion requirements. The WRF model had model domain. A simulation with a processes like e.g. cloud formation and studies applying advanced data assimi- been applied to Europe on a rotated model time step of 20s was run from radiation transport. Warrach-Sagi et lation to improve the process under- latitude-longitude grid with a horizontal Mai 15, to September 1, 2007. The al. (2013b) showed that convection per- standing over a wide range of temporal resolution of 0.11° and with 50 verti- simulation took one month on the NEC mitting climate simulations with WRF scales. This will also address whether cal layers up to 20 hPa. The model Nehalem Cluster. Precipitation and bear the potential for better location the model is able to reasonably repre- domain (red frame in Fig. 1) covers the soil moisture results of this simulation of precipitation events namely in sent extreme events. In the following area specified in CORDEX. The most were analyzed (Warrach-Sagi et al. orographic terrain as it is needed by on the current status of regional climate recent reanalyses data ERA-interim of 2013b, Greve et al., 2013) and due to impact modelers like hydrologists and simulations and land-atmosphere feed- the European Center for Medium range the results in the beginning of 2012 agricultural scientists. To set up the back in convection permitting simula- Weather Forecast (ECMWF) is avail- the simulation was repeated with an most advanced configuration for con- tions in WRFCLIM is reported. Further able at approx. 0.75°. updated version of WRF on the Cray vection permitting climate simulations details are given e.g. in Warrach-Sagi XE6. WRF was compiled with PGI 11 with WRF by the University of Hohen- et al. (2013a). These data were used to force WRF with openMPI 1.4. This simulation was heim to be carried out on the Cray XE6 at the lateral boundaries of the model carried out for the 0.11° domain (red at HLRS in the near future the following domain. WRF was applied, one-way frame in Fig. 1) without nesting to an study is performed in WRFCLIM: Ex- nested, in a double nesting approach intermediate grid but a 30 grid box periments with various boundary layer on 0.33° (~37 km) (black frame in Fig. 1) wide boundary. The simulation of the parameterizations, radiation schemes and 0.11° (~12 km). In an additional ex- 480*468*54 (in total 12,130,560) and two land surface models are per- periment for summer 2007, when the grid boxes with a 60 s model time step formed on 2 km resolution (convection Convective and Orographically-induced was run from 1987 to 2009 on 40 permitting scale) to further down- Precipitation Study (COPS) (Wulfmeyer nodes (1,280 cores), within 24 hours scale the climate data for applications et al. 2011) took place, a third domain 5 months were simulated. Within 3 in hydrology and agriculture and an with 0.0367° (~ 4 km) resolution was months the simulation was completed improved representation of feedback nested into the 0.11° domain (white and approx. 90 TB of raw output data processes between the land surface frame in Fig. 1), and a convection per- were produced. This simulation is cur- and the atmosphere. The first sensitivity mitting simulation was performed. rently under evaluation e.g. within the study was performed with 54 vertical EURO-CORDEX ensemble of regional cli- levels and a horizontal grid spacing of WRF is coded in Fortran and was com- mate models (e.g. Vautard et al., 2013; 2 km for a domain of 501*501 (in total piled with PGI 9.04 with openMPI librar- Kotlarski et al., 2014). 13,554,054) grid boxes with 1,088 ies. At 0.11° the EURO-CORDEX model cores on the Cray XE6 for a week in domain covered 450*450*54 (in total Land-Atmosphere Feedback September 2009, when atmospheric 10,935,000) grid boxes (Fig. 1, red box) in Convection Permitting observations were available from an and the model was run with a time Simulations experiment. step of 60s from 1989 to 2009. For When setting up a model for a particu- the simulation 20 nodes were re- lar region, the foremost issue is the In a case study on the domain of quested (160 cores), the raw output determination of the most appropriate Germany (Fig. 2) various combinations data was about 80TB. Within 24 hours model configuration. Different regions of 4 boundary layer schemes and 2 on the NEC Nehalem Cluster it was experience varied conditions, and the land surface models, including differ- Figure 2: Domain of the second and third case study on parameterization schemes possible to simulate 2.5 months. The optional setup of a model is spatially ent combinations of switches in one

14 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 15 Applications Applications

of them, were analyzed. The resulting layer (Fig. 3). Results also showed that CORDEX. The first simulation (WRF temperature in central Europe well, ensemble of 44 simulations was com- the WRF model is sensitive on the 3.1.0 on the NEC Nehalem Cluster) namely in Germany, and fits well in the pared against measurements of abso- combination of land surface model and was evaluated by Warrach-Sagi et al. EURO-CORDEX ensemble. The evalua- lute humidity from the DIAL (Differential boundary layer scheme. The compari- (2013b) with respect to precipitation tion within the EURO-CORDEX ensemble Absorption Lidar (light detection and son of model data with the high resolu- and Greve et al. (2013) with respect is still ongoing. The data is now avail- ranging)) of the University of Hohenheim tion lidar measurements is a first step to soil moisture. During all seasons, able and used e.g. within the DFG this dynamical downscaling reproduced funded Research Unit “Agricultural Land- the observed spatial structure distri- scapes under Climate Change–Pro- bution of the precipitation fields. How- cesses and Feedbacks on a Regional ever, a wet bias was remaining. At Scale” for impact studies. However, 12 km model resolution, WRF showed the weakness of the 12 km resolution typical systematic errors of RCMs in concerning the windward-lee effect in orographic terrain such as the wind- precipitation remains. So it is essential ward-lee effect. In the convection per- to proceed towards convection per- mitting resolution (4 km) case study mitting simulations as in WRFCLIM. in summer 2007, this error vanished The first sensitivity study of the WRF and the spatial precipitation patterns to the boundary-layer and land-surface further improved. This result indicates parameterizations and comparisons the high value of regional climate simu- with Differential Absorption Lidar for lations on the convection-permitting September 2009 showed significant scale. Soil moisture depends on and sensitivity of WRF to the choice of the affects the energy flux partitioning land surface model and planetary at the land-atmosphere interface. The boundary layer parameterizations. The Figure 3: Vertical profiles of absolute humidity on September 8, 2009 at 16:00 UTC. Black lines represent the mean of the scanning DIAL measurements within a 2 km sector, blue shades indicate the 1-d variance of the DIAL latent heat flux is limited by the root model sensitivity to the land surface data within the scan. Colored lines indicate different boundary layer schemes. Solid lines denote the NOAH land- zone soil moisture and solar radiation. model is evident not only in the lower surface model, while the NOAH-MP land surface model is represented in dashed lines. When compared with the in situ soil PBL, but it extends up to the PBL moisture observations in Southern entrainment zone and even up to the France, where a soil moisture network lower free troposphere. This study is (Behrendt et al., 2009) close to Jülich in this ongoing research. Currently more was available during 2 years of the ongoing for more experiments in spring in Western Germany in September case studies and comparisons with study period, WRF generally reproduces 2013 under other weather conditions 2009 which were taken during a cam- scanning lidar measurements are per- the annual cycle. The spatial patterns to understand the processes, their paign of the DFG Transregio 32 "Patterns formed for different weather situations and temporal variability of the seasonal modelling and to set up a sophisticated in Soil-Vegetation-Atmosphere-Systems" in different seasons. For this, data of mean soil moisture from the WRF simulation with WRF for Germany on project. field experiments in the frame of differ- simulation corresponds well with two the convection permitting scale also for ent German research projects are col- reanalyses products, while their absolute climate simulations. It is important to be able to accurately lected, i.e., the DFG Transregio 32 and values differ significantly, especially simulate the vertical humidity and tem- Research Unit 1695, and the BMBF at the regional scale. Based on these Acknowledgements perature profiles in the boundary layer project "High definition clouds and pre- results the updated version WRF 3.3.1 This work is part of the Project PAK for convection, cloud development and cipitation for advancing climate pre- was applied at Cray XE6 for EURO- 346/RU 1695 funded by DFG and sup- precipitation. Both, the boundary layer diction" HD(CP)2. CORDEX. The benefits from downscal- ported by a grant from the Ministry of schemes and land surface models, ing the reanalyses ERA-Interim data Science, Research and Arts of Baden- impact the simulated vertical absolute Summary of Preliminary with WRF become especially evident Württemberg (AZ Zu 33-721.3-2) and humidity profiles. The impact of the Results in the probability distribution of daily the Helmholtz Centre for Environmental land surface models extends also to At HLRS within WRFCLIM two climate precipitation data. Kotlarski et al. (2014) Research - UFZ, Leipzig (WESS project). higher altitudes, even up to entrain- simulations with WRF were carried out show that WRF 3.3.1 generally simu- DIAL measurements were performed ment zone at the top of the boundary for 1989-2009 to evaluate the model lates the seasonal precipitation and within the Project Transregio 32, also performance within the frame of EURO- funded by DFG. Model simulations were

16 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 17 Applications Golden Spike Award by the HLRS Steering Committee in 2013 Applications

carried out at HLRS on the Cray XE6 [6] Warrach-Sagi, K., Bauer, H.-S., Branch, O., Milovac, J., Schwitalla, T., Wulfmeyer, V. A Highly Scalable Parallel and the NEC Nehalem Cluster within High-resolution climate predictions and WRFCLIM and we thank the staff for short-range forecasts to improve the LU Decomposition for the their support. process understanding and the representa- tion of land-surface interactions in the WRF model in Southwest Germany (WRFCLIM). Finite Element Method References In: "High Performance Computing in Sci- [1] Behrendt, A., Wulfmeyer, V., Riede, A., ence and Engineering 13", Eds: Wolfgang Wagner, G., Pal, S., Bauer, H., Radlach, M., E. Nagel, Dietmar H. Kroener, Michael M. Numerical simulations with the Finite this equation, this can only be realized Späth, F. Resch, Springer, 529-542, 2013a. 3-Dimensional observations of atmospheric Element Method (FEM) play an important on parallel machines. Some applica- humidity with a scanning differential ab- [7] Warrach-Sagi, K., Schwitalla, T., role in the current science to solve tions require parallel direct solvers, sorption lidar. In Richard H. Picard, Klaus Wulfmeyer, V., Bauer, H.-S. partial differential equations on a domain since they are more general and more Schäfer, Adolfo Comeron et al. (Eds.), Evaluation of a CORDEX-Europe simula- Ω. The equations are formulated into robust than iterative solvers. A direct Remote Sensing of Clouds and the Atmo- tion with WRF: precipitation in Germany, sphere XIV, SPIE Conference Proceeding Climate Dynamics, DOI 10.1007/s00382- their equivalent weak formulation, solver often performs well in cases, Vol. 7475, ISBN: 9780819477804, 2009, 013-1727-7, 2013b. which leads to an infinite dimensional where many iterative solvers fail, e.g. if Art. No. 74750L, DOI:10.1117/12.835143, 2009. [8] Wulfmeyer, V., Behrendt, A., Kottmeier, C., Corsmeier, U., Barthlott, C., Craig, G.C., [2] Greve, P., Warrach-Sagi, K., Wulfmeyer, V. Hagen, M., Althausen, D., Aoshima, F., Evaluating Soil Water Content in a WRF- Arpagaus, M., Bauer, H.-S., Bennett, L., Noah Downscaling Experiment, J. Appl. Blyth, A., Brandau, C., Champollion, C., Meteor. Climatol. 52: 2312–2327, 2013. Crewell, S., Dick, G., Di Girolamo, P., Dorninger, M., Dufournet, Y., Eigenmann, R., Engelmann, R., Flamant, C., Foken, T., [3] Kotlarski, S., Keuler, K., Christensen, Gorgas, T., Grzeschik, M., Handwerker, J., O.B., Colette, A., Déqué, M., Gobiet, A., Hauck, C., Höller, H., Junkermann, W., Goergen, K., Jacob, D., Lüthi, D., Kalthoff, N., Kiemle, C., Klink, S., König, van Meijgaard, E., Nikulin, G., Schär, C., M., Krauss, L., Long, C.N., Madonna, F., Teichmann, C., Vautard, R., Mobbs, S., Neininger, B., Pal, S., Peters, Warrach-Sagi, K., Wulfmeyer, V. G., Pigeon, G., Richard, E., Rotach, M.W., Regional climate modeling on European Russchenberg, H., Schwitalla, T., scales: A joint standard evaluation of the Smith, V., Steinacker, R., Trentmann, J., EURO-CORDEX RCM ensemble. Submitted • Kirsten Turner, D.D., van Baelen, J., Vogt, S., to Geoscientific Model Development, 2014. Warrach-Sagi Volkert, H., Weckwerth, T., Wernli, H., Wieser, A., Wirth, M. • Josipa Milovac [4] Skamarock, W.C., Klemp, J.B., Dudhia, The Convective and Orographically Induced • Hans-Stefan Bauer J., Gill, D.O., Barker, D.M., Duda, M.G., Precipitation Study (COPS): The Scientific Figure 1: domain Ω on 8 processors Figure 2: Cells near some interfaces Figure 3: Nodal points Huang, X.-Y., Wang, W., Powers, J.G. • Andreas Behrendt Strategy, the Field Phase, and First High- A description of the Advanced Research • Thomas Schwitalla lights. COPS Special Issue of the Q. J. R. WRF version 3. NCAR Tech Note, TN- Meteorol. Soc. 137, 3-30, DOI:10.1002/ • Florian Späth 475+STR, 113pp, 2008. problem. By dividing the domain Ω into the problem is indefinite, unsymmetric, qj.752, 2011. • Volker Wulfmeyer subdomains and discretizing the weak or ill-conditioned. Thus, we solve the [5] Vautard, R., Gobiet, A., Jacob, D., Belda, M., Colette, A., Deque, M., Fernandez, J., formulation by a subspace with piece- equation Au = b with the help of a par- Institute of Physics Garcia-Diez, M., Goergen, K., Guettler, I., contact: Kirsten Warrach-Sagi, wise polynomial functions, we finally allel LU decomposition [1]. For further and Meteorology, Halenka, T., Keuler, K., Kotlarski, S., [email protected] obtain a linear equation Au = b, where information about the parallel program- University of Nikulin, G., Patarcic, M., Suklitsch, M., A is a sparse matrix and u the solution. ming model, see [2]. Teichmann, C., Warrach-Sagi, K., Hohenheim Wulfmeyer V., Yiou, P. The simulation of European heat waves To attain a reasonable approximation The Idea Behind the from an ensemble of regional climate of the continuous solution it is essential Decomposition models w ithin the EURO-CORDEX project, to use a fine meshsize for the sub- Our concept of the parallel LU decom- Climate Dynamics, 10.1007/s00382-013- 1714-z, 2013. domains. This can result into several position for matrices resulting from million or billion unknowns. To obtain finite element problems is based on a a reasonable computing time to solve nested dissection approach, see [1,3].

18 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 19 Applications Applications

On P = 2S processors, the algorithm Those indices are given either on only For the next step, we combine each where all matrices are distributed block- uses S+1 steps in which sets of pro- one processor or on an interconnection processor with another one to a new columnwise to the processors. Here, cessors are combined together con- between two or more processors, processor set, such that we now have in Z we have the inner elements, which sistently and problems between these e.g. the crosspoint in Fig. 2 is given on four clusters each with two processors. should be eliminated, and in S are sets are solved, beginning with one processors 1, 2, 3 and 4. By defining The combined processor set defines the boundary elements for the Schur processor in the first step. a processor set π for each nodal point (here, e.g. π(i) = {1,2,3,4}) we can com- bine all indices with a common proces- sor set to one interface. We have in total 21 different interfaces, cf. Table 1. Those elements with only one entry in their processor set define the inner do- main of each local domain Ωp. All other elements define a boundary part of the local domains (cf. Fig. 4).

The main observation is that all indices in the inner domain on processor p are independent (within the meaning of the finite element method) of the indices in the inner domain of any other pro- Figure 4: Interfaces Figure 5: Inner and boundary elements in step 1. cessor q ≠ p. We can "eliminate" the

inner elements (here: k=1,…,8) on every Figure 6: Inner and boundary Figure 7: Distributed matrix on 4 Let us consider a domain Ω in 2D, dis- processor in parallel within an LU de- elements in step 2. processors in step 3. tributed on 8 processors. The domain composition. For the elements on the is discretized into cells c, and every cell boundary, we build the current Schur is given to one processor. The collection complement, which is distributed on the new inner elements (cf. step 2 in Table 1 complement. L and R define the con- of all cells on one processor defines a processors. In Fig. 5 Illustration we and Fig. 6) and they also can be elimi- nection between those elements. The p local domain Ω . have an example for the inner (π4) and nated in parallel on those four clusters. Schur complement computed in this -1 boundary elements (π10, π14, π15, π20, π21) Combining these processors leads step is S-LZ R, where a block LU de- Every cell has some indices (or nodal on processor 4. to a new boundary on which the next composition of Z is created. Note, that points) for the finite element method. Schur complement can be computed. there are no boundary elements in the By repeating this procedure we end up last step. Thus, only a distributed Z with one processor set containing matrix is given there. all processors. Thus we only have inner step 1 step 2 step 3 step 4 elements in the last step and the elimi- To execute the parallel LU decomposition, k π t k π t k π t k π t k k k k nating finishes the LU decomposition. two basic communication routines 1 1 1 9 1,2 1 13 1,3 1 19 3,5 1 are necessary. The first one is the one- 2 2 2 10 3,4 2 14 2,4 1 20 4,6 1 A second parallelization is realized by to-one communication to fill the new 3 3 3 11 5,6 3 15 1,2,3,4 1 21 3,4,5,6 1 using all processors in each processor ZRLS matrix from the two Schur 4 4 4 12 7,8 4 16 5,7 2 set. In step s there are 2s-1 processors complements in the step before. The 5 5 5 17 6,8 2 in every cluster we can use to execute second one is the broadcast routine 6 6 6 18 5,6 ,7,8 2 7 7 7 the local LU decomposition. In Fig. 7 to provide Z and L (after updating within 8 8 8 we have an example for step 3, the the computation of the Schur comple- dis­​tribution defines a block matrix, ment) to all processors in the proces- Table 1: Sorted list of interfaces with cluster t (cf. Figure 4). sor set.

20 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 21 Applications Applications

Table 2: Parallel execution time The Performance on Model Acknowledgements on the CRAY XE6 Cluster in Problems The authors acknowledge the financial Stuttgart for the decomposition for the Poisson problem in the We demonstrate the realization of the support from BMBF grant 01IH08014A unit cube (3D). solver on two applications in 3D within the joint research project ASIL (Poisson) and 2D (Stokes), respectively. (Advanced Solvers Integrated Library) For the Poisson problem −∆u = ƒ in the and the access to the Hermit cluster in unit cube Ω= ( 0 ,1) 3 with homogeneous Stuttgart. boundary conditions we use trilinear finite elements on a regular hexahedral References −l [1] Maurer, D. mesh with mesh width hl=2 on refine- Ein hochskalierbarer paralleler direkter ment level l≥0 . The mesh is distributed N = 35 937 N = 274 625 N = 2 146 689 Löser für Finite Elemente Diskretisierungen, S P = 1 1:05.58 min. 2:03:25 hrs. to P=2 processors by recursive co- PhD thesis, Karlsruhe Institute of Technology, P = 4 27.05 sec. ordinate bisection. We consider the 2013. Poisson problem up to refinement level P = 8 6.89 sec. [2] Wieners, C. P = 16 1.76 sec. 5:06.06 min. l=7 and N=2 146 689 degrees of free- A geometric data structure for parallel P = 32 0.55 sec. 1:13.94 min. dom and we compute the factorization finite elements and the application to multigrid methods with block smoothing, P = 64 0.83 sec. 16.06 sec. for P=1−4096 processors, cf. Table 2. Comput. Vis. Sci., 13, pp. 161–175, 2010. P = 128 0.77 sec. 6.87 sec. To decompose over 2 million unknowns P = 256 1.11 sec. 5.05 sec. 2:45.30 min. in 3D, we need less than one minute [3] Maurer, D., Wieners, C. P = 512 1.02 sec. 4.29 sec. 1:19.11 min. on 1,024 processors. A parallel block LU decomposition method for distributed finite element matrices, P = 1024 0.93 sec. 3.73 sec. 49.75 sec. Parallel Comput., 37, pp. 742–758, 2011. P = 2048 3.44 sec. 49.75 sec. A second realization in 2D is done for

P = 4096 49.55 sec. the Stokes problem −∆u + p=0,∆ div u=0 contact: Daniel Maurer, on an L -shaped domain with inf-sup [email protected] stable Taylor-Hood-Serendipity Q2/Q1 Table 3: Parallel execution time elements. This saddle point problem on the CRAY XE6 Cluster in leads to a symmetric, but indefinite ma- Stuttgart for the decomposition for the Stokes problem (2D). trix A. Since this is a two dimensional • Daniel Maurer problem with one dimensional interfaces, we obtain a high performance with Karlsruher Institut the parallel solver, even for several für Technologie thousand processors. The results are (KIT), Institut für shown in Table 3. Over twenty million Angewandte und unknowns are decomposed in about 24 Numerische seconds on 4,096 processors. Mathematik

N = 80 131 N = 317 955 N = 1 266 961 N = 5 056 515 N = 20 205 571 For further information about the parallel P = 8 13.65 sec. direct solver, its programming scheme P = 16 4.25 sec. 53.86 sec. and more, see [1]. P = 32 1.31 sec. 17.50 sec.. P = 64 0.66 sec. 5.88 sec. 66.44 sec. P = 128 0.43 sec. 2.06 sec. 20.43 sec. P = 256 0.64 sec. 1.39 sec. 7.30 sec. 78.91 sec. P = 512 0.73 sec. 1.06 sec. 3.04 sec. 24.96 sec. P = 1024 0.61 sec. 0.89 sec. 2.23 sec. 9.76 sec. 98.87 sec. P = 2048 0.78 sec. 1.99 sec. 6.63 sec 40.10 sec. P = 4096 23.99 sec.

22 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 23 Applications Applications

and thus lead to a future generation of surfing. Plasma accelerators use an Plasma Acceleration: more compact accelerators. Plasma intense laser pulse or particle bunch from the Laboratory to acceleration experiments, for instance, (boats on water) as driver to excite succeeded in doubling the energy of relativistic plasma waves (sea waves Astrophysics electron beams accelerated for several for surfing). Accelerating structures kilometers at SLAC in less than a one- are sustained by plasma electrons and meter long plasma. are not affected by physical boundary effects as in conventional accelerators. Most of the matter in the known uni- The plasma accelerator only lasts verse is also plasma. Hence, plasma for the driver transit time through the based particle acceleration and radiation plasma. The resulting plasma wavelength generation mechanisms may also play is only a few microns long (<10 m for a key role in some of the astrophysical sea waves), and support accelerating mysteries that remain unresolved, electric fields up to 3 orders of magni- such as the origin of cosmic rays and tude higher than conventional particle gamma ray bursts through collisionless accelerators. These accelerating fields shocks and magnetic field generation can then accelerate electrons or posi- and amplification mechanisms. Using trons (as surfers in sea waves) to high computing resources at SuperMUC we energy in short distances (<1 m). have explored several questions with impact on the development of plasma One of the challenges associated with acceleration technology and improving the design of a plasma based linear on our understanding of the mecha- collider is positron acceleration. Most nisms that may lead to particle accelera- important plasma based acceleration tion and radiation in astrophysics. experimental results were performed This report will describe some of the in strongly nonlinear regimes, enabling highlights that were achieved during to optimize the quality of accelerated our present allocation in these topics. electron bunches. It has been recognized, however, that these strongly non-linear Plasma Acceleration in the regimes, are not suitable for positron Figure 1: 3D simulation of a Laboratory: towards Plasma- acceleration. Using SuperMUC we then plasma wakefield excited by a doughnut shaped laser. based Linear Colliders explored novel configurations for As conventional particle acceleration techniques are hitting their techno- logical limits, plasma based acceleration The study of novel particle acceleration conventional particle accelerator tech- is emerging as a leading technology and radiation generation mechanisms nology is based on radio-frequency in future generations of higher energy, can be important to develop advanced fields, which can lead to very long and compact particle accelerators. Although technology for industrial and medical very expensive machines. For instance, initially proposed more than 30 years applications, but also to advance our the LHC at CERN is several tens of ago [1], the first ground breaking plasma understanding of fundamental scientific kilometers long and costed several bil- acceleration experimental results ap- questions from sub-atomic to astro- lions of Euro. Thus, investigating new peared in 2005. Plasma acceleration is nomical scales. For instance, particle more compact accelerating technologies presently an active field of research, accelerators are widely used in high- can be beneficial for science and ap- being pursuit by several leading labora- energy physics, and in the generation plications. Plasmas are interesting for tories (e.g. SLAC, DESY, RAL, LOA, of x-rays for medical and scientific this purpose, because they support LBNL). Electron or positron acceleration Figure 2: 3D simulation of a long particle bunch in a plasma leading to stable imaging. Although extremely reliable, nearly arbitrarily large electric fields, in plasma waves is similar to sea wave accelerating fields.

24 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 25 Applications Applications

positron acceleration in strongly non- the proton bunch will drive plasma waves linear regimes. We investigated plasma through a beam plasma instability acceleration driven by narrow particle called the Self-modulation Instability or bunch drivers and by doughnut shaped SMI. The SMI modulates the bunch Laguerre-Gaussian lasers. Although density profile radially. A fully self-mod- the work using narrow particle bunch ulated beam then consists of a train drivers is still in progress, we found of shorter uniformly spaced bunches. that the wakefields driven by Laguerre- Each of them will excite a plasma wave Gaussian beams can have good proper- that grows through the beam, thereby ties for positron acceleration in the producing acceleration gradients that strongly non-linear. Fig. 1 is simulation grow through the beam. A long proton result showing a doughnut plasma bunch will also be subject to the hos- Figure 4: 3D simulation result showing electron density vortices and magnetic field due to the KHI. wave. Typical simulations require 4 x 104 ing instability or HI. The HI can lead to (doughout wakefields) - 2 x 105 (narrow beam break up. Hence, HI suppression Recent experiments, however, have stability or KHI. We found that the KHI drivers) core-hours, pushing 2.5 x 109 is required for successful experiments. started to investigate the onset of is an important dissipation mechanism, - 1.6 x 1011 particles for more than SuperMUC simulations unraveled a new collisionless shocks in the laboratory capable to efficiently transform plasma 4 x 1 0 4 time-steps. HI instability suppression mechanism, resorting to high power laser pulses. kinetic energy into electric and mag- establishing the conditions for stable There are distinct shock types ac- netic field energy [4]. Fig. 4 shows the We also performed simulations to clarify plasma wakefield generation in future cording to the energy transfer to the development of the characteristic KHI important physical mechanisms as- experiments at CERN. Fig. 2 illustrates plasma. We performed simulations vortices and associated magnetic field sociated with a future plasma based a fully self-modulated particle bunch in SuperMUC that enabled to explore in conditions relevant for astrophysics. acceleration experiment at CERN using propagating stably in the plasma, with- the transition between different shock Each simulation took 2 x 104 core-hours proton bunches from the Super Proton out HI growth. Typical simulations ran types. Our simulations showed that pushing 1010 particles for more than Synchrotron (SPS) at the Large Hadron for roughly 5x104 core-hours, pushing electrostatic shocks, which lead to 10 6 iterations. Collider (LHC) [2,3]. In this experiment, 6 . 4 x 1 0 8 particles for more than 2x105 the formation of strong electric fields, time-steps. can be used to efficiently accelerate On-going Research / plasma ions or protons (see Fig. 3). Outlook Particle Acceleration in the Electromagnetic shocks, which lead SuperMUC enabled to make important • Jorge Vieira Universe: mimicking Astro- to the formation of intense magnetic advances to plasma acceleration in the • L.D. Amorim physical Conditions in the fields, can lead to particle accelera- laboratory and in astrophysics in condi- • Paulo Alves Laboratory tion in astrophysics through Fermi-like tions for which purely analytical models • Anne Stockem The origin of cosmic rays and gamma acceleration mechanisms. Magnetic are currently unavailable. We managed • Thomas Grismayer ray bursts are fundamental mysteries field generation also occurs in electro- to perform very large simulations, • Luís Silva for our understanding of the universe. magnetic shocks, due to the Weibel or which would not have been possible in These questions are closely related Current Filamentation Instability (WI smaller supercomputers. GoLP/IPFN, to particle acceleration in shocks and or CFI). These instabilities can grow Instituto Superior to strong magnetic field generation. when counter-streaming plasma flows References and Links Técnico, Lisbon, Collisionless shocks have been studied interpenetrate. Magnetic field genera- [1] Tajima, T. and Dawson, J. Portugal Phys. Rev. Lett. 43 267 (1979). since decades in the context of space tion and amplification is important in and astrophysics due to their potential astrophysics as they can lead to strong [2] Caldwell, A. et al. of efficient particle acceleration to en- bursts of radiation. Thus, in addition to Nat. Physics 5, 363 (2009). 15 ergies larger than 10 eV. Large-scale magnetic field amplification in collision- [3] http://awake.web.cern.ch/awake/ magnetic fields are also important less shocks, simulations at SuperMUC [4] Grismayer, T. et al. for particle acceleration in astrophysics, enabled the discovery of a novel mecha- Phys. Rev. Lett. 111 015005 (2013). but also allow for non-thermal radiation nism driving large-scale magnetic fields processes to operate. Exploring these in shearing counter-streaming plasma contact: Jorge Vieira, Figure 3: 3D simulation result showing the irradiation of a dense hydrogen target by a high intensity laser. Colored spheres represent accelerated protons. extreme scenarios is complex, since flows. Velocity shear flows lead to the [email protected] astronomical observations are limited. development of the Kelvin-Helmholtz In-

26 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 27 Applications Applications Towards Simulating Plasma Turbulences in Future Large-Scale Tokamaks

GEM is a 3D MPI-parallelised gyrofluid at the IPP in Garching with a plasma code used in theoretical plasma physics minor radius r = 0.5 m (see Fig. 1). A at the Max Planck Institute for Plasma typical grid to cover the edge layer for Physics, IPP, at Garching near Munich, conventional tokamaks is 64 × 4,096 × Germany. Recently, IPP and LRZ col- 16. To simulate thin-strip, medium-to- laborated within a PRACE Preparatory kamaks like the JET tokamak (r = 1.25 Access Project to analyse various ver- m) in Culham, UK [4], or a large-scale (a) (b) (c) sions of the code on the HPC systems tokamak like the ITER reactor-plasma Figure 2: (a) Overall wall-clock time for the various versions SuperMUC at LRZ and JUQUEEN at experiment (r ≈ 2 m) currently under of the code on SuperMUC. Jülich Supercomputing Centre (JSC) to construction in Cadarache, France [5], A comparison of the performance of Acknowledgements (b) Time spent in the MPI improve the weak scalability of the ap- grid sizes up to 1,024 × 16,384 × 64 the various versions on SuperMUC is Our work was financially supported by functions (from the Scalasca plication [1]. are necessary. shown in Fig. 2 (a). Detailed performance the PRACE project funded in part by analysis). (c) Comparison of the weak scaling of the MGU analysis using the Scalasca tool [6] the EUs 7th Framework Programme version of GEM on SuperMUC Simulation of Turbulence in Scaling Towards Large- revealed that the scalability of the domi- (FP7/2007-2013) under grant agree- and on JUQUEEN. Plasmas Tokamak Cases nating MPI routines determines the ment nos. RI-283493 and RI-312763. The code GEM addresses electromag- The main goal of the project was to scalability of the entire code, as shown The results were obtained within the netic turbulence in tokamak plasmas establish weak scalability of the gyrofluid in Fig. 2 (b). This is very significant to PRACE Preparatory Access Type C [2]. Its main focus is on the edge layer code GEM to larger systems: if the thin- the CG solver due to its heavy usage of Project 2010PA1505 “Scalability of gy- in which several poorly understood phe- strip, small-tokamak case 64 × 4,096 × the MPI_Allreduce function. Here the rofluid components within a multi-scale nomena are observed in experiments. 16 can be afforded on 512 cores, the MGU version shows best scalability. framework”. The code is written in Fortran/MPI largest planned thick-strip, large-tokamak Finally, Fig. 2 (c) presents the good weak and is based on the electromagnetic case 1,024 × 16,384 × 64 should be scaling behaviour of the MGU version References • Volker Weinberg gyrofluid model. It can be used with able to run on 131k cores in the same (excluding the I/O) up to 32,768 cores [1] Scott, B., Weinberg, V., Hoenen, O., • Anupam Karmakar Karmakar, A., Fazendeiro, L. different geometries depending on the or similar wall clock time. Since the on SuperMUC (IBM System x iDataPlex) Scalability of the Plasma Physics Code targeted use case. The code has been main bottleneck of the code is in the at LRZ and up to 131,072 cores on GEM, PRACE Whitepaper, http://www. Leibniz run up to now on conventional toka- solver, we have particularly focused JUQUEEN (IBM Blue Gene/Q) at Jülich prace-project.eu/IMG/pdf/wp125.pdf. Supercomputing mak cases like the ASDEX-Upgrade [3] on its improvement. Various versions Supercomputing Centre. Centre (LRZ) [2] Scott, B., et al. of the MPI-parallelised code have been Phys. Plasmas 12 (2005) 102307, Plasma set up for the weak scaling analysis. The code has been run up to now on Phys. Contr. Fusion 48 (2006) B277, In most cases the I/O routines were conventional tokamak cases like the Plasma Phys. Contr. Fusion 49 (2007) S25, Contrib. Plasma Phys. 50 (2010) 228, IEEE eliminated to solely concentrate on the ASDEX-Upgrade [3] (see Fig. 1) at the Trans Plasma Sci. 38 (2010) 2159. solver features. The dummy version IPP in Garching with a plasma volume functioning as a baseline for our mea- V = 13 m3. A typical grid to cover the [3] http://www.ipp.mpg.de/16195/asdex surements uses no boundary or sum edge layer for conventional tokamaks is [4] http://www.ccfe.ac.uk/JET.aspx information. The CG version implements 64 × 4,096 × 16. To simulate thin-strip, a conjugate-gradient method for the medium-tokamaks like the JET tokamak [5] http://www.iter.org solver iteration. Furthermore, two dif- (V = 155 m3) in Culham, UK [4], or a [6] http://www.scalasca.org ferent Multigrid versions have been an- large-scale tokamak like the ITER reactor- alysed: the MGV version (Multigrid with plasma experiment (V ≈ 840 m3) contact: Volker Weinberg, V-cycle scheme) and the MGU version currently under construction in Cada- [email protected] (Multigrid with U-cycle scheme) which rache, France [5], grid sizes up to

Figure 1: Plasma vessel of the ASDEX-Upgrade tokamak at IPP. has been developed within this project. 1,024 × 16,384 × 64 are necessary. © IPP, Photo: Volker Rohde

28 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 29 Applications Applications

‘Start Frame’ and ‘End Frame’. Together Now, one may spawn an almost arbi- A Flexible, Fault-tolerant Approach they describe several thousand input pa- trarily large number of worker pro- for Managing Large Numbers of rameters and input data for the render- cesses on SuperMUC in order to per- ing software Blender (www.blender.org). form the rendering of the frames. Independent Tasks on SuperMUC Rendering one individual frame is done using a python script (e.g. rendering The only requirement for the worker frame no. 22 of the first scene using processes is a connection to the redis In many HPC applications, small compu- whose configuration cannot be changed 32 threads): server and their number is only re- tational tasks have to be re-iterated a by the user. As a by-product, it fea- lrz_render_frame.py –l 1 –f 22 –t 32 stricted by the total number of frames, thousand times or more. For example, tures several extremely convenient i.e. one could use all of SuperMUC to create a computer animated movie, properties like fault-tolerance, dynamic Based on the command line parameters at once to render the movie using all several ten thousand single frames resizing of the attached resources, and the description in Table 1, this 155,656 cores. The simplest way to have to be rendered. Likewise, in genetic and a mechanism for backfilling which python file creates the correct system start the worker processes is a Load- research, millions of gene sequences makes it preferable over conventional call to render this specific frame in Leveler script invoking poe with the need to be aligned. A common aspect job farming approaches. Blender. Running the python script shell script ‘startworker.sh’ in Fig. 2 as is that the applied software usually from Fig. 1, one can create similar argument. This script invokes a single only scales up to one compute node Orchestrating Thousands of command strings for each frame that process which iteratively fetches a (16 cores on SuperMUC). The runtime Worker Nodes using Redis–a has to be rendered. These strings single string from the redis server and of the individual tasks is short and No-SQL Data Base are sent to the redis server (‘redis-cli executes it as a shell command. researchers often resort to creating The concept of our job farming approach lpush’). Eventually, the redis data base batch jobs for each task and submitting is based on the light-weight data base stores several thousand string keys Since the software Blender has an inte- them to a scheduler like SLURM or server redis. This no-SQL data base representing the command lines for grated OpenMP parallelization, it will effi- SGE. This is a convenient and efficient can handle large numbers of queries by all frames. ciently use the 16 physical cores (or 32 solution if the number of users on the thousands of different users efficiently. logical cores using hyperthreading) avail- system is small. On HPC resources like In our example this redis server will run able in one compute node of SuperMUC. SuperMUC with thousands of users, on a SuperMUC login node which can however, this procedure is not feasible be accessed by the compute nodes as since the number of jobs per user is well. The data base server will store limited (on SuperMUC, max. 3 jobs can and manage the subtasks that will be run simultaneously per user). The processed by the compute nodes. standard way around this limit are job farming approaches. In our illustration example we use the approach for rendering a movie. Table In this article we describe a novel ap- 1 was created at the concept stage proach to job farming. It is based on an of the movie. It contains columns like Figure 1: Python script for sending command strings for each frame to the redis server. independent data base run by the user. ‘Scene Description’ which are not This makes it much more flexible and needed as input to render the anima- easier to handle than most job farming tion. The important information are tools integrated into batch systems the file names as well as the variables

Filename Start Frame End Frame Scene Description 01_01_opening.blend 1 1400 Flight over Campus 02_01_node.blend 1 415 One Node of SuperMUC 02_02_cpu.blend 1 240 Zoom into a CPU Core Figure 2: Bash script ‘startworker.sh’ for a redis worker process which fetches command strings Table 1: Excerpt from the task description file for a computer animated movie. from the redis server and executes them in an endless loop.

30 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 31 Applications Applications

The proposed work flow is probably one Consider a circle of radius 1 in Fig. 4 strings in a more elaborated way. For of the simplest implementations for which circumscribes an area equal to example, it has an integrated imple- a hybrid MPI/OpenMP programming π. Now, we draw two random numbers mentation for fault tolerance. If a worker model. Once Blender has finished render- from the interval [-1,1], representing does not return a result within a certain ing a frame, its parent process will arbitrary x- and y-coordinates. The prob- time, the subtask will be automatically reconnect to the redis server and fetch ability P that the new point is located rescheduled assuming that a failure a new command string. This means, within the circle equals (the area of has occurred on the original worker that the described work flow has an in- the circle divided by the area of the node responsible for the subtask. tegrated load balancing mechanism. surrounding square of length 2). We can approximate this probability P by Once lines 1 to 3 have been run, the Since one may spawn thousands of drawing a large number of new points redis environment for parallel processing these worker processes , one can eas- (x,y) and checking whether they are in- of tasks is established. Subsequently, ily render a complete movie overnight. side the circle, i.e. . one may run further parallel computa- tions like the one in lines 5 to 17. It is doRedis–Embedding the In R, we can implement this using the also worth noting that one can trans- Figure 4: Approximating by Monte Carlo Integration Redis Workflow in code example in Fig. 5, which are form the redis version in (b) into a Conventional Loop Syntax based on a simple function for drawing serial version simply by replacing memory parallelization, but an MPI In its pure version, the redis workflow 10 million observations, sample_10M. %dopar% by %do%. In this case, R will implementation is missing and not ex- might be considered cumbersome be- Additionally, we have a loop statement process the foreach statement as a pected to yield strong benefits. It is cause several scripts have to be run on repeating the function B times. In the normal loop statement which is a con- far more efficient to distribute the different machines and the communi- redis version, these B iterations are venient feature if the worker processes millions of alignments over many com- cation with the redis server has to be sent to the local worker processes via are temporarily not available. puting nodes. In comparison to the performed ‘manually’. Within the R pro- the foreach statement of doRedis. previous case, the alignment features gramming language (www.r-language.org), Another Use Case: Gene a data base which has to be copied an easy-to-use package for the redis Apart from lines 1 to 3 which set up the Sequence Alignment to each node at the beginning of the workflow has been developed, doRedis. connection to the redis server both In gene sequence analysis, we have to computation. Fig. 6 shows a Gantt Apart from requiring a redis server versions are almost identical. Line 3 align millions of sequences to a refer- chart for 2,048 Blast processes. In running in the background, the work- starts local R worker processes. ence genome. A well-established software this case, several Blast instances were flow is completely embedded in the Similar to the previous example, these for this problem is Blast (http://blast. run on the SuperMUC compute nodes. conventional R syntax. As in the official processes fetch strings from the data ncbi.nlm.nih.gov/Blast.cgi). Similar to The green region at the beginning rep- example for doRedis, we will illustrate base and automatically execute these Blender it provides a good shared resents the copying of the reference the usage by a simple example in which strings as normal R-code. However, the π is approximated. mechanism in doRedis executes these

Figure 3: Redis-based workflow for rendering a movie on SuperMUC using the software Blender. Figure 5: Code examples in the R computing language to approximate. (a) serial version and (b) redis code.

32 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 33 Applications Applications

genome data base to the main memory Discussion and Outlook of each participating SuperMUC node. In comparison to most other job farm- The data base was copied to RAM-Disk ing approaches, the redis work flow (/dev/shm), i.e. it was used as an described here offers several desirable in-memory data base. Several Blast features like flexibility and extensibility. processes were run on each node, and these processes could share the in Typical job farming approaches that memory-data base. This strategy sig- are integrated into standard batch sys- nificantly reduces the amount of RAM tems have no automatic load balancing. per core although it moderately slows In the redis work flow, each worker im- down the computation in comparison mediately queries a new task as soon to the serial version. The panel (a) in as it has finished processing its previous Fig. 6 corresponds to a serial version task. Thus, if tasks with varying run- which performed ten alignments time exist, compute nodes that finished (iterating orange and blue regions). a task will automatically perform an- other task. Likewise, if we have tasks The panel (b) in Fig. 6 depicts the par- with different requirements (e.g. main allel version using the proposed redis memory), one can define several job workflow. The green region for copying queues on the redis server which will the data shows that certain nodes re- be processed by different workers. The Figure 7: Implementation of a hybrid cloud using the redis approach. ceive the data base earlier and directly R-connector doRedis also provides a start their alignment tasks. Subsequently, fault-tolerant scheduling mechanism. It the ten alignments (orange and blue) will automatically restart a task on an- For example, a user may outsource the by a redis server running on a local PC on the compute nodes take slightly other worker node if it either loses the computations to the resources at the to which all compute nodes had ac- longer than in the serial version. None- connection to one of the workers or an LRZ when their requirements exceed cess. The machine learning algorithms theless, using this strategy 20,000 optional, user-defined timeout occurs. the capacities of the local resources were implemented in R. This allowed us alignments could be performed in at the institute of the user (e.g. during to include diverse hardware architec- roughly the same time in which the se- Last but not least, the attachment of peak hours). During off-peak hours, the tures into the work flow, establishing a • Christoph Bernau rial version only managed to perform the workers as well as the submission computations can be performed on fault-tolerant, dynamic, hybrid cloud. • Ferdinand Jamitzky 10 alignments. of tasks are completely dynamic. It is designated local hardware at the insti- • Helmut Satzger even possible to start workers before tute of the user. contact: Christoph Bernau, the data base is filled with tasks. The [email protected] Leibniz worker processes will simply remain Another interesting extension of the Supercomputing idle until tasks are sent to the redis redis work flow is worth mention- Centre (LRZ) server. Additionally, it is possible to add ing. The presented redis approach is tasks during runtime–one simply has capable of combining very heteroge- to send additional tasks to the job queue neous resources with various kinds of on the redis server. This provides a architectures. The authors have already convenient way for implementing conducted several test runs where ‘backfilling’, i.e. assigning new work to resources of different University depart- resources that currently idle because ments, the commercial cloud services they are waiting for other processes Amazon EC2, and LRZ systems jointly of the same job to finish. Similarly, one performed data mining, where several may assign additional resources to a thousand machine learning algorithms job queue to speed up the work. This had to be run. The tasks were distri- feature also enables cloud bursting. buted among the different resources

Figure 6: Gantt chart for gene sequence alignment using Blast (a) serially and (b) the redis-based parallel version.

34 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 35 Applications Applications

has been to produce estimates of ex- the signals – late-arriving energy, due SHAKE-IT: Evaluation pected ground shaking in Northern Italy to reverberations and scattering. A 1D of Seismic Shaking in through detailed deterministic simula- model, as expected, cannot account tions of ground motion due to possible for the waveform complexity due to the Northern Italy earthquakes. effect of the sediments, while the high- resolution 3D model does a very good Results job in fitting the envelope of the seis- Ground shaking due to an earthquake Current-generation numerical codes, We started the work with the implemen- mograms. For the local earthquakes not only depends on the energy radi- and powerful HPC infrastructures, now tation of a 3D description of the sub- for which high-quality seismograms ated at the source but also on propa- allow for realistic simulations in com- surface geological units of the sedimen- were available, the new 3D model can gation effects and amplification due to plex 3D geologic structures. We apply tary Po Plain basin, plus neighbouring reproduce well the peak ground ac- the response of geological structures. such methodology to the Po Plain in Alps and Northern Apennines mountain celeration at periods longer than 2 s, A further step in the assessment of Northern Italy, a region with relatively chains. Once the spatial characters of and the overall duration of the shaking – seismic hazard, beyond the evaluation rare earthquakes but with a large the structure have been set by merging two parameters of highest relevance of the earthquake generation potential, property and industrial exposure – as it detailed information from scientific for engineering purposes, as they are requires then a detailed knowledge of became clear during the recent events literature and databases, the elastic of great consequence to model the re- the local Earth structure and of its of May 20-29, 2012. The region is parameters and density of the different sponse of high-rise buildings and soil effects on the seismic wave field. The characterized by a deep sedimentary units had to be adjusted. Our strategy liquefaction effects. The new high- simulation of seismic wave propagation basin: the 3D description of the spatial consisted of defining an a priori plausible resolution model shows therefore sig- in heterogeneous crustal structures and structure of sedimentary space of model parameters, and ex- nificant improvement with respect is therefore an important tool for the layers is very important, because they ploring it following a metaheuristic to both the 1D and the large-scale 3D evaluation of earthquake-generated are responsible for significant local ef- procedure based on a Latin hypercube models. Understandably, the differ- ground shaking in specific regions, and fects that may substantially amplify the sampling. This strategy was chosen ences are most evident for propagation for estimates of seismic hazard. amplitude of ground motion. Our goal to direct the trial-and-error procedure, paths crossing the sedimentary basin. as it provides a more uniform coverage of the subspace of interest than a Having verified that we now have a random sampling. Each model has been model (and a reliable computational tested for its ability to reproduce ob- framework), able to reproduce the served seismic waveforms for recent gross characters of seismically-induced events. Best models were those provid- ground motion, we can then put them ing the best fit. We have extensively to work to infer the ground shaking that simulated and independently compared would be caused by seismic sources in performance three different models deemed plausible for the region. Seismic for the Earth's crust – ranging from hazard assessment is built on the a simple 1D model, to a 3D large-scale knowledge of past activity derived from reference model, to our new “optimal” historical catalogues and geological high resolution 3D model – to gauge evidence. We can legitimately reason their ability to fit observed seismic that some event of the past will repeat waveforms for a set of 20 recent itself with very similar characters in earthquakes occurred in the region. We the future, and the numerical frame- observe a strong improvement of work (that we have verified on recent the fit to observed data by the wave events) permits to reconstruct the field computed in our new detailed ground shaking that will ensue. We 3D model of the plain. Specifically, the have therefore considered in detail

Figure 1: Mesh of the crust and the mantle, coloured as a function of topography. The mesh honours the topography of new model has shown to be able to two specific cases. One refers to the free surface and crust/mantle boundary. reproduce the long so-called '' of earthquake that struck the city of

36 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 37 Applications Applications

Ferrara in 1570, an event very similar deriving from historical catalogs – has statistical properties of the resulting References and Links to the one occurred in May, 2012; the been explicitly considered by generating ensemble of wave fields generated. We [1] Molinari, I., Morelli, A. Geophys. J. Int., 185, 352-364, doi: other to a series of historical events, suites of source parameters spanning may thus address the minimum, maxi- 10.1111/j.1365 -246X.2011.0 494 0.x, with similar characters, located along the uncertainty ranges of hypocentral mum, or median of the suite of peak 2011. the same linear geological structure coordinates and geometrical parame- ground accelerations, for all the geologi- [2] Morelli, A., Molinari, I., Basini, P., roughly comprised between the cities ters. For each earthquake, we repre- cally acceptable source models, in all Danecek, P. of Modena and Parma. The uncertainty sent the uncertainty regions by 200 possible locations. We can also explore Abstract S32B-02 presented at 2012 Fall in the knowledge of source parameters – instances generated through Latin the dependence of shaking parameters Meeting, AGU, San Francisco, Calif. (USA), quite relevant in the case of knowledge hypercube sampling, and analyse the on source parameters – such as hy- 3-7 Dec., 2012.

pocentral depth or fault plane orienta- [3] Molinari, I., Morelli, A., Basini, P., tion. Different geological settings (i.e. Berbellini, A. hard rock, vs. consolidated sediments, Abstract S32B-03 presented at 2013 Fall Meeting, AGU, San Francisco, Calif. (USA), or water-saturated sediments) in fact 9-13 Dec., 2013. show a quite different response. [4] Tondi, R., Cavazzoni, C., Danecek, P., Morelli, A. On-going Research / Computers & Geosciences, 48, 143-156, Outlook DOI: 10.1016/j.cageo.2012.05.026, 2012. This study builds the basis for a physics- [5] Gualtieri, L., Serretti, P., Morelli, A. based approach to seismic hazard Geochemistry, Geophysics, Geosystems, estimates in Northern Italy. We will need, DOI:10.1002/2013GC004988, 2013. on one side, to increase the resolution of the geological model, in some critical contact: Andrea Morelli, locations, to slightly increase the highest [email protected] significant frequency. We are currently limited to relatively low-frequency (f < 0.5 Hz) and more efforts are needed to go to higher frequency. A more detailed description of the geological • Andrea Morelli 1 structure may permit to increase to • Peter Danecek 1 f ~ 1 Hz. Stochastic synthesis is re- • Irene Molinari 1 quired to reach further high frequencies • Andrea Berbellini 1 – f > 1 Hz, with engineering interest • Paride Lagovini 1 for low-rise residential buildings – so a • Piero Basini 2 hybrid deterministic-stochastic ap- proach must be used. 1 Istituto Nazionale di Geofisica e Vulcanologia, Italy

2 Toronto University, Toronto, Canada

Figure 2: Snapshots at 30s and 70s of the simulation of the Mw=5.2 June 21, 2013, earthquake. The effect of the basin structure is clearly apparent when the wave front passes the ESE-WNW-trending boundary with the mountain front.

38 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 39 Applications Applications

Results parts of the code. A straightforward Quantum Photophysics High-level QC methods are required for implementation of the summation of and Photochemistry of describing photo-initiated quantum the PT series is very inefficient, since molecular dynamics that involves multiple at least one or more slow divide op- Biosystems electronic states. It is challenging to erations are required to calculate each describe electronic fast and efficient individual term of the PT series. More- de-excitation occurring through so-called over, the summation runs over a large conical intersections, where topography amount of data, involving some com- around an intersection seam of two binations of transformed 2-e integrals, degenerate electronic states influences thus making such algorithm not pro- the transition between them. Systems cessor cache-friendly. with quasi-degeneracy require multi- reference approaches. The new XM- The efficient implementation of the CQDPT2 method [2] is a unique and XMCQDPT2 theory within the Firefly pack- invariant approach to multi-state multi- age is based on a developed family of the reference perturbation theories (PT), cache-friendly algorithms for the direct allowing an accurate description of summation of the PT series [3], as large and complex systems; in particular, well as on a so-called resolvent-fitting ap- near the points of avoided crossings proach [4], which uses a table-driven and conical intersections. The use of interpolation for the resolvent operator, high-performance high-core-count thus further reducing computational clusters becomes mandatory require- costs. ment for predictive modeling of such systems. Being executed in its recently devel- oped extreme parallel (XP) mode of Unlike well-established approaches in operations, Firefly is scalable up to classical molecular dynamics, QC meth- thousands of cores. The XP mode in- ods deserve special attention when cludes a three-level parallelization (MPI implementing them on modern com- process → groups of weakly coupled puter architectures. Huge data sets of MPI communicators with parallelization 4-indexed two-electron (2-e) integrals over MPI inside each group → optional are generated and stored on a disk multithreading within each instance of

Figure 1: When taken out of the Green Fluorescent Protein, its anionic light-absorbing molecular unit after their evaluation and intermediate a parallel process) and efficient asyn- shows a remarkable fundamental interplay between electronic and nuclear excited-state decay channels sorting in a two-pass transformation. chronous disk I/O with real-time data in the ultrafast photo-initiated dynamics occurring on a pile of potential energy surfaces. These data are subsequently used in compression/decompression (through- computationally intensive parts of the put of up to 5 - 7 GByte/s). The interaction of molecules with light of excited-state evolution of photo- code, where most of the computational is central to vital activity of living organ- active proteins and their light-absorbing efforts are due to summation of the The Firefly package enables to perform isms and human beings. Photosynthesis, molecular units using highly correlated individual terms of the PT series. Thus, efficient large-scale XMCQDPT2 calcu- vision in vertebrates, solar energy multi-reference methods of Quantum the QC algorithms inevitably contain lations for systems with up to 4,000 harvesting and conversion, and light Chemistry (QC) and their efficient parallel different stages, which are I/O inten- basis functions and with large active sensing are remarkably efficient pro- implementation. Such large-scale cal- sive (e.g., integral transformation) and spaces, which comprise several cesses, and much focus has been on culations combined with a high accuracy computationally intensive (e.g., direct millions of configuration state functions. elucidating the role played by the protein have been enabled through the PRACE summation of the PT series). The I/O Protein and solution environments may environment in their primary events, LRZ’s SuperMUC infrastructure and performance is, therefore, of utmost also be taken into account using which occur on a timescale down to the highly tuned and optimized design importance. Besides, a theoretical/ multi-scale approaches. The overall size sub-picoseconds. Our project deals with of the Firefly package [1]. formulae complexity requires a care- of model systems may comprise computationally demanding simulations ful tuning of computationally intensive 5,000 – 10,000 atoms.

40 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 41 Applications Applications

The project has the total allocated CPU electronic and nuclear dynamics that is biological evolution and function of light- of such non-adiabatic mechanisms, time of 3,885,000 core-hours, which coupled with a remarkable efficiency in sensitive proteins. thus enhancing our understanding of the have entirely been used. The average the ultrafast excited-state decay channels efficiency of light triggered processes number of cores used by a single run of this chromophore (Fig. 1). The en- • We have identified higher electroni- in nature. Our findings will ultimately lay is 1,024 and up to 2,944 with at least ergy exchange is found to be fast and, cally excited states of the isolated GFP the groundwork for mode-selective 24 hours duration. A distributed high- importantly, mode-specific [5]. This chromophore anion in the previously photochemistry and photophysics in bio- capacity high-bandwidth finding lays the groundwork for mode- unexplored UV region down to 210 nm logical systems. with up to 100 TBytes for temporary selective photochemistry in biosystems. [6]. By forming a dense manifold, these storage during a single run is required. molecular resonances are found to We also expect direct benefits from The IBM GPFS file system of the LRZ’s • By simulating the photo-initiated serve as a doorway for very efficient further tuning and optimizing the Firefly SuperMUC has been found to be the early-time nuclear dynamics of the GFP electron detachment in the gas phase. code, aiming at efficient large-scale most sustained to a high load caused and KFP proteins, their absorption Being resonant with the quasi-continuum excited-state and non-adiabatic calcu- by the most I/O intensive parts of the spectral profiles have been retrieved, of a solvated electron, this electronic lations. This will expand a range of XMCQDPT2 code. The direct benefits which appear to be remarkably similar band in the protein might play a major biologically-relevant systems, which will from tuning and optimizing the Firefly to those of the isolated chromophores role in the GFP photophysics, where be feasible for high-level ab initio calcu- code have been reaped during the proj- [5]. Here, high-frequency stretching resonant GFP photoionization in solution lations, thus further pushing the limits ect. The most I/O intensive parts of modes play an important role defining triggers the protein photoconversion of theory. the code have been redesigned target- the spectral widths. Remarkably, they with UV light. ing particular large XMCQDPT2 jobs. also serve as reaction coordinates in Acknowledgements photo-induced electron transfer, which • Computationally, the largest run, This work was supported by the RFBR The following scientific and technical competes with internal conversion ever performed using a highly correlated (grant No. 14-03-00887). The results milestones have been achieved within mediated by twisting. Many GFP-related multi-reference electronic structure have been achieved using the PRACE- the project: proteins prohibit internal conversion, theory, has been executed in the XP- 2IP project (FP7 RI-283493) resources thus possibly directing the decay to- mode using 2,944 cores with a sus- LRZ’s SuperMUC, HLRS’s Laki, and • By using excited-state electronic struc- wards electron transfer; whereas fluo- tained performance close to 60% of RZG’s MPG Hydra based in Germany. ture calculations of the isolated anionic rescence may be regarded as a side the theoretical aggregate peak perfor- Green Fluorescent Protein (GFP) chro- channel only enabled in the absence of mance, ca 40 Teraflop/s. The cal- References and Links mophore, we have been able to dis- relevant electron acceptors. This find- culations are aimed at understanding [1] Granovsky, A.A. • Anastasia V. Firefly, http://classic.chem.msu.su/gran/ close a striking interplay between the ing adds to our understanding of of a wavelength tuning mechanism Bochenkova 1 firefly/index.html in the retinal-containing visual proteins, • Alexander A. where a single 11-cis PSB retinal [2] Granovsky, A.A., Granovsky 2 J. Chem. Phys. 134, 214113, 2011. chromophore is responsible for entire color vision (Fig. 2). [3] Granovsky, A.A. 1 Department of http://classic.chem.msu.su/gran/gamess/ Physics and qdpt2.pdf On-going Research / Astronomy, Outlook [4] Granovsky, A.A. Aarhus University, Our results indicate the existence of http://classic.chem.msu.su/gran/ Denmark gamess/table_qdpt2.pdf efficient electron-nuclear coupling mechanisms in the photoresponse of [5] Bochenkova, A.V., Andersen, L.H. 2 Firefly project, biological chromophores. These non- Faraday Discuss. 163, 297–319, 2013. Moscow, Russia adiabatic mechanisms are found to be [6] Bochenkova, A.V., Klaerke, B., Rahbek, dual, and, importantly, mode-specific. D.B., Rajput, J., Toker, Y., Andersen, L.H. Numerous ways may be outlined by Phys. Rev. Lett., submitted, 2014. which the proteins use this electron-to- nuclei coupling to guide the photo- contact: Anastasia Bochenkova, chemistry and the photophysics that lie [email protected] behind their functioning. In the future Figure 2: (Left) Dim-light visual pigment Rhodopsin, with the retinal chromophore inside (depicted in blue). (Right) The 11-cis PSB retinal chromophore covalently bound to the protein inside its binding pocket. projects, we plan to unravel the details

42 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 43 Applications Applications

NIC Excellence Project 2013 asymmetric stretching modes of experimental Raman spectra different Cr species (simulation) (T = 873 K, P ~ 1 GPa) Ab Initio Geochemistry of the Deep Earth

Most of what we know today about the While melts and fluids play a very im-

formation and evolution of the Earth portant role in geological processes, intensity (a.u.) is derived from a combination of geo- the investigation of their molecular physical observations, geochemical structure, physical and thermodynamic information extracted from minerals properties remains challenging, es- or rocks, and laboratory experiments. pecially under the extreme conditions An essential prerequisite to recon- of pressure and temperature prevalent 200 250 300 350 250 300 350 400 struct the history of our planet is the in the Earth's interior. With the rising -1 -1 wavenumber (cm ) wavenumber (cm ) knowledge of the thermodynamics and power of supercomputers, first-principles kinetics of geological processes in a simulations based on quantum me- Figure 2: Vibrational mode spectra of chloridic Cr(III) complexes in aqueous solution from first-principles molecular wide range of relevant time and length chanics and density-functional theory [1] dynamics simulations compared to experimental Raman spectra [5]. Also shown are two snapshots from the scales. Chemical elements and their have become an increasingly powerful simulations. isotopes are redistributed most effec- tool to complement experimental tively at high temperatures and in the approaches in studying mineral-fluid or presence of silicate melts or aqueous mineral-melt interactions. Such simu- ture, and physical properties of the of performing first-principles molecular fluids. Chemical transport in minerals lations are not only needed to interpret respective phases. dynamics simulations on supercomputers and rocks is usually much slower. In e.g. the complex fingerprint information do we now have a method available to some cases signatures of the time and obtained by various types of spectros- Molecular Structure of Melts make realistic structure predictions for the thermodynamic state of their for- copy, but they also provide a unique link and Fluids at High Pressures chemical complex systems at extreme mation can be stored over billions of between chemical interaction, struc- and Temperatures conditions. years. A first step towards a better under- standing of the role of melts and fluids The change in the molecular structure in geological processes from a mo- of a fluid with temperature is illustrated lecular scale perspective is the develop- in Fig. 1. Ab initio molecular dynamics

ment of realistic structure models for simulations of an H2O-SiO2 fluid at tem- these disordered phases at relevant peratures of 3,000 K and 2,400 K at conditions. For this purpose, state-of- pressures of about 4 GPa show that

the-art experiments using methods the number of free H2O molecules in such as x-ray diffraction, x-ray absorp- the fluid increases with decreasing tion, infrared absorption, or Raman temperature and pressure [2], which spectroscopy need to be performed in eventually leads to a phase separation situ to account for the continuous into a silica-bearing aqueous fluid and structural changes with variation of a hydrous silicate melt. The hydrolysis pressure, temperature, and/or chemi- reaction, which transforms bridging cal composition. However, there is no Si-O-Si bonds into non-bridging Si-OH unique structure solution from the ex- groups, can still be observed directly perimental data alone. Various molecu- on the timescale of picoseconds at lar modeling approaches have been these high temperatures. At tempera- Figure 1: Structural evolution of an H O-SiO fluid during isochoric cooling from 3,000 K (left) to 2,400 K 2 2 developed in the last decades to address tures below about 2,000 K the rate (right) [2]. The increasing number of H2O molecules (shown as balls and sticks) may be interpreted as a precursor of fluid-melt phase separation, which is also an important process in explosive volcanism. this problem. But only with the feasibility of formation or breaking of strong Si-O

44 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 45 Applications Applications

ferences in the Gibbs energy due to the pute the force constants acting on the 300 300 mass-dependence of the normal mode isotopic atom in the three Cartesian full normal mode analysis vibrational frequencies. It is known that directions (Bigeleisen and Mayer approx- 250 250 Bigeleisen and Mayer the heavy isotope preferentially frac- imation). If these force constants are pseudofrequency method ) ) tionates into the stronger bonded envi- transformed into pseudo-frequencies °° 200 °° 200 ronment. Using first-principles methods (generally these are not normal modes) we are now able to make quantitative and the latter are inserted into the re- -1) (° / -1) (° / β β 150 150 predictions of equilibrium stable isotope duced partition functions, the obtained fractionation between different phases fractionation factors are very similar 1000 ( 100 1000 ( 100 including minerals, melts, and fluids. to those obtained from a full normal While density-functional perturbation mode analysis (Fig. 3). After having 50 50 - theory or other quantum chemical demonstrated the accuracy of our new H3BO3 H4BO4 methods have been employed to compute method for Li and B isotopes we are 0 0 400 600 800 1000 400 600 800 1000 fractionation factors for (simple) crys- now confident to make reasonable pre- T (K) T (K) tals and molecules by a number of dictions for other important light iso- groups, the explicit treatment of a fluid topes such as Si and to shed new light Figure 3: Comparison of 11B/10B isotope fractionation factors as a function of temperature for two mol- or melt phase remained almost un- on fractionation processes, e.g. during ecules, H3BO3 and H4BO4- computed using the different methods described in the text (modified from [6]). explored. Major challenges are (1) the Earth’s core formation. need of a reliable structure model for bonds is becoming too low, and accel- tion spectra. For this purpose we use the fluid at relevant pressures and tem- Towards Predictive Modeling eration methods such as constrained the molecular dynamics trajectories peratures and (2) an efficient scheme of Element Partitioning and dynamics or metadynamics need to in conjunction with various methods of to compute the fractionation factor from Mineral Solubility be employed to sample the equilibrium theoretical spectroscopy [3]. Further- a large number of molecular dynamics In a next step, we have started to ad- state. In systems with weaker bonds more, our simulations support the configurations representing a complex dress the problem of element re-distri- such as aqueous solutions containing interpretation of experimental spectra. disordered material. bution during fluid-rock or melt-rock mono- or divalent cations, the lowest For instance, application of a mode- interaction. This is an important field in temperature for direct observation of projection technique [4] allows us to Within our NIC project, we have devel- geochemistry ranging from the use of changes in the cation complexation de- relate bands in experimental Raman oped an approximate method that pre- trace elements as geochemical tracers creases to lower than 1,000 K, which spectra of high pressure / high tem- dicts mineral-fluid isotope fractionation to element mobilization in hydrothermal then covers relevant conditions of hy- perature fluids to vibrations of indi- with accuracy similar to experiment drothermal or subduction zone environ- vidual cation complexes studied in the [6]. As described above, the structure ments. simulation (Fig.2) [5]. of the fluid is sampled by first-principles G molecular dynamics simulation. The 2 melt 1 Y3+ Al3+ Depending on the objectives of a spe- Equilibrium Stable Isotope fractionation factor, which is expressed cific study we may be interested in the Fractionation between as a reduced partition function ratio, global structure of a melt or fluid as Minerals and Fluids is then derived considering the local

probed e.g. by diffraction methods, or Stable isotopes are widely used as geo- nature of the fractionation process. A - G1 3+ 3+ in the atomic environment of a specific chemical tracers, which includes ap- conventional full normal mode analysis melt 2 Al Y element that is present in the system plications in deep Earth geochemistry would require a structural relaxation of with very low concentration. In the latter as well as in paleoclimatology. The field all atoms in the simulation cell followed case, a site-sensitive spectroscopic of isotope geochemistry has been ex- by computing harmonic frequencies. 1 V 1 G = d = VAl VY d method would provide a reference of panding greatly since modern analytical This is not only computationally extremely 0 0 the 'real system'. The most direct as- methods allow measuring variations in expensive but it also changes the sessment of the predicted structure isotope concentrations in the sub-per structure of the fluid. It turns out that model is made by comparing experi- mil range. From a thermodynamic point at high temperatures it is sufficient Figure 4: Computing the change in Gibbs energy ∆G using the alchemical change method. After adding the two partial transmutation reactions, the trace element mental and computed diffraction pat- of view, isotope fractionation between to relax only the isotopic atom and its (here Y3+) partition coefficient between two phases can be extracted assuming a terns, vibrational or electronic excita- different phases is caused by small dif- nearest neighbors, and then to com- partition coefficient close to unity for the major element (Al3+).

46 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 47 Applications NIC Excellence Project 2012 Applications fluids during ore-forming processes. Acknowledgements Nonlinear Response of Single First-principles prediction of element This work has been funded by the partitioning between different phases is Deutsche Forschungsgemeinschaft Particles in Glassforming Fluids conceptually more complex than iso- within the Emmy-Noether program tope fractionation. For computing Gibbs (grant no. JA1469/4-1). Simulations to External Forces energy differences the changes in the were performed on the supercom- chemical interactions upon element puters JUQUEEN and JUROPA at JSC substitution need to be accounted for. thanks to allocated computing time In “active micro-rheology” the response solving Newton’s equations of motion, Furthermore, Gibbs energies cannot within NIC project HPO15. of a tagged particle in a fluid to an as usual; however, it is essential to use be derived directly from a single molecu- external force ƒ that acts only on this a so-called “dissipative particle dynamics” lar dynamics trajectory. One promis- References particle is measured [1]. This new (DPD) thermostat to keep the chosen ing approach is to use the alchemical [1] Marx, D., Hutter, J. experimental method can be applied temperature constant, since the extra Ab-initio molecular dynamics: Basic theory change method of thermodynamic in- to colloidal and biological systems, heat put into the system via the ad- and advanced methods. Cambridge Univer- tegration as illustrated in Fig. 4, which sity Press, 2009. where it can be realized, e.g., by using ditional friction that the pulled particle we recently applied in a study of the a combination of magnetic tweezers creates in the system needs to be [2] Spiekermann, G. melt-dependence of the trace element and confocal microscopy. Particularly removed. The DPD thermostat is con- PhD thesis, FU Berlin, 2012. partitioning of Y3+ between silicate interesting is the use of active micro- structed such that it is compatible minerals and melts [7]. In a number of [3] Jahn, S., Kowalski, P.M. rheology to study glass-forming fluids, with the proper hydrodynamic equations Rev. Mineral. Geochem. 78, 691-743, molecular dynamics simulations a since these systems show a strong and their underlying conservation laws. 2014. single particle in the simulation cell is non-linear response even for very small transmuted from one element to an- [4] Spiekermann, G., Steele-MacInnis, M., external forces [2], and understanding Schmidt, C., Jahn, S. other (here from Y3+ to Al3+) by chang- the glassy freezing of such systems still J. Chem. Phys. 136, 154501, 2012. ing the interaction potential V through is one of the grand challenge problems [5] Watenphul, A., Schmidt, C., Jahn, S. variation of the integration parameter λ of condensed-matter physics [3]. Geochim. Cosmochim. Acta 126, 212-227, in steps from 0 to 1. After performing 2014. this transmutation in the two phases of In order to provide an understanding interest the exchange coefficient of the [6] Kowalski, P.M., Wunder, B., Jahn, S. how such experiments have to be Geochim. Cosmochim. Acta 101, 285-301, • Sandro Jahn two elements between the two phases 2013. properly analyzed and interpreted, a • Daniela Künzel can be derived. While in our pilot study faithful model of this approach has [7] Haigis, V., Salanne, M., Simon, S., • Johannes Wagner [7] we used classical interaction po- been implemented via large scale Wilke, M., Jahn, S. tentials we are currently exploring the Chem. Geol. 346, 14-21, 2013. computer simulations on the JUROPA Deutsches possibility to predict realistic element supercomputer of the NIC/JSC. As GeoForschungs- partition coefficients from first principles contact: Sandro Jahn, a model glass-former, a binary mixture Zentrum GFZ, simulations. This challenging goal is [email protected] of colloidal particles that interact with Potsdam computationally very demanding and can screened Coulomb potentials (so-called only be achieved by using supercom- “Yukawa potentials”) has been used. In puter facilities. the Non-equilibrium Molecular Dynamics (NEMD) simulation, a single particle is pulled with a constant external force of strength ƒ that acts in x-direction. The simulation box contains altogether 1,600 particles; periodic boundary con- Figure 1: Snapshot of typical trajectories of a pulled A-particle in the binary AB-mixture (containing 800 particles of type A and 800 of type B) for a tempera- ditions are applied in all directions to ture T = 0.14 slightly above the glass transition temperature and a force ƒ = 1.5

make the system representative for a (the force is measured in units of kBT/d AA, where kB is Boltzmann’s constant and d the diameter of A particles; temperature is measured in units of the macroscopic colloidal dispersion. The AA

energy scale εAA of the Yukawa potential between A-particles, V (r) = ε (d /r) dynamics of this interacting many-particle αβ αβ αβ exp[-6(r/d -1)], with , = A,B). The trajectories are drawn such that they all αβ α β system is simulated by numerically start at the same initial point.

48 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 49 Applications Applications

When one follows the trajectories of formed by its neighbors, a cooperative [4], similar to the well-known “time-tem- the pulled particles in order to study rearrangement of a rather large perature superposition principle” [3] the time evolution of their mean square surrounding region may be necessary for relaxation functions of glass-forming displacements (note that the latter are until the particle can “escape from the systems in thermal equilibrium. anisotropic, directions parallel and per- cage”. These motions in the cage are pendicular to the direction of the pulling reflected in a pearl-like structure of A very interesting behavior is also de- force are not equivalent), one encoun- the trajectory. Conversely, when the tected when one studies the time- ters huge sample-to-sample fluctuations particle passes a region of higher mo- dependent mean square displacements from run to run, and therefore huge bility, the force can pull it through such (MSDs) of the pulled particles (Fig. 3). computational resources need to be a region rather fast, causing an almost For ƒ = 0, there are three regimes: at invested to obtain physically meaning- linear string-like piece of the trajectory. small times, there is a ballistic regime ful results (by sampling thousands of (MSD t2), when the particle’s motion trajectories over a significant range of When one follows each trajectory long is not yet constrained by its neighbors. ∝ time). However, these huge fluctuations enough, one clearly can identify a ve- The “cage effect” (i.e., a particle is are one of the central subjects of the locity v of the steady state that is estab- “caught” by a cage formed by its neigh- Figure 3: Mean-squared displacement (MSD) of A particles in the direction of the force for different values of its strength study, since they reflect the so-called lished, and identify a friction coefficient bors) shows up by almost horizontal ƒ, as indicated, for T = 0.14. The straight lines show ex- “dynamical heterogeneity” [3] of the ξ of the pulled particle via ξ = ƒ/v. plateaus in the MSD vs. t curves. When ponents for ordinary diffusion ( t, observed for ƒ = 0) and for super-diffusive behavior ( t1.4). The inset shows a supercooled fluid (Fig. 1). Dynamical This friction coefficient is found to be a the particle is released from the cage ∝ comparison between MSDs for the parallel and perpendicular heterogeneity means, that the fluid has strongly nonlinear function of the force (which happens by cooperative rear- ∝ direction for ƒ = 1.5. From Ref. [4]. a structure like an irregular mosaic, ƒ at low temperatures (Fig. 2a). Particu- rangement of the particles in its neigh- where in different “stones” of the mosaic larly illuminating is this behavior when borhood), ordinary diffusive motion

the local mobility of the particles may one describes it in terms of a so-called (MSD Deq t) sets in. When a force ƒ 1394/P8, and by a generous grant of differ from the mobility in a neighboring “Peclet number” Pe*, see Fig. 2b. One is applied, we see that the “lifetime” computer time at the JUROPA super- ∝ regime by orders of magnitude! Such finds that at low enough temperatures of the cage is reduced (the more the computer of the Jülich Supercomputer differences in mobility of glass-forming (T < 0.18) and intermediate forces stronger the force), and afterwards a Centre (JSC) provided via the John fluids can be caused by tiny density outside the linear response regime the super-diffusive motion in the direction von Neumann Institute for Computing fluctuations already. The trajectories data can be scaled such that they all along the force sets in (while in the (NIC). of the pulled particles then exhibit a fall on a master curve (Fig. 2b, inset). perpendicular directions still ordinary • Kurt Binder1 pearl-necklace type appearance (Fig. 1): This implies that there holds a “force- diffusive behavior occurs). The anoma- References • Jürgen Horbach2 When the particle is “caught” in a cage temperature superposition principle” lous behavior of the MSDs is also re- [1] Squires, T. M., Mason, T. G. • Peter Virnau1 Annu. Rev. Fluid Mech. 42, 413, 2010. flected in the distribution functions of • David Winter1 the displacements, the so-called van [2] Habdas, P., Schaar, D., Levitt, A. C., Hove correlation functions (see [5,6] Weeks, E. R. 1 Institut für Physik, Europhys. Lett. 67, 477, 2004. for details). Also theoretical concepts Johannes (such as the extension of the so-called [3] Binder, K., Kob, W. Gutenberg- “mode coupling theory” to nonlinear Glassy Materials and Disordered Solids: An Universität Mainz Introduction to Their Statistical Mechanics response of glass-forming fluids) can be Rev. Ed. (World Scientific, Singapore, 2011). stringently tested [6]. Thus the studies 2 Institut für carried out in this research project [4] Winter, D., Horbach, J., Virnau, P., Theoretische Binder, K. yield important ingredients towards the Physik II, Heinrich Phys. Rev. Lett. 108, 028303, 2012. theoretical understanding of dynamical Heine Universität heterogeneity in glass-forming fluids. [5] Winter, D., Horbach, J. Düsseldorf J. Chem. Phys. 138, 12A512, 2013.

Acknowledgements [6] Harrer, C., Winter, D., Horbach, J., This research was supported by the Fuchs, M., Voigtmann, T. J. Phys.: Condens. Matter 24, 464105, Figure 2: a) Friction coefficient ξ for A particles and different temperatures T as function of the force ƒ. b) Peclet number Deutsche Forschungsgemeinschaft (DFG), 2012. Pe* = ƒD d / plotted vs. the scaled force ƒ = f d /k T. The solid line indicates the linear response behavior, for eq AA ξ scal AA B Projects No. SFB TR6/A5 and No. FOR which Pe* = ƒscal holds (Deq is the self-diffusion coefficient of the A particles in equilibrium). The inset shows Pe* as a func- Pe*=50 contact: Jürgen Horbach, tion of the ratio ƒscal / ƒscal . From Ref. [4]. [email protected]

50 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 51 Applications Applications

QCD has two distinctive properties: the QCD action. The extremely large Is there Room for a Bound asymptotic freedom at high energies, number of variables in these simulations Dibaryon State in Nature? – and confinement at low energies. makes it impossible to use conventional Asymptotic freedom corresponds to a computing resources. At the same An ab Initio Calculation of small coupling constant in the theory, time, lattice QCD simulations are well and in this regime perturbation theory suited for massively parallel computers, Nuclear Matter gives a highly precise description of as they involve local interactions and the physical phenomena. Nuclear matter, require the inversion of large sparse however, is a low-energy phenomenon matrices. Lattice QCD is thus well suited and a different approach must be used for high performance computing (HPC). to explore hadronization, i.e. the forma- tion of bound states and matter from To set up a lattice QCD calculation, individual quarks. Lattice QCD is a first- spacetime has to be discretized on a principles approach that has exactly hypercubic lattice. This induces system- this capability. atic effects that need to be considered. These are the finiteness of the intro- Lattice QCD duced lattice spacing between the space- In lattice QCD simulations, the properties time points at which measurements of baryons and nuclei can be calculated are made, and the finite extent of the by studying correlation functions of lattice. Effects due to the breaking of local operators. These expectation values symmetry (e.g. rotational) down to a Figure 1: A schematic of a dibaryon (left) and two loosely bound baryons (right). All nuclear matter are linked to functional integrals of subgroup introduced by the hypercubic is made up of quarks and gluons. They are governed by the laws of QCD. QCD that are evaluated stochastically lattice must also be considered. More- using Monte Carlo methods. In addition, over there is the added complication What are the fundamental properties (di)baryons in terms of correlation func- every 'measurement' must be per- that the simulations become increasingly of nuclear matter? How does it behave tions that determine the properties of formed on large ensembles of field expensive as the physical limit is ap- under extreme conditions? – These a particle state, such as its mass and configurations generated by a Markov proached, specifically when tuning the are some questions that ab initio calcu- whether it is a loosely bound system process and distributed according to light quarks to their physical values. To lations of hadrons and nuclear matter of individual particles or a tightly bound using numerical lattice techniques are state (Fig. 1). The interpolating operators trying to answer. Focussing on one, involved in correlation functions are we ask: Is there room for a bound di- related to the particle's wavefunction, baryon state in nature? which is mostly determined by the in- dividual (valence) quarks it consists of. The dibaryon (a 6-quark state) was pre- This is the realm of QCD and our goal dicted over 30 years ago in a model is the numerical ab initio lattice QCD calculation [1]. So far, however, its de- calculation of dibaryon properties. tection has eluded experimental physi- cists, and theoretical studies remain in- Quantum Chromodynamics conclusive. It is therefore interesting to (QCD) take the step from the well determined At the core of all nuclear matter lies model-independent numerical calculations QCD. It is the field-theoretical description of lattice Quantum Chromodynamics of the fundamental strong force, which (QCD) for single baryons (3-quark states, determines the interactions between

such as the proton) to dibaryons – thus quarks and gluons, i.e. the constituents Figure 2: In lattice QCD, several calculations are performed at different unphysical quark masses paving the way to ever more compli- of hadrons. (colored points). In this way, the systematic effects can be taken into account and results can be safely cated nuclei. We study the spectrum of extrapolated to the point of the experimentally measured quark masses. The black point shows the experimentally determined point.

52 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 53 Applications Applications

handle these effects, multiple series of Dibaryons i.e. if the H-dibaryon mass is less than The number of combinations of quark measurements are usually made at Hadrons are composite particles, con- twice the single Λ-baryon mass. propagators required for a (di)baryon several unphysical quark masses, several sisting of quarks bound by the strong calculation is equal to the factorials of different lattice spacings and lattice force. There are 6 quarks in total. Numerics their respective quark content

volumes. Through closely monitoring However we simulate only the 3 lightest Our preliminary findings indicate that, in N = N u!Nd!Ns!, which is 2 for the proton the lattice QCD results, a safe and reli- quarks: the up (u), down (d) and strange order to resolve the mass difference (uud), 8 for the H-dibaryon (udsuds), able extrapolation to the physical limit (s) quarks (in ascending order of between two individual Λ-baryons and and 36 for a deuteron (uududd). For may be performed. In the case of the mass). There are two families of had- a true dibaryon state, a statistical other nuclei of mass number A ≥ 2 this single proton, this has been performed rons that have been observed in nature. precision of < 1% for the masses of the rapidly increases. Consequently, the extensively so that a reliable extrapola- The first family is that of quark-anti- individual and dibaryon states is re- H-dibaryon system is much more com- tion to the physical point with controlled quark states called mesons, with the quired. This means that an extremely plex and demanding than the single systematics can be made (Fig. 2). How- pion being its most prominent member. large number of individual lattice QCD proton. To improve the efficiency of the ever, in the much more complicated The second is that of 3-quark states measurements are required, of order calculation, we employ two techniques: case of nuclear matter, i.e. nuclei of referred to as baryons, of which the 30,000 (corresponding to approxi- a 'blocking' algorithm and a 'truncated' mass number A ≥ 2, this type of calcula- stable proton (quark content uud), the mately 3.8 Mcore-hrs on JUQUEEN for solver. tion is still in its infancy. As a conse- long lived neutron (udd, with a lifetime our chosen parameters). quence, the following initial calculation of ~15 minutes) and Λ-baryon (uds) are Blocking Algorithm is on a lattice volume of 64 × 323 (time members. In addition, the dibaryon is To set up a calculation of the H-dibaryon, To tackle the large number of contrac- × space3) points, for a pion mass of a proposed 6-quark state (Fig 1). The a QCD interpolating operator that tions, it is convenient to use the so- 450 MeV and a spatial extent of 2 fm. H-dibaryon is a state with strangeness describes it must be used. This is called 'blocking' algorithm [3] to handle A larger and more physical ensemble quantum number –2 and quark content achieved by forming a combination of and simplify the potentially large num- 3 of (mπ,=320 MeV, 96 × 48 , and 3 fm) udsuds. This state was predicted in the quark fields so that the resulting in- ber of combinations of quark propaga- will shortly be analyzed on JUQUEEN. 1970s, and an indication that such a terpolating operator has the correct tors (Fig. 4). This involves a pre-con- state exists would be that it is energeti- quantum numbers to be related to the traction of the quark propagators at cally favored over two distinct baryons, H-dibaryon wavefunction. These quark the source of the correlation function, fields then have to be connected ('con- so that the pre-contracted 'block' can tracted') to form a measurable corre- be re-used to efficiently combine the lation function. Numerically, any single quark propagators at the sink. This re- lattice QCD measurement of a hadron moves the need for a new source con- mass thereby consists of three main traction for each combination. parts: the calculation of quark propa- gators (matrix inversions); techniques Truncated Solver used to obtain a good overlap with the Quark propagators are calculated state of interest, which requires some through the inversion of a large sparse tuning of the operator (Fig. 3); and the matrix using an iterative solving proce- contraction of the quark propagators dure. The truncated solver method [4] to form a hadronic correlation function. utilizes a less precise stopping condi- For a baryon, the first two are the tion (tolerance of 10-3 compared to the most expensive. However, due to the usual 10-10) for the inversion. There larger number of combinations of quark is of course an associated bias with propagator contractions that enter the respect to a high precision solve. How- dibaryon correlation function, stem- ever this can be corrected for, which ming from the additional 3 quarks, all allows many more measurements to three steps are equally significant for a be made in a given time compared to dibaryon measurement, thus making the usual high-precision method and it more computationally demanding. results in at least a 20 % gain in statisti- Figure 3: An optimized spatial profile of quark sources used in our calculation of the H-dibaryon state [2]. cal precision.

54 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 55 Applications Applications

The truncated solver and the aforemen- a determination of the ground state. A tioned blocking algorithm are two very more sophisticated method is required new and modern technical develop- for a study of the excited states. For in- ments that are key to making this cal- stance, we employ the method of solving culation possible. With these improve- a generalized eigenvalue problem (GEVP). ments, the calculation may be achieved The choice of interpolating operator using existing HPC resources in a used to measure the H-dibaryon is not sensible time-frame and has become unique, and one is free to choose any feasible for lattice ensembles large wavefunction that gives the correct enough to model the dibaryon without quantum numbers. Depending on how significant distortion due to the finite- well a given operator describes the ness of the lattice spacing and with lat- actual physical state of the H-dibaryon, tice parameters approaching those the correlation function will have a of the physical situation. better or worse overlap with the true H-dibaryon wavefunction. Solving the Results eigenvalue problem requires a set of We show results gathered on a total of such differently overlapping operators ~16,000 measurements [5] obtained and provides the mass eigenvalues

in 1.9 Mcore-hrs of computing time on of the operator matrix, which allows Figure 5: The GEVP of three interpolating operators gives the first three mass states of the H-di- JUQUEEN. The mass of the H-dibaryon the corresponding state's mass to baryon particle spectrum. If the lowest mass (ground state) lies below the threshold of two individual was determined using the exponential be retrieved. Our analysis uses a set of Λ particles (blue band), it is the energetically favored state [5]. decay of the correlation function. For 4 trial wavefunctions to extract the large time separations the ground state ground and first excited states of the dominates the correlation function H-dibaryon mass spectrum. The pre- its infancy and requires top-tier HPC share of the supercomputer JUQUEEN and excited states are exponentially liminary results in Fig. 5 show that with resources for its study. Our first analysis at Jülich Supercomputing Centre (JSC). suppressed. The ground state mass the present statistical accuracy of ~ 4 %, shows a tendency of the H-dibaryon This work was supported by the DFG is most easily computed from taking the H-dibaryon mass is consistent with to be unbound. However, the statistical via SFB 1044 and grant HA 4470/3-1. the logarithmic derivative of the cor- that of two individual particles. How- error needs to be substantially re- • Hartmut Wittig relation function to determine an 'ef- ever for a truly conclusive result the duced. Additionally, the calculation has References • Thomas D. Rae fective' mass. The ground state mass statistical error on the dibaryon data to be repeated on a series of lattice en- [1] Jaffe, R.L. • Anthony Francis Phys. Rev. Lett. 38 (1977) 195. can then be determined by performing must be significantly reduced. sembles with ever more physical, and a fit in the plateau region of the effec- hence expensive, parameters. Even at [2] von Hippel, G.M., Jäger, B., Rae, T.D., Johannes Gutenberg tive mass (seen in Fig. 5) where it is Conclusions this early stage the results do however Wittig, H. Universität Mainz JHEP 1309 (2013) 014. independent of the time separation. The study of nuclear matter from ab give insights into the nature of light nu- However, this method is only suited for initio lattice QCD calculations is still in clei, specifically the H-dibaryon [6], that [3] Doi, T., Endres, M.G. are beyond the reach of other non-ab Comput. Phys. Commun. 184 (2013) 117.

initio approaches. With the continued [4] Blum, T., Izubuchi, T., Shintani, E. investment of HPC resources on plat- Phys. Rev. D88 (2013) 094503.

forms such as JUQUEEN we will be able [5] Francis, A., Miao, C., Rae, T.D., Wittig, H. to answer the question: is there room PoS (LATTICE 2013) 440, arXiv:1311.3933, for a bound dibaryon state in nature? 2013.

[6] Walker-Loud, A. Acknowledgements PoS (LATTICE 2013) 013, arXiv:1401.8259, The authors gratefully acknowledge the 2013. Gauss Centre for Supercomputing Figure 4: The dibaryon is a much more complex system than the single baryon and efficient algo- rithms have to be used to handle the large number of quark contractions [5]. There are 8 terms in (GCS) for providing computing time for contact: Hartmut Wittig, total for the udsuds dibaryon. a GCS Large Scale Project on the GCS [email protected]

56 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 57 Projects Projects

the selected setup can be found like CoolEmAll–Energy Efficient airflow recirculation or increased tem- and Sustainable Data Centre peratures beyond limits. By repeating the simulation process with different Design and Operation setups an optimized configuration can be found where the energy consump- tion is minimized. Although IT hardware developers in- identified, first as major issue, the com- crease the energy efficiency of new position of hardware from compute The Data centre Efficiency Building systems, the global overall energy nodes up to the room and infrastructure Blocks (DEBBs) are introduced to allow consumption of IT infrastructures rises like cooling and power provisioning for the specification of separate blocks or year by year driven by the dramatically which we used a modular approach (Data sections of a data centre independently increasing usage of IT services cover- centre Efficiency Building Blocks, DEBBs) (i.e. node, enclosures, cooling devices, ing all areas in our dailies life. The need allowing to specifying parts of the storage, …), since specifying the whole

of new strategies to reduce or at least whole systems independently and then data centre in a monolithic way leads Figure 2: DEBB composition minimize energy consumption is obvi- combing them. Secondly the specifica- to very complex structures and in- ous. The European Commission funded tion of the cooling mechanisms and creased effort to maintain the specifi- behavior (like cooling, power provisioning, project CoolEmAll has taken a holistic their efficiencies is part of the base for cation. DEBBs can also be created by centralized infrastructure, etc.) or a approach to the complex problem of the Simulation, Visualization and sup- the manufacturer of each product itself more or less homogeneous part of the how to make data centres more energy port toolkit (SVD Toolkit). The last piece and then combined by the data centre system (i.e. one cluster, one often used and resource efficient IT infrastruc- is the description of the software run- designer or provider. DEBB are used to Rack, etc. with all necessary descrip- tures by developing a range of tools to ning on the systems containing detailed decouple independent section of a sys- tions and details . As shown in Fig. 2 enable data centre designers, operators, information about application charac- tem. this includes Geometry information (3D suppliers and researchers to plan teristics and the workload distribution. model) describing in detail the physical and operate facilities more efficiently. Based on this the SVD Toolkit can start Each DEBB represents a section of the structure used as input for the CFD the simulation process beginning with system with either a certain functional simulations and for visualization, power The main concept for the project as load distribution, calculating energy shown in Fig. 1 includes the different consumptions for all systems as input aspects of data centre provisioning. for the airflow and heat simulation and Three different fields of input data are visualize the results and problems with

Figure 1: CoolEmAll main concept Figure 3: SVD Toolkit architecture

58 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 59 Projects Projects

profiles describing the power consump- and sub sections and refers to their utilization and optimization directly by room layout, hardware selection and tion for any part depending on different corresponding component description, reducing the provided hardware. For cooling setup to reduce the overall con- operational states and loads, metrics geometrical data or profiles. This modu- CoolEmAll the reduction of the overall sumed energy. describing which metrics are available lar approach also allows to specify energy consumption if far more impor- for the specified device, component different section on different levels of tant, so the energy values are used With the SVD Toolkit developed within descriptions similar to fact sheet with detail, depending on the need for the for the second simulation step, the CoolEmAll data centre’s energy con- common parameters for all devices following simulation, so a well known airflow simulations. Within the project sumption can be optimized by combining and the hierarchy which represents a system which is in the focus of the simu- two types of simulations were used different strategies. Selecting other specific instance of the system, a lation can be specified more detailed covering different issues. The first was applications or hardware is often not whole data centre or only a node depend- than surrounding systems which are the simulation of single enclosures as appropriate but changes in the work- ing on the target to be achieved and not homogeneous and equipped different nodes or node groups like blade centers load distribution should not lead to can be seen as the glue between all so effort for specifying them also in etc. This allows hardware manufac- different results or throughput. Never- the other pieces for the overall system. detail would be to high in comparison to turer to optimize the layout of these theless optimizing the rack placement The hierarchy is the entry point and the relevance. It is also easy to replace devices by relocate openings or fans, for rooms or changing the temperature describes the composition of all devices parts of the DEBB description which changing the surface of heat sinks to slightly may decrease the cooling ef- might happen in case of an updated improve the airflow and the dissipated fort considerable. As shown in Fig. 6 version of the description, when single heat resulting in reduced temperatures (worse/good case) the changes in the devices are replaced or a system goes or lower power consumption for fans. out of production. For the design phase Additionally the cooling in enclosures this easy replacement is also useful to can be optimized by changing the order be able to compare different composi- of the nodes in the airflow like for the tions without creating them new from RECS Box (Resource Efficient Computing scratch. and Storage in Fig. 4 and a simplified version for the simulation with only four The SVD Toolkit (Fig. 3) developed in the nodes in Fig. 5) or changing in homoge- project handles all simulations needed neous environments the node to work- from application and workload simula- flow assignment. For data centre de- tions starting with the Data Center signer and providers the other scenario Workload and Resource Management is more important, the simulation of System (DCWoRMS), the CFD simula- the whole server room including the tions with either OpenFoam or Ansys cooling infrastructure. Since this is at Figure 4: RECS Box 3D design CFX within the Collaborative simulation a quite larger scale a simulation with and Visualization Environment (COVISE). the same granularity as before will last The simulation workflow definition to- much longer producing results with gether with the parameter and configu- lower accuracy limited by the fact the ration setup as well as the visualization real world cannot be described identi- of the results is included in the CoolEmAll cally, for example cables may influence web GUI. airflows and therefore the accuracy will be not so high compared to the time DCWoRMS uses the application profiles and energy consumed for the simulation. covering the applications behavior Therefore all values are only used for on different loads and the predicted a whole rack and the structure of the or planned workload distribution to racks does also not cover any details simulate mainly the usage rates of all within–it is taken as black box. This al- components and as result the energy lows a fast and precise enough simu- consumptions. This information can lation on the room so that the designer Figure 5: Simplified RECS heat distribution be used to sum up the overall capacity and providers then can optimize the Figure 6: Rack placement with hotspot and optimized

60 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 61 Projects Projects

placement lead to a more homogeneous temperatures may lead to increased temperature distribution in the room, failure rates if the temperatures especially for rack 9 the previous exist- reach the operational limits. With the ing hot spot in the upper left corner simulation capabilities from CoolEmAll can be solved (the temperatures are even the under-floor airflows can be shown in Kelvin), the temperatures at checked and airflows can be assigned the outlets are reduced by up to 30° K. to each air-conditioner in the room to Since the allowed maximum for the op- know the heat load distribution in ad- erational temperature is still the same, vance before any installation is done. is might be possible to increase the Fig. 8 show the scenario for three temperature for the room inlets which air conditioners from two perspectives will lead to a lower energy consump- where it can clearly be seen that the tion for the cooling infrastructures airflows are not even distributed and since the difference between outside there are also some strong vortices environmental air and coolant will be which also means wasted fan power Figure 9: Containerized modular data centre increased and in some cases even free and maybe to less air in some areas cooling becomes possible with dramati- of the room or racks. The SVD Toolkit cally improved power efficiency. Based can also be used to show the effect added as needed. An example of such Project Partners on Fig. 7 showing the temperature of increased room temperatures on a container equipped with RECS Boxes • Poznan Supercomputing and and speed of airflows recirculation can the energy efficiency of the cooling is given in Fig. 9 where airflows are il- Networking Center, PL clearly be identified and countermea- infrastructure, in our test case it was lustrated. • High Performance Computing sures like building walls or changing the possible to save approx 10% of overall Centre University of Stuttgart, D placement can be performed to pre- power consumption by increasing the The outcome of CoolEmAll allow to simu- • Université Paul Sabatier - vent further recirculation and therefore temperature above 22.5° C - just high late data centres and optimize power Institute for Research in reduced heat transfer. The negative enough to allow free cooling. In the consumptions before it is even build Informatics of Toulouse, F effects of recirculation also demand an project we also addressed the creation and whenever changes are required. • christmann informationstechnik + increase in air throughput to balance of blueprints of containerized modular The capabilities developed within the medien GmbH & Co. KG, D the heat transfer and so waste fan power data centers where depending on the CoolEmAll project can be used to de- • 451 Research Ltd, UK • Jochen Buchholz and indirectly more cooling power for load additional container can flexibly be sign data centres, describe all hard- • The Catalonia Institute for Energy • Eugen Volk the additional fan power. Also higher ware devices, specify the expected ap- Research, E plications and workload distribution and • Atos, E University of based on all this information simulate Stuttgart, HLRS the operation of a data center to opti- contact: Jochen Buchholz, mize the overall energy consumption. [email protected] This leads to a more optimized rack placement, optimized temperatures and reduced need for fans and cooling power and may even be used to plan a better structured datacenter building.

CoolEmAll has been funded by the European Commission and started in October 2011 and has been ended in March this year.

Project Website: Figure 7: Recirculation within the room Figure 8: Uneven airflow distribution in the http://www.coolemall.eu room and below the floor

62 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 63 Projects Projects

pact on scalability and performance. ality and reduce as much as possible MyThOS: Many Threads Parallel programming is a challenge for the overhead. The monolithic organiza- developers, and current approaches tion of current operating systems cre- try to increase scalability by improving ated dependencies between features the data segmentation, producing that are not relevant for the execution than expect higher performance from better adapted thread bodies and reduc- of individual threads but affect the the processor updates. But exploiting ing data dependencies. However, the use of memory and computational re- parallelism in any degree can be difficult: operating system is part of the core of sources. Instantiation of new threads not only the data must be segmented, the problem, and has not yet been require several 10,000s of compute but the communication overhead can correctly addressed. cycles [OEC11][BAU09][BOY10][LIN00], lead to reduced performance. Also, the and even if a thread pool can reduce amount of work of concurrent threads To correctly address this issue the archi- this time, it also reduces the degree For decades now, the computing in- must be sufficient to compensate for tecture of operating systems for highly of the dynamicity that can realize the dustry has relied on increasing clock the overhead of thread creation and scalable applications has to be rede- program. We expect the MyThOS op- frequency and architectural changes management. Unrolling loops with signed to eliminate secondary function- erating system kernel design to be a to improve performance of CPUs. The short or little compute-intensive loop situation has changed nowadays; small bodies can be too expensive and can architectural improvements continue, reduce the performance of the program. but the power wall, the memory wall and thermal issues have made the ap- The adequate load per thread is depen- proach of increasing clock frequency dent on several factors, the first is unfeasible. Semiconductor manufactur- the code and data size. It also has to be ers continue developing the silicon taken into consideration the operations manufacturing technology and doubling executed by the operating system to the number of transistor per unit area instantiate and configure the thread every few months, but rather than to and the additional services needed improving the clock speed (because the to make the execution of the threads physics do not allow it any more), they possible. Although in some operating increase the number of processing system kernels implementations threads units (i.e. cores) in the chip die. have fewer dependencies than pro- cesses, the minimum execution environ- This trend has not only been an uptake ment for threads is still too large by HPC, but also by consumer computing and leads to cache misses, instantiation and embedded systems, continuously challenges and expensive context increasing the core count as the main switches when a thread is rescheduled way to increase the (theoretical peak) or invokes a system call. performance. This creates a situation in which only embarrassing parallel ap- This overhead is not just a problem plications can effectively improve the that the programmer should cater execution performance by taking advan- for, but it is also a primary obstacle tage of the additional resources. that keeps parallel applications from scaling better because it prevents the In order to maximize the theoretical effective use of concurrency. Several performance of new system, developers studies [OEC11][TOL11] have noted that

must use an extremely high level of the current architecture of operating Figure 1: Kernel Map, for the execution of a thread only a few modules are needed. (C) 2007-2010, parallelism in their programs, rather systems generally have a negative im- http://www.makelinux.net/kernel_map

64 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 65 Projects Projects

suitable solution for increasing the scal- control over the hardware. This is not reduce the operating system overhead management is done by the operating ability of applications in multicore a problem as for the project, compat- by having less expensive system calls system as the different system calls processors and multiprocessor nodes. ibility with legacy software is secondary and improving the predictability of the and memory accesses are handled to performance and scalability, and, system, which ultimately increases the transparently across all boundaries of The MyThOS Project however, this does not contradict the performance of synchronous systems. the system's architecture. Similarly, The MyThOS project addresses a spe- requirements of the scientific, multi- system calls can be executed locally or cific problem area that exists in both media and compute intensive applications This modular approach also allows remotely depending on where the nec- industry and academy and that is be- that MyThOS is targeting. operating system functionality to be essary functionality and resources are coming increasingly relevant with the replaceable and distributed across available, and they interact with each paradigm shift in processors' architec- The project's aim is to reorganize the the system as the functionality will be other without the need to explicitly use ture. Even if the project is technologi- operating system architecture so that implemented in modules that use the appropriate communication proto- cally innovative it is, at the same time, it is exclusively dedicated to tasks message passing as the means for col for each case. looking for direct applicability. related to computing, specially multi- communication, allowing programs threading, with some extra functionality to interchange data between threads The core of the operating system (those Current operating systems are designed necessary for creating the execution instantiated on different system archi- components that involve messaging, to support a variety of architectures environment, communication and han- tectures, supporting communication memory management and thread instan- and application scenarios, thus not sup- dling other necessary services for the across different system boundaries tiation) is loaded by default at system porting the execution of specific use system. These non-core functionality (i.e. between processor cores or be- boot. This is the essential functionality cases optimally. MyThOS aims to ex- of the operating system are loaded ac- tween processor nodes, etc.). to initialize processing units, to interact ecute each application thread with just cordingly to the needs of the system with each other, to load extra function- the necessary support for its execu- or the application at each moment. A To achieve maximum flexibility for ality and to start the execution of the tion. This implies that some aspects, lower operating system overhead the execution, applications running on application. As the application runs the such as security, time sharing, inter- means that threads with smaller bodies MyThOS will be abstracted from the distribution of the operating system activity, etc., can be severely limited can be created, allowing for better peculiarities of the different hardware and its modules providing the extra func- as the application will be given more exploitation of concurrency, increasing memory address spaces and com- tionality will change accordingly with the scalability and the use of the avail- munication protocols so that a single the needs of the application and the able computing resources. All this to- uniform memory space and messaging status of the system. Each process gether increases the application perfor- system is presented. For an applica- and each thread is allocated with the mance. tion there is no difference accessing system components that provide the local memory or memory that is not necessary functionality for an efficient Technical Principles physically joined as the appropriate execution of its task. For a better use Classic operating systems are based on a monolithic architecture, which does not allow for fast and dynamic use of parallelism due to the strong linkage of its functions and data structures. To solve this issue MyThOS will use a modu- lar approach, taking concepts from the microkernel architecture paradigm, which have the implication that the computing resources will only be mini- mally loaded by the operating system, leaving more free resources to the pro- gram itself as in each computing unit

Figure 2: Different instances of the operating there will be deployed only the minimal system on two CPUs functionality for it to work. This will also Figure 2: Architecture of distributed operating system with process and threads

66 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 67 Projects Projects

of the resources, different variants of narios from molecular dynamics, fluid References the same system component will be dynamics and distributed computation [1] Oechslein, B., Schedel, J., Kleinöder, J., Bauer, L., Henkel, J., Lohmann, D., provided, optimized for different proces- of multimedia data. The scope of sce- Schröder-Preikschat, W. sors or system architectures, allowing narios addressed in MyThOS however OEC11: OctoPOS, A Parallel Operating Sys- to tune the system for fast deploy- goes beyond the use cases MyThOS tem for Invasive Computing, 2011. ment and management of threads on is well suited for analyzing big data and [2] van Tol, M.W. specific hardware. A single thread is for highly dynamical scenarios. TOL11: A Characterization of the SPARC allocated per core in low-weight “sand- T3-4 System, 2011. boxes”, which can be quickly instanti- To be able to use the capabilities pro- [3] Baumann, A., Barham, P., Dagand, P.-E., ated, configured and later destroyed posed by MyThOS it is necessary to Harris, T., Isaacs, R., Peter, S., Roscoe, T., with the aim to achieve a hierarchical break compatibility with POSIX system Schüpbach, A., Singhania, A. multithreaded organization of the ex- calls, used in Unix-like operating sys- BAU09: The multikernel, a new OS archi- tecture for scalable multicore systems, ecution of the application that allows for tems, as otherwise we would face the 2009. dynamic management, changing its limitations of application design and execution structure (the number and scalability that current systems have. [4] Boyd-Wickizer, S., Clements, A.T., Mao, Y., Pesterev, A., Frans Kaashoek, M., distribution of threads) according to Morris, R., Zeldovich, N. the requirements of the application and During the run time of the project it will BOY10: An analysis of Linux scalability to resources available in the system at not be possible, neither sensible, to many cores, 2010.

each moment. develop the complete OS functionality, [5] Ling, Y., Mullen, T., Lin, X. however, a prototype version of the LIN00: Analysis of Optimal Thread Pool To make this execution model possible, operating system and its programming Size, 2000.

the individual operating system com- model will be developed, as well as [6] Schubert, L., Kipp, A. ponents must be orchestrated and documentation on how to extend it and SCH09a: Principles of Service Oriented coordinated with each other to make how to design applications that can Operating Systems, 2009.

an efficient use of the resources and use the system capabilities correctly [7] Schubert, L., Kipp, A., Wesner, S. lower the communication overhead. will be provided. Thanks to the modular SCH09b: Above the Clouds: From Grids MyThOS will stand on the research re- design and implementation of MyThOS to Service- oriented Operating Systems, 2009. sults of S(o)OS [SCH09a][SCH09b] it will be easily further developed and • Colin Class and BarrelFish [BAU09] projects which extended in the future with new capa- • Daniel Rubio clearly show that a distributed organi- bilities and support for future hardware contact: Colin W. Glass, Bonilla zation can significantly improve scalabil- architectures. [email protected] ity and performance. University of Who is MyThOS? Stuttgart, HLRS The project will be developed using Intel MyThOS is funded by the BMBF (“Bundes- MIC, the Knight's Corner processor ministerium für Bildung und Forschung”, family, as its high CPU core count, inter- German Federal Ministry of Education nal interconnection models and memory and Research) and started in October organization, allow to experiment 2013. The project partners include: with a wide range of future hardware University of Ulm; Brandenburg Univer- scenarios. MyThOS will give support sity of Technology; HLRS, University for writing applications using common of Stuttgart; University of Siegen; Bell programming languages such as C Labs, Alcate-Lucent AG. and Fortran, which are widely used in industry as well as academia. The de- velopments will be validated within the project on challenging use case sce-

68 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 69 Projects Projects

OpenFOAM for JUQUEEN handle frequent production runs of the HPC for Climate-friendly Thus one of the aims of the CEC project required sizes. Power Generation is the provision of CFD software which makes efficient usage of massively paral- At present work at JSC is under way to lel HPC architectures. Within the scope analyze the scaling behaviour of Open- The joint research project CEC [1] fund- key design difficulty in the development of this project the open source software FOAM using modern performance analy- ed by the Federal Ministry of Economics of modern low NOx gas turbine com- package OpenFOAM [2] will be used. sis tools like Scalasca [3, 4] which is a and Technology (BMWi) and the Siemens bustion systems are thermoacoustic joint development of JSC and the German AG was launched in January 2013 instabilities. Thermoacoustic instabilities Before the start of the CEC project Research School for Simulation Sciences. with an initial three year funding phase. can result in pressure oscillations OpenFOAM has already been used by The main goal of the JSC work package Besides Siemens – the project coordina- which require the immediate shut down Siemens for production runs on JSCs is to reduce performance and scal- tor - and JSC the consortium includes of a power plant to avoid mechanical HPC cluster JUROPA. There simula- ing bottlenecks in OpenFOAM to enable seven German research institutes in failure. Avoiding these events is a high tions of a single combustor tube are large simulations on JUQUEEN. the fields of power plant technology and priority. Developing a modeling capability done applying typically a few thousand material science. Aim of the project is for thermoacoustic instabilities is cores / several hundred nodes. Within Project Partners to develop new combustion technologies challenging. The phenomenon involves this project it is planned to compute • Siemens AG, Energy Sector, for climate-friendly power generation. unsteady aerodynamics, chemical reac- the behavior of a complete annular Mühlheim an der Ruhr The focus is on modern gas turbine tion and acoustics resulting in a stiff combustion chamber which will lead to • Institut für Verbrennungstechnik, DLR technology which play an important role multi-physics problem. models where the size (number of CFD • Institut für Verbrennung und Gas- in the present transformation of the cells) is increased by at least a factor dynamik (IVG), Universität Duisburg-Essen energy system towards a sustainable Computational fluid dynamics on mas- of 10. In order to keep the turnaround • Institut für Thermische system based on renewable energy sively parallel computers was found to times acceptable it is necessary to in- Strömungsmaschinen (ITS), KIT sources. The validation of the gas tur- be a suitable approach to predict thermo- crease the number of cores used for • Jülich Supercomputing Centre, bine combustion technologies will be acoustic instabilities for single burner the calculations by a similar factor. Forschungszentrum Jülich GmbH carried out in a new test center called systems. To resolve all relevant effects • Institut für Prozess- und Anwendungs- “Clean Energy Center” (CEC) which simulations consuming approx. Thus as a first step OpenFOAM was technik Keramik, RWTH Aachen is currently being built by Siemens in 200,000 core-hours are needed. While implemented on JSCs Blue Gene/Q sys- • Institut für Strömungsmechanik und Ludwigsfelde near Berlin. the behavior of a single burner is similar tem JUQUEEN which is more suitable Technische Akustik (ISTA), TU Berlin • Christian Beck1 to the behavior of a can combustor for simulations running on tens of thou- • Zentrum für Angewandte Raum- • Johannes While testing is a key step in the vali- based gas turbine, annular combustor sands of cores. This was not straight- fahrttechnik und Mikrogravitation, Grotendorst2 dation of new gas turbine combustion based gas turbines require the simulation forward since the dynamical linking ZARM, Universität Bremen • Bernd Körfgen2 systems, numerical methods can pro- of the complete combustion system strategy applied by OpenFOAM is not the • Department Werkstoffwissenschaften, vide significantly more detailed insight comprising up to 24 burners (see Fig. 1). first choice for highly scalable archi- FAU Erlangen-Nürnberg 1 Siemens AG, into the behavior of gas turbine com- tectures like JUQUEEN. The resulting Energy Sector, bustion systems. The detailed insight is The resulting computational require- OpenFOAM installation was success- References Mülheim a. d. Ruhr important since it enables engineers to ments are difficult to satisfy on PC fully tested and made available for use [1] http://www.fz-juelich.de/ias/jsc/EN/ Research/Projects/_projects/cec.html derive new design hypotheses needed technology based clusters – larger scale by Siemens as well as other research 2 Jülich to achieve higher thermodynamic efficien- systems are needed. groups with access to JUQUEEN. [2] http://www.openfoam.org Supercomputing cies and reduced pollutant emissions. A Centre (JSC) [3] Geimer, M., Wolf, F., Wylie, B.J.N., Benchmarking of OpenFOAM on Ábrahám, E., Becker, D., Mohr, B. JUQUEEN with a model of a single com- The Scalasca performance toolset archi- bustion tube provided by Siemens tecture, Concurrency and Computation: Practice and Experience, 22(6):702-719, showed that the single core simulation April 2010. speed is as expected less than on Figure 1: Gas turbine with JUROPA due to the slower clock speed [4] http://www.scalasca.org can combustion system (lhs) vs. annular combus- of JUQUEEN’s A2 processor. But with tion system (rhs) evolving models OpenFOAM on contact: Johannes Grotendorst, JUQUEEN will be the only option to [email protected]

70 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 71 Projects Projects Going DEEP-ER to Exascale

The heterogeneous Cluster-Booster Architecture implemented by DEEP consists of two parts: a Cluster based on multi-core CPUs with InfiniBand October 1, 2013 marked the start of interconnect, and a Booster of many- the EU-funded project “DEEP-ER” core processors connected by the (DEEP–Extended Reach). As indicated EXTOLL network. DEEP-ER (see Fig. 1) by its name, DEEP-ER aims for exz- will simplify this concept by unifying tending the Cluster-Booster Archi- the interconnect merging Cluster and tecture proposed by the currently Booster Nodes into the same network. running EU-project “DEEP” [1] with The DEEP-ER prototype will explore additional functionality. The DEEP-ER new memory technologies and concepts project will have a duration of three like non-volatile memory (NVM) and years and a total budget of more than network attached memory (NAM). Ad- 10 million Euro. Its coordinator, the ditionally, with respect to DEEP the Jülich Supercomputing Centre, leads a processor technology will be updated: consortium of 14 partners from 7 dif- next generation Xeon processors will ferent European countries. be used in the Cluster and the KNL generation of Xeon Phi will populate the Figure 2: Participants of the DEEP-ER kickoff meeting. Booster.

The DEEP-ER multi-level I/O infrastructure extended in order to automatically re- References has been designed to support data- start tasks in the case of transient [1] Suarez, E., Eicker, N., Gürich, W. • Estela Suarez “Dynamical Exascale Entry Platform: the intensive applications and multi-level hardware failures. In combination with • Norbert Eicker DEEP Project”, inSiDE Vol. 9 No.2, Autumn checkpointing/restart techniques. The a multi-level user-based checkpoint in- 2011, http://inside.hlrs.de/htm/Edition_ project will develop an I/O software frastructure to recover from non-tran- 0 2 _11/a r t i c l e _12.h t m l Jülich platform based on the Fraunhofer paral- sient hardware-errors, applications will Supercomputing lel file system (BeeGFS), the parallel I/O be able to cope with the higher failure contact: Estela Suarez, Centre (JSC) library SIONlib, and the I/O software rates expected in Exascale systems. [email protected] package Exascale10. It aims to enable an efficient and transparent use of the DEEP-ER’s I/O and resiliency concepts underlying hardware and to provide all will be evaluated using seven HPC ap- functionality required by applications plications from fields that have proven for standard I/O and checkpointing. the need for Exascale resources.

DEEP-ER proposes an efficient and user- Acknowledgements Figure 1: Cluster-Booster Architecture as implemented friendly resiliency concept combining This project has received funding from in DEEP-ER. (CN: Cluster Node; BN: Booster Node; NAM: Network Attached Memory; NVM: Non-Volatile user-level checkpoints with transparent the European Union's Seventh Frame- Memory; MEM: Memory). In contrast to DEEP, here task-based application restart. OmpSs work Programme for research, techno- Cluster and Booster Nodes are attached to the same is used to identify the application's logical development and demonstration interconnect. Additional memory and storage is added individual tasks and their interdepen- under grant agreement no 610476 at the node, the network, and system levels. dencies. The OmpSs runtime will be (Project "DEEP-ER").

72 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 73 Projects Projects Score-E – Scalable Tools for Energy Analysis and Tuning in HPC

For some time already, computing cen- Vampir [3], Periscope [4], and TAU [5]) ters feel the severe financial impact of will enable software developers to in- energy consumption of modern com- vestigate the energy consumption of puting systems, especially in the area their parallel programs in detail and to of High-Performance Computing (HPC). identify program parts with excessive Today, the share of energy already ac- energy demands, together with sugges- counts for a third of the total cost of tions on how to make improvements ownership, see Fig. 1, and is continu- and to evaluate them quantitatively. In ously growing. addition, the project will develop models to describe not directly measurable The main objective of the Score-E project, energy-related aspects and a powerful funded under the third "HPC software visualization of the measurement results. for scalable parallel computers" call of

the Federal Ministry of Education and Measurement of energy consumption Figure 2: Interaction of the tools Persicope, Scalasca, Vampir, TAU and Scalasca’s profile browser Research (BMBF), is to provide user- will take advantage of hardware coun- CUBE via the data exchange formats OTF2 for traces and CUBE4 for profiles. friendly analysis and optimization tools ters like Intel's "Running Average Power for the energy consumption of HPC Limit" (RAPL) counters introduced with Blue Gene/Q application programming ment system Score-P, see Fig. 2 [6], applications. These tools (Scalasca [2], the Sandy Bridge architecture, IBM's interfaces to query power consumption which forms the common base of all on a node-board level, as well as sys- four tools mentioned above. To serve tem specific power measurement infra- a broad user base, both, within the structure on the node and rack level Gauss Alliance and beyond, most of the e.g., on SuperMUC. The new energy- performance tools are released free related metrics can not only be visual- of charge to the community under an ized by the performance tools mentioned open-source license. Only Vampir, due above, but also be used to trigger to its sophisticated user interface, is specific tuning actions during runtime distributed commercially. by utilizing Score-P's online access in- terface. One particular task in the development will be the extension of the "Scalable With regards to visualization the Score-E I/O library for parallel access to task- project will investigate domain-topology local files" (SIONlib [7]) to support not visualizations as well as linked, multiple- only MPI plus basic OpenMP but arbi- view data presentation to gain deeper trary programming models and hetero- insight of the performance and energy geneity by providing a generic callback- behavior of complex applications. based interface as well as a key-value scheme to handle a variable number At the same time, the project will fur- of tasks per process. This extension ther develop and maintain the com- widens the applicability of SIONlib to all munity instrumentation and measure- Figure 1: Total Cost of Ownership of a typical supercomputer according to a study by RWTH [1].

74 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 75 Projects Projects

programming models currently in use Acknowledgements in HPC. The Score-E project is funded by the UNICORE 7 Released German Federal Ministry of Research The software products are accompanied and Education (BMBF) under Grant by training and support offerings No. 01IH13001. The UNICORE middleware suite is well- service, which is validated by the through the Virtual Institute–High Pro- established as one of the major solutions UNICORE services to assert the user's ductivity Supercomputing (VI-HPS [8]), References for building federations and e-infra- identity. Nevertheless, the strong client and will be maintained and adapted to [1] Bischof, C., an Mey, D., Iwainsky, C. structures, with a history going back to authentication based on X.509 certifi- Brainware for green HPC Computer emerging HPC architectures and pro- 1996 [1]. It is in worldwide use in HPC cates is still available and will continue Science–Research and Development, gramming paradigms beyond the life- 27(4):227–233, 2012. oriented infrastructures, for example to be supported in future releases. time of the Score-E project itself. PRACE, the US-project XSEDE [2] and [2] http://scalasca.org/ in national grid initiatives such as Apart from the enhanced web services Besides the evident economic and envi- [3] http://www.vampir.eu/ PL-Grid. This spring, UNICORE 7 was re- container and improved security stack, ronmental benefits in terms of energy, leased, which is the first major release there are a number of other new fea- [4] http://www.lrr.in.tum.de/~periscop/ Score-E will also empower the optimized since UNICORE 6.0 in August 2007. tures. For example, a new data-oriented programs to unlock new scientific and [5] https://www.cs.uoregon.edu/research/ It is no paradigm change as was the processing feature allows to define tau/home.php commercial potentials. change from UNICORE 5 to UNICORE 6. data processing via user-defined rules. [6] http://www.score-p.org Instead, UNICORE 7 is built on the Jobs can be restarted easily, and data

The academic project partners in [7] http://www.fz-juelich.de/ias/jsc/EN/ same ideas and principles as UNICORE staging now supports wildcards. LMAC are the Jülich Supercomputing Expertise/Support/Software/SIONlib/ 6, and the two versions are compatible. Centre, the German Research School _node.html We decided to make this a major Several changes have been made to

for Simulation Sciences, RWTH Aachen [8] http://www.vi-hps.org release due to a number of improve- improve the performance of UNICORE 7. University, TU Dresden and TU Munich. ments that serve to put the software For example, security sessions have The industrial partner GNS mbH, a Christian Rössel, on an updated technological basis. been introduced to reduce the amount private company that specializes in [email protected] of XML data transferred between services related to metal forming simu- As the most prominent change, we up- client and server, also reducing the CPU lations, such as mesh generation for dated the internal web services stack time required to process the XML • Christian Rössel 1 complex structures and finite element to use the Apache CXF framework [3], messages. Several new batch operations • Bernd Mohr 1 analyses, coordinates the project. which is the most advanced and have been added, for example allow- • Michael Gerndt 2 mature Java services stack available ing to delete multiple files or to check In addition, the University of Oregon, an today. This allows to build both WS/ the status of many jobs using a single 1 Jülich associated partner, complements the SOAP services as are currently used in request/reply web service call. In data Supercomputing Score-E objectives with corresponding UNICORE and RESTful services that staging, the transfer of directories Centre (JSC) extensions to the performance tool will become more and more important or multiple files has been optimized. TAU. Further associated partners are in the future. Now, multiple files can be transferred 2 Technische Engys UG, who specializes in the ap- in a single session, greatly using the Universität plication, support and development of As the major new feature, we have overhead. This works especially well in München Open Source Computational Fluid Dy- added the possiblitity to deploy and run conjunction with the UFTP high-perfor- namics (CFD) software and Munters UNICORE in a way that end-users do mance data transfer protocol. Euroform which its expertise in engi- not need certificates, using the Unity neering droplet separation systems for group management and federated Together with the UNICORE 7.0 release, various industrial purposes. identity solution [4]. Instead of using a first version of the new UNICORE X.509 certificates to identify them- Portal component was made available. For more information see: selves, UNICORE clients request a signed This serves the increasing demand of http://www.vi-hps.org/projects/score-e Security Assertion Markup Language users and infrastructure operators for and http://www.score-p.org (SAML) [5] document from the Unity a web-based access to UNICORE

76 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 77 Projects Projects

resources. Compared to desktop clients, via Kerberos are possible, too, since be quickly and widely deployed by our References no installation on end user machines the security subsystem is extensible. users in PRACE, XSEDE, national infra- [1] Streit, A. et.al. annals of telecommunications 65, pp. is required, and updates and fixes can structures such as PL-Grid, and others. 757–762, 2010 be made available much easier. The The Portal currently offers access to UNICORE Portal was developed in a the most important UNICORE features, A particularly interesting new deploy- [2] UNICORE in XSEDE– https://www.xsede.org/software/unicore highly modular and customizable fashion, which is job submission and monitoring, ment of UNICORE is under develop- and allows deployers to disable un- data access and transfer, and access to ment in the FET-Flagship „Human Brain [3] Apache CXF–http://cxf.apache.org needed functionality and to extend the a subset of workflow features. Project“, where UNICORE will form the [4] Unity identity management solution– portal with customized components, basis for the HPC platform, which will http://www.unity-idm.eu for example for user authentication. The screenshot (Fig. 1) shows the data combine High-Performance Computing [5] SAML–http://en.wikipedia.org/wiki/ The underlying technology is Java, using browser view, which is a two-window and scalable storage into a future Security_Assertion_Markup_Language Vaadin [6], an open-source web frame- component inspired by tools such as exascale platform for simulating the work which allows for a rich user expe- Norton Commander. It allows to manage human brain. The Human Brain project [6] Vaadin web application framework– http://vaadin.com rience comparable to desktop clients. data on both UNICORE storages and is expected to bring new requirements, the end user's local filesystem. Data such as interactive supercomputing, Special care has been taken to allow can be uploaded to, downloaded from large user federations, and others. contact: Bernd Schuller, integrating the Portal with different or transferred between UNICORE servers UNICORE is well-placed to take on these [email protected] user authentication methods. The user in a very simple way. The user's local requirements. A strong focus of further can authenticate via X.509 certificate file system is made accessible using a development will be put on the UNICORE in the browser, or via the Unity identity Java applet. portal, where we plan to add sup- management service, which allows also port for some of the more advanced username/password authentication. UNICORE 7 is a major step forward in UNICORE features such as metadata Other methods such as authentication many respects, and we expect it will management. Customization of the por- tal for specific use cases and applica- tions will play a big role, e.g. through the development of an application in- tegration layer. Last not least, we will work on adding social features and • Bernd Schuller “teamwork” functions, leveraging the • Daniel Mallmann group membership management ca- pabilities of the Unity system to allow Jülich simple sharing of task definitions or Supercomputing data files, as well as receiving notifica- Centre (JSC) tions about activities of group members.

Figure 1: Screenshot of the data browser view in the UNICORE Portal

78 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 79 Projects Projects

interface or through web-service calls, and laptop computers to Virtual eality Project Mr. SymBioMath which enables the use by external pro- devices like for example CAVEs (Cruz- grams like workflow engines. The role Neira et al. 1992). The project High Performance, Cloud and algorithms to the project, while Inte- of workflow engines is crucial for the Symbolic Computing in Big-Data gromics – as a commercial company – Mr. SymBioMath project since the bio- Workflows Problems applied to Mathematical Mod- will specifically address the commer- informatics and biomedical applications The processing of workflows represents elling of Comparative Genomics cialization of the project results, while consist of a set of data transformation a core element of the software devel- (Mr.SymBioMath; http://www.mrsym- also contributing to the core software modules which can be flexibly linked to opment within Mr. SymBioMath since biomath.eu) represents an interdisci- development efforts. The project partner each other to achieve the desired over- they are applied both for the biomedi- plinary effort focused on big data pro- Servicio Andaluz de Salud–Hospital all functionality. cal and bioinformatics domains. A first cessing within the application domains Carlos Haya, Spain is in charge of the prototype implementation of a biomedi- of bioinformatics and biomedicine. biomedical use cases, which will be Bioinformatics and cal workflow as shown in Fig. 1 has The project partners cover a wide range addressed by the project. Specifically Biomedical Algorithms been run on the Mr. SymBioMath cloud of biological, medical and technological this refers to discovering links between Considering application domains the infrastructure by executing software expertise. The University of Malaga allergies and the relevant parts of the project is clearly focused on two related modules consisting of preexisting C (UMA), Spain acts as project coordina- genomes of patients. The Leibniz Super- ones, bioinformatics and biomedicine. code and python scripts (Heinzlreiter et tor and provides expertise in bioinfor- computing Centre (LRZ), Germany con- On the one hand the project works on al., 2014). The workflow starts with matics, comparative genomics as well tributes its expertise on visualization genome comparison on the genome CEL files containing single nucleotide as High Performance Computing and for the representation of the results of level and related applications, while on polymorphisms (SNPs)–areas of variation Cloud Computing. RISC Software GmbH, the bioinformatics and biomedical com- the other hand the biomedical aspects of within the genomes–from different Austria is a software company with a putations on different types of devices the program are driven by the project human patients. The amount of data to special expertise on advanced comput- ranging from mobile devices to Virtual partner Hospital Carlos Haya, which is be processed can range from a few ing technologies like cloud and Grid Reality devices. specifically interested in the relation megabytes to terabytes depending on Computing. Besides contributing to the between genome variations and allergic the number of patients participating software development effort within the Infrastructure reactions of patients. Both application in a specific study. After the files have project, RISC Software GmbH provides Since the project is clearly focused on areas require the processing of large been uploaded the birdseed algorithm the Cloud Computing infrastructure using Cloud Computing as its resource genomic datasets and thus represent (Korn et al., 2008) is applied to them. for the project. The Johannes Kepler provision methodology a community a good use case for the infrastructure’s After this step the algorithms output University Linz, Austria contributes cloud installation provided by project data handling capabilities. The specific data has to be filtered to remove areas specific expertise on bioinformatics partner RISC Software GmbH is used bioinformatics and biomedical algorithms which are not varying across different for development and testing purposes. rely on the underlying infrastructure patients and where the birdseed algo- This cloud installation also gives the and provide their services to the higher project consortium the opportunity to layers of the hierarchy, represented by try out different cloud technologies the end-user applications. for supporting the bioinformatics and biomedical applications within the proj- End-User Applications and ect. As a Cloud Computing middleware Visualization OpenStack (http://www.openstack.org) The main applications for end-users will is being used, which provides the users be a visualization application focused on with access to different Cloud Comput- genome comparison and related func- ing services like execution of virtual tionality as well as a workflow composition instances, access to volumes acting as application allowing a user to set up additional virtual hard disks as well as bioinformatics and biomedical work- access to data containers where files flows. The end-user applications will be can be stored to be exchanged be- designed to run on a range of different tween instances. This functionality devices spanning from mobile devices can be accessed either through a web like cell phones and tablets over desktop Figure 1: Biomedical Workflow Figure 2: Example Dotplot

80 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 81 Projects Projects

rithm did not produce appropriate re- been identified as the most time- corresponding (x, y) coordinates. The extend the initial components adding sults. To conclude this workflow, the consuming step. Typically the workflow computational steps involved in this new functionality and tightening their filtered result has to be converted to gets executed with several CEL files as workflow will also be executed in a dis- integration. the VCF file format, which represents a input which allows for the parallelization tributed way across multiple instances standard file format for genomic data. of the VCF-conversion step: The filtered of the cloud infrastructure, but as a Acknowledgements output of the birdseed algorithm also next step this workflow will act as test The Mr. SymBioMath project is running The workflow was instantiated by execut- consists of several files, which can be case for the invocation through a work- from February 2013 to January 2017 ing preexisting software modules – a processed independently. Therefore flow engine like Galaxy. and is being funded by the European C code for the birdseed algorithm and the execution of the last step of the Union within the 7th framework pro- Python scripts for the other steps – workflow was performed by running it Client Software gramme for research as an Industry- being distributed across multiple differ- on several instances in parallel. Within the bioinformatics domain web- Academia Partnerships and Pathways ent cloud instances. The data exchange services are a common methodology (IAPP) project under grant agreement between the modules was realized by While the workflow described here is a to expose computational services to ex- number 324554. storing files into OpenStack containers. very simple first step illustrating the ternal users. Therefore project partner To enable accessing the containers viability of workflow execution on the UMA has developed the web-service References from the running cloud instances the Mr. SymBioMath cloud infrastructure, client software jORCA (Martin-Requena [1] Cruz-Neira, C., Sandin, D.J., DeFanti, T.A., Kenyon, R.V., Hart, J.C. containers have been mounted into this setup is by no way user-friendly at al., 2010) which can be used for call- The CAVE: Audio Visual Experience Auto- the file systems of the instances. This and needs to be improved and extended ing web services in a unified way. It will matic Virtual Environment, Communica- was realized by the cloudfuse-mount heavily to be usable for its actually represent a core tool for invoking Mr. tions of the ACM, Vol. 35, No. 6, 1992, pp. daemon which accesses the OpenStack intended users like bioinformaticians SymBioMath workflows through web- 64–72.

containers through web service calls and clinicians. To make the execution service interfaces. Within the project a [2] Heinzlreiter, P., Perkins, J.R., Torreno, O., and makes it locally accessible in a way of workflows more user-friendly and GlobusOnline (GO) plugin for jORCA has Karlsson, J., Antonio, J. comparable to flexible a workflow engine like Galaxy been developed, which enables the use Andreas Mitterecker, Miguel Blanca, Oswaldo Trelles: A Cloud-based GWAS Analysis (NFS) mounts. With this infrastructure (http://galaxyproject.org/) will be of the GO services directly from the Pipeline for Clinical Researchers, in Proc. of in place the workflow was distributed used. The workflow engine will provide jORCA client software. the 4th International Conference on Cloud across different instances according to a user-friendly way of setting up work- Computing and Services Science, 2014, to the computational load being generated flows and enacting them. Besides that, After the genome comparison has been be published. by the different modules. During test the storage of workflows will enable executed, its result should typically be • Paul Heinzlreiter runs, the last step of the workflow – different groups of researchers to re- viewed as a dotplot. An OpenGL-based [3] Korn, J., Kuruvilla, F., McCarroll, S., • Christoph Anthes the conversion to the VCF format – has peat experiments thus supporting sci- application facilitating a flexible visualiza- Wysoker, A., Nemesh, J., Cawley, S., • Balazs Tukora Hubbell, E., Veitch, J., Collins, P., entific verification of published results. tion of genome comparisons is currently Darvishi, K., Lee, C., Nizzari, M., Gabriel, being developed at LRZ and will act as S., Purcell, S., Daly, M., Altshuler, D. Leibniz A different example workflow which will a visualization and partly post-processing Integrated genotype calling and association Supercomputing be executed on the Mr. SymBioMath in- tool for the results of the genome analysis of SNPs, common copy number Centre (LRZ) polymorphisms and rare CNVs, Nature frastructure focuses on bioinformatics comparison. The core features of the Genetics, Vol. 10, No. 40, pp. 1253-1260, and there more specifically on genome visualization application are given by 2008. comparison. The overall idea of this supporting a three-dimensional visual- workflow is given by the comparison of ization of multiple genomes. An example [4] Martin-Requena, V., Ríos, J., García, M., two genome sequences and the graphi- screenshot from a Virtual Reality Ramírez, S., Trelles, O. cal representation of the results in a device is shown in Fig. 3. jORCA: easily integrating bioinformatics dotplot with an example shown in Fig. 2. Web Services, Bioinformatics, Vol. 26, No. 4, pp. 553-559, 2010. A dotplot shows the similarities be- Conclusion tween two sequences by representing The article provided a high-level over- one genome sequence on the x-axis view of the project's aims and some contact: Paul Heinzlreiter, and the other one on the y-axis. If a initial developments. The core targets [email protected] similarity occurs between the two se- for the upcoming months are given by quences it is marked with a dot at the continued software development to Figure 3: Genome Visualization

82 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 83 Projects Projects

creasing the overall computational work The step from the mutual solver execu- ExaFSA–Exascale Simulation compared to the standard traditional tion to the simultaneous execution implies of Fluid-Structure-Acoustics coupling of a flow and a structure solver a change of data dependencies as that implies a mutual execution of the structural solver cannot use the Interactions the solvers which, due to non-equal results of the current fluid solver step workloads necessarily leads to a very as an input anymore. If not handled inefficient usage of computational re- carefully, this can lead to additional in- For a bit more than one year now, DFG's the sound design of ventilators, cars, sources (see Fig. 3 and [2]). stabilities in particular for the chal- Priority Programme 1648 "Software aircrafts, and wind energy plants to lenging case of an incompressible fluid. for Exascale Computing" (SPPEXA) has simulations of the human voice. The been running, with 13 project consortia huge amount of numbers produced as addressing the multi-faceted challenges an output is to be translated into in- of exascale computing. In each issue of terpretable results using fast parallel inSiDE, one of those projects will pres- in-situ visualization methods. As there ent its agenda and progress, starting are numerous highly sophisticated and now with ExaFSA. reliable tools available for each of the

Figure 2: The coupling tool preCICE [1] provides the numerical equation coupling, data mapping, techni- cal communication, and a geometry interface for coupled simulations. Parallel solvers such as solver Figure 1: The project ExaFSA focuses on the three-field coupled simulation of fluid flow, structural A in the picture communicate over proxies with a server adapter running as a separate process that dynamics, and acoustics. takes care of all numerical operations and the communication to the adapter of solver B which is a sequential solver in the shown example and runs in the same process as the solver B itself.

The project team ExaFSA consisting of three involved physical effects, a fast the groups of Miriam Mehl (Computer and flexible way to establish a fully coupled Science, Stuttgart), Hester Bijl (Aero- three-physics simulation environment is space Engineering, Delft), Thomas Ertl to combine such solvers. (Computer Science, Stuttgart), Sabine Roller (Computer Science, Siegen), and We focus here on two particular aspects Dörte Sternel (Mechanical Engineering, of our project work, both related to the ​ Darmstadt) aims at tackling three of coupling software gluing together the today's main challenges in numerical solvers, which is, besides efficient solvers simulation: the increasing complexity of for fluid flow, structural dynamics, and the underlying mathematical models, acoustics, a key component of the FSA the massive parallelism of a supercom- simulation environment. We use the puter, and the huge amount of data in-house coupling tool preCICE (see also produced as an output. The project [1] and Fig. 2), in which we recently focuses on interactions between fluids, implemented a new numerical method Figure 3: Imbalanced resource usage of the standard consecutive execution of fluid and structure solvers in a coupled fluid-structure simulation (a) versus our new balanced version where fluid and structures, and acoustics (see Fig. 1) developed within our project that allows structure solver are executed simultaneously (b). Busy processors are marked green, idle waiting where application examples range from for an inter-solver parallelism without in- processors orange.

84 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 85 Projects Projects

Figure 4: Simulation of a density pulse with Ateles. We compare runs using the usual MPI parallelization of Ateles with runs replacing the MPI communication at a vertical line through the domain by a communication via the coupling tool preCICE that we also use to couple fluid, structure and acoustics solvers. (a) Setup of the combination of Ateles with preCICE and simulation results showing the correctness at the coupling interface. (b) Strong scaling results for a total of 512 mesh elements, corresponding to 262144 degrees of freedom. The graph shows the bottleneck induced by sequen- tial parts of preCICE which consequently have to be reduced to enable exascale FSA simulations. The simulations have been run at the MAC Cluster (http://www.mac.tum.de/wiki/ index.php/MAC Cluster) in Munich. Full nodes, encompass- ing 16 physical processors, were reserved also for smaller runs, while runs with 32 or more processors used multiple nodes (e.g. 64 processors were run on 4 nodes). The preCICE server processes were always run on separated full nodes and are not taken into account for the scale up graph.

a) b)

Our new coupling method overcomes Thus, this setup is not only suitable for about ExaFSA, further challenges in References such stability and convergence prob- performance assessments, but also and contributions to efficient FSA simu- [1] Bungartz, H.-J., Benk, J., Gatzhammer, B., Mehl, M. and Neckel, T. lems using a quasi-Newton approach for validations of the results. We use lations, see http://www.sppexa.de and Partitioned simulation of fluid-structure where unknown internals of the solvers the discontinuous Galerkin solver Ateles http://ipvs.informatik.uni-stuttgart.de/ interaction on cartesian grids. (which we assume to be black-box) are [3] with an 8th order scheme to solve S G S / E X A F S A /. In: H.-J. Bungartz, M. Mehl, and M. Schäfer, estimated from input-output observa- this smooth flow problem. Due to the editors, Fluid-Structure Interaction– Modelling, Simulation, Optimisation, Part tions and used in a suitable manner design of the DG-Method, relatively Acknowledgements II, volume 73 of LNCSE, pp. 255–284. to stabilize and accelerate the solving small amounts of data have to be com- We thankfully acknowledge the financial Springer, Berlin, Heidelberg, October 2010. of the system of equations related municated between high order elements support by the priority program 1648 – • Hester Bijl 1 [2] Uekermann, B., Bungartz, H.-J., to each time step. Our results [2] show in the mesh allowing the solver to scale Software for Exascale Computing funded • Thomas Ertl 2 Gatzhammer, B. and Mehl, M. 1 that, indeed, we can now solve the time down to single elements per process. by the German Research Foundation A parallel, black-box coupling for fluid- • Mehl, Miriam step equation with the same number The complete computational domain (DFG). structure interaction. • Roller, Sabine 3 of fluid and structure solver steps as is then split for our experiment into two In: Sergio Idelsohn, Manolis Papadrakakis, • Dörte Sternel 4 and Bernhard Schrefler, editors, Compu- before with the mutual method, but parts, where each part is computed by tational Methods for Coupled Problems in 1 almost twice as fast and without wast- Ateles and the coupling of both is done Science and Engineering, COUPLED PROB- Delft University ing energy leaving processors in a via preCICE. The results (see Fig. 4) LEMS 2013, Stanta Eulalia, Ibiza, Spain, of Technology busy waiting state as we execute all show that sequential parts of preCICE 2013. eBook (http://congress.cimne.com/ coupled2013/proceedings/). components in parallel. become a bottleneck at about a total 2 University of of 32 processes, already, which leaves Stuttgart (HLRS) However, the coupling tool itself still room for further improvements and [3] Zudrop, J., Klimach, H., Hasert, M., Masilamani, K. and Roller, S. can be a bottleneck as the following innovations in the remaining project 3 Universität Siegen A fully distributed cfd framework for experiment shows: A simple spherical runtime. massively parallel systems. In: Cray User Gaussian pulse in density is transported Group, Stuttgart, 2012. 4 Technische with a constant fluid velocity through a Many aspects of FSA coupled simula- Universität domain governed by the nonlinear Euler tions are typical for the larger class contact: Philipp Neumann, Darmstadt equations. As this represents a pure of multi-physics applications that are [email protected] transport phenomenon, the shape of expected to profit from the project this pulse should not vary over time. outcome as well. For more information

86 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 87 Centres Centres

The Leibniz Supercomputing Centre of Research in HPC is carried out in colla- Compute servers currently operated by LRZ are given in the Bavarian Academy of Sciences and boration with the distributed, statewide the following table Humanities (Leibniz-Rechenzentrum, LRZ) Competence Network for Technical and provides comprehensive services to Scientific High Performance Computing Peak ­Performance scientific and academic communities by: in Bavaria (KONWIHR). System Size (TFlop/s) Purpose User Community­

• giving general IT services to more Contact: 18 thin node islands 3 ,18 5 Capability German than 100,000 university customers Leibniz Supercomputing Centre IBM iDataPlex ­Computing universities and 512 nodes per island with research institutes, in Munich and for the Bavarian 2 Intel Sandy Bridge PRACE projects Academy of Sciences Prof. Dr. Arndt Bode EP processors each (Tier-0 System) 147,456 cores Boltzmannstr. 1 288 TByte main memory • running and managing the powerful 85478 Garching near Munich FDR 10 IB communication infrastructure of the Germany IBM System x 1 fat node island with 78 Capability Munich Scientific Network (MWN) “SuperMUC” 205 nodes ­Computing Phone +49 - 89 - 358 - 31- 80 00 IBM Bladecenter HX5

• acting as a competence centre for [email protected] 4 Intel Westmere EX processors each data communication networks www.lrz.de 52 TByte main memory QDR IB • being a centre for large-scale 32 accelerated nodes with 100 Prototype archiving and backup, and by 2 Intel Ivy Bridge EP and 2 system Intel Xeon Phi each 76 GByte main memory • providing High Performance Dual-Rail FDR 14 IB Computing resources, training and

support on the local, regional, Linux-Cluster 510 nodes with Intel Xeon 13.2 Capability Bavarian and national and international level. EM64T/AMD Opteron Computing Munich 2-, 4-, 8-, 16-, 32-way Universities, LCG Grid 2,030 Cores 4.7 TByte

SGI Altix ICE 64 nodes with 5.2 Capacity Bavarian Intel Nehalem EP Computing Universities, PRACE 512 Cores 1.5 TByte memory

SGI Altix 2 nodes with 20.0 Capability Bavarian Ultraviolet Intel Westmere EX Computing Universities, PRACE 2,080 Cores 6.0 TByte memory

Megware 178 nodes with 22.7 Capability Bavarian IB-Cluster AMD Magny Cours Computing, Universities, “CoolMUC” PRACE PRACE 2,848 Cores prototype 2.8 TByte memory

MAC research 64 Intel Westmere Cores, 40.5 Testing Munich Centre cluster 528 Intel Sandy Bridge Cores, accelerated of Advanced 1,248 AMD Bulldozer Cores, architectures Computing (MAC), 8 NVIDIA GPGPU cards, and cooling Computer Science 8 ATI/AMD GPGPU cards technologies TUM

Picture of the Petascale system SuperMUC at the Leibniz Supercomputing Centre. A detailed description can be found on LRZ’s web pages: www.lrz.de/services/compute

88 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 89 Centres Centres

Bundling Competencies Compute servers currently operated by HLRS In order to bundle service resources in the state of Baden-Württemberg HLRS Peak has teamed up with the Steinbuch Cen- ­Performance ter for Computing of the Karlsruhe In- User System Size (TFlop/s) Purpose Community­ First German National Center stitute of Technology. This collaboration Based on a long tradition in supercom- has been implemented in the non-profit Cray XE6 3,552 dual socket 1,0 45 Capability European puting at University of Stuttgart, HLRS organization SICOS BW GmbH. "Hermit" nodes with 113,664 Computing and German (Höchstleistungsrechenzentrum Stuttgart) (Q4 2011) AMD Interlagos cores Research Organizations was founded in 1995 as the first German World Class Research and Industry federal Centre for High Performance As one of the largest research centers

Computing. HLRS serves researchers for HPC HLRS takes a leading role in NEC Cluster 23 TB memory 170 Laki: German at universities and research laboratories research. Participation in the German (Laki, Laki2) 9988 cores 120,5 TFlops Universities, heterogenous 911 nodes Laki2: Research in Europe and Germany and their exter- national initiative of excellence makes compunting 47,2 TFlops Institutes nal and industrial partners with high-end HLRS an outstanding place in the field. platform of 2 and Industry computing power for engineering and independent clusters scientific applications. Contact: Höchstleistungs­rechenzentrum

Service for Industry ­Stuttgart (HLRS) A detailed description can be found on HLRS’s web pages: www.hlrs.de/systems Service provisioning for industry is done Universität Stuttgart to­gether with T-Systems, T-Systems sfr, and Porsche in the public-private joint Prof. Dr.-Ing. Dr. h.c. Dr. h.c. venture hww (Höchstleistungsrechner Michael M. Resch für Wissenschaft und Wirtschaft). Nobelstraße 19 Through this co-operation industry 70569 Stuttgart always has acces to the most recent Germany HPC technology. Phone +49 - 711- 685 - 8 72 69 [email protected] / www.hlrs.de

View of the HLRS Cray XE6 "Hermit"

90 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 91 Centres Centres

research groups and in technology, Compute servers currently operated by JSC e.g. by doing co-design together with leading HPC companies. Peak ­Performance Implementation of strategic support User System Size (TFlop/s) Purpose Community­ infrastructures including community- oriented simulation laboratories and IBM 28 racks 5,872 Capability European The Jülich Supercomputing Centre (JSC) cross-sectional teams, e.g. on math- Blue Gene/Q 28,672 nodes Computing Universities at Forschungszentrum Jülich enables ematical methods and algorithms and "JUQUEEN" 458,752 processors and Research IBM PowerPC® A2 Institutes, scientists and engineers to solve grand parallel performance tools, enabling 448 Tbyte main PRACE challenge problems of high complexity in the effective usage of the supercom- memory science and engineering in collaborative puter resources. infrastructures by means of supercom- Intel 3,288 SMT nodes with 308 Capacity European Linux CLuster 2 Intel Nehalem-EP and Universities, puting and Grid technologies. Higher education for master and “JUROPA” quad-core 2.93 GHz Capability Research doctoral students in cooperation e.g. processors each Computing Institutes 26,304 cores and Industry, Provision of supercomputer resources with the German Research School for PRACE 77 TByte memory of the highest performance class for proj- Simulation Sciences. ects in science, research and industry Intel 206 nodes with 240 Capacity selected in the fields of modeling and computer Contact: GPU Cluster 2 Intel Westmere and HGF Projects “JUDGE” 6-core 2.66 GHz Capability simulation including their methods. The Jülich Supercomputing Centre (JSC) processors each Computing selection of the projects is performed Forschungszentrum Jülich 412 graphic proces- sors (NVIDIA Fermi) by international peer-review procedures 20.0 TByte memory implemented by the John von Neumann Prof. Dr. Dr. Thomas Lippert Institute for Computing (NIC), GCS, and 52425 Jülich IBM 1,024 PowerXCell 100 Capability QCD Cell System 8i processors Computing Applications PRACE. Germany “QPACE” 4 TByte memory SFB TR55, Phone +49 - 24 61- 61- 64 02 PRACE Supercomputer-oriented research [email protected] and development in selected fields of www.fz-juelich.de/jsc physics and other natural sciences by

JSC's supercomputer "JUQUEEN", an IBM Blue Gene/Q system.

92 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 93 Activities Activities

While supercomputers with exascale sessions offered potential suppliers op- Open Dialogue on Pre- capabilities will become available portunities to ask questions. The sup- Commercial Procurement of sooner or later without any HBP inter- pliers were encouraged to provide feed- vention, it is unlikely that these future back, during the event and afterwards. Innovative HPC Technology systems will meet unique HBP require- ments without additional research and The event was attended by representa- development (R&D). JSC as the leader tives of more than 25 different com- use these models to simulate the brain of the HBP’s HPC Platform is therefore panies. While many of the questions on supercomputers. considering to carry out a Pre-Com- raised during the event were aimed at mercial Procurement (PCP) to drive R&D getting a better understanding of the Large-scale, memory-intensive brain for innovative HPC technologies that presented technical goals, the majority simulations running on the future pre- meet the specific requirements of the of the feedback concerned legal is- exascale and exascale versions of the HBP. sues, in particular IPR regulations. The HBP Supercomputer will need to be in- feedback received represents valuable teractively visualised and controlled by PCP is a relatively new model of public information that is being taken into experimenters, locally and from remote procurement, designed to procure account in the drafting of the call for locations. “Interactive Supercomputing” R&D services, which is promoted by tender. The Human Brain Project (HBP) [1] is an capabilities should allow the super- the European Commission (EC) [3]. It ambitious, European-led scientific col- computer to be used like a scientific is organized as a competitive process All slides presented during the meeting laborative project, which is supported instrument, enabling in silico experi- comprising three phases: 1) solution are available online [4]. since October 2013 by the European ments on virtual human brains. These exploration, leading to design concepts, Commission through its FET Flagship requirements will affect the whole 2) prototyping, 3) original development Acknowledgements Initiative [2]. The HBP aims to gather system design, including the hardware of limited volumes of first products. To The Human Brain Project receives all existing knowledge about the human architecture, run-time system, mode select the supplier(s) best able to funding from the European Union’s 7th brain, build multi-scale models of the of operation, resource management satisfy the goals of the PCP, the number Framework Programme (FP7) under brain that integrate this knowledge and and many other aspects. of candidates is reduced at each phase Grant Agreement no. 604102. after an evaluation of the respective results and the bids for the following References • Dirk Pleiter phase. The risks and benefits of the [1] https://www.humanbrainproject.eu/ • Boris Orth

PCP are shared between the procurer [2] http://cordis.europa.eu/fp7/ict/ and the suppliers. For instance, the programme/fet/flagship/ Jülich intellectual property rights (IPR) result- Supercomputing [3] http://cordis.europa.eu/fp7/ict/pcp/ ing from the PCP may remain with the Centre (JSC) suppliers if the procurer is granted a [4] http://www.fz-juelich.de/SharedDocs/ Termine/IAS/JSC/EN/events/2013/ suitable license allowing the procurer hbp-od-pcp-2013.html to use the IPR. [5] http://dx.doi.org/10.1109/ BioVis.2013.6664348 An Open Dialogue event with interested potential suppliers was held by the HBP in Brussels on December 18, 2013 contact: Boris Orth, as part of a market exploration. Its goal [email protected] was to present the HBP PCP frame- work and process, as well as a first version of the technical goals. The agenda included an overview of the HBP

Figure 1: The image shows VisNEST [5], an interactive analysis tool for neural activity data, being used as a whole and the roadmap for build- in the aixCAVE at RWTH Aachen University (Source: Virtual Reality Group, RWTH Aachen University). ing the HBP’s HPC Platform. Discussion

94 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 95 Activities Activities Second JUQUEEN Porting and Tuning Workshop

to get the participants started. The were achieved. To provide an incentive latter consisted of best practices for to the participants, JSC awarded a programmers, overviews of available prize for the best progress made dur- performance tools and debuggers, ing the course, which was decided via a OpenMP, and parallel I/O. The main vote on the last day. focus of the workshop was on hands-on sessions with the users' codes which The slides of the talks can be found on were supervised by members of staff the web at from JSC's Simulation Laboratories and cross-sectional teams (Application http://www.fz-juelich.de/ias/jsc/ Optimization, Performance Analysis, jqws14 Mathematical Methods and Algorithms) as well as from IBM. contact: Dirk Brömmel, [email protected] Participants working on CFD had an additional series of talks on specialized CFD related subjects during the hands- on sessions. These talks were partly given by the participants themselves and focused on numerics, various CFD applications, and on experiences with CFD codes in HPC. In addition, every Figure 1: Participants of the Second JUQUEEN Porting and Tuning Workshop (Source: Forschungszentrum Jülich GmbH). CFD group briefly presented the solver • Dirk Brömmel 1 used in their application and discussed • Paolo Crosetto 1 Early last year the Jülich Supercomput- field of Computational Fluid Dynamics HPC related aspects. These talks • Mike Nicolai 2 ing Centre (JSC) organized the First (CFD) by providing dedicated talks and greatly enhanced the communication JUQUEEN Porting and Tuning Workshop. discussions. and interaction between the research 1 Jülich Encouraged by its success we decided groups attending the workshop. This Supercomputing to start a series of such events and fol- Although the Blue Gene/Q installation special interest group within the work- Centre (JSC) lowed it by the Second JUQUEEN at JSC with its 28 racks has been in, shop was organized by the JARA-HPC Porting and Tuning Workshop from stalled for over a year now, this PRACE simulation laboratory "Highly Scalable 2 JARA-HPC February 3 to 5. This year, as in the Advanced Training Centre (PATC) Fluids & Solids Engineering", which of- previous year, the goal of the course course again attracted 30 participants fers support on questions beyond the was to make the participants more fa- of which roughly half joined the CFD pure HPC domain, tailored to specific miliar with the system installed at JSC part of the workshop. The topics cov- CFD software tools and applications. and to provide tools and ideas to help ered included in-depth talks on very with porting their codes, analyzing the specific hardware features of the Even though a workshop of only three performance, and in improving the effi- Blue Gene/Q architecture like trans- days duration is too short to transfer ciency. Apart from those topics, special actional memory and speculative ex- all ideas from the talks to the users' attention was given to users from the ecution alongside introductory talks codes, some immediate improvements

96 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 97 Activities Activities NIC Symposium 2014

The NIC symposium is held every two Prof. Kurt Kremer (MPI f. Polymer Sci- were a new record high. Ample discus- years to give an overview on activities ence, Mainz) gave a historical account sion sessions after the talks and the and results obtained by research of soft matter computer simulations poster session gave the participants groups receiving computing time grants on the Jülich supercomputers provided rich opportunities to exchange ideas on the supercomputers in Jülich through NIC and its predecessor, the and methods in an interdisciplinary set- through the John von Neumann Institute Höchstleistungsrechenzentrum (HLRZ). ting. The detailed programme, talks, for Computing (NIC). The 7th NIC He dedicated his talk to Prof. Kurt posters, proceedings, and pictures are Symposium took place at Forschungs- Binder on the occasion of his recent available at http://www.fz-juelich.de/ zentrum Jülich from February 12 to 13, 70th birthday. Prof. Binder (University ias/jsc/nic-symposium/. 2014 and was attended by about 140 of Mainz) is one of the founding fathers scientists. of NIC/HLRZ [1] and currently serves References as chairman of the NIC scientific council. [1] see InSiDE Spring 2012, p.100. The participants were welcomed by Prof. Achim Bachem (Board of Directors In the scientific programme, recent re- contact: Walter Nadler, of Forschungszentrum Jülich), who sults in various fields of research, rang- [email protected] • Walter Nadler portrayed current research in Jülich, ing from astrophysics to turbulence, and by Prof. Thomas Lippert (JSC), who were presented in 13 invited talks and Jülich presented the Jülich approach to pre- in about 80 posters. Both, attendance Supercomputing exascale supercomputing. and number of poster presentations Centre (JSC)

Figure 1: Participants of the NIC Symposium 2014 (Source: Forschungszentrum Jülich GmbH).

98 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 99 Activities Activities

the Functional Renormalization Group dedicated projects and hosts research Opening Workshop of the approach, just to name a few. Driving projects dealing with fundamental as- Simulation Laboratory the progress in all these methods re- pects of code development, algorithmic quires an increasing cooperation within optimization and performance improve- “Ab Initio” a multidisciplinary community which ment. Examples of current SLai activi- almost no working group worldwide can ties are afford. Specifically there is a necessity • development of fast eigensolvers tai- Ab initio methods play a major role in algorithmic development and hardware to establish extensive collaborations lored to DFT methods; the fields of chemistry, solid-state phys- evolution is required. between Physicists, Chemists, Computer • implementation of efficient tensor ics, nano-science and materials sci- Scientists and Applied Mathematicians. contraction kernels for Coupled Cluster ence. The impact of such methods has There are three major factors hindering methods; been steadily growing, currently gener- a straightforward achievement of State-of-the-art codes are usually the • generation of improved data struc- ating computations which result in sev- performance and accuracy: 1) model result of a multi-person effort main- tures for FLAPW-based methods; eral thousands of scientific papers per diversity, 2) a need for scientific inter- tained over many years. The outcome • development of a universal pre-con- year [1]. In order to keep up with this disciplinarity and 3) implementation is a large number of heterogeneous ditioner for the efficient convergence of impressive output, it is crucial to pre- heterogeneity. Model diversity emerges implementations, each one focusing on the charge density in DFT. serve the performance and accuracy of from the necessity of solving for a distinct physical aspects such as, for ab initio simulations as the complexity large spectrum of diverse scientific example, finite-size versus periodic sys- On November 8, 2013 SLai held its of the physical systems under scrutiny problems (e.g. band gaps, structural tems or molecular versus solids com- kick-off event at the Jülich Super- is increased. Maintaining high perfor- relaxation, magnetic properties, phonon pounds. Moreover, codes are written computing Centre Rotunda Hall [4]. mance and great accuracy poses signif- transport, chemical bonds, crystal- in a variety of styles and programming The workshop was designed to in- icant challenges such as, for instance, lization, etc.) by simulating different languages culminating in huge legacy troduce the SimLab to the local simulating materials with interfaces, aspects of a physical system. As a con- codes which are difficult to maintain condensed-matter physics and quantum parallelizing and porting codes on new sequence ab initio methods have been and port on new parallel architectures. chemistry communities, bringing architectures, and converging the self- realized in a rich variety of mathematical together 50 scientists from institutes consistent field in Density Functional models, some examples of which are The Jülich Aachen Research Alliance– within JARA. By sharing their work Theory (DFT) computations. In order to Density Functional Theory assorted dis- High-Performance Computing (JARA- and scientific expertise, the participants address these challenges, expertise in cretizations, hybrid Quantum Mechanics/ HPC) [2] has established the Simula- established a platform defining oppor- • Edoardo Di Napoli the fields of numerical mathematics, Molecular Mechanics methods and tion Laboratory “ab initio methods in tunities for mutual collaboration and Chemistry and Physics” (SLai) [3] with prioritization of the SimLab's activities. JARA-HPC the purpose of addressing all three Short informal talks were given by main classes of issues. SLai is part speakers from each of the main par- Jülich of the strategic effort to improve the ticipating partner institutes and cross- Supercomputing scientific collaboration between RWTH sectional groups, followed by extensive Centre (JSC) Aachen and the Forschungszentrum constructive discussions between the Jülich. Multidisciplinarity is deeply en- speakers and the audience. coded in the DNA of the SimLab which connects the large community of ab References initio application users with the code [1] Burke, K. Perspective on Density Functional Theory, developers and the supercomputing J. Chem. Phys. 136, pp. 150901, AIP support team. The mission of the Publishing LLC 2012 Lab is to provide expertise in the field [2] http://www.jara.org/en/research/jara-hpc/ of ab initio simulations for physics, chemistry, nano-science and materials [3] http://www.jara.org/hpc/slai science with a special focus on High [4] http://www.jara.org/hpc/slai/kick-off Performance Computing at the Jülich supercomputer facilities. It also acts contact: Edoardo A. Di Napoli Figure 1: Participants to the opening workshop of the ab initio Simulation Laboratory. as a high-level support structure in [email protected]

100 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 101 Activities Activities

New Books in HPC This book presents the state-of-the-art Resch, M.M., Bez, W., Focht, E., Kobayashi, H., in simulation on supercomputers. Kovalenko, Y. (Eds.) Leading researchers present results 2013, X, 157 p. 84 illus., 71 illus. in color. Tools for High Performance debugging and tuning work required. achieved on systems of the High Per- Computing 2012 This process is supported by special formance Computing Center Stuttgart Proceedings of the joint Workshop on software tools, facilitating debugging, (HLRS) for the year 2013. The reports Sustained Simulation Performance, Uni- performance analysis, and optimiza- cover all fields of computational science versity of Stuttgart (HLRS) and Tohoku tion and thus making a major contribu- and engineering ranging from CFD via University, 2013. tion to the development of robust and computational physics and chemistry efficient parallel software. This book to computer science with a special This book presents the state-of-the-art introduces a selection of the tools, emphasis on industrially relevant ap- in High Performance Computing and which were presented and discussed plications. Presenting results of one of simulation on modern supercomputer at the 6th International Parallel Tools Europe’s leading systems this volume architectures. It covers trends in Workshop, held in Stuttgart, Germany, covers a wide variety of applications hardware and software development September 25 to 26, 2012. that deliver a high level of sustained in general and specifically the future of performance. The book covers the high-performance systems and hetero- High Performance main methods in High Performance geneous architectures. The application Computing in Science and Computing. Its outstanding results in contributions cover Computational Fluid Engineering ‘13 achieving highest performance for pro- Dynamics, material science, medical ap- duction codes are of particular interest plications and climate research. Innova- for both the scientist and the engineer. tive fields like coupled multi-physics or The book comes with a wealth of multi-scale simulations are presented. coloured illustrations and tables of re- All papers were chosen from presenta- sults. tions given at the 16th Workshop on Sustained Simulation Performance held Sustained Simulation in December 2012 at HLRS, University Performance 2013 of Stuttgart, Germany and the 17th Cheptsov, A., Brinkmann, S., Gracia, J., Resch, Workshop on Sustained Simulation M., Nagel, W.E. (Eds.) Performance at Tohoku University in 2013, XI, 162 p. 70 illus. in color. March 2013.

• Written by the leading experts on the HPC tools market • Describes best-practices of parallel tools usage • Contains real-life practical examples.

The latest advances in the High Perfor- mance Computing hardware have significantly raised the level of available compute performance. At the same time, the growing hardware capabilities Nagel, Wolfgang E., Kröner, Dietmar H., Resch, of modern supercomputing architec- Michael M. (Eds.) tures have caused an increasing com- 2013, XIII, 697 p. 393 illus., 314 illus. in color. plexity of the parallel application de- Transactions of the High Performance Comput- velopment. Despite numerous efforts ing Center, Stuttgart (HLRS) 2013. to improve and simplify parallel pro- gramming, there is still a lot of manual

102 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 103 Activities Activities

Our general course on parallelization, HLRS Scientific Tutorials and the Parallel Programming Workshop, Workshop Report and Outlook October 13-17, 2014 at HLRS, will have three parts: The first two days of this course are dedicated to parallelization HLRS has installed Hermit, a Cray XE6 ming for Parallel Accelerated Super- with the Message passing interface system with AMD Interlagos processors computers – an alternative to CUDA (MPI). Shared memory multi-threading and 1 PFlop/s peak performance and from Cray perspective in spring 2015. is taught on the third day, and in the will extend with an XC30 system. We last two days, advanced topics are dis- strongly encourage you to port your These Cray XE6/XC30 and XK7 courses cussed. This includes MPI-2 functionality, applications to these architectures as are also presented to the European com- e.g., parallel file I/O and hybrid early as possible. To support such effort munity in the framework of the PRACE MPI+OpenMP, as well as the upcoming we invite current and future users to Advanced Training Centre (PATC). MPI-3.0. As in all courses, hands-on In the table below, you can find the whole participate in the special Cray XE6/ GCS, i.e., HLRS, LRZ and the Jülich sessions (in C and Fortran) will allow HLRS series of training courses in 2014. XC30 Optimization Workshops. With Supercomputer Centre together, serve users to immediately test and under- They are organized at HLRS and also these courses, we will give all neces- as one of the first six PATCs in Europe. stand the parallelization methods. The at several other HPC institutions: LRZ sary information to move applications course language is English. Garching, NIC/ZAM (FZ Jülich), ZIH to this Petaflop system. The Cray XE6 One of the flagships of our courses is (TU Dresden), ZDV (Uni Mainz), ZIMT provides our users with a new level of the week on Iterative Solvers and Several three and four day-courses (Uni Siegen) and Uni Oldenburg. performance. To harvest this potential Parallelization. Prof. A. Meister teaches on MPI & OpenMP will be presented at will require all our efforts. We are basics and details on Krylov Subspace different locations all over the year. 2014 – Workshop Announcements looking forward to working with our Methods. Lecturers from HLRS give Scientific Conferences and Workshops at HLRS • Rolf Rabenseifner users on these opportunities. The next lessons on distributed memory parallel- We also continue our series of Fortran 13th HLRS/hww Workshop on Scalable Global Parallel File Systems (May 12-14, 2014) four-day course in cooperation with ization with the Message Passing Inter- for Scientific Computing on December 8th ZIH+HLRS Parallel Tools Workshop (date and location not yet fixed) University of Cray and multi-core optimization spe- face (MPI) and shared memory multi- 8-12, 2014 and in spring 2015, mainly High Performance Computing in Science and Engineering - The 17th Results and Stuttgart, HLRS cialists is on September 23-26, 2014. threading with OpenMP. This course visited by PhD students from Stuttgart Review Workshop of the HPC Center Stuttgart (September 29 - 30, 2014) will be presented twice, on September and other universities to learn not only IDC International HPC User Forum (October 28 - 29, 2014) Programming of Cray XK7 clusters with 15-19, 2014 at LRZ and in spring 2015 the basics of programming, but also to Parallel Programming Workshops: Training in Parallel Programming and CFD GPUs is taught in OpenACC Program- at HRLS in Stuttgart. get an insight on the principles of de- Parallel Programming and Parallel Tools (TU Dresden, ZIH, February 24 - 27) veloping high-performance applications Cray XE6/XC30 Optimization Workshops (HLRS, March 17 - 20) (PATC) Iterative Linear Solvers and Parallelization (HLRS, March 24 - 28) ISC and SC Tutorials Another highlight is the Introduction with Fortran. Introduction to Computational Fluid Dynamics (HLRS, March 31 - April 4) Georg Hager, Gabriele Jost, Rolf Rabenseifner: Hybrid Parallel Programming with to Computational Fluid Dynamics. MPI & OpenMP. Tutorial 04 at the International Supercomputing Conference, ISC’14, GPU Programming using CUDA (HLRS, April 7 - 9) Leipzig, June 22-26, 2014. This course was initiated at HLRS by With Unified Parallel C (UPC) and Co- Open ACC Programming for Parallel Accelerated Supercomputers Dr.-Ing. Sabine Roller. She is now a Array Fortran (CAF) in spring 2015, (HLRS, April 10 - 11) (PATC) Georg Hager, Jan Treibig, Gerhard Wellein: Node-Level Performance Engineering. Tutorial 01 at the International Supercomputing Conference, ISC’14, Leipzig, June 22- professor at the University of Siegen. It the participants will get an introduction Unified Parallel C (UPC) and Co-Array Fortran (CAF) (HLRS, April 14 - 15) (PATC) 26, 2014. is again scheduled in spring 2014 in of partitioned global address space Scientific Visualisation (HLRS, April 16 - 17) Rolf Rabenseifner, Georg Hager, Gabriele Jost: Hybrid MPI and OpenMP Parallel Stuttgart and in September/October in (PGAS) languages. Summer School: Modern Computational Science in Quantum Chemistry Programming. Half-day Tutorial at Super Computing 2013, SC13, Denver, Colorado, (Uni Oldenburg, Aug 25 - Sep 05, 2014) USA, November 17-22, 2013. Siegen. The emphasis is placed on Iterative Linear Solvers and Parallelization (LRZ, Garching, September 15 - 19) explicit finite volume methods for the com- In cooperation with Dr. Georg Hager from Introduction to Computational Fluid Dynamics (ZIMT Siegen, September/October) pressible Euler equations. Moreover the RRZE in Erlangen and Dr. Gabriele Message Passing Interface (MPI) for Beginners (HLRS, October 13 - 14) (PATC) outlooks on implicit methods, the ex- Jost from Supersmith, the HLRS also Shared Memory Parallelization with OpenMP (HLRS, October 15) (PATC) tension to the Navier-Stokes equations continues with contributions on hybrid Advanced Topics in Parallel Programming (HLRS, October 16 - 17) (PATC) and turbulence modeling are given. MPI & OpenMP programming with tuto- Parallel Programming with MPI & OpenMP (FZ Jülich, JSC, December 1 - 3) Additional topics are classical numerical rials at conferences; see the box on the Training in Programming Languages at HLRS methods for the solution of the incom- left page, which includes also a second Fortran for Scientific Computing (December 8 - 12) (PATC) pressible Navier-Stokes equations, aero- tutorial with Georg Hager from RRZE. URLs: acoustics and high order numerical http://www.hlrs.de/events/ methods for the solution of systems of http://www.hlrs.de/training/course-list/ partial differential equations. (PATC): This is a PRACE PATC course

104 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 105 GCS – High Performance Computing Courses and Tutorials

High Performance Computing Introduction to SuperMUC - Node-Level Performance - The 3D Jacobi solver Industrial Services Introduction with Python the new Petaflop Engineering - The Lattice-Boltzmann Method of the National HPC Centre to Parallel Programming Supercomputer at LRZ (PATC course) - Sparse Matrix-Vector Multiplication Stuttgart (HLRS) with MPI and OpenMP Date & Location (PATC course) - Backprojection algorithm for CT June 26-27, 2014 Date & Location reconstruction Dates & Location Date & Location JSC, Forschungszentrum Jülich Date & Location July 14-15, 2014 • Energy & Parallel Scalability. July 16, 2014, and August 5-8, 2014 July 8-11, 2014 HLRS, Stuttgart Between each module, there is time October 1, 2014 JSC, Forschungszentrum Jülich Contents LRZ, Garching near Munich for Questions and Answers! HLRS, Stuttgart Python is being increasingly used in Contents Contents high-performance computing projects Contents This course teaches performance en- Webpage The course provides an introduction to such as GPAW. It can be used either This four-day workshop gives an intro- gineering approaches on the compute www.hlrs.de/training/course-list Contents the two most important standards for as a high-level interface to existing duction to the usage of SuperMUC, node level. “Performance engineering” In order to permanently assure their parallel programming under the distrib- HPC applications, as embedded in- the new Petaflop class Supercomputer as we define it is more than employ- competitiveness, enterprises and uted and shared memory paradigms: terpreter, or directly. This course com- at LRZ. The first three days are dedi- ing tools to identify hotspots and institutions are increasingly forced to MPI, the Message Passing Interface, bines lectures and hands-on session. cated to presentations by Intel on bottlenecks. It is about developing a deliver highest performance. Power- and OpenMP. While intended mainly for We will show how Python can be used their software development stack thorough understanding of the inter- ful computers, among the best in the the JSC guest students, the course is on parallel architectures and how per- (compilers, tools and libraries); the actions between software and hard- world, can reliably support them in open to other interested persons upon formance critical parts of the kernel remaining day will be comprised of ware. This process must start at the doing so. request. can be optimized using various tools. talks and exercises delivered by IBM core, socket, and node level, where This courses are targeted towards and LRZ on usage of the IBM-specific the code gets executed that does decision makers in companies that Webpage Webpage aspects of the new system (IBM MPI, the actual computational work. Once would like to learn more about the http://www.fz-juelich.de/ias/jsc/ http://www.fz-juelich.de/ias/jsc/ LoadLeveler, HPC Toolkit) and recom- the architectural requirements of a advantages of using high performance events/mpi-gsp events/hpc-python mendations on tuning and optimizing code are understood and correlated computers in their field of business. for the new system. with performance measurements, They will be given extensive information the potential benefit of optimizations about the properties and the capa- Prerequisites can often be predicted. We introduce bilities of the computers as well as Participants should have good knowl- a “holistic” node-level performance access methods and security aspects. edge of HPC-related programming, engineering strategy, In addition we present our comprehen- in particular MPI, OpenMP and at apply it to different algorithms from sive service offering - ranging from in- least one of the languages C, C++ or computational science, and also show dividual consulting via training courses Fortran. how an awareness of the perfor- to visualization. Real world examples mance features of an application may will finally allow an interesting insight Webpage lead to notable reductions in power into our current activities. http://www.lrz.de/services/compute/ consumption: courses/ • Introduction Webpage • Practical performance analysis http://www.hlrs.de/events/ • Microbenchmarks and the memory hierarchy • Typical node-level software over heads • Example problems:

106 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 107 GCS – High Performance Computing Courses and Tutorials

Advanced Fortran Topics Iterative Linear Solvers Cray XE6/XC30 Message Passing Interface Shared Memory Advanced Topics and Parallelization Optimization Workshop (MPI) for Beginners Parallelization with OpenMP in Parallel Programming Date & Location (PATC course) (PATC course) (PATC course) September 8-12, 2014 Date & Location Date & Location LRZ, Garching near Munich September 15-19, 2014 Date & Location Date & Location Date & Location October 16-17, 2014 LRZ, Garching near Munich September 23-26, 2014 October 13-14, 2014 October 15, 2014 HLRS, Stuttgart Contents HLRS, Stuttgart HLRS, Stuttgart HLRS, Stuttgart This course is targeted at scientists Contents Contents who wish to extend their knowledge of The focus is on iterative and parallel Contents Contents Contents Topics are MPI-2 parallel file I/O, Fortran beyond what is provided in the solvers, the parallel programming models HLRS installed HERMIT, a Cray XE6 The course gives a full introduction This course teaches shared memory hybrid mixed model MPI+OpenMP Fortran 95 standard. Some other tools MPI and OpenMP, and the parallel system with AMD Interlagos proces- into MPI-1. Further aspects are domain OpenMP parallelization, the key concept parallelization, MPI -3.0 parallelization relevant for software engineering are middleware PETSc. Different modern sors and a performance of 1 PFlop/s. decomposition, load balancing, and on hyper-threading, dual-core, multi- of explicit and implicit solvers and of also discussed. Topics covered include Krylov Subspace Methods (CG, GMRES, Currently, the system is extended by a debugging. An MPI-2 overview and the core, shared memory, and ccNUMA particle based applications, parallel • object oriented features BiCGSTAB ...) as well as highly efficient Cray XC30 system. We invite current MPI-2 one-sided communication is also platforms. Hands-on sessions (in C and numerics and libraries, and paralleliza- • design patterns preconditioning techniques are and future users to participate in this taught. Hands-on sessions (in C and Fortran) will allow users to immediately tion with PETSc. Hands-on sessions are • generation and handling of shared presented in the context of real life special course on porting applications Fortran) will allow users to immediately test and understand the directives and included. Course language is English (if libraries applications. to our Cray architectures. The Cray test and understand the basic con- other interfaces of OpenMP. Tools for required). • mixed language programming XE6 and Cray XC30 will provide our structs of the Message Passing Inter- performance tuning and debugging are • standardized IEEE arithmetic and Hands-on sessions (in C and Fortran) users with a new level of performance. face (MPI). Course language is english presented. Course language is English Webpage exceptions will allow users to immediately test and To harvest this potential will require all (if required). (if required). http://www.hlrs.de/events/ • I/O extensions from Fortran 2003 understand the basic constructs of our efforts. We are looking forward to • parallel programming with coarrays iterative solvers, the Message Passing working with you on these opportunities. Webpage Webpage • source code versioning system Interface (MPI) and the shared On the first three days, specialists http://www.hlrs.de/events/ http://www.hlrs.de/events/ (subversion) memory directives of OpenMP. from Cray will support you in your To consolidate the lecture material, This course is organized by LRZ, effort porting and optimizing your each day's approximately 4 hours of University of Kassel, HLRS, and IAG. application on our Cray XE6 / XC30. lectures are complemented by 3 hours On the last day, Georg Hager and Jan of hands-on sessions. Webpage Treibig from RRZE will present detailed http://www.lrz.de/services/compute/ information on optimizing codes on Prerequisites courses/ the multicore processors. Course Course participants should have basic language is English (if required). UNIX/Linux knowledge (login with secure shell, shell commands, simple Webpage scripts, editor vi or emacs). Good http://www.hlrs.de/events/ knowledge of the Fortran 95 standard is also necessary, such as covered in the February course at LRZ.

Webpage http://www.lrz.de/services/compute/ courses/

108 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 109 GCS – High Performance Computing Courses and Tutorials

Scientific Visualization GPU Programming Data Analysis Introduction Parallel Programming Fortran using CUDA and Data Mining with Python to the Programming with MPI and OpenMP for Scientific Computing Date & Location and Usage October 20-21, 2014 Date & Location Date & Location of the Supercomputer Date & Location Date & Location HLRS, Stuttgart October 22-24, 2014 November 10-12, 2014 Resources at Jülich December 1-3, 2014 December 8-12, 2014 HLRS, Stuttgart JSC, Forschungszentrum Jülich JSC, Forschungszentrum Jülich HLRS, Stuttgart Contents Date & Location This two day course is targeted at Contents Contents November 27-28, 2014 Contents Contents researchers with basic knowledge in The course provides an introduction Pandas, matplotlib, and scikit-learn JSC, Forschungszentrum Jülich The focus is on programming models This course is dedicated for scientists numerical simulation, who would like to to the programming language CUDA, make Python a powerful tool for data MPI and OpenMP. Hands-on sessions and students to learn (sequential) pro- learn how to visualize their simulation which is used to write fast numeric analysis, data mining, and visualization. Contents (in C and Fortran) will allow users gramming scientific applications with results on the desktop but also in algorithms for NVIDIA graphics proces- All of these packages and many more This course gives an overview of the to immediately test and understand Fortran. The course teaches newest Augmented Reality and Virtual Environ- sors (GPUs). Focus is on the basic can be combined with IPython to pro- supercomputers at Jülich. Especially the basic constructs of the Message Fortran standards. Hands-on sessions ments. It will start with a short over- usage of the language, the exploitation vide an interactive extensible environ- new users will learn how to program Passing Interface (MPI) and the shared will allow users to immediately test and view of scientific visualization, following of the most important features of the ment. In this course, we will explore and use these systems efficiently. memory directives of OpenMP. Course understand the language constructs. a hands-on introduction to 3D desk- device (massive parallel computation, matplotlib for visualization, pandas for Topics discussed are: system architec- language is German. This course is top visualization with COVISE. On the shared memory, texture memory) and time series analysis, and scikit-learn ture, usage model, compilers, tools, organized by JSC in collaboration with Webpage second day, we will discuss how to efficient usage of the hardware to for data mining. We will use IPython to monitoring, MPI, OpenMP, perfor- HLRS. Presented by Dr. Rolf Rabensei- http://www.hlrs.de/events/ build interactive 3D Models for Virtual maximize performance. An overview show how these and other tools can mance optimization, mathematical fner, HLRS. Environments and how to set up an of the available development tools and be used to facilitate interactive data software, and application software. Augmented Reality visualization. the advanced features of the language analysis and exploration. Webpage is given. Webpage http://www.fz-juelich.de/ias/jsc/ Webpage Webpage http://www.fz-juelich.de/ias/jsc/ events/mpi http://www.hlrs.de/events/ Webpage http://www.fz-juelich.de/ias/jsc/ events/sc-nov http://www.hlrs.de/events/ events/python-data-mining

110 Spring 2014 • Vol. 12 No. 1 • inSiDE Spring 2014 • Vol. 12 No. 1 • inSiDE 111 Editorial Contents

News PRACE: Results of the 8th Regular Call 4 Optimizing a Seismic Wave Propagation Code for PetaScale-Simulations 5

Applications Industrial Turbulence Simulations at Large Scale 8 High-Resolution Climate Predictions and Short-Range Forecasts 12 A Highly Scalable parallel LU Decomposition for the Finite Element Method 19 Plasma acceleration: from the laboratory to astrophysics 24 Towards Simulating Plasma Turbulences in Future large-scale Tokamaks 28 A Flexible, Fault-Tolerant Approach for Managing Large Numbers of Independent Tasks on SuperMUC 30 SHAKE-IT: Evaluation of Seismic Shaking in Northern Italy 36 Quantum Photophysics and Photochemistry of Biosystems 40 Ab Initio Geochemistry of the Deep Earth 44 Nonlinear Response of single particles in glassforming fluids to external forces 49 Is there Room for a Bound Dibaryon State in Nature? – An ab Initio Calculation of Nuclear Matter. 52

Projects CoolEmAll–Energy Efficient and Sustainable Data Centre Design and Operation 58 MyThOS: Many Threads Operating System 64 HPC for Climate-Friendly Power Generation 70 Going DEEP-ER to Exascale 72 Score-E–Scalable Tools for Energy Analysis and Tuning in HPC 74 UNICORE 7 Released 77 Project Mr. SymBioMath 80 ExaFSA–Exascale Simulation of Fluid-Structure-Acoustics Interactions 84

Centres 88

Activities 94

Courses 106

´

If you would like to receive inside regularly, send an email with your postal address to [email protected] © HLRS 2014