inSiDE • Vol. 8 No.1 • Spring 2010

Innovatives Supercomputing in Deutschland Editorial Contents Welcome to this new issue of InSiDE the physics of galactic nuclei and the the journal on innovative Supercomput- global parameter optimization of a News ing in Germany published by the Gauss physics-based united-residue force field PRACE: Ready for Implementation 4 Center for Supercomputing. for proteins. New JUVIS Visualization Cluster 7 The focus of the last months all over Projects are an important driving fac- Europe was on the European HPC initia- tor for our community. They provide Applications tive PRACE (Partnership for Advanced a chance to develop and test new Direct Numerical Simulation of Computing in Europe) and the Gauss methods and tools. They allow further Active Separation Control Devices 8 Center for Supercomputing has played improving existing codes and building a very active and important role in the tools we need in order to harness Numerical Simulation and Analysis of Cavitating Flows 14 this process. As a result of all these the performance of our systems. activities PRACE is now ready for the A comprehensive section of projects Modelling Convection over West Africa 18 implementation phase. Here you find an describes among other the recent ad- overview of the results achieved in this vances in the UNICORE 6 middleware Driving Mechanisms of the Geodynamo 26 Editorial issue. In our autumn edition we will be software. This is a good example how Contents able to report on the next steps which a nationally funded project that was The Maturing of Giant Galaxies by Black Hole Activity 32 will include the founding of a formal supported by European projects finally European organization. turned into a community tool and The Physics of Galactic Nuclei 36 is now taking the next step in the PRACE is an important driving factor improvement process. Global Parameter Optimization of a Physics-based also for German supercomputing. Hence United Residue Force Field (UNRES) for Proteins 40 the Gauss Center for Supercomputing Let us highlight a typical hardware has again stepped up its effort to pro- project, which is an excellent example Projects

• Prof. Dr. H.-G. vide leading edge systems for Europe in of how the community can drive also Recent Advances in the UNICORE 6 Middleware 46 Hegering (LRZ) the next years. HLRS should be able to hardware development. QPACE (Quan- announce its system soon after we go tum Chromodynamics Parallel Comput- LIKWID Performance Tools 50 • Prof. Dr. Dr. Th. to press – so stay tuned for further in- ing on the Cell) is a project driven by Lippert (JSC) formation. LRZ has started its procure- several research institutions and IBM. Project ParaGauss 54 • Prof. Dr.-Ing. M. M. ment process. InSiDE expects to report The collaboration is a good example Resch (HLRS) on the results in the autumn issue. how we can deviate from main stream Real World Application Acceleration with GPGPUs 58 technology and still closely work with As we enter the Petaflops-era applica- leading edge technology providers. Systems tions are the driving factors of all our The resulting merger of ideas and activities. InSiDE presents the three technologies provides an outstanding Storage and Network Design 62 winners of a Golden Spike Award of solution for a large group of application for the JUGENE Petaflop System the HLRS. The applications that were researchers. 67 honored by the HLRS steering commit- QPACE tee are reporting on direct numerical As usual, this issue includes informa- simulation of active separation control, tion about new systems and events in Centres 68 the modeling of convection over West supercomputing in Germany over the Africa, and the maturing of galaxies by last months and gives an outlook of Activities 74 black hole activity. Further application workshops in the field. Readers are in- papers deal with the numerical simula- vited to participate in these workshops. Courses 88 tion and analysis of cavitating flows, the driving mechanisms of the geo­dynamo,

2 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 3 theses were: IBM Blue Gene/P (MPP To stay at the forefront of HPC, PRACE PRACE: [4]) at FZJ, CRAY XT5 (MPP) at CSC, has to play an active role: The project Ready for Implementation IBM Power 6 (SMP-FN) at Sara, IBM evaluated promising hardware and Cell at BSC (Hybrid), NEC SX-9 coupled software components for future multi- with Nehalem (Hybrid) at HLRS, Petaflop/s systems, including: comput- In October 2009 PRACE [1], the identified in 2006 by ESFRI [3]. Having and Intel Nehalem clusters by Bull and ing nodes based on NVIDIA as well as Partnership for Advanced Computing started in January 2008, PRACE is Sun respectively at CEA and FZJ (SMP- AMD GPUs, Cell, ClearSpeed-Petapath, in Europe, demonstrated to a panel among the first projects to complete TN). All prototypes were evaluated in FPGAs, Intel Nehalem and AMD Bar- of external experts and the European its work after only two years. This close to production condition using celona and Istanbul processors; sys- Commission that the project made includes not only the legal and adminis­ synthetic benchmarks to measure sys- tems to assess memory bandwidth, “satisfactory progress in all areas” [2] trative tasks and the securing of ad- tem performance, I/O bandwidth, and interconnect, power consumption and and “that PRACE has the potential equate funds but due to the specific communication characteristics. Most I/O performance as well as hybrid and to have real impact on the future of challenges of the rapidly evolving nature applications from the above set of power-efficient solutions; programming European HPC, and the quality and of High Performance Computing also were evaluated on at least three differ- models or languages such as Chapel, outcome of European research that very important technical work. ent prototypes. The results contain a CUDA, Rapidmind, HMPP, OpenCL, depends on HPC services”. wealth of information for the selection StarSs (CellSs), UPC and CAF. PRACE PRACE identified and classified a set of of future systems and where vendors co-funded work to extend the world’s News This marked an important milestone over twenty application that constitute a and application developers should focus “greenest” QPACE [5] News towards the implementation of the per- cross-section of the most important sci- to improve performance. The Blue for general purpose applications. manent HPC Research Infrastructure entific applications that consume a large Gene/P installed at FZJ in 2009 is the in Europe, consisting of several world- portion of the resources of Europe’s first European Petaflop/s system that Finally, promising technologies, espe- class top-tier centres managed as a leading HPC centres. When neces- qualifies as a PRACE tier-0 system. In cially with respect to energy efficiency, single European entity. It will be estab- sary, these applications were ported addition, PRACE made the prototypes will be evaluated with the ultimate goal lished temporarily as an association to the architectures that PRACE identi- available to users through open call to engage in collaborations with indus- with its seat in Lisbon in spring 2010. fied as likely candidates for near term for proposals. In the first and second trial partners to develop products The project has prepared signature- Petaflop/s systems. The focus was, round, over 8.9 million CPU hours were exploiting STRATOS, the PRACE advi- ready statutes and other documents however, to evaluate how well the ap- granted to nine research teams. The sory group for Strategic Technologies required to create the association as a plications scale on the different archi- results will be published on the PRACE created in the PRACE Preparatory legal entity. Among them is the Agree- tectures, how well the systems enable web site. Phase project. ment for the Initial Period that covers petascaling, and how difficult it is to the commitments of the partners dur- exploit the parallelism provided by the ing the first five years where national hardware. Ten of these applications procurements are foreseen by the constitute the core of a benchmark hosting partners. Germany, France, suite that may be used in the procure- Italy, and Spain have each committed ments of the next Tier-0 systems. themselves to provide supercomputing resources worth at least 100 Million In parallel PRACE conducted an in- Euro to PRACE during the initial period. depth survey of the HPC market involv- The Netherlands and the UK are ex- ing all relevant vendors. The objective pected to follow suit. The access to the was to classify and select architec- PRACE resources will be governed by a tures promising to deliver systems single European Peer Review system. capable of a peak performance of one Petaflop/s in 2009/10. As one im- PRACE is one of 34 Preparatory Phase portant result a comprehensive set of projects funded in part by the European architectures was defined and proto- Commission to prepare the important types covering this set were deployed European Research Infrastructures as prototypes at partner sites. In detail

PRACE Management Board welcomes Bulgaria and Serbia to the PRACE Initiative (Toulouse, September 8, 2009). Photo by Thomas Eickermann.

4 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 5 PRACE started an advanced training ture will be open to all European states. and education programme for scien- The Gauss Centre for Supercomputing New JUVIS Visualization Cluster tists and students in advanced pro- represents Germany both in the legal gramming techniques, novel languages, entity and in the project. The project JSC offers a new visualization service and tools. Among them were a sum- partners asked the Research Centre which was made possible by additional mer school (Stockholm, 2008) and a Jülich to coordinate also the Implemen- BMBF-funding dedicated to the D-Grid winter school (Athens, 2009), six code tation Phase project and to continue the project. The compute cluster named porting workshops and two seminars very successful management provided JUVIS is intended to be used as a par- for industry. This will be continued and by JSC’s project management office. allel server for the visualization of sci- intensified in the permanent Research entific data. JUVIS allows the parallel Infrastructure to ensure that users can The implementation phase project is processing of data as well as parallel effectively harness the 100thousand- expected to start in July 2010. To keep rendering while the final image is dis- fold parallelism and exploit the oppor- the momentum among the project played on the desktop of the user’s local tunities of novel or heterogeneous ar- partners and bridge the gap an exten- . The cluster has been pro- chitectures in systems containing Cell sion of the project until June 2010 cured, installed and tested at JSC and processors or GPUs. Over 48 hours of has been proposed by the project and is ready for operation since a couple of training material are available on the granted by the European Commission. months. It consists of one login node, News PRACE web site as streaming videos. This will allow to the continuation of 16 compute nodes, and one fileserver News dissemination and training, porting and with a 10.5 TB RAID system. In total The full impact of PRACE for European scaling of additional applications and the compute nodes are equipped with Science and industry can only be evaluation of future technologies. 128 CPU cores (32 Quad-Core Intel achieved if it serves informed users. Xeon Processors E5450, 3.00 GHz), PRACE partners gave over 90 presen- PRACE is well positioned and ready to 256 GB main memory and Nvidia tations at scientific, technical, and pol- shape the European HPC eco system. Quadro FX 4600 GPUs. In a first step, icy setting events. PRACE showcased the visualization software ParaView was its achievements in the exhibitions of References installed on JUVIS. ParaView is an open- the Supercomputing Conferences in [1] The PRACE project receives funding source, multi-platform data analysis and from the European Community’s Seventh the US, the International Supercom- visualization application based on the Framework Programme under grant puting Conferences (ISC) in Germany, agreement no RI-211528. Additional Visualization Toolkit VTK. It is able to run and the EU initiated ICT conferences. information can be found under in parallel on the new visualization clus- The PRACE award for the best paper www.prace-project.eu ter, allowing the post-processing and on petascaling by a student or young remote visualization of very large data [2] The reviewers may only tick yes or European scientist at ISC is already a no aside to this predefined text when sets without the need to transfer the recognized tradition. evaluating a project. data to the local workstation.

[3] ESFRI: European Strategy Forum on The successful achievements of the In addition to ParaView, many other visu- Research Infrastructures report at PRACE project were important pre­ ftp://ftp.cordis.europa.eu/pub/esfri/docs/ alization software systems can be used requisites to become eligible to apply for esfri-roadmap-report-26092006_en.pdf on JUVIS with enabled remote render- a grant of 20 Mio Euro supporting the ing capability. For this purpose, a Virtual [4] MPP: Massively Parallel Processor; implementation phase. PRACE submit- Network Computing (VNC) server and SMP-FN/TH: Symmetric Multi Processor- ted a related proposal for a PRACE First Fat Node/Thin Node the VirtualGL package are installed. This Implementation Project in November of infrastructure allows the usage of any 2009. The number of partners grew [5] See: http://www.green500.org/lists/ OpenGL based visualization software 2009/11/top/list.php from 16 in 14 countries to 21 in 20 with full 3D hardware acceleration on

states [6] demonstrating the impor- [6] Germany is represented by GCS as JUVIS, while the resulting image is dis- tance of HPC to Europe. Of course, the member and FZJ as coordinator to comply played on the user’s desktop in a VNC permanent PRACE Research Infrastruc- with EC regulations. client window.

6 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 7 Golden Spike Award by the HLRS Steering Committee in 2009

means of Reynolds Averaged Navier- The computational domain consists Direct Numerical Simulation Stokes (RANS) methods. The crux of of a rectangular box. Boundary condi- of Active Separation Control RANS lies in the fact that the involved tions are applied to define an inflow, model assumptions have to be adapted outflow, freestream and wall region. Devices to every new configuration and in the The fundamental differential equations fact that the underlying equations are are converted into approximate dif- Aircraft design has always been a freestream towards the surface. The inherently steady state and thus do not ference equations using finite differ- struggle for the most “efficient” form necessary vortical motion is induced by really allow for simulation of transient ences on a structured grid. The finite or method. The most efficient has been a large longitudinal eddy created with so processes. Within this context the difference stencils are chosen to be of defined either by increased peak per- called Vortex Generators (VG). Already presented work covers simulations of compact form, i.e. information of both formance or in more recent times by in use are passive vortex generators, a generic JVG configuration by means flow variable and its derivative is taken reduced operational cost. In engineering i.e. small sheets attached to the wing of highly accurate Direct Numerical into account resulting in a numerical terms reduced operational cost equals surface. Indeed these VGs are well able Simulation (DNS) methods. The DNS scheme of spectral-like wave resolution reduced or constant drag with constant to generate aforementioned eddies but approach is used for its lack of any in space. The evolution in time is simu- or increased lift. Today's commercial there are also disadvantages due to the model assumptions. Therefore, it is lated by an explicit Runge-Kutta time airliners commonly encounter so called fact that the passive VGs are optimized well suited to provide a reference integration scheme which maintains flow separation on the wing flaps during for a specific point of operation and solution for coarser or “more approxi- the accuracy of the spatial resolution. Applications landing which led to the employment of induce parasitic drag at all other flight mate” numerical schemes. Further- The resulting linear system of equa- Applications rather complex and heavy flap mechan- attitudes. In 1992 experimental work by more, DNS allows for a computation tions can be solved very efficiently on ics which became necessary to com- Johnston et al. [1] has shown the gen- of the unsteady flow formation espe- vector CPU like the pensate for the reduced lift. Therefore eral ability to suppress flow separation cially in the beginning of the vortex NEC SX-9 at HLRS Stuttgart because the suppression of such flow separation using so-called Jet Vortex Generators generation and detailed analysis of of the strongly structured form. Fur- constitutes a vital step towards more (JVG). The major advantage over solid the fluid dynamics involved. thermore the domain can be split into efficient wing and flap design. The cause VGs stems from the fact that JVGs are boxes of equal dimensions which are for flow separation are positive pres- actively controllable and thus dispose of Navier-Stokes 3D then assigned to one computational sure gradients along the streamline any additional drag outside of the point The numerical solver Navier-Stokes which act as obstacle to the oncoming of operation. These promising findings 3D (NS3D) has been developed at fluid. In order to enable the fluid to over- led to intensive research in JVGs and the Institut für Aero- und Gasdynamik come larger adverse pressure gradients cumulated in exhaustive experimental (IAG) at Stuttgart University [2]. The it has to be energized (see Figure 1). parameter studies covering many as- program is based on the complete One method to do so is to move high pects such as velocity ratio, radius and Navier-Stokes equation, i.e. no turbu- energy fluid out of the undisturbed blowing angle. lence or small scale models are used.

Albeit the outcomes of these experi- ments yield a very good general idea of the mechanisms of active flow control devices there are still a number of open questions involved as no detailed picture of the formation of the vortex and its interaction with the boundary layer could be gained from experiments yet. Therefore, any design sugges- tions rely heavily on empirical data and are difficult to transpose to different configurations. During industrial de- sign processes parameter studies are Figure 1: Scheme of Jet Vortex Actuator thus often undertaken numerically by Figure 2: Instantaneous streamwise velocity contours and isosurfaces of vorticity visualized by l2 criterion (b=150°)

8 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 9 process each. Interprocess com- cial airliners. The jet is described by a reaching far out of the boundary layer to the formation of a large longitudinal munication is realized by use of Mes- steady velocity distribution at the wall (Figure 2). The jet affects the bound- eddy above the wall. In order to gain an sage Passing Interface (MPI) routines. boundary of the computational domain ary layer by mainly two mechanisms insight in the fluid dynamics snapshots Within each MPI process the domain and resembles the profile of a pipe namely the blockage of the oncoming of the flow field are recorded and is furthermore parallelized employing flow. The jet-to-crossflow velocity ratio fluid and the strong shear at the jet evaluated. Figures 3 and 4 depict the NEC Microtasking shared-memory par- is set to R=3. The jets are inclined by slopes. The blockage leads to flow flow at the last recorded time step. In allelization. All computations described an angle of attack a=30° and skewed structures similar to the wakes found case of a skew angle of b=30° two lon- here have been run on one node of the to the downstream direction by angles behind solid obstacles, i.e. periodic gitudinal vortices establish in the flow NEC SX-9 within 12hrs computing time. b=30° and b=150° respectively. One eddy shedding close to the wall and and wrap around each other due to can imagine an oblique jet to be blown a so called horseshoe vortex which the induced velocities in the transver- Jet Vortex Generator either against the oncoming main flow wraps around the jet. Furthermore the sal plane. The development of a high- Simulations or in line with it. Altogether 18.4m grid oncoming flow deflects the jet into the speed streak close to the wall takes Numerical simulations have been done nodes are used and computations have main streamwise direction which leads place with an inclination to the center- for two Jet-in-Crossflow configurations been carried out for 31 flow through to additional flow features unique to line of the flow. The two longitudinal which differ in the jet exit’s orienta- times. Albeit this does not suffice to jets in crossflow such as the entrain- vortices seem not subject to instabili- tion only. The configurations were provide data for statistical analysis, it ment of near wall fluid layers upwards ties at that point in time. They stretch chosen to closely match experiments yields a good picture of the evolution of into the jet wake. Both the induced rather well-defined downstream and Applications described in [3] and are representa- the perturbed flow field. Even though perturbations of the crossflow behind upwards along the plate. The oncom- Applications tive for separation control devices. both the jet and the laminar boundary the jet exit as well as the fluid entrain- ing freestream deflects the jet in both The crossflow represents a laminar layer are initially in a steady state the ment evidently depend on the jet pa- spanwise and wall-normal direction. boundary layer on a flat plate with zero resulting flow regime becomes highly rameters. The shear on the slope of Thus a strong shear layer develops on pressure gradient at freestream Mach unsteady with the jet exhibiting un- the jet on the other hand induces a ro- the top side of the jet. The sweep of number of Ma=0.25 which corre- stable modes leading to the formation tational motion which is prolonged and the jet over the boundary layer results sponds to landing speed for commer- of large transient vortex structures fortified behind the jet exit and leads in the near-wall fluid layers being en-

Figure 3: Instantaneous downstream velocity locity and streamlines (b=30°) Figure 4: Instantaneous downstream velocity and streamlines (b=150°)

10 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 11 trained upwards into the trajectory of Boundary Layer Control energy is fed to the boundary layer in Acknowledgements the jet. The entrainment region does As already mentioned, Jet-in-Cross­ an overly straightforward fashion by We thank the High Perfomance not extent very far downstream though flow configurations are investigated as adding momentum. In order to exploit Computing Center Stuttgart (HLRS) and the exchange of high- and low- a means to suppress boundary layer such a mechanism with reasonable for provision of supercomputing time speed layers inside the boundary layer separation. This is to be achieved by input-to-output ratio a tangential jet and technical support within the is not very strong. A different flow moving faster fluid layers closer to seems more advisable. On the other pro­ject “LAMTUR”. can be observed when the jet is blown the wall and therewith increasing the hand when the jet is directed against against the main flow. Firstly the block- wall friction. Figure depicts a compari- the main flow the region of increased References age is a lot stronger than in the previ- son of the time-averaged wall-shear- shear extends promisingly in both [1] Johnston, J. P., Nishi, M. “Vortex Generator Jets – a Means for Flow ous case. Secondly the wake exhibits stress distribution for the simulated downstream and spanwise direction. Separation Control”, AIAA Journal, 28(6), instabilities which lead to a mixing of cases. Both configurations lead to a The reason being the different fluid dy- pp. 989–994, 1990 the fluid layers in all directions. Again net increase of wall shear stress in namics at work. Firstly the jet acts as [2] Babucke, A., Kloker, M., Rist, U. the jet wake gets deflected but addi- a confined stripe behind the jet. This turbulator which leads to fuller velocity “Direct Numerical Simulation of a Serrated tionally the entrainment region extents area overlaps the trajectory of the profiles compared to the unperturbed Nozzle End for Jet-noise Reduction”, in M. further downstream and the wake rolls jet only marginally though in case of flow and furthermore the jet-crossflow Resch (Eds.), High Performance Comput- into a longitudinal vortex. a jet angle of b=30°. The additional interaction leads to the formation of ing, in Science and Engineering 2007, Springer, 2007 a longitudinal vortex which transports Applications fast fluid layers closer to the wall. [3] Casper, M., Kähler, C. J., Radespiel, R. Applications Therefore this configuration seems “Fundamentals of Boundary Layer a Control with Vortex Generator Jet Arrays”, in 4th feasible for active flow control devices Flow Control Conference, number AIAA, even in already fully turbulent shear pp. 2008-3995, 2008 flows as they are found on aircraft airfoils and flaps.

Conclusion Simulations of two different Jet-in- Crossflow configurations have been carried out in order to investigate the • Björn Selent effect on a boundary layer at Ma=0.25. • Ulrich Rist The jets have been skewed and inclined by values typically found in active-flow- Institute of control device setups. The simulations Aero- and gave insight into the generation of a jet- Gasdynamics, vortex system and wake perturbations. University In case of a jet aligned with the mean of Stuttgart flow a stable flow established in which the boundary layer was only poorly energized by the jet. A skew of the jet against the main flow led to the genera- tion of a longitudinal vortex which in turn led to significantly increased wall friction. Following simulations may in- clude further variation of the pitch and skew angles of the jet as well as veloc- ity ratio variation. Also of interest is a change of initial conditions towards a Figure 5: Comparison of mean wall shear distribution in actuated flow (top: b=150°, bottom: b=30) fully developed turbulent boundary layer.

12 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 13 is described in [4]. This tool relies on plete computation required about Numerical Simulation and the Finite Volume Method and oper- 105 CPU hours and was performed Analysis of Cavitating Flows ates on block structured meshes. It using 64 to 192 CPU cores on the is written in FORTRAN 95 and it is -Cluster of the Leibniz-Rechen­ parallelized with MPI. A special feature zentrum (LRZ). In order to resolve of CATUM is its low Mach number con- wave dynamics we apply numerical Cavitation is a phenomenon which sistency which is necessary to simu- time steps of 8.5 x 10 -8s. Every occurs in liquid flows if the pressure late wave propagation with 1500 m/s 1000th time step was stored on decreases locally below the vapor even for very low speed liquid flows. disk, giving a total of 480 data sets. pressure. The liquid evaporates and a To reduce the computational effort a cavity of vapor is generated. With re- grid sequencing technique is applied as Only this high resolution in time allows spect to fluid dynamics within hydraulic follows. The simulation is initiated on a us to investigate the propagation of machines such as injection nozzles, coarse grid consisting of 2 x 105 cells. shock waves in the flow field in the pumps or ship propellers, local de- postprocessing phase. Due to the high compression of the liquid towards or The solution of this simulation is inter- resolution in time and space, 700 GB even below vapor pressure commonly polated onto a refined grid and pro- of data where produced. To analyze occurs [3]. Thereby, the liquid par- vides initial conditions for the next this amount of data the new remote Applications tially evaporates and completely alters run on the refined grid. This grid se- visualization server of the LRZ is used. Applications its thermodynamic and fluid dynamic quencing procedure is repeated two This machine is directly connected to properties. Aside from harmful char- times until the finest grid is reached. the high performance file system of acteristics like cavitation erosion and The finest grid consists of 2.4 x 107 the compute servers and, therefore, noise production an increasing number cells, 250 in spanwise direction and no transfer of data to a local machine of modern applications benefit from 600 around the hydrofoil. The com- is necessary. controlled cavitation. Modeling, simula- tion and analysis of cavitating flows is a challenging task due to the strongly increased complexity as compared to pure liquid or pure gas flows. The main challenge is to take compressibility ef- fects of the liquid and of the vapor into account. This is necessary in order to predict the formation and propaga- tion of violent shocks due to collapsing vapor bubbles. Furthermore, the phys- ics of cavitating flows contain multiple spatial and temporal scales that need to be sufficiently resolved by the nu- merical method.

To investigate the fundamental dy- namics of cavitation we compute the Figure 1: Three instances in time of the predicted cavitating flow around a standardized shedding of sheet and cloud cavitation around -3 shaped hydrofoil (NACA 0015) within a a NACA 0015 hydrofoil (dtpic = 3.4 · 10 s). rectangular test-section [1]. The simu- The iso-surfaces represent the vapor volume fraction of a = 5%, that means 5% of the com- lations were performed by applying the putational cell volume is covered by vapor. (The in-house CFD-Tool CATUM (CAvitation images were rendered with Blender using 8 CPU Figure 2: Vapor volume fraction a within a cavity at a cut plane at midspan position. Technische Universität München) which cores on the remote visualization server.) The detached cavity shows a complex inner structure.

14 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 15 Additionally, the throughput rate is sig- shown in Figure 1. A detailed study of of the hydrofoil during the whole simula- References nificantly higher as the one of a local the inner structure of the vapor re- tion we obtain detailed information of [1] Schmidt, S. J., Thalhamer, M., Schnerr, G. H. disc which is crucial for time dependent gions where carried out on the finest erosion sensitive areas [2] as shown "Inertia Controlled Instability and Small Scale flow calculations. For data analysis, grid. Figure 2 shows the vapor fraction in Figure 4. We observe instantaneous Structures of Sheet and Cloud Cavitation", Visit (using up to 24 CPU cores) and at a cut plane at midspan position of maximum loads up to 2500 bar. Our Proceedings of the 7th International Sympo- Tecplot (up to 8 CPU cores) on the re- the hydrofoil at one instance in time. investigation confirms experimental ob- sium on Cavitation, CAV2009 - Paper No. 17 mote visualization server were used to It is observed that the attached cavity servations that the regions close to the [2] Schmidt, S. J., Thalhamer, M., extract flow features like iso-surfaces at the leading edge consists of almost trailing edge are highly vulnerable to Schnerr, G. H. of the vapor volume fraction a, slices of pure vapor (white color), whereas the cavitation erosion. "Density Based CFD-Techniques for the flow field and flow vectors. detached cloud exhibits a complex Simulation of Cavitation Induced Shock Emission", Proceedings ETC - 8th European inner structure with strong variations Acknowledgments Turbomachinery Conference, Graz, Austria, The grid sequencing technique enables of the vapor content. Each time when Our investigations on numerical simula- pp. 197-208, 2009 an investigation of grid dependency of such a cloud collapses strong shocks tion of cavitating flows are supported the solutions. We observe that large are generated. Figure 3 depicts a col- by the KSB Stiftung Frankenthal, espe- [3] Sezal, I. H., Schmidt, S. J., Schnerr, G. H., Thalhamer, M., Förster, M. scale structures like the shedding of lapse directly at the trailing edge of the cially the prediction of erosion sensitive "Shock and Wave Dynamics in Cavitating vapor clouds are grid independent and hydrofoil which generates a maximum areas. We also would like to thank the Compressible Liquid Flows in Injection the cavity length and the shedding fre- pressure of p = 225 bar. These pres- Deutsche Forschungsgemeinschaft Nozzles", Journal of Shock Waves (Springer), Applications quency remain constant for all grids. sure peaks at the surface are respon- (DFG) for supporting further investiga- DOI 10.1007/s00193-008-0185-3, 2009 Applications However, the flow computed on the sible for the damage of the material tions on cavitation in injection nozzles [4] Sezal, I. H. finest grid exhibits more small scale which is known as cavitation erosion. of automotive engines. The support and "Compressible Dynamics of Cavitating 3D structures like horseshoe vortices By recording the impact positions and provision of computing time by the LRZ Multi-Phase Flows", PhD Thesis, Technische and vortices in spanwise direction as the maximum pressures at the surface is also gratefully acknowledged. Universität München, 2009

• Matthias Thalhamer • Steffen J. Schmidt • Günter H. Schnerr

Lehrstuhl für Fluidmechanik, Fachgebiet Gasdynamik, TU München

Figure 3: Shock formation and propagation due to collapsing vapor clouds. The collapse directly at the Figure 4: Instantaneous maximum loads of up to p = 2500 bar at the surface of the hydrofoil during the complete computation. trailing edge generates a pressure of p = 225 bar. High peaks indicate areas where cavitation erosion is likely to occur.

16 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 17 Golden Spike Award by the HLRS Steering Committee in 2009 Modelling Convection over West Africa use COSMO coupled with the aerosol and In the African Sudanian (10°N-15°N) The AEWs are synoptic-scale distur- reactive trace gases module and Sahelian climate zones (15°N-18°N) bances that propagate westwards (COSMO-ART) to investigate convective systems play a key role in across tropical West Africa toward the interaction of the SAL with the water cycle, because they con- the eastern Atlantic and the eastern WAM systems. COSMO-ART [4,5] tribute to about 80% to the annual Pacific. The AEWs are characterized was developed in Karlsruhe and com- rainfall. The convective systems are by propagation speeds of 7-8 m s-1, a putes the emission and the transport to the thunderstorm complexes with a hori- period of 2-5 days, and a wavelength of mineral dust. We used the com- initial zontal extent of several hundreds of of about 2,500-4,000 km. Important puter facilities at the HP XC 4000 at conditions kilometers. They are part of the West parameters for the initiation of convec- the Steinbruch Centre for Computing of MOI. This African monsoon (WAM), which also tion are the spatial distribution and (SCC) as the COMSO model requires resulted in a impacts the downstream tropical temporal development of water vapour substantial supercomputer resources. similar soil moisture con- Atlantic by providing the seedling dis- in the convective boundary layer (CBL). In the following, we focus on two dif- tent in the uppermost level Applications turbances for the majority of Atlantic Besides advective processes, water ferent topics. On the one hand the life of TERRA-ML, compared to the soil Applications tropical cyclones [1]. vapor is made available in the atmo- cycle of an MCS over West Africa is moisture values of about 12% derived sphere locally through evapotranspi­ modelled with respect to the soil condi- by the AMSR-E satellite and in-situ The WAM system is characterized by ration from soil and vegetation. Many tions. On the other hand the focus lies measurements of about 18% for the the interaction of the African easterly research findings show that the soil on comparison between the simulated uppermost 5 cm taken at Dano (3°W jet (AEJ), the African easterly waves moisture exerts greater influence on MCSs over West Africa and over the and 11°N) [7] for the region around (AEWs), the Saharan air layer (SAL), the CBL than vegetation. Eastern Atlantic. 11°N, where the MCS was observed. as well as by the low-level monsoon The corresponding simulation was flow, the Harmattan, and the me- In the scope of the African Monsoon The first part of this study investigates designated as CTRL experiment. To soscale convective systems (MCSs). Multidisciplinary Analyses (AMMA) pro­ the sensitivity of the life cycle of an eliminate the effect of spatial soil mois- The AEJ can be observed between ject we investigate the development of MCS to surface conditions [6]. The ture variability on the initiation of con- 10-12°N and at a height of about 600 MCSs, the sensitivity of their life cycle analysis is based on simulations of vection, an additional simulation with hPa and has a typical wind speed of to differing surface properties, the a real MCS event on 11 June 2006 a homogeneous (HOM) distribution about 12 m s-1. The AEJ develops as role of larger-scale weather systems which occurred in the pre-onset phase of soil moisture and soil texture was a result of the reversed meridi- (AEWs, the Saharan Heat Low) in their of monsoon when vegetation cover is performed. In this case the volumetric onal temperature gradient evolution, and the development of tro­ low and the impact of soil moisture soil moisture was specified as a mean due to the relatively cool pical cyclones out of such systems. In is assumed to be dominant. Different value of the volumetric soil moisture and moist monsoon layer addition, we study the interaction of conditions for soil moisture were ap- in the CTRL experiment along 11°N and the hot and dry the SAL with African monsoon weather plied for initialization of the soil model from 4.5°W to 4.5°E. To investigate SAL, which is located systems. TERRA-ML. The run based on the the effect of dry regions on convec- above the COSMO soil type distribution and on tive systems, where the soil moisture monsoon To simulate the weather systems original ECMWF fields was denoted structure is less complex than the con- layer. over West Africa we use the COSMO with MOI. ditions present in the CTRL run, a dry (COnsortium for Small scale MOdelling, band of 2 degree longitudinal extension www.cosmo-model.org) model [2,3]. However, comparison with AMSR-E was inserted into the homogeneous COSMO is an operational weather satellite data showed that the MOI field soil moisture field. In this band the vol- forecast model used by several Eu- contained too much soil moisture in umetric soil moisture was reduced by ropean weather services, e.g. the the upper surface layer. Therefore, we 35% compared to the homogeneous German Weather Service reduced the volumetric soil moisture environment. The corresponding ex- (DWD). Additionally, we content in all layers by 35% compared periment was denoted as BAND.

18 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 19 Figure 1: Volumetric soil moisture in % at 15 UTC in the uppermost layer (color shaded) and precipitation in mm h-1 (isolines, interval 2) on June 11, 2006 at 17 UTC for CTRL (a), HOM (b), and MOI (c) case. Taken from [6].

In the CTRL case three separate cells These two cells were observed at HOM case three separate cells could CAPE values had developed within the were initiated in the south-eastern part roughly the same locations as the still be distinguished at 19 UTC, which dry band, while CIN increased to the of Burkina Faso. Precipitation of up most intensive cells in the CTRL case. were less intense than in the CTRL east and west of it (Figure 2c). East to 6 mm h-1 was simulated at 17 UTC However, the precipitation of both cells case. The MOI run showed only weak of the band (0.83°W, 12.2°N) CIN fur- (Figure 1a). The south-western most was less intense than that of the CTRL convective activity. The impact of the ther increased during the subsequent Applications cell developed in the lee of an area with case at the same time. In the MOI case dry band with a volumetric moisture hours. Inside the band (1.8°W, 12.2°N) Applications orographically induced upward motion. the favorable conditions with high con- content of 8.3%, surrounded by a ho- CIN was significantly lower. The lower Triggering of convection often occurs vective available energy (CAPE) and low mogeneous moisture content of 12.7% CAPE inside the dry band resulted in this way in West Africa. Two further convective inhibition (CIN) values were on the modification of a mature MCS is mainly through lower near-surface cells developed in the east, at 1.9°E more limited in space than in the other shown in Figures 2a and 2b. Precipita- humidity. East of the dry band a lower and 11.9°N and at 2.2°E and 11.6°N. cases. In addition, the surface tempera­ tion was significantly reduced between near-surface temperature led to a In Figure 1a the precipitation pattern ture in the region of interest in the MOI 0 and 2°W, i.e. shifted by about one higher CIN, which inhibited the convec- of the CTRL case at 17 UTC is overlaid case was about 3°C lower than in the degree to the east of the dry band. tion of the approaching MCS. The in- on the soil moisture distribution in the CTRL case. Under these conditions Precipitation was interrupted, when crease of CIN inside the dry band later uppermost layer at 15 UTC. This figure only one weak precipitating cell had the MCS approached the dry band, that night yielded from the nocturnal shows that all three cells developed in developed at 17 UTC. but re-generated when the system decrease of near-surface temperature the transition zone from a wetter to reached the western part of the band and the passage of the MCS in the dryer surface, while the centres of the Once triggered, the convective cells (Figure 2b). Convection was also initi- surrounding. precipitating cells were positioned over developed quickly in the CTRL and HOM ated in the western part of the dry the dryer surface. In comparison to the case and moved with the AEJ towards band (around 3°W) at about 19 UTC. The simulations showed that convec- CTRL case, only two precipitating cells the west. About two hours after its ini- The cloud cluster was accompanied tion was initiated in all model experi- had developed at 17 UTC in the HOM tiation, the cells had already organized by significant rainfall and moved to the ments, regardless of the initial soil case (Figure 1b). into an MCS in the CTRL run. In the west. At 18 UTC an area with lower moisture distribution. The area where

Figure 2: 24-h accumulated precipitation in mm (color shaded) starting from 06 UTC on 11 June 2006 (a), Hovmöller diagram of precipitation in mm h-1 (color shaded) averaged between 10.5°N and 13.5°N on June 11 and 12, 2006 (b), and CIN (c) in J kg-1 (color shaded) on June 11, 2006 at 18 UTC for BAND case. The solid lines enframe the area of the dry band. Taken from [6].

20 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 21 convection was initiated in the simula- the MCS reached a region which was of soil-precipitation feedbacks in an system moved across West Africa the tions corresponded roughly with the characterized by high CIN values area where high sensitivity of precipita- position of the model region was ad- observations. In the CTRL case all (>150 J kg-1), reduced total water con- tion on soil moisture was proven. justed for each subsequent run. This three cells were initiated along soil tent and soil moisture inhomogeneities. enabled us to identify structural fea- moisture inhomogeneities and shifted The second study investigates convec- tures of convective systems over West towards the dry surface. Triggering of In conclusion, triggering of convec- tive systems over West Africa and Africa and the Eastern Atlantic, and to convection and optimal evolution was tion on this day was favoured by drier the eastern Atlantic. The convective analyze and discuss their differences. simulated in areas with low CIN, high surfaces and/or soil moisture inhomo- systems are embedded in the AEW All the runs are 72 h in duration. The CAPE and low soil moisture content geneities, while a mature system weak- out of which a hurricane developed. In parameterization of convection is (< 15%) or soil moisture inhomogene- ened in the vicinity of a drier surface. the afternoon hours on September 9, switched off. The model source code ities. In CTRL and HOM the cells de- This means that a positive feedback 2006 an MCS was initiated over land was adapted to provide information for veloped quickly and merged into orga- between soil moisture and precipitation ahead of the trough of this AEW. The moisture, temperature and momentum nized mesoscale systems while moving existed for a mature system whereas convective system increased quickly budgets [8,9,10]. westwards. The MCS in the CTRL run a negative feedback was found for trig- in intensity and size and developed experienced a significant modification. gering of convection. The two mecha- three westward moving arc-shaped The model was able to capture the The precipitation disappeared when nisms are illustrative for the complexity convective systems (Figure 3a,b). The above described convective systems near-surface winds depict the mostly as well as their structure and intensity Applications westerly monsoon inflow and the weak changes very well. We identified three Applications 67  6HSWHPEHU   87& 7 easterly inflow into the system. As stages in the life cycle of the MCS over  6HSWHPEHU   87& 97  6HSWHPEHU   87& the system decayed, new convective West Africa: the developing, the ma- D E bursts occurred in the remains of ture, and the decaying phase. To ana- 1 1 the old MCS. This lead to structural lyze the structure of these convective changes of the form of a squall line systems in more detail, an east-west crossing the West African coastline cross section (Figure 4a) is drawn at around midnight on September 11, through the convective system shown 2006. During the next 24 hours, in- in Figures 3a,b. During the mature 1 1 tense convective bursts occurred over phase, low-level convergence occurs the eastern Atlantic. These convective between a strong westerly inflow and : : : : : : bursts were embedded in a cyclonic cir- an easterly inflow from behind the sys- culation which in­tensified and became tem and is partly enhanced due to the a tropical depression on September 12, descending air from the rear system. 67  6HSWHPEHU   87& 7 2006, 12 UTC out of which Hurricane The associated ascent region extends  6HSWHPEHU   87& 97  6HSWHPEHU   87& Helene developed. The largest and up to about 200 hPa and is tilted east- F G 1 1 longest lived convective burst in this wards with height (Figure 4a). The con- intensification period (Figure 3c,d) vergence continues up to 700 hPa. At was analyzed here and compared to this stage strong downdraughts have the convective system over land. The developed around 7.5°W just behind 1 1 convective systems modify their en- the low-level convergence zone vironment and these changes can be that is located under the tilted related to the structure of the system updraught region. The area itself. : : : : : : : :

Figure 3: Comparison between the Rapid Developing Thunderstorm Product (RDT) (left) and COSMO model runs (right) on A series of COSMO runs over large September 10, 2006 at 04 UTC (upper row) and on September 12, 2006 at 8 UTC (lower row). Left: Convective objects model domains (1000 x 500 gridpoints) are superimposed over the Meteosat infrared images using shadings of grey up to -65°C, orange–red colors between -65° with 2.8 km horizontal resolution were and -81°C, and black above -81°C in the RDT images. Courtesy of Météo France. Right: The vertical integral of cloud water, cloud ice and humidity (kg m-2), indicating the convective updraught cores, from the 2.8-km COSMO run which was initialized carried out such that the model area on September 9, 2006 at 12 UTC. Taken from [10]. was centred around the MCS. As the

22 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 23 of the westerly inflow reaches deep tem. This is about 100 hPa lower than Acknowledgment [6] Gantner, L., Kalthoff, N. into the system up to about 700 hPa for the MCS over West Africa where This project received support from the “Sensitivity of a Modelled Life Cycle of a Mesoscale Convective System to Soil and the AEJ has its maximum around the air is accelerated due to air that AMMA-EU project. Based on a French Conditions over West Africa”, Quarterly 600 hPa. Divergence associated with exited the updraught core at lower lev- initiative, AMMA was built by an inter- Journal of the Royal Meteorological the upper-level outflow can be seen els and slightly descends. It is also ap- national scientific group and funded by Society, 136(s1): 471-482, 2010 at around 200 hPa with very strong parent from the cross section that the a large number of agencies, especially [7] Kohler, M., Kalthoff, N., Kottmeier, C. easterly winds ahead of the system and downdraughts are not as strong as for from France, UK, US and Africa. It has “The Impact of Soil Moisture Modifications westerly winds in the rear. Thus, there the MCS over land. Furthermore, the been the beneficiary of a major finan- on CBL Characteristics in West Africa: is strong mid-level convergence east of low-level convergence covers a much cial contribution from the European A Case-Study from the AMMA Campaign”, Quarterly Journal of the Royal Meteorologi- the tilted updraught. broader region than over the continent. Community’s Sixth Framework Re- cal Society, 136(s1) : 442-455-, 2010 search Program. The cross section through a convec- To quantify the difference between [8] Grams, C. M. “The Atlantic Inflow: Atmosphere-land- tive system embedded in the devel- the convective systems over land and Detailed information on scientific coor- ocean Interaction at the South Western oping tropical depression over the over water and to assess the influ- dination and funding is available on the Edge of the Saharan Heat Low”, Master’s Applications eastern Atlantic is shown in Figure 4b. ence of the convective systems on the AMMA International web site: thesis, Institut für Meteorologie und Applications A major difference between the MCS environment, potential temperature, Klimaforschung, Universität Karlsruhe, Karlsruhe, Germany, March 2008 over West Africa and the convective relative vorticity, and potential vorticity http://www.amma-international.org bursts that are embedded in the circu- budgets for regions encompassing the [9] Grams, C. M., Jones, S. C., lation over the Atlantic is their lifetime. convective region were calculated. De- References Marsham, J. H., Parker, D. J., Haywood, J. M., Heuveline, V. The MCS over land lasted for about tails and the results for the budget cal- [1] Avila, L. A., Pasch, R. J. “The Atlantic Inflow to the Saharan Heat “Atlantic Tropical Systems of 1993”, Monthly 3 days and the succession of MCSs culations can be found in [10]. In future Low: Observations and Modelling”, Quar- Weather Review, 123:887-896, 1995 over the ocean only for about 6 to 24 studies the analysis could be applied terly Journal of the Royal Meteorological hours. The region of maximum heat- to other cases in order to generalise [2] Steppeler, J., Doms, G., Schättler, U., Society, 136(s1): 125-140, 2010 Bitzer, H. W., Gassmann, A., ing and ascent is vertical and not tilted these results. [10] Schwendike, J., Jones, S. C. Damrath, U., Gregoric G. “Convection in an African Easterly Wave as in the convective system over land. “Meso-gamma Scale Forecasts using the • Juliane over West Africa and the Eastern Atlantic”: The AEJ seems to occur at around Nonhydrostatic Model LM”, Meteorology Schwendike a Model Case Study of Helene (2006). and Atmospheric Physics, 82:75–97, 2003 700 hPa west of the convective sys- Quarterly Journal of the Royal Meteorolo­ • Leonhard [3] Schättler, U., Doms, G., Schraff, C. gical Society, 136(s1):364-396, 2010 Gantner “A Description of the Nonhydrostatic • Norbert Kalthoff Regional Model LM”, part VII: • Sarah Jones User’s guide, Deutscher Wetterdienst, www.cosmo-model.org, 2008 Institut für [4] Vogel, B., Hoose, C., Vogel, H., Meteorologie und Kottmeier, C. Klimaforschung, “A Model of Dust Transport Apllied to the Dead Sea Area”, Meteorologische Karlsruher Zeitschrift, 15:DOI: 10.1127/0941– Institut für 2948/2006/0168, 2006 Technologie [5] Vogel, B., Vogel, H., Bäumer, D., Bangert, M., Lundgren, K., Rinke, R., Stanelle, T. “The Comprehensive Model System COSMO-ART – Radiative Impact of Aerosol on the State of the Atmosphere on the Figure 4: Cross sections through the convective system over land (a) and over the ocean (b). The vertical velocity (shaded) and Regional Scale”, Atmospheric Chemistry zonal wind (contour interval is 3 m s-1) are shown. The cross section (a) is along 13.1°N on September 10, 2006, 05 UTC, and (b) and Physics, 9:14483–14528, 2009 along 12.8°N on September 12, 2006, 09 UTC. Taken from [10].

24 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 25 by means of numerical models. Con- Driving Mechanisms sidering different thermal to chemical of the Geodynamo forcing ratios we hope to get a better general understanding of convection and magnetic field generation in the The Earth's outer core is a region of differences arise from light elements Earth's core, with regard to the ob- our planet extending from a depth of being released from the solidifying ma- served spatial and temporal variations. 2,900 km to about 5,100 km. This terial at the Inner Core Boundary (ICB). These simulations are computationally region is not accessible to direct It is still unclear which driving source very demanding, since, both, high spa- measurements. Geophysicists have dominates core convection and thus tial grid resolutions as well as long time- at least partially revealed the nature the dynamo mechanism. Additional series are required. Therefore, high of the Earth's outer core. According potential sources of dynamo action, performance clusters are an essential to today's knowledge, fluid motions in but likely less effective, are precession working tool for such an approach and the Earth's metallic liquid outer core induced flows due to inclined rotation efficient parallel codes are required. drive a machinery which is called the axis as well as tidal forces. In the following sections we will give a Geodynamo. Potential energy, which short description of the applied model

stems from the accretion of the Earth Various numerical dynamo simulations and numerical methods and present Figure 1: Illustration of potential driving sources for flows Applications Applications is transformed into kinetic energy and have demonstrated that convective some representative results of our work. in the Earth's fluid outer core. further into magnetic energy by mag- flows in rotating fluids can maintain a netic induction in an electrically con- self-sustaining dynamo. However, due to Mathematical Model and ducting fluid. limited computational resources, these Numerical Methods models come even not close to realis- For modeling core convection and the These are the Navier-Stokes equation The liquid outer core of the Earth is tic planetary conditions. On the other Geodynamo we consider a thermally (1a), the equation of continuity (1b) assumed to be a mixture composed hand they are able to produce magnetic and chemically driven flow within a which ensures incompressibility, the predominantly of iron and nickel and a fields which show Earth-like features. spherical shell, rotating with a con- heat transport equation (1c), the com- small fraction of lighter elements (e.g., Typically, numerical dynamo simulations stant angular frequency. We employ positional transport equation (1d), and sulphur, oxygen or silicon). Two poten- do not take into account the effects the Boussinesq approximation where the magnetic induction equation (1e) tial sources are available for driving that may arise from the presence of density changes are only retained in with the condition of divergence free convection and thereby the dynamo in two buoyancy sources with significantly the buoyancy terms. Furthermore, magnetic field (1f). P, T, C are pertur- the liquid outer core: density variations different diffusion time scales. Both all material properties in the core bations in pressure, temperature and due to temperature and due to com- driving sources differ in terms of their are assumed constant. The govern- composition, with respect to an adia- positional differences (cf. Jones, 2007 respective molecular diffusivities. The ing magneto-hydrodynamic equations batic well mixed reference state. u and

for an overview). Lewis number Le=kT / kC ­ gives the in non-dimensional form, using d as B are the velocity and magnetic field

ratio of thermal diffusivity kT to compo- the characteristic length scale and vector, respectively. is the unit vec- 2 Temperature differences result from sitional diffusivity kC . For the Earth’s the viscous diffusion time scale d / n, tor of position and is the unit vector secular cooling of the planet. Once the outer core this ratio is estimated to where n is the kinematic viscosity, are indicating the direction of the axis of planet has sufficiently cooled to grow a be of 0(103). This implies that thermal expressed as: rotation. solid inner core, the latent heat being instabilities diffuse much faster than released in the solidification process, com­positional ones. It is well known provides another source for thermal that the spatial and temporal behaviour convection. Radioactive elements such of convection depends strongly on the The non-dimensional control parameters appearing in Eq. 1(a)-(f) are: as potassium (40K) are an additional diffusivity of the driving component Ek: Ekman number, relative importance of viscous to Coriolis force possible thermal source. From the ma- (e.g., Breuer et al., 2004). T Ran : thermal Rayleigh number, relative strength of thermal buoyancy terial physics point of view it remains C Ran : compositional Rayleigh number, relative strength of compositional buoyancy an open question whether indeed ra- For this reason we study thermo- PrT: thermal Prandtl number, ratio of viscosity to thermal diffusivity dioactive elements can be present in chemically driven core convection and Pr : compositional Prandtl number, ratio of viscosity to chemical diffusivity the core. The additional compositional dynamos in rotating spherical shells C

26 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 27 In the present analysis we assume that restricted to Ek > O(10-7). In order to tionally driven convection and the more purely compositional driven case. both temperature and composition pro- solve the system of partial differential likely situations of a joint action of both Figure 2 illustrates the difference vide de-stabilizing buoyancy. Tempera- equations Eq. 1 (a)-(f) numerically we sources. Since we focus mainly on the between spatial structure of the tem-

ture and composition are fixed on Ti = 1 apply two methodically distinct numeri- influence of the driving mechanism we perature and the compositional field

and Ci = 1 at the Inner Core Boundary cal codes in dependence of the required keep the Ekman number constant at for the predominantly compositionally -4 (ICB) and on TO = 0 and CO = 0 at the grid resolution and computer platform. a moderate value of Ek = 10 for sim- driven case d = 20%. The temperature Core Mantle Boundary (CMB). The dif- The first one is based on a finite volume plicity. The thermal and compositional field is much more diffusive compared ference in temperature (composition) (FV) discretization with an implicit time- Prandtl numbers are also fixed with to the compositional field where coher-

between the two boundaries defines stepping scheme and efficient spatial PrT = 0.3 and PrC = 3.0, respectively. To ent small scale up-wellings develop, the thermal (compositional) scale. For domain decomposition for parallel com- parameterize different driving scenarios mainly in the polar and equatorial re- the compositional component the ap- putations using the Message Passing In- we define a total Rayleigh number as gions. This is due to the difference in plied boundary condition at CMB is un- terface (MPI) (Harder & Hansen, 2005). their diffusivities expressed by different

realistic. It allows light material to pass This sophisticated method has been , Prandtl numbers PrT and PrC. As a re- from the outer core into the . developed over years at the Institute for sult the overall flow behvaviour depends T t ot However for a starting point, we wanted Geophysics in Münster with the inten- where d = Ran /Ra n : denotes the strongly on the individual strength of to investigate the influence of different tion to reach more Earth-like parameter fraction of thermal contribution to the thermal and chemical buoyancy in the diffusivities isolated from distinct bound- regimes, i.e., smaller Ekman numbers, total Rayleigh number. We have consid- momentum equation Eq. 1(a). This con- Applications ary conditions. Regarding the model by means of High Performance Compu- ered different values of d including the text is exemplarily illustrated in Figure 3 Applications parameters the most critical one tations on massively parallel computing two end members of purely thermal and for the two cases d = 20%, 80% by turns out to be the Ekman number Ek architectures (>1000 nodes) like Blue purely compositional convection, which means of the spatial distribution of (a measure for the relative importance Gene at Jülich Supercomputing Centre merely differ in the Prandtl numbers. azimuthally and temporally averaged

of viscous to Coriolis force in the mo- (JSC). The second code is based on a For the results presented in this sec- helicity h = u •=xu F and differential t ot mentum equation Eq. 1(a)). Motions pseudo-spectral method with a mixed tion, the value of Ra n is kept constant rotation UF = uF F, both being char- in the Earth's fluid outer core are as- implicit-explicit time stepping scheme at a value of 4.8 x 10 6. One has to keep acteristic properties of convection in sumed to be strongly dominated by (Dormy et al., 1998). Pseudo-spectral in mind that this fixed total Rayleigh rotating fluids. These structures are Coriolis force, expressed by a very means spectral discretization in longi- number implies unequal super-criticality mainly aligned parallel to the axis of ro- low Ekman number of Ek < 10 -15. tude and latitude and finite differences in for the different scenarios defined by d. tation reflecting the relative importance Theoretical analysis predict that the radial direction. Parallelization is imple- This is due to the different diffusivities of the Coriolis force for the flow dynam- smallest scales in the convective mented via decomposition in radial direc- of the thermal and compositional com- ics. Helical flows and an azimuthal mean flow pattern are strongly decreasing tion using MPI. These methods are com- ponent. For the purely thermally driven flow component varying with distance with decreasing Ek. For this reason monly used for Geodynamo simulations case this leads to a super-criticality of from the axis of rotation are thought to T T global Geodynamo models are so far since they are very efficient on shared < 3 x Racrit and < 16 x Racrit for the be essential for the dynamo mechanism memory architectures with fast proces- sors. Therefore, we use this method for extensive parameter studies in more “moderate” parameter regimes.

Parameter Study of Thermo- chemically Driven Convection and Dynamos in a Rotating Spherical Shell at Moderate Ekman Numbers To investigate the influence of the driv- ing mechanism on the convective flow pattern, we considered different sce-

Figure 2: Snapshot of iso-surfaces of temperature field (left) and composition narios including the two extreme cases Figure 3: Spatial distribution of azimuthally and temporally averaged differential rotation and helicity for the predominantly chemically (right) for the predominantly chemically driven case d = 20% of purely thermally and purely composi- driven case d = 20% (left) and the predominantly thermally driven case d = 80% (right).

28 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 29 ure 5 which displays temperature and continuously towards more realistic the radial magnetic field component parameter regimes. In this respect, we below the core mantle boundary for have to answer the question whether a low Ekman number of Ek = 2x10 -6 the observed distinct flow behaviour in a convective flow driven by basal dependent on the driving scenario will heating. This particular simulation was even hold in a regime with strongly calculated within the framework of the increased weight of the Coriolis force. DEISA (Distributed European Infrastruc- This is a challenging task and is basi- ture for Supercomputing Applications) cally constrained by the computational project 3D-Earth utilizing massively resources available and their future parallel computing (1,920 processes). technical improvements. Therefore, With such computing power it was pos- modern High Performance Computer sible to calculate a time series of nearly Centres with innovative and substantial 105 timesteps with a spatial resolu- technical support are essential for a Figure 4: Snapshots of internal magnetic field lines for the predominantly compositionally driven tion of < 5x107 volume cells. This high further progress in understanding the case d = 20% and for the predominantly thermally driven case d = 80% (figures created with DMFI resolution is needed in the low Ekman secrets of the Geodynamo. (Aubert et al., 2008)) number regime to resolve the dominant Applications in planetary cores. In this respect the showing snapshots of dominant mag- scales of the solution. As displayed in References Applications most apparent difference between the netic field lines for the two distinct driv- Figure 5 (top), the temperature field and [1] Aubert, J., Aurnou J., Wicht, J. “The Magnetic Structure of Convection-driven two cases d = 20%, 80% is the strong ing scenarios at arbitrary times. Here, thus buoyancy is extremely fine-scaled in Numerical Dynamos", Geophysical Journal contribution of differential rotation and the width of the field lines reflects the the equatorial region and drives nearly International, 172, pp. 945-956, 2008 helicity in the polar regions in the pre- relative local strength of the magnetic hundred Busse-Taylor columns. How- dominantly compositionally driven case. field. In the predominantly composition- ever, the strong driving creates also [2] Breuer, M., Wessling, S., Schmalzl, J., Hansen, U. In the case of dominant thermal forc- ally driven case magnetic field is mainly strong upflows in the tangential cylinder. “Effect of Inertia in Rayleigh-Bénard Convec- ing these flow structures evolve mainly generated in the polar regions forced This vigorous convection inside the tan- tion", Physical Review E, 69, 026302/1-10, in small bands (Busse-Taylor columns) by strong compositional up-wellings in gential cylinder has a strong impact on 2004 outside the tangential cylinder (tangent these regions (cf. Figure 2). By con- the generated magnetic field, which is [3] Dormy, E., Cardin, P., Jault, D. to the inner core). The importance trast, in the mainly thermally forced no longer dominated by the dipole com- • Martin Breuer “MHD Flow in a Slightly Differentially Rotat- of these structures for the dynamo situation the dynamo process acts ponent but has strong higher multipole ing Spherical Shell with Conducting Inner • Tobias Trümper mechanism is illustrated in Figure 4 primarily in the bulk region outside the components (Figure 5, bottom). Dipole Core in a Dipolar Magnetic Field", Earth • Helmut Harder tangential cylinder. These two distinct dominance is restricted to lower forc- and Planetary Science Letters, Volume 160, • Ulrich Hansen pp. 15-30, 1998 driving scenarios demonstrate quite ing, i.e., lower Rayleigh number, where

plainly the strong importance of the buoyancy is insufficient to drive vigorous [4] Harder, H., Hansen, U. Institute for considered driving scenario on the flow convection inside the tangential cylinder “A Finite Volume Solution Method for Geophysics, structure and the dynamo process. by breaking the Proudman-Taylor con- Thermal Convection and Dynamo Problems University of in Spherical Shells", Geophysical Journal straint. This particular result suggests Münster International, 161, pp. 522-532, 2005 High Performance Comput- that the observed different flow fields Figure 5: Example of a High Performance Com- ing Application of Low Ekman for thermal and compositional convec- [5] Jones, C. A. puting dynamo simulation Number Dynamo Simulation tion will have a strong influence on the “Thermal and Compositional Convection in the Outer Core", Vol. 8 of Treatise on with < 5x107 volume Since the liquid outer core is character- generated magnetic field. A more de- cells and 105 timesteps. Geophysics, chap. 5, pp. 131-185, Elsevier ized by fast rotation and low viscosity, tailed discussion of obtained results on Shown are snapshots Science, 2007 of temperature T (top) its dynamical behaviour is dominated thermally driven dynamos is given by and radial magnetic field by an extremely low Ekman number. Wicht et al. (2009). [6] Wicht, J., Stellmach, S., Harder, H. “Geomagnetic Field Variations", (Eds: K.H. component Br (bottom) This puts severe demands on numeri- near the core mantle Glaßmeier, H. Soffel and J. Negendank), cal simulations of core flows due to Outlook boundary (CMB) at chap. Numerical models of the Geodynamo: low Ekman number the evolving small spatial scales in the The main ambition for the next years from fundamental Cartesian models to 3D Ek = 2x10-6. solution. This is demonstrated in Fig- will be to push the dynamo models simulations of Field reversals, pp. 107-158, Advances in Geophysical and Environmental Mechanics and Mathematics

30 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 31 Golden Spike Award by the HLRS Steering Committee in 2009

the exact importance of jets is still un- assuming axisymmetry. These compu- The Maturing of Giant Galaxies known. This becomes obvious in two tations are extremely demanding if re- by Black Hole Activity opposing processes that are invoked alistic parameters are used: although in the current literature: positive and the jet plasma moves with almost the negative feedback. speed of light, its density is much lower than the ambient density of the extra- Extragalactic jets are amongst the most huge amount of energy clearly has a Think Positive – or Negative? galactic gas (in our simulations typically spectacular phenomena in astrophysics: considerable impact on the energy Dense and very cool clouds of gas are 1,000 times smaller), which results in these dilute but highly energetic beams budget of the respective galaxy and its the birthplaces of stars since only then a much slower propagation of the jet. of plasma are formed in the environs of environment. Since extragalactic jets can gravity surmount thermal pressure Also, a considerable range of spatial active black holes, moving with speeds are most frequent in the most mas- and cause the clouds to collapse, ulti- scales has to be covered. While the very near the speed of light. They run sive galaxies (observed in roughly every mately forming new stars as nuclear fu- jets extend over more than 100 kpc into the hot gas surrounding the galaxy, third among them), they have become sion sets in. If clouds are hit by a jet or (1 kpc = 3 x 1016 km), the jet beams where they are eventually decelerated a common explanation for cosmo- its preceding bow shock, they are com- are 100 times narrower and have to and heat up the gas considerably, dig- logical simulations failing to reproduce pressed by the impact and can become be resolved in detail. All this results in ging large cavities (the so-called jet co- giant galaxies correctly, despite their gravitationally unstable. This would simulations with more than 6 million coon) into the extragalactic gas. otherwise great success in modeling result in an increased rate of star cells and more than one million time Applications the evolution of our universe and its formation (positive feedback). On the steps necessary. The very large num- Applications Both the jets and the cavities are ob- galaxies. While astronomical observa- other hand, the same impact could as ber of time steps makes simulations servable today with radio and X-ray tions show that giant elliptical galaxies well disrupt the cold clouds and thereby time-consuming but does not allow a telescopes on earth or in space. These consist of old and red stars, showing destroy the seeds necessary for the massively parallel approach since the observations have revealed that the almost no signs of current formation of formation of new stars (negative feed- “problem per processor” would become power of these jets is even greater new stars, these galaxies still grow in back). Additionally, the jet would heat too small then and communication than was thought before: more than cosmological simulations and therefore up the hot gas surrounding the galaxy costs between processors would be- one trillion times the total power of our are much bluer, actively forming stars to higher temperatures and prevent it come unwieldy. These simulations were sun (1039 watts) for the most power- and have considerably greater masses. from cooling down, getting compressed ful ones – with activity durations of This has caused an increased interest and forming new cold clouds. Both pro- some ten million years. This power in jet physics and in research projects cesses have been demonstrated to be ultimately originates from the active focusing on the interaction of jets with possible, and only a detailed study with supermassive black hole of the galaxy, their host galaxy and its environment, a realistic underlying model can help us since the jet taps the enormous ro- now referred to as “jet feedback”. Yet, reveal their real importance. tational energy of the black hole. This the detailed processes involved and Observations of distant radio galaxies, showing them as they were 10 billion years ago, give additional hints on the interaction between the jet and the disk. There, extended emission-line nebulae aligned with the jet axis are observed and it is conjectured that these nebulae are actually created by the interaction of the jet with a galaxy's gaseous disk.

Computational Challenges We have conducted a series of magne- tohydrodynamic jet simulations to ex-

Figure 1: Velocity field of a jet (density contrast 0.001) after 30 million years. The jet beam (in red) amine the interaction of jets with their Figure 2: 3D simulation showing the interaction of a jet and a galactic gaseous reaches out to the jet head and then turns back, inflating a turbulent cocoon (mostly green). environment at very high resolution disk (volume rendering of density)

32 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 33 therefore only possible on a powerful The Mist of Distant Galaxies in contrast to a uniformly resolved gas clouds embedded in the ambient supercomputer such as the NEC SX-8 With respect to the riddle of the mesh. The current implementation medium is computed explicitly. Prelimi- at the High Performance Computing emission-line nebulae, we were able to allows a maximum dynamic range in nary results indicate that actually both Center Stuttgart (HLRS), to which we test two models for the location and length scale of 6 orders of magnitude. positive feedback by compression of have adjusted and optimized our code. kinematics of the line-emitting clouds The code is parallelized by MPI and in- clouds in the center as well as nega- Due to its vector capabilities and shared against observed properties and found cludes dynamic load-balancing between tive feedback by removal of gas along memory architecture per node, the that clouds embedded in the jet cocoon the different MPI processes. The 3D the jet axis may be acting at the same code ran very efficiently and allowed us were able to much better reproduce simulations are clearly computationally time. So far, the simulations have only to reach realistic sizes for the jets. observed velocities and morphologies very demanding, since an additional covered a time much shorter than our than clouds embedded in the shocked dimension results in a much larger previous simulations, but runs on a We found that the expansion of the ambient gas. number of cells, even if the resolution large number of processors will allow jet cocoon follows self-similar models is somewhat lower. On the other hand, us to eventually examine the action of only at early times, but deviates from About 10 percent of the total jet power a higher number of cells (in contrast to jet feedback in detail and over realisti- that significantly once the cocoon ap- was measured as kinetic energy in the time steps) generally can be handled cally long times. proaches pressure balance with its jet cocoon, which made it possible to by massive parallelization if an efficient environment. We found this to be link these findings to simulations of method is used. References consistent with observations and multi-phase turbulence that model the [1] Gaibler, V., Krause, M., Camenzind, M. “Very Light Magnetized Jets on Large Applications also achieved bow shock shapes and amounts and emission power of the Outlook Applications Scales – I. Evolution and Magnetic Fields”, strengths such as those observed cool gas phase embedded in these tur- We have successfully tested RAMSES Monthly Notices of the Royal Astronomical typically. Magnetic fields turned out to bulent regions. The expected emission on both the Nehalem cluster at the Society, 400, pp. 1782-1802, 2009 be important to stabilize the contact line luminosities, however, disagree HLRS and the HLRB-II at the Leibniz- [2] Krause, M., Gaibler, V. surface between the jet and the ambi- considerably with the observed range Rechenzentrum (LRZ), and found it to “Physics and Fate of Jet Related ent medium and were moreover found of luminosities and we concluded that be scaling almost linearly with up to Emission Line Regions”, Conference to be efficiently amplified by a shearing the models are still too simple to in- 4,000 cores, if sufficient bandwidth is contribution, at arXiv:0906.2122

mechanism in the jet head. clude all the necessary physics. This available. On the Nehalem cluster, the [3] Gaibler, V., Camenzind, M. was, admittedly, not too surprising performance was ~50 percent smaller “Numerical Models for Emission Line since the models relied on clouds pas- than expected from the single-core Nebulae in High Redshift Radio Galaxies”, Wolfgang E. Nagel, Dietmar B. Kröner, sively advected with and spread all performance since memory bandwidth • Volker Gaibler Michael M. Resch (Eds.), Springer 2010, across the cocoon plasma, while the become a bottleneck for more than 4 High Performance Computing in Science interaction of real clouds of cold gas cores per node; scaling then increased and Engineering ‘09 Max-Planck- would be considerably more complex. almost linearly for a larger number of Institut für [4] Gaibler, V. nodes. HLRB-II, however, did not suffer “Very Light Jets with Magnetic Fields”, extraterres- ­ Moving on – from this and showed excellent scaling PhD Thesis, Ruprecht-Karls-Universität trische Physik, Simulations in 3D behaviour. We conjecture that the Heidelberg, 2008 Garching To improve our model of the jet – cloud memory bandwidth bottleneck on the interaction, we had to extend our simu- Nehalem cluster may be related to de- lations to full three dimensions since tails of the MPI implementation, since a clumpy galactic or intergalactic gas MPI adjustments on a local Harper- cannot be modeled properly within axi- town Beowulf cluster were able to get symmetry. This also made it necessary around this limitation. to move to another code: RAMSES, an adaptive mesh refinement code, that in- For a small test simulation, we have cludes cooling, gravity, star formation, successfully set up a clumpy galactic cosmological evolution and magnetic gaseous disk, similar to disks actually fields. The mesh refinement is highly observed in distant galaxies. A jet is

Figure 3: The gaseous clouds in the disk (green) are compressed (blue) by the suitable for resolving small clouds in an positioned in the center of the galaxy action of the jet and the remaining gas is pushed outwards. otherwise large computational domain, and interaction of the jet with the cold

34 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 35 (radius 3.5 pc, H gas density of the distinction between the two types The Physics of Galactic Nuclei 2 104 cm-3 gas temperature of 50 K). of active galaxies. Several mecha- Figure 1 shows a cut through the mid- nisms have been proposed recently. Active Galactic Nuclei are powered by ing physical processes in these regions plane of the gas density distribution of We concentrate on one of the most supermassive black holes. Whenever are still poorly understood, the aim of our standard model (80 km/s cloud promising solutions, namely the ef- enough gas reaches the centre of oth- our project is to obtain a deep under- velocity, 3 pc offset from the x-axis) fects of infrared radiation pressure erwise normal galaxies, a fast rotating, standing of the origin and evolution of after an evolution time of 0.25 Myrs. [7,8]. We are currently approaching thin and hot gaseous accretion disc is the complex stellar, gaseous and dusty The cloud has approached the black the problem from two directions: (i) A able to form and the centre lights up to structures and their relation to the ob- hole from the right hand side along the parallel N-body like code is developed, become as bright as the stars of the served phases of nuclear activity. x-axis. After passing the black hole, ma- which additionally includes radiation whole galaxy. A surrounding ring-like, terial with opposite angular momentum pressure forces and collisions between dusty, geometrically thick gas reser- Nuclear Disc Formation collides, the angular momentum is effi- dusty gas clouds. This will enable us voir (the torus) provides mass supply in Galactic Nuclei ciently redistributed and a rotating spiral for the first time to assess the dy- and anisotropically blocks the light. Recent observations of our Galactic structure builds up, whereas a part of namical state of clumpy torus models, This gives rise to two characteristic Centre reveal a warped clockwise rotat- the gas is transported through the inner which are vitally discussed in literature observational signatures, depending ing stellar disc with a mean eccentricity boundary. The amount of accreted gas nowadays, mostly treated with radia- whether the line of sight is obscured by of 0.36 ± 0.06 and a second inclined is mainly determined by the cloud’s im- tive transfer codes and show excellent Applications the torus (edge-on view) or not (face-on counter-clockwise rotating disc extend- pact parameter relative to its radius. By agreement with observed properties Applications view). These nuclear regions of nearby ing from 0.04 pc to 0.5 pc (e.g. [1] that time, the inner gas disc has already of tori in nearby active galaxies. (ii) Tori active galaxies, as well as our own Ga- and references therein), made up of relaxed to an approximate equilibrium are actually thought to be made up of a lactic Centre have been observed in roughly 100 young and massive stars state. The instability of such pressure multi-phase medium including gas and great detail with the most up-to-date with a total mass of roughly 104-5 solar stabilized discs can be assessed with dust at various temperatures. To be instruments at the largest available masses [2,3]. Due to the hostility of the the help of the so-called Toomre param- able to treat this mixture correctly, we telescopes and interferometers, yielding environment, they cannot be explained eter. Figure 2 shows those regions of are currently implementing a frequency unprecedented resolution. As the occur- by normal means of star formation, as the disc, which are unstable to gravita- resolved radiative transfer module tidal forces would disrupt typical molecu- tional forces according to this Toomre lar clouds in the vicinity of the central criterion and will form stars. Defining black hole. We present the first numeri- the stellar disc in this way enables us to cal realization of a scenario proposed directly compare our results to the ob- by [4], where an infalling cloud covers served properties of the stellar system. the black hole in parts during the pas- Thereby we find that the initial cloud sage and forms a self-gravitating, ec- velocity determines the resulting stellar centric accretion disc, before it starts mass of the disc (the higher the veloc- fragmenting. For our simulations we use ity, the lower the disc mass). Our stan- the N-body Smoothed Particle Hydro­ dard model shows a disc eccentricity of dynamics (SPH) Code GADGET2 [5], order 0.4, an outer radius of 1.35 pc which shows very good scaling on the and a disc mass of 0.7-6.95 ×104 solar Altix 4700 supercomputer at the LRZ. masses, in good agreement with the ob- We gain roughly a factor of 1.8 in speed servations mentioned above. For details when doubling the number of CPUs. of this project we refer to [6]. Most of our simulations were done using 128 CPUs, with typically 10,000 Radiation Pressure Driven to 20,000 CPU hours per run. Dust Tori One of the main questions of current Our simulations start with a spherical AGN torus research is the stabiliz-

Figure 1: Density (colour coded in solar masses per pc3) and velocity (overlayed and homogeneous molecular cloud with ing mechanism of their vertical scale Figure 2: Gravitational unstable parts of the inner gas disc after 250,000 years, arrows) distribution of the cloud interacting with the central supermassive black parameters adapted from observations height against gravity, as needed for which will later form stars hole after 250,000 years

36 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 37 a turbulent and very dense disc builds up on the parsec scale (see Figure 4). PLUTO [9] is a multiphysics, multial- gorithm modular code which was es- pecially designed for the treatment of supersonic, compressible astrophysical flows in the presence of strong discon- tinuities in a conservative form, relying on a modern Godunov-type High-Reso- lution Shock-Capturing (HRSC) scheme. PLUTO's parallel performance on the SGI Altix 4700 supercomputer at the LRZ shows almost ideal scaling.

In order to be able to treat viscosity effects and star formation within the disc correctly, we complement our Applications 3D PLUTO simulations with an effec- Applications tive disc model, where we calculate the evolution of the surface density Figure 3: Gas density distribution (first two panels) and temperature distribution within a meridional distribution. At the current age of NGC Figure 4: Central gas density distribution in an equatorial slice slice through our 3D torus simulation of the Seyfert Galaxy NGC 1068. 1068's nuclear starburst of 250 Myr, through our 3D torus simulation. our simulations yield disc sizes of the into the Magneto-Hydrodynamics code Feeding Supermassive order of 0.8 to 0.9 pc, gas masses of [2] Nayakshin, S., Cuadra, J. NIRVANA. Radiation heats dust grains Black Holes with 106 solar masses and mass transfer "A Self-gravitating Accretion Disk in Sgr A* a Few Million Years Ago: via absorption and accelerates them Nuclear Star Clusters rates of 0.025 solar masses per year Is Sgr A* a Failed Quasar?", A&A 437, with the help of radiation pressure. We Recently, high resolution observa- through the inner rim of the disc. This pp. 437-445, 2005 propagate forward and backward six tions with the help of the near-infrared is in quite good agreement with the • Marc Schartmann radiation fields (one for each direction) adaptive optics integral field spec- disc size as inferred from interfero- [3] Nayakshin, S., et al. • Christian Alig "Weighing the Young Stellar Discs around and relate the thermodynamic quanti- trograph SINFONI at the VLT proved metric observations in the mid-infrared • Martin Krause Sgr A*", MNRAS 366, pp. 1410-1414, 2006 ties by equations of state for the gas the existence of massive and young and compares well to the extent and • Andreas Burkert and dust separately. As the light propa- nuclear star clusters in the centres of mass of a rotating disc structure as [4] Wardle, M., Yusef-Zadeh, F. gation timestep is often much smaller nearby Seyfert galaxies. With the help inferred from water maser observa- "On the Formation of Compact Stellar Max-Planck- Disks around Sagittarius A*", ApJL, 683, than the gas cooling and hydrodynam- of three-dimensional high resolution tions. Several other observational Institute for pp. 37-40, 2008 ics timesteps, the code can subcycle hydrodynamical simulations with the constraints are in good agreement as Extraterrestrial the radiative transfer within the same PLUTO code, we follow the evolution of well. On basis of these comparisons, [5] Springel, V. Physics hydrodynamics timestep. Test runs at the cluster of the Seyfert 2 galaxy NGC we conclude that the proposed sce- “The Cosmological Simulation Code GAD- GET-2”, MNRAS 364, pp. 1105-1134, 2005 LRZ showed that the code scales well 1068. The gas ejection of its stars pro- nario seems to be a reasonable model

down to a domain size of 16 cells and vide both, material for the obscuration and shows that nuclear star formation [6] Alig, Ch., et al. conserves the total energy (radiation, within the so-called unified scheme of activity and subsequent AGN activity MNRAS, 2010, submitted (arXiv:0908.1100) kinetic and thermal) very well. Using AGN and a reservoir to fuel the central, are intimately related. For details we 16 processors at LRZ for two days, active region and it additionally drives refer to [10]. [7] Pier, E. A., Krolik, J. H. “Radiation-pressure-supported Obscuring we are able to compute a periodic box turbulence in the interstellar medium Tori around Active Galactic Nuclei”, 399, 2 of 60 × 64 cells for almost 15,000 on tens of parsec scales. This leads to References pp. 23-26, 1992 timesteps. The code will then be ap- a vertically wide distributed clumpy or [1] Bartko, H., et al. “Evidence for Warped Disks of Young plied to global gas and dust tori, irradi- filamentary inflow of gas on tens of par- [8] Krolik, J. H. Stars in the Galactic Center”, ApJ 697, “AGN Obscuring Tori Supported by Infrared ated by a central accretion disk. sec scale (see Figure 3 a,b,c), whereas pp. 1741-1763, 2009 Radiation Pressure”, 661, pp. 52-59, 2007

38 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 39 the lowest potential energy [4], and the The UNRES force field has been derived Global Parameter Optimization prediction of the folding pathways can as a Restricted Free Energy (RFE) func- of a Physics-based United Residue be formulated as a search for the fam- tion of an all-atom polypeptide chain ily of minimum-action pathways leading plus the surrounding solvent, where Force Field (UNRES) for Proteins to this basin from the unfolded (dena- the all-atom energy function is aver- turated) state. In neither procedure do aged over the degrees of freedom we want to make use of ancillary data that are lost when passing from the One of the major and still unsolved protein structure prediction, methods from protein structural databases. all-atom to the simplified system (i.e., problems of computational biology is that implement direct information from the degrees of freedom of the solvent, to understand how interatomic forces structural data bases (e.g., homology The complexity of a system composed the dihedral angles x for rotation about determine how proteins fold into three- modeling and threading) are, to date, of a protein and the surrounding sol- the bonds in the side chains, and the dimensional structures. The practical more successful compared to physics- vent generally prevents us from seek- torsional angles l¸ for rotation of the aspect of the research on this problem based methods [2]; however only the ing the solution to the protein-structure peptide groups about the Ca · · ·Ca is to design a reliable algorithm for latter will enable us to understand the and folding pathway prediction problem virtual bonds) [7]. The RFE is further the prediction of the three-dimensional folding and structure-formation pro- at atomistic detail [1]. Therefore, great decomposed into factors coming from structure of a protein from amino- cess and protein dynamics. attention has been paid in recent years interactions within and between a given acid sequence, which is of utmost to develop coarse-grained models of number of united interaction sites. The Applications importance because the experimental The underlying principle of physics- polypeptide chains and other biomolec- energy of the virtual-bond chain is ex- Applications methods of the determination of pro- based methods of protein-structure ular systems [5]. As part of this effort, pressed by eq. (1). tein structures cover only about 10 % prediction and protein-folding simula- of new protein sequences. Even more tions is the thermodynamic hypothesis challenging is simulation of the dynam- formulated by Anfinsen [3] according ics of protein folding and unfolding and to which the ensemble called the “na- of the processes for which the dynam- tive structure” of a protein constitutes ics of the proteins involved is critical the basin with the lowest free energy (1) (e.g., enzymatic reactions, signal trans- under given conditions. Thus, energy- duction), because these processes based protein structure prediction is occur within millisecond or second formulated in terms of a search for the time scales, while the integration time basin with the lowest free energy; in a step in MD simulations is of the order simpler approach the task was defined of femtoseconds [1]. In the case of as searching the conformation with we have been developing a physics- The term represents the based united-residue force field, here- mean free energy of the hydrophobic after referred to as UNRES [6,7]. In (hydrophilic) interactions between the the UNRES model [7] a polypeptide side chains, which implicitly contains chain is represented by a sequence the contributions from the interac- of a-carbon (Ca) atoms linked by vir- tions of the side chain with the sol- tual bonds with attached united side vent. The term denotes the chains (SC) and united peptide groups excluded-volume potential of the side- (p). Each united peptide group is lo- chain – peptide-group interactions. The cated in the middle between two con- peptide-group interaction potential is secutive a-carbons. Only these united split into two parts: the Lennard-Jones peptide groups and the united side interaction energy between peptide- chains serve as interaction sites, the group centers and the aver- (a) (b) a-carbons serving only to define the age electrostatic energy between chain geometry. The correspondence peptide-group dipoles ; the sec- Figure 1: Illustration of the correspondence between the all-atom polypeptide chain in water (a) and its UNRES representation (b). The side chains in part (b) are represented by ellipsoids of revolution and the peptide groups are represented by small spheres in between the all-atom and UNRES rep- ond of these terms accounts for the the middle between consecutive a-carbon atoms. The solvent is implicit in the UNRES model. resentations is illustrated in Figure 1. tendency to form backbone hydrogen

40 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 41 bonds between peptide groups and (2) global and not local optimization (as 1. Run the multidimensional MREMD . The terms represent correla- performed so far [6]) of parameter simulations on the training proteins tion or multibody contributions from the where being the ab- space is needed for this purpose. Our for all sets of energy parameters coupling between backbone-local and solute temperature corresponding to recent preliminary results [9] of ran- in the initial population and evaluate backbone-electrostatic interactions, the th trajectory, and denotes the dom scanning the parameter space the target function which makes use and the terms are correlation variables of the UNRES conformation and selecting the best parameter set of the hierarchical structure of the contributions involving consecutive of the th trajectory at the attempted by means of MREMD simulations of two protein energy landscape and se- peptide groups; they are, therefore, exchange point. If and are peptides: 1L2Y and 1LE1, which have lected experimental observables. termed turn contributions. The multi- exchanged, otherwise the exchange is a well-defined a-helical and b-hairpin body terms are indispensable for repro- performed with probability . structure, respectively, produced a 2. Select a subset of energy parameters duction of regular a-helical and b-sheet The multiplexing variant of the REMD force field with only about 5 - 6 Å reso- to generate a new trial population by structures. The terms , where method (MREMD) differs from the lution, but very good transferability. applying standard genetic operators: is the length of th virtual bond and REMD method in that several trajec- Therefore, in the present project, we mutations and crossovers on vectors is the number of virtual bonds, tories are run at a given temperature. aimed at extending this approach to representing energy parameters. are simple harmonic potentials of virtual- Each set of trajectories run at a differ- include a genetic algorithm, larger pro- bond distortion; it has been introduced ent temperature constitutes a layer. teins, and experimental data of protein 3. Run the multidimensional MREMD recently for molecular-dynamics imple- Exchanges are attempted not only folding which are being carried out in simulations on training proteins Applications mentation. The factors account within a single layer but also between our laboratory. after prescreening of trial population Applications for the dependence of the respective layers. In this project we enhanced the of energy parameters on two pep- free-energy terms on temperature [6]. MREMD by implementing multidimen- The goal of our current project is to op- tides with a minimal a-helical and a sional extension, in which pairs of repli- timize the UNRES force field by global minimal b-hairpin fold. The s are the weights of the energy cas run at different temperatures and/ search of the parameter space to re- terms, and they can be determined or different sets of energy-term weights produce structural and thermodynamic 4. Update the current population of (together with the parameters within are exchanged in the multidimensional properties of proteins as functions of energy parameters with best sets each cumulant term) only by optimiza- extension, based on the Metro­polis cri- temperature. The force field will be from trial population. If a new set tion of the potential-energy function, terion using the following : applicable to MD simulations of large has a lower score function which is the subject of our present proteins at millisecond time scale in than a previous set and the work. To search the conformational multiple-trajectory mode and to gener- Cartesian distance between these (3) space with UNRES and to study the ating canonical ensembles of proteins. sets is lower than a pre-set cut-off dynamics and thermodynamics of pro- These features will enable us to study value replaces , if the tein folding, we implemented Langevin where and denote different protein folding in both single-molecule distance is greater it is added and dynamics in UNRES [7]. To sample the parameters of the potential energy, and ensemble (kinetics) mode, chap- the “worst” previous set removed. conformational space more efficiently and denotes the variables of the erone interaction with proteins and is decreased as optimization than by canonical MD, we extended [8] UNRES conformation of the replica such biologically important processes progresses. the UNRES/MD approach with the rep- simulated at temperature using and systems as signal transduction, lica exchange and multiplexing replica- potential energy parameters other enzyme action, molecular motors, and 5. Iterate points 2. – 4. exchange molecular dynamics method symbols being the same as in eq. 2. amyloidogenesis, as well as to pursue (MREMD). In the REMD method, ca- This modification was necessary to physics-based prediction of protein The target function mentioned in points nonical MD simulations are carried out carry out runs with multiple sets of en- structures, and the prediction of the 1. and 4. is essentially the sum of the simultaneously, each one at a differ- ergy parameters in global optimization effect of mutations on thermodynamic fourth powers of the deviations of the ent temperature. After every of the parameters of the UNRES force and kinetic stability. points of the calculated and experimen- steps, an exchange of temperatures field. In order to obtain a force field tal free-energy differences between between neighbouring trajectories with folding properties, we developed a The procedure of global optimization of the folded, unfolded, and partially folded is attempted, the decision novel method of force-field optimization the UNRES energy-function parameters forms of the training protein(s). The about the exchange being made based [6,7] which makes use of the hierar- developed within our project consists of reader is referred to our earlier work on the Metropolis criterion, which is chical structure of the protein energy the following 5 steps: for details [6]. To test the genetic algo- expressed by eq 2. landscape. However, we realized that rithm we used the 1LE1 protein, which

42 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 43 Figure 2: The most probable (left, blue; rmsd=2.6 Å) and the lowest-rmsd (right, red; rmsd=1.6 Å) Figure 4: The most probable (left, blue; rmsd=2.0 Å) and the lowest-rmsd (right, red; rmsd=1.2 Å) conforma- conformation of 1LE1 corresponding to the best initial set of energy-term weights and the experimental tion of 1LE1 corresponding to the best final (after 30 cycles) set of energy-term weights and the experimental (center, green) conformation of this peptide. In the UNRES structures the side chains are represented (center, green) conformation of this peptide. In the UNRES structures the side chains are represented by virtual by virtual Ca · · · SC bonds, while they are represented at atomistic detail in the experimental structure. Ca · · · SC bonds, while they are represented at atomistic detail in the experimental structure.

forms a single 12-residue b-hairpin formations corresponding to the best References Applications [10]. The initial population consisted set of energy-term weights are shown [1] Scheraga, H. A., Khalili, M., Liwo, A. [7] Liwo, A., Czaplewski, C., Scheraga, H. A. Applications “Protein-folding Dynamics: Overview Rojas, A. V., Murarka, R. K., Ołdziej, S., of 100 randomly-generated sets of in Figure 4. It can be seen that not a of Molecular Simulation Techniques”, Makowski, M., Kazmierkiewicz, R., energy-term weights. The most prob- only the C -rmsd dropped but also the Annual Review of Physical Chemistry, 58, “Simulation of Protein Structure and Dynam­ able and the lowest-Ca-rmsd conforma- strands now have correct orientation pp. 57-83, 2007 ics with the Coarse-grained UNRES Force tion, together with the experimental with respect to the direction of the hair- Field”, In Coarse-Graining of Condensed [2] Rohl, C. A., Strauss, C. E. M., Phase and Biomolecular Systems, 1st edi- 1LE1 structure are shown in Figure 2. pin. Using the developed methodology, Misura, K. M. S., Baker, D. tion; Voth, G., Ed, CRC Press, Taylor & Fran- It can be seen that the most probable we are now optimizing the UNRES force “Protein Structure Prediction Using cis; Boca Raton, FL, pp. 107–122, 2008 conformation has the b-hairpin shape; field with bigger training proteins, which ROSETTA”, Methods in Enzymology, 383, with the b-turn at Asn6-Gly7; however, have more complicated folds. While p. 66, 2004 [8] Czaplewski, C., Kalinowski, S., Liwo, A., Scheraga, H. A. the strands are flipped with respect the reported test required only modest • Dawid Jagiela1 Figure 3: Evolution of [3] Anfinsen, C. B. “Application of Multiplexing Replica Exchange 1,2 fitness score with the to their native positions. The evolution computational resources (20 trajecto- “Principles that Govern the Folding of Protein Molecular Dynamics Method to the UNRES • Cezary Czaplewski number of the cycle for of the fitness-score function with cycle ries and 100 energy-term-weight sets, Chains”, Science, 181, pp. 223-230, 1973 Force Field: Tests with a and a + b Proteins”, • Adam Liwo1,2 the optimization of the number is shown in Figure 3. It can be which makes a total of 2,000 cores), Journal of Chemical Theory and Computa- • Stanislaw Ołdziej2,3 population of 100 sets [4] Scheraga, H. A., Liwo, A., Ołdziej, S., tion, 5, pp. 627–640, 2009 seen that the score dropped by more the actual optimization typically implies • Harold A. Scheraga2 of weights with the Czaplewski, C., Pillardy, J., Ripoll, D. R., 1LE1 peptide. than 2 orders of magnitude. The most much larger jobs. Each energy energy Vila, J. A., Kazmierkiewicz, R., [9] He, Y., Xiao, Y., Liwo, A., Scheraga, H. A. probable and the lowest-Ca-rmsd con- and force calculation in a MREMD run Saunders, J. A., Arnautova, Y. A., “Exploring the Parameter Space of the 1 Faculty of is fine-grained to 8-16 cores and the Jagielska, A., Chinchio, M., Nanias, M. Coarse-grained UNRES Force Field by Chemistry, “The Protein Folding Problem: Global Random Search: Selecting a Transferable number of MREMD trajectories is 128, University of Gdansk Optimization of Force Fields”, Frontiers in Medium-resolution Force Field”, Journal multiplied by 100 energy-term-weight Bioscience, 9, pp. 3296-3323, 2004 of Computational Chemistry, 30, sets run simultaneously. Consequently, pp. 2127–2135, 2009 2 Baker Laboratory the optimization job requires 20,000+ [5] Kolinski, A., Skolnick, J. of Chemistry and “Reduced Models of Proteins and their Ap- [10] Cochran, A. G., Skelton, N. J., cores. Consequently, these calcula- Chemical Biology, plications”, Polymer, 45, pp. 511-524, 2004 Starovasnik, M. A. tions can be accomplished only with the “Tryptophan Zippers: Stable, Monomeric Cornell University massively parallel resources provided [6] Liwo, A., Khalili, M., Czaplewski, C., b-hairpins”, Proceedings of the National by the Jülich Supercomputing Centre Kalinowski, S., Ołdziej, S., Wachucik, K., Academy of Sciences U.S.A., 98, 3 Intercollegiate Scheraga, H.A. pp. 5578–5583, 2001 (Blue Gene/P). The optimized force field Faculty of “Modification and Optimization of the United- will be tested in 9th CommunityWide residue (UNRES) Potential Energy Function Biotechnology, Experiment on the Critical Assessment for Canonical Simulations. I. Temperature University of of Techniques for Protein Structure Dependence of the Effective Energy Function Gdansk, Medical and Tests of the Optimization Method with Prediction (CASP9) which will take place University of Gdansk Single Training Proteins”, The Journal of Phys- between March and August 2010. ical Chemistry B, 111, pp. 260-285, 2007

44 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 45 so-called IDB (Incarnation Database). UNICORE user database, Virtual Orga- Recent Advances in the The functionality of the XNJS is ac- nization (VO) management systems can UNICORE 6 Middleware cessible via two sets of Web service be accessed for getting user attributes interfaces. The first set of interfaces such as role, project membership or is called UAS (UNICORE Atomic Ser- local login. To enable interoperability The UNICORE Grid system provides a UNICORE 6 Architecture vices) and offers the full functionality to scenarios, proxy certificates that are seamless, secure and intuitive access A bird’s eye view of the UNICORE 6 higher level services, clients and users. common in other Grid software are op- to distributed computational and data system is given in Figure 1. The system In addition to the UAS, a second set of tionally supported. resources such as supercomputers, can be decomposed into three layers: interfaces based on open standards clusters, and large server farms. the client, services and system layers. defined by the Open Grid Forum is avail- On the lowest layer the TSI (Target UNICORE serves as a solid basis in On the client layer, a variety of client able (depicted as “OGSA-*”). Some of System Interface) component is the many European and international tools are available to the users, rang- the OGSA-* services are still under de- interface between UNICORE and the research projects that use existing ing from the graphical UNICORE Rich velopment or await official ratification, resource management system and op- UNICORE components to implement client (URC), the UNICORE command- and the UNICORE community actively erating system of a Grid resource. In advanced features, higher-level services, line client (UCC) to an easy-to-use high- contributes to these discussions and the TSI component the abstracted com- and support for scientific and business level programming API (HiLA). Custom provides reference implementations for mands from the Grid layer are trans- applications from a growing range of do- clients such as portal solutions can be emerging standards. The service layer lated to system-specific commands, e.g. Projects mains. Since its initial release in August easily implemented. includes a flexible and extensible secu- in the case of job submission, the spe- Projects 2006, the current version UNICORE 6 rity infrastructure, that allows interfac- cific commands like llsubmit or qsub of has been significantly enhanced with The middle layer comprises the UNI- ing to and integrating a variety of secu- the resource management system are new components, features and stan- CORE 6 services. Figure 1 shows three rity components. Instead of the default called. The TSI component is perform- dards, offering a wide range of func- sets of services, the left and right one tionality to its users from science and containing services at a single site while industry. After giving a brief overview of the middle shows the central services, UNICORE 6, this article introduces some such as the central registry, the work- of these new features and components. flow services and the information ser- vice, which serve all sites and users in a Overview of UNICORE 6 UNICORE Grid. The Gateway component UNICORE 6 is built on industry-standard acts as the entry point to a UNICORE technologies, using open, XML-based site and performs the authentication of solutions for secure inter-component all incoming requests. It acts as a HTTP communication. UNICORE 6 is designed and Web services router, forwarding cli- according to the principles of Service ent requests to the target service, real- Oriented Architectures (SOAs). It uses izing a secure firewall tunnel. Therefore, SOAP-based stateful Web services, in a typical UNICORE installation, only a as standardized by the standards con- single open firewall port is necessary. sortia W3C, OASIS and the Open Grid Forum (OGF). The security architecture The central UNICORE server compo- is based on X.509 certificates, the Secu- nent is based on UNICORE 6’s Web rity Assertion Markup Language (SAML) services hosting environment. The and the eXtensible Access Control XNJS component is the job manage- Markup Language (XACML). UNICORE 6 ment and execution engine of UNICORE is implemented in the platform-neutral 6. It performs the job incarnation, languages Java and Perl, and UNICORE 6 namely the mapping of the abstract job services successfully run on Linux/Unix, description to the concrete job descrip- Windows and MacOS, requiring only a tion for a specific compute resource Java 5 (or later) runtime environment. according to the rules stored in the

Figure 1: Overview of the UNICORE 6 system

46 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 47 ing tasks under the users UID and is the sequences, branches, loops and condi- service allows to create and manage rity. Therefore a plugin for the URC only UNICORE component that needs to tions, while integrating seamlessly into remote storage resources. was developed that allows to access be executed with root privileges. The UNICORE Grids and into the UNICORE Shibboleth-enabled attribute authori- TSI is available for a variety of commonly clients. The workflow system supports Execution Environments: ties, which create short-lived certifi- used resource management systems automated resource selection and bro- Improved Support for cates that enable access to the Grid such as Torque, LoadLeveler, LSF, kering, and is scalable due to its two- Parallel Applications resources. SLURM, and Sun GridEngine. In addition layered architecture. Recently a for- Users running parallel applications are to job submission and job management each loop construct has been added usually required to know about the Interactive Access services, UNICORE 6 provides unified that allows to process large file sets invocation details of the parallel envi- to Grid Sites access to storages via its Storage- or perform parameter studies in a ronment (such as OpenMP or MPI). A Using the single sign-on feature of the Management and FileTransfer service scalable and very user-friendly fashion. recently added UNICORE feature allows UNICORE Rich Client, it is desirable to interfaces. File systems are accessed Figure 2 shows a screen shot of the administrators to easily parametrize provide an interactive terminal access directly via the TSI, while other storages URC, with the workflow editor shown the available parallel environments in to the remote site. A URC plugin has can be integrated using custom adap- on the right-hand side. the UNICORE IDB, so that users can been developed that can be used to tors. For example, an adaptor to the easily choose the required environment open an interactive connection using Apache Hadoop cluster storage system Improved Data and and set options and arguments, with- a variety of means such as standard has been realized. Storage Management out being required to know the system SSH using username and password, Projects UNICORE’s data and storage manage- specifics. GSI – SSH and a custom X.509 based Projects Recent Development ment capabilities have been integrated solution. The client-side plugin offers into the UNICORE Rich Client in an easy Remote Administration VT100 emulation and port forwarding. Highlights to use fashion. The URC allows com- It is often convenient to check site mon operations like editing, deleting or health, get up-to-date performance Open Source Model Workflow System renaming, as well as drag and drop of metrics and even deploy and un-deploy UNICORE 6 is available as open source The UNICORE 6 workflow system offers files between the remote sites and the services remotely using the standard under BSD license from SourceForge. powerful workflow features such as user’s desktop. A new StorageFactory communication channels (i.e. Web ser- At http://www.unicore.eu more infor- vices), without having to rely on other mation can be obtained and lightweight tools such as Nagios, or needing to UNICORE 6 installation packages can actually log in to the UNICORE 6 server. be downloaded. A small UNICORE 6 tes- • Bernd Schuller For this reason, an Admin service has tgrid is available for immediate testing. • Morris Riedel been developed that offers various UNICORE 6 is continually evolved, with • Achim Streit remote administration and monitor- regular releases approximately every ing features. It can be used through a three months. The main contributors in Jülich graphical URC plugin and through the the international UNICORE open-source Supercomputing commandline client. Using the UNICORE community are: ICM Warsaw, CINECA, Centre, XACML based access control, access CEA, Technische Universität Dresden, Forschungs- to this service is of course limited to and Forschungszentrum Jülich. zentrum Jülich privileged users.

Shibboleth Integration in the UNICORE Rich Client Shibboleth is a standards-based iden- tity management system aiming to provide single sign-on across organi- sational boundaries. It is considered a prime candidate for simplifying end user experience by hiding the specifics of the X.509 certificate based secu-

Figure 2: Screen shot of the UNICORE Rich Client

48 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 49 LIKWID Performance Tools

Figure 1: Topology of Figure 2: Relation- typical Intel Nehalem ship between counter based Cluster Node register, hardware events and LIKWID event groups

Projects Projects Today's multicore processors bear often be satisfied with a rough over- it only requires a GNU C compiler and (also called model specific registers multiple complexities when aiming for view of the performance properties is a matter of seconds. In the following (MSR)). likwid-perfCtr uses the Linux high performance. Conventional per- of their application code. Moreover, we describe all four tools in detail: msr module to modify the MSRs from formance tuning tools like Intel VTune, advanced tools often require kernel user space. It can be used as a wrap- OProfile, CodeAnalyst, OpenSpeedshop patches and additional software com- likwid-perfCtr per or with code markers to probe only etc. require a lot of experience in order ponents, which makes them unwieldy Hardware performance counters are parts of the application. A major prob- to get sensible results. For this reason and bug-prone. Additional confusion facilities to count hardware events lem with hardware performance events they are usually unsuitable for the sci- arises with the complex multi-core, during code execution on a proces- is how to choose and interpret them entific user, who would multi-cache, multi-socket structure of sor. Because this mechanism is imple- correctly. The events differ between modern systems; affinity has become mented directly in hardware there is no processors and it is difficult for a be- an important concept, but users overhead involved. All modern proces- ginner to choose the right events to are all too often at a loss about sors provide hardware performance help in their optimization effort. likwid- how to handle it. counters, but their primary purpose is perfCtr provides preconfigured groups to support computer architects during with useful, ready to use event sets LIKWID ("Like I Knew What the implementation phase. Still they are and derived metrics like bandwidth and I'm Doing") is a set of easy also attractive for application program- event ratios. Still likwid-perfCtr is fully to use command line tools mers, because they allow an in-depth transparent, i.e. it is clear at any given to support optimization view on what happens on the proces- time on which events the performance and is targeted towards sor while running applications. Tools for groups are based. This ease of use performance-oriented hardware performance counters have sets likwid-perfCtr apart from existing programming in a Linux the reputation to be complex to install tools. environment. It does not and even more complex to use. In al- require any kernel patch- most all cases a special kernel module Supported architectures: ing and is suitable for Intel is necessary. Many tools are restricted • Intel Pentium M (Banias, Dothan) and AMD processor archi- to one processor vendor, others re- • Intel Core 2 (all variants) tectures. Multi-threaded and quire code changes in the user's ap- • Intel Nehalem (Xeon, i7, i5) even hybrid shared/distrib- plication. Hardware performance coun- • AMD K8 (all variants) uted-memory parallel code ters are controlled and accessed using • AMD K10 (Barcelona, is also supported. Building processor-specific hardware registers Shanghai, Istanbul)

50 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 51 likwid-topology Thread/Process pinning is vital for to know the influence of the hardware Multicore/Multisocket machines exhibit performance. If topology information is prefetchers. In some situations turning complex topologies, and this trend will available, it is possible to pin threads of hardware prefetching might even in- continue with future architectures. according to the application’s resource crease performance. On the Intel Core Performance programming requires requirements like bandwidth, cache 2 processor this can be achieved by in-depth kowledge of cache and node sizes etc. likwid-pin supports thread setting bits in the IA32_MISC_ENABLE topologies, i.e. which caches are shared affinity for all threading models that MSR register. Besides the ability to between which cores and which cores are based on POSIX threads, which in- toggle the hardware prefetchers likwid- reside on which sockets. The Linux cludes most OpenMP implementations. features reports on the state of swit- kernel numbers the usable cores and By preloading a library wrapper to the Figure 6: Pinning threads with likwid chable processor features as e.g. Intel makes this information accessible in pthread_create API call, likwid-pin is speedstep. likwid-features only works /proc/cpuinfo. Still how this number- able to bind each thread upon creation. Other threading implementations are for Intel Core 2 processors. ing maps on a processors topology It configures a dynamic library using en- supported via a "skip mask." This skip depends on BIOS settings and may even vironment variables and starts the real mask specifies which threads should Future Plans differ for otherwise identical proces- application with the library preloaded. not be pinned by the wrapper library. Some enhancements to the suite are sors. The processor and cache topology No code changes are required, but the For instance, the Intel OpenMP imple- imminent. Due to the limited number can be queried with the cpuid machine application must be dynamically linked. mentation always runs OMP_NUM_ of performance counter registers on Projects Projects instruction. likwid-pin is based directly THREADS+1 threads but uses the current CPUs, a code must often be on the data provided by cpuid. It ex- first created thread as management run multiple times in order to get com- tracts machine topology in an acces- thread. The skip mask makes it pos- plete information. Event multiplexing, sible way and can also report on cache Supported threading implementations: sible to pin hybrid applications as well i.e. switching between sets of event characteristics. There is an option to • POSIX threads by skipping MPI shepherd threads. In sources on a regular basis, will en- get a quick overview of the node's cache • Intel OpenMP addition, likwid-pin can be used for se- able the concurrent measurement of and socket topology in ASCII art. • gcc OpenMP quential programs as replacement for many performance metrics. This will the (inferior) taskset tool. introduce a statistical error to the event counts, which is however well Some compilers provide their own controllable. The API for enabling and mechanisms for thread affinity. In disabling performance counters in the • Jan Treibig order to avoid interference effects, current version of likwid-perfCtr is only • Michael Meier those mechanisms should be disabled rudimentary, and will be enhanced to • Georg Hager when using likwid-pin. In case of the become more flexible. • Gerhard Wellein Intel compiler, this can be achieved by setting the environment variable KMP_ Further goals are the combination of Regional AFFINITY. The big advantage of likwid- LIKWID with the well-known IPM tool Computer pin is that the same tool can be used for MPI profiling to facilitate the collec- Centre (RRZE), for all applications, compilers, MPI tion of performance counter data in University implementations and processor types. MPI programs, and the integration of of Erlangen- Figure 3: Intel Core i7 a simple bandwidth map benchmark, Nuremberg likwid-features which will allow a quick overview of the An important hardware optimization cache and memory bandwidth bottle- on modern processors is to hide necks in a shared-memory node, includ- data access latencies by hardware ing the ccNUMA behaviour. prefetching. Intel processors not only have a prefetcher for main memory, How to get it several prefetchers are responsible The LIKWID performance tools are for prefetching the data inside the open source. You can download them cache hierarchy. Often it is beneficial at: http://code.google.com/p/likwid/

Figure 4: Intel Dunnington Figure 5: Output of likwid-topology for Intel Xeon node with two sockets

52 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 53 als as linear combinations of a finite set In the program Project ParaGauss of localized, most often Gaussian-type ParaGauss function . The resulting matrix ele- the procedure k The interdisciplinary project ParaGauss Compared to other areas of simula- ments are then obtained as integrals, sketched above in the framework of the Munich Center tion, like fluid dynamics or continuum in fact scalar products , over is parallelized in of Advanced Computing aims at devel­ mechanics in engineering or the geo- the various terms V of the Kohn-Sham a coarse-grained oping and implementing new and im- sciences, quantum chemistry meth- operator, representing the kinetic fashion using mes- proved strategies for calculating the ods, such as those based on Density energy of the electron, the attractive sage passing via MPI. properties of molecular systems at the Functional Theory (DFT), lack a com- potential due to the atomic nuclei, the Each individual subtask is level of quantum mechanics, i.e. from mon data structure which would allow repulsion of an electron by the classical divided in as few as possible “first principles” and without empirical an efficient and homogeneous paral- Coulomb potential due to electron den- consecutive blocks to minimize assumptions. The project relies on the lelization of the entire algorithm. sity and the so-called exchange and com­munication. Crucial for an efficient density functional approach to calcu- correlation terms. parallelization is that a large amount late the electronic structure of atoms, At the heart of the problem are the of “integrals” have to be handled molecules or solids and builds on the solutions of the Kohn-Sham equa- These latter terms comprise the when constructing the matrix ele- implementation in the parallel program tion where H is the quantum mechanical aspects of the ments of H. For larger molecules ParaGauss [1]. To reach the goals of Kohn-Sham operator which represents many-body effects. The more demand- this data set reaches several GB. Projects the project, alternatives to current the interaction energies of one electron ing tasks in the Kohn-Sham procedure Also, the computational effort of the Projects parallel implementations have to be with all the others as well as with the are the determination of the matrix different tasks scatters widely, result- developed, to increase the number of atomic nuclei of the system. Summing elements according to analytical ing in bottlenecks that limit the number efficiently usable processors and espe- over a sufficient number of one-elec- integrals and the evaluation of the cially to improve the usability of mod- tron orbitals from lower to higher electron-electron interaction during ern multi-core architectures. energies yields the electronic density the self-consistency cycle from such in- i tegrals (Coulomb term) or by numerical integration (exchange and correlation terms). A smaller task, besides others, The Kohn-Sham operator H itself de- is the diagonalization of the generalized pends on the electron density so eigenvalue problem. To determine the that the Kohn-Sham equation has to be electronic structure at a given molecu- solved iteratively until “self-consistency” lar geometry, i.e. a given set of nuclei has been reached. One starts from assumed to be fixed in space, one has an initial guess for , constructs to carry out these tasks repeatedly the Kohn-Sham operator H and solves and consecutively during the iteration the Kohn-Sham equation. With the re- process. In a typical application, one sulting density, one constructs a new determines the geometry of a molecu- operator, and so on, until the initial and lar system (Figure 1), i.e. one has to resulting orbitals are sufficiently close minimize the energy by varying the mo- to each other. lecular geometry, which requires first- order derivatives (forces) of the energy Various subtasks of mathematically dif- with regard to nuclear displacements, ferent structure need to be tackled hence additional analytical and numeri- in this procedure. First, the Kohn- cal integrations. Often one also has to Sham equation is transformed to determine second-order derivatives of a generalized matrix eigenvalue the energy with respect to nuclear co- problem by rep- ordinates, e.g. when one needs an ac- Figure 1: A model catalyst: transition metal hydride cluster Rh6H3 (rhodium resenting curate description (frequencies) of the purple, hydrogen white) adsorbed at the wall of a SiO2 framework (zeolite of the orbit- vibrations of the framework of nuclei. faujasite structure; silicon brown, oxygen grey)

54 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 55 of efficiently usable MPI tasks by means of OpenMP. On for symmetric molecular frameworks will increase the overall efficiency of a processors, especially a shared-memory multi-core system, or calculations that account for spin simulation. Implementation of this com- for the smaller sub- we will compare the efficiency of an polarization of open-shell systems. bined strategy requires a strict modu- tasks. To improve the OpenMP parallelized variant of a mod- These densely filled submatrices are larization of the overall procedure into usage of multi-core ule that evaluates analytical integrals symmetric (or hermitian) with typi- separate modules for each subtask. architectures as with the existing version on the basis cal dimensions ranging from 100 to a The number of processors used at a well as for a generally of message-passing parallelization. few 1,000, reaching 10,000 for large time will then be determined taking higher efficiency when This module will serve as prototype for problems. Currently these matrices into account the optimum number of engaging a larger number MPI parallelization between nodes. are diagonalized by a serial algorithm, processors for specific tasks and the of processors, several strat- but submatrices are treated in paral- number of currently available cores. egies will be combined in a new imple- As algorithmically simple tasks, e.g. lel. As parallel eigensolvers are not mentation of ParaGauss. Multi-core numerical integration, are easily paral- very efficient for smaller matrices, we Two-level parallelization, modulariza- architectures provide less memory lelized, computationally less demanding constructed a mixed procedure for di- tion, optimization of memory usage per core and typically slower access tasks, that are harder to parallelize, agonalizing a set of matrices. First one and pool-parallel execution of tasks than single-core hardware. This issue consume a growing part of the total determines a hardware-dependent opti- taken together will yield a new soft- will become even more important as run time with growing number of pro- mum number of cores for diagonalizing ware structure for quantum chemical memory access is already a bottleneck cessors. Therefore it is necessary to matrices of increasing size. Then one calculations that is also ready for dis- Projects at current computing platforms. optimize all tasks which take up more optimizes the order in which the matri- tributed computing. We are confident Projects than ~1 % of real time in a sequen- ces are diagonalized, employing varying that the MAC project ParaGauss will One way out is to limit the amount of tial run. Otherwise these tasks will numbers of processors to achieve a deliver a simulation tool of enhanced data of parallel tasks at the expense dampen the acceleration for a larger low overall execution time (Figure 2). usability and efficiency on shared- of communication overhead. Currently number of processors. memory multi-core environments. we are exploring an alternative solution In this spirit, other tasks also may be which uses message passing paral- The solution of the generalized matrix solved more efficiently by allowing vari- References lelization between multi-core shared eigenvalue problem is such a task. In able numbers of cores. In this way, [1] Belling, Th., Grauschopf, T., Krüger, S., Mayer, M., Nörtemann, F., Staufer, M., memory nodes and, on each multi-core general the matrix to be diagonalized dynamically variable numbers of cores Zenger, C., Rösch, N., node, low-level parallelization of single is decomposed into diagonal blocks will reduce the overall runtimes while "High Performance Scientific and Engineer- it also wastes hardware resources ing Computing", in Bungartz, H.-J., Durst, • Notker Rösch to a certain degree. In typical applica- F., Zenger, C. (Eds.), Proceedings of the • Sven Krüger First International FORTWIHR Conference tions, one hardly ever carries out just 1998, Lecture Notes in Computational a single huge simulation; rather, sev- Science and Engineering, Springer, Technische eral if not many similar calculations are Heidelberg, Vol. 8, p. 439, 1999 Universität needed. This observation may help to München, resolve the conflict just described by Theoretische tackling several such application tasks Chemie in parallel on a larger pool of cores. Typical problems benefitting from such a pooling strategy are the determina- tion of energies of educts and prod- ucts of a reaction, the determination of a reaction path that is represented by a finite set of intermediate struc- tures, or the comparison of various isomers of a given compound. Combin- ing this approach with an optimized dy-

Figure 2: Scheme of parallel solution of several matrix eigenvalue problems. namical usage of the numbers of cores Left: serial solution of individual problems in parallel, right: parallel solution of problems, partially parallelized. for different subtasks of the algorithm

56 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 57 Real World Application Acceleration with GPGPUs

Future microprocessor development Molecular Dynamics) is a free-of-charge efforts will continue to concentrate on molecular dynamics simulation pack- adding cores rather than increasing age written using the Charm++ paral- single-thread performance. Similarly, lel programming model, noted for its the highly parallel Graphics Processing parallel efficiency and regularly used Unit (GPU) is rapidly gaining maturity as to simulate large systems (millions of a powerful engine for computationally atoms). It has been developed by the demanding applications [1]. GPUs, de- joint collaboration of the Theoretical signed primarily for 3D graphics appli- and Computational Biophysics Group cations, are starting to become an at- and the Parallel Programming Labo- tractive option to accelerate scientific ratory at the University of Illinois at computations. Modern GPUs do not Urbana-Champaign. At the LRZ, NAMD Projects only offer powerful graphics but also a (version 2.7b1) has been compiled both Projects highly parallel programmable processor with and without GPU support. CUDA featuring peak arithmetic performance acceleration mixes well with task-based and memory bandwidth to the on-board parallelism, allowing NAMD to run on

graphics card memory that substan- clusters with multiple GPUs per node. Figure 1: The Figure depicts a representation of the file apoa1 that was used in this article. This file tially out-paces the CPUs counterpart. consists of 92,224 atoms, including the water box surrounding the molecule (water is not shown in However, making efficient use of GPU GPUs at the LRZ the Figure). The full name of the molecule is “ApoA-1 Milano”. It is a naturally occurring mutated variant of the apolipoprotein A1 protein found in human HDL, the lipoprotein particle that carries cholesterol clusters still presents a number of The LRZ operates two powerful visual- (depicted in yellow in the middle of the protein) from tissues to the liver and it is associated with challenges in application development ization servers since early 2009. Both protection against cardiovascular disease. and system integration. GPUs are opti- machines are SunFire-X4600-M2 serv- mized for structured parallel execution, ers with 32 CPU cores (8 quad-core atomic trajectories by solving equa- yond the specified cutoff distance) are extensive ALU counts and memory Opterons with 2.3 GHz) and 256 GB tions of motion numerically using em- computed less frequently. This reduces bandwidth. GPUs are designed to pro- RAM each. Each server has 4 Nvidia pirical force fields [5,6], such as the the cost of computing the electrostatic vide high aggregate performance for Quadro FX5800 graphics cards at- CHARMM force field [7], that approxi- forces over several time steps. The ap- tens of thousands of independent com- tached. The graphics cards feature mate the actual atomic force in molec- plication decomposes and schedules the putations. This key design choice allows 4GB on-board memory and 240 CUDA ular systems. NAMD is distributed free computation among CPU computation GPUs to spend the vast majority of chip parallel processing cores (the Nvidia of charge to academics, pre-compiled and GPU kernel. The GPU computes the die area (and thus transistors) on arith- Quadro FX5800 graphics card is iden- for popular computing platforms or non-bonded interactions, bonded inter- metic units rather than on caches. tical to the Tesla C1060 card). CUDA in source code form [4]. The velocity actions such as angle, dihedral and tor- (Compute Unified Device Architecture) Verlet integration method [8] is used sion are still done on the CPU. A particularly interesting application is a parallel computing architecture to advance the positions and velocities for GPU acceleration is the field of mo- developed by NVIDIA [3]. CUDA has of the atoms in time. To further reduce At every iteration step, NAMD has to lecular dynamics simulation, which is unlocked the computational power of the cost of the evaluation of long-range calculate the short-range inter­action dominated by N-body atomic force cal- GPUs and made GPUs much more electrostatic forces, a multiple time forces between all pairs of atoms culations. This article re-examines the accessible to computational scientists. step scheme is employed. The local within a cutoff distance. By partitioning performance of GPUs and compares interactions (bonded, van der Waals space into patches slightly larger than the results to pure CPU performance. NAMD with CUDA and electrostatic interactions within the cutoff distance, NAMD ensures This comparison will be carried out NAMD is a parallel molecular dynam- a specified distance) are calculated at that all interactions of one atom are using the NAMD molecular dynamics ics (MD) code for large biomolecular each time step. The longer range inter- with atoms in the same or neighboring simulation package. NAMD (NAnoscale systems [2]. MD simulations compute actions (electrostatic interactions be- cubes.

58 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 59 Projects Figure 2: Run-time as a function of the cutoff value. Values for the CPU-only version of NAMD are Figure 3: Variation of time performance as a function of Cut-off and two away flags. Projects represented by black circles, values for the CPU-GPU version of NAMD are depicted by red squares.

Results and Discussions The non-bonded interactions (including software at the LRZ Supercomputing [4] http://www.ks.uiuc.edu/Research/namd/ The performance results as a function electrostatic and van der Waals interac- Centre. The speed-up of the GPU- [5] Allen, M. P., Tildesley, D. J. of the cutoff is shown in Figure 2, where tions) are calculated beyond the cutoff enhanced version of NAMD depends “Computer Simulation of Liquids”, we compare the results of the CPU- distance on the GPUs. The rest of the most strongly on the cutoff parameter Oxford University Press, New York, 1987 only version of NAMD (using 4-Cores calculations such as the bonded interac- and also on the twoAwayX, twoAwayY, Opteron 2.3 GHz) with the CPU-GPU tion are done on the CPU. As the cut- and twoAwayZ options. These options [6] McCammon, J. A., Harvey, S. C. “Dynamics of Proteins and Nucleic Acids”, version of NAMD, using the Nvidia- off gets larger, the ratio of non-bonded reflect how the patches are arranged Cambridge University Press, Cambridge, FX58000 graphics cards. A larger to bonded calculations increases. This and the non-bonded interactions are 1987 • Momme Allalen value for cutoff means more compu- greatly favors the GPU as can be seen passed on to the GPU (some arrange- • Helmut Satzger tation is transferred to the GPU. To in the figures. However, most real-world ments are much better than others). [7] MacKerell, Jr. A. D., Banavali, N., • Ferdinand Jamitzky Foloppe, N. generate the results in Figure 2, one simulations use cutoff values that lead To get the optimum performance out “Development and Current Status of the has to switch off the flags twoAwayX, to the CPU being used more heavily of GPUs, fine-grained parallelism and CHARMM Force Field for Nucleic Acids”, Leibniz twoAwayY and twoAwayZ. The configu- than the GPU. As such, the huge pos- highly optimized code are needed. Biopolymers 56: 257–265 (2001). Supercomputing ration files of the apoa1 benchmark sible performance gain using GPUs can- Centre [8] Ma, Q., Izaguirre, J. A., Skeel, R. D. were used by varying the cutoff values not be fully passed on to the scientist. References “Verlet-I/r-RESPA/Impulse is Limited by [1] Owens, J. D., Houston, M., Luebke, D., from 6\Å to 32\Å. To increase paral- Using the CUDA streaming API for asyn- Nonlinear Instabilities Siam Journal on Green, S., Stone, J. E., Phillips, J. C. lelism on GPUs, every on or off com- chronous memory transfers and kernel Scientific Computing, 24:1951-1973, “GPU computing” Proceedings of the IEEE, 2003 bination of the twoAwayX, twoAwayY invocations to overlap GPU computation 96:879-899, May 2008 and twoAwayZ flags was tested. For with communication and other work example, twoAwayX=yes means that done by the CPU yields speedups of up [2] Phillips, J. C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, W., Schulten, K. the simulation box size is halfed in the to a factor of nine times faster than Chipot, Ch., Skeel, R. D., Kale, L., Villa, E., x direction. The results for different CPU-only runs. “Scalable Molecular Dynamics with NAMD”, simulation box sizes are summarized in Journal of Computational Chemistry, Figure 3. The majority of calculations Conclusion 26:1781-1802, 2005 are of the non-bonded type, and this is Application acceleration using GPUs [3] Nvidia CUDA (Compute Unified Device exactly what is off-loaded to the GPU has shown promising results for the Architecture) “Programming Guide”, in the CUDA-enabled version of NAMD. NAMD molecular dynamics simulation Nvidia, Santa Clara, CA 2007

60 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 61 • Memory is the most precious I/O nodes are equally distributed over Storage and Network Design for resource on petascale supercom- those 4 switch fabrics. Rather than the JUGENE Petaflop System puters, and the memory-efficient attaching whole BG/P racks to a single organization and scaling of applica- fabric (which would limit the rack to tion I/O patterns is an active area that one fabric), the 8 I/O nodes of a In conjunction with the installation of challenging than solutions for a single of our petascale research. BG/P rack are distributed over all four the IBM Blue Gene/P Petaflop system supercomputer. The JUST cluster uses fabrics, so smaller BG/P jobs can also JUGENE at Forschungszentrum Jülich 10-Gigabit Ethernet as its networking The Networking Layer benefit from the aggregate bandwidth (inSiDE Vol. 7 No. 1, p. 4), the support- technology, for several reasons: The networking solution needs to sup- of all four planes. ing storage and networking infrastruc- port a sustained GPFS bandwidth of ture has been upgraded to a sustained • The I/O nodes of the Blue Gene/P (as 66 GByte/s and has to provide 10 GbE The following candidates have been bandwidth of 66 GByte/s, matching the the main client) have on-chip 10 GbE. ports for 700 to 800 participants: evaluated for the switch fabric: enhanced computational capabilities. the 600 I/O nodes of the 72-rack • Ethernet is by far the most stable, Blue Gene/P, at least 100 ports in the • Cisco Nexus 7000 in its Starting with the installation of the ini- interoperable, and manageable JUST storage cluster, plus connectivity 18-slot version tial 16-rack JUGENE system in 2007, networking solution. to additional systems. To achieve this, scalable parallel file services have been JSC together with IBM started the • Force10 E1200i as a 14-slot switch Systems moved outside of the supercomputer • 10 GbE bandwidth is adequate, and design and planning phase well ahead Systems systems and are now provided by a the ultra-low latencies of other HPC of the actual deployment. Even today • Myricom Myri-10G switches dedicated storage cluster JUST, which interconnects are not needed for there is no switch solution on the is based on IBM General Parallel File typical HPC storage traffic. market that offers 700 to 800 10 GbE System (GPFS). There are three main ports on a single backplane, and the operational considerations leading to Expanding this infrastructure to support combination of multiple switches intro- this approach: Petaflop systems requires a solution de- duces several design and bandwidth sign which addresses some unique chal- limitations, e.g. due to port channel 1. Data is more long-lived than the lenges which are not present (or not as restrictions (IEEE802.3ad) and the compute systems it is generated on. relevant) at smaller scale. spanning tree algorithm.

2. There is a need to attach more than • The 10 GbE network needs to After discussing different network de- a single supercomputing resource provide very large port counts, and signs and even routing protocol based to the parallel file system (locally and has to sustain Terabits of bandwidth network solutions (e.g. equal-cost multi-­ in different projects, e.g. DEISA). between the supercomputer(s) and path), a rather flat network topology the storage cluster. was chosen, avoiding the risk of badly 3. Cutting edge supercomputers pick converging routing protocols and ad- up many performance-related soft • The storage cluster needs to be ditional costs for big pipes between ware and firmware improvements perfectly balanced in every aspect switch fabrics. Based on the knowledge that a pure storage cluster would of the hardware and software stack of the behaviour of GPFS as the main not need. to be able to deliver the raw perfor- application, a network layout was in- mance of thousands of disks to the vented that uses four switch chassis Separating out the HPC storage greatly end users. but keeps all high-bandwidth storage reduces the above interdependencies, access traffic local within each of the and also makes it easier to offer site- • The solution needs to be resilient switch fabrics. wide parallel file services to a more het- to faults and manageable. This be- erogeneous landscape of computing and comes ever more critical with the ex- To achieve this, quad-homed GPFS visualization resources. On the other ploding number of components both servers were deployed that offer direct hand, designing the networking infra- on the compute and the storage side. storage access local at each switch fab- structure around it then becomes more ric. The high number of Blue Gene/P

Figure 1: One of the four Force10 E1200i 10 GbE switches

62 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 63 While each alternative had some ad- to be the ideal solution for JSC’s Blue 3. The requirement of a seamless 2. $HOME filesystems, with daily vantages and some disadvantages, Gene/P and POWER6 based storage migration from the initial JUST stor- incremental backups. especially in the area of new upcoming cluster deployment. The traffic pat- age cluster installed in 2007 (and DataCenter Ethernet standards like terns observed in production mode via the goal to re-use parts of its hard- 3. $ARCH filesystems, with daily PerPriority Flow Control (IEEE802.1Qb) JSC’s RMON and SFlow/Netflow based ware, in particular the SATA disks). incremental backups and migration and Multichassis Etherchannel, at the monitoring match the assumptions that to tape storage through a combina- end the major decision criteria were were used as a base for this specific As in the initial JUST cluster, the GPFS tion of the IBM GPFS Information the proven stable functionality of the tailored network solution. metadata is stored on dedicated meta- Lifecycle Management (ILM) features existing single E1200 chassis for just data building blocks, with two IBM and IBM TSM/HSM. the functions needed for JSC’s network Using Ethernet as the base technol- DS5300 storage subsystems and 15k setup in combination with the assured ogy is a perfect fit for heterogeneous FC disks. Metadata is protected both by Figure 4 depicts the user view of the availability of the desired hardware in aspects like integrating the Nehalem- Raid1 arrays and by GPFS replication. GPFS filesystems on the expanded sync with the overall Blue Gene/P and based JUROPA cluster, and intercon- JUST cluster. Note that to cope with storage cluster deployment timeline. necting with different campus and proj- Several combinations of servers and the performance impact of the in- ect networks as well as visualization disk storage subsystems have been creased number of files resulting from Using the knowledge of Force10’s mod- solutions. evaluated for the GPFS data building an 5x enlarged Blue Gene/P system, ule and ASIC design in combination with block. For best price/performance, the number of $ARCH filesystems has Systems the known GPFS traffic patterns, JUST GPFS Storage Design nearline SATA disks are used and pro- been increased from one to three and Systems servers and BG/P I/O nodes are con- and Usage Model tected as 8+2P Raid6 arrays. The users are distributed across those. nected to the linecard ports in a layout The design of the hardware building resulting GPFS data building block is that minimizes the congestion prob- blocks for the expanded JUST storage shown in Figure 3: ability on the 4x over-subscribed lin- cluster is driven by three factors: ecards (which are required to achieve • 3 x GPFS-DATA servers p6-570, the large portcounts). At the end of 1. The target of a sustained aggregate each: 2 x CEC, 8 core, 8 x dual-port the multi-stage deployment phase and bandwidth of 66 GByte/s. FC4 adapters, (8 ports used), some additional network optimizations 4 x 10 GigE link into BG/P network (introduction of Jumbo frames, deploy- 2. The requirement to use quad-homed ment of end-to-end flow control, tuning GPFS/NSD servers with sufficient • 2 x DS5300 storage subsystems, of network options and buffers), the bandwidth to drive the four 10 GbE each: 16 x FC host ports (12 ports chosen network design has proven network fabrics. used), 24 x EXP5000, each @ 16 x SATA drives (5/6 are 1 TB drives, 1/6 are 500 GB drives reused from JUST-1)

To achieve the desired aggregate GPFS bandwidth, eight of these building blocks are deployed providing a total of 4.2 PByte usable capacity. This stor- age is partitioned into three classes of filesystems with different usage char- acteristics:

1. A large scratch filesystem $WORK with roughly half the capacity and Figure 3: One JUST Building Block ($WORK bandwidth, not backed up or archived. uses four, $HOME and Blue Gene/P production jobs should $ARCH share the other generally use this filesystem. four building blocks)

Figure 2: JSC Network Overview

64 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 65 GPFS Performance Aspects During the deployment phase, a num- To reach the maximum performance ber of these well-known best practices QPACE of a GPFS cluster, a number of “GPFS have been re-visited. Some of the Golden Rules” need to be followed: design and configuration choices that QPACE (Quantum Chromodynam- and QPACE was identified as one of were optimal with the 16-rack Blue ics Parallel Computing on the Cell) the advanced prototypes. In order to • Performance primarily depends on Gene/P and generation-2007 storage is a massively parallel and scalable extend the range of applications for the number of disks within the file hardware are no longer appropriate for computer architecture optimized for QPACE beyond LQCD, an implementa- system (disks do not get faster any- a 72-rack Petaflop Blue Gene/P and Lattice Quantum ChromoDynamics tion of the high-performance LINPACK more). generation-2009 storage hardware. (LQCD). It has been developed in co- (HPL) benchmark was a first step. operation between several academic Here, different from QCD applications, • The hardware configuration should On the system level, these invaluable institutions (SFB TR 55 under the lead- the transfer of large messages and be symmetric and all system compo- lessons learned (during installation, ership of the University of Regensburg) collective operations must be sup- nents must be balanced. from early adopters and studies by the and the IBM Deutschland Research and ported efficiently. This required exten- JSC application support team) have Development GmbH. sions of both network communication • Settings of the GPFS parameters already been incorporated into the and the software stack. like the filesystem blocksize must be production setup. On the application At JSC, a 4-rack QPACE system is suitable and should match the physi- level, efforts are ongoing to optimize installed. The building block is a node A special FPGA bit-stream and a Systems cal layout. the applications I/O patterns so they card comprising an IBM PowerXCell 8i library of MPI functions for message Systems scale on Petaflop systems to the same processor and a custom FPGA-based passing between compute nodes has • Changed characteristics from one degree that the MPI communication network processor. 32 node cards are been developed at JSC in co-operation generation of storage HW to the scaling has reached. mounted on a single backplane and with IBM. The HPL benchmark pro- next (e.g. DS4000 to DS5000) eight backplanes are arranged inside duced excellent results (43.01 TFlop/s should be reflected in the GPFS one rack, hosting a total of 256 node on 512 nodes) and the 4 rack system configuration. cards. The closed node card housing, in Jülich corresponds to an aggregate which is connected to a liquid-cooled peak performance of more than • Changed usage patterns (e.g. an cold plate, acts as a heat conductor 100 TFlop/s. 5x increase in the number of files thus making QPACE a highly energy- • Olaf Mextorf1 generated) will also have an impact efficient system. Providing 723 MFlop/s/W QPACE • Willi Homberg • Ulrike Schmidt1 on the filesystem performance that was recognized as the most energy- • Lothar needs to be considered. To remove the generated heat a efficient supercomputer worldwide and Jülich Wollschläger1 cost-efficient liquid cooling system ranked top in the Green500 list at the Supercomputing • Michael Hennecke2 has been developed, which enables Supercomputing Conference 2009 in Centre, • Karsten Kutzer2 high packaging densities. The maxi- Portland, Oregon. Forschungs- mum power consumption of one zentrum Jülich 1 Jülich QPACE rack is about 32 kW. The 3D- Supercomputing torus network interconnects the node Centre, cards with nearest-neighbor commu- Forschungs- nication links driven by a lean custom zentrum Jülich, protocol optimized for low latencies. For the physical layer of the links 2 IBM Deutschland 10 Gigabit Ethernet PHYs are used GmbH providing a bandwidth of 1 GB/s per link and direction.

In the context of PRACE work-package 8, promising technologies for future supercomputers have been evaluated

Figure 4: User view of the JUST GPFS filesystems and the connected Supercomputers at JSC

66 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 67 ­

Leibniz Supercomputing Centre of Research in HPC is carried out in colla- Compute servers currently operated by LRZ are the Bavarian Academy of Sciences and boration with the distributed, statewide Humanities (Leibniz-Rechenzentrum, LRZ) Competence Network for Technical and Peak ­Performance provides comprehensive services to Scientific High-Performance Computing User scientific and academic communities by: in Bavaria (KONWIHR). System Size (TFlop/s) Purpose ­Community

• giving general IT services to more Contact: SGI Altix 4700 9,728 Cores 62.3 Capability German “HLRB II” 39 TByte ­Computing Universities than 100,000 university customers Leibniz Supercomputing Centre Intel IA64 and Research in Munich and for the Bavarian 19 x 512-way Institutes, Academy of Sciences Prof. Dr. Arndt Bode DEISA Boltzmannstr. 1 • running and managing the powerful 85478 Garching near Munich communication infrastructure of the Germany Centres Centres Munich Scientific Network (MWN) Linux-Cluster 4,438 Cores 36.3 Capacity Bavarian Phone +49 - 89 - 358 - 31- 80 00 Intel Xeon 9.9 TByte Computing and Munich • acting as a competence centre [email protected] EM64T/ Universities, AMD Opteron D-Grid, for data communication networks www.lrz.de 2-, 4-, 8-, 16-, LCG Grid 32-way • being a centre for large-scale archiving and backup, and by

• providing High-Performance SGI ICE 512 Cores 5.2 Capability Bavarian Intel Nehalem 1.6 TByte Computing Universities, Computing resources, training 8-way PRACE and support on the local, regional and national level.

SGI Altix 4700 384 Cores 2.5 Capability Bavarian & 1 TByte Computing Universities SGI Altix 3700 BX2 Intel IA64 128 & 256- way

Linux Cluster 220 Cores 1.3 Capacity Bavarian Intel IA64 1.1 TByte ­Computing and Munich 2, 4- and Universities 8-way

View of “Höchstleistungsrechner in Bayern HLRB II“, an SGI Altix 4700 A detailed description can be found on LRZ’s web pages: www.lrz.de/services/compute (Photo: Kai Hamann, produced by gsiCom)

68 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 69 Based on a long tradition in supercom- the University of Heidelberg in the hkz-bw Compute servers currently operated by HLRS are puting at Universität Stuttgart, HLRS (Höchstleistungsrechner-Kompetenz­ was founded in 1995 as a federal Cen- zentrum Baden-Württemberg). Peak ­Performance tre for High-Performance Computing. User HLRS serves researchers at universities Together with its partners HLRS System Size (TFlop/s) Purpose ­Community and research laboratories in Germany ­provides the right architecture for the and their external and industrial part- right application and can thus serve NEC Hybrid 12 16-way nodes 146 Capability German Architecture SX-9 with 8 TByte Computing Universities, ners with high-end computing power for a wide range of fields and a variety of main memory Research engineering and scientific applications. user groups. + 5,600 Intel Institutes Nehalem cores and Industry, 9 TB memory and D-Grid Operation of its systems is done to­ Contact: 64 NVIDIA Tesla gether with T-Systems, T-Systems sfr, Höchstleistungs­rechenzentrum S1070 and Porsche in the public-private joint ­Stuttgart (HLRS) Centres IBM BW-Grid 3,984 Intel 45.9 Grid D-Grid Centres venture hww (Höchstleistungsrechner Universität Stuttgart Harpertown cores Computing Community für Wissenschaft und Wirtschaft). 8 TByte memory Through this co-operation a variety of Prof. Dr.-Ing. Dr. hc. Michael M. Resch systems can be provided to its users. Nobelstraße 19 Cray XT5m 896 AMD 9 Technical BW Users Shanghai cores Computing and Industry 70569 Stuttgart 1.8 TByte memory In order to bundle service resources in Germany the state of Baden-Württemberg HLRS AMD Cluster 288 AMD cores 3.7 Technical Research has teamed up with the Computing Cen- Phone +49 - 711- 685 - 8 72 69 1.6 TByte memory Computing Institutes tre of the University of Karlsruhe and [email protected] and Industry the Centre for Scientific Computing of www.hlrs.de A detailed description can be found on HLRS’s web pages: www.hlrs.de/systems

View of the HLRS BW-Grid IBM Cluster (Photo: HLRS) View of the HLRS hybrid NEC supercomputer (Intel Nehalem / NVIDIA S1070 / SX-9) (Photos: HLRS)

70 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 71 The Jülich Supercomputing Centre (JSC) Implementation of strategic support Compute servers currently operated by JSC are at Forschungszentrum Jülich enables infrastructures including community-

scientists and engineers to solve grand oriented simulation laboratories and Peak ­Performance challenge problems of high complexity in cross-sectional groups on mathemati- User science and engineering in collaborative cal methods and algorithms and par- System Size (TFlop/s) Purpose ­Community infrastructures by means of supercom- allel performance tools, enabling the puting and Grid technologies. effective usage of the supercomputer IBM 72 racks 1,002.6 Capability European Blue Gene/P 73,728 nodes ­computing Universities resources. “JUGENE” 294,912 processors and Research Provision of supercomputer resources PowerPC 450 Institutes, of the highest performance class for Higher education for master and 144 TByte memory PRACE, DEISA projects in science, research and indus- doctoral students in cooperation e.g.

try in the fields of modeling and comput- with the German Research School for Intel 2,208 SMT nodes with 207 Capacity European er simulation including their methods. Simulation Sciences, a joint venture of Linux CLuster 2 Intel Nehalem-EP and Universities, “JUROPA” quad-core 2.93 GHz Capability Research Centres The selection of the projects is per- Forschungszentrum Jülich and RWTH Centres processors each Computing Institutes formed by an international peer-review Aachen University. 17,664 cores and Industry, procedure implemented by the John von 52 TByte memory PRACE, DEISA Neumann Institute for Computing (NIC), Contact: Intel 1,080 SMT nodes with 101 Capacity EU Fusion a joint foundation of Forschungszentrum Jülich Supercomputing Centre (JSC) Linux CLuster 2 Intel Nehalem-EP and Community Jülich, Deutsches Elektronen-Synchro- Forschungszentrum Jülich “HPC-FF” quad-core 2.93 GHz Capability tron DESY, and GSI Helmholtzzentrum processors each Computing 8,640 cores für Schwerionenforschung. Prof. Dr. Dr. Thomas Lippert 25 TByte memory 52425 Jülich Supercomputer-oriented research Germany IBM 1,024 PowerXCell 100 Capability QCD Cell System 8i processors Computing applications and development in selected fields “QPACE” 4 TByte memory SFB TR55, of physics and other natural sciences Phone +49 - 24 61- 61- 64 02 PRACE by research groups of competence in [email protected] supercomputing applications. www.fz-juelich.de/jsc

IBM 35 Blades 7 Capability German Cell System 70 PowerXCell 8i computing Research “JUICEnext” processors School 280 GByte memory

AMD 125 compute nodes 2.5 Capability EU SoftComp Linux Cluster 500 AMD Opteron computing Community “SoftComp” 2.0 GHz cores 708 GByte memory

AMD 44 compute nodes 0.85 Capacity and D-Grid Linux Cluster 176 AMD Opteron capability Projects “JUGGLE” 2.4 GHz cores computing 352 GByte memory

View on the supercomputers JUGENE, JUST (storage cluster), HPC-FF and JUROPA in Jülich (Photo: Forschungszentrum Jülich)

72 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 73 NIC Symposium 2010 17th EuroMPI in Jülich Conference at HLRS

The John von Neumann Institute for • Education and training in all areas EuroMPI 2010 will be held at Stuttgart, cussed at the HLRS. Furthermore the Computing (NIC) was established in of supercomputing by symposia, Germany on September 12 - 15, 2010. key questions of scalability will be ad- 1998 by Forschungszentrum Jülich workshops, summer schools, HLRS has a long tradition in MPI and dressed. Handling one million cores is and Deutsches Elektronen-Synchrotron seminars, courses and guest pro- has contributed to the definition of the a task that is beyond currently available DESY to support the supercomputer- grammes for scientists and students. standard and to the proliferation of implementations and the workshop will oriented simulation in natural and engi- MPI over the years. It was therefore be a good chance to discuss new ideas neering sciences. In 2006, GSI Helm- Every two years the NIC organizes a only natural to accept an invitation and strategies. The link to industry will holtzzentrum für Schwerionenforschung symposium to give an overview of the to host the next EuroMPI meeting at also be given in the planned social event. joined NIC as a contract partner. The activities and results obtained by the Stuttgart. The 17th European MPI The Mercedes Museum at Stuttgart is core task of NIC is the peer-reviewed NIC projects and research groups in the Users’ Group Meeting will be a forum the perfect place to get an overview of allocation of supercomputing resources last two years. The fifth NIC Symposium for users and developers of MPI and more than 100 years of automobile his- to computational science projects from took place at the Forschungszentrum other message passing programming tory embedded in the overall world his- Activities Germany and Europe. The NIC partners Jülich from February 24 - 25, 2010. environments. Through the presenta- tory. This visit will bring us close to the Activities support supercomputer-aided research It was attended by more than one hun- tion of contributed papers, poster pre- industrial heart of Stuttgart that beats in science and engineering through a dred scientists that had been using the sentations and invited talks, attendees for cars and everything automotive three-way strategy: supercomputers JUMP, JUROPA, and will have the opportunity to share ideas engineers can imagine. JUGENE in Jülich, and the APE topical and experiences to contribute to the • Provision of supercomputing computer at DESY in Zeuthen in their improvement and furthering of message Invited Speakers: resources for projects in science research. Fifteen talks and a large passing and related parallel program- Jack Dongarra, UTK, USA September research and industry poster session covered selected top- ming paradigms. With HLRS focusing on Bill Gropp, UIUC, USA ics. To accompany the conference, an industrial co-operations and applications Rolf Hempel, DLR, Germany 12-15, 2010 • Supercomputer-oriented research extended proceedings volume was pub- EuroMPI will aim at bringing science Jesus Labarta, BSC, Spain and development by NIC research lished that can also be viewed online at and industry in the field closer together. Jan Westerholm, AAU, Finnland groups in selected fields of physics The potential of MPI for industrial ap- Join us! and natural sciences. www.fz-juelich.de/nic/symposium plications will be one of the issues dis- See also: www.eurompi2010.org

Participants of the NIC Symposium 2010 Schlossplatz Stuttgart at night

74 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 75 automated storage technology used to • Storage virtualization – Consistent Storage Consortium at HLRS provide long-term storage for even the step in protecting virtual infrastruc- largest data sets (Petabyte-Archiving, tures On March 11, 2010, the 7th Confer- data center operator. After Backup). ence for IT Professionals and Data Cen- Leibnitz Rechenzentrum, DESY, Max- • Efficient data management through ter Operators was held on the grounds Planck Institute for Extraterrestrial The conferences of the Storage Consor- file virtualization in the data center of the University of Stuttgart in the Physics and other exciting conference tium are strongly focused on technol- large lecture hall of the HLRS. This con- locations, here at HLRS over 100 par- ogy, planning, and application require- • Storage consolidation and ference was designed and organized by ticipants have given us a clear confirma- ments of most advanced virtualization centralization of desktop services the Storage Consortium – information tion by ways of their positive feedback and storage architectures and are with WAN optimization and communication platform of leading on the symposium that we do meet the targeted at managers and operators of IT manufacturers operating in the needs of IT-/Storage representatives. data centers, storage administrators, • Use of scenarios and reports for German-speaking market. The event was Therefore, another conference in 2010 IT managers and architects, and appli- data deduplication held under the slogan "Energy Efficiency, is still in the pipeline”. cation server administrators, but also Virtualization and Cloud Computing”. controllers and technical buyers. Typical • Convergence of server, network and The event was supported by the follow- concerns of users, such as the planning storage, benefits of virtualization in According to the Managing Director ing partners of the Storage Consortium: and selection of modern storage archi- the storage environment and user Activities and initiator of the Storage Consortium, Riverbed, Cisco Systems, NetApp, tectures, are of particular importance. motivations Activities Mr. Norbert Deuschle, the one-day Quantum, Bull, Symantec, F5 Networks, The overall motto is "from user to user," conference turned out to be a great HP StorageWorks, 3PAR and Fujitsu and the contents of the presentations • Solutions for cloud computing success, especially due to the dedicated Technology Solutions. For user pre- are only partially provided by the manu- support by both the management and sentations (practical relevance), the facturers. This is to ensure practical • Virtualized storage resources as a staff involved at HLRS. Mr. Deuschle companies Sonepar Germany, Freuden- relevance as much as possible which is basis for cloud computing in modern quote: “Thanks to the willingness and berg IT, T-Systems International and, further improved by the informal way data centers. proactive support by Prof. Dr. Michael of course, the HLRS data center itself the various participants are exchanging Resch, Dipl.-Ing. Peter W. Haas as well could be won. information between each other. Inquiries to the Storage Consortium, as Dr. Thomas Bönisch of HLRS with his please contact: Mr. Norbert E. Deuschle, exciting keynote speech, we could now Due to its IT infrastructure, HLRS is The agenda of the Storage Consortium [email protected] carry out our seventh successful user very well suited to host such an event, meeting at HLRS on March 11, 2010, conference on the grounds of a large in particular, of course, in the areas did include the following current topics: See also: www.storageconsortium.de of storage networks (10-/40-Gbit/s Ethernet) as well as systems (disk, • Virtualization solutions as a basis tape). During the lunch break, visitors for cloud computing in modern data to the conference had the opportunity centers of going on a total of four tours in order to convince themselves. In addition to • Solutions in the field of storage the large data center with NEC Super systems and file system manage- Computing and storage systems, espe- ment at HLRS cially the data management aspects and the Storage Management (File Manage- • Capacity optimized backups of ment, HSM) found special attention. virtualized server environments, This has to be seen, of course, in the data deduplication and virtualized context of ever-increasing capacities of file management unstructured data to companies. In this context, quite naturally, visitors were • Virtualized Data Center highly interested in having a close look Architectures with standardized, at HLRS' comprehensive disk and tape virtualized IT Services

76 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 77 2 2 2 of this research code will be supported • Optimization of simulation software C A S E and JSC Co-operate by supercomputing experts from JSC. for up-to-date computer architec- to Achieve "Digital Aircraft" The formal basis for the common re- tures, e.g. multi-core architectures search activities is the co-operation or hybrid systems using GPUs, agreement ‘Numerical simulations for FPGAs, etc. aeronautical research on supercomput- ers’ signed by Deutsches Zentrum für • Development of scalable CFD Luft- und Raumfahrt (DLR) and Forsch­ methods for Petaflop systems ungszentrum Jülich in January 2010. • Evaluation of future trends in As a first step the TAU code has been computer/cluster architectures implemented on JUGENE and JUROPA and successful benchmarks were per- • Efficient usage of HPC resources formed up to several thousand cores. In order to intensify this collaboration • Training of students/young scientists and to achieve the "digital aircraft" JSC and C2A2S2E will work in future • Initiation of common research Activities together on: projects in HPC. Activities

Computer simulations are a key tech- support from the European Union and nology for the engineering as well as the federal state Niedersachsen, is a Large-scale Projects natural sciences. Nowadays, numeri- well-known competence centre for nu- cal simulations are an indispensable merical aircraft simulations. Besides its at GCS@FZJ ingredient to all aeronautical research activities on improving the modelling of activities since they allow insights the physical processes and the numeri- Two calls for large-scale projects were In October, four more projects could which would be impossible with theory cal algorithms it operates as a topical issued by the Gauss Centre for Super- be awarded that status: again one and experiment alone. Thus simulations computer centre the most powerful computing (GCS) in 2009, one in April from the field of fluid dynamics promise to reduce the parallel cluster for aeronautics research and one in July. Projects are classified (Dr. J. Harting, Universität Stuttgart: development risks and in Europe. Especially, the CFD code TAU as "large scale" if they require more than "Lattice Boltzmann simulation of emul- to shorten the devel- is developed, maintained and applied in 5% of the available processor core cy- sification processes in the presence of opment cycles. It is projects by C2A2S2E for its customers cles on each member's high-end system. particles and turbulence"), and three expected that the in the European aircraft industry. In the case on the petaflop supercom- from elementary particle physics next generation of puter JUGENE in Jülich this amounts to (Prof. Z. Fodor, Universität Wuppertal: aircrafts will be de- But the computing resources available 24 rack months for a one year applica- "Lattice QCD with 2 plus 1 flavours at veloped and tested at a topical centre are not sufficient tion period. At its meeting in June the the physical mass point"; Prof. G. completely in the to simulate a virtual aircraft with NIC Peer Review Board – in collaboration Schierholz, DESY: "Hadron Physics computer ("digital all its multidisciplinary interactions. with the review boards of HLRS and at Physical Quark Masses on Large aircraft"). Therefore C2A2S2E needs access to LRZ – decided to award the status of Lattices"; Prof. S. Katz, Eötvös Univer- capability computers at the top level of large scale project to two projects: one sity, Budapest, Hungary: "QCD Simula- The Center for the European HPC ecosystem like JSCs from the field of fluid dynamics (Prof. N. tions with Light, Strange and Charm Computer Applica- JUGENE and JUROPA. Furthermore the Peters, RWTH Aachen: "Geometrical dynamical Quark Flavours"). tions in AeroSpace efficiency and scalability of the simula- Properties of Small Scale Turbulence") Science and Engi- tion software, i.e. the TAU code, have to and one from elementary particle phys- Altogether there are currently six large neering (C2A2S2E), be improved to allow calculations using ics (Dr. K. Jansen, DESY Zeuthen: "QCD scale scientific projects working on founded by DLR tens to hundreds of thousands of cores. Simulations with Light, Strange and JUGENE with an allocation of more than and Airbus with The necessary optimization/adaptation Charm dynamical Quark Flavours"). 40% of the available CPU cycles.

78 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 79 taken by another dozen of projects re- also gave its participants the unique HLRB and KONWIHR sulting in active discussions among the opportunity to present new challeng- Results and Review Workshop workshop participants. The workshop ing projects and their requirements on participants submitted seventy contri- future hardware and software. LRZ for butions for the workshop proceedings. its part sketched its present plans for On December 8 and 9, 2009, the their results in talks. The talks were Currently, these contributions are in this procurement. Leibniz Supercomputing Centre (LRZ) divided into two parallel sessions with the process of being reviewed by refer- hosted the "HLRB and KONWIHR Re- 10 talks per day. Additionally, on each of ees. Presumably, the proceedings will LRZ would like to thank the participants sults and Review Workshop". On the the two days a plenary talk was given. be published in spring 2010. of the workshop for their important and one-hand side, this workshop acts as Due to the considerable interdisciplinar- interesting research performed on and a review of the scientific research per- ity of the workshop the speakers had Since end of 2009 LRZ is actively pre- justifying the HLRB II supercomputer, formed on the HLRB II supercomputer been asked to prepare their talks in a paring the procurement of the succes- for their fruitful and comfortable col- in the years 2008 and 2009, and on manner, such that even non-specialists sor system to the HLRB II supercom- laboration with LRZ's HPC support the other side, it served as a forum for could follow their talks to a large extend. puter. As the new system will provide a staff and for their active participation the exchange of experiences between This guidline was followed by the speak- Peta-scale performance, the workshop in the workshop. users of different projects facing simi- ers in an outstandingly impressive way lar problems as well as between users leading not only to interesting talks on Activities and LRZ's HPC support staff. a high scientific level, but also to fruitful Activities discussions between participants from German-Israeli Co-operation: The workshop was attended by more the various disciplines. The feature of than 100 participants among them interdisciplinarity was most outstanding 24th Umbrella Symposium scientists and researchers from all during the plenary talks, which – follow- over Germany, members of the steer- ing their nature – were attended by all ing committee, and LRZ's HPC sup- workshop participants yielding an audi- JSC hosted the "Umbrella Sympo­- spring this year to support a selected port staff. The workshop featured an ence of even broader diversity in the dis- sium for the Development of Joint number of promising ideas arising from extraordinary interdisciplinarity since ciplines than was faced by the speakers Co-operation Ideas", an annual event the symposium: eight such potential the covered research areas include in the parallel sessions. run jointly by Technion Haifa, RWTH partnerships were already proposed by Computer Sciences, Mechanical Engi- Aachen University and Forschungszen- the end of the meeting. The next sym- neering (predominantly concerned with On the first day of the workshop, Bernd trum Jülich with changing topics since posium on the same topic will be hosted fluid-dynamics problems), Astrophysics, Brügmann (an astrophysicist fom the 1984. This year's symposium took by RWTH Aachen in early 2011. High-Energy Physics, Geo Sciences, University of Jena) was speaking on place from January 18 - 20, 2010, Condensed Matter Physics (includ- "Black Holes and Gravitational Waves". during which around 50 scientists at- ing solid-state physics as well And on the second day Sven Schmid tended more than 20 presentations as plasma physics), (working at NASA for a joined project of and follow-up discussions on the topic Chemistry, and Bio NASA and the University of Stuttgart) of "Modelling and Simulation in Medi- Sciences. During gave a talk on "Characterization of the cine, Engineering and Sciences". This the two days LRZ Aeroacoustic Properties of the SOFIA wide-ranging meeting included contribu- could offer 42 Cavity and its Passive Control". Both tions on surface science, medical en- projects the op- speakers earn a special thank for their gineering, biophysics, nanoelectronics portunity to outstanding talks, as they managed and virtual reality, but with a common present to give a general overview on the one- emphasis on numerical modelling. hand side, while presenting the chal- lenging problems faced in their project An important aspect of the symposium on the other side. is to give attendees the opportunity to build new cooperations in their particu- Apart from the talks, LRZ offered the lar field: a call for Umbrella travel grants possibilty to put up posters, which was for young scientists will be published in

80 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 81 especially considering that the system All three QCD teams (MIT, Edinburgh, Blue Gene Extreme was only partially available on the third DESY) reported linear scaling of their Scaling Workshop 2009 day due to necessary hardware re- codes to the full system. pairs. IBM experts repaired half of the machine at a time while the other half Finally, JSC’s program MP2C, which From October 26 - 28, JSC organized Massachusetts Institute of Technology, was still used for the workshop. couples MD and mesoscopic hydrody- the 2009 edition of its Blue Gene Scal- Rensselaer Polytechnic Institute, namics, was scaled up to nearly full ing Workshop. This time, the main focus Argonne National Laboratory, University Many interesting results were capacity of JUGENE, i.e. 262,144 cores was on application codes able to scale- of Edinburgh, Swiss Institute of Tech­ achieved. For example, the Gordon (64 racks), as the present version re- up during the workshop to the full Blue nology (Lausanne), Instituto Superior Bell team of Rensselaer Polytechnic quires P=2n. After some adjustments Gene/P system JUGENE which consists Técnico (Lisbon) as well as from RZG Institute focused on a strong scaling of message passing protocols, a good of 72 racks with a total of 294,912 MPG, DESY Zeuthen and JSC. study of our implicit, unstructured scaling behaviour was observed. Par- cores – the highest number of cores mesh based flow solver. An adaptive, ticle numbers up to N=8x1010 were worldwide available in a single system. During the workshop, the teams were unstructured mesh with 5 billion ele- chosen. According to memory require- supported by four JSC parallel applica- ments was considered in this study to ments, about 10 times more particles Interested application teams had to tion experts, the JUGENE system ad- model pulsatile transition to turbulence could have been applied. Since also submit short proposals which were ministrators and one IBM MPI expert; in an abdominal aorta aneurysm ge- parallel I/O capabilities were investi- Activities evaluated with respect to the required however, the participants shared a lot ometry that was acquired from image gated using SIONlib, benchmarking was Activities extreme scaling, application-related of expertise and knowledge, too. Every data of a diseased patient. At full sys- limited to this particle number, since constraints which had to be fulfilled by participating team succeeded to sub- tem scale, they obtained an overall for one restart configuration it already the JUGENE software infrastructure mit one or more successful full 72 rack parallel efficiency of 95% relative to the produced about 5 TBytes, for which and the scientific impact that the codes jobs during the course of the workshop, 16 rack run which is considered as the the I/O performance of 8 GB/s for could produce. Surprisingly, not only and executing many smaller jobs to in- base of their strong scaling study. writing and 16 GB/s reading data was a handful of proposals were submit- vestigate the scaling properties of their measured. ted, as the organizers had expected, applications along the way. A total of The Lisbon team investigated OSIRIS, but ten high-quality applications could 398 jobs were launched using 135.61 a fully relativistic electro-dynamic All experiences achieved during the be selected, among them two 2009 rack days of the total 169.3 rack days particle-in-cell code written at IST and workshop are summarized in the Gordon Bell Prize finalists. Applications reserved for the workshop. This is an UCLA. Their goal was to determine technical report FZJ-JSC-IB-2010-02 came from Harvard University, 80% utilization which is astonishing the strong scaling results for two dif- available at http://www.fz-juelich.de/ ferent test cases: the first test case jsc/docs/autoren2010/mohr1/. On ("warm") simulates the evolution of one March 22 - 24, 2010, JSC organized a single, hot species, while the second follow-up Blue Gene Extreme Scaling case ("collision") simulates two counter- Workshop. A more detailed report streaming species. They were able about this event will be published in the to obtain the results for the strong next edition of inSiDE. scaling up to 294,912 cores with an efficiency of 81% (warm) and 72% (col- lision) compared to their baseline of 4,096 CPU cores. They prepared also a set of production runs to simulate, for the first time, a complete astro- physical plasma collision in three dimen- sions. Due to limited availability of CPU time, these simulations will be subject to future work. Finally, they have de- fined a work plan to integrate JSC’s SION library in OSIRIS.

82 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 83 The 12th HLRS-NEC Publications in Computational Teraflop Workshop Science and Engineering

The 12th Teraflop Workshop, held on Ehlers from the Institute for Mechan- Nagel, W. E.; Kröner, D. B.; Resch, M. M. on industrially relevant applications. Pre- March 15th and 16th, 2009 at HLRS ics, University of Stuttgart. He pre- (Eds.), High Performance Computing in Science senting results for both vector systems and Engineering ‘09, Transactions of the High continued the Teraflop workshop series sented various computational issues and microprocessor-based systems the Performance Computing Center, 1st Edition., initiated by NEC and the HLRS in 2004. which have to be resolved when moving 2010, XII, 551 p., Hardcover, ISBN: 978-3-642- book allows comparing performance Once again, the covered topics drew from particle dynamics to multiphase 04664-3 levels and the usability of various archi- a bow from leading edge operating continous media. The next two talks tectures. As HLRS operates the largest system development to the needs and were given by Philip Rauschenberger This book presents the state of the NEC SX-8 vector system in the world results of real live applications. from the ITLR, University of Stuttgart, art in simulation on supercomputers. this book gives an excellent insight into and Arne Nägel form the GCSC, Uni- Leading researchers present results the potential of vector systems. The The opening talk was given by Ryusuke versity of Frankfurt who both gave an achieved on systems of the HLRS for book covers the main methods in high Egawa of Tohoku University in Sendai, overview across the applications at the year 2009. The reports cover all performance computing. Its outstanding Japan who gave deeper insight into their institutes which are related to fields of computational science and en- results in achieving the highest perfor- Activities the possibilities of vector computers large scale computers. In the next ses- gineering ranging from CFD to compu- mance for production codes are of par- Activities in meta-computing environments. sion, Ralf Schneider, HLRS, presented tational physics and chemistry to com- ticular interest for both scientists and The following two talks addressed the the efforts necessary when dealing puter science with a special emphasis engineers. SX-Linux project [1]. The first talk in the with direct mechanical simulations.

second session was given by Jochen Another research topic which poses Buchholz from HLRS who discussed a similar challenge is the coupling of Resch, M. M.; Roller, S.; Benkert, K.; Galle, architectures. The application contribu- the aims and first ideas of the TIMACS multi-scale, multi-physics applications M.; Bez, W.; Kobayashi, H. (Eds.), High Per- tions cover computational fluid dynam- formance Computing on Vector Systems 2009, project. He was followed by Georg which was discussed by Sabine Roller ics, fluid-structure interaction, physics, 2010, XIII, 250 p., Hardcover, ISBN: 978-3- Hager form the RRZ Erlangen who from the German Research School for 642-03912-6 chemistry, astrophysics, and climate showed 13 ways to sugarcode the fact Simulation Sciences, Aachen. research. Innovative fields like coupled that an application is not performing The book presents the state of the art multi-physics or multi-scale simulations well on modern supercomputers. The Workshop closed with a session in high performance computing and are presented. All papers were chosen in which distributed multi-scale simula- simulation on modern supercomputer from presentations given at the 9th The three talks given during the next tions for medical physics applications architectures. It covers trends in hard- Teraflop Workshop held in November session by Ulrich Rist from the IAG, and film cooling investigated by large- ware and software development in gen- 2008 at Tohoku University, Japan, and University of Stuttgart, Hans Hasse eddy simulation were discussed by eral and specifically the future of vec- the 10th Teraflop Workshop held in from the University of Kaiserslautern Jörg Bernsdorf from the German Re- tor-based systems and heterogeneous April 2009 at HLRS, Germany. and Matthias Meinke from the Aero­ search School for Simulation Sciences,

dynamic Institute of RWTH Aachen Aachen, and Lars Gräf from the ETH targeted real live applications from Zürich, respectively. Müller, M. S.; Resch, M. M.; Nagel, W. E. of multi-core processors or graphics CFD and molecular dynamics. The (Eds.), Proceedings of the 3rd International processing units if they are able to split afternoon session was shared by References Workshop on Parallel Tools for High Perfor- a problem into smaller ones that can mance Computing, September 2009, ZIH, Thomas Soddemann of the Fraunhofer [1] SX-Linux project be solved in parallel. The proceedings https://gforge.hlrs.de/projects/sxlinux/ Dresden, 1st Edition., 2010, 200 p., Hardcover, SCAI, who raised the question whether ISBN: 978-3-642-11260-7, to be published in of the 3rd International Workshop on GPU Computing is affordable super- [2] Teraflop Workbench June 2010 Parallel Tools for High Performance computing, and Peter Berg from IMK http://www.teraflop-workbench.de/ Computing provide a technical overview in Karlsruhe. Mr. Berg presented the As more and more hardware platforms in order to help engineers, developers work on ensemble-simulations done support parallelism, parallel program- and computer scientists decide which at his department. The second day ming is gaining momentum. Applications tools are best suited to enhancing their started with a talk given by Wolfgang can only leverage the performance current development processes.

84 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 85 HLRS Scientific Workshops and 9 10 11 14 16 Courses – Related to Fidelio?

15 In many of the previous editions of the social events, but courses? Do partici- inSiDE, I reported on past and future pants really prefer hotels’ wireless LAN workshops and courses, about the or a thriller in the TV? Should I cite new “high-productivity” programming Prof. Chr. Pfeiffer who teaches that paradigm PGAS (partitioned global thrillers in the evening significantly re- several participants and lecturers to “Volksfest” (fair). At a ride with the larg- address space) and related UPC and duce the learning effort of the day, and relax in a beer garden3. In Stuttgart est mobile observation wheel, you can CAF courses – but how does this fit also that men prefer thrillers? As usual after the city tour, Stuttgart’s first enjoy the view from 60 m height16. to the pictures on these pages? – and in scientific workshops, we believe that micro brewery4 helps to survive5 from also about “oldstyle” efficient MPI and also in scientific courses, after a day the stressful lectures. Some partici- This year, the October course at LRZ OpenMP programming. The courses of intensive study and practical work in pants also experience to visit the first starts on the last day of Munich’s famous cover also computational fluid dy- hands-on sessions1 people want to see TVtower6 in the world with its beautiful “Oktoberfest”…and on the next day the view over Stuttgart at day and night. heavy work with theory and practical Activities 12 13 Besides the city tour, always a tour starts again – but more relaxed. ☺ Activities through the 3D visualization in the Rolf Rabenseifner CAVE7 and the computing rooms8 is included together with a special trip in 2010 – Workshop Announcements the Porsche simulator9. Scientific Conferences and Workshops at HLRS, 2010 12th Teraflop Workshop (March, 15 - 16) In Dresden, the tour through the 9th HLRS/hww Workshop on Scalable Global Parallel File Systems (April 26 - 28) famous historical city10 includes a 4th HLRS Parallel Tools Workshop (July, not yet fixed) group photo in front of the “Dresdner 17th EuroMPI 2010 (Successor of the EuroPVM/MPI Series, September 12 - 15) Frauenkirche”11. This year, based on a High-Performance Computing in Science and Engineering - The 13th Results and Review Workshop of the HPC Center Stuttgart (October 4 - 5) special offer of the Semper Opera12,13, IDC International HPC User Forum (October 7 - 8) namics (CFD), iterative solvers, and more than only their hotel room. We half of the course enjoyed Beethoven’s Parallel Programming Workshops: Training in Parallel Programming and CFD modern object oriented paradigms in try to give them together the opportu- Fidelio with its critical, political, and Parallel Programming and Parallel Tools (TU Dresden, ZIH, February 22 - 25) 14 latest (2008) Fortran standardization. nity to experience the cities of the High now already historical stage setting , Introduction to Computational Fluid Dynamics (HLRS, March 15 - 19) After more then ten years, former Performance Computing centers. A designed in the former “DDR”. The pre- Iterative Linear Solvers and Parallelization (HLRS, March 22 - 26) 2 partici­pants send now their new PhD guided walk through the city can pro- miere was on October 7, 1989, just 32 Platforms at HLRS (HLRS, March 29 - 30) students­ – still no relation to the pic- duce the necessary hunger for a com- days before the Fall of the Wall. Some Unified Parallel C (UPC) and Co-Array Fortran (CAF) (HLRS, May 18 - 19) tures!? – and they come from all over fortable dinner. In Jülich, the fortress of the attendees visited also the famous 4th Parallel Tools Workshop (not yet fixed) Germany (sometimes also from really with its several meters thick walls gives art and jewels in the treasury “Grünes Parallel Programming with MPI & OpenMP (TU Hamburg-Harburg, July 21 - 23) far away) and have to stay one week a monumental impression in the night Gewölbe”15 after an early arrival on Parallel Programming with MPI & OpenMP (CSCS Manno, CH, August 10 - 12) in a boring/nice hotel. Scientific work- and in Garching at a warm September Sunday. A really rare event is the co- Introduction to Computational Fluid Dynamics (HLRS or RWTH Aachen, September 6 - 10) shops and conferences have exciting evening, the “English Garden” invited incidence of a course with Stuttgart’s Message Passing Interface (MPI) for Beginners (HLRS, September 27 - 28) Shared Memory Parallelization with OpenMP (HLRS, September 29) 1 2 3 6 7 8 Advanced Topics in Parallel Programming (HLRS, September 30 - October 1) Iterative Linear Solvers and Parallelization (LRZ, Garching, October 4 - 8) Parallel Programming with MPI & OpenMP (FZ Jülich, JSC, November 29 - December 1) Unified Parallel C (UPC) and Co-Array Fortran (CAF) (HLRS, December 14 - 15) Introduction to CUDA Programming (not yet fixed) 4 5 Training in Programming Languages at HLRS Fortran for Scientific Computing (March 1 - 5 and October 25 - 29) URLs: http://www.hlrs.de/events/ https://fs.hlrs.de/projects/par/events/2010/parallel_prog_2010/

86 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 87 High Performance Computing Courses and Tutorials

GPGPU Programming ics, quantum chemistry and similar. Data Mining of XML Contents Parallel Programming with Participants will learn to implement the Hands-on sessions featuring example Data Sets Using R R is a powerful software that can MPI, OpenMP amd PETSc algorithms, but also to apply existing Date & Location applications are given. also easily be used for high quality software and to interpret the solu- June 14 - 16, 2010 Date & Location visualization of large datasets of vari- Dates & Locations tions correctly. Methods and problems LRZ Building, Garching/Munich Webpage July 14, 2010 ous kinds like particle data, continu- August 10 - 12, 2010 of parallelization are discussed. This http://www.lrz.de/services/compute/ LRZ Building, Garching/Munich ous volume data, etc. The range of Manno (CH), CSCS course is based on a lecture and prac- Contents courses/#Datavis different plots is comprised of line tical awarded with the "Landeslehr­ Heterogeneous GPGPU computing Prerequisites plots, contour plots, surface plots up November 29 - December 1, 2010 preis Baden-Württemberg 2003" and promises tremendous acceleration Participants should have some basic to interactive 3D opengl plotting. The JSC, Research Centre Jülich organized by German Research School of applications. This programming Parallel Programming with R knowledge in programming with R, as course features hands-on sessions for Simulation Sciences. workshop includes hands-on sessions, well as basic understanding of XML. with examples. Contents application examples and an introduc- Date & Location The focus is on programming models Webpage tion to CUDA, CAPS, cuBLAS, cuFFT, July 6, 2010 Contents Webpage MPI, OpenMP, and PETSc. Hands-on http://www.hlrs.de/events/ the Portland Group Fortran Compiler, LRZ Building, Garching/Munich Large data sets are often stored on http://www.lrz.de/services/compute/ sessions (in C/Fortran) will allow users pycuda, and R. the web in XML format. Automated courses/#RVis to immediately test and understand Prerequisites access and analysis of these data the basic constructs of the Message Message Passing Interface Webpage Participants should have some basic sets presents a non-trivial task. The Passing Interface (MPI) and the shared (MPI) for Beginners http://www.lrz.de/services/compute/ knowledge in programming with R. dynamic language R is known as a very Introduction to Parallel memory directives of OpenMP. These courses/#Gpgpu powerful language for statistics, but Programming with MPI and two courses are organized by CSCS Date & Location Contents it has also evolved into a tool for the OpenMP (JSC Guest Student and JSC in collaboration with HLRS. September 27 - 28, 2010 R is known as a very powerful language analysis and visualization of large Programme) Stuttgart, HLRS Visualization of Large Data for statistics, but it has also evolved data sets. Webpage Sets on Supercomputers into a tool for the analysis and visual- Date & Location http://www.hlrs.de/events/ Contents ization of large data sets which are In this course R will be used to per- August 3 - 5, 2010 http://www.fz-juelich.de/jsc/neues/ The course gives an full introduc- Date & Location typically obtained from supercomput- form and automate these tasks and JSC, Research Centre Jülich termine/mpi-openmp tion into MPI-1. Further aspects are June 29, 2010 ing applications. The course teaches visualize the results interactively domain decomposition, load balancing, LRZ Building, Garching/Munich the use of the dynamic language R for and on the web. The course includes Contents and debugging. An MPI-2 overview parallel programming of supercomput- hands-on sessions. The course provides an introduction Introduction to Computa- and the MPI-2 one-sided communica- Contents ers and features rapid prototyping of to the two most important standards tional Fluid Dynamics tion is also taught. Hands-on sessions The results of supercomputing simula- simple simulations. Several parallel Webpage for parallel programming under the (in C and Fortran) will allow users to tions are data sets which have grown programming models including Rmpi, http://www.lrz.de/services/compute/ distributed and shared memory Date & Location immediately test and understand the considerably over the years, giving snow, multicore, and gputools are courses/#RXML paradigms: MPI, the Message Passing September 6 - 10, 2010 basic constructs of the Message Pass- rise to a need for parallel visualization presented which exploit the multiple Interface and OpenMP. While intended RWTH Aachen ing Interface (MPI). Course language is software packages. processors that are standard on mainly for the JSC Guest Students, English (if required). The course focuses on the software modern supercomputer architectures. Scientific High-Quality the course is open to other interested Contents packages Paraview, Visit and Vapor Hands-on sessions with example ap- Visualization with R persons upon request. Numerical methods to solve the equa- Webpage and their use to visualize and anal- plications are given. tions of Fluid Dynamics are presented. http://www.hlrs.de/events/ yse large data sets generated by Date & Location Webpage The main focus is on explicit Finite supercomputer applications which Webpage July 20, 2010 http://www.fz-juelich.de/jsc/neues/ Volume schemes for the compressible are typically generated in the fields of http://www.lrz.de/services/compute/ LRZ Building, Garching/Munich termine/parallele_programmierung Euler equations. Hands-on sessions will CFD, molecular modelling, astrophys- courses/#RPar manifest the content of the lectures.

88 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 89 High Performance Computing Courses and Tutorials

Shared Memory Webpage Advanced Fortran Topics Parallel Performance and netCDF. Participants should have Contents Parallelization with OpenMP http://www.hlrs.de/events/ Analysis with VAMPIR experience in parallel programming in This course gives an overview of the Date & Location general, and either C/C++ or Fortran supercomputers JUROPA/HPC-FF Date & Location October 11 - 15, 2010 Date & Location in particular. and JUGENE. Especially new users will September 29, 2010 Iterative Linear Solvers LRZ Building, Garching/Munich October 18, 2010 learn how to program and use these Stuttgart, HLRS and Parallelization LRZ Building, Garching/Munich Webpage systems efficiently. Topics discussed Prerequisites http://www.fz-juelich.de/jsc/neues/ are: system architecture, usage Contents Date & Location Course participants should have basic Prerequisites termine/parallelio model, compiler, tools, monitoring, This course teaches shared memory October 4 - 8, 2010, UNIX/Linux knowledge (login with Participants are required to have good MPI, OpenMP, performance optimi- OpenMP parallelization, the key con- LRZ Building, Garching/Munich secure shell, shell commands, simple knowledge of C, C++ or Fortran, as sation, mathematical software, and cept on hyper-threading, dual-core, scripts, editor vi or emacs). Good well of the message passing concepts Fortran for application software. multi-core, shared memory, and Prerequisites knowledge of the Fortran 95 standard (MPI) needed for parallel programming. Scientific Computing ccNUMA platforms. Hands-on ses- Good knowledge of at least one of is also necessary, such as covered in Webpage sions (in C and Fortran) will allow users the programming languages C/C++ or the previous winter semester course. Contents Date & Location http://www.fz-juelich.de/jsc/neues/ to immediately test and understand Fortran; basic UNIX skills. Mathematical Running parallel codes on large-scale October 25 - 29, 2010 termine/supercomputer the directives and other interfaces of knowledge about basic linear algebra Contents systems with thousands of MPI tasks Stuttgart, HLRS OpenMP. Tools for performance tuning and analysis is also assumed. This course is targeted at scientists typically requires tuning measures in and debugging are presented. Course who wish to extend their knowledge order to reduce MPI overhead, load Contents Unified Parallel C (UPC) language is English (if required). Contents of Fortran beyond what is provided in balancing problems and serialized This introduction to C++ is taught with and Co-Array Fortran (CAF) The focus of this compact course is the Fortran 95 standard. Some other execution phases. VAMPIR is a tool lectures and hands-on sessions. This Webpage on iterative and parallel solvers, the tools relevant for software engineer- that allows to identify and locate these course is organized by HLRS and Insti- Date & Location http://www.hlrs.de/events/ parallel programming models MPI and ing are also discussed. Topics covered scalability issues by generating trace tute for Computational Physics. This December 14 - 15, 2010 OpenMP, and the parallel middleware include files from program runs that can after- course is dedicated for scientists and Stuttgart, HLRS PETSc. Different modern Krylov Sub- • object oriented features, submodules wards be analyzed using a GUI. students to learn (sequential) program- Advanced Topics in space Methods (CG, GMRES, BiCG- • design patterns ming of scientific applications with Contents Parallel Programming STAB ...) as well as highly efficient pre- • handling of shared libraries Webpage Fortran. The course teaches newest Partitioned Global Address Space conditioning techniques are presented • mixed language programming http://www.lrz.de/services/compute/ Fortran standards. Hands-on sessions (PGAS) is a new model for parallel Date & Location in the context of real life applications. • standardized IEEE arithmetic courses/#VampirNG will allow users to immediately test and programming. Unified Parallel C (UPC) September 30 - October 1, 2010 Hands-on sessions (in C and Fortran) • I/O extensions from Fortran 2003 understand the language constructs. and Co-array Fortran (CAF) are PGAS Stuttgart, HLRS will allow users to immediately test and • parallel programming with coarrays extensions to C and Fortran. PGAS understand the basic constructs of • the Eclipse development environment Parallel I/O and Webpage languages allow any processor to Contents iterative solvers, the Message Passing • source code versioning system. Portable Data Formats http://www.hlrs.de/events/ directly address memory/data on any Topics are MPI-2 parallel file I/O, Interface (MPI) and the shared mem- other processors. Parallelism can be hybrid mixed model MPI+OpenMP ory directives of OpenMP. This course To consolidate the lecture material, Date & Location expressed more easily compared to li- parallelization, OpenMP on clusters, is organized by LRZ, the University of each day’s approximately 4 hours of October 25 - 27, 2010 Introduction to the Program- brary based approches as MPI. Hands- parallelization of explicit and implicit Kassel, the HLRS Stuttgart and IAG. lecture are complemented by 3 hours JSC, Research Centre Jülich ming and Usage of the Super- on sessions (in UPC and/or CAF) will solvers and of particle based applica- of hands-on sessions. computer Resources in Jülich allow users to test and understand the tions, parallel numerics and libraries, Webpage Contents basic constructs of PGAS languages. and parallelization with PETSc. Hands- https://fs.hlrs.de/projects/par/events/ Webpage This course will introduce the use of Date & Location on sessions are included. Course 2010/parallel_prog_2010/G.html http://www.lrz.de/services/compute/ MPI parallel I/O and portable self-de- November 25 - 26, 2010 Webpage language is English (if required). courses/#TOC16 scribing data formats, such as HDF5 JSC, Research Centre Jülich http://www.hlrs.de/events/

90 Spring 2010 • Vol. 8 No. 1 • inSiDE Spring 2010 • Vol. 8 No. 1 • inSiDE 91 inSiDE is published two times a year by The GAUSS Centre for Supercomputing (HLRS, LRZ, JSC)

Publishers Prof. Dr. H.-G. Hegering | Prof. Dr. Dr. T. Lippert | Prof. Dr. M. M. Resch

Editor & Design F. Rainer Klank, HLRS [email protected] Stefanie Pichlmayer, HLRS [email protected]

Authors Christian Alig [email protected] Momme Allalen [email protected] Martin Breuer [email protected] Andreas Burkert [email protected] Cezary Czaplewski [email protected] inSiDE Volker Gaibler [email protected] Leonhard Gantner [email protected] Georg Hager [email protected] Ulrich Hansen [email protected] Helmut Harder [email protected] Dawid Jagiela [email protected] Ferdinand Jamitzky [email protected] Sarah Jones [email protected] Norbert Kalthoff [email protected] Martin Krause [email protected] Sven Krüger [email protected] Adam Liwo [email protected] Michael Meier [email protected] Stanislaw Ołdziej [email protected] Morris Riedel [email protected] Ulrich Rist [email protected] Notker Rösch [email protected] Helmut Satzger [email protected] Marc Schartmann [email protected] Harold A. Scheraga [email protected] Steffen J. Schmidt [email protected] Günter H. Schnerr [email protected] Bernd Schuller [email protected] Juliane Schwendike [email protected] Björn Selent [email protected] Achim Streit [email protected] Matthias Thalhamer [email protected] Jan Treibig [email protected] Tobias Trümper [email protected] Gerhard Wellein [email protected]

If you would like to receive inSiDE regularly, send an email with your postal address to [email protected] © HLRS 2010