2

The Compute Courier brings all the news related to both the Cartesius and Lisa systems. Software on Cartesius that supports hardware acceleration for GPUs is indicated with the GPU symbol. All bold URLs in the PDF version of this newsletter are clickable.

Table of Contents The 2nd phase of Cartesius is in production Change default Intel compiler version to 15.0.0 on Cartesius Machine Learning and Neural Networks on Cartesius AIMMS modeling system available on Lisa New and updated software on Lisa New and updated software on Cartesius Known issues on Cartesius We're hiring! PRACE training events PRACE call for preparatory access The 2nd phase of Cartesius is in production After a few months of installing, configuring and tuning, Cartesius phase 2 is in production since last Thursday.

Cartesius phase 2 implies an extension of Cartesius with 1080 thin nodes, containing Intel® next generation of processors, the Intel Xeon E5-2690 v3 (Haswell) processor. This Haswell processor has a slightly higher clock frequency (2.6 GHz) than the Ivy Bridge (Intel Xeon E5-2695 v2) thin nodes (2.4 GHz) from Phase 1. Furthermore, the bandwidth to memory is larger than for the Ivy Bridge nodes (2133 MT/s DDR4 memory instead of 1866 MT/s DDR3 memory) and it supports Intel® AVX2 instructions, while Ivy Bridge only supports AVX instructions.

Both Ivy Bridge and Haswell nodes contain the same number of cores (24) and the same amount of memory (64GB), but the Haswell nodes contain a newer generation of InfiniBand adapters.

Costs of new nodes All this makes the newer processors faster than the Ivy Bridge processor (up to 10% - 40%, dependent on the application). Nevertheless we have decided to charge both types of nodes in the same way i.e. 1 SBU corresponds to one core-hour both on Haswell and on Ivy Bridge nodes.

Binary compatibility The new processors are backward compatible with the Ivy Bridge thin nodes and Sandy Bridge fat nodes, so all programs that ran on the Ivy Bridge processors, should also run on the new Haswell nodes. To get optimal performance out of the Haswell processors a program should make use of the special AVX2 instructions for this processor. One can do this by recompiling codes with the compiler flags designated to invoke these instructions. For the Intel compiler suite, there are two ways of doing this: - Using compiler flag (both for Fortran and C): -xCORE-AVX2. This will create a binary with AVX2 instructions, specifically for the Haswell processors. Note that the executable will not run on Ivy Bridge and Sandy Bridge nodes. - Using compiler flags (both for Fortran and C): -xAVX -axCORE-AVX2. This will generate multiple, feature specific auto-dispatch code paths for Intel® processors, if there is a performance benefit. So this binary will run both on Ivy Bridge/Sandy Bridge and Haswell processors. During runtime it will be decided which path to follow, dependent on which processor you are running on. In general this will result in larger binaries.

What does it mean for batch jobs To make most efficient use of the enlarged Cartesius, we decided to put all (1620) thin nodes, both Haswell and Ivy Bridge, in the same (“normal” and “short”) batch partitions, so the total pool of thin nodes can be used to schedule all jobs submitted to this default partition. For testing and debugging your codes, 48 nodes are set aside to run short jobs, of which 16 are Ivy Bridge and 32 are Haswell. Some nodes are reserved for system administration, to build and install software and run tests for upgrades etc. We have configured SLURM to have a preference for the Haswell nodes, so for jobs that leave the choice to the batch scheduler, the Haswell nodes will be filled first; Ivy Bridge nodes will only be used if not enough Haswell nodes are available.

Compute Courier | number 2 | January 2015 | 2 You can still control where your jobs will run - Only on Ivy Bridge - Only on Haswell - Only on Ivy Bridge or only on Haswell but you don’t care which - A heterogeneous combination of Ivy Bridge and Haswell nodes

The way to tell SLURM how to steer your job to certain processors, is by the --C or --constraint keyword. For instance: - #SBATCH -–constraint=ivy ! Runs only on Ivy Bridge nodes - #SBATCH –-constraint=haswell ! Runs only on Haswell nodes - #SBATCH –C ”[ivy|haswell]” ! Runs either on Ivy Bridge or on Haswell nodes - If you don’t specify any constraint, your job can run on any combination of nodes within the partition Of course, the more you restrict your jobs, the longer it may take to schedule them. So, although the Haswell nodes are faster, the turnaround time might be lower if you leave the choice to the batch scheduler.

Change default Intel compiler version to 15.0.0 on Cartesius The start of Cartesius Phase 2 adds many nodes with state of the art processors. These processors are best supported by the latest Intel compilers and libraries. Therefore the default version of the Intel compiler will be changed from 14.0.2 to 15.0.0 and the default version of the MKL library will be changed from 11.0.2 to 11.2 on Monday morning February 2 at 8 AM. This compiler release has been tested with a set of application benchmarks for their correctness. You can already use the new compiler release with the following commands: module unload fortran c mkl module load fortran/intel/15.0.0 c/intel/15.0.0 mkl/11.2 Should you run into any issues with the new compiler version, please let us know at [email protected] so that we can investigate and forward these issues to Bull and Intel. Other releases of the Intel compilers and MKL libraries (both older and newer) are kept available on the system. E.g. if you would like to change to the older compilers after February 2nd, please use the commands: module unload fortran c mkl module load fortran/intel/14.0.2 c/intel/14.0.2 mkl/11.0.2

Machine Learning and Neural Networks on Cartesius A number of libraries and tools are available on Cartesius to facilitate experiments and research in the area of machine-learning, including neural networks. The currently installed tools include: Torch7 GPU , cuDNN GPU , cuda-convnet2 GPU , and Caffe. More information about these tools and how to use them, can be found on the webpage: https://surfsara.nl/systems/cartesius/software/machine-learning.

Compute Courier | number 2 | January 2015 | 3 AIMMS modeling system available on Lisa AIMMS (an acronym for "Advanced Interactive Multidimensional Modeling System") is a software system designed for modeling and solving large-scale optimization and scheduling-type problems. AIMMS is considered to be one of the five most important algebraic modeling languages, and the creator has been awarded with the INFORMS Impact Prize for his work on this language. AIMMS has been used by three recent winners of the Franz Edelman Award, a prize for outstanding achievements in operations research. AIMMS B.V. provided an Academic Cluster license for use of their software on the Lisa system. The license allows any number of batch jobs to be run in parallel and is therefore well suited for large parameter studies. For more information, see the documentation page.

New and updated software on Lisa Many software packages and libraries are pre-installed and ready for you to use on Lisa, just have a look at our software documentation. The following packages were newly installed in the last few months: MetaPhlAn – Metagenomic Phylogenetic Analysis AIMMS – modeling and solving large-scale optimization and scheduling-type problems The following packages were upgraded in the last few months: bowtie 2.2.4 – Memory-efficient short read aligner RAxML 8.1.6 – Randomized Axelerated Maximum Likelihood NAMD 2.10b2 – Molecular dynamics. Delft3D 4168 – Modelling suite for integral water solutions 3.1.2 – Statistical computing NWchem 6.5 – Computational chemistry automake 1.14.1 – a tool for automatically generating Makefiles python 2.7.9 & 3.4.2 – the Python programming language vegas 2 – Gene-based tests for association MATLAB 2014b & runtime 8.4 – high-level technical computing language. If you have any questions or need other software on Lisa, you can send your question or request for installation or upgrade to [email protected].

New and updated software on Cartesius Many software packages and libraries are pre-installed and ready for you to use on Cartesius, just have a look at our software documentation. Software that supports hardware acceleration for GPUs is indicated with the GPU symbol. Please ask [email protected] for more information about GPU accelerated software and how you can benefit. The following packages were newly installed since the last newsletter: OpenCV – a computer vision and image processing library

Compute Courier | number 2 | January 2015 | 4 clBLAS GPU – OpenCL implementation of the BLAS level 1, 2, and 3 routines. Its primary target are GPUs, but it can be used on multi-core CPUs as well GDL - GNU data language numdiff - numerically inspect the differences between two files The following packages were upgraded since the last Cartesius newsletter: ADF 2014.02 GPU – Amsterdam Density Functional package Crystal 14 – a general-purpose program for the study of crystalline solids PetSc 3.5.0 & 3.5.2 GPU – Partial Differential Equation Solvers library SLEPc 3.5.2 – Eigenvalue/vector Solver library MATLAB 2014b & runtime 8.4 GPU – high-level technical computing language Gromacs 5.0.3 GPU – A versatile package to perform Molecular Dynamics Delft3D 4505 – Modelling suite for integral water solutions LAMMPS 30oct2014 – Large-scale Atomic/Molecular Massively Parallel Simulator Python 2.7.8 – The Python Language PGI compilers 14.10 GPU – latest version of the PGI compiler. Supports GPU programming through CUDA Fortran and the OpenACC standard. CPMD 3.17.1 – Car-Parrinello Molecular Dynamics If you have any questions or need other software on Cartesius, you can send your question or request for installation or upgrade to [email protected].

Known issues on Cartesius This section lists the known issues on Cartesius. It can take some time before these issues are fixed, in the meanwhile we try to provide workarounds. Known issues on Cartesius:  the SLURM option --export=NONE can be used to ensure that environment variable settings from your interactive session do not impact your submitted batch jobs. However, this also causes some SLURM environment variables not to be set. Most notably, the ­c or --cpus-per-task flag doesn't function anymore and your OpenMP application will only use 1 thread. A workaround is to set the environment variable OMP_NUM_THREADS explicitly in your job, e.g.: #SBATCH --cpus-per-task=4 #SBATCH --export=NONE export OMP_NUM_THREADS=4 srun my_application  the squeue -u[user] command shows incorrect output for pending jobs. The 'NODES' column shows the number of requested cores, not the number of requested nodes. The correct output can be produced with the command: squeue | egrep “JOBID|[user]”

Compute Courier | number 2 | January 2015 | 5 We're hiring! As national High Performance Computing and e-Science Support Center SURFsara supports scientists with state-of-the-art integrated services, expertise and infrastructure: High Performance Computing and Networking, data services, visualization and e-Science & Cloud services. The Cartesius team is looking for consultants to solve daily issues and optimize applications on Cartesius, amongst others.

Jobs at SURFsara  (Senior) Adviseur / Wetenschappelijk programmeur  (Senior) Unix system administrator

Reaction form Are you looking for an internship or job in the field of High Performance Computing and Networking? Fill out the information request form or send your CV to [email protected].

PRACE training events PRACE organizes trainings on many topics related to HPC throughout Europe. An overview of all coming training events can be found on the website http://www.training.prace-ri.eu/. This website also contains presentations, videos and tutorials from previous events. One of the coming events which PRACE co-organises is the “International Summer School on HPC Challenges in Computational Sciences” in from 21 to 26 June 2015, in Toronto, Canada. The summer school is sponsored by Compute/Calcul Canada, the Extreme Science and Engineering Discovery Environment (XSEDE) with funds from the U.S. National Science Foundation, the Partnership for Advanced Computing in Europe (PRACE) and the RIKEN Advanced Institute for Computational Science (RIKEN AICS) in Japan. Leading American, European and Japanese computational scientists and HPC technologists will offer instruction on a variety of topics, including:  HPC challenges by discipline (e.g.: earth, life and materials sciences, physics)  HPC programming proficiencies  Performance analysis & profiling  Algorithmic approaches & numerical libraries  Data-intensive computing  Scientific visualization  Canadian, EU, Japanese and U.S. HPC-infrastructures The expense-paid program will benefit advanced scholars from Canadian, European, Japanese and U.S. institutions who use HPC to conduct research. Interested students should apply by 11 March 2015. Meals, housing, and travel will be covered for the selected participants. Applications from graduate students and postdocs in all science and engineering fields are welcome. Preference will be given to applicants with parallel programming experience, and a research plan that will benefit from the utilization of high performance computing systems. Further information and application: https://ihpcss2015.computecanada.ca

Compute Courier | number 2 | January 2015 | 6 PRACE call for preparatory access PRACE, the Partnership for Advanced Computing in Europe, is a Research Infrastructure that enables researchers from across Europe to apply for time on the PRACE resources via a peer review process. This call marks the opening of continuous preparatory access allowing researchers to apply for code scalability testing, support for code development and optimization. Preparatory access calls are rolling calls, researchers can apply for resources all year. The next cutoff date is March 2nd at 11AM CET.

The available HPC (High Performance Computing) systems available to researchers through PRACE are:  IBM BlueGene/Q – JUQUEEN – hosted by the Gauss Centre member site in Jülich, Germany  BULL Bullx cluster – CURIE – hosted by CEA (funded by GENCI) in Bruyères-Le-Châtel, France  IBM BlueGene/Q – FERMI – hosted by CINECA  IBM System X iDataPlex – MareNostrum – hosted by BSC in Barcelona, Spain  Cray XC40 – Hornet – hosted by the Gauss Centre member site in Stuttgart, Germany  IBM System X iDataPlex – SuperMUC – hosted by the Gauss Centre member site in Munich, Germany

There are three types of preparatory access: A. Code scalability testing to obtain scalability data which can be used as supporting information when responding to future PRACE project calls. This route provides an opportunity to ensure the scalability of the codes within the set of parameters to be used for PRACE project calls, and document this scalability. Assessment of applications is undertaken using a light-weight application procedure with application evaluated at least every 2 months. B. Code development and optimization by the applicant using their own personnel resources (i.e. without PRACE support). Applicants will need to describe the planning for development in detail together with the expert resources that are available to execute the project. Applications will be assessed at least every 2 months. C. Code development with support from experts from PRACE. Assessment of the applications received will be carried out at least every two months.

All proposals must be submitted via the PRACE website at www.prace-ri.eu/hpc-access. All proposals will undergo PRACE technical and scientific assessment. See also a related document on http://www.prace-ri.eu/IMG/pdf/prace_preparatory_access_call.pdf

Compute Courier | number 2 | January 2015 | 7