An ARM cluster for running CMSSW jobs Lirim Osmani1 and Tomas Linden´ 2 1Department of Computer Science, University of Helsinki, 2Helsinki Institute of Physics

Introduction Power measurements To meet the computing challenge posed by the High-Luminosity Large The RPi4 is powered by a 5 V / 3 A power supply with a USB-C connector. Hadron Collider (HL-LHC) improved software performance, changed analysis The power drawn by the RPi4 board itself was measured with an AVHzY CT-2 procedures and new hardware resources are required. ARM computers are USB-power meter, neglecting the losses of the power supply. developed for the highly comptetive mobile phone market, so they might The Odroid-N2 uses a 12 V / 1.5 A power supply so a less accurate meter provide less expensive computing resources for HL-LHC computing. had to be used than the one for the RPi4. Hardware Table 2: Power measurements. CMSSW has been ported in earlier work to armv7 [1, 2] and to armv8 [3, 4]. Computer Idle stress-ng Odroid-N2 2±1 W 6±1 W The available ARM systems are usually either servers or inexpensive 4 Model B with heat sink 2.6 ±0.1 W 5.8±0.1 W development boards. In this work we have used small ARM development Kaby Lake 14 nm quad core i7-7700 Fedora 29 8±1 W 99±1 W boards, two hexa core Odroid-N2 [5] from HardKernel and two quad core Raspberry Pi 4 Model B (RPi4 in the following) [6] from the Raspberry Pi Foundation. Both boards have been bought with 4 GB of RAM, which is more than the 1–2 GB found on earlier similar boards, but less than the 2 GB / core required by CMS. The machines can still be used for many tasks. Table 1: Comparison of the Odroid-N2 and the Raspberry Pi 4 Model B. Board Odroid-N2 Raspberry Pi 4 Model B CPU S922X BCM2711B0 Line width 12 nm 28 nm Figure 2: The RPi4 current and voltage during a CMSSW job averaged to 3.229 W. Clock frequency 1.8 GHz 1.5 GHz 1.9 GHz Cores 4 x Cortex A73 4 x Cortex A72 Performance 2 x Cortex A53 L1 cache 32 kB 32 kB The storage I/O performance can quickly be studied with the hdparm tool, L2 cache 1 MB + 256 kB 1 MB which is not an accurate benchmark, but it gives an estimate of the read RAM 2, 4 G 1, 2, 4 GB RAM speed DDR4 1320 LPDDR4-3200 SDRAM performance of a storage system and the memory and cache system 1 Gb/s 1 Gb/s bandwidth. USB 4 x USB 3.0 2 x USB 3.0, 2 x USB 2.0 1 x USB 2.0 OTG 1 x USB C OTG Table 3: Average of three hdparm runs. µSD UHS-I DDR50 Computer Cached reads Buffered disk reads eMMC 5.1 Odroid-N2 eMMC 2124 MB/s 144 MB/s Board size 90 x 90 mm2 85 x 56 mm2 Raspberry Pi 4 Model B µSD 1098 MB/s 43 MB/s The Odroid-N2 bottom is covered by a heat sink providing enough cooling to ROOT release 6.18/04 [11] was compiled on both an Odroid-N2 and a RPi4. avoid thermal throttling of the CPU during prolonged high load usage. Compiling this ROOT version with four cores on the RPi4 took 3 h 51 min 39 The RPi4 CPU is equipped with a metallic heat spreader, which keeps the s. With six cores on Odroid-N2 it took 2 h 7 min and 33 s and on the 64 CPU temperature below the throttling limit of 80 ◦C only for brief high load system it took 1 h 2 min and 7 s. bursts. In order to have maximum performance and consistent benchmark Running ./stressHepix three times gave an average of 911.0 ROOTMARKS results the cooling was improved on one RPi4 4 B with a 30x30x7 mm fan on the RPi4 and 1158.7 ROOTMARKS on the Odroid-N2. The x86 64 and on the other one with a 32x32x20 mm heat sink. Both the fan and the system gave an average of 4999.8 ROOTMARKS. heat sink increased the cooling power enough that thermal throttling can be Local CMSSW jobs were run using the software validation and data quality avoided when running the stress-ng program on four cores for three hours. monitoring tool called runTheMatrix: singularity shell -B /cvmfs docker://cmssw/cc7:aarch64 On all boards 128 GB of flash storage was used, eMMC on the Odroid-N2s export SCRAM ARCH=slc7 aarch64 gcc700 and µSD on the Pi4s with a six GB large swapfile. source /cvmfs/cms.cern.ch/cmsset default.sh cmsrel CMSSW 10 2 0 pre6 cd CMSSW 10 2 0 pre6 cmsenv time runTheMatrix.py -l 2.0 --job-reports -j 4 The runs were single threaded so they used only part of the available CPU capacity. The average of three runs was on the RPi4 22 min 0 s, on the Odroid-N2 17 min 13 s and on the Kaby Lake system 4 min 31 s. The estimated energy consumption for the RPi4 was 4262 J, which is about half of the x86 64 consumption of 8000 J. Future work Multithreaded benchmarks should be run to study the performance scaling. The configuration of the grid setup should be completed. Summary These computers draw 2–3 W when idling and about 6 W with CPU load ROOT was compiled successfully on both boards. Singularity and CVMFS were compiled on both boards Figure 1: The two Raspberry Pi 4 Model B boards and the two Odroid-N2 boards. The CMSSW software stack can be run in Singularity on these ARM boards. The Odroid-N2 was released with AArch64 userspace available on For the tested applications the Odroid-N2 is faster than the RPi4 18.04 LTS, so it was used. At the time of this work Ubuntu and Gentoo The RPi4 used for the used CMSSW runTheMatrix job about half the supported the AArch64 userspace and kernel on the RPi4, so Ubuntu energy of the x86 64 system 18.04.3 LTS with a Gentoo kernel and Ubuntu 19.10 was used on the RPi4s. Funding Software environment This work was jointly funded by Doctoral School in Computer Science Singularity, a HPC container technology for scientific applications with (DoCS) and NODES research lab at the University of Helsinki. reproducible, sharable and distributed features [7] Acknowledgements Singularity AArch64 images built on top of CERN CentOS 7 We gratefully acknowledge the help of Juha Aaltonen and Sami Lehti. CVMFS [8] as a software distribution service compiled for AArch64 References Development AArch64 CMSSW releases available in CVMFS [1] D. Abdurachmanov et al, Initial explorations of ARM processors for scientific computing, 2014 J. Phys.: Conf. Ser. 523 012009. [2] D. Abdurachmanov et al, Explorations of the viability of ARM and Xeon Phi for physics processing, 2014 J. Phys.: Conf. Ser. 513 052008. Cluster software [3] D. Abdurachmanov et al, Techniques and tools for measuring energy efficiency of scientific software applications 2015 J. Phys.: Conf. Ser. 608 012032. [4] D. Abdurachmanov et al, Heterogeneous High Throughput Scientific Computing with APM X-Gene and Xeon Phi 2015 J. Phys.: Conf. Ser. 608 012033. Our experimental cluster runs on resources at the Computer Science [5] https://www.hardkernel.com/shop/odroid-n2-with-4gbyte-ram/ Department of the University of Helsinki. It is a hybrid cluster setup consisting [6] https://www.raspberrypi.org/products/raspberry-pi-4-model-b/ [7] G.M. Kurtzer, V. Sochat V, M.W. Bauer, Singularity: Scientific containers for mobility of compute, 2017 PLoS ONE 12(5): e0177459. of AArch64 and Intel x86 64 machines for running CMSSW jobs. The https://doi.org/10.1371/journal.pone.0177459. See also https://sylabs.io/singularity/ [8] J Blomer et al, Status and future perspectives of CernVM-FS 2012 J. Phys.: Conf. Ser. 396 052013. See also software stack consist of: https://cernvm.cern.ch/portal/filesystem [9] M.Ellert et al., Advanced Resource Connector middleware for lightweight computational Grids, Future Generation Computer Systems 23 (2007) 219-240. ARC 6 as the standard grid interface [9] See also http://www.nordugrid.org/arc/arc6/ [10] D. Thain, T. Tannenbaum, and M. Livny, Distributed Computing in Practice: The Condor Experience, Concurrency and Computation: Practice and HTCondor as the batch system [10] Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005. See also https://research.cs.wisc.edu/htcondor/. [11] R. Brun and F, Rademakers, ROOT - An Object Oriented Data Analysis Framework, Proceedings AIHENP’96 Workshop, Lausanne, Sep. 1996, Nucl. Inst. & Meth. in Phys. Res. A 389 (1997) 81-86. See also http://root.cern.ch/

L. Osmani and T. Linden´ (University of Helsinki, Helsinki Institute of Physics) An ARM cluster for running CMSSW jobs 4. November 2019 1 / 1