Porting the Lhcb Stack from X86 (Intel) to Aarch64 (ARM) and Ppc64le (Powerpc)
Total Page:16
File Type:pdf, Size:1020Kb
EPJ Web of Conferences 214, 05016 (2019) https://doi.org/10.1051/epjconf/201921405016 CHEP 2018 Porting the LHCb Stack from x86 (Intel) to aarch64 (ARM) and ppc64le (PowerPC) 1, 2 2 3 Laura Promberger ∗, Marco Clemencic , Ben Couturier , Aritz Brosa Iartza , and Niko Neufeld2 on behalf of the LHCb collaboration 1Fakultät für Informatik und Wirtschaftsinformatik - Fachgebiet Informatik, Hochschule Karlsruhe - Technik und Wirtschaft, Karlsruhe, Germany 2EP, CERN, Meyrin, Switzerland 3Escuela de Ingeniería Informática, Universidad de Oviedo, Oviedo, Asturias, Spain Abstract. LHCb is undergoing major changes in its data selection and process- ing chain for the upcoming LHC Run 3 starting in 2021. With this in sight several initiatives have been launched to optimise the software stack. This con- tribution discusses porting the LHCb Stack from x86_64 architecture to both ar- chitectures aarch64 and ppc64le with the goal to evaluate the performance and the cost of the computing infrastructure for the High Level Trigger (HLT). This requires porting a stack with more than five million lines of code and finding working versions of external libraries provided by LCG. Across all software packages the biggest challenge is the growing use of vectorisation - as many vectorisation libraries are specialised on x86 architecture and do not have any support for other architectures. In spite of these challenges we have success- fully ported the LHCb High Level Trigger code to aarch64 and ppc64le. This contribution discusses the status and plans for the porting of the software as well as the LHCb approach for tackling code vectorisation in a platform independent way. 1 Introduction In 2021 the LHCb experiment will undergo a major upgrade for Run 3. With the increased luminosity provided by the LHC and the introduction of new sub-detectors, the LHCb exper- iment will be able to research new phenomena and known ones more in detail. At the same time this results in an increase of the raw data detector output by a factor of 100 from 50 GB/s to approximately 4 TB/s. And the output data rate of the final selection of the events being written to disk will increase from 0.7 GB/sto2-10GB/s. To cope with the large increase of data volume a combination of upgrading the hardware resources of the HLT computing farm and increasing the performance of the software stack is necessary. The increase of performance can be achieved by optimizing the logic of algorithms and exploiting techniques which maximize hardware efficiency. Mainly to be named are multi-threading and vectorisation. The upgrade of the computing farm will include a new data center and new compute nodes. For the most competitive cost-performance solution it is important to have several ∗e-mail: [email protected] © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). EPJ Web of Conferences 214, 05016 (2019) https://doi.org/10.1051/epjconf/201921405016 CHEP 2018 options. For this LHCb decided to extend the architecture support from Intel x86_64 to ARM aarch64 and PowerPc ppc64le. 1.1 The LHCb software stack The LHCb software stack is made up of multiple, large projects. These projects can be divided into three groups. The LCG project provides external dependencies (e.g. Oracle, ROOT, Python). On top of these is the experiment-independent project Gaudi which is being used by different experiments at CERN. Last there are the experiment-specific projects LHCb, Lbcom, Rec and Brunel which alone sum up to about five million lines of code. For this study we used the LHCb stack with Brunel v53r1 [1]. First, with the predefined LCG version 91 [2], but later LCG was upgraded to version 92 [3]. This version of the stack was selected because vectorisation is considered to be the largest challenge for porting the code. The selected version contains less vectorised code than the software stack being currently developed for Run 3. At the same time this version has the disadvantage that it is not multi-threaded, yet. 2 Vectorisation Before being able to port the software stack, several vectorisation libraries being used by LHCb are evaluated for their cross-platform support. The two libraries used are Vc and Vcl. An overview of their features is shown in Table 1. Both libraries do not have any support for either aarch64 (ARM) or ppc64le (PowerPc). However, Vcl being a light low-level wrapper on top of the intrinsic functions allows an implementation of the missing architectures in a limited time. Whereas Vc is a more high-level approach making its usage easier but with a more complex internal structure. Therefore it was decided for the port to add the cross- platform support to Vcl for all needed functions and replace the usage of Vc by either Vcl or a generic scalar implementation. Table 1: Vectorisation libraries Vcl Vc Avx2 Yes Yes Avx512 Yes In development Altivec No No Neon No In development Documentation Yes Yes Examples Yes Yes Nearly every function Masked Functions Shuffle, blend and permute (where construct) Unsupported Func- Officially yes, but not imple- tions and Architec- No tures mented GPL3 or payed for use in License BSD-3-Clause proprietary software Distribution Own website[4] Github[5] 2 EPJ Web of Conferences 214, 05016 (2019) https://doi.org/10.1051/epjconf/201921405016 CHEP 2018 options. For this LHCb decided to extend the architecture support from Intel x86_64 to ARM Table 1: Vectorisation libraries aarch64 and PowerPc ppc64le. Vcl Vc 1.1 The LHCb software stack Main Developer Agner Fog Matthias Kretz Targeted for horizontal vec- The LHCb software stack is made up of multiple, large projects. These projects can be Vectorisation Style Wrapper for intrinsic divided into three groups. The LCG project provides external dependencies (e.g. Oracle, torisation ROOT, Python). On top of these is the experiment-independent project Gaudi which is being Does not cover all use cases; Only support for Intel archi- used by different experiments at CERN. Last there are the experiment-specific projects LHCb, hard to implement vertical Problems tecture, not so many masked Lbcom, Rec and Brunel which alone sum up to about five million lines of code. vectorisation; vector width functions For this study we used the LHCb stack with Brunel v53r1 [1]. First, with the predefined not always identifiable LCG version 91 [2], but later LCG was upgraded to version 92 [3]. This version of the Expandability for stack was selected because vectorisation is considered to be the largest challenge for porting New Intrinsics Medium (no unit tests) Complex the code. The selected version contains less vectorised code than the software stack being currently developed for Run 3. At the same time this version has the disadvantage that it is Future Version 2 will be integrated not multi-threaded, yet. in the C++ standard 2 Vectorisation 3 Porting to aarch64 (ARM) Before being able to port the software stack, several vectorisation libraries being used by The LHCb stack is first ported to aarch64. For LCG it requires changing compile flags and LHCb are evaluated for their cross-platform support. The two libraries used are Vc and Vcl. versions of the external dependencies. Some optional dependencies, like Oracle, are not An overview of their features is shown in Table 1. Both libraries do not have any support for supported on the architecture, so they can be deactivated without creating any problems. For either aarch64 (ARM) or ppc64le (PowerPc). However, Vcl being a light low-level wrapper the other projects (Gaudi, LHCb, ...) compile flags also have to be changed. Additionally, all on top of the intrinsic functions allows an implementation of the missing architectures in a usage of Vc is replaced either by Vcl or a generic scalar implementation. limited time. Whereas Vc is a more high-level approach making its usage easier but with During the port two major problems occurred. First, there are platform-specific differ- a more complex internal structure. Therefore it was decided for the port to add the cross- ences when casting double to unsigned int. Intel does not have a specific instruction to platform support to Vcl for all needed functions and replace the usage of Vc by either Vcl or cast from double to unsigned int. Instead Intel cast from double to signed int and a generic scalar implementation. then reinterprets it to unsigned int. ARM on the other hand has an instruction to cast from double to unsigned int. As a result casting the number -2.3 to unsigned int is not valid on ARM. It will result in a floating-point exception stating that the operation is Table 1: Vectorisation libraries invalid as the value is out of range for unsigned int. To solve this problem the behavior cl c of Intel is mimicked explicitly: cast double to signed int and then to unsigned int. V V The mimicking was selected as enforcing the proper data range would have resulted in an Avx2 Yes Yes unreasonable amount of code breaking changes for this study. The second problem is also Avx512 Yes In development a platform-specific problem. It is about the default expansion of char. On Intel char is Altivec No No expanded to signed char and on ARM it is expanded to unsigned char. This resulted Neon No In development within the LHCb stack in a wrongly calculated hash function. To solve the problem the Gcc -fsigned-char Documentation Yes Yes compile flag has to be used, forcing ARM to behave like Intel: expanding char signed char Examples Yes Yes to . After successfully building the LHCb stack on aarch64, the results of the full reconstruc- Nearly every function Masked Functions Shuffle, blend and permute tion test (Brunel) were validated for their numerical accuracy as the values fluctuate depend- (where construct) ing on the compiler version and the platform being used.