Data Intensive ATLAS Workflows in the Cloud

Data intensive ATLAS workflows in the Cloud Dissertation zur Erlangung des mathematisch-naturwissenschaftlichen Doktorgrades Doctor rerum naturalium\ " der Georg-August-UniversitätGöttingen im Promotionsprogramm ProPhys der Georg-August University School of Science (GAUSS) vorgelegt von Gerhard Ferdinand Rzehorz aus Bruchsal CERN-THESIS-2018-094 09/05/2018 Göttingen,2018 Betreuungsausschuss Prof. Dr. Arnulf Quadt PD. Dr. JörnGroße-Knetter Mitglieder der Prüfungskommission: Referent: Prof. Dr. Arnulf Quadt II. Physikalisches Institut, Georg-August-UniversitätGöttingen Koreferent: Prof. Dr. Ramin Yahyapour Institut fürInformatik, Georg-August-UniversitätGöttingen Weitere Mitglieder der Prüfungskommission: Prof. Dr. Steffen Schumann II. Physikalisches Institut, Georg-August-UniversitätGöttingen Prof. Dr. Jens Grabowski Institut fürInformatik, Georg-August-UniversitätGöttingen Prof. Dr. Ariane Frey II. Physikalisches Institut, Georg-August-UniversitätGöttingen Dr. Oliver Keeble IT Department, CERN Tag der mündlichen Prüfung: 09.05.2018 Data intensive ATLAS workflows in the Cloud Abstract Large physics experiments, such as ATLAS, have participating physicists and institutes all over the Globe. Nowadays, physics analyses are performed on data that is stored thousands of kilo- metres away. This is possible due to the distributed computing infrastructure known as the Worldwide LHC Computing Grid (WLCG). In addition to the analyses, all the previous data transformation steps, such as raw data reconstruction, are performed within the WLCG. Within the next decade, the computing requirements are projected to exceed the available resources by a factor of ten. In order to mitigate this discrepancy, alternative computing solutions have to be investigated. Within this thesis, the viability of Cloud computing is evaluated. The concept of Cloud computing is to rent infrastructure from a commercial provider. In contrast to that, in the WLCG computing concept the hardware within the computing centres is purchased and operated by the WLCG. In order to examine Cloud computing, a model that predicts the workflow performance on a given infrastructure is created, validated and applied. In parallel, the model was used to evaluate a workflow optimisation technique called overcommitting. Overcom- mitting means that the workload on a computer consists of more parallel processes than there are CPU cores. This technique is used to fill otherwise idle CPU cycles and thereby increase the CPU utilisation. Using the model, overcommitting is determined to be a viable optimisation technique, especially when using remote data input, taking into account the increased memory footprint. Introducing the overcommitting considerations to the Cloud viability evaluation in- creases the feasibility of Cloud computing. This is because Cloud computing may not include a storage solution and has the flexibility to provision virtual machines with additional memory. The final conclusion is drawn by taking the above described results and by combining them with the cost of the WLCG and the Cloud. The result is that Cloud computing is not yet competitive compared to the WLCG computing concept. Data intensive ATLAS workflows in the Cloud Zusammenfassung Die großen Physikexperimente, wie zum Beispiel ATLAS, bestehen aus Kollaborationen mit Physikern und Instituten auf der ganzen Welt. Heutzutage werden physikalische Analysen an Daten durchgeführt,die Tausende von Kilometern entfernt gespeichert sind. Dies ist auf- grund der verteilten Computing-Infrastruktur, die als Worldwide LHC Computing Grid (WLCG) bekannt ist, möglich. Zusätzlich zu den Analysen werden alle vorherigen Datentransformation- sschritte, wie die Rekonstruktion von Rohdaten, innerhalb des WLCG durchgeführt.Innerhalb des nächsten Jahrzehnts wird erwartet, dass die Anforderungen an die Computerinfrastruktur die verfügbarenRessourcen um den Faktor zehn übersteigen werden. Um diese Diskrepanz zu mindern, müssenAlternativen zur jetzigen Computerinfrastruktur untersucht werden. Im Rah- men dieser Arbeit wird Cloud Computing evaluiert. Das Konzept von Cloud Computing besteht darin, eine Computerinfrastruktur von einem kommerziellen Anbieter zu mieten. Dies steht im Gegensatz zum WLCG Konzept, in dem die Ausstattung der Rechenzentren gekauft und selbst betrieben wird. Um Cloud Computing zu untersuchen, wird ein Modell erstellt, validiert und angewendet, dass das Verhalten von Arbeitsflüssenauf einer beliebigen Infrastruktur vorher- sagt. Parallel dazu wurde das Modell zur Bewertung einer Arbeitsfluss-Optimierungsmethode namens Overcommitting verwendet. Overcomitting bedeutet, dass die Arbeitslast auf einem Computer aus mehr parallelen Prozessen besteht, als CPU-Kerne vorhanden sind. Diese Tech- nik wird verwendet, um ansonsten ungenutzte CPU-Zyklen zu füllenund dadurch die CPU- Auslastung zu erhöhen. Unter der Verwendung des Modells wird das Overcommitting als eine brauchbare Optimierungstechnik ermittelt. Dies gilt insbesondere dann, wenn die Daten nur auf weit entfernten Speichermedien vorhanden sind und unter der Berücksichtigung des erhöhten Bedarfs an Arbeitsspeicher. Der Einbezug dieser Uberlegungen¨ in die Cloud Computing Eval- uation verbessert dessen Stellung. Dies liegt daran, dass Cloud Computing nicht unbedingt Speichermöglichkeiten enthältund flexibel genug ist, um virtuellen Maschinen zusätzlichen Ar- beitsspeicher zuzuweisen. Unter Berücksichtigung all dieser Gesichtspunkte und in Kombination mit den Kostenmodellen des WLCG und der Cloud, ergibt sich, dass Cloud Computing noch nicht konkurrenzfähiggegenüber dem bisherigen WLCG Konzept ist. Contents 1 Introduction1 1.1 Motivation...................................1 1.2 Thesis structure.................................2 2 The Standard Model of particle physics3 2.1 Interactions...................................3 2.1.1 Weak interaction............................3 2.1.2 Electromagnetic interaction......................5 2.1.3 Electroweak unification........................5 2.1.4 Strong interaction...........................5 2.1.5 Quarks and leptons...........................6 2.1.6 The Higgs mechanism.........................6 2.2 Beyond the Standard Model..........................7 3 The ATLAS detector9 3.1 LHC.......................................9 3.1.1 CERN..................................9 3.1.2 Machine specifics............................9 3.2 ATLAS..................................... 12 3.2.1 Detector components.......................... 12 3.2.2 Inner detector.............................. 13 3.2.3 Calorimeters.............................. 13 3.2.4 Muon spectrometer........................... 14 3.2.5 Trigger and data acquisition...................... 15 4 LHC offline computing 17 4.1 Distributed and Grid computing....................... 18 4.2 Cloud computing................................ 19 4.2.1 Concept................................. 20 I Contents 4.2.2 Pricing................................. 22 4.2.3 Storage................................. 24 4.2.4 Security, safety and integrity..................... 27 4.2.5 Availability............................... 28 4.3 Grid Computing................................ 29 4.4 WLCG...................................... 29 4.4.1 Concept and purpose.......................... 30 4.4.2 Composition.............................. 31 4.4.3 Evolution................................ 37 4.5 ATLAS computing components........................ 37 4.5.1 XRootD................................. 37 4.5.2 Athena................................. 38 4.5.3 AthenaMP............................... 38 4.5.4 PanDA................................. 38 4.5.5 Rucio.................................. 39 4.5.6 JEDI................................... 40 4.5.7 CVMFS................................. 40 4.5.8 Tags................................... 40 4.5.9 AMI................................... 40 4.6 General concepts................................ 40 4.6.1 Benchmarking............................. 40 4.6.2 Storage................................. 41 4.6.3 Swapping................................ 42 4.6.4 CPU efficiency............................. 43 4.6.5 Undercommitting............................ 43 4.6.6 Control groups............................. 44 5 Workflows 45 5.1 General model................................. 47 5.1.1 All experiments............................. 49 5.1.2 ATLAS................................. 50 5.2 Monte Carlo simulation............................ 51 5.2.1 Event generation............................ 51 5.2.2 Simulation............................... 54 5.3 Reconstruction................................. 56 5.3.1 Raw data reconstruction........................ 57 5.3.2 Raw data reconstruction profile.................... 57 5.3.3 Simulated data reconstruction..................... 64 5.3.4 Digitisation............................... 66 5.3.5 Trigger simulation........................... 67 5.3.6 Reprocessing.............................. 67 5.4 Analysis..................................... 68 5.4.1 Group production........................... 69 II Contents 5.4.2 Complete processing.......................... 69 6 Models and predictions 71 6.1 Related work.................................. 72 6.2 The Workflow and Infrastructure Model................... 73 6.2.1 Functionalities............................. 74 6.2.2 Model input............................... 76 6.3 Model logic................................... 78 6.3.1 Workflow duration........................... 78 6.3.2 CPU consumption

Load more