A Ground-Up Approach to High-Throughput Cloud Computing

A Ground-Up Approach to High-Throughput Cloud Computing

Università degli Studi di Torino Scuola di Dottorato in Scienze ed Alte Tecnologie XXVI ciclo Tesi di Dottorato A ground-up approach to High-Throughput Cloud Computing in High-Energy Physics Dario Berzano Relatore Correlatori Prof. Massimo Masera Dott. Gerardo Ganis Università degli Studi di Torino CERN Dott. Stefano Bagnasco INFN Torino CERN-THESIS-2014-376 16/04/2014 Controrelatore Prof. Marco Paganoni Università degli Studi di Milano-Bicocca Anno Accademico 2012–2013 «Simplicity is the ultimate sophistication.» Apple II commercial (1977) Table of Contents 1 Virtualization and Cloud Computing in High Energy Physics 13 1.1 Computing in the LHC era . 13 1.1.1 Hierarchy of computing centres . 16 1.1.2 Event processing paradigm . 17 1.1.3 Analysis framework and data format: ROOT . 18 1.1.4 The ALICE Experiment . 19 1.2 Virtualization and Cloud Computing . 21 1.2.1 State of the art . 23 1.3 From the Grid to clouds? . 25 1.3.1 Grid over clouds and more . 26 1.4 Overview of this work . 27 2 Building a Private Cloud out of a Grid Tier-2 29 2.1 Overview of the INFN Torino Computing Centre . 29 2.1.1 Grid VOs and ALICE . 30 2.2 Rationale of the Cloud migration . 32 2.2.1 No performance loss through virtualization . 32 2.2.2 Abstraction of resources in the Grid and the cloud . 32 2.2.3 Harmonization of diverse use cases in a private cloud . 33 2.2.4 IaaS: separating administrative domains . 34 2.2.4.1 Virtual farm administrator . 35 2.2.4.2 Rolling updates . 35 2.2.5 Computing on demand and elasticity . 36 2.2.6 Preparing for the near future . 37 2.3 The cloud controller: OpenNebula . 38 2.3.1 Datacenter virtualization vs. infrastructure provisioning . 38 2.3.2 OpenNebula features . 39 2.4 Designing and building the private cloud . 42 2.4.1 Management of the physical infrastructure . 43 6 CONTENTS 2.4.1.1 Provisioning: Cobbler . 43 2.4.1.2 Configuration: Puppet . 44 2.4.1.3 Monitoring: Zabbix . 46 2.4.1.4 Network connectivity: NAT and proxy . 46 2.4.2 Classes of hypervisors . 46 2.4.2.1 Working class hypervisors . 49 2.4.2.2 Service hypervisors . 50 2.4.3 Level 2 and 3 networking . 51 2.4.4 Shared storage . 52 2.4.4.1 GlusterFS . 52 2.4.4.2 Virtual machine image repository . 53 2.4.5 Disks of running virtual guests . 54 2.4.6 Benchmarks . 56 2.5 Performance tunings and other issues . 57 2.5.1 Kernel SamePage Merging issues . 57 2.5.2 OpenNebula backing database: MySQL vs. SQLite . 60 2.5.3 Fast creation of ephemeral storage . 60 2.5.3.1 Allocation of large files . 60 2.5.3.2 Filesystem formatting . 62 2.5.4 Direct QEMU GlusterFS support . 62 2.5.5 Fast deployment: caching and snapshotting . 63 2.5.5.1 LVM . 63 2.5.5.2 QCow2 . 64 2.6 Custom virtual guest images . 65 3 Running elastic applications on the Cloud 69 3.1 Virtualizing the Grid Tier-2 . 69 3.1.1 Preparation and update of a worker node . 70 3.1.1.1 Creation of the base image . 71 3.1.1.2 Contextualization . 72 3.1.1.3 Dynamic configuration: Puppet . 72 3.1.1.4 Creating images vs. contextualizing . 73 3.1.2 Automatic addition of worker nodes to TORQUE . 74 3.1.3 Nodes self registration to Puppet . 76 3.1.4 CPU efficiency measurements . 77 3.1.5 Grid services on the cloud . 77 3.2 Sandboxed virtual farms . 80 3.2.1 Virtual networks: ebtables . 80 3.2.2 Virtual Router: OpenWRT . 82 3.2.2.1 Building a custom OpenWRT image . 82 3.2.2.2 Contextualization . 85 CONTENTS 7 3.2.2.3 Virtual Routers management and security . 86 3.2.2.4 VPN for geographically distributed virtual farms . 86 3.2.3 OpenNebula EC2 interface: econe-server . 87 3.2.3.1 EC2 credentials . 88 3.2.3.2 Mapping flavours to templates . 88 3.2.3.3 SSH keys and user-data . 89 3.2.3.4 Improved econe-server Elastic IPs support . 90 3.2.3.5 EC2 API traffic encryption . 91 3.2.4 Elastic IPs . 91 3.2.4.1 Integration of econe-server and Virtual Routers . 92 3.2.5 Creating a user sandbox . 93 3.2.6 Using the cloud . 93 3.3 M5L-CAD: an elastic farm for medical imaging . 94 3.3.1 Challenges . 95 3.3.2 The WIDEN web frontend . 96 3.3.3 Processing algorithms . 96 3.3.4 Elastic cloud backend . 98 3.3.4.1 Evolution towards a generic Virtual Analysis Facility 99 3.3.5 Privacy and patient confidentiality . 100 4 PROOF as a Service: deploying PROOF on clouds and the Grid 103 4.1 Overview of the PROOF computing model . 103 4.1.1 Event based parallelism . 103 4.1.2 Interactivity . 104 4.1.3 PROOF and ROOT . 105 4.1.4 Dynamic scheduling . 106 4.1.5 Data locality and XRootD . 107 4.1.6 PROOF-Lite . 109 4.2 Static and dynamic PROOF deployments . 109 4.2.1 Dedicated PROOF clusters . 110 4.2.2 Issues of static deployments . 110 4.2.3 Overcoming deployment issues . 111 4.2.4 PROOF on Demand . 112 4.3 ALICE Analysis Facilities and their evolution . 112 4.3.1 Specifications . 113 4.3.1.1 The analysis framework . 113 4.3.1.2 Deployment . 114 4.3.2 Users authentication and authorization . 115 4.3.3 Data storage and access model . 116 4.3.3.1 PROOF datasets . 117 4.3.3.2 The AliEn File Catalog . 118 8 CONTENTS 4.3.4 Issues of the AAF data model . 119 4.3.4.1 Very slow first time data access . 119 4.3.4.2 Lack of a pre-staging mechanism . 120 4.3.4.3 Data integrity concerns and resiliency . 121 4.3.5 The dataset stager . 121 4.3.5.1 Features of the dataset stager . 122 4.3.5.2 Robustness of the staging daemon . 123 4.3.5.3 Daemon status monitoring: the MonALISA plugin . 125 4.3.5.4 The ALICE download script . 126 4.3.5.5 The dataset management utilities for ALICE . 127 4.3.5.6 Alternative usage of the staging daemon . 127 4.3.6 A PROOF interface to the AliEn File Catalog . 128 4.3.6.1 Dataset queries . 129 4.3.6.2 Issuing staging requests . 130 4.3.7 The first virtual Torino Analysis Facility for ALICE . 131 4.3.7.1 Overview . 131 4.3.7.2 Storage and datasets model . 132 4.3.7.3 Software distribution model . 136 4.3.7.4 Dynamically changing PROOF resources . 136 4.3.7.5 Virtual machine images and Tier-2 integration . 136 4.3.8 Spawning and scaling the TAF . 138 4.4 The Virtual Analysis Facility . 139 4.4.1 Cloud awareness . 140 4.4.1.1 PROOF is cloud aware . 141 4.4.2 Components . 142 4.4.3 The CernVM ecosystem . 143 4.4.3.1 µCernVM: an on-demand operating system . 143 4.4.3.2 CernVM-FS . 146 4.4.3.3 CernVM Online . 147 4.4.4 External components . 149 4.4.4.1 HTCondor . 151 4.4.4.2 PROOF on Demand . 154 4.4.5 Authentication: sshcertauth . 155 4.4.5.1 Rationale . 156 4.4.5.2 Components . 157 4.4.5.3 The web interface . 158 4.4.5.4 Users mapping plugins . 159 4.4.5.5 The keys keeper . 161 4.4.5.6 Restricting access to some users . 162 4.4.6 PROOF dynamic workers . 163 4.4.6.1 Chain of propagation of the new workers . 163 CONTENTS 9 4.4.6.2 User’s workflow . 164 4.4.6.3 Updating workers list in xproofd . 166 4.4.6.4 Configuring workers from the PROOF master . 166 4.4.6.5 Adding workers.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    269 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us