14.SCHULZ Grid.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Grid Computing ESI 2011 Markus Schulz IT Grid Technology Group, CERN WLCG [email protected] Overview • Grid Computing – Definition, History, Fundamental Problems, Technology – EMI/gLite • WLCG – Challenge – Infrastructure – Usage • Grid Computing – Who? – How? • What’s Next? Markus Schulz 2 Focus • Understanding concepts • Understanding the current usage • Understanding whether you can profit from grid computing • Not much about projects, history, details… • If you want practical examples: • https://edms.cern.ch/file/7223 98/1.4/gLite-3-UserGuide.pdf Markus Schulz 3 What is a Computing Grid? • There are many conflicting definitions – Has been used for several years for marketing… • Marketing moved recently to “Cloud-computing” • Ian Foster and Karl Kesselman – “coordinated resource sharing and problem solving in dynamic, multi- institutional virtual organizations. “ – These are the people who started globus, the first grid middleware project • From the user’s perspective: – I want to be able to use computing resources as I need – I don’t care who owns resources, or where they are – Have to be secure – My programs have to run there • The owners of computing resources (CPU cycles, storage, bandwidth) – My resources can be used by any authorized person (not for free) – Authorization is not tied to my administrative organization • – NO centralized control of resources or users Markus Schulz 4 Other Problems? • The world is a fairly heterogeneous place – Computing services are extremely heterogeneous • Examples: – Batch Systems (controlling the execution of your jobs ) • LSF, PBS, TorQue, Condor, SUN-GridEngine, BQS, ….. • Each comes with its own commands and status messages – Storage: Xroot, CASTOR, dCache, DPM, STORM,+++ – Operating Systems: • Windows, Linux ( 5 popular flavors), Solaris, MacOS,…. • All come in several versions – Site managers • Highly experienced professionals • Scientists forced to do it ( or volunteering ) • Summer students doing it for 3 months……. Markus Schulz 5 What is a Virtual Organization? • A Virtual Organization is a group of people that agree to share resources for solving a common problem – The members often belong to different organizations – The organizations are often in different countries – High Energy Physics Collaborations are a good example Markus Schulz 6 Fundamental Problems • Provide security without central control • Hide and manage heterogeneity • Facilitate communication between users and providers • Not only a technical problem! • Grid Middleware is the software to address these problems Markus Schulz, CERN, IT Department Short History • 1998 The GRID by Ian Foster & Carl Kesselman – Made the idea popular • 1998 Globus-1 first middleware widely available – Proof of concept – www.globus.org evolved to gt-5 ( 2010 ) • Since 1998 several hundred middleware solutions • OpenGridForum works towards standardization – Progress is slow….. • LHC experiments use: – gLite, ARC, OSG (globus, VDT), Alien Markus Schulz 8 Software Approach • Identify an AAA system that all can agree on • Authentication, Authorization, Auditing – That doesn’t require local user registration – That delegates “details” to the users ( Virtual Organizations) • Define and implement abstraction layers for resources – Computing, Storage, etc. • Define and implement a way to announce your resources (Information System) • Build high level services to optimize the usage • Interface your applications to the system Markus Schulz 9 gLite as an example Markus Schulz 10 gLite middleware Job Information & Security Access Data Services Management Monitoring Services Services Services Services Computing Storage Element Information Element System Authorization User Interface Worker Node File and Job Monitoring Replica Catalog Workload Management Authentication API Metadata Accounting Catalog Job Provenance CERN, IT Department The Big Picture 12 Security • Authentication – Who you are • Authorization – What you can do • Auditing/Accounting – What you have done • How to establish trust? Markus Schulz 13 Authentication • Authentication is based on X.509 PKI infrastructure ( Public Key) – Certificate Authorities (CA) issue (long lived) certificates identifying individuals (much like a passport) • Commonly used in web browsers to authenticate to sites – Trust between CAs and sites is established (offline) – In order to reduce vulnerability, on the Grid user identification is done by using (short lived) proxies of their certificates • Short-Lived Credential Services (SLCS) – issue short lived certificates or proxies to its local users • e.g. from Kerberos or from Shibboleth credentials • Proxies can – Be delegated to a service such that it can act on the user’s behalf – Be stored in an external proxy store (MyProxy) – Be renewed (in case they are about to expire) – Include additional attributes - Authorization 14 gLite The EGEE Middleware Distribution Public Key Based Security • How to exchange secret keys? – 340 Sites ( global) • With hundreds of nodes each? – 200 User Communities ( non local) – 10000 Users (global) • And keep them secret!!! Markus Schulz 15 Authorization • VOMS is now a de-facto standard – Attribute Certificates provide users with additional capabilities defined by the VO. – Allows group and role based authorization – Basis for the authorization process • Authorization: currently via mapping to a local user on the resource ( or ACLs) – glexec changes the local identity (based on suexec from Apache) • Designing an authorization service with a common interface agreed with multiple partners – Uniform implementation of authorization in gLite services – Easier interoperability with other infrastructures – ARGUS Security - overview Certification Authority 1 per country/region/la b 17 gLite The EGEE Middleware Distribution Common AuthZ interface Pilot job on CREAM OSG Worker Node EGEE/EGI (both EGEE and pre-WS GT4 gk, GT4 OSG) gridftp, opensshd gatekeeper,g ridftp, gt4-interface (opensshd) edg-gk dCache glexec edg-gridftp Prima + gPlazma: LCAS + LCMAPS SAML-XACML L&L plug-in: SAML-XACML Common SAML XACML library SAML-XACML Query Q: map.user.to.some.pool R: Oblg: user001, somegrp <other obligations> SAML-XACML interface Common SAML XACML library GPBox Site Central: GUMS (+ SAZ) Site Central: LCAS + LCMAPS LCMAPS L&L plug-ins plug-in Trust • You can setup a Certification Authority and several Registration Authorities – Using openssl – 1h work • No one will trust your certificates • Trust is based on: – Common policies – Common infrastructure to follow up on security problems Markus Schulz 19 Security groups • Joint Security Policy Group: TAGPMA EUGridPMA APGridPMA – Joint with WLCG, OSG, and others – Focus on policy issues – Strong input to e-IRG EUGridPMA • The Asia- – Pan-European trust federation of CAs Americas European Pacific Grid PMA Grid PMA Grid PMA – Included in IGTF (and was model for it) – Success: most grid projects now subscribe to the IGTF • Grid Security Vulnerability Group Incident Certification Response Audit – Looking at how to manage vulnerabilities Authorities Requirements – Risk analysis is fundamental – Balance between openness and security Usage Security & Availability • Operational Security Coordination Team VO Rules Policy Security – Main day-to-day operational work – Incident response and follow up – Members in all NGI and sites User Registration Application Development & VO Management & Network Admin Guide – Frequent tests (Security Challenges) Computing Access Computing Elements (CE) • CE CE – gateways to farms LFS Site CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU • Workload Management – WMS/LB – Matches resources and requests Including data location • UI – Handles failures (resubmission) WM S – Manages complex workflows UI – Tracks job status 21 Workload Management (compact) Desktops A few ~50 nodes 1-20 per site 1-24000 per site Job Description Language • [ • Executable = “my_exe”; • StdOutput = “out”; • StdError = “err”; • Arguments = “a b c”; • InputSandbox = {“/home/giaco/my_exe”}; • OutputSandbox = {“out”, “err”}; • Requirements = Member( • other.GlueHostApplicationSoftwareRunTimeEnvironment, • "ALICE3.07.01“ • ); • Rank = -other.GlueCEStateEstimatedResponseTime; • RetryCount = 3 • ] 23 ECSAC'09 - Veli Lošinj, Croatia, 25-29 August 2009 Pilot Jobs • All WLCG experiments use a form of pilot jobs • They have given a number of advantages • Responsive scheduling • Knowledge of environment • They have also faced some resistance • Security • Traceability 17/05/2011 Markus Schulz - ESI 2011 24 Information System • BDII = Yellow Pages – realtime • Light weight Database • LDAP protocol • GLUE 1.3 (2) Schema – Describes resources and their state – Approx 100MB – Update 2min • Several hundred instances 25 Data Management • Storage Elements (SEs) – External interfaces based on SRM 2.2 and gridFTP – Many implementations: • CASTOR, Storm, DPM, dCache, BestMan…. Many local interfaces: – Grid rfio CPU • POSIX, dcap, secure rfio, rfio, xrootd FTP CPU xrootd SRM SE CPU • Catalogue: LFC (local and global) Site CPU • File Transfer Service (FTS) • Data management clients gfal/LCG-Utils 26 Data Management VO User Tools Frameworks lcg_utils Data Management FTS GFAL Cataloging Storage Data transfer Information System/Environment Variables Vendor (Classic gridftp Specific (RLS) LFC SRM SE) RFIO APIs General Storage Element • Storage Resource Manager (SRM) – hides the storage system implementation (disk or tape) – handles authorization – translates SURLs (Storage