OGF Director, NSF Cloud and Autonomic Computing Center Senior Scientist, High Performance Computing Center Adjunct Professor of Physics Texas Tech University

Total Page:16

File Type:pdf, Size:1020Kb

OGF Director, NSF Cloud and Autonomic Computing Center Senior Scientist, High Performance Computing Center Adjunct Professor of Physics Texas Tech University The Open Grid Forum: History, Introduction and Process Alan Sill VP of Standards, OGF Director, NSF Cloud and Autonomic Computing Center Senior Scientist, High Performance Computing Center Adjunct Professor of Physics Texas Tech University Open Grid Forum 44, May 21-22, 2015 EGI Conference, Lisbon, Portugal ©2015 Open Grid Forum 1 About the Open Grid Forum: Open Grid Forum (OGF) is a global organization operating in the areas of cloud, grid and related forms of advanced distributed computing. The OGF community pursues these topics through an open process for development, creation and promotion of relevant specifications and use cases. OGF actively engages partners and participants throughout the international arena through an open forum with open processes to champion architectural blueprints related to cloud and grid computing. The resulting specifications and standards enable pervasive adoption of advanced distributed computing techniques for business and research worldwide. © 2015 Open Grid Forum OGF 44 - EGI Conference Lisbon, Portugal May 18-22, 2015 2 History and Background • Began in 2001 as an organization to promote the advancement of distributed computing worldwide. • Grid Forum --> Global Grid Forum --> GGF + Enterprise Grid Alliance --> formation of OGF in 2005. • Mandate is to take on all forms of distributed computing and to work to promote cooperation, information exchange, and best practices in use and standardization. • OGF is best known for a series of important computing, security and network standards that form the basis for major science and business-based distributed computing (BES, GridFTP, DRMAA, JSDL, RNS, GLUE, UR, etc.). • We also develop cloud, networking and data standards (OCCI, DFDL, WS-Agreement, NSI/NML, etc.) in wide use. • Cooperative work agreements with other SDOs in place. © 2015 Open Grid Forum OGF 44 - EGI Conference Lisbon, Portugal May 18-22, 2015 3 OGF Standards OGF has an extensive set of applicable standards related to advanced distributed grid and cloud computing and associated storage management and network operation: - Managing the Trust Eco-System (CA operations, AuthN/AuthZ) - Job Submission and Workflow Management (JSDL, BES) - Network Management (NSI, NML, NMC, NM) - Federated Identity Management (FedSec-CG) - Virtual Organizations (VOMS) - Secure, fast multi--party data transfer (GridFTP, SRM) - Service Agreements (WS-Agreement, WS-Agreement Negotiation) - Data Format Description (DFDL) - Cloud Computing interfaces (OCCI) - Distributed resource management APIs (DRMAA, SAGA, etc.) - Firewall Traversal (FiTP) - (Many others under development) © 2015 Open Grid Forum OGF 44 - EGI Conference Lisbon, Portugal May 18-22, 2015 4 OGF HPC Standards In Use In Industry: • DRMAA: Distributed Resource Management Application API Grid Engine (Univa), Open Grid Scheduler: (open source); TORQUE and related products: Adaptive Computing; PBS Works: Altair Engineering; Gridway: DSA Research; HTCondor: U. of Wisconsin / Red Hat; • OGSA® Basic Execution Service Version 1.0 and BES HPC Profile: BES++ for LSF/SGE/PBS: Platform Computing; Windows HPC Server 2008: Microsoft Corporation; PBS Works - (client only): Altair Engineering; • JSDL: Job Submission Description Language (family of specifications): BES++ for LSF/SGE/PBS and Platform LSF: Platform Computing; Windows HPC Server: Microsoft Corporation; PBS Works and PBS/Pro: Altair Engineering; Tivoli Workload Scheduler: IBM Corporation; • WS-Agreement (family of specifications): ElasticLM License-as-a-Service: ElasticLM; BEinGrid SLA Negotiator, LM-Architecture and Framework: (Multiple partners); BREIN SLA Management Framework: (Multiple partners); WSAG4J, Web Services Agreement for Java (framework implementation): Fraunhofer SCAI. © 2015 Open Grid Forum OGF 44 - EGI Conference Lisbon, Portugal May 18-22, 2015 5 http://occi-wg.org Starting Point: OGF Documents http://ogf.org/documents © 2015 Open Grid Forum OGF 44 - EGI Conference Lisbon, Portugal May 18-22, 2015 7 Public Comment process http://redmine.ogf.org/projects/editor-pubcom/boards/ © 2015 Open Grid Forum OGF 44 - EGI Conference Lisbon, Portugal May 18-22, 2015 8 OGF Document Types • Informational: To inform the community about a useful idea or set of ideas. • Experimental: To inform the community about a useful experiment, testbed or implementation of idea or set of ideas. • Community Practice: To inform the community of common practice or process, with the objective to influence the community and/or document its current practices. • Recommendations: To publish a specification, analogous to an Internet Standards track document. Recommendations are initially designated as "proposed," and following further experience and review may become full recommendations. • Further information including guidance and advice contained in GFD.152 at: http://ogf.org/documents/GFD.152.pdf © 2015 Open Grid Forum OGF 44 - EGI Conference Lisbon, Portugal May 18-22, 2015 9 EGI international presence Value (yearly increase) Storage Value (yearly increase) 235 PB CPU cores 361,300 across 53 countries Disk (PB) (1.44 M job/day) (+69%) Tape (PB) 176 PB (+32%) EGI-InSPIRE RI-261323EGI-InSPIRE RI-261323 www.egi.eu www.egi.eu Standards-based international collaboration EGI Federated Cloud: A successful standards-based international federated cloud infrastructure Cyfronet FZJ OeRC EGI.eu CESNET GWDG TUD CNRS IN2P3 KTH Technologies Masaryk Members •OpenStack •OpenNebula INFN FCTSG •70 individuals •StratusLab •40 institutions •CloudStack (in CETA •13 countries evaluation) CESGA •Synnefo •WNoDeS IGI SARA Stakeholders Standards RADICAL •23 Resource Providers IFCA •10 Technology Providers •OCCI (control) •OVF (images) STFC •7 User Communities •X.509 (authN) SZTAKI •4 Liaisons •CDMI (storage - under development) BSC GRNET Imperial DANTE LMU IPHC IISAS SixSq(Updated July 2014) 100%IT IFAE SRCE Credit: David Wallom Chair EGI Federated Cloud Task Force Federated Cloud architecture Domain specific services in Standards used to enable federation • Virtual Machine Images OCCI: VM Image management • OVF: VM Image format • GLUE2: Resource • X509: Authentication discovery and Description • (CDMI: Storage) • Others in development Virtual organisations FedCloud User interfaces Federation monitoring Cloud hypervisor • Information system (BDII) (e.g. OpenStack, OpenNebula, • Monitoring (SAM) EmotiveCloud, Okeanos…) • Accounting (APEL) FedCloud interfaces Operation Cloud site • AAI (Perun) academic/commercial Open to new members: www.egi.eu EGI-InSPIRE RI-261323Join as user, or as an IaaS/PaaS/SaaS service provider: http://go.egi.eu/cloud Example: Worldwide LHC Computing Grid ~450,000 cpu cores Total worldwide grid ~430 Pb storage capacity: ~2x WLCG Typical data transfer across all grids and rate: ~12 GByte/sec VOs © 2015 Open Grid Forum OGF 44 - EGI Conference Lisbon, Portugal May 18-22, 2015 13 XSEDE: The Next Generation of US National Supercomputing Infrastructure The Role of Standards for Risk Reduction and Inter-operation in XSEDE Cloud and grid standards now power some of the largest academic supercomputing infrastructures in the world! LSN-MAGIC Meeting XSEDE Services Layer: February 22, 2012 Simple services combined in many ways –Resource Namespace Service 1.1 –OGSA Basic ExecuNon Service –OGSA WSRF BP – metadata and noNficaNon –OGSA-ByteIO Examples – (not –GridFTP a complete list) –JSDL, BES, BES HPC Profile –WS Trust Secure Token Services –WSI BSP for transport of credenNals –… (more than we have room to cover here) Basic message: XSEDE represents best-of-breed engagement of open computing standards with the US cyberinfrastructure. 15 US National Cyberinfrastructure Grids Blacklight Shared Memory Open Science Grid Trestles 4k Xeon cores High throughput IO-intensive 10k cores Blue Waters 124 sites 160 GB SSD/Flash Leadership FutureGrid * Gordon Yellowstone Extend Data intensive Geosciences the impact Promote an of cyber- 64 TB memory Prepare open, robust, infrastructure 300 TB Flash Mem the current collaborative, Darter and next and innovative 24k cores generation ecosystem Nautilus SuperMIC Visualization 380 nodes – 1PF Data Analytics Wrangler Stampede (Ivy bridge, Xeon Phi, Collaborate Adopt, Comet Data Analytics 460K cores GPU) Keeneland with other CI create and “Long Tail Science” w. Xeon Phi Maverick groups and disseminat CPU/GPGPU Provide 47k cores/2 PF >1000 users Visualization projects e High throughput Data Analytics technical knowledge Upgrade in 2015 expertise and support services Credit: Irene Over 13 million service units/day Qualters, US National Science typically delivered as of 2014 across Foundation all XSEDE supercomputing sites (about ACI-REF Campus sharing, 3 million core hours/day), totaling NSF Cloud (shared) about 1.6 billion core hours per year LSN-MAGIC Meeting Why Open Standards? February 22, 2012 • Risk reducon • Best-of-breed mix-and-match • Allows innovaMon/compeMMon at more interesMng layers • Facilitates interoperaMon with other infrastructures Takeaway message • The use of standards permits XSEDE to interoperate with other infrastructures, reduces risks including 17 vendor lock-in, and allows us to focus on higher level capabiliMes and less on the mundane Andrew Grimshaw 600k - 800k jobs/day! Distributed Across 124 Sites Open Science Grid currently consists of over 124 geographical sites, operating on a wide variety of computing systems OGF Cooperative Agreements In Place as of May 2015 OGF and IEEE: • OGF co-sponsors activities at many IEEE conferences; pursuing engagements with P2301 & P2302
Recommended publications
  • DRMAA) Version 2
    Distributed Resource Management Application API (DRMAA) Version 2 Dr. Peter Tröger Hasso-Plattner-Institute, University of Potsdam [email protected] ! DRMAA-WG Co-Chair http://www.drmaa.org/ My Person • Senior Researcher at Hasso-Plattner-Institute, Potsdam • Research field: Dependable systems • New online failure prediction and recovery techniques (SAP ByDesign, TACC Ranger, IBM z196, Intel) • Fault injection on Firmware level (Fujitsu Technology Solutions) • New reliability modeling approaches (DSN paper pending ...) • Virtualization-based fault tolerance (VMWare, Xen, KVM) • Teaching • Dependable systems, parallel programming concepts, operating systems, middleware and distributed systems • Standardization in Open Grid Forum as side activity ... DRMAAv2 | OGF 35 $X PT 2012 Hasso-Plattner-Institute for Software Engineering (HPI) " Privately funded and independent research institute, founded in 1999! " Associated with the University of Potsdam, Germany! " B.Sc. and M.Sc. curriculum in IT-Systems Engineering! " Ph.D. programme! " Rich experience in research projects that are typically conducted with industrial partners, both on a national and international level! " Research school for PhDs with international departments (Cape Town, Haifa, China) DRMAAv2 | OGF 35 $X PT 2012 $X Open Grid Forum (OGF) Application Area End User Application / Portal Features SAGA API / OGSA / OCCI API Portabilit API Standards Proprietary Other OGF OGF Other DRMAA End User Application / Portal Meta Scheduler Features SAGA API + Backends API API Portabilit API API
    [Show full text]
  • Push-Based Job Submission Using Reverse SSH Connections
    rvGAHP – Push-Based Job Submission using Reverse SSH Connections Scott Callaghan Gideon Juve Karan Vahi University of Southern California USC Information Sciences Institute USC Information Sciences Institute Los Angeles, California Marina Del Rey, California Marina Del Rey, California [email protected] [email protected] [email protected] Philip J. Maechling Thomas H. Jordan Ewa Deelman University of Southern California University of Southern California USC Information Sciences Institute Los Angeles, California Los Angeles, California Marina Del Rey, California [email protected] [email protected] [email protected] ABSTRACT using Reverse SSH Connections. In WORKS’17: WORKS’17: 12th Workshop Computational science researchers running large-scale scientific on Workflows in Support of Large-Scale Science, November 12–17, 2017, Denver, CO, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3150994. workflow applications often want to run their workflows onthe 3151003 largest available compute systems to improve time to solution. Workflow tools used in distributed, heterogeneous, high perfor- mance computing environments typically rely on either a push- 1 INTRODUCTION based or a pull-based approach for resource provisioning from these Modern scientific applications typically require the execution of compute systems. However, many large clusters have moved to multiple codes, may include computational and data dependencies, two-factor authentication for job submission, making traditional au- and contain varied computational models ranging from a bag of tomated push-based job submission impossible. On the other hand, tasks to a monolithic parallel code. These applications are often pull-based approaches such as pilot jobs may lead to increased com- complex software suites with substantial computational and data plexity and a reduction in node-hour efficiency.
    [Show full text]
  • Sun Grid Engine Update Daniel Gruber Software Engineer Sun Microsystems Deutschland Gmbh Sun Is a Wholly-Owned Subsidiary of Oracle
    Sun Grid Engine Update Daniel Gruber Software Engineer Sun Microsystems Deutschland GmbH Sun is a wholly-owned subsidiary of Oracle 1 Content What's new in SGE? DRMAA Customer Feedback 2 Sun Grid Engine Releases Release Announcement Some Features... 6.2 major 23.09.2008 SDM, scalability (> 60000 cores), AR, IJS 6.2 update 1 18.12.2008 maintenance release GUI Installer, JSV, Per Job Resources, 6.2 update 2 31.03.2009 jemalloc SGE Inspect, SDM Cloud Adapter, 6.2 update 3 23.06.2009 Exclusive Host 6.2 update 4 23.10.2009 maintenance release Slotwise Preemption, Core Binding, 6.2 update 5 22.12.2009 enhanced Inspect, Java JSV, Array Job Throttling, Hadoop Support Sun Confidential: Internal Only 3 SDM – Service Domain Manager Grid Grid Grid Engine Engine Engine A B C Service Domain Manager Zzzzz Zzzzz Power Saving Spare Pool (via IPMI) Spare Pool CloudService Sun Confidential: Internal Only 4 JSV – Job Submission Verifier • Administrator (or users) can reformulate (insert, delete) job submission parameters based on a JSV scripts • Jobs can be rejected based on parameters • bash, csh, tcl, perl and JSV scripts are supported Sun Confidential: Internal Only 5 GUI Installer • Installs a complete SGE cluster Sun Confidential: Internal Only 6 Slot-wise preemption • Slot limit per host • Suspends jobs from subordinate queues in order to get high priority jobs to run • Suspends longest/shortest running jobs • Multiple layers (suspend trees) possible • Per layer: Order definable Sun Confidential: Internal Only 7 Core Binding • Job submission extension
    [Show full text]
  • Release Notes for IBM Spectrum LSF Performance Enhancements
    IBM Spectrum LSF Version 10 Release 1 Release Notes IBM IBM Spectrum LSF Version 10 Release 1 Release Notes IBM Note Before using this information and the product it supports, read the information in “Notices” on page 41. This edition applies to version 10, release 1 of IBM Spectrum LSF (product numbers 5725G82 and 5725L25) and to all subsequent releases and modifications until otherwise indicated in new editions. Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change. If you find an error in any IBM Spectrum Computing documentation, or you have a suggestion for improving it, let us know. Log in to IBM Knowledge Center with your IBMid, and add your comments and feedback to any topic. © Copyright IBM Corporation 1992, 2017. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Release Notes for IBM Spectrum LSF Performance enhancements ........ 14 Version 10.1 ............. 1 Pending job management......... 16 What's new in IBM Spectrum LSF Version 10.1 Fix Job scheduling and execution ....... 21 Pack 3 ................ 1 Host-related features .......... 27 Job scheduling and execution ........ 1 Other changes to LSF behavior ....... 30 Resource management .......... 1 Learn more about IBM Spectrum LSF...... 31 Container support ........... 5 Product notifications .......... 31 Command output formatting ........ 5 IBM Spectrum LSF documentation....... 32 Logging and troubleshooting ........ 5 Product compatibility ........... 32 Other changes to IBM Spectrum LSF ..... 6 Server host compatibility ......... 32 What's new in IBM Spectrum LSF Version 10.1 Fix LSF add-on compatibility ........
    [Show full text]
  • Introduction to Python University of Oxford Department of Particle Physics
    Particle Physics Cluster Infrastructure Introduction University of Oxford Department of Particle Physics October 2019 Vipul Davda Particle Physics Linux Systems Administrator Room 661 Telephone: x73389 [email protected] Particle Physics Computing Overview 1 Particle Physics Linux Infrastructure Distributed File System gluster NFS /data Worker Nodes /data/atlas /data/lhcb NFS HTCondor /home Batch Server Interactive Servers physics_s/eduroam Network Printer Managed Laptops Managed Desktops Particle Physics Computing Overview 2 Introduction to the Unix Operating System Unix is a Multi-User/Multi-Tasking operating system. Developed in 1969 at AT&T’s Bell Labs by Ken Thompson (Unix) Dennis Ritchie (C) Unix is written in C programming language. Unix was originally a command-line OS, but now has a graphical user interface. It is available in many different forms: Linux , Solaris, AIX, HP-UX, freeBSD It is a well-suited environment for program development: C, C++, Java, Fortran, Python… Unix is mainly used on large servers for scientific applications. Particle Physics Computing Overview 3 Linux Distributions Source: https://www.muylinux.com/2009/04/24/logos-de-distribuciones-gnulinux/ Particle Physics Computing Overview 4 Particle Physics Linux Infrastructure Particle Physics uses CentOS Linux on the cluster. It is a free version of RedHat Enterprise Linux. Particle Physics Computing Overview 5 Basic Linux Commands Particle Physics Computing Overview 6 Basic Linux Commands o ls - list directory contents ls –l - long listing
    [Show full text]
  • The Translational Journey of the Htcondor-CE
    Journal of Computational Science xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of Computational Science journal homepage: www.elsevier.com/locate/jocs Principles, technologies, and time: The translational journey of the HTCondor-CE Brian Bockelman a,*, Miron Livny a,b, Brian Lin b, Francesco Prelz c a Morgridge Institute for Research, Madison, USA b Department of Computer Sciences, University of Wisconsin-Madison, Madison, USA c INFN Milan, Milan, Italy ARTICLE INFO ABSTRACT Keywords: Mechanisms for remote execution of computational tasks enable a distributed system to effectively utilize all Distributed high throughput computing available resources. This ability is essential to attaining the objectives of high availability, system reliability, and High throughput computing graceful degradation and directly contribute to flexibility, adaptability, and incremental growth. As part of a Translational computing national fabric of Distributed High Throughput Computing (dHTC) services, remote execution is a cornerstone of Distributed computing the Open Science Grid (OSG) Compute Federation. Most of the organizations that harness the computing capacity provided by the OSG also deploy HTCondor pools on resources acquired from the OSG. The HTCondor Compute Entrypoint (CE) facilitates the remote acquisition of resources by all organizations. The HTCondor-CE is the product of a most recent translational cycle that is part of a multidecade translational process. The process is rooted in a partnership, between members of the High Energy Physics community and computer scientists, that evolved over three decades and involved testing and evaluation with active users and production infrastructures. Through several translational cycles that involved researchers from different organizations and continents, principles, ideas, frameworks and technologies were translated into a widely adopted software artifact that isresponsible for provisioning of approximately 9 million core hours per day across 170 endpoints.
    [Show full text]
  • Condor Via Developer Apis/Plugins
    Extend/alter Condor via developer APIs/plugins CERN Feb 14 2011 Todd Tannenbaum Condor Project Computer Sciences Department University of Wisconsin-Madison Some classifications Application Program Interfaces (APIs) › Job Control › Operational Monitoring Extensions 2 www.cs.wisc.edu/Condor Job Control APIs The biggies: › Command Line Tools › DRMAA › Condor DBQ › Web Service Interface (SOAP) http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=SoapWisdom 3 www.cs.wisc.edu/Condor Command Line Tools › Don’t underestimate them! › Your program can create a submit file on disk and simply invoke condor_submit: system(“echo universe=VANILLA > /tmp/condor.sub”); system(“echo executable=myprog >> /tmp/condor.sub”); . system(“echo queue >> /tmp/condor.sub”); system(“condor_submit /tmp/condor.sub”); 4 www.cs.wisc.edu/Condor Command Line Tools › Your program can create a submit file and give it to condor_submit through stdin: PERL: fopen(SUBMIT, “|condor_submit”); print SUBMIT “universe=VANILLA\n”; . C/C++: int s = popen(“condor_submit”, “r+”); write(s, “universe=VANILLA\n”, 17/*len*/); . 5 www.cs.wisc.edu/Condor Command Line Tools › Using the +Attribute with condor_submit: universe = VANILLA executable = /bin/hostname output = job.out log = job.log +webuser = “zmiller” queue 6 www.cs.wisc.edu/Condor Command Line Tools › Use -constraint and –format with condor_q: % condor_q -constraint 'webuser=="zmiller"' -- Submitter: bio.cs.wisc.edu : <128.105.147.96:37866> : bio.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 213503.0 zmiller 10/11 06:00 0+00:00:00
    [Show full text]
  • Sun Grid Engine Update
    Sun Grid Engine Update SGE Workshop 2007, Regensburg September 10-12, 2007 Andy Schwierskott Sun Microsystems Copyright Sun Microsystems What is Grid Computing? • The network is the computer™ > Distributed resources > Management infrastructure > Targeted service or workload • Utilization & performance ↑, costs & complexity ↓ • Examples: > Aggregating desktops for computation, aka cycle stealing > e.g. SETI@Home, use engineers' desktop at night > Managing an entire rack from a single interface > Rendering and simulation “farms” Copyright Sun Microsystems 2007 Page 2 What Sun Grid Engine does in Grid Computing • Helps solving problems horizontally > High Performance [Technical] Computing > Data center optimization • Examples: > EDA, modeling, transaction validation, MCAD • Increasing utilization, reduce turnaround times > 10%-25% is typical, go up to 90%++ > Cycle stealing • ==> Intelligently automate batch and interactive job distribution for jobs running from seconds to days and weeks Copyright Sun Microsystems 2007 Page 3 Target Industries & Typical Workloads Industries Computing Tasks Copyright Sun Microsystems 2007 Page 4 Sun Grid Engine Enterprise Allocation and Resource Prioritization Policies Selection Extensible Workload to Resource Matching Customizable System Load and Access Resource Regulation Control Definable Job Execution Contexts Web-based Reporting and Resource Analysis Accounting Open and Integratable Data Source Copyright Sun Microsystems 2007 Page 5 Sun Grid Engine Hierarchical Configuration Ease of Integration with N1 Administration Systems Management Products 3rd Party Standards-Compliant Software Full CLI Functionality Integration Heterogeneous Wide commercial Environments OS support Copyright Sun Microsystems 2007 Page 6 Sun Grid Engine Components qsub qrsh qlogin qmon qtcsh Shadow Master Copyright Sun Microsystems 2007 Page 7 Sun Grid Engine 6 • SGE 6.0 released in 2004 > Sites slowly adopt new functionality > ..
    [Show full text]
  • Beginner's Guide to Oracle Grid Engine 6.2 Oracle White Paper—Beginner's Guide to Oracle Grid Engine 6.2
    An Oracle White Paper August 2010 Beginner's Guide to Oracle Grid Engine 6.2 Oracle White Paper—Beginner's Guide to Oracle Grid Engine 6.2 Executive Overview ..................................................................................... 1 Introduction .................................................................................................. 1 Chapter 1: Introduction to Oracle Grid Engine ............................................ 3 Oracle Grid Engine Jobs ......................................................................... 3 Oracle Grid Engine Component Architecture .......................................... 3 Oracle Grid Engine Basics ...................................................................... 5 Chapter 2: Oracle Grid Engine Scheduler ................................................... 10 Job Selection ........................................................................................... 10 Job Scheduling ........................................................................................ 17 Other Scheduling Features ...................................................................... 18 Additional Information on Job Scheduling ............................................... 20 Chapter 3: Planning an Oracle Grid Engine Installation .............................. 21 Installation Layout .................................................................................... 21 QMaster Data Spooling ........................................................................... 22 Execution Daemon Data
    [Show full text]
  • A Comprehensive Perspective on Pilot-Job Systems
    A Comprehensive Perspective on Pilot-Job Systems ∗ Matteo Turilli Mark Santcroos Shantenu Jha RADICAL Laboratory, ECE RADICAL Laboratory, ECE RADICAL Laboratory, ECE Rutgers University Rutgers University Rutgers University New Brunswick, NJ, USA New Brunswick, NJ, USA New Brunswick, NJ, USA [email protected] [email protected] [email protected] Pilot-Jobs provide a multi-stage mechanism to execute ABSTRACT workloads. Resources are acquired via a placeholder job and subsequently assigned to workloads. Pilot-Jobs are having Pilot-Job systems play an important role in supporting dis- a high impact on scientific and distributed computing [1]. tributed scientific computing. They are used to consume more They are used to consume more than 700 million CPU hours than 700 million CPU hours a year by the Open Science Grid a year [2] by the Open Science Grid (OSG) [3, 4] communi- communities, and by processing up to 1 million jobs a day for ties, and process up to 1 million jobs a day [5] for the ATLAS the ATLAS experiment on the Worldwide LHC Computing experiment [6] on theLarge Hadron Collider (LHC) [7] Com- Grid. With the increasing importance of task-level paral- puting Grid (WLCG) [8, 9]. A variety of Pilot-Job systems lelism in high-performance computing, Pilot-Job systems are used on distributed computing infrastructures (DCI): are also witnessing an adoption beyond traditional domains. Glidein/GlideinWMS [10, 11], the Coaster System [12], DI- Notwithstanding the growing impact on scientific research, ANE [13], DIRAC [14], PanDA [15], GWPilot [16], Nim- there is no agreement upon a definition of Pilot-Job system rod/G [17], Falkon [18], MyCluster [19] to name a few.
    [Show full text]
  • EGI Federated Platforms Supporting Accelerated Computing
    EGI federated platforms supporting accelerated computing PoS(ISGC2017)020 Paolo Andreetto Jan Astalos INFN, Sezione di Padova Institute of Informatics Slovak Academy of Sciences Via Marzolo 8, 35131 Padova, Italy Bratislava, Slovakia E-mail: [email protected] E-mail: [email protected] Miroslav Dobrucky Andrea Giachetti Institute of Informatics Slovak Academy of Sciences CERM Magnetic Resonance Center Bratislava, Slovakia CIRMMP and University of Florence, Italy E-mail: [email protected] E-mail: [email protected] David Rebatto Antonio Rosato INFN, Sezione di Milano CERM Magnetic Resonance Center Via Celoria 16, 20133 Milano, Italy CIRMMP and University of Florence, Italy E-mail: [email protected] E-mail: [email protected] Viet Tran Marco Verlato1 Institute of Informatics Slovak Academy of Sciences INFN, Sezione di Padova Bratislava, Slovakia Via Marzolo 8, 35131 Padova, Italy E-mail: [email protected] E-mail: [email protected] Lisa Zangrando INFN, Sezione di Padova Via Marzolo 8, 35131 Padova, Italy E-mail: [email protected] While accelerated computing instances providing access to NVIDIATM GPUs are already available since a couple of years in commercial public clouds like Amazon EC2, the EGI Federated Cloud has put in production its first OpenStack-based site providing GPU-equipped instances at the end of 2015. However, many EGI sites which are providing GPUs or MIC coprocessors to enable high performance processing are not directly supported yet in a federated manner by the EGI HTC and Cloud platforms. In fact, to use the accelerator cards capabilities available at resource centre level, users must directly interact with the local provider to get information about the type of resources and software libraries available, and which submission queues must be used to submit accelerated computing workloads.
    [Show full text]
  • Experimental Study of Remote Job Submission and Execution on LRM Through Grid Computing Mechanisms
    Experimental Study of Remote Job Submission and Execution on LRM through Grid Computing Mechanisms Harshadkumar B. Prajapati Vipul A. Shah Information Technology Department Instrumentation & Control Engineering Dharmsinh Desai University Department Nadiad, INDIA Dharmsinh Desai University e-mail: [email protected], Nadiad, INDIA [email protected] e-mail: [email protected], [email protected] Abstract —Remote job submission and execution is The fundamental unit of work-done in Grid fundamental requirement of distributed computing done computing is successful execution of a job. A job in using Cluster computing. However, Cluster computing Grid is generally a batch-job, which does not interact limits usage within a single organization. Grid computing with user while it is running. Higher-level Grid environment can allow use of resources for remote job services and applications use job submission and execution that are available in other organizations. This execution facilities that are supported by a Grid paper discusses concepts of batch-job execution using LRM and using Grid. The paper discusses two ways of middleware. Therefore, it is very important to preparing test Grid computing environment that we use understand how a job is submitted, executed, and for experimental testing of concepts. This paper presents monitored in Grid computing environment. experimental testing of remote job submission and Researchers and scientists who are working in higher- execution mechanisms through LRM specific way and level Grid services and applications do not pay much Grid computing ways. Moreover, the paper also discusses attention to the underlying Grid infrastructure in their various problems faced while working with Grid published work.
    [Show full text]