Experimental Study of Remote Job Submission and Execution on LRM Through Grid Computing Mechanisms

Experimental Study of Remote Job Submission and Execution on LRM through Grid Computing Mechanisms Harshadkumar B. Prajapati Vipul A. Shah Information Technology Department Instrumentation & Control Engineering Dharmsinh Desai University Department Nadiad, INDIA Dharmsinh Desai University e-mail: [email protected], Nadiad, INDIA [email protected] e-mail: [email protected], [email protected] Abstract —Remote job submission and execution is The fundamental unit of work-done in Grid fundamental requirement of distributed computing done computing is successful execution of a job. A job in using Cluster computing. However, Cluster computing Grid is generally a batch-job, which does not interact limits usage within a single organization. Grid computing with user while it is running. Higher-level Grid environment can allow use of resources for remote job services and applications use job submission and execution that are available in other organizations. This execution facilities that are supported by a Grid paper discusses concepts of batch-job execution using LRM and using Grid. The paper discusses two ways of middleware. Therefore, it is very important to preparing test Grid computing environment that we use understand how a job is submitted, executed, and for experimental testing of concepts. This paper presents monitored in Grid computing environment. experimental testing of remote job submission and Researchers and scientists who are working in higher- execution mechanisms through LRM specific way and level Grid services and applications do not pay much Grid computing ways. Moreover, the paper also discusses attention to the underlying Grid infrastructure in their various problems faced while working with Grid published work. However, such knowledge is very computing environment and discusses their trouble- useful to beginners. Moreover, understanding of shootings. The understanding and experimental testing internal working of job submission and execution presented in this paper would become very useful to becomes very useful, in trouble-shooting, when the researchers who are new to the field of job management in higher-level services or applications report failures. Grid. In this research paper, we concisely present Keywords-LRM; Grid computing; remote job important concepts of Local Resource Manager and submission; remote job execution; Globus; Condor; Grid Grid computing environment. We discuss about testbed. preparing required Grid computing environment without spending much efforts and doing rework on I. INTRODUCTION real computing machines, for which we discuss about Grid environment made of Virtual Machines and made Research efforts in Grid computing [1] started due of real machines. Moreover, through experiments to limitation faced in the number of resources offered performed on our prepared Grid environment, we by Cluster computing [2]. Cluster computing contains show direct remote job submission on LRM and homogeneous resources, which are under a single through Grid computing mechanisms. The presented administrative control, whereas Grid computing can experimental study of remote job submission and allow use of resources present across boundary of an execution on LRM through Grid computing organization [3]. Therefore, use of Grid computing mechanism becomes very useful in learning concepts becomes essential when resources of a single practically and also when trouble-shooting underlying organization are not sufficient to solve their mechanisms of Grid environment. computation-demanding problem, generally, a The work in Introduction to Grid Computing with scientific experiment. Grid computing technology and Globus [4] discusses installation steps of building Grid software can aggregate resources of various infrastructure using GT 4. A recent work in [5] focuses organization in collaborative way with considering on implementation steps of achieving security in Grid various heterogeneity present in terms of architecture through Grid Security Infrastructure (GSI). Specifically, and software used at Cluster computing level. their work [5] focuses on installation and testing of host certificates and client certificates and their testing. As Distributed Resource Manager. Basic features of any compared to [5], our work has wider scope, not just LRM include following: security in Grid. As compared to [4], we focus on • Scripting language for defining batch job providing practical understanding of job execution on • Interfaces for submission of batch jobs LRM using LRM mechanisms and two Grid computing • Interfaces to monitor the executions mechanisms. Moreover, we also present an easy way of • Interfaces to submit input and gather output preparing Grid testbed without use of any networking data hardware, which researchers can implement in their • Mechanism for defining priorities for jobs Personal Computer or Laptop. Furthermore, our work • Match-making of jobs with resources also focuses on trouble-shooting of important problems • Scheduling jobs present in queue(s) to faced during use of Grid computing environment. determine execution order of jobs based on job- The rest of the paper is structured as follows. priority, resource status, and resource allocation Section II concisely describes Local Resource Manager configuration. and its responsibilities and discusses why there is need Advance reservation based LRMs are also used in of Grid mechanism to bridge heterogeneity of different Grid. Advance reservation allows reserving resources LRMs. Section III describes Grid computing in advance for their use in future. Examples of such environment and its responsibilities and provided LRMs include Platform LSF [7], PBS Pro/Torque [11], services. Section IV presents how to prepare Grid [12], Maui [13], and SGE [9]. computing environment for practical study. Section V Grid allows use of heterogeneous LRMs through a provides experimental testing of LRM using LRM common interface. In Globus based Grid, which is mechanisms and discusses need of Grid computing widely used, GRAM (Globus Resource Allocation mechanism. Section VI provides experimental testing Manager) interface is used for job submission on of remote job submission and execution on LRM using heterogeneous LRMs. GRAM is a standardized way of Grid computing mechanisms. Section VII discusses accessing any LRM, as it is LRM neutral. GRAM major problems faced during use of Grid computing messages remain same irrespective of LRM, whether environment and also shows their trouble-shooting. LRM is PBS, LSF, or Condor. Each LRM can be Finally, Section VIII provides conclusions and future connected into a Globus based Grid using Globus-LRM work. adapter. Clients submit job requests using GRAM protocol and GRAM-LRM connector can translate those messages into a language understood by a II. LOCAL RESOURCE MANAGER specific LRM. For example, GRAM5 supports Condor, In most Grid deployments, individual resources of PBS, and LSF; similarly, Unicore [14] supports SGE, a single organization are centrally managed using LoadLeveler and Torque. batch queue oriented local resource management We use HTCondor as LRM in our test Grid system (LRM). In such system, user submitted batch computing environment. Therefore, we concisely jobs are introduced into the queue of LRM, from which discuss about it. HTCondor [15], which is formerly resource scheduler decides about execution of jobs. known as Condor [8], is a set of daemons and Examples of such LRMs include PBS [6], LSF [7], commands that enable to implement concept of batch- Condor [8], Sun Grid Engine [9], and Load Leveler [10]. queue controlled Cluster computing [2] of jobs. A resource scheduler is also known as a low-level HTCondor is available for various platforms including scheduler or local scheduler, as it deals with the jobs Unix, Linux, and Windows OSes. We use HTCondor submitted in a single administrative domain. For and Condor interchangeably to refer to the same thing. compute resources, LRM can be configured for (i) The main goal of HTCondor is to provide high which user is allowed to run jobs, (ii) what policies are throughput distributed computing infrastructure that associated with selection of jobs for running, and (iii) transparently executes a large number of jobs over a what policies are associated with individual machines long period. Though, HTCondor allows use of for considering them to be idle. computing power of a computing machine to users, it An LRM manages two types of jobs: local jobs is the owner of the computing machine that defines generated inside a resource domain and jobs generated under which conditions Condor can allow use of the by external users, i.e. Grid users. The common purpose computing machine to batch jobs. For example, of any LRM is to manage, control, and schedule batch HTCondor can be configured to run a job when processes on the resources under its control. Since, keyboard and CPU are idle. Condor implements LRM deals with distributed resources, it is also called ClassAds (Class Advertisements) to describe requirements of jobs and specification of computing resources. Computing resources advertise their higher-level, bigger, global resource. A Grid resource properties such as available memory, CPU middleware offers following services: type, CPU speed, virtual memory size, physical • Submission and monitoring of remote jobs location, and current load average. Interaction with • Allocation and co-allocation of resources Condor system for various activities is done through

Experimental Study of Remote Job Submission and Execution on LRM Through Grid Computing Mechanisms

Push-Based Job Submission Using Reverse SSH Connections

Grid Computing: What Is It, and Why Do I Care?*

Introduction to Python University of Oxford Department of Particle Physics

The Translational Journey of the Htcondor-CE

Platform LSF Foundations Chapter 1

IBM Platform Computing Cloud Service Ready to Use Platform LSF & Symphony Clusters in the Softlayer Cloud

GT 4.0 RLS GT 4.0 RLS Table of Contents

A Globus Primer Or, Everything You Wanted to Know About Globus, but Were Afraid to Ask Describing Globus Toolkit Version 4

Parallelism of Machine Learning Algorithms

The Advanced Resource Connector User Guide

Platform LSF Quick Reference IBM Platform LSF 9.1.2 Quick Reference

Reflections on Building a High-Performance Computing Cluster Using Freebsd