An Extremely Quick Introduction to

Joel Hollingsworth ([email protected]) Department of Computing Sciences Elon University What is the Grid?

“The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations.”

“The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem solving...”

The Anatomy of a Grid Ian Foster, Carl Kesselman, and Steven Tuecke A Three Point Checklist (What is a Grid?): - computing resources are not administered centrally - open standards are used - non-trivial quality of service is achieved

Types of Grids: - Computational Grids - Data Grids (distributed data) - Equipment Grids (telescope, etc.) http://devresource.hp.com/drc/technical_papers/grid_soa/04.png Grid Software (http://www.globus.org/) So, are these Grids? Distributed Resource Management using Sun Grid Engine

Joel Hollingsworth ([email protected]) Department of Computing Sciences Elon University What is a Distributed Resource Management?

DRM is a software application in charge of unattended background executions of jobs.

Basic features include: - interface to define workflows and/or job dependencies - automatic submission of jobs across multiple machines - interfaces to monitor and report on jobs - priorities and/or queues to control the execution order

Examples: Sun Grid Engine, Condor, LSF, PBS

DRMs are typically implemented on a cluster of machines. What is a Cluster?

“Cluster is a widely-used term meaning independent computers combined into a unified system through software and networking.”

“Clusters are typically used for High Availability for greater reliability or High Performance Computing to provide greater computational power than a single computer can provide.”

http://beowulf.org/overview/

Clusters are a powerful Grid resource, but not considered a Grid system due to its centralized control of the hosts. Sun Grid Engine (SGE)

DRM software from (free - open) - http://gridengine.sunsource.net - aggregates computer power - create computer farms

SGE is used to do the following: - optimally place computing tasks and balance the load on a set of networked computers - allow users to generate and queue more computing tasks than can be run at the moment - ensure that tasks are executed with respect to priority and provide all users with a fair share of access over time network submission

grid1 grid3 grid5 grid7 queue submission

grid0 compute tasks execution daemons

queue submission master daemon scheduler execution daemon grid2 grid4 grid6 grid8 qmon - graphical user interface for SGE usage/administration

[jkh@grid]$ qmon

Job Control Submit Jobs qrsh - run shell script or executable -now no queue job if unable to run now

[jkh@grid]$ qrsh uname -a Linux node2.cs.appstate.edu 2.6.9-34.EL #1 Fri Feb 24 16:44:51 EST 2006 i686 i686 i386 GNU/Linux qsub - submit job scripts no arguments accepts input from STDIN (^D to send submit input) -cwd run the job from the current working directory -o redirect standard output (default: home directory) -e redirect standard error (default: home directory) -j merge stderr/stdout (y/n) -S specifies the interpreting shell for the job u.sh:

#!/bin/bash

#$ -cwd #$ -o uname.out #$ -j y #$ -S /bin/sh

uname -a

[jkh@grid]$ qsub u.sh Your job 138 ("u.sh") has been submitted.

[jkh@grid]$ cat uname.out Linux node5.cs.appstate.edu 2.6.9-34.EL #1 Fri Feb 24 16:44:51 EST 2006 i686 i686 i386 GNU/Linux u.sh:

#!/bin/bash

#$ -cwd #$ -o uname.out #$ -j y #$ -S /bin/sh

uname -a

sleeper.sh:

#!/bin/bash

#$ -cwd #$ -j y #$ -S /bin/sh

sleep $1

[jkh@grid]$ qsub sleeper.sh 5 Your job 139 ("sleeper.sh") has been submitted. qstat - show job/queue status no arguments show currently running/pending jobs

-f show full listing of all jobs/queues

-F show full-format listing of all jobs/queues

[jkh@grid]$ qsub u.sh queuename qtype used/tot. load_avg arch states [email protected] BIP 0/2 0.00 lx24-x86 [email protected] BIP 0/2 0.00 lx24-x86 [email protected] BIP 0/2 0.00 lx24-x86 [email protected] BIP 0/2 0.00 lx24-x86 [email protected] BIP 0/2 0.00 lx24-x86 [email protected] BIP 0/2 0.00 lx24-x86 [email protected] BIP 0/2 0.00 lx24-x86

############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 138 0.00000 u.sh jhollingswor qw 11/02/2006 10:09:41 1 qdel - delete jobs -f force action for running jobs

[jkh@grid]$ qdel 138 jkh has deleted job 138 Getting SGE to run Java:

[jkh@grid]$ javac Factorial.java [jkh@grid]$ java Factorial 4 Factorial(4) = 24

fact.sh: #!/bin/bash

#$ -cwd #$ -j y #$ -S /bin/sh

java Factorial $1

[jkh@grid]$ qsub fact.sh 4 Your job 65816 ("fact.sh") has been submitted. [jkh@grid]$ cat fact.sh.o65816 Factorial(4) = 24 Cryptex Challenge

Assume we have a cryptex with four wheels containing the numbers 0-9.

Your task is to find the correct four digit sequence that opens the cryptex.

You must find all four digits before the cyptex can be opened.

The only feedback you receive is whether the four digit sequence opens the cryptex or not. What is your approach to solving this problem?

How many tests (in the worst case) would you have to perform?

Assuming each test took 1/4 seconds to perform, how long is this process going to take? CryptexChallenge/Java:

[jkh@grid0 Java]$ ls Cryptex.class SolveCryptex.java

SolveCryptex.java: public class SolveCryptex { public static void main(String [] args) { Cryptex cryptex = new Cryptex("VSBMNABHIHCKUOSF");

int a = 1, b = 2, = 3, d = 4;

if (cryptex.test(a, b, c, d) == true) System.out.println(a+" "+b+" "+c+" "+d); } }

Compiling/Running Java:

[jkh@grid0 Java]$ javac SolveCryptex.java [jkh@grid0 Java]$ java SolveCryptex We have limited each account to no more than 20 jobs at one time in the queue.

You could write some code that sequentially tests all possibilities. Is that a good approach?

You could write some code that tests a group of possible combinations (a, b, c, d). 1 ≤ a < 10, 1 ≤ b < 10, 1 ≤ c < 10, d = 1

Run this code in parallel over the different groups. DRMAA Distributed Resource Management Application API - API for job submission and control - Allows the application programmer the ability to easily submit jobs to a DRM from within a program.

Sun Grid Engine - C and Java class library

Condor 6/PBS/Torque/Gridway - C library

Perl/Python/Ruby modules based on the DRMAA C interface

http://drmaa.org.wiki/ http://gridengine.sunsource.net/howto/drmaa_java.html Howto1.java:

import org.ggf.drmaa.*;

// Open and close an SGE session.

public class Howto1 { public static void main (String[] args) { SessionFactory factory = SessionFactory.getFactory (); Session session = factory.getSession ();

try { session.init (null); session.exit (); } catch (DrmaaException e) { System.out.println ("Error: " + e.getMessage ()); } } } Howto2.java:

import org.ggf.drmaa.*;

// Run a single SGE job.

public class Howto2 { public static void main (String[] args) { SessionFactory factory = SessionFactory.getFactory (); Session session = factory.getSession ();

try { session.init (null); JobTemplate jt = session.createJobTemplate (); jt.setRemoteCommand ("sleeper.sh"); jt.setArgs (new String[] {"5"});

String id = session.runJob (jt);

System.out.println ("Your job has been submitted with id " + id);

session.deleteJobTemplate (jt); session.exit (); } catch (DrmaaException e) { System.out.println ("Error: " + e.getMessage ()); } } } Howto2_1.java: import org.ggf.drmaa.*;

// Run the same SGE job multiple times.

public class Howto2 { public static void main (String[] args) { SessionFactory factory = SessionFactory.getFactory (); Session session = factory.getSession ();

try { session.init (null); JobTemplate jt = session.createJobTemplate (); jt.setRemoteCommand ("sleeper.sh"); jt.setArgs (new String[] {"5"});

java.util.List ids = session.runBulkJobs (jt, 1, 30, 2); java.util.Iterator i = ids.iterator ();

while (i.hasNext ()) { System.out.println ("Your job has been submitted with id " + i.next ()); }

session.deleteJobTemplate (jt); session.exit (); } catch (DrmaaException e) { System.out.println ("Error: " + e.getMessage ()); } } } Howto2_2.java:

import org.ggf.drmaa.*;

// Queue multiple versions of a program.

public class Howto2 { public static void main (String[] args) { SessionFactory factory = SessionFactory.getFactory (); Session session = factory.getSession ();

try { session.init (null); JobTemplate jt = session.createJobTemplate (); jt.setRemoteCommand ("sleeper.sh");

for (int i = 0; i < 10; i++) { jt.setArgs (new String[] {"" + i}); String id = session.runJob (jt); System.out.println ("Your job has been submitted with id " + id); }

session.deleteJobTemplate (jt); session.exit (); } catch (DrmaaException e) { System.out.println ("Error: " + e.getMessage ()); } } } Howto3.java: import org.ggf.drmaa.*;

// Wait for a single job to finish.

public class Howto2 { public static void main (String[] args) { SessionFactory factory = SessionFactory.getFactory (); Session session = factory.getSession ();

try { session.init (null); JobTemplate jt = session.createJobTemplate (); jt.setRemoteCommand ("sleeper.sh"); jt.setArgs (new String[] {"5"});

String id = session.runJob (jt); System.out.println ("Your job has been submitted with id " + id); session.deleteJobTemplate (jt);

JobInfo info = session.wait (id, Session.TIMEOUT_WAIT_FOREVER); System.out.println("Job " + info.getJobId () + " finished with exit status " + info.getExitStatus ()); session.exit (); } catch (DrmaaException e) { System.out.println ("Error: " + e.getMessage ()); } } } Real-World Usage

Multi-Objective NSGA-II (http://www.iitk.ac.in/kangal/codes.shtml)

Populate Offspring Decoded Strings

Genetic Algorithm

Crossover/ Mutation Evaluate

Pairs/ Scores Singles Select Parents for i = 0 to population_size: Evaluate score(population[i])

The evaluation is done for every generation. There will be many.

What happens when score(population[i]) is actually a simulation and takes multiple minutes to run?

Each score(population[i]) is independent of all others ... we can parallelize the evaluation loop. Java Wrapper Class

Model File AMPL NSGA-II

SGE

Local Area Network

Client ... evalSolver evalSolver evalSolver evalSolver

AMPL AMPL AMPL AMPL Library Library Library Library

Cluster Thanks!