Administering IBM Platform

Platform LSF Version 9 Release 1.1 Administering Platform LSF SC27-5302-01 Platform LSF Version 9 Release 1.1 Administering Platform LSF SC27-5302-01 Note Before using this information and the product it supports, read the information in “Notices” on page 751. First edition This edition applies to version 9, release 1 of IBM Platform LSF (product number 5725G82) and to all subsequent releases and modifications until otherwise indicated in new editions. © Copyright IBM Corporation 1992, 2013. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Managing Your Cluster ........1 Resource Preemption ...........378 Working with Your Cluster .........1 Guaranteed Resource Pools.........383 LSF Daemon Startup Control ........22 Goal-Oriented SLA-Driven Scheduling .....393 Working with Hosts ...........30 Exclusive Scheduling ...........412 Managing Jobs .............66 Working with Queues ...........90 Job Scheduling and Dispatch ....415 LSF Resources .............99 Working with Application Profiles ......415 External Load Indices ..........122 Job Directories and Data..........430 Managing Users and User Groups ......135 Resource Allocation Limits .........433 External Host and User Groups .......140 Reserving Resources ...........445 Between-Host User Account Mapping .....144 Job Dependency and Job Priority .......459 Cross-Cluster User Account Mapping .....150 Job Requeue and Job Rerun ........477 UNIX/Windows User Account Mapping ....154 Job Migration .............482 Job Checkpoint and Restart.........491 Cluster Version Management and Resizable Jobs .............505 Patching on UNIX and Linux .....163 Chunk Jobs and Job Arrays.........514 Scope ................163 Job Packs...............527 Enable LSF HPC Features .........173 Running Parallel Jobs...........529 Monitoring Your Cluster .......179 Job Execution and Interactive Jobs 613 Achieving Performance and Scalability .....179 Runtime Resource Usage Limits .......613 Event Generation ............190 Load Thresholds ............627 Tuning the Cluster............192 Pre-Execution and Post-Execution Processing . 631 Authentication and Authorization ......203 Job Starters ..............646 Submitting Jobs with SSH .........210 Job Controls ..............651 External Authentication ..........214 External Job Submission and Execution Controls 657 Job Email and Job File Spooling .......225 Interactive Jobs with bsub .........674 Non-Shared File Systems .........230 Interactive and Remote Tasks ........685 Error and Event Logging .........235 Troubleshooting and Error Messages .....244 Appendices ............693 Submitting Jobs Using JSDL ........693 Time-Based Configuration ......261 Using lstch ..............703 Time Configuration ...........261 Using Session Scheduler ..........711 Advance Reservation ...........268 Using lsmake .............726 Managing LSF on EGO ..........732 Job Scheduling Policies.......289 Notices ..............751 Preemptive Scheduling ..........289 Specifying Resource Requirements ......304 Trademarks ..............753 Fairshare Scheduling ...........345 © Copyright IBM Corp. 1992, 2013 iii iv Administering IBM Platform LSF Managing Your Cluster “Working with Your Cluster” “LSF Daemon Startup Control” on page 22 “Working with Hosts” on page 30 “Managing Jobs” on page 66 “Working with Queues” on page 90 “LSF Resources” on page 99 “External Load Indices” on page 122 “Managing Users and User Groups” on page 135 “External Host and User Groups” on page 140 “Between-Host User Account Mapping” on page 144 “Cross-Cluster User Account Mapping” on page 150 “UNIX/Windows User Account Mapping” on page 154 Working with Your Cluster “Learn about LSF” “View cluster information” on page 4 “Example directory structures” on page 8 “Add cluster administrators” on page 10 “Control daemons” on page 11 “Control mbatchd” on page 12 “Reconfigure your cluster” on page 14 “Live reconfiguration” on page 16 Learn about LSF Before using LSF for the first time, you should download and read LSF Foundations Guide for an overall understanding of how LSF works. Basic concepts Job states: LSF jobs have the following states: v PEND: Waiting in a queue for scheduling and dispatch v RUN: Dispatched to a host and running v DONE: Finished normally with zero exit value v EXIT: Finished with non-zero exit value v PSUSP: Suspended while pending v USUSP: Suspended by user v SSUSP: Suspended by the LSF system v POST_DONE: Post-processing completed without errors v POST_ERR: Post-processing completed with errors v UNKWN: mbatchd lost contact with sbatchd on the host on which the job runs v WAIT: For jobs submitted to a chunk job queue, members of a chunk job that are waiting to run © Copyright IBM Corp. 1992, 2013 1 v ZOMBI: A job becomes ZOMBI if the execution host is unreachable when a non-rerunnable job is killed or a rerunnable job is requeued Host: An individual computer in the cluster. Each host might have more than one processor. Multiprocessor hosts are used to run parallel jobs. A multiprocessor host with a single process queue is considered a single machine, while a box full of processors that each have their own process queue is treated as a group of separate machines. Tip: The names of your hosts should be unique. They should not be the same as the cluster name or any queue defined for the cluster. Job: A unit of work that is run in the LSF system. A job is a command submitted to LSF for execution, using the bsub command. LSF schedules, controls, and tracks the job according to configured policies. Jobs can be complex problems, simulation scenarios, extensive calculations, anything that needs compute power. Job files When a job is submitted to a queue, LSF holds it in a job file until conditions are right for it to be executed. Then the job file is used to execute the job. UNIX: The job file is a Bourne shell script that is run at execution time. Windows: The job file is a batch file that is processed at execution time. Interactive batch job: A batch job that allows you to interact with the application and still take advantage of LSF scheduling policies and fault tolerance. All input and output are through the terminal that you used to type the job submission command. When you submit an interactive job, a message is displayed while the job is awaiting scheduling. A new job cannot be submitted until the interactive job is completed or terminated. Interactive task: A command that is not submitted to a batch queue and scheduled by LSF, but is dispatched immediately. LSF locates the resources needed by the task and chooses the best host among the candidate hosts that has the required resources and is lightly loaded. Each command can be a single process, or it can be a group of cooperating processes. Tasks are run without using the batch processing features of LSF but still with the advantage of resource requirements and selection of the best host to run the task based on load. Local task: An application or command that does not make sense to run remotely. For example, the ls command on UNIX. 2 Administering IBM Platform LSF Remote task: An application or command that can be run on another machine in the cluster. Host types and host models: Hosts in LSF are characterized by host type and host model. The following example is a host with type X86_64, with host models Opteron240, Opteron840, Intel_EM64T, etc. Host type: The combination of operating system and host CPU architecture. All computers that run the same operating system on the same computer architecture are of the same type - in other words, binary-compatible with each other. Each host type usually requires a different set of LSF binary files. Host model: The host type of the computer, which determines the CPU speed scaling factor that is applied in load and placement calculations. The CPU factor is considered when jobs are being dispatched. Resources: Resource usage: The LSF system uses built-in and configured resources to track resource availability and usage. Jobs are scheduled according to the resources available on individual hosts. Jobs submitted through the LSF system will have the resources they use monitored while they are running. This information is used to enforce resource limits and load thresholds as well as fairshare scheduling. LSF collects information such as: v Total CPU time that is consumed by all processes in the job v Total resident memory usage in KB of all currently running processes in a job v Total virtual memory usage in KB of all currently running processes in a job v Currently active process group ID in a job v Currently active processes in a job On UNIX, job-level resource usage is collected through PIM. Load indices: Load indices measure the availability of dynamic, non-shared resources on hosts in the cluster. Load indices built into the LIM are updated at fixed time intervals. Managing Your Cluster 3 External load indices: Defined and configured by the LSF administrator and collected by an External Load Information Manager (ELIM) program. The ELIM also updates LIM when new values are received. Static resources: Built-in resources that represent host information that does not change over time, such as the maximum RAM available to user processes or the number of processors in a machine. Most static resources are determined by the LIM at start-up time. Static resources can be used to select appropriate hosts for particular jobs that are based on binary architecture, relative CPU speed, and system configuration. Load thresholds: Two types of load thresholds can be configured by your LSF administrator to schedule jobs in queues. Each load threshold specifies a load index value: v loadSched determines the load condition for dispatching pending jobs. If a host’s load is beyond any defined loadSched, a job cannot be started on the host. This threshold is also used as the condition for resuming suspended jobs. v loadStop determines when running jobs should be suspended. To schedule a job on a host, the load levels on that host must satisfy both the thresholds configured for that host and the thresholds for the queue from which the job is being dispatched.

Administering IBM Platform

Platform LSF Foundations Chapter 1

IBM Platform Computing Cloud Service Ready to Use Platform LSF & Symphony Clusters in the Softlayer Cloud

QEMU Version 4.2.0 User Documentation I

Platform LSF Quick Reference IBM Platform LSF 9.1.2 Quick Reference

How to Configure the NFX250

Administering IBM Platform LSF Chapter 1

Experimental Study of Remote Job Submission and Execution on LRM Through Grid Computing Mechanisms

The X-Bone & Its Virtual Internet Architecture

Turbo-Charging Open Source Hadoop for Faster, More Meaningful Insights

Multihoming with ILNP in Freebsd

Platform LSF Command Reference Chapter 99

IP Flow Mobility Support for Proxy Mobile Ipv6 Based Networks