Administering IBM Platform LSF Chapter 1
Total Page:16
File Type:pdf, Size:1020Kb
Platform LSF Version 9 Release 1.2 Administering Platform LSF SC27-5302-02 Platform LSF Version 9 Release 1.2 Administering Platform LSF SC27-5302-02 Note Before using this information and the product it supports, read the information in “Notices” on page 813. First edition This edition applies to version 9, release 1 of IBM Platform LSF (product number 5725G82) and to all subsequent releases and modifications until otherwise indicated in new editions. Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change. If you find an error in any Platform Computing documentation, or you have a suggestion for improving it, please let us know. Send your suggestions, comments and questions to the following email address: [email protected] Be sure include the publication title and order number, and, if applicable, the specific location of the information about which you have comments (for example, a page number or a browser URL). When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. © Copyright IBM Corporation 1992, 2013. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Managing Your Cluster . 1 Job Directories and Data..........434 Working with Your Cluster .........1 Resource Allocation Limits .........436 LSF Daemon Startup Control ........22 Reserving Resources ...........449 Working with Hosts ...........29 Job Dependency and Job Priority .......462 Managing Jobs .............64 Job Requeue and Job Rerun ........480 Working with Queues ...........95 Job Migration .............484 LSF Resources .............103 Job Checkpoint and Restart.........493 External Load Indices ..........130 Resizable Jobs .............507 Managing Users and User Groups ......144 Chunk Jobs and Job Arrays.........515 External Host and User Groups .......149 Job Packs...............527 Between-Host User Account Mapping .....153 Cross-Cluster User Account Mapping .....158 Chapter 7. Energy Aware Scheduling 531 UNIX/Windows User Account Mapping ....163 || About Energy Aware Scheduling (EAS).....531 || Managing host power states ........531 Chapter 2. Cluster Version || CPU frequency management ........540 Management and Patching on UNIX || Automatic CPU frequency selection ......543 and Linux .............171 Scope ................171 Chapter 8. Job Execution and Interactive Jobs ..........555 Chapter 3. Monitoring Your Cluster 183 Runtime Resource Usage Limits .......555 Achieving Performance and Scalability .....183 Load Thresholds ............570 Event Generation ............194 Pre-Execution and Post-Execution Processing . 574 Tuning the Cluster............195 Job Starters ..............592 Authentication and Authorization ......206 Job Controls ..............597 Submitting Jobs with SSH .........212 External Job Submission and Execution Controls 603 External Authentication ..........216 Interactive Jobs with bsub .........622 Job Email and Job File Spooling .......229 Interactive and Remote Tasks ........632 Non-Shared File Systems .........235 Running Parallel Jobs...........638 Error and Event Logging .........240 Troubleshooting and Error Messages .....249 Chapter 9. Appendices .......715 Submitting Jobs Using JSDL ........715 Chapter 4. Time-Based Configuration 267 Using lstch ..............725 Time Configuration ...........267 Using Session Scheduler..........733 Advance Reservation ...........273 Using lsmake .............748 Managing LSF on EGO ..........754 Chapter 5. Job Scheduling Policies 293 LSF Integrations ............772 || Launching ANSYS Jobs ..........810 Preemptive Scheduling ..........293 || PVM Jobs ..............810 Specifying Resource Requirements ......307 Fairshare Scheduling ...........351 Notices ..............813 Resource Preemption ...........383 Guaranteed Resource Pools.........388 Trademarks ..............815 Goal-Oriented SLA-Driven Scheduling .....398 Privacy policy considerations ........815 Exclusive Scheduling ...........416 Index ...............817 Chapter 6. Job Scheduling and Dispatch .............419 Working with Application Profiles ......419 © Copyright IBM Corp. 1992, 2013 iii iv Administering IBM Platform LSF Chapter 1. Managing Your Cluster Working with Your Cluster Learn about LSF Before using LSF for the first time, you should download and read LSF Foundations Guide for an overall understanding of how LSF works. Basic concepts Job states: LSF jobs have the following states: v PEND: Waiting in a queue for scheduling and dispatch v RUN: Dispatched to a host and running v DONE: Finished normally with zero exit value v EXIT: Finished with non-zero exit value v PSUSP: Suspended while pending v USUSP: Suspended by user v SSUSP: Suspended by the LSF system v POST_DONE: Post-processing completed without errors v POST_ERR: Post-processing completed with errors v UNKWN: mbatchd lost contact with sbatchd on the host on which the job runs v WAIT: For jobs submitted to a chunk job queue, members of a chunk job that are waiting to run v ZOMBI: A job becomes ZOMBI if the execution host is unreachable when a non-rerunnable job is killed or a rerunnable job is requeued Host: An individual computer in the cluster. Each host might have more than one processor. Multiprocessor hosts are used to run parallel jobs. A multiprocessor host with a single process queue is considered a single machine, while a box full of processors that each have their own process queue is treated as a group of separate machines. Tip: The names of your hosts should be unique. They should not be the same as the cluster name or any queue defined for the cluster. Job: A unit of work that is run in the LSF system. A job is a command submitted to LSF for execution, using the bsub command. LSF schedules, controls, and tracks the job according to configured policies. Jobs can be complex problems, simulation scenarios, extensive calculations, anything that needs compute power. © Copyright IBM Corp. 1992, 2013 1 Working with Your Cluster Job files When a job is submitted to a queue, LSF holds it in a job file until conditions are right for it to be executed. Then the job file is used to execute the job. UNIX: The job file is a Bourne shell script that is run at execution time. Windows: The job file is a batch file that is processed at execution time. Interactive batch job: A batch job that allows you to interact with the application and still take advantage of LSF scheduling policies and fault tolerance. All input and output are through the terminal that you used to type the job submission command. When you submit an interactive job, a message is displayed while the job is awaiting scheduling. A new job cannot be submitted until the interactive job is completed or terminated. Interactive task: A command that is not submitted to a batch queue and scheduled by LSF, but is dispatched immediately. LSF locates the resources needed by the task and chooses the best host among the candidate hosts that has the required resources and is lightly loaded. Each command can be a single process, or it can be a group of cooperating processes. Tasks are run without using the batch processing features of LSF but still with the advantage of resource requirements and selection of the best host to run the task based on load. Local task: An application or command that does not make sense to run remotely. For example, the ls command on UNIX. Remote task: An application or command that can be run on another machine in the cluster. Host types and host models: Hosts in LSF are characterized by host type and host model. The following example is a host with type X86_64, with host models Opteron240, Opteron840, Intel_EM64T, etc. Host type: The combination of operating system and host CPU architecture. All computers that run the same operating system on the same computer architecture are of the same type - in other words, binary-compatible with each other. 2 Administering IBM Platform LSF Working with Your Cluster Each host type usually requires a different set of LSF binary files. Host model: The host type of the computer, which determines the CPU speed scaling factor that is applied in load and placement calculations. The CPU factor is considered when jobs are being dispatched. Resources: Resource usage: The LSF system uses built-in and configured resources to track resource availability and usage. Jobs are scheduled according to the resources available on individual hosts. Jobs submitted through the LSF system will have the resources they use monitored while they are running. This information is used to enforce resource limits and load thresholds as well as fairshare scheduling. LSF collects information such as: v Total CPU time that is consumed by all processes in the job v Total resident memory usage in KB of all currently running processes in a job v Total virtual memory usage in KB of all currently running processes in a job v Currently active process group ID in a job v Currently active processes in a job On UNIX, job-level resource usage is collected through PIM. Load indices: Load indices measure the availability of dynamic, non-shared resources on hosts in the cluster. Load indices built into the LIM are updated at fixed time intervals. External load indices: Defined and configured by the LSF administrator and collected by an External Load Information Manager (ELIM) program. The ELIM also