Administering IBM Platform LSF Chapter 1
Total Page:16
File Type:pdf, Size:1020Kb
Platform LSF Version 9 Release 1.3 Administering Platform LSF SC27-5302-03 Platform LSF Version 9 Release 1.3 Administering Platform LSF SC27-5302-03 Note Before using this information and the product it supports, read the information in “Notices” on page 827. First edition This edition applies to version 9, release 1 of IBM Platform LSF (product number 5725G82) and to all subsequent releases and modifications until otherwise indicated in new editions. Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change. If you find an error in any Platform Computing documentation, or you have a suggestion for improving it, please let us know. In the IBM Knowledge Center, add your comments and feedback to any topic. You can also send your suggestions, comments and questions to the following email address: [email protected] Be sure include the publication title and order number, and, if applicable, the specific location of the information about which you have comments (for example, a page number or a browser URL). When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. © Copyright IBM Corporation 1992, 2014. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Managing Your Cluster . 1 Job Directories and Data..........441 Working with Your Cluster .........1 Resource Allocation Limits .........444 LSF Daemon Startup Control ........22 Reserving Resources ...........456 Working with Hosts ...........29 Job Dependency and Job Priority .......469 Managing Jobs .............64 Job Requeue and Job Rerun ........487 Working with Queues ...........99 Job Migration .............491 LSF Resources .............109 Job Checkpoint and Restart.........500 External Load Indices ..........135 Resizable Jobs .............514 Managing Users and User Groups ......149 Chunk Jobs and Job Arrays.........522 External Host and User Groups .......154 Job Packs...............534 Between-Host User Account Mapping .....158 Cross-Cluster User Account Mapping .....163 Chapter 7. Energy Aware Scheduling 539 UNIX/Windows User Account Mapping ....168 About Energy Aware Scheduling (EAS).....539 Managing host power states ........539 Chapter 2. Cluster Version CPU frequency management ........548 Management and Patching on UNIX Automatic CPU frequency selection ......551 and Linux .............177 Scope ................177 Chapter 8. Job Execution and Interactive Jobs ..........563 Chapter 3. Monitoring Your Cluster 189 Runtime Resource Usage Limits .......563 Achieving Performance and Scalability .....189 Load Thresholds ............579 Event Generation ............200 Pre-Execution and Post-Execution Processing . 583 Tuning the Cluster............201 Job Starters ..............602 Authentication and Authorization ......212 Job Controls ..............607 Submitting Jobs with SSH .........218 External Job Submission and Execution Controls 613 External Authentication ..........222 Interactive Jobs with bsub .........633 Job Email and Job File Spooling .......235 Interactive and Remote Tasks ........643 Non-Shared File Systems .........241 Running Parallel Jobs...........649 Error and Event Logging .........246 Troubleshooting and Error Messages .....255 Chapter 9. Appendices .......729 Submitting Jobs Using JSDL ........729 Chapter 4. Time-Based Configuration 273 Using lstch ..............739 Time Configuration ...........273 Using Session Scheduler..........747 Advance Reservation ...........279 Using lsmake .............762 Managing LSF on EGO ..........768 Chapter 5. Job Scheduling Policies 301 LSF Integrations ............786 Launching ANSYS Jobs ..........824 Preemptive Scheduling ..........301 PVM Jobs ..............824 Specifying Resource Requirements ......315 Fairshare Scheduling ...........358 Notices ..............827 Resource Preemption ...........390 Guaranteed Resource Pools.........395 Trademarks ..............829 Goal-Oriented SLA-Driven Scheduling .....405 Privacy policy considerations ........829 Exclusive Scheduling ...........424 Index ...............831 Chapter 6. Job Scheduling and Dispatch .............427 Working with Application Profiles ......427 © Copyright IBM Corp. 1992, 2014 iii iv Administering IBM Platform LSF Chapter 1. Managing Your Cluster Working with Your Cluster Learn about LSF Before using LSF for the first time, you should download and read LSF Foundations Guide for an overall understanding of how LSF works. Basic concepts Job states: LSF jobs have the following states: v PEND: Waiting in a queue for scheduling and dispatch v RUN: Dispatched to a host and running v DONE: Finished normally with zero exit value v EXIT: Finished with non-zero exit value v PSUSP: Suspended while pending v USUSP: Suspended by user v SSUSP: Suspended by the LSF system v POST_DONE: Post-processing completed without errors v POST_ERR: Post-processing completed with errors v UNKWN: mbatchd lost contact with sbatchd on the host on which the job runs v WAIT: For jobs submitted to a chunk job queue, members of a chunk job that are waiting to run v ZOMBI: A job becomes ZOMBI if the execution host is unreachable when a non-rerunnable job is killed or a rerunnable job is requeued Host: An individual computer in the cluster. Each host might have more than one processor. Multiprocessor hosts are used to run parallel jobs. A multiprocessor host with a single process queue is considered a single machine, while a box full of processors that each have their own process queue is treated as a group of separate machines. Tip: The names of your hosts should be unique. They should not be the same as the cluster name or any queue defined for the cluster. Job: A unit of work that is run in the LSF system. A job is a command submitted to LSF for execution, using the bsub command. LSF schedules, controls, and tracks the job according to configured policies. Jobs can be complex problems, simulation scenarios, extensive calculations, anything that needs compute power. © Copyright IBM Corp. 1992, 2014 1 Working with Your Cluster Job files When a job is submitted to a queue, LSF holds it in a job file until conditions are right for it to be executed. Then the job file is used to execute the job. UNIX: The job file is a Bourne shell script that is run at execution time. Windows: The job file is a batch file that is processed at execution time. Interactive batch job: A batch job that allows you to interact with the application and still take advantage of LSF scheduling policies and fault tolerance. All input and output are through the terminal that you used to type the job submission command. When you submit an interactive job, a message is displayed while the job is awaiting scheduling. A new job cannot be submitted until the interactive job is completed or terminated. Interactive task: A command that is not submitted to a batch queue and scheduled by LSF, but is dispatched immediately. LSF locates the resources needed by the task and chooses the best host among the candidate hosts that has the required resources and is lightly loaded. Each command can be a single process, or it can be a group of cooperating processes. Tasks are run without using the batch processing features of LSF but still with the advantage of resource requirements and selection of the best host to run the task based on load. Local task: An application or command that does not make sense to run remotely. For example, the ls command on UNIX. Remote task: An application or command that can be run on another machine in the cluster. Host types and host models: Hosts in LSF are characterized by host type and host model. The following example is a host with type X86_64, with host models Opteron240, Opteron840, Intel_EM64T, etc. Host type: The combination of operating system and host CPU architecture. All computers that run the same operating system on the same computer architecture are of the same type - in other words, binary-compatible with each other. 2 Administering IBM Platform LSF Working with Your Cluster Each host type usually requires a different set of LSF binary files. Host model: The host type of the computer, which determines the CPU speed scaling factor that is applied in load and placement calculations. The CPU factor is considered when jobs are being dispatched. Resources: Resource usage: The LSF system uses built-in and configured resources to track resource availability and usage. Jobs are scheduled according to the resources available on individual hosts. Jobs submitted through the LSF system will have the resources they use monitored while they are running. This information is used to enforce resource limits and load thresholds as well as fairshare scheduling. LSF collects information such as: v Total CPU time that is consumed by all processes in the job v Total resident memory usage in KB of all currently running processes in a job v Total virtual memory usage in KB of all currently running processes in a job v Currently active process group ID in a job v Currently active processes in a job On UNIX, job-level resource usage is collected through PIM. Load indices: Load indices measure the availability of dynamic, non-shared resources on hosts in the cluster. Load indices built into the LIM are updated at fixed time intervals. External load indices: Defined and configured by the LSF administrator and collected