Administering IBM Spectrum LSF
Total Page:16
File Type:pdf, Size:1020Kb
IBM Spectrum LSF Version 10 Release 1 Administering IBM Spectrum LSF IBM IBM Spectrum LSF Version 10 Release 1 Administering IBM Spectrum LSF IBM Note Before using this information and the product it supports, read the information in “Notices” on page 841. This edition applies to version 10, release 1 of IBM Spectrum LSF (product numbers 5725G82 and 5725L25) and to all subsequent releases and modifications until otherwise indicated in new editions. Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change. If you find an error in any IBM Spectrum Computing documentation, or you have a suggestion for improving it, let us know. Log in to IBM Knowledge Center with your IBMid, and add your comments and feedback to any topic. © Copyright IBM Corporation 1992, 2017. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Managing Your Cluster ... 1 Limiting job resource allocations ....... 455 Work with your cluster ........... 1 Reserving Resources ........... 468 LSF daemon startup control ......... 25 Job dependency and job priority ....... 481 Working with hosts............ 33 Job requeue and job rerun ......... 500 Managing job execution .......... 70 Job Migration ............. 505 Working with Queues .......... 109 Job Checkpoint and Restart......... 514 Enable host-based resources ........ 120 Resizable Jobs ............. 527 External Load Indices .......... 147 Chunk Jobs and Job Arrays......... 536 Managing LSF user groups ......... 161 Job Packs............... 548 External Host and User Groups ....... 166 Between-Host User Account Mapping ..... 170 Chapter 6. Energy Aware Scheduling 553 Cross-Cluster User Account Mapping ..... 175 About Energy Aware Scheduling (EAS)..... 553 UNIX/Windows User Account Mapping .... 180 Managing host power states ........ 553 CPU frequency management ........ 562 Chapter 2. Monitoring Your Cluster 189 Automatic CPU frequency selection ...... 565 Achieve performance and scalability...... 189 Event generation ............ 200 Chapter 7. Control job execution ... 575 Tuning the Cluster............ 201 Runtime Resource Usage Limits ....... 575 Authentication and authorization ....... 211 Load Thresholds ............ 592 External authentication .......... 217 Pre-execution and post-execution processing ... 596 Job file spooling ............ 225 Job starters .............. 615 Non-Shared File Systems ......... 230 Job Controls .............. 620 Error and event logging .......... 235 External job submission and execution controls .. 627 Troubleshooting LSF problems........ 244 Interactive jobs with bsub ......... 650 Interactive and remote tasks ........ 659 Chapter 3. Time-Based Configuration 261 Running Parallel Jobs........... 665 Time Configuration ........... 261 Advance Reservation ........... 267 Chapter 8. Appendices ....... 747 Submitting Jobs Using JSDL ........ 747 Chapter 4. Job Scheduling Policies 293 Using lstch .............. 757 Preemptive Scheduling .......... 293 Using Session Scheduler.......... 765 Specifying Resource Requirements ...... 307 Using lsmake ............. 779 Fairshare Scheduling ........... 352 Manage LSF on EGO ........... 785 Global Fairshare Scheduling ........ 384 LSF Integrations ............ 801 Resource Preemption ........... 400 Launching ANSYS Jobs .......... 838 Guaranteed resource pools ......... 405 PVM jobs............... 839 Goal-Oriented SLA-Driven Scheduling ..... 416 Exclusive Scheduling ........... 435 Notices .............. 841 Trademarks .............. 843 Chapter 5. Job scheduling and Terms and conditions for product documentation 843 dispatch ............. 437 Privacy policy considerations ........ 844 Share resources with application profiles .... 437 Job directories and data .......... 452 Index ............... 845 © Copyright IBM Corp. 1992, 2017 iii iv Administering IBM Spectrum LSF Chapter 1. Managing Your Cluster Work with your cluster Learn about LSF directories and files, commands to see cluster information, control workload daemons, and how to configure your cluster. LSF Terms and Concepts Before you use LSF for the first time, you should read the LSF Foundations Guide for a basic understanding of workload management and job submission, and the Administrator Foundations Guide for an overview of cluster management and operations. Job states IBM Spectrum LSF jobs have several states. PEND Waiting in a queue for scheduling and dispatch. RUN Dispatched to a host and running. DONE Finished normally with zero exit value. EXIT Finished with nonzero exit value. PSUSP Suspended while the job is pending. USUSP Suspended by user. SSUSP Suspended by the LSF system. POST_DONE Post-processing completed without errors. POST_ERR Post-processing completed with errors. UNKWN The mbatchd daemon lost contact with the sbatchd daemon on the host where the job runs. WAIT For jobs submitted to a chunk job queue, members of a chunk job that are waiting to run. ZOMBI A job becomes ZOMBI if the execution host is unreachable when a non-rerunnable job is killed or a rerunnable job is requeued. Host An LSF host is an individual computer in the cluster. Each host might have more than one processor. Multiprocessor hosts are used to run parallel jobs. A multiprocessor host with a single process queue is considered a single machine. A box full of processors that each have their own process queue is treated as a group of separate machines. © Copyright IBM Corp. 1992, 2017 1 Managing Your Cluster Tip: The names of your hosts should be unique. They cannot be the same as the cluster name or any queue that is defined for the cluster. Job An LSF job is a unit of work that runs in the LSF system. A job is a command that is submitted to LSF for execution, by using the bsub command. LSF schedules, controls, and tracks the job according to configured policies. Jobs can be complex problems, simulation scenarios, extensive calculations, anything that needs compute power. Job files When a job is submitted to a queue, LSF holds it in a job file until conditions are right for it run. Then, the job file is used to run the job. On UNIX, the job file is a Bourne shell script that is run at execution time. On Windows, the job file is a batch file that is processed at execution time. Interactive batch job An interactive batch job is a batch job that allows you to interact with the application and still take advantage of LSF scheduling policies and fault tolerance. All input and output are through the terminal that you used to type the job submission command. When you submit an interactive job, a message is displayed while the job is awaiting scheduling. A new job cannot be submitted until the interactive job is completed or terminated. Interactive task An interactive task is a command that is not submitted to a batch queue and scheduled by LSF, but is dispatched immediately. LSF locates the resources that are needed by the task and chooses the best host among the candidate hosts that has the required resources and is lightly loaded. Each command can be a single process, or it can be a group of cooperating processes. Tasks are run without using the batch processing features of LSF but still with the advantage of resource requirements and selection of the best host to run the task based on load. Local task A local task is an application or command that does not make sense to run remotely. For example, the ls command on UNIX. 2 Administering IBM Spectrum LSF Managing Your Cluster Remote task A remote task is an application or command that that can be run on another machine in the cluster. Host types and host models Hosts in LSF are characterized by host type and host model. The following example is a host with type X86_64, with host models Opteron240, Opteron840, Intel_EM64T, and so on. Host type: An LSF host type is the combination of operating system and host CPU architecture. All computers that run the same operating system on the same computer architecture are of the same type. These hosts are binary-compatible with each other. Each host type usually requires a different set of LSF binary files. Host model: An LSF host model is the host type of the computer, which determines the CPU speed scaling factor that is applied in load and placement calculations. The CPU factor is considered when jobs are being dispatched. Resources LSF resources are objects in the LSF system resources that LSF uses track job requirements and schedule jobs according to their availability on individual hosts. Resource usage: The LSF system uses built-in and configured resources to track resource availability and usage. Jobs are scheduled according to the resources available on individual hosts. Jobs that are submitted through the LSF system will have the resources that they use monitored while they are running. This information is used to enforce resource limits and load thresholds as well as fairshare scheduling. LSF collects the following kinds of information: v Total CPU time that is consumed by all processes in the job v Total resident memory usage in KB of all currently running processes in a job v Total virtual memory usage in KB of all currently running processes in a job v Currently active process group ID in a job Chapter 1. Managing Your Cluster 3 Managing Your Cluster v Currently active processes in a job On UNIX and Linux, job-level resource usage is collected through PIM. Load indices: Load indices measure the availability of dynamic,