Loadleveler: Using and Administering
Total Page:16
File Type:pdf, Size:1020Kb
IBM LoadLeveler Version 5 Release 1 Using and Administering SC23-6792-04 IBM LoadLeveler Version 5 Release 1 Using and Administering SC23-6792-04 Note Before using this information and the product it supports, read the information in “Notices” on page 423. This edition applies to version 5, release 1, modification 0 of IBM LoadLeveler (product numbers 5725-G01, 5641-LL1, 5641-LL3, 5765-L50, and 5765-LLP) and to all subsequent releases and modifications until otherwise indicated in new editions. This edition replaces SC23-6792-03. © Copyright 1986, 1987, 1988, 1989, 1990, 1991 by the Condor Design Team. © Copyright IBM Corporation 1986, 2012. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Figures ..............vii LoadLeveler for AIX and LoadLeveler for Linux compatibility ..............35 Tables ...............ix Restrictions for LoadLeveler for Linux ....36 Features not supported in LoadLeveler for Linux 36 Restrictions for LoadLeveler for AIX and About this information ........xi LoadLeveler for Linux mixed clusters ....36 Who should use this information .......xi Conventions and terminology used in this information ..............xi Part 2. Configuring and managing Prerequisite and related information ......xii the LoadLeveler environment . 37 How to send your comments ........xiii Chapter 4. Configuring the LoadLeveler Summary of changes ........xv environment ............39 The master configuration file ........40 Part 1. Overview of LoadLeveler Setting the LoadLeveler user .......40 concepts and operation .......1 Setting the configuration source ......41 Overriding the shared memory key .....41 File-based configuration ..........42 Chapter 1. What is LoadLeveler? ....3 Database configuration option ........43 LoadLeveler basics ............4 Understanding remotely configured nodes . 43 LoadLeveler: A network job management and Using the configuration editor ........44 scheduling system ............4 Modifying configuration data ........45 Job definition .............5 Defining LoadLeveler administrators .....45 Machine definition ...........5 Defining a LoadLeveler cluster .......45 How LoadLeveler schedules jobs .......7 Defining LoadLeveler machine characteristics . 59 How LoadLeveler daemons process jobs .....8 Defining security mechanisms .......60 The master daemon ...........9 Defining usage policies for consumable resources 65 The Schedd daemon ..........10 Gathering job accounting data .......65 The startd daemon ...........12 Managing job status through control expressions 72 The region manager daemon .......14 Tracking job processes ..........73 The resource manager daemon .......15 Querying multiple LoadLeveler clusters ....74 The kbdd daemon ...........15 Handling switch-table errors........75 The negotiator daemon .........15 Providing additional job-processing controls The LoadLeveler job cycle .........16 through installation exits .........75 LoadLeveler job states ..........19 Consumable resources ...........22 Chapter 5. Defining LoadLeveler Consumable resources and Workload Manager 23 resources to administer .......89 Overview of reservations ..........24 Fair share scheduling overview ........27 Defining machines ............89 Planning considerations for defining machines . 90 Chapter 2. Getting a quick start using Machine_group stanza format and keyword summary ..............90 the default configuration .......29 Machine substanza format and keyword What you need to know before you begin ....29 summary ..............91 Using the default configuration files ......29 Machine stanza format and keyword summary 91 LoadLeveler for Linux quick start .......30 Default values for machine_group and machine Quick installation ...........30 stanzas ...............92 Quick configuration ..........31 Examples of machine_group and machine stanzas 92 Quick verification ...........31 Dynamic adapter discovery .........93 Post-installation considerations ........32 LoadLeveler adapter and node status monitoring. 94 Starting LoadLeveler ..........32 Defining classes .............94 Directory considerations .........33 Using limit keywords ..........94 Allowing users to use a class .......97 Chapter 3. What operating systems are Class stanza format and keyword summary . 97 supported by LoadLeveler?......35 Examples: Class stanzas .........98 Defining user substanzas in class stanzas ....99 © Copyright IBM Corp. 1986, 2012 iii Examples: Substanzas ..........99 Using the ckpt_dir and ckpt_subdir keywords 143 Defining users .............102 Removing old checkpoint files.......144 User stanza format and keyword summary . 102 Using the ckpt_execute_dir keyword ....144 Examples: User stanzas .........102 Initiating a checkpoint using the ll_ckpt() API 146 Defining groups ............103 LoadLeveler scheduling affinity support ....147 Group stanza format and keyword summary 104 Configuring LoadLeveler to use scheduling Examples: Group stanzas ........104 affinity ..............148 Defining clusters ............104 LoadLeveler multicluster support.......149 Cluster stanza format and keyword summary 104 Configuring a LoadLeveler multicluster . 150 Examples: Cluster stanzas ........105 LoadLeveler Blue Gene support .......153 Defining regions ............106 Configuring LoadLeveler Blue Gene support 155 Region stanza format and keyword summary 106 Blue Gene reservation support.......157 Examples: Region stanzas ........106 Blue Gene fair share scheduling support . 157 Blue Gene heterogeneous memory support . 157 Chapter 6. Performing additional Blue Gene preemption support ......157 administrator tasks .........109 Using fair share scheduling.........158 Fair share scheduling keywords ......158 Setting up the environment for parallel jobs . 110 Reconfiguring fair share scheduling keywords 161 Scheduling considerations for parallel jobs. 110 Example: three groups share a LoadLeveler Steps for reducing job launch overhead for cluster...............161 parallel jobs .............111 Example: two thousand students share a Steps for allowing users to submit interactive LoadLeveler cluster ..........162 POE jobs ..............112 Querying information about fair share Setting up a class for parallel jobs .....112 scheduling .............163 Striping when some networks fail .....113 Resetting fair share scheduling ......163 Setting up a parallel master node......113 Saving historic data ..........163 Using the BACKFILL scheduler .......114 Restoring saved historic data .......164 Tips for using the BACKFILL scheduler . 116 Procedure for recovering a job spool......164 Example: BACKFILL scheduling ......117 Configuring and using island scheduling ....165 Data staging ..............117 Energy aware job support .........166 Configuring LoadLeveler to support data S3 state support ............166 staging ..............118 Using an external scheduler ........119 Replacing the default LoadLeveler scheduling Part 3. Submitting and managing algorithm with an external scheduler ....120 LoadLeveler jobs .........169 Customizing the configuration file to define an external scheduler ...........121 Chapter 7. Building and submitting Example: Retrieving specific information . 122 Example: Changing scheduler types ......122 jobs ...............171 Preempting and resuming jobs .......122 Building a job command file ........171 Overview of preemption ........123 Using multiple steps in a job command file . 172 Planning to preempt jobs ........124 Examples: Job command files .......173 Steps for configuring a scheduler to preempt Editing job command files .........176 jobs ...............126 Defining resources for a job step .......177 Configuring LoadLeveler to support reservations 127 Submitting jobs requesting data staging ....177 Steps for configuring reservations in a Working with coscheduled job steps......178 LoadLeveler cluster ..........128 Submitting coscheduled job steps......178 Steps for integrating LoadLeveler with the Determining priority for coscheduled job steps 178 Workload Manager ...........133 Supporting preemption of coscheduled job steps 179 LoadLeveler support for checkpointing jobs . 135 Coscheduled job steps and commands and APIs 179 Checkpoint keyword summary ......136 Termination of coscheduled steps......179 Planning considerations for checkpointing jobs 136 Using bulk data transfer..........180 Additional planning considerations for Preparing a job for checkpoint/restart .....180 checkpointing MetaCluster HPC jobs on AIX . 138 Preparing a job for preemption .......183 Checkpoint and restart limitations .....138 Submitting a job command file .......183 Submitting a MetaCluster HPC checkpoint job to Job state monitoring ..........184 LoadLeveler ..............138 Submitting a job using a submit-only machine 184 job_1.cmd - A checkpointable job command file 138 Working with parallel jobs .........184 Using the llckpt command to checkpoint a job Step for controlling whether LoadLeveler copies step ...............139 environment variables to all executing nodes . 185 Restarting a job step from a checkpoint....140 Making periodic checkpoints .......142 iv LoadLeveler: Using and Administering Ensuring that parallel jobs in a cluster run on 64-bit support for configuration file keywords the correct levels of PE and LoadLeveler and expressions ...........232 software ..............185 Configuration keyword descriptions......233 Task-assignment considerations ......186 User-defined keywords ..........284 Submitting jobs that use striping