<<

SYSTEM PERFORMANCE OPTIMIZATION

R.J. Bednarz CYFRONET, Swierk, Poland

0. Introduction 1. Hardware Overview

The System Performance Optimization In this Chapter I would like to has become and important and difficult present the examples of hardware used for field for large scientific computer centres. the large scale scientific computations Important because the centres must satisfy /batch and interactive/. The examples will increasing user demands at the lowest be taken from CERN and INR computer centres, possible cost. Difficult because the which use IBM and CDC equipment. It is well System Performance Optimization requires known, that scientific computations are a deep understanding of hardware, software almost monopolized by these two and workload. The optimization is a manufacturers especially in the field of dynamic process depending on the changes physics. Assuming the elementary knowledge in hardware configuration, current level of computer structure, I will concentrate of the and user generated on the hardware aspects, which are crucial workload. With the increasing complication for performance and programming. Since the of the computer system and software, the IBM and CDC computers differ very much in field for the optimization manoeuvres the organization, it is necessary to broadens. describe them separately. In the three hour lecture it is of course difficult to cover all aspects of 1.1. IBM Hardware the System Performance Optimization. First of all it was necessary to talk in the Figure i shows IBM 370/168-3 computer, Chapter i about the hardware of only two which is installed at CERN. The central manufacturers IBM and CDC. Chapter 2 processing unit is equiped with 3 megabytes contains the description of four IBM and memory, byte and block multiplexer channels, two operating system. The description controllers and devices. Up to 16 concentrates on the organization of the controllers can be linked to a shared operating systems, the job scheduling and channel, and therefore the reconfiguration I/O handling. The performance definitions, of the controllers linked to block workload specification and tools for the multiplexer channels can be easily done system stimulation are presented in the manually. The 3333 and 3350 disk storage Chapter 3. Chapter 4 is devoted to the can be accessed from two independent description of the measurement tools for storage controllers. Such feature is the System Performance Optimization. called dual access, and it causes a In this Chapter I am going to present complicated routing of disk requests software, hardware and hybrid monitors. through the system. The results of the measurement and various The central processing units of methods used for the operating system higher models from IBM 370 line include tuning will be discussed in the Chapter 5. a buffer storage. The buffer storage can Unfortunately it was not possible to sharply reduce the time required for cover during the lectures theoretical fetching currently used sections of main aspects related to the System Performance storage. On the Model 165, for example, Optimization. Therefore the author intends the CPU can obtain eight bytes from the to cover the subject of the computer buffer in two cycles /1G0 nanoseconds/, models and simulation in the separate and a request can be initiated every cycle. publication. This compares with 18 cycles /1440 nano-

- 144 - seconds/ required to obtain eight bytes First, DAT obtains the address of the directly from main storage. On average, appropriate segment table from a system the high-speed buffer storage operates control register. To this segment table to make the effective system storage cycle address, DAT adds the segment address bits time one-third to one-quarter of the to obtain the segment table entry. Next actual main-storage cycle time. Buffer DAT obtains the page table address from operation is handled entirely by hardware the segment table entry and adds the page and it is transparent to the programmer, address bits to it in order to obtain the who doesn*t need to adhere to any page table entry. Finally, DAT forms the particular program structure in order to 24 bit real storage address by appending achieve close-to-optimum use of the buffer. the displacement to the page frame address. Very important hardware feature of To reduce the amount of time required for IBM 370 is the dynamic address translation address translation, DAT retains up to 128 /DAT/. This feature is essential for previously translated addresses in a virtual storage operating systems. The CPU translation lookaside buffer /TBL/. Prior can operate with their virtual storage to performing a translation using segment features disabled /Basic Control mode/ or and page tables, DAT searches the TBL for enabled /Extended Control mode/. For ease required address. in storage management, virtual storage, A program interruption occurs during real storage, and direct access storage address translation, if DAT attempts to used to contain virtual contents, are translate a virtual storage address to devided into contiguous fixed-length a real storage address and the required sections of equal size. Virtual storage is page is not in real storage. This devided into 64K-byte segments. A maximum interruption, called a page fault, alerts virtual storage of 16, 777, 216 bytes, the control program that the page must be therefore contains 256 segments. Each loaded from external page storage into a segment of virtual storage is devided into page frame of real storage. The transfer of 4K - byte pages. A page frame is a 4K-byte a page into real storage is a page-in. The block of real storage, that can obtain one page-in process is shown in Figure 3. First, page at a time. An equivalent of frame on when a needed page is not in real storage the direct access storage is called a slot. /indicated by a bit in page table entry/, In a virtual storage system, storage management automatically goes to a mechanism is required to associate the corresponding entry in a external page virtual storage addresses of data and table. The external page table entry gives instruction with their actual location in the slot location for page. real storage. This function is performed Next, storage management selects by DAT. To translate the addresses, DAT a frame in real storage to hold the uses tables in real storage. These tables, required page. To do so, it refers to the which are maintained by the control page frame table, which indicates, which program, are the segment table and the frames are allocated. Storage management page tables. One segment table and finds an available frame and brings in the a corresponding set of page tables exist required page from its slot in external for each address space in the system page storage. To complete the page-in /see Figure 2/. There is one page table process, storage management updates the for each segment in the address space. appropriate page frame table entry and The page table indicates which pages are page table entry. currently in real storage and the real In order to keep a supply of frames storage location of those pages. DAT available for page-in, the control program translates the virtual storage addresses removes pages from real storage that have contained in a instruction during execution not been recently referenced. Prior to of the instruction. removing a page from a frame, the control

- 145 - program determines whether the page million instruction per second, 96K central contents were modified during processing. memory /60 bit words/ and fourteen If so, storage management performs a peripheral processors with 4K memory page-out. Otherwise an exact copy of the /12 bit words/. The peripheral equipment page already exists in external page may be attached to 24 channels with the storage. A page-out copies the modified maximum transfer rate 2 millions page from its real storage to a slot. The characters per second. Two types of disks slot need not be the one that contains the are used at INR. old version of the page. Storage The characteristics of the disk are the management need only update the external following : page table entry to designate the new slot. Disk CD 841 CD 844 At the end of this section I would Number of units 7 2 like to qoute some characteristics of the Unit capacity in Model 165. The millions of characters 36 118 has Basic Machine Time of SO nanoseconds. Transfer rate in Storage Cycle Time is 2 microseconds with thousands of characters 179 461 the 8 byte Storage Access Width and Access Time /miliseconds/: four-way interleaving. High-Speed Buffer Maximum 135 55 Storage can have 8192 bytes or 16384 bytes. Average 75 30 Block Multiplexor Channels are Minimum 25 S buffered to a width of 16 bytes for Average Rotational communication with storage. The maximum Latency 12.5 8.3 data rates for Block Multiplexor Channels Dual access is provided for CD 841 disks. is 3 million bytes a second. Five Low Speed Batch Terminals and Byte Multiplexor Channels are not buffered seven TTY are connected to Local and they are used for slow peripheral Communications Controller through modems. devices. Finally, the 3300 disk has the Four PDP-11/45 minicomputers are linked average access time 30 ms, the average to 6671 Multiplexer. Slow peripheral latency 8.4 ms and the data transfer rate devices include two printers, Card Reader 806 Kbyte per seconds. and Punch Paper Tape Reader/Punch, and Plotter. The devices form two separate 1.2. CDC Hardware lines linked to channels by means of controllers. Four 659 Magnetic Tape Units In this section I will describe the /9 track/ and one 657 Magnetic Tape Unit architecture of CD 6000, CD 7000, CYBER 70 /7 track/ are served by one controller and CYBER 170 computers. The above and channel. computer families consists of central Central Memory is organized in banks processors, central memory, peripheral of 4,096 word each. The storage cycle in processors, channels, controllers and one microsecond, however the Central devices. In comparison with IBM computers, Memory address and data control mechanisms peripheral processors form additional permit to move a word to or fron; Central layer in the Memory every 100 nanoseconds. introducing a high degree of multiprocessing. The Central Processor is interrupted Peripheral processors perform a majority by means of the Exchange Jump. This of system tasks, which usually take a lot operation may be initiated by a peripheral of CPU time on computers with the standard processor or by the Central Processor. The architecture. effect of this operation is to interrupt Figure 4 shows the configuration of the currently active central program and CYBER-73 at the Institute of Nuclear to initiate another program. To initiate Research. The configuration consists of the operation a PPU executes the exchange central processor unit performing 1.2 Jump referring to the beginning of the

- 146 - Exchange Jump Package. This package The CYBER 176 Central Processor has contains the 24 operating registers a 36.3 Mhz frequency clock which provides /X, A, B/, program address /P/, the an internal minor cycle, or clock period, relative address /IIA/, field length /FL/ of 27.5 nanoseconds. The CPU contains for Central Memory, and the Monitor a 12 words instruction stack with a two Address /MA/. look-ahead feature plus the ability to If the CPU is "Monitor State" it may execute contiguous or non-contiguous not be interrupted. In this state the instruction loops which reduces the need Central Processor may set up and initiate to reference central memory in order to jobs or tasks in a direct manner via the access the instructions. Central Exchange Jump instructions. AVhen Instructions are issued from the this instruction is executed the CPU state instruction stack at maximum rate of one is switched to "program state". A user per clock period to any of the nine may also initiate a Central Exchange Jump functional units. Each functional unit is instruction, however he is not allowed to independent of others and all functional set up the exchange package. The lower units are segmented, so that although it CYBER computers use for the Central may take several clock periods for a given Processor much more simpler interruption operation to be completed but several mechanism then IBM computers and the separate operations can be at various execution of peripheral processors can stages of completion within a single not be interrupted at all. functional unit. In the second part of this section, Unlike other CYBER models, the I would like to describe some features CYBER 176 contains an interrupt system of CD 7600, CYBER 76 or CYBER 176 designed specially to ease the handling of computers. Figure 5 shows the CD 7600 high speed I/O but at the same time, configuration at CEKN. The computer has a minimizing the system memory requirements. 64 K small core memory /SCM/, and 512 K This interrupt capability is invoked large core memory /LSM/. Six high speed primarily by the HSPP*s handling transfers channels are used to connect the six to the 819 disk subsystem. 817 disk units, and another channel is CYBER 176 Central Memory /SCM/ used to connect the processor to eight consists of bipolar semiconductor memory 844 disks, four other channels are used ranging in size from 131 K to 262 K words to connect the maintenance control unit of 60 bit memory. Associated with each AlCU/, the first level instrumentation 60 bit word in an additional 8 bit SECDED peripheral processor /FLIPP/ and the CDC field and logic associated with this field front-ends. The RIOS stations are permits the correction of single bit connected to the CDC 6500 and approximately errors and the detection of double bit 70 INTERCOM terminals can connect to errors. Memory is interleaved 16 ways in cither front-end. order to minimize the occurence of bank In March 1978, Control Data annouced conflicts. The bank cycle time is 82.5 a new model CYBER 176. The main difference nanoseconds for read operations and between CYBER 176 and CD 7600 is the 165 nanoseconds for write operations, and additional CYBER 170 Peripheral Processor this together with 16 way interleaving Subsystem, which replaces MCU on CD 7600. provides for maximum transfer rate of up Therefore CYBER 176 has two different thirty-six 60 bit words per microsecond. groups of peripheral processors. High The CYBER 176 Extended Memory /LCM/ Speed Peripheral Processors are didicated is composed of ferrite core with to the control of the 817 or 819 disk integrated circuit control logic. The subsystems and Peripheral Processor memory size ranges from 524 K to 2097 K. Subsystems handles the peripheral Extended Memory is equipped with SECDED equipment, which was on CD 7600 attached logic. The memory is made up from to the front-end. individual banks of 262 K words with

- 147 - a bank cycle time of 1.76 microseconds. have a 4 K twelve bit memory with a 275 Associated with each memory bank is a 16 nanoseconds cycle time. The timing in word bank register, and this feature these processors is controlled by 27.5 together with bank interleaving permits a nanoseconds clock period. maximum transfer rate of one word per The characteristics of disk subsystem clock period on the one or two million include a very high transfer rate of just word configurations. The Central Processor over six million characters per second, can also directly reference elements in which is very closely matched to the Extended Memory on a single word basis. actual six million char/sec rate of one The next element of the CYBER 176 I of HSPP channels. A Peripheral Processor should like to describe is the Input/Output Subsystem channel can only handle Multiplexer. The multiplexer provides the a maximum rate of four million char/sec interface between High Speed Peripheral and hence be unable to fully exploit Processors and the peripheral processor 319 capabilities. The 819 disk drives uses subsystems on the one hand and Central a twenty platter non-intechangeable pack Memory on the other. The single path to that has capacity of 412 million CM can handle data up to a speed of characters. 18 million character per second. The Prom the above hardware examples it interface to the Peripheral Processor is obvious, that the various manufacturers Subsystem is via a 60 bit port and allows use different approaches to achieve the each of the ten PPs access any area of high computer performance. IBM applies the Central Memory. The maximum data transfer concepts of the High Speed Buffers, rate accross this 60 bit wide path is also paging. CDC uses the multiprocessing with 18 million characters per second. The peripheral processors, instruction stacks High Speed Peripheral Processors are each and pipeline concept. connected to the I/O Multiplexer by 12 bit channels. 2. Operating Systems A CYBER 176 includes four channels, and four HSPPs, for handling the In this chapter I am going to present 819 disk subsystem but up to a maximum an overview of the operating systems, of 14 such channels can be connected to which are most commonly used in the the I/O Multiplexer. Unlike the scientific computing centres. As in the Peripheral Processor Subsystem, data first chapter, only IBM and CDC operating transfered across one of these channels is systems will be presented. The features of directed to a dedicated circular buffer the operating system influence to large area in Central Memory. Management of extent overall system performance. Prom these buffers is accomplished with user point of view it is also important, a central processor interrupt system. As how the operating system treats various data flows into or out of the buffer, classes of jobs. In the description of thresholds at midpoint and end-buffer the operating systems I will concentrate cause a central processor interrupt to on the general organization, utilization occur permitting the processor to remove of system resources and job scheduling. or provide new data parallel with the Two relatively old IBM systems MFT and MVT continuing data flow across the channel. are presented in order to show the progress These channels can operate at data rates in the operating system design. Vh>2 is an up to six million characters per second. example of virtual storage operating The Peripheral Processor System has system. VM/370 represents the virtual the memory cycle time 500 nanoseconds. machine super operating systems. The The other characteristics are the same as description of two CDC operating systems for CYBER-73. SCOPE and NOS should us to The High Speed Peripheral Processors understand the problems connected with the

- 148 - multiprocessor systems. important to schedule I/O bound jobs into lower-numbered partitions. In the reverse 2.1. OS/MFT situation, CPU bound jobs "screen" I/O bound jobs, which get access to CPU MPT is the name of a control program during infrequent I/O processing. working in frame of IBM System/360 The MFT system nucleus occupies a Operating System. MPT can control a fixed fixed area in a main storage area at least number of tasks concurrently for the 34 K bytes and it contains the resident computers with the storage size no less partition of the control program that than 128 K byte. performs the following control functions: Figure 6 shows the structure of main - Task management routines storage under MFT. An installation may - Data management routines define up to 52 partitions, however only - Job management routines 15 may be problem - program partitions. - The system queue area The boundaries between partitions arc The I/O requests are scheduled by the I/O established at the system generation time, supervisor residing in the nucleus but the operator may redefine the according to first-come-first-served /FCFS/ partition sizes during operation. MFT is rule. a spooling system, which can read up to Because of the static partition sizes, 3 input streams and produce up to 36 it is difficult for MPT to avoid the output streams /Figure 7/. Job scheduling storage fragmentation. Also indirect consists of assigning jobs to partitions. control of job-dispatching priorities leads Por this purpose jobs are devided into 15 usually to low CPU utilization. The design classes from A to 0 and each program of MFT reflects the state of art in system partition may process up to 3 job classes programming of early sixtieth, when the /Figure 8/. Inside a class, jobs are performance was a secondary objective. scheduled according to the job card Further developments of IBM operating priority. systems cope with the above deficiencies. Immediately after a job ends in a partion, the initiator program is loaded 2.2. OS/MVT and it selects a new job according to the above rules. The initiator allocates also Another version of IBM control program the requested data sets and I/O devices. is OS/MVT, which is devoted to processing Relocating loader program loads job step variable number of tasks. The processing relatively to the beginning of the of Job Steps, by an MVT Control Program is partition. The loader is a nucleus presented on Figure 9. A new feature of program, which has access to the whole MVT is a possibility to execution in storage. Finally, the job can start parallel several tasks in the frame of one execution and at the complition, the Job Step. The second feature is dynamic initiator places the completed job on the memory allocation scheme. Unlike in MFT, output queue. the MVT user assigns the parameter The task dispatching algorithra used /REGION = nnn/ indicating its main storage jy MPT is called highest-static-priority- requirement for each Job Step. MVT first-served /HSFS/. The priority supplies only this amount of storage to uecreascs with the partition number. The the Job Step, but not a partition of highest priorities are assigned to the prespecified size for the entire job. system tasks residing in the nucleus. The Figure 10 shows the organization of tasks in the partition with high numbers main storage under MVT. The Link Pack /low priority/ can get access to CPU only Area is reserved for routines that can be ciuring I/O operations of the tasks with used concurrently by the control program low partition numbers. Therefore it is and by any jobs. These include access

- 149 - method routines, storage management steps and low priority to the CPU bound routines and joh scheduling in conjuction job steps. Among the tasks with same with MVT initiator program. The Master priority, FCFS rule is used. Scheduler provides the main communication The main advantages of MVT in link between the operator and the comparison with MFT are: the reduction of operating system. storage fragmentation, the elimination of Job Steps, Readers, Writer, fitting-jobs to partition sizes and the Initiators are scheduled to the dynamic elastic dispatching scheme. The last one region area between Nucleus and Master however assumes the user knowledge and the Scheduler. Jobs are devided into 15 job good will to dispatch job steps correctly. classes using the same scheme described Some options of MVT allow to eliminate for MFT. The operator loads an initiator this daubtful assumptions. for specified job classes to a region of The first is the time-slice option, 52 K bytes. The initiator requests a which handles a selected dispatching region for first job step, after selecting priority according to round-robin the job from its first class with the scheduling rule. The other dispatching highest priority. If the appropriate priorities are using the IISFS rule. The region is not available, the initiator round-robin scheduling for one goes into the wait state. After a dispatching priority protects at least contiguous block of storage appears in the partly from monopolized of CPU by a CPU system, the initiator is loaded into it. bound task and it is helpful to improve The I/O devices are allocated to the job response time for short jobs. step and a routine in the Link Pack Area The second option is the Houston attaches the job step as a user task in Automatic Spooling System /HASP/, developed the allocated region. All programs by a group of users. HASP includes the related to this task are loaded and the heuristic dispatching option, which is execution of the task begins. After able to monitor some characteristics of completion, the region is released and a job step. A job step is classified as I/O new initiator is loaded. Immediately bound, if it will use only a part of time after releasing the I/O devices of the slice. In the opposite case it is first job step, the initiator can proceed classified us the CPU bound. The I/O bound to the second job step if any. Minimum tasks are scheduled first and if all arc region size can not be less than 52 K, in wait state, the CPU bound tasks are unless some reentrant routines of the served in round-robin scheme. The division initiators are moved to the L.P.A. The between task classes in dynamic. maximum number of initiators can not HASP can also establish a desired exceed 15, however the degree of mix of I/O and CPU bound task by changing multiprogramming is variable and depend on the time slice at installation specified storage requirements of jobs steps in the intervals. The above features HASP are an execution. example of the efforts toward tiie full Some privileged regions, like automatization of the operating systems. teleprocessing jobs, may increase their size during job step execution. In this 2.3. 0S/VS2 case they roll-out other jobs, which must come back to the same region to complete Further development of IBM operating the execution. The task dispatcher in the systems is connected with concept of systems nucleus controlls the task virtual storage. The system control residing in the storage according to HSFS program 0S/VS2 is known in two versions. rule. The priorities are taken from JOB Release 1 provides one virtual address statement or from EXEC for each job step range. Jobs are assigned regions from this in the last case. It is possible than to just as jobs are assigned regions in real assign high priority to the I/O bound job storage of MVT. Release 2 provides

- 150 - multiple virtual address space: each user the system resources manager that a new receives its own copy of virtual storage, address space is to be created. The system minus space used for certain system resources manager decides, based on function /e.g. the nucleus/. factors like priorities and the number of Figure 11 shows the virtual storage already existing address spaces, whether overview for Release 1. The abbreviation or not a new address space is advantageous. LSQA stands for the local system queue If not the new space will not be created area. Basically the concept of Realise 1 until the system resources manager finds is close to OS/MVT except of course the conditions suitable. If the new address virtual storage. Therefore we will space is acceptable, the address space concentrate on the description of creation routine involves virtual storage Release 2. management /VSM/ to assign virtual storage Figure 12 shows the virtual storage and set up addressability for the address lay-out for Release 2. Each address space space. is divided into system area, user area and VSM builds an LSQA and sets up a the common area. The system area contains segment table, page tables and external the nucleus, which is fixed in real page tables in it. VSM also creates control storage then mapped in low addresses of blocks to operate the control task for the virtual storage. The System Queue Area address space. contains tables and queues relating to the Then the region control task /RCT/ entire system. The Pageable Link Pack Area receives control. The RCT builds control contains supervisor call routines, access blocks that further define the address methods and any reentrant read-only system space, then attaches the started task and user programs, that can be shared among control routine /STC/, users of the system. Common Service Area Next, STC uses an initiator as a contains data for communication among the subroutine to select the job. The initiator private user address spaces. passes, the job ID to the job entry On the right part of Figure 12 we see subsystem. The job entry subsystem invokes the structure of the user address space in the interpreter to build scheduler control virtual storage. The LSQA contains tables blocks in the scheduler work area /SWA/ and queues associated with the uservs job for the address space. Upon return from and address space. Scheduler 'York Area the job entry subsystem, the initiator contains control blocks and tables created performs allocation and it issues an during JCL interpretation and used by the ATTACH for the task related to the address initiator during job step scheduling. space: the terminal monitor program /LOGON/, Subpools 229 and 230 are used for control or any started program /START/. If the blocks, but can be obtained only by START command is for an initiator, the authorized programs. The remainder of initiator asks the job entry subsystem for private address space is available to its a problem program job that is ready for user, with space being allocated from low execution. The job entry subsystem calls address up. the interpreter to build the scheduler Pigure 13 shows the VS2 Release 2 tables in S'JA. Y/hen the initiator receives control program overview. After the system control again, the problem program is is initialized and the job entry subsystem attached. is active, jobs may be submitted for Task management is performed by the processing. To schedule a batch job, the supervisor. Basically, the supervisor job entry subsystem issues a START controls the use of the CPU, real storage, command for an initiator. To schedule a and virtual storage. All supervisor time sharing job, a user issues a LOGON activity begins with an interruption. The command. As a result of START and LOGON supervisor interruption handler saves a new address space is required. critical information necessary to return The address creation routine notifies control to the interrupted program after

- 151 - the interruption is processed. In most supplied to a user in order to ensure that cases, the interruption handler passes the installation performance specification control to the one of the following is met. routines to process the interruption: To take advantage of the system - Task supervisor, which performs services resources manager, user simply identifies /such as attaching and detaching a the performance group in which he is to be suhtask /requested by tasks and included, as prescribed by the installation. allocates CPU among competing task. Service units are used to measure the - Contents supervisor, which locates amount of processing resources provided to requested programs, fetches them to each address space. They are computed as virtual storage if necessary, and a combination of the three basic schedules their execution. processing resources: - Real storage manager, which directs the Service units = A»CPU + B»I/0 + C'/frames/ movement of pages between real storage The coefficients A, B, C are supplied by and external page storage. the installation. - Auxiliary storage manager which handles Villen an installation specifies external page storage including virtual performance objectives, it specifies one I/O data sets. or more service rates, how many service - Virtual storage manager, which services units per second a user should receive. GETMAIN and FREE MAIN by allocating and The installation is not specifying any deallocating storage within the virtual particular amount of the individual address space. resources that a user is to receive. It is - Service manager, which improves system assumed that different users will use CPU, response through a new dispatching I/O and real storage in different techniques that allows internal system proportions. However, by supplying the function to run enabled, unserialized, coefficients, the installation can adjust and in parallel on a multiprocessing the relative importance of CPU, I/O, or system. real storage resources within service In VS2 Release 2 an installation can definition. specify, in measurable terms, the The purpose of performance groups is performance that any member of any subset to group user transactions that the of its users is to receive, under any installation considers to have similar workload conditions and during any period performance requirements. Basically, a user in the life of a job. The system resources transaction in a batch environment is a manager is responsible for tracking and job or job step and in a time-sharing controlling the rate at which resources environment, an interaction. The are provided to users in order to meet the installation can define as many as 255 installations requirements. performance groups, each identified by a The installation defines: distinct performance group number. - Performance groups - subsets od users Each performance group can be devided that should be managed in distinguishable into as many as eight periods. By dividing ways. a performance group into periods, an - Performance objectives - distinct rates, installation can associate different called service rates, at which CPU, I/O, performance objectives with a different and real storage resources are provided periods in the life of transaction. Trie- to users in a performance group at duration of a period can be specified certain workload level in the system. cither as a number of real-time seconds Service rate is the number of service or as a number of accumulated service unit-'?. units per second a user should receive. A performance objective states service A service unit is a measure of processing rates, how many service units per seconii resources. The system resource manager an associated transaction should recei\e monitors the rate at which service is under different system workload conditions.

- 152 - The installation can define as many as 64 computing system. performance objectives, each identified by V/hile the control program of VM/370 distinct number from 1 to 64. manages the concurrent execution of Figure 14 shows an example of virtual machines, an operating system must performance objectives. manage the work flow within each virtual The system resource manager tracks the machine. Because each virtual machine service rates provided to users and the executes independently of other machines, average workload level of the system. As each one may use a different operating the level increase or decreases, it system, or different releases of the same adjusts the service rates to maintain the operating system. relation ships between the performance The following operating systems can objectives that the installation has execute in virtual machines: defined. DOS, DOSAS, OS/PCP, OSAlFT, OSAWT, User transactions are associated OS/VSi, 0S/VS2, OS-ASP, PS44, CMS, RSCS. with performance objectives by means of Figure 16 shows an example of the performance group and periods within each Multiple Virtual Machine. A virtual performance group: each period of a machine consists of the following performance group definition includes a components : performance objective number. Figure 15 - Virtual system console shows, how a user is associated with a - Virtual storage performance objective. - Virtual CPU To manage system resources, the system - Virtual channels and I/O devices resources manager serves as the As a virtual system control serves centralized decision - maker. It monitors usually IBM 2741 Communication Terminal or a wide of data about the condition of the an IBM 3277 Display Station. By entering system, seeking to control such key commands at his terminal a user can variables as: perform almost all the functions an - Amount of real storage allocated. operator can perform on a real machine - Distribution of I/O load. system console. He can load an operating - Swapping frequency system, stop and start virtual machine - Level of multiprogramming execution, and display and change the - Paging rate. contents of registers and storage. By centralizing the control of these Each virtual machine has its own viables, the system resource manager can virtual storage space from 8 K to 16 better make decisions that will affect million bytes. Control Program brings into overall system performance. real storage what ever part of virtual VS Realise 2 is an example of fully storage is needed for the virtual machine*s automatized operating system, which can be execution, but does not necessarily keep in tuned by large number of installation storage those parts that are not needed parameters. Such a tuning process involves immediately. also very complex measurements, which will Control Program provides CPU resources be discribed in one of the following to each active virtual machine through chapters. time slicing. The virtual CPU can execute in either basic or extended control mode. 2.4. VM/370 For example OS/MVT and 0S/VS2 can execute in virtual machines. Virtual Machine Facility/370 is a A virtual machine supports the same system control program, that manages a real devices as a real machine. Virtual devices computing system so that all its resources are logically controlled by the virtual are available to many users at the same machine and not by VM/370. In most cases time. Each user has at his disposal the input/output operations, and any error functional equivalent of a real dedicated

- 153 - recovery processing, are complete 2.5. SCOPE 3.4 responsibility of the virtual machine operating system. From complicated organization of Virtual and real device addresses CD 6000 and CYBER computers, one can easily may differ. CP converts virtual channel see, that it is a challenging task to and device addresses to real channel and write an effective operating system, which device equivalents and performs any data could fully exploit the multiprocessing translation that are necessary. capability of these computers. In the past All virtual devices must have real many mistakes were done in the design of counter parts. A virtual disk must have a early versions of such operating systems. real disk counterpart, or a virtual tape The Supervisory Control Program Execution must have a real tape counterpart. Some /SCOPE/ is the name of an operating virtual devices, such as tapes, must have system, which is now together with its a one-to-one relation ship with a real successor NOS/BH most often used on device. Others may be assigned a portion CD 6000 and CYBER computers. SCOPE 3.4 and of a real device. For example, a virtual NOS/BE have bassicaly the same organization. disk my occupy all or part of a real disk. SCOPE has been primarily designed for the In other words a real disk can be devided , but later on an into several virtual minidisks. interactive subsystem INTERCOM has been Two operating system CMS and HSCS are added. considered as the part of VM/370. The Job and interactive command execution Converational Monitor System /CMS/ provides is controlled by SCOPE peripheral monitor users with a wide range of conversational, /PPMTR/ and central processor monitor time-sharing functions. The Remote /CPMTR/. The activities of the monitors Spooling Communication Subsystem allows are peripheral supported by more than 300 users to transmit files to remote stations routines. SCOPE uses some area of the in the RSCS teleprocessing network. central memory to store: the system tables The main advantages of VM/370 are and CPMTR programs /CMR area/, the central connected with the operating system memory resident library, and also the block development, testing of complex system and addresses for files on mass storage /KBT interactive program development. Programs area/. developed in a virtual machine can exceed Figure 17 shows the usage of the the real storage size of the computer. central memory /CM/ for system and users. Programmers can use debugging aids at The user job in execution is assigned a their terminal, which are normally reserved contiguous area in CM and a control point for the computer operator. They can number. A control point is a concept used display and store into registers, stop to facilitate book-keeping. SCOPE permits execution at an instruction address and up to 15 control points. CPMTH programs after normal flow of execution. CMS do not run on the control points and only simplifies the creation and manipulation one control point is taken for the of source programs on disk, and allows spooling subsystem JANUS. The others are the user to examine selected parts of used for user or operator jobs. program listings and storage dumps at this The communication between various terminal. processes running in SCOPE is rather The disadvantages of VM/370 are complicated. Mien a user program is loaded related with low execution speed of batch and executed as the result of a control oriented operating systems. For example card call, the system must place any a virtual machine with OS/MVT might parameters specified on job card within execute at only one half of its speed on the field length of the job. No central real IBM 370. processor instruction allow a CP program to perform I/O, therefore a request must bo send to the system, to load a PP routine

- 154 - to execute I/O. The request is placed in speed functions will process one of slower a register located at the reference functions. The list of slower function address plus one /11A+1/, as it is shown on includes: advancing of control point, Figure IS. CPMTR will pick up the request scanning of PP output register, processing inserting the control point number. of delay stack and some others. Therefore the useras program should use CPMTR processes certain PP output the exchange jump instruction immediately register monitor functions, it checks user after placing a call in RA+1. This will program RA+1 requests and it schedules CPU. cause CPMTR to begin execution immediately. The CPU dispatching is based in If CPMTR determines that the RA+i call earlier versions of SCOPE on one level should be assigned to a PP, it will pass round-robin rule. Recently a multilevel the call to PPMTR. round-robin rule has been introduced. V/hen a PP is available, PPMTR will Active jobs from one class form a ring, write the call into its PP input register, which is served by CPU until all jobs in CMR. The PP resident is checking disappear from the ring. Then a lower permanently its input register and when it priority ring is served. The value of time sees the call, the appropriate routine will slice depends on job priority and field be loaded and executed. length. SCOPE peripheral routines can not Input and output request processing load and often execute without help of the depends upon the source of each request. monitors. In fact the monitors will be Active user CM programs issue RA+1 request asked to perform such functions like: for I/O which are cycled through CPMTR. channel reservation, loading of peripheral, PP programs request I/O by placing a monitor transient programs and overlays, sending request into their PP output register. of dayfile messages, changing of control CPMTR assigns the I/O request to CPCIO point assignment and requesting another which in turn assigns it to the proper peripheral job. peripheral processor, CIO or ISP /see Y.hen PP resident has a monitor request, Figure 19/. CIO /circular input/output/ it places a message into PP output processes requests for magnetic tape, register in the PP communication area. teletype, a unit record equipment. ISP After making the request, PP resident /stack processor/ processes all requests for the first byte of the byte of the for mass storage I/O. output register to be set to zero, Before calling CIO, the program must signalling that the monitor has processed set up circular buffer parameters and the to request. CIO operation code in the file environment The PPMTR is in general control of table /FET/. The relative address of the the system. It is loaded into PPO at FET is placed in the CIO call. '.Then the file deadstart and remains there i'or the is opened CPCIO determines if the file duration of system execution. Primarily, in on allocatable or nonallocatable device. PPMTR controls and coordinates system If the file in on allocatable device, activities to avoid conflicts between CPCIO calls Stack Processor Manager /SPU/ various system processors. It allocates to enter the request in the I/O Request peripheral processors, central Memory to Stack. SPM performs request scheduling and control points. During the execution of device optimization. Rotational Mass its main loop, PPMTR scans the CPMTR Storage I/O is performed by SPM selecting request stack /T.MTRRS/ for a PPMTR a stack request and assigning it to a ISP function call or a PP program request. driver. The request is placed into ISP Moreover PPMTR looks for the pending communication area. ISP comes up, it function cowing froti peripheral processors, initializes the driver overlay appropriate ii advances the system clock, and it to the specified RMS device /3SV7 for 841 checks UA+1 address of executing job. disk or 3SY for 844 disk/. ISP performs PPMTll upon completion of the four high the I/O requested, obtaining field access

- 155 - as necessary and at I/O completion Many installations ,including INR,change returns the request to SPM for termination the algorithm for the input queue priority processing. to include the job field length and the If the file device code is for a non- time limit insted of card priority. A job allocatable device, CIO and its overlays with the highest input queue priority is will process the request. For example, if assigned to control point and it starts a user issues a request to read data from the execution. a file on a SCOPE standard format 7-track After initiation the job is under tape, CIO will call the overlay 1RT into control of the Scheduler, which is a CPMTR its PP. 1RT will reserve one of the program running in user mode. The Scheduler hardware channels connected to the is responsible for allocating control equipment. It then issues the function points and central memory. code to connect the controller and tape The Central Memory Queue /CMQ/ driver. 1RT issues functions to transmit contains all jobs waiting to bo run at a one PRÜ of data from the tape driver over control point. If a job is in the CMQ, the data channel. it must have all the resources that it 'Then the entire PRU is transmitted or an currently needs to run except for Central end-of-record is encountered, IRT picks up Memory and control point. pointers to the circular buffer from FET The Device Queue is formed by jobs and it transfers data from PP to the requesting a non-allocatable device. The buffer. IRT updates the PRU count in program ITS is responsible for detecting, the file name table /FNT/, releases the when the appropriate device is ready, and channel, sets completion bits in the FNT for calling the Device Queue manager /1DM/. and FET, and drops out. 1DM will in turn call the Scheduler to put The above description of SCOPE 3.4 the job descriptor back into the CMQ. I/O corresponds to level 430. Earlier The Permanent File Queue consists oí versions of SCOPE 3.4 were using ISP all jobs waiting to attach a permanent driver for device optimization, insted of file. If a job at a control point tries to SPM. The device optimization algorithm attach a permanent file, the PP routine schedules in both cases the requests for PFA is called. If PFA determines that the which the disk head displacements and the job cannot attach the permanent file rotational latency is minimal. because it is temporarily unvailable, PFA SCOPE processes jobs in the system in calls the permanent file queue manager the three independent stages: Input, /1PF/. 'Vhenever a permanent file is Scheduling, Output. Jobs can be loaded detached, iPF checks if there are jobs into computer reading card decks into the waiting for the file. If there are, 1PF system using the system package Janus. will select the job which has been Alternately, they may be input from tape waiting the longest. 1PF will then call with the tape loader /1LT/ or from a user Scheduler to put the job into Central terminal through INTERCOM /see Figure 20/. Memory Queue. As each job is read into computer, it is Jobs which are waiting for operator placed into input/preinput file. The action will be in the Operator Action preinput queue is formed for the jobs Queue. Y.lien the operator enters an with magnetic tape requests. This jobs appropriate type-in the Job will be put will be individually staged to execution into the CMQ and eventualis' it will be by the operator. rolled in and initiated at the control V/henever one control point and 2000 B point. words of Central Memory are available, the Interactive INTERCOM jobs which are Scheduler calls IIB to initiate another waiting for input from a terminal or batch job. IIB scans the input queue and waiting until output can be sent to a it calcutes the input queue priority terminal are swapped out and put into the according to job card priority and job age. INTERCOM Queue, 'hen the terminal I/O

- 156 - completes, ICI will request the Scheduler 3. Aging rate /AR/ to place the job in CMQ. 4. Quantum priority /QP/ at control point. The function of the CP Scheduler 5. Quantum length /BQ/ program is to decide which jobs should be v.hen a job has been swapped out and enters run at any given time and to use CM the CMQ, it is assigned a queue priority efficiently. The Scheduler decides which equal to its base priority /BP/. The base jobs to swap-in and it then calls the priority is normally a combination of swapper PP program to perform the actual minimum queue priority for the class and swapping. Each job in the CMQ or that is the cob card priority /BP=MINQP+8 • JCP/. executing at control point has an The priority is incremented with time at associated priority called "queue priority". a rate equal to the aging rate of the A job card priority /JCP/ has a class, '.lien the priority of a job in the weighting effect on the queue priority. CMQ reaches the maximum priority of the The Scheduler makes its decision based class, its priority is no longer aged. entirely on queue priority. '.-hen a job is swapped into a control It will schedule in the highest queue point, it is given a priority equal to the priority job in the CMQ, which will fit in quantum priority plus eight times the job the currently available memory and in the card priority, './hen the quantum for the memory assigned to jobs of lower queue job has elapsed, the priority of the job priority. is set to the base priority. ".hen a job is swapped into a control The following tables gives the CDC point, it is given a high queue priority. standard set of Scheduler parameters: At the end of a period of time called Class of Jobs MINQP MAXQP a "quantum", the queue priority is changed 1. Batch 100 1000 to a lower value iherby making the job a 2. Dependent 200 1000 more likely candidate for swapp-out. 3. Interactive 1100 2400 The quantum for a job is considered 4. Multiuser 2410 2510

: elapsed when (X+Y/4.J«64 > B ,, where X 5. Express 1000 3200 denotes amount CPU time, Y denotes PPU

tine and Bi¿ stands for quantum value. Class of Jobs AR QP BQ Jobs in the system are devided into 1. Batch 4 1400 2000 five classes : 2. Dependent 10 1400 2000 1. Batch jobs using no non-allocatable 3. Interactive 1000 2500 200 devices. 4. Multiuser 200 3000 4000 2. Batch jobs using one or more non- 5. Express 400 3200 400 allocatable devices. For the above parameter set, an interactive 3. INTERCOM /interactive/ jobs. job needs only O.G sec to swapp-out a batch 4. Multi-user jobs. job and a multiuser job will swapp-out 5. Express jobs. a batch immediately. Therefore some A multiuser job is a program that installation like 'ashington University and runs under control of INTERCOM and that INR use the different parameter sets, which simultaneously processes several tertiinals reduce amount of swapping caused by in a serial manner /EDITOR/. An express interactive jobs. .job is a job l'or which the operator entered SCOPE, like others modern operating DROP, KILL or RERUN. system, need very carefull tuning of large For each class there is an entry in number of system parameters in order to a tiible in CMR called the Job Control Area achieve high performance. The tuning which contains parameters for scheduling process may be performed only by means of oi jobs in the class. These parameters the system performance monitors, which will include : be described in one of the following i. Miniuuia queue priority /MINQP/ chapters. -i. Maximum queue priority /MAXvP/

- 157 - 2.6. KRONOS subsystem is in charge of the user interface, the submission of jobs to the KRONOS is another operating system queues, and initiation of system commands developed by CDC for the same range of to the monitor. An exception is MAGNET computers as SCOPE. KRONOS was designed subsystem, which handles the automatic to provide time-sharing and transaction tape assignment. capabilities for large number of TELEX subsystem handles time-sharing interactive terminals. At certain stage of and differed Latch /comming from interactive development, the name of the operating terminals/. TELEX passes the transaction system has been changed into NETWORK Messages to TRANEX. The transaction OPERATING SYSTEM /NOS/, which basically subsystem TRANEX is one in which messages has the same organization. The new features received from terminals trigger the of NOS are related with the communication execution of one or more tasks to interact processor CD 2550, which can serve as the with data file or data base. EXPORT/IMPORT node of a complicated computer network. subsystem processes the remote batch from NOS uses many concepts of SCOPE like: 200 USER terminals. The equivalent of peripheral and central monitor, control JANUS, which handles the local batch, is points, PP communication area, RA+1 called BATCHIO subsystem. requests and recall. The recall program The main advantage of modular status is provided in both systems to subsystem is the release all CM space enable efficient use of the central taken by the subsystem tables in the case processor in the multiprogramming of the operator drop. In other words OMR environment. Often, a CP program must wait for NOS has less permanent system tables for an I/O operation to be completed then for SCOPE. before more computation can be performed. The second difference between NOS and To eliminate the CPU time wasted if the SCOPE is connected with the decentralized CP program were placed in a loop to await disk I/O processing by NOS. At first glance, I/O completion, a CP program can ask the the SCOPE centralized disk I/O processing monitor to put the control point into recall looks very attractive. It gives the status until a later time. Then the CPU possibility to choose from the stack the may be assigned to execute a program at request with the shortest execution time some other control point. Recall may be and therefore to minimize the disk head automatic or periodic. Auto-recall should movements. However measurements show, that be used when a program requests I/O or other the stack processor preparations for the system action and cannot proceed until the request processing take a lot of time. request is completed. The monitor will not Therefore it is better to have the return control until the specific request parallel preparation of many requests by has been satisfied. Periodic recall can be independent processors. This is the case used, when the program is waiting for any with NOS, where each pool peripheral one of several requests to be completed. processor has a disk driver included in the The program will be activated periodically, PP resident. It gives the possibility of the so that it can determine which request has channel reservation only if the request is been satisfied and whether or not it can ready. Otherwise, another PP can reserve proceed. disk channel and it performs the disk I/O. The main differences, between NOS and Figure 22 shows an example of the SCOPE, are connected with the modular interactive job flow using NOS decentralized structure, organization of I/O and disk I/O feature. simplicity of NOS system tables. Job scheduling in the NOS operating Figure 21 shows the residency of the system is performed similarly as in NOS operating system. The subsystems run on SCOPE 3.4. An example of the Job Control the control point like user jobs. Up to 23 Parameters is presented on the Figure 23. control points can exists in NOS. Each NOS provides very powerfull control

- 158 - language, wftich allows the programmer to application. From the internal point of transfer control and to perform view, it is important "efficiency" of arithmetic and test function within the the resource utilization in order to control statement record. Control language process to workload. of statements similar to statements The workload is the most difficult /GOTO, SET, CALL, IF, DISPLAY/. An factor in the computer performance important feature of control language is evaluation. In the production environment, capability to create procedure files. A the workload depends very much on day procedure file is a group of control time, day of the week, user customs and language statements which can be called work schedule. Using powerful measurement much like a subroutine for insertion tools it is possible to compare the anywhere within the control statement effects caused by the changes in an record. operating system during the production Another feature, important for time. The measurement tools help us to interactive users, is the existence of exclude the workload which is not "typical" indirect permanent files. Indirect for certain time of the day and to permanent files are accessed by using a introduce only typical workload periods for working copy of the permanent file as a statistical analysis. However, the local file attached to the useras job. If production measurements can be only applied the user wishes the working copy to to establish rather significant change in remain permanent after the file has been the computer performance. altered, the SAVE or REPLACE functions Therefore installations and must be issued. manufacturers use banchmarks and In NOS there is not possible an stimulators to achieve repeatable results. overflow on disk untis. Also it is It is of course a problem how for impossible to create dependency between banchmark and stimulator can imitate the jobs and to access Extended Core Storage "real" workload. The problem is too large by the user. In spite this déficiences, extent solved by appropriate dayfile NOS is considered as extremly effective analysers. The dayfile analysers collect operating system for the installations such workload characteristics like: with very large number of interactive 1. Job CPU time terminals. 2. Job channel time 3. Number of I/O request 3. Performance Definitions 4. Average or maximum memory size requested by a job The computer system performance may 5. Job PPU time be seen from different points of view. 6. Priorities assigned to a job. Since the process of computer evaluation Some other workload characteristic are involves computer system designers, more difficult to get: system request managers, system programmers and interarrivai time, blocked time, working application programmers, it is obvious, set size and locality of reference. that "performance" may be a highly Figure 24 shows the result of the subjective term. Therefore it is defined measurements performed using the program sometimes as "performance is the degree WKLOAD by D.Makosa at INR. V/KLOAD is to which computing system meets the devoted mostly the measurements of expectations of the person involved with interactive job characteristics and it it". requires certain changes in SCOPE to In more popular term perforuance is obtain additional dayfile messages. The understood as measures of system speed and characteristics, obtained by means of resource use. From the external point of V/ORKLOAD, may be used in the stimulation view, it is important "effectiveness" with technique. which the system handles the specific The stimulators were developed for

- 159 - SCOPE and NOS operating systems by evaluation. The steps will involve the N. Williams form CDC. The SCOPE STIMULATOR definition of performance measures, the is a set of CP and PP programs that determination of the quantitative values simulate the 6766 multiplexer and TTY of performance measures and the terminals. Under normal condition, when the discussion of the results. Only the first hardware is present, the INTERCOM driver, step will be covered in this chapter and resides in a PPU and communicates with the other will be discussed in the the multiplexer over a data channel. The following chapters. multiplexer in turn transmits the data to Now I would like to remind you some the appropriate terminals. The most basic definitions of performance measures which function of the STIMULATOR is to describe the system effectivness: communicate with the INTERCOM driver over 1. THROUGHPUT is inverse of elapsed time a common data channel. When the STIMULATOR required to process given workload /or is present, the STIMULATOR routines take the number of user jobs processed per the place of the multiplexer hardware. unit of time/. The STIMULATOR processes the driver's 2. TURNAROUND TIME is the time period function codes as well as receiving and between submitting a job aud receiving transmitting data destined for each the output. terminal. STIMULATOR run consists of two 3. RESPONSE TIME is the tournaround time phases. The first is the stimulation phase for interactive command. where all data received from INTERCOM and Another set of performance measures calculated response times to each command describe the system efficiency: are written to tape. The second phase 4. CPU UTILIZATION is a percent of time consists of data reduction and report a CPU is working for system and users. generation. 0. UNIT OVERLAP is a percent of time The resources required for stimulation during which to computer units operate are one control point, one or two simultaneously /for example CPU and dedicated PP*s, depending on whether or not a channel/. the option to save the output to tape is 6. EXTERNAL DELAY FACTOR is the ratio requested, one dedicated data channel, one between-multi- and monoprogramming or two tapes and central memory whose size turnaround time. depends on the number of terminals being 7. SYSTEM UTILITY is the weighted sum of simulated and the length of the input all system units /usually weights are tests. The stimulation phase is performed proportional to prices of units/. by two PP routines, iVG and VSM and one 5. REQUEST WAIT TIMS is a time required to CP routine, SIP /see Figure 25/. process a request /CPU or I/O/. The selection of a benchmark is not 0. PAGE FAULT FREQUENCY is the number of considered usually as a difficult problem. page faults per second. However it is necessary to include in the As you see from the above definitions benchmark not only the most frequent jobs some performance measures are of the general but also long-run jobs, which take the nature and s one other li':e PAGE FAULT most of system time. It is also popular to FREQUENCY arc applicable only two special use synthetic benchmarks, which simulate operating systems. usage system resources without performing the function of normal jobs. By means of control parameters a synthetic job can change its CPU, central memory and I/O requirements and therefore the whole benchmark can be easily constructed. With the well defined workload we can proceed to futher steps in the performance

- 160 - 4. Measurements Tools permits the gathering of information on the following classes of system activity: 1. CPU activity The software anti hardware resource 2. Channel activity and channel - CPU utilization is measured by weans of overlap software, hardware and hybrid performance 0. I/O device activity and contention for: monitors. As a software monitor it is - Unit record devices understood a special program - Graphics devices incorporated into the operating system. - Direct access storage devices Depending on computer and monitor design - Communication equipment we can distinguish between Central - Magnetic tape devices Software Monitor /CSÎ.I/ and Peripheral - Character reader devices Software Monitor /PSM/. CSM runs in the 4. Paging activity same processor as user programs and 5. Workload activity. therefore it is necessary to interrupt MF/1 is limited to reporting on the processor activity in order to take system activity as that activity is measurements. The interrupts uay occur communicated to the system /for example, regularly at specified time intervals by setting of flags/. As a result of this /sampling technique/ or may appear as the indirect reporting, statistically sampled result of an event /event driven technique/. values can approach in accuracy only the PSK! is permanently or temporarily residing internal system indications, not lu the peripheral processor and its necessarily the external activity itself. activity does not cause additional For example, if a CPU is disabled so that interrupts of the central processor. PS?.! the freeing of a device /device end •May however slightly influence the interruption/ cannot be communicated to performance of the operating system cue to the system, the device will appear busy I/O requests and conflicts on memory banks. for a longer period of time than it would The hardware monitors consists usually if it were measured by hardware monitor. of probes, logic box, counters and The sampling activity is used to recording devices /see Figure 26/. The collect only a part of measurements. For probes are sensing the electronic signals example, the percentage of channel busy in the circuitry of the measured systems, time is derived by dividing the number of indicating the resource states being sampling observations during the monitored. The logic box allows to perform reporting interval in which the channel logic functions on the signals /like AND, was busy, by the total number of OR, NOT/. The results are accumulated in observations. However the channel activity the counters, which content can be count is taken from an operating system displayed or recorded on magnetic tapes. counter and it is independent of sampling The hybrid monitor combined the cycle, which can change from 50 to 999 features of hardware monitor with the miliseconds. The channel activity count abbilities to analyze the software gives the number of successful Start I/O, resources, like files, tables, etc. which were issued to the channel during the In this chapter I am going to describe reporting priotí. the example of the CSM, PSM and hardware The system operator initiates MF/l monitor. monitoring with the START command. MF/l can also be started as a batch job. The 4.1. System Activity Measurement Facility report formating takes place at the time for VS2 specified by the parameter INTERVAL. Figure 27 shows the Paging Activity The System Activity Measurement Report of MF/l. This report provides Facility /MF/l/ is a CSM' using the sampling detailed in information about demands on technique for VS2 operating system. MP/1

- 161 - the system paging facilities and the contained in the body of CM resident utilization of real and auxiliary storage peripheral program GOG. At the end of during the reporting interval. measurements, the statistical tables of Explanations of some fields appearing in GOG are rewritten by a peripheral program the report are as follows: GOL into the field length of Fortran - Pageable System Area include the program CIAIN. With the help of text file pageable link pack area and its CIATXT, CIAIN produces a final report in directory, and the common storage area. the form of distribution tables. - PAGE Reclaim Rate is the per-second CIA can be initiated by the operator rate of pages frames that are peripheral call, which specifies also a disconnected /stolen/ from an address sampling delay in seconds. Delay zero space or the system pageable area, but forces CIA to take permanent measurements, are retrieved for reuse before being what roughly corresponds to 20 samplings re-allocated. per second. Otherwise CIA bounces into - Swap Page-in Rate is the per-second rate a PP every specified number of seconds. of pages read into real storage, as a Internally peripheral program CIA is result of address space swap-ins. devided into the following experiments ; - VIO Page-in is the transfer of a VIO Al. Measures time it takes to read file page from auxiliary to real storage, hundread 60 bit words from CM to a PP resulting from a page fault or a PGLOAD /indicates PP-read saturation level/ on a VIO window. A2. Measures time it takes to write - Swaps are the number of address space hundread 60 bit words from a PP to CM swap sequences, where a swap sequence /indicates PP-write saturation level/ consists of an address space swap-out AA—AB. Measure status of CPUA and CPUB and swap-in. /RESCHEDULE, CPMTR, IDLE, STORAGE, The System Activity Measurement MOVE, SCHEDULER, USER/ Facilities can produce also SMF records, BB. Measures number of free PP in the which contains the results of measurements. system From SMF records it is relatively easy to CC. Measures control point status produce another types of reports, e.g. /CPU-WAIT, X-RECALL, A-COMPUTE, the utilization of the resource as a B-COMPUTE, Y-RECALL; M-STORMOV, NEXT/ function of time. DD. Measures the amount of free CM EE. Measures the amount of free ECS 4.2. SCOPE Monitors CIA and GSS FM. Measures number of seven track tape drives free In this section I am going to FN. Measures number of nine track drives describe two PSM, which were designed for free SCOPE 3.4 performance measurements. The HH. Measures a distribution of individual first version of CIA has been developed by user CM field lengths P.Jalics at Stuttgart University in 1973. II. Measures a distribution of individual The Stuttgart version of CIA was applicable user ECS field lengths only to a particular CD 6600 configuration. Ji. Measures the number of entries in the In 1974 the author of this lecture input queue introduced a CERN version of CIA which is J2. Measures the number of entries in the able to measure the performance all CD 6000 output queue and CYBER computers except of CD 7600. J3. Measures the number of local files in Figure 28 shows the organization of the system CIA modules. The main part of CIA monitor K1-K9. Measures the number of stack is a peripheral program CIA which collects request for a DST entry all necessary information from the SCOPE NX. Measures the utilization of two system tables or CYBER channels. The character or one character classes of information is placed into tables PP routines

- 162 - PX. Measures use of PP routines according PPSTAT /INR, H.Wojciechowicz/ consists to three character name of measurement code included into CPMTR QA-QX. Measure the CPU idle and channel which counts the number of each PP active overlap program-call. This information is analyzed RA-RX. Measure the channel activity. later by PPSTAT and the frequency of call Figure 29 shows a typical results for and PP program residency are printed. Experiment K2, which measures the number It enables to define the proper residency of stack requests attached to first of the PP programs. physical Device Status Table. For each REPORT /INR, J.Dzieciaszek/ scans the possible number of stack requests, the dayfile and selects the desired information experiment gives the corresponding rate of and immediately prints out the short occurrence. At the bottom of the table, report of the computer workload status. the mean request stack length is given. This report is divided into three parts: Groningen System Statistics /GSS/ a/ Important events in system activity is a bouncing peripheral processor program /time and type of deadstart and with the sampling period of 20 seconds. interactive subsystem work/ GSS samples about 250 data pertinent to b/ Status of users and system Jobs /time the status of the system. However data are of execution, central processor and not recorded in a CM table, like in CIA, channels, time of waiting in input but on a permanent file /see Figure 30/. queue, central memory used/ The permanent file is processed afterwords c/ Diagram of central processor idle time by a central processor program giving /sent to dayfile every five minutes/. daily or cumulative system statistics. The As one can see from the above analysis of the data collected on a file examples, using appropriate measurement is more flexible, than this one applied to tool it is possible to get very detailed CIA histograms. It is possible in GSS to information about the performance of calculate not only the mean values of the SCOPE 3.4 operating system. In fact this measured variables but also the dispersions. not the case with KRONOS /NOS/ for which The measurements are presented often as a practically there are no software monitors. function to day time and some of them are Therefore in next section, I am going to specified for the job classes. GSS describe a hardware monitor for KRONOS measures not only the outstanding stack measurements. requests but also the total number of processed stack requests, which was for 4.3. Hardware Monitor of KRONOS System earlier versions of SCOPE 3.4 a difficult task. A hardware monitor was used for the Figure 31 shows the result of GSS study of CDC-6600 computer running at experiment for control point occupation. Federal Computer Performance Evaluation Apart from complex monitors like CIA and Simulation Center by dr. David and GSS a lot of smaller programs is used S.Lindsay. for the performance measurements of SCOPE The computer is equiped with 500 K of like KIVIAT, STATS, PPSTAT and REPORT. Extended Core Storage /ECS/, twenty

KIVIAT /University 0f Stuttgart/ Peripheral Processor Units and twenty-four consists of the measurement code that is I/O channels /see Figure 32/. The transfer included into PP Monitor, the table in CM between 13IK Central Memory and ECS is and KIVIAT program, that processes the possible at rate ten word per microsecond. table to produce the results. It runs The four I/O channels are used to drive automatically every hour gathering data on twenty 844 disk drives. The other two disk permanent file for later statistical channels are used for eight 659 tape analysis. The information about CPU drives and ten 657 tape drives. utilization, and channel activity is The measurement tool was the COMRESS produced in the form of the clear diagrams. DYNAPROBE 7900 hardware monitor, which

- 163 - sensors were attached to points on the Another equipment namely Rand Monitor back panels of the central memory, central /Stimulus Generator /RMS/ was used to processor and channels. With the help of measure CD 6600 response times to differential amplifier, the sensors are time-sharing terminal interactions. At able to detect the fluctuations which predetermined times, the RMS issued a correspond to changes in the status of teletype message to a user program computer components as busy/not busy. executing as a time-sharing job on CD 6600. DYNAPROBE can collect data with sixteen The program was designed and written to counters capable of counting 10 pulses per consume specified amounts of CPU time and microsecond, and with twelve bits used to perform specified numbers of disk I/O sample, at selected rate, twelve signals. operations. Thus the RMS was able to time The sample interval was chosen to be 10 large numbers of interactions requiring seconds for this measurement. Also at varying amounts of system resources. intervals of 10 seconds, the sixteen counters and twelve bits are written to a 4.4. Hybrid Monitor for CD 7600 9-track self-contained magnetic tape. The data tape is then evaluated by software In 1972 with the purchase of the reduction program called DYNAPAR, which CD 7600, CERN has requested a nonstandard reads the data recorded on the DYNAPROBE First Level Instrumentation Peripheral tape and produces analysis reports. The Processor /FLIPP/. FLIPP was designed in analyser provides 16 pseudo counters for close collaboration between CERN and CDC the arithmetic operations on the binary in order to provide the performance counters. The analyzer also provides the monitoring in the production environment. ability to define combinations of the FLIPP uses a normal CD 7600 PPU with 12 bits as additional counters. 4K byte core memory. One I/O Multiplexer From the description of the KRONOS channel is connected to the CPU for scheduler it is obvious that many ordinary I/O operations, one specialized interactive and batch users can be rollout channel is also connected to the CPU and very often from CM. The rollout can be furthermore there is one channel to each made to ECS or in the case of overflow to of the other PPU*s /see Figure 33/. As an disk. absolute minimum a system monitor must Since DYNAPROBE is not a hybrid of course have access to all system tables. monitor, it was necessary to make changes Therefore, due to the restrictions to the in KRONOS to measure ECS and disk rollout. SCM accesses for a normal 7600 PPU, Code was added to PPU program 1R0, the a special channel was connected to FLIPP rollout processor, to activate an giving it direct read/write access to all otherwise unused I/O channel during ECS of SCM. rollout, so that the hardware monitor The access to operating system tables could detect such rollouts. The some in LCM is solved in two steps. Using the technique was used on another normally SCM write feature FLIPP puts a request for unused channel to report disk rollouts. LCM resident tables into a fixed area in Using DYNAPROBE the following SCM. It then sends an interrupt to the CPU. measurements were performed: The interrupt will be taken care of by 1. CPU utilization special FLIPP interrupt handler, which in 2. Activity on the four disk and two tape this case will store the requested LCM channels tables into another fixed area of LCM and 3. Number of ECS and Diks Rollouts then notify FLIPP that the interrupt has 4. Number of Exchange-Jumps been honoured. The FLIPP is able to obtain 5. Utilization of Central Memory hopper, the information using its SCM read through which all accesses to CM are capability. Continuously interrupting the performed. CPU in order to scan a system table in LCM

- 164 - may affect the system to an appreciable makes possible only 8 different key point degree. However, in most cases there is no codes, which of course is far from real need to scan the tables with such sufficient for all the CPU code of the high rate. Also the interrupts sent from operating system. However with the FLIPP have the lowest priority of all CPU additional job identification code, it is interrupts. possible to distinguish the task, which The System Status Interface Consists executed a no-operation instruction. of special hardware which is loaded when Peripheral equipment connected to some triggering condition arises. The FLIPP consists of console, display and information is held in one of the three disk drive. To be able to control and 60-bit diassembly registers /ranks A-C/. utilize this equipment as well as the There 60 bits are broken down into 12-bit measurement hardware, an operating system bytes as shown on Figure 34. /called FLOSS/ was developed. The resident The information is gated into rank C part of FLOSS contains keyboard, display register of the monitor hardware by some and disk drivers. All other code, and in triggering condition. From here it passes particular measurement routines, are one clock period later, to rank B and then loaded upon demand from the disk. The to the rank A as these become free. From console is used for initiating and ran A it can be read by FLIPP. If a controlling the measurements runs. Their trigger occurs when rank C contains data progress and the results obtained may be then the new data will be lost. These presented on the display. The disk is used ranks are used to buffer the data which both for storing measurement data and for can come at a peak rate which is faster keeping the non-resident FLIPP software. than a PPU channel can handle. The recorded data may then be transmitted The triggering conditions which can over the file transfer link to an analysis cause FLIPP to have access to the system program in the CPU. For long term storage status interface data are: the data may be routed to the 6000 1. FLIPP generates a trigger itself in front-ends and then written onto magnetic order to sample system status. tape. 2. A manual switch on a the console it set to cause a stream of triggers at 5. Operating System Tuning some fixed clock rate. 3. An external signals occurs at some part In general, tuning a system is the of the machine which has been selected process of modyfing its hardware or as a source of triggers. The signal is software characteristics to bring detected via hardware probe. performance closer to installation defined 4. A keypoint code is sent from selected objectives. Usually the operating systems PPU. contain a lot of parameters, which can 5. A SCM-LCM block transfer is starting or influence system performance. The stopping. manufacturers supply the default values 6. A key point code is received from CPU. for the parameters, however it is 7. A CPU exchange jump takes place. impossible to provide an optimal parameter Each PPU is connected to the system set for wide range of computer models and status interface via ordinary channel. various possible workloads. A part of the Using a normal output instruction the PPU parameter choice, the installations may then send a key point code bit pattern usually introduce many modifications to /6 bits/ indicating what it is doing. the operating systems, which can influence For the CPU, a modified form of the to large extent the performance. no-operation instruction is used. When In this chapter I am going to discuss executed the three bit operand field will the impact of various measurement technique be sent to the system interface. This on the operating system tuning and also

- 165 - its future development. because their demand curve slopes most severely to the right of workload level 4. 5.1. Tuning of VS2 Before the installation modifies performance objective 2, it is able to The measurement facility MF/l can he perform further analysis to determine the used to determine, if the VS2 Installation quantitative effects of such a change. Parameter Set /IPS/ reflects the Suppose the MF/l reports showed an average installations turnaround/response of 20 jobs associated with performance requirements. Suppose that the objective 2, 8 jobs with performance installations jobs are divided into objective 8, and 4 jobs with performance three performance groups : objective 5. Then, some measure of the Job Class Performance Objective average system rate of service supplied Terminal Jobs 2 to all jobs can be obtained by multiplying High Priority Batch 8 the number of jobs at each performance Low Priority Batch 5 objective by the service rate for that performance objective at workload level 4 The above performance objectives are /i.e., 20 x 40 + 8 x 20 + 4 x 10 = 800 + presented on the Figure 35. 160 + 40 = 1000 total service units per The installation performs certain unit time/. number of MF/l half-hour measurements in In this case, increasing the order to find the reason of poor response performance objective 2 by 10 service units, time at their terminals. Two figure of results in a total increase in demand of interest in this case are the average 200 service units. Thus, the workload level system workload level, and the average of the system shifts the right to response time for terminal transactions. compensate for this increase in demand, Suppose that these reports indicate that, while maintaining the system supply of during the time of MF/l monitoring, 1000 total service units approximately the system was operating approximately at constant /see Figure 36/. If the workload level 4 and the average response installations considers the service rate time for time sharing users was 30 provided for all performance objectives seconds /considered by the installation at the projected new workload to be to be poor response time/. acceptable, it may modify the IPS to An analysis of the performance objective contain the new performance objective 2. specification shows that, at workload The installation may then use MF/l level 4, performance objective 2 specified monitoring to verify the accuracy of the that terminal transactions receive 40 projections; repeating the entire service units per unit time. The procedure, if desired, until satisfactory installation might reasonably wish to results are achieved. increase service rate for its time sharing users in order to improve the 5.2. Tuning of SCOPE 3.4 response time. This increase would raise the performance objective curve for During last five years some performance objective 2, as shown on important modifications has been developed Figure 35. at INR in order to improve SCOPE 3.4 Figure 36 shows that raising the performance. service rate for performance objective 2 The first modification was related to raises the system workloaa level, all the system disk botleneck, which was the other factor being equal. Transactions main limiting factor for the early version associated with performance objective 5 of the NRI configuration. At this time and 8 would therefore be affected. CD 841 disk with single access was used as Transactions associated with performance the system disk. The author of this lecture objective 5 would be most severely affected

- 166 - made an observation, that it would be MCM 54000 72000 72000 possible to reduce access time on system TCP disk, if one could place the Frequently 40 6 5 Used Routines /FUR/ close together in 4 100 order to limit disk head movements. The 5 4 3 measurements indicated 33 FUR in the 300 4 3 2 peripheral program library. The routines 600 3 2 1 2 with total size 10 K words /octal/ were 600 i 0 loaded 5.3 times per second, what corresponds to 90 percent of all peripheral MCM denotes the maximum CM size /octal/ program loads. Since the size of 841 disk TCP denotes the CPU time in second /octal/ cylinder is 17920 words /decimal/, it was The new expression for input queue possible to place FUR on one cylinder priority may be written as follows /see Figure 37/.

On the SCOPE 3.4 deadstart tape the PjL = J • 1000 B + P « iOOB + AA • t /l/ bodies of FUR were placed after other PP-programs. Since FUR should start from where the beginning of a new cylinder it is P - financial priority ranging from 0 to 6 usually necessary to cover the rest of the A^ - aging rate in input queue proceeding cylinder with empty PP-routines. t - job time in the input queue During preloading the PP-program bodies are Additionally, to obtain significant placed on the system disk. Normal loading improvement in throughput of small jobs, program IRCP is unable to load the bodies the latter ones are treated like express in such order. Modified IRCP reads program jobs and may enter execution independently name table /PPNT/ in alphabetical order from large jobs. It enables us to avoid and rearranges it placing all FUR entries undesirable situation that small jobs, at the end of PPNT. Now the PP-program even with high priorities are completely bodies match PPNT entries and disk blocked in input queue, because all JDT addresses are properly written to each entries for batch class are occupied by entry. After loading of PP-program bodies, large ones. the alphabetical order of PPNT entries is Another algorithm has been worked out restored. With the version SCOPE 3.4 - FUR, for the output queue priority: the throughput was increased about ten

per cent. i - S0/B The second modification was the new P D + 5 + A /2/ scheduling for in input and output queues. 0= 1 0/C 0 • * According to needs of the majority of users a criterion of the top priority where

in the input queue for a highest paid job SQ - output file size /PRU*s/

was changed to the top priority for a job AQ - output queue aging rate requiring the least computer resources. D,B,C - constantants In such a way an essential decrease of a As you see from the above formula, the wait time for small jobs has been hiperbolic function was used in order to achieved. obtain very sharp dependence on file size. The new algorithm takes into The constant B describes the maximum size consideration job's magnitude J defined of output file, which can processed as a function of required system resources. automatically. The files of larger size are

The following table shows the possible processed by operator /SQ ^ iOOOOB/. values of J: Normally the other constants are set C = 400B and D = 7000B. The above algorithms are very convenient for user, they also lead to the substantial reduction

- 167 - of the input and output queue lengths. For the dual access 841, the Experiment K The third modification was the gives the following probability automatic change of the INTERCOM distribution: peripheral program residency. The performance measurement made at INR led us Number of requests Probability - P^ to the conclusion that the residency of in stack - i some PP programs should be dynamicsly 0 0.293 changed, according to the current system 1 0.218 load, jobs mixture and other external 2 0.164 circumstances. Finally, a special program 3 0.126 NRI was written /D.Malcosa/, which is 4 0.094 automatically called by DSD when the 5 0.061 operator starts INTERCOM work. NRI creates 6 0.031 a CPU job, which moves the proper PP 7 O.OiO programs into the central memory. After 8 0.002 INTERCOM drop NRI moves all these programs back to disk memory. Next, some PP We can consider dual access 841 disk programs usefull in batch processing are as multiserver consisting of two servers placed in such released area in the central /M=2/. In this case the utilization memory. The rest of this area /about factor is given by the formula: 10 K words/ is left for users programs

/see Figure 38/. £ = M* (i - PQ - P±) + P1 = i.196 /5/ The fourth modification was related to the I/O scheduling policy of SCOPE 3.4 The queue length is given by the formula level 373. As one can see on Figure 4, OO the INR configuration consists of single L = ^ P.«(i - 2) = 0.681 /&/ i=3 1 access 844 disk subsystem and double access 841 disk subsystem. The CIA As we see from the formula /3/ and /5/ the measurements has shown, that the usage of utilization factor per one server is 50 844 disks was low to compare with the per cent higher for 841 disks. Similarly 841 disk usage. For example the experiment from the formula /4/ and /6/, it follows K2 for 844 gave the following resulti that the queue is 4 times longer for 841 disk then for 844 disk. Number of requests Probability - The measurements of the channel activity in stack - i gave the following results: 0 0.603 Channel Equipment Activity 1 0.263 2 0.098 1 First Access 841 disk 0.269 3 0.029 2 Second Access 841 Disk 0.123 4 0.006 3 844 Disk 0.286 5 0.001 At first glance it looks like that From the above data we can calculate the the 844 disk controller is more havily utilization factor: used then the 841 controllers. However this fact has no importance, since the dual

1 p access on 841 disk is used almost f = - 0 = 0.397 /3/ exclusively for one disk drive containing the local files /PUBLIC/. Therefore we can Similarly the average queue length is : say, that in fact we should add the oo activities on both accesses of 841 disks L = Pi'Ci - i)= 0.179 /4/ and to compare with the activity of 844 i=2 disks. The sum is 37 per cent higher than the activity of 844 disks.

- 168 - From the above measurements it was throughput. abvious that more work should be put on faster 844 disks. The investigations, 5.3. Tuning of KRONOS performed by the author and D.Mukosa, were concentrated on the mechanism of file The hardware monitor presented in the opening. In order to find a record block Section 4.3, was used to specify for new or overflowing local file, requirements for KRONOS tuning. The peripheral program 3D0 searches the disk measurements, performed by D.S. Lindsay, equipment with the lowest ACTIVITY. From were taken during approximately the equipment 3D0 takes the unit with the three weeks on the first shift. highest number of free record blocks. The average CPU utilization was for The ACTIVITY is stored for each disk the Supervisor 13.4 % and for user programs equipment in a byte of Device Activity 52.1 %. In comparison with SCOPE, KRONOS Table and it is calculated every second by uses much more CPU time in the supervisor IHN from the formula: state, because PPMTR executes only limited number of functions. The average ACTIVITY = |'( OLDACTIVITY + NEWACTIVITY) utilization of disk channels was ranging /V from 30.5 % to 38.2 %, therefore the disk channels are remarkably evenly balanced NEWACTIVITY = {j3 + (S + SPEED ) • COUNT ] /D in utilization. where The high I/O-CPU overlap of 54.3 % can for 841 disk be understood, if we assume, that I/O and S = 0 /No System/, D = 2 /dual access/ CPU processing are statistically and for 844 disk independent events. Thus the probability S = 1 /system/, D = 1 /single access/ of simultaneous I/O and CPU should be just For standard SCOPE 3.4.1 system 1RN the product of the separate probabilities contains : of I/O processing /89.6 %/ and CPU for 841 disk processing /61.1 %/. The product is 54.7 % SPEED = 8 and it is almost equal to the measured for 844 disk overlap. SPEED = 4 The measurements show that Since 844 disk contains the operating approximately 40 % of real time is spent system, the value of COUNT is usually rolling jobs in and out of CM. Moreover high. Therefore local files were opened the disk rollout consumes about half as almost entirely on 841 PUBLIC unit. In much time as ECS rollout, despite the fact order to increase ACTIVITY of 841 disk, that system only rolls a job to disk if 1RN was changed such that ECS will not hold the job. for 841 disk The first conclusion is to implement SPEED = 20 mote ECS in order to eliminate slow disk and for 844 rollouts. The second conclusion is to apply SPEED = 2. the direct CM/ECS transfer, instead of As a result of 1RN changes, the using 10 times slower distributive data utilization factor and the queue length path. of 844 disk increased from 0.397 to The measurements of Central Memory 0.476 and from 0.178 to 0.345 respectively. Lockout has shown this effect to be only The utilization and the queue length of 1.68 %. However the number of Exchange T 841 disk decreased dramatically from Jump is incredibly high namely about 2800 1.196 to 0.660 and from 0.685 to 0.085 per second. This may cause a substantial respectively. Substantial increase of the overhead, because each Exchange-Jump central processor utilization by users was consumes some overhead in monitor time, observed from 0.647 to 0.795, which roughly perhaps in range 10 - lOOjASec. corresponds to 23 per cent increase of the

- 169 - 5.4. Tuning of CD 7600 hours. With the help of the SPOOK display it was easy to diagnose the problem which The hybrid performance monitor FLIPP was in fact twofold: was used for tuning SCOPE 2 operating at a. The I/O load was very badly distributed CERN. The measurements were performed by accross disk channels and this was the 0. Martin and A. Tengvald. The normal result of the very bad disk allocation workload at CD 7600 consists of the algorithm used by SCOPE 2.0, where the following job mix: least full disk was always selected, and of the high activity on the device JOB CLASS % JOBS % TOTAL holding the NULCEUS library and the CP TIME permanent file directory /PFD/. b. The 817 disks were nearly saturated EXPRESS 50 % because they had to perform too many SHORT 30 % arm movements. MEDIUM 16 % 3 % Assuming it takes 100 to handle LONG 4 % 52 % tn.sec every disk request and ignoring the fact that every pair of 817 disks share the JOB CLASS MEAN CP MEAN same two channels one can see that the TIME USED TURNAROUND rate of 300 disk requests per minute in seconds TIME /50 % device utilization/ is about the in minutes threshold beyond which device contention EXPRESS 2.5 3 will develop. SHORT 12 15-30 Therefore it was decided to cut down MEDIUM 50 NOT the I/O load and to achieve better LONG 370 APPLICABLE balancing of the I/O load. The improvements in these directions led to 20 % increase The CPU utilization was 84 %, of which in CPU utilization, 40 % decrease in disk Supervisor took 11 % and the rest is activity and 50 % increase on maximum job consumed by user jobs. throughput. FLIPP is mainly used for SPOOK display, It particular it was necessary to System Information File /SIF/ and SPY change the default allocation and monitor. SPOOK display shows the job transfer size so that more words would be statistics, disk space and queues, CPU transferred in fewer disk aocesses. The utilization by various subsystems and job Ai/Tl combination was chosen /i.e. 10» 512 status. SIF is used as input by most of words on 817 disk or 14 • 512 words on 844/, the performance analysis programs and the because the average access time per following two records extracted. I/O request would be about the same 1. The Job Termination Record provides whether the file was on the 844 or 817 wait time in input queue, time spent in disks, the cost of transferring 5 more execution, number of tapes staged, job blocks is only 11 ms. class and job history. This change was first made for File 2. The System Activity Record provides Routeer I/O requests under SCOPE 2.0 and information on channel and disk the effect was very visible. With activity /response time, data rates/ a 6000-7600 link running at 300 KC/second as well as on main contributors to the the corresponding load on the disks is Input/Output load. SPY helps users to approximately 700 I/O requests per minute monitor CPU time used by various parts if AO/TO is used and only 350 I/O requests of job, and therefore it makes with Al/Ti. possible the substantial reduction of The next logical step was then to use CPU bottlenecks. the same concept for the NUCLEUS library. The significant drops in CPU The advantages were not a great as one utilization have been observed during peak

- 170 - expected due to the fact that the loader Finally a study on frequency usage does not know the length of the NUCLEUS of Job Supervisor overlays allowed to routines and only about the displacement save approximately 40 K of LCM by using within first allocation unit where the disk resident groups of overlays. It was routine resides. then decided to implement the capability The solution to the above problem was to keep the overlays of the FTN compiler to write a utility program to ensure that in LCM, in order to avoid the overhead the most frequently used nucleus library of loading them from the device holding routines are positioned such that they end the NULCEUS library for every subroutine on an allocation unit boundary. to be compiled /over 1000 subroutines The NULCEUS library is now kept on an compiled per hour on average during the 817 disk unit in A3 style /i.e. 40 512 day/. The result of this change have been words/. very satisfactory. In order to obtain the better I/O balancing, the PFD was placed on 844 disk Acknowledgements for speed /twice as fast an 817 disk for single block transfer/. I am greatly indebted to Mr. Czeslaw Moreover with the change from Nowicki from SCOPE 2.0 to SCOPE 2.1.2 the disk and Mr. Andrzej Plewickl from International allocation algorithm has changed. This new Business Machines Corporation for the algorithm is bassically the following: valuable material concerning operating 1. Select a subset of devices eligible systems. I wish to express my appreciation for allocation by using a space filter. to my colleagues Danuta Makosa, Jerzy 2. Select a device from the eligible Dzieciaszek and Henryk Wojciechowicz for device list in a round-robin fashion. many important contributions to the Therefore if the space filter is lecture. I am also very gratefull to very small the least full device will Miss B. Trenel and Mr. 0. Martin from CERN always be selected as under SCOPE 2.0, but for the documentation about FLIPP. if the space filter is very big, all Finally, I would like to thank Miss Ewa devices are eligible for allocation and Piwek, who patiently typed the manuscript. selected in round robin way. This scheme does not take into Literature account device characteristics and channel configuration and therefore can easily 1. "IBM System/370 Summary", IBM /1970/ lead to an overutilization of the 844 2. "IBM System/370 Model 155 Functional channel. This scheme has been enhanced by Characteristicë", IBM /1970/ simply adding two more levels of filtering 3. "MFT Guide", IBM /1970/ so that the procedure is as follows: 4. "Introduction to 0S/VS2 Realise 2", 1. Apply space filter. IBM /1973 2. Select devices with the smallest current 5. "IBM Virtual Machine Facility/370: queue size. Introduction", IBM /1976/ 3. Select least active devices, where 6. "SCOPE System Programmeras Reference device activity is number of I/O Manual", CDC /1971/ requests submitted divided by a weight. 7. "KRONOS General Information Manual", 4. Select the final device in a round robin CDC /1971/ way. 8. "KRONOS 2.0 Operating Guide", CDC /1971/ At CERN a big space filter was used, 9. White C.E, "Network Operating System which is roughly equivalent to the capacity Status", Proceedings of the ECODU-XVIII, of an 844 disk and this new algorithm has London /1974/ helped in balancing the I/O load over 10. Skagestein G., "Comparison of NOS and seven disk channels. Master" Proceedings of the ECODU-XXIV,

- 171 - Montreaux /1977/ International Symposium on Computer 11. Hellerman H., Conroy T.F., "Computer Modeling, Measurement and Evaluation, System Performance", McGraw-Hill Harvard University /1976/ Book Company /1975/ 16. Dzieciaszek J., Gluski K., Makosa D., 12. Svobodova L., "Computer Performance Wojciechowicz H., "NRI - CYFRONET Measurement and Evaluation Methods: Way of the SCOPE Operating System Analysis and Application, Elsevier Development", Proceedings of the /1976/ ECODU-XXV, Liege /1978/ 13. "0S/VS2 Planning Guide for Release 2" 17. Bednarz R., "SCOPE 3.4 - FUR a New IBM /1973/ Loading Procedure", Proceedings of 14. Bednarz R., "A New Version of CIA" the ECODU XXI, Geilo /1976/ Proceedings of the SCODÜ-XVIII, 18. Bednarz R., "On the Interpretation of London /1974/ the Performance Measurements", 15. Lindsay D.S., "A Hardware Monitor Proceedings of the ECODU-XXIII, Study of a CDC KRONOS System", Toulouse /1977/

- 172 - OOTTED LINES REPRESENT MANUALLY SWITCHED ALTERNATE PATHS

Configuration of IBM 370/168 at CERN

Effective 24-fail Virtual Storage Address

External Pact Tabla

Page Frame Table

1

1

1

Frame Number 0 A •" i i i 0 s t

2UMI Rul Slo'qa Addrm

Fig. 2 Dynamic Address Translation Procedure Fig. 3 Page in Process

- 173 - 14 34 PERIPHERAL DATA PROCESSORS channel:

MAGNETIC 30 kc MAGNETIC TAPE TRANSPORTS TAPE 37,5 ips EiO-120 kc, 75 ips TRANSPORT

0681 351S-3 CHANNEL MAG, TAPE CONVERTER CONTROLLER ^659-^ ^5a^^9-^ ^7^

Fig. 4 Configuration of CYBER-73 at Institute of Nuclear Research

Corara!

Low StOflOB Address High Storage Address

Basic Fixed Area | J

Dynamic Area | |

Fig. 6 MFT Main Storage Organization

* 70 INTERCOM TERMINALS

Fig. 5 Configuration of CD 7600 at CERN

- 174 - JobSchcdutor [~J

Fig. 7 Job Processing in MFT

INi'UT JJKU:;

UL..SS «

Joli 1:2 Job 15 Joli .j l')iTY=0 l'i¡TV=7 PilT,=«

Jul/ 1 Jo;, 12

piir/=ii '— —>

Job 10 JoLj 75

PRTÏ=3 PltTY=ä >

Fig. 8 Job class and priority scheduling in MFT

- 175 - The MVT Control Program

Fig. 9 Parallel Task Processing by MVT

MVT

Low STORAGE Address High Storage Address

Basic Fixed Area [ |

Dynamic Area | |

Fig. 10 MVT Main Storage Organization

Imtwot LSOA

Avatlthip StucH

Fig. 11 Virtual Storage Overviews

- 176 - High Adrtrn« System Qucim Ar*»

Common Pageable L>nk Pack Ar»;a Arca Common SwvtCfi Ar«a

Local System Queue Area

Scheduler Work Area

Subfwois229'230 }

User 4 Users Privat« Addriïts Spao Arca

User's Programs

System ^ Ama

Low Address

Fig. 12 Virtual Storage layout /Release 2/

Fig. 13 VS2 Release 2 Control Program Overview

- 177 - Performance Service Rati' Service Hate Service Rale Service Rate Objective for Workload •for Workload for Workload for Workload Number Level 10 Level 20 Level 30 Level 40

3 40 20 10 0

6 30 15 0 0

9 50 45 40 30

12 70 50 35 15

Fig. 14 Performance Objectives

User Tran jction -- Performance Group Number 2

Installation Performance Specification (IPS)

Performance Group Definitions

. I Performance Objectives

12

Worxload Level I W 10 20 30 40 / Period 2 /

40 20 10

Service Rate

Fig. 15 Associating User With a Performance Objective

- 178 - Heul 1/0 devices CED.

VM/37Ù Control Program

CMS 0S/VS2 OS/MVT Machine FL High Core RBT virtual virtual virtual uiu cil ine machines uacliine CP 7

CP 6

CP 4

CP 2

CP 1 Ó ó CM Library Virtual I/O devices CMR 0 Low Core

Fig. 16 Multiple Virtual Machine Fig.17 SCOPE Central Memory Layout

RA+1

RA

Overlay CP.MTR Program

Message Buffer Transient Program

Output Register PPMTR Input Register

CM

MB Pointer

IR Pointer

PPn PPO

Fig.18 SCOPE Monitor Request Processing

- 179 - rai

PP COMMUNICATIONS SREA

DISK I I

I i \ \ X REQUEST STACK \ N i ! \ \

YIST Ol CPMTR Diet CPCIO Driver Privar SPH —*

USER JOB

FÎT

Fig.19 I/O Processing in SCOPE

Input/Pre-input , / / Queue

Fig.20 Job Processing by SCOPE

- 180 SYS« DISK

TABLES

SYSTEM

CPMTR LIBRARY

BATCHIO DSD

MAGNET

SUBSYSTEMS

TELEX

TRANEX

DSEB

POOL PP

Fig.21 NOS Memory Layout

pp SYSTEM DISK CM

CUR

POOL PP

Fig.22 Interactive Job Flow in NOS

- 181 - ÛUEUE PRIORITY

JOB LOWER UPPER INITIAL ORIGIN OUEUE ENTRY BOUND BOUND TIME SLICE CPU TYPE TYPE PRIORITY PRIORITY PRIORITY INCREMENT CPU CM PRIORITY

SYSTEM INPUT bbOO 700 3000 100 ROLLOUT bOOD 100 100D OUTPUT M OO 100 7700

BATCH INPUT 2M00 SOQO MOID M 00 200 30 ROLLOUT 2M00 1010 MGOM OUTPUT 500 100 7000

EXPORT/ INPUT 3M0G 2M00 MOID M 00 EDO IMPORT ROLLOUT 3M00 1M00 M 00b OUTPUT SOO IDG 7b00

TELEX INPUT M 000 3770 700b MO 10 30 ROLLOUT MOOM 37M0 7000 OUTPUT SOO 10D 70C0

MULTI- INPUT b77M b700 7M00 MOO bO 31 TERMINAL ROLLOUT b77M MDOO 7M00 OUTPUT bDOO 100 7700

DELAY PARAMETERS

JS CR AR JA CS

1 10 200 ID 10

Fi g.23 NOS Job Control Parameters

Batch jobs completed - 406 INTERCOM sessions completed - 83 EDITOR sessions completed - 39 Number of INTERCOM commands - 982 Interactions performed - 403 Maximal number of jobs active concurrently - 16 Maximal sessions active concurrently - 10 Maximal EDITOR users active concurrently - 5 The average number of interactions per command is equal to 1.2 The average response time /total response time/number of commands/ - 8 sec. The average think time /total INTERCOM queue time/number of interactions/ - 32.9 sec. The central processor time per command - 0.6 sec. Approximately 900 commands were monitored The multiple interactions type of commands - 7.9 % The use of command classes is the following: File manipulation - 15.7 % Batch dispositions - 4.S % Permanent file - 10 % Compilers/Application - 8 % Load/Execute - 7.5 % Information - 50.1 % Miscellaneous - 4.2 %

Fig.24 WKLOAD Program Results

- 182 - INTEnCOM TABLES

VS11 call

SIP PUOGKAU AND BUFFK11S

Fig. 25 SCOPE STIMULATOR

- 183 - PAGING ACTIVITY PAGE OS/VS2 SYSTEM IO JA11 DATE 1/11/74 INTERVAL 15.00.45» RELEASE 02.0 MP/1 VERSION 01 TINE It.25.08 MAIN STORAGE PAGING RATES PER SECOND

PAGE IN PAGE OUT PAGE RECLAIMS PERCENT PERCENT PERCENT OF OF OF TOTAL NON TOTAL TOTAL NON TOTAL TOTAL CATEGORY RATE SUM SNAP SWAP RATE SUM SWAP SNAP RATE SUM

PAGEABLE SYSTEM AREAS VIO 0.00 0 0.00 0.00 0 0.00 0.00 0 NON VIO 1.83 63 8.48 8.48 62 0.79 0.79 11 SUM 1.83 63 8.48 8.48 62 0.79 0.79 11 ADDRESS SPACES VIO 0.06 2 1.71 1.71 12 2.19 2.19 31 NON VIO 1.01 3S 0.13 3.37 3.49 26 0.02 4.01 4.03 57 SUM 1.07 37 0.13 5.08 5.20 38 0.02 6.19 6.22 89 TOTAL SYSTEM VIO 0.06 2 1.71 1.71 12 2.19 2.19 31 NON VIO 2.84 98 0.13 11.85 11.98 88 0.02 4.80 4.82 69 SUM 2.89 100 0.13 13.56 13.68 100 0.02 6.99 7.01 100

AUXILIARY STORAGE USER POOL PAGEABLE MAIN STORAGE COUNTS SWAP COUNTS

PAGE PAGE SLOTS PERCENT FRAMES PERCENT

AVAILABLE SLOTS 14,321 93 UNUSED FRAMES 145 SO SWAPS

VIO SLOTS 7 0 DATA PAGES 145 50 AVERAGE PACES PER NON-VIO SLOTS 1,100 7 TOTAL FRAMES 290 100 SWAP OUT

UNAVAILABLE SLOTS 0 0 AVERAGE PAGES PER TOTAL SLOTS is.4:» 100 SWAP IN 1* Fig. 27 Paging Activity of VS2

pp 8

CIA I OT

MONITOR

DURING MEASUREMENTS

Fig. 28 Organization of CIA Measurements

- 184 - EXPERIMENT K2 NUMBER OF STACK REQUESTS

MEANING VALUE COUNT COUNT/SAMPLES 0 0 8923 .232 1 1 7737 .201 2 2 6704 .174 GE.MED 3 3 5721 .149 GE.MED 4 4 4289 .111 GE.MED 5 5 2728 .071 GE.MED 6 6 1470 .038 GE.MED 7 7 649 .017 GE.MED 8 8 208 .005 GE.MED 9 9 61 .002 GE.MED ÍU 10 8 .000 GE.MED GT 10 999 1 .000 GE.MED

COUNT = 38499 SUM = 85759 MEAN = 2.23 SAMP-TIM 4

Fig. 29 The Stack Request Distribution Measured by CIA Monitor

CM

PP CUR TABLES

03 DURING

MEASUREMENTS

PECORDED PERMANENT DATA FILE

REPORT at < WRITER BS 01 S AT THE END

Fig. 30 Organization of 6SS Measurements

- 185 - BUSY CPT-S CPT-S FOR CPU CPT-S IN nECALL CPT-S RECALL/CPU-IDLE

CPT TIMES PCT .CPT TIMES PCT CPT TIMES PCT CPT TIMES PCT FOUND FOUND FOUND FOUND

0 O O 0 127 S.5 0 0 0 0 0 0 1 O O 1 320 21.3 1 2G0 17.3 1 0 0 2 il O.T 2 4Gü 31.1 2 388 25.9 2 2 0.1 3 129 8.6 3 3G7 24.5 3 310 20.7 3 i 0.1 4 323 21.5 4 185 12.3 4 205 13.7 4 il 0.7 5 3C5 24.3 5 32 2.1 5 110 7.7 5 17 f.l G 300 20.4 ü 2 0.1 G iOi G.7 G 33 2.2 7 199 13^3 7 1 0.1 7 57 3.8 7 21 1.4 S 107 7.1 3 O O S 4ü 3.1 8 28 1.9 !) 42 2.S 'J O O •j 10 0.7 9 7 0.5 10 i 6 1.1 10 O O 10 7 0.5 10 7 0.5 11 2 0.1 11 O O 11 O O 11 7 0

FIELD JOBS PCT. FIELD JOBS PCT. FIELD JOBS PCT. FIELD JOBS PCT. LENGTH LENGTH LENGTH LENGTH

020000B 3170 38 .8 020000B 118 3.0 020000B 3052 02 «, v> 020000B 420 50.1 040000D 1069 20.4 040000B 752 23.0 040000n 917 18.7 040000B 174 20.8 060000B 1472 13.0 OGOOOOB 872 2G.7 OGOOOOB GOO 12.2 OOOOOOB 142 16.9 100000B . 937 11.5 iOOOOOB 731 22.3 100000B 20G 4.2 100000B 57 6.8 120000B 235 2.9 120000B 172 5*3 120000B 03 1.3 120000B 19 2.3 140000B 337 4.1 140000B 298 9.1 140000B 39 0.8 140000B 15 1.8 160000B PO 1.1 1G00OOB 79 2.4 1G0OO0B 7 0.1 100000B 4 0.5 200000B 73 0.9 200000B 03 1.9 200000B 10 0.2 200000B 4 0.5 220000B 170 2.2 220000B 173 5.3 220000B 3 0.1 220000B 3 0.4 240000B 10 0.2 240000B 14 0.4 240000B 2 0.0 240000B 0 Ó

8171 100.0 3272 100.0 4S99 100.0 S38 100.0

AVERAGES : OF FIELD CPT-S LENGTH

BUSY 5.4 040300B FOR CPU 2.2 066100B RECALL 3.3 021700B

Fig. 31 GSS REPORT on Control Point Occupation

ECS 500 K

PERIPHERAL PROCESSOR SUBSYSTEM

24 I/O CHANNELS CH.25

t

Fig. 32 Hardware Monitor for KRONOS System

- 186 - P COUNTER

CPU KSVPQlNT COOt T R R R JOB IDENTIFICATION R SO BIT A A A 1 N N N ASSEMBLY EXIT ANO MONITOR 0lT5 FROM PSD G K K K REGISTER TO MCV SCANNER G V.CM SCM CLOCK TRANSFER E C e A PERFORMANCE OATA FLIPP R TIMER REGISTER, A NORMAL 7600 PPU

1/0 M'JJI IK OF II BIT PPU KEVPOtNT COOE CHANNEL MEHORV TO 8 FLPPU

FLIPP

SCANNER

SPECIAL CHANNEL FOR SCM RCAO I WRITE

SMALL CORE MCMORV FlIPP RE CORO FLAG ON CHANNEL 2 S*K (0 8ITS

-/y>—y- EX TE PN A t. SOURCE

V?:îRo FLAG FROM Fl PPU I Jj I/O BUFFER AREAS 1000 +1 1 SCV 9L0CK TRANSFER PAP1 T OF MOM! TOR 1000

CPU TRIGGER ( »61 jfc î — I '0 EXCHANGE ARCAS — - EXCHANGE JUMP r_

Fig. 33 FLIPP Configuration

11 10 a 7 T S 4 12 10 BIT

WORD 0 P REGISTER BITS 2LS_ 2*

P REGISTER DITS LCM EXIT MONITOR J>-2> SCM MODE JOB IDENTIFICATION CODE WORD I BLOCK MODE CPU KEVPOIHT BIT BIT TRANSFER COOE

WORD 2 TIMER REGISTER BITS 2»_ 2»

WORD I TIMER REGISTER BITS 2" - :°

LOST EXCH CPU LCM P PU EXT

WORD 4 PPU KEYPOINT COOE

DATA TRIG TRIG TRIG TRIG TRIG

Fig. 34 Disassembly Register Bit Assignments

- 187 - Fig. 37 Concentration of Frequently Used Routines

on One Cylinder

- 188 -