Quick viewing(Text Mode)

Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

UDC 681.3.06

Operating System of the VX/VPP300/ VPP700 Series of Vector-Parallel Systems

VYuji Koeda (Manuscript received April 29, 1997)

The computers used in science and technology are generally either low-end, high cost-performance machines owned by individual research and development sec- tions, or high-end ultra high-speed machines. This paper describes the features of the UXP/V of the VX/ VPP300/VPP700 series of vector parallel , which were developed to flexibly cover the requirements for science and technology computing. It also looks the following important topics regarding vector parallel supercomputers : - The method of allocating resources such as the CPU and memory - The technique for batch processing - The technology used to achieve high-speed I/O processing and network process- ing This paper also describes a function for easy installation and administration and a strengthened operational management function for the computer center.

1. Introduction 2. Development history The cost effectiveness of science and technol- The history of UXP/V began with UTS/M, ogy calculation environments has rapidly im- which is a UNIXNote1) system that operates on a proved due to increases in the processing speeds mainframe. UTS/M was based on the UTS, which of high-end EWSs and the falling prices of super- was developed between 1982 and 1985 by Amdahl computers and minicomputers. These changes Corporation in the USA as a UNIX system on have encouraged more research departments and Amdahl’s mainframe. At first, UTS/M was based researchers to use science and technology calcu- on SVR2Note2), and was supplied as a business-ori- lation machines. The VX, VPP300, and VPP700 ented UNIX system for mainframes; then a UNIX series vector-parallel supercomputers were devel- system for supercomputers was requested. In 1990, oped to meet the demands for very high process- VPONote3) was supplied as an additional function ing speeds to perform complicated, high-level cal- of vector processing so that the UNIX system could culations. be used on a supercomputer. This paper looks at problems connected with In 1989, the System V UNIX and the BSD the operating system of vector-parallel super- UNIX were unified into SVR4Note4). Taking this computers, and explains the development of UXP/ opportunity, we developed UXP/M by introducing V, which is the operating system of the VX, the base of UTS/M into SVR4. We released UXP/ VPP300, and VPP700 series. M in 1991.

Note1) UNIX is a registered trademark in the Note2) SVR2 is an abbreviation of UNIX System V United States and other countries licensed Release 2. exclusively through X/Open Company Lim- Note3) VPO is an abbreviation of Vector Processor ited. Option. Note4) SVR4 is an abbreviation of UNIX System V Release 4.

FUJITSU Sci. Tech. J.,33,1,pp.15-23(June 1997) 15 Y. Koeda: Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

Then, calculation capabilities about 20 times be easier to operate, install, and manage. better than that of a conventional single proces- 1) Open architecture sor or TCMP-typeNote5) vector computer was re- This operating system is based on SVR4, quired in aircraft and weather service fields. In which is the standard UNIX operating system. A response, in 1993, we developed and supplied the standard GUI and language environment, such VPP500 series of distributed-memory-type vector- as X11Note6), Motif Note7), Fortran90, ANSI C, and parallel supercomputers. This series has the char- Message Passing Library are used. Moreover, the acteristics of both a conventional vector computer function for vector-parallel supercomputer pro- and a new parallel processing machine. The op- cessing is extended by maintaining an open ar- erating system for this series consists of two parts: chitecture. This improves the connectivity and UXP/M, which supports the standard UNIX func- operation of the system as a calculation server tion in a front-end processor, and a parallel oper- under a distributed system environment. ating system in a back-end processor for batch 2) Higher performance processing only. Files can be accessed at high speed using the Then, we developed UXP/V for the VX, VPP300, disk array unit, large-capacity high-speed file sys- and VPP700 series to provide flexible support of vari- tem, and software striping function. ous applications in research and development de- This system can be connected to a LAN con- partments and superior cost effectiveness. UXP/V forming to ANSI standard HIPPI (800 megabits is a standalone UNIX operating system with ex- per second)Note8) or ATM-LAN (155.2 megabits per tended parallel and vector functions. It can be stored second)Note9). Such a connection increases the net- in each processor. We began sales of UXP/V in 1995. work transmission speed, which is important in a distributed system environment. Moreover, the 3. Aim of development performance of application programs is enhanced The VX, VPP300, and VPP700 series are dis- by using the compiler, which has superior optimi- tributed-memory-type vector-parallel super- zation techniques, and by using the tuning tools computers appropriate for high-speed processing for vectorization and parallel processing creation. of large amounts of data in science and technol- 3) Easier installation and management ogy calculations. Because these systems can be The operating system and software extended scalably according to the application, products are provided on disk. The user can start they can be used under various conditions. That using this product by performing the minimum is, these systems can be used as departmental sys- environment setup at installation. This system tems exclusively used in research and develop- can be started and stopped using the power ON/ ment departments or as center systems shared OFF button, which enables the system to be eas- by various research and development depart- ily used as a departmental system. Online manu- ments. The operating system for the VX, VPP300, als are enhanced so that required information can and VPP700 series must have a higher perfor- be effectively retrieved from a workstation. mance, superior operation management func- 4) Superior center operation management func- tions, and wider application fields; also it must tion

Note5) TCMP is an abbreviation of Tightly Coupled Note7) Motif is trademark of Open Software Foun- Multi-Processing. dation, Inc. Note6) X11 is an abbreviation of X Window System Note8) HIPPI is an abbreviation of High Perfor- V11. X Window System is a trademark of X mance Parallel Interface. Consortium, Inc. Note9) ATM is an abbreviation of Asynchronous Transfer Mode.

16 FUJITSU Sci. Tech. J.,33,1,(June 1997) Y. Koeda: Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

Interference between batch processing and 4. Characteristics of the operating interactive processing is prevented. The jobNote10) system management function, which is the nucleus of cen- 4.1 Standalone parallel operating ter operation, is enhanced for flexible handling of system various types of operating conditions. The center The VX, VPP300, and VPP700 series are operation management functions such as the se- supercomputers having two or more processing el- curity function (AUDIT and ACL) are enhanced. ements (PEs) connected through a high-speed Also, the automatic operation function is enhanced crossbar network. Figure 1 shows the system con- (system-start by power-on and system-stop by figuration. Each PE consists of a scalar unit, vec- power-off can be automatically performed at a tor unit, main memory, data transfer unit, and specific time using the calendar function of the other units. Standard UNIX with the extended hardware). parallel processing function and extended vector 5) Continuity of resources processing function operates on all PEs. There- For inheritance of user resources at system fore, every PE can perform vector processing, par- changeover, binary compatibility with the VPP500 allel processing, and interactive processing. A PE vector-parallel supercomputers and source code to which input/output devices such as disk drives, compatibility with the VP and VPX series vector tape units, and network units can be connected is supercomputers are assured. also called an IOPENote11). If there are a large num-

Crossbar network

IOPE(Input/Output PE) PE

UXP/V(conforming to UNIX SVR4) UXP/V(conforming to UNIX SVR4) - Parallel processing - Parallel processing - Vector processing - Vector processing Disk drives - High-speed communication function - High-speed communication function

Input/output NQS NQS Shell subsystem Batch Interactive Batch Interactive processing processing processing processing Tape unit

Compilation Execution Compilation Execution

LAN network

Fig. 1— Standalone parallel operating system.

Note10) A is a series of tasks to be executed for a Note11) IOPE is an abbreviation of Input/Output PE. user.

FUJITSU Sci. Tech. J.,33,1,(June 1997) 17 Y. Koeda: Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

ber of PEs, a single IOPE may not be able to handle able for workstations should be executed on work- all of their I/O requests, and the many accesses to stations. For example, the development environ- system disks via the IOPE will lengthen the sys- ment and GUI software products should be run tem start-up time. To solve these problems, two on workstations as much as possible. To facili- or more IOPEs are prepared to distribute the disk tate this, software products that must cooperate and network load. Also, system disks are con- with workstations must be prepared. Some ex- nected to two or more IOPEs so that the PEs us- amples of these software products are the VPP ing the system disks are grouped. The start-up Workbench (for supplying a program development time for a large-scale system can be shortened by environment with a GUI equivalent to that for starting the system in parallel in group units. workstations), NQSNote12) (for submitting jobs), and OLIASNote13) browser (viewer for online manuals). 4.2 Interactive processing and batch processing 4.4 Job management and job Interactive processing and batch processing scheduling can be executed on each PE. Each PE can be 1) Job management handled as a host on the network, and can accept UXP/V jobs are managed by the partition a remote log-in from an arbitrary host. In batch manager (PM). The PM allocates and releases processing, CPU distribution can be freely con- resources to jobs, and controls the priority. A job trolled without affecting interactive processing. is entered as an NQS batch request. NQS requests UXP/V can select the appropriate PEs (PEs for the PM to allocate resources. A batch request to interactive processing only or PEs for batch pro- which the PM allocates resources is executed as a cessing only can be selected). In a PE for batch job. When the job execution terminates, the re- processing only, a log-in from an external network sources are released by the PM and the batch re- is suppressed. In a PE for interactive process- quest terminates. Then, NQS reports the execu- ing only, execution of a batch job is suppressed. tion results to the person who entered the batch Memory and CPU resources are carefully allocated request. so that batch processing and interactive process- 2) Job scheduling mechanism ing do not interfere with each other, even if batch Jobs must be scheduled to use the parallel and interactive processing are executed on the processing system effectively. The job scheduling same PE. Memory is divided according to the mechanism of UXP/V is explained below. purpose of use so that interactive processing does i) PE allocation according to job class not conflict with batch processing. The distribu- The maximum processor count, execution tion of CPU use between interactive processing mode (SIMPLEX or SHARED mode [ex- and batch processing is set to assure a quick re- plained later]), PE to be allocated, and other sponse in interactive processing. information are specified in a job class. The job class to be used is specified in the NQS 4.3 Cooperation with workstations queue. When the user selects an NQS queue Vector-parallel supercomputers should be and submits a job, the PM allocates PEs ac- used for calculations, and other operations suit- cording to the job class setup information. Fig-

Note12) NQS is an abbreviation of Network Queuing Note13) OLIAS is an abbreviation of Online Informa- System. NQS was ported from a program tion Access System. that was developed in response to a request from NASA.

18 FUJITSU Sci. Tech. J.,33,1,(June 1997) Y. Koeda: Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

ure 2 shows an example of PE assignment ac- job at fixed intervals an additional ticket cording to the job class. In this example, NQS whose length is proportional to the CPU dis- batch queue A is assigned to job class 0. The tribution value. PM allocates 2-parallel job A1 to PE1 and PE2 Figure 3 shows an example of the CPU distri- according to the definition of job class 0. NQS bution control. In this example, 20% of the batch queue D is used to submit a 4-parallel CPUs are assigned to interactive processing job. Using job class 3, 4-parallel job D1 is al- and 80% are assigned to batch processing. For located to PE3 to PE6. Thus, PE allocation to the job to be executed in batch processing, the submited jobs can be freely controlled accord- CPU time is distributed according to the CPU ing to the job class. distribution value defined in batch queues A ii) CPU distribution control and B. At execution, jobs a1, b1, and b2 are The CPU distribution ratio can be set for allocated 20%, 30%, and 30% of the CPU time, the scheduling class for interactive process- respectively. Even if interactive processing ing and the scheduling class for batch pro- and batch processing are both executed or cessing. Moreover, the CPU distribution value there are many jobs to be executed, the CPU can be specified in batched job units. Units time to be allocated to the jobs can be freely of CPU time called tickets are allocated to controlled by this control function. each job and then taken away as each job iii)Memory priority spends them on CPU time. A job that has In resource allocation, jobs with a higher spent all of its tickets loses execution priority memory priority are the first jobs to acquire and is not executed while there are jobs that resources. If there are two or more jobs with still have tickets. The scheduler issues to each the same memory priority, the job that first

Batch requests NQS

Batch queue A PE1 PE2 PE3 PE4 PE5 PE6 Job class 0 A1 B1 C1 A3 A2 A1 Maximum processor 2 count Execution count limit 1 C2

Batch queue B D1 Job class 1 B1 Maximum processor 3 count Execution count limit 2

Partition manager (PM) Batch queue C Maximum Job class 2 Job class PEs to be assigned Mode Maximum processor 1 processor count C4 C3 C2 C1 count Execution count limit 2 0 PE 1, 2 2 SIMPLEX

1 PE 3, 4, 5 3 SHARED Batch queue D 2 PE 6 1 SHARED Job class 3 Maximum processor 4 D1 count 3 PE 3, 4, 5, 6 4 SHARED Execution count limit 1

Fig. 2— PE allocation according to job class.

FUJITSU Sci. Tech. J.,33,1,(June 1997) 19 Y. Koeda: Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

Batch requests NQS rently as possible on PEs by increasing the Job a1 Batch queue A priority of these jobs only for a fixed time at a1 CPU distribution CPU : 40 value 40 Job b1 specific intervals. Therefore, several parallel Job b2 Batch queue B CPU : 60 jobs can operate as if they are exclusively and CPU : 60 b1b2 CPU distribution periodically using the system (this control is value 60 Batch processing called harmonized scheduling). This mini- Interactive processing mizes the overhead of synchronization be- CPU : 20% tween PEs. Figure 4 shows an example of CPU assignment using the harmonized scheduling Interactive Scheduling class processing Batch processing function. In this example, the CPU assign- CPU distribution ratio (20%) Job a1(20%) Job b1 (30%) Job b2 (30%) ment period is set to 100 milliseconds. The CPU of every PE is assigned to parallel job A Fig. 3— CPU distribution control. for 40 milliseconds and then to parallel job B for 40 milliseconds. The CPU is assigned to requested resource allocation is given prior- other processes for the remaining 20 millisec- ity. After jobs have been submited from NQS, onds and during the input/output operations their memory priority can be changed as re- for the two jobs. quired. Therefore, a specific job can be ex- 3) Job swap function ecuted earlier than the other jobs by chang- The job swap function temporarily moves a ing its memory priority. currently executed job to a job swap area or re- iv) SIMPLEX mode and SHARED mode turns the job to the original area. Using this func- Jobs are executed in SIMPLEX mode or tion, an urgent job can be executed and operation SHARED mode. SIMPLEX mode is used to can be switched smoothly. execute jobs by exclusively using the PE. The i) Execution of urgent job by NQS queue selec- SHARED mode is used to execute jobs with a tion high throughput. While a job is being ex- To execute an urgent job, an NQS batch queue ecuted in SIMPLEX mode, other jobs are not with a high priority must be prepared. In the executed on the PE but wait until the PE be- job execution, an urgent job can be executed comes idle. If a job is being executed, another earlier than other jobs by submitting the ur- job in SIMPLEX mode waits until the job be- gent job to the queue. While another job is ing executed terminates. On the other hand, being executed on a PE, a job swap automati- jobs in SHARED mode can share the same cally occurs because of job resource contention. PE. If there is an idle part of a PE, the Figure 5 shows an example in which an urgent submited jobs are sequentially allocated to job is executed by selecting an NQS queue. In the PE and executed. this example, job 1 in the batch queue with v) Fixed priority memory priority 10 is being executed on PE1 The absolute priority for job execution can be to PE4. When an urgent job is submited to specified (this priority is called the fixed pri- the batch queue with memory priority 20, job ority). To execute an urgent job on a PE that 1, which has a lower memory priority, is is currently executing another job, the PE’s swapped out to the swap device. Then, the CPU can be allocated to the urgent job by in- urgent job is allocated to PE1 to PE4 and ex- creasing the fixed priority. ecuted. When the execution of the urgent job vi) Harmonized scheduling terminates, job 1 in the swap device is swapped Some parallel jobs can be executed as concur- in PE1 to PE4 and its execution is resumed.

20 FUJITSU Sci. Tech. J.,33,1,(June 1997) Y. Koeda: Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

ii) Forced job swap-out execution can be quickly started. A job management command for forced 4) Job freeze function swap-out of a currently executed job is pro- If the system must be stopped for periodic vided. Using this command, the operation maintenance when a job is still running, the job status can be changed according to the time. must be stopped temporarily and the frozen data Because the resources to be used after opera- stored in files. After the system is restarted, the tion switching can be reserved, the next job job can be thawed using the frozen data and the job can be restarted.

100 ms 100 ms 40 ms 40 ms 40 ms 40 ms 4.5 High-speed input/output processing T To process jobs at high speed, the input/out-

PE1 A B B Others ABOthers A .... put processing speed must be increased in addi- I/0 tion to the speed of the vector operation and par- PE2 A B Others A A B Others A .... allel processing features. The following explains I/0 the high-speed input/output function developed for PE3 A B Others ABOthers A .... UXP/V.

Fig. 4— Harmonized scheduler.

NQS

Batch queue A s PE1 PE2 PE3 PE4 a Job 1 Job 1 Memory priority 10 Swap device f Resource shortage detection

Batch queue B PE1 PE2 PE3 PE4 g Job 1 d Swap-out Urgent job Memory priority 20

k h

PE1 PE2 PE3 PE4 Output of urgent job Urgent job

j PE1 PE2 PE3 PE4

Processing end of urgent job

PE1 PE2 PE3 PE4 l Job 1 Swap-in

Fig. 5— Execution of urgent job by NQS queue selection.

FUJITSU Sci. Tech. J.,33,1,(June 1997) 21 Y. Koeda: Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

1) Memory resident file system (mrfs) 2) Very Fast and Large File System (VFL-FS) The mrfsNote14) is a file system constructed in The VFL-FSNote15) is a file system for high- the real memory of a PE. Because an access to a speed input/output of large amounts of data. In file on mrfs uses the vector function, data is trans- this file system, data blocks are allocated to con- ferred between memory areas at high speed. tiguous areas on a medium. Compared with the There are two types of mrfs: the shared mrfs standard UNIX file system (in which data blocks (which can be shared between two or more jobs) are allocated to discrete areas on a medium), the and the job mrfs (which can be used only during disk-head seek time and rotational delay are re- job execution). A shared mrfs can be handled like duced. Moreover, because a large amount of data an ordinary file system. Because the shared mrfs from a job is transferred directly to a medium and is a volatile file system, the files in mrfs are de- not through the system buffer, the overhead of leted when the file system is unmounted or the copying data to the system buffer is avoided. Thus, PE is shut down. The job mrfs is assigned to an VFL-FS has a high input/output performance for area when a job starts. When the job terminates, large amounts of data. the job mrfs is automatically released. Figure 6 3) Disk array units shows an example use of mrfs. RAID disk array units can be connected. A The compiler, linker, library, and other re- RAID disk array unit contains two or more disk sources to be commonly used between two or more drives that perform input/output operations in jobs are arranged in the shared mrfs. These re- parallel to increase the disk access speed and im- sources are used by the jobs that execute a series prove reliability. of operations such as compilation, linkage, and 4) LVCF (logical volume manager) execution. The intermediate creation results (e.g., LVCFNote16) is used to construct logical volumes objects, load modules, and temporary files) of the such as large-capacity volumes and stripe volumes job use the job mrfs. Thus, the jobs are executed by combining various real devices. A large-capac- at high speed without input/output accesses to the ity volume is a continuous logical volume created disks. by connecting two or more physically different real devices. By using a large-capacity volume, a large- capacity file system can be created and devices can be used effectively. A stripe volume is created Compilation Linkage Execution by arranging two or more physically different real devices. These parallel devices have parallel in- put/output operations. The file access speed for

Source files Result-data these stripe volumes is increased by transferring Compiler Load output file data to these real devices simultaneously. When module Linker this stripe volume function is combined with the Temporary file above disk array unit, parallel input/output op- Library Object erations are executed by both hardware and soft- Shared mrfs Job mrfs ware, which increases the file access speed.

Fig. 6— Example of mrfs use.

Note14) mrfs stands for Memory Resident File Sys- Note16) LVCF is an abbreviation of Logical Volume tem. Control Facility, which is a logical volume Note15) VFL-FS stands for Very Fast and Large File manager developed by VERITAS Software System. Corporation of the United States.

22 FUJITSU Sci. Tech. J.,33,1,(June 1997) Y. Koeda: Operating System of the VX/VPP300/VPP700 Series of Vector-Parallel Supercomputer Systems

4.6 High-speed network which is 570 megabytes per second. To solve this When a vector-parallel supercomputer is con- problem, the NFS protocol was modified so that nected as a calculation server to a network in a the slower part of the NFS protocol was removed distributed system environment, the network pro- and the inter-PE communication function of the cessing speed must be increased. Increasing the crossbar network was used directly. Direct data network processing speed increases the process- transfer between PE memory increases the remote ing speed of the total system, including the vec- file access speed. tor-parallel supercomputer. The TCP/IP commu- nication protocol makes it possible to connect the 4.7 High reliability communication mediums listed below and extend Maintenance is important in a large-scale the system functions. This protocol is widely used system having many PEs. Even if some PEs be- in research and development fields. come faulty, UXP/V can continue system opera- 1) HIPPI-LAN tion by isolating the faulty PEs. Moreover, a PE The TCP/IP communication function conforms can be maintained during system operation be- to RFC1374 and is created on an HIPPI network cause its power can be turned off separately. conforming to ANSI X3.218, X3.210, and X3.222. We are also attempting to enhance AUDIT, A file transfer function and NFSNote17) that use a ACL, and other security functions that are impor- high-speed (800 megabits per second) communi- tant to central operation. cation path are supported. 2) ATM-LAN 5. Conclusion The point-to-point TCP/IP communication This paper looked at the resource assignment, function can be used by connecting an ATM switch scheduling, high-speed input/output, system in- (exchange) and using the RFC1483 capsule con- stallation, and operation functions of vector-par- trol method. High-speed (155.2 megabits per sec- allel supercomputers. Also, this paper explained ond) communication with each host can be per- the operating system function support in UXP/V. formed using the star-type switch connection. In the future, we will examine how to increase 3) Inter-PE TCP/IP communication function the processing speed (the most important factor The TCP/IP communication function is sup- of supercomputers), and study various operating ported in the crossbar network. The TCP window environments in order to find ways to enhance the size, which determines the maximum amount of operation management function and simplify op- data that can be transferred continuously, and the eration. Also, we will continue to develop better transfer data length are extended. Interprocess operating systems for supercomputers. communication programs that use standard in- terfaces such as the socket/TLI/RPC can be used Yuji Koeda received the Master degree on a high-speed (570 megabytes per second) com- in Electrical Engineering from the Uni- versity of Yamagata, Yamagata, Japan munication path of the hardware. in 1985. He joined Fujitsu Limited, Kawasaki in 4) Inter-PE NFS 1985, where he is currently engaged in development of network communication File access between PEs is executed using systems for UNIX mainframe comput- NFS. The processing speed of the standard NFS ers and supercomputers. is much lower than that of the crossbar network,

Note17) NFS is an abbreviation of Network File Sys- tem, which is a trademark of Sun Micro- systems, Inc. of the United States.

FUJITSU Sci. Tech. J.,33,1,(June 1997) 23