Operating System of AP3000 Series Scalar-Type Parallel Servers

UDC 681.3.06 Operating System of AP3000 Series Scalar-Type Parallel Servers VHiroyuki Oyake VYuji Iguchi VTsunemi Yamane (Manuscript received May 6, 1997) This paper outlines the control software for the AP3000 series of scalar-type parallel servers. The series provides high-performance computing power in many fields, for example, R&D, database processing, decision making support, and multimedia processing. Each AP3000 series machine is a scalar-type parallel computer system consist- ing of four or more node computers that are interconnected by a high-speed communication network called AP-Net. The AP3000 series has a high processing performance to cover the higher-level models of equipment ranging from Sym- metrical Multiprocessors ( SMPs ) to Massively Parallel Processors ( MPPs ). The AP3000 series contains an SMP as a node computer and has the same scalability as that of an MPP. The series has the characteristics of a cluster-type computer and has all the characteristics of typical parallel computers. Solaris is used as the control software of the AP3000 series. By combining Solaris with operation management software products, all node computers connected to AP-Net can be operated as one system. The set of node computers can also be divided into various groups as required. 1. Introduction used a workstation made by SUN Microsystems A computer having multiple processors is ge- as the front-end machine. From the experience nerically called a parallel computer. Parallel com- gained in creating the AP1000, Fujitsu concluded puters are classified into three types: SMPNote1), that the fundamental requirements for MPP com- MPPNote2), and cluster-type computers. An SMP puters are hardware that enables high-speed, low- computer has several tens of processors and a latency, and high-throughput communication be- shared memory architecture. An MPP computer tween many processors, and software that makes has several hundreds of processors and a distrib- the best use of the hardware performance. uted memory architecture. A cluster-type com- Before creating the operating system for the puter has several computers mutually connected AP3000 series, the successor of the AP1000, it was by a network so that the computers can be used decided that no operating systems specific to MPP as a single system. Processors used in parallel computers would be created, and instead a generic computers are called processor elements (PEs) or operating system with an add-on communication node computers. In this paper, such processors driver would be used. This decision was made are referred to as nodes. because when an MPINote3) or PVMNote4) provides Fujitsu created the MPP computer AP1000 high transmission rates, there are no differences in 1990. The AP1000, in which a specific Cell OS between a specific OS and a generic OS for appli- was adopted, was a back-end type computer that cation software, which is the only beneficiary of Note1) Abbreviation of symmetrical multi-processor. Note3) Abbreviation of message-passing interface. Note2) Abbreviation of massively parallel processor. Note4) Abbreviation of parallel virtual machine. Public domain software developed by the Oak Ridge National Laboratory (USA). FUJITSU Sci. Tech. J.,33,1,pp.31-38(June 1997) 31 H. Oyake et al.: Operating System of AP3000 Series Scalar-Type Parallel Servers parallel processing. 2.2 MPP To help satisfy the demand for high-perfor- Venture companies have looked for new mar- mance computing, the AP3000 series supports an kets for MPP, but the markets are not as exten- environment in which multiple users can execute sive as they had expected and their investment parallel processing programs at the same time. returns are probably quite poor. Also, to provide the high throughput required to Some have blamed the poor diffusion on the construct a computer center system, the AP3000 small number of application programs available, series provides distributed features for nodes and while others believe it is because parallel pro- enables reinforced batch processing. grams are difficult to create and use. To solve these In this paper, we introduce technology to ex- problems, the AP3000 adopts a generic OS. ploit the computational capabilities of the AP3000 In the AP3000, the nodes are connected as series and to operate the system. clusters only as viewed from inside the system. However, from the application software view, the 2. Purposes of the AP3000 AP3000 does exhibit some of the features of clus- In the field of high-performance computation, ter-type and MPP computers. the demand for scalar-type parallel computers as well as conventional vector-type computers is in- 2.3 Cluster creasing. The AP3000 series are scalar-type par- Although most cluster-type computers consist allel computers that support high-grade models of multiple computers, to increase the computa- of the SMP and MPP. The AP3000 was developed tional capabilities there are some cluster-type to include all the features of the SMP, MPP, and computers that consist of hundreds of computers, cluster-type computers. for example, the IBM SP2. The AP3000 can simi- larly connect a number of nodes as clusters as well. 2.1 SMP SMPs can also be connected as clusters to con- The computational capabilities of SMPs have struct a flexible and high-speed computational been significantly increased by improved process- server. ing abilities and a newly developed architecture for basic processors that enables connection of up 3. Features of the System to about 30 CPUs. 3.1 System configuration The AP3000 can connect Sun Microsystem’s SMP The scalar parallel server AP3000 series uses models as nodes. Dual-CPU models such as the Ul- a 64-bit-microprocessor UltraSparcNote6) and Sun tra EnterpriseNote5) 2 Model 2200 can be installed in a Microsystems workstations as node computers. cabinet without any modification. Node computers The AP3000 can connect up to 1,024 nodes through with many CPUs such as the Ultra Enterprise 6000 the high-speed network AP-Net. A control work- will be externally connected in the future. station is externally connected to AP3000 nodes Although the type of nodes can be flexibly by using a control network. A system control selected, there is a trade-off between the capabil- mechanism directly connected with the control ity of a node and the size of the area affected by a workstation manages the power and provides an failure. To localize a failure, the number of pro- integrated console. Figure 1 shows the AP3000 cessors per node must be reduced by increasing system structure. the number of nodes. Note5) A registered trademark of Sun Microsystems, Note6) A registered trademark of SPARC Interna- Inc. (USA). tional, Inc. 32 FUJITSU Sci. Tech. J.,33,1,(June 1997) H. Oyake et al.: Operating System of AP3000 Series Scalar-Type Parallel Servers AP3000 series AP-Net RS232C Control System control workstation feature Node Node Node Node Node UPS Control network (Optional) External network Fig.1— AP3000 system structure. 3.2 Generic OS is efficient and economical to run a highly-multi- In the AP3000, the Solaris operating systemNote7) plexed program with only a few nodes or to ex- functions in each node. The operating system layer ecute multiple programs simultaneously. is a distributed processing system. To control the The appropriate mode can be selected when nodes efficiently, distributed control softwares that programs are executed, and programs need not are commonly used on Solaris are also available be modified. on the AP3000. The administrator regards the AP3000 as a single computer, while users regard 3.4 Network address it as a system of multiple computers. This double The AP3000 not only enables nodes to be characteristic is important because it enables handled collectively as a MPP but also as indi- popular application software to be used without vidual computers like computers in a cluster sys- modification. tem. Therefore, IP addresses are assigned to each node. A node has IP addresses for the control net- 3.3 High-speed parallel computation work and AP-Net. For practical operations, either The AP3000 supports two parallel computa- the control network or AP-Net must provide rout- tional modes to enable the user to choose between ing information to external networks to commu- high-performance and multiplexed computing. nicate with them. The two modes are the SIMPLEX mode and SHARE mode. 3.5 High availability In SIMPLEX mode, a single parallel-process- The AP3000 has a mechanism that makes it ing application program exclusively uses multiple possible to continue operation without stopping nodes and communication paths to reduce the the entire system when a hardware failure occurs. turn-around time. In SHARE mode, programs can The AP3000 provides conversational and be created and executed by multiple users using batch services in multiple nodes. If a node drops TCP/IP, but at a lower data transmission rate. out due to a failure, the remaining nodes continue SHARE mode is provided because, when develop- service. In server systems similar to the AP3000, ing or debugging parallel-processing programs, it abnormal termination of an externally connected Note7) A registered trademark of Sun Microsystems, Inc. (USA). FUJITSU Sci. Tech. J.,33,1,(June 1997) 33 H. Oyake et al.: Operating System of AP3000 Series Scalar-Type Parallel Servers node causes the service to end. To avoid this, ex- as well as general network devices. The high- ternally connected nodes must be multiplexed. speed communication made possible using the AP- When an externally connected node becomes Net stream driver can accelerate NFS and backup faulty, another node must inherit the address and through the network without the need to modify name made available to the outside; therefore, the applications. Thus, AP-Net allows more computer system must be designed carefully. connections than are allowed in a LAN device.

Load more