Hardware Systems: Processor and Board Alternatives

Afshin Attarzadeh

INFOTECH, Universität Stuttgart, [email protected]

Abstract: Nowadays parallel computing has a great influence in our daily life. Weather forecast, air control, modeling nuclear experiments instead of actually performing it and lots of other issues are directly related to parallel computing concept. Clusters in the issue of parallelism, are more commonly used today. The issue of clusters, like other concepts in parallelism, is a system issue which involves with software and hardware and their relation to each other. In this paper the hardware part is considered mostly.

Introduction – What is cluster computing?

As Pfister mentions in his book [1], there are three ways to improve performance: − Work harder − Work smarter − Get help To work harder is just like using faster hardware. This means using faster processors, faster and higher capacity of memory storage and peripheral devices with higher capabilities. Working smarter is when things are done more efficiently and this is due to use of efficient and faster algorithms and techniques. The last aspect deals with parallel processing.

Clusters could be a good solution in utilizing all those three aspects of performance improvement with a reasonable expense! Clusters are so flexible that commodity components could be used in their hardware structure. Their flexibility allows designers in developing high performance parallel algorithms. Clusters could be easily configured and more over they are scalable. This means they can easily follow the technology advances in both hardware and software aspects. According to Buyya[2] the most common scalable parallel computer architectures could be classified as follows: • Massively Parallel Processors (MPP) • Symmetric Multiprocessors (SMP) • Cache-Coherent Nonuniform Memory Access (CC-NUMA) • Distributed Systems • Clusters 2 Afshin Attarzadeh

Clusters – Definition and Architecture

There are many different definitions on clusters. Some of these definitions are given below: − “A commonly found computing environment consists of many workstations connected together by a local area network. The workstations, which have become increasingly powerful over the years, can together, be viewed as a significant computing resource. This resource is commonly known as cluster of workstations.” [3] − “A computer cluster is a group of locally connected computers that work together as a unit. [4] − “A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers working together as a single, integrated computing resource.” [2] Although the definitions above has some slight differences with each other but all are common in defining a cluster as a single computing resource although it contains several nods. Additionally all these nodes are located locally. Clusters are typically used for two major reasons. One is High Availability (HA) for greater reliability and the other one is High Performance Computing (HPC) to provide greater computational power than a single computer can provide. Below a schematic view of cluster architecture is given.

Fig. 1. Cluster Computer Architecture [4]

4T-Machines

Clusters are also mentioned as 4T machines. They are named 4T because they have a computation power of Tips (Tera instruction per second), Tera bytes of Memory storage, Tera bytes per second of IO bandwidth and Tera bytes per second bandwidth of communication.[] Hardware Systems: Processor and Board Alternatives 3

Why clusters?

One of the reasons of using clusters in parallel computing is the flexibility in using commodity or commercial hardware components. This feature leads to a huge cost reduction while the overall performance does not decrease as much. The other brilliant reason is that a cluster could be developed from scratch with an arbitrary number of computing nodes which in the future could be changed (Scalability).

Classification of clusters

Clusters are used widely today, while they offer the following features at a relatively low cost [2]: • High Performance • Expandability and Scalability • High Throughput • High Availability Clusters are classified into many categories based on various factors. Some of these classifications are given below in brief [2]: 1. Application Target • High Performance Clusters for Computational science application • High Availability Clusters for mission-critical applications 2. Node Ownership • Dedicated Clusters: all nodes are reserved for cluster and all the nodes have a dedicated task. • Non-dedicated Clusters: in this case nodes are as workstation and for the tasks related to the cluster, the server steals idle cycles from the workstations. 3. Node Hardware • Clusters of PCs (CoPs) or Pile of PCs (PoPs) • Clusters of Workstations (COWs) • Clusters of SMPs (CLUMPs)

Cluster Hardware Components

The major hardware components to build a cluster are Processors, Memory and cache, Disk storage and IO interfaces, and Network interfaces. Among all, the processors and memory technology have been improved rapidly. Once it was said that the ultimate in processors would be achieving the processors up to 1GHz. But now the processors are produced with a clock frequency up to 3 GHz which means 3 times faster. 4 Afshin Attarzadeh

Processors During the past decades there has been an outstanding progress in architecture, which made the single processor chips as powerful as supercomputers but obviously with a lower price! Some of these developed architectures are: RISC, CISC, VLIW1 and Vector [2]. The following is a brief information on the latest products of four big processor producers.

Alpha [8] The DEC Alpha, also known as the Alpha AXP, is a 64-bit RISC microprocessor originally developed and fabricated by Digital Equipment Corp. (DEC), which used it in its own line of workstations and servers. Designed as a successor to the VAX line of computers, it supported the VMS operating system, as well as Digital UNIX. Later open source operating systems also ran on the Alpha, notably Linux and BSD UNIX flavors. Microsoft supported the processor until Windows NT 4.0 SP6 and did not extend Alpha support beyond beta 3 of Windows 2000.

Intel [9][10][11] processors are now popular among PC users. The current generation of Intel processor family is now the 4 family. Intel has introduced different CPUs for different applications. In general Intel has introduced 4 different types of CPUs; Desktop CPUs, Mobile CPUs, Server CPUs and CPUs for workstation. the new generation of Intel processors in the class of server solutions, is designed with an array of innovative features to extract greater instruction parallelism including speculation, prediction, large register files, a register stack, advanced branch architecture, and many others. It has the ability to address the memory with its 64-bit address registers, which results in a better function and higher performance in server applications. The Itanium also has an innovative floating-point architecture that supports the high performance requirements of workstation applications such as digital content creation, design engineering, and scientific analysis. For compatibility reasons Itanium based processors can run IA-32 applications on an Itanium-based operating system that supports execution of IA-32 applications. A mixed IA-32 and Itanium-based code execution is also supported in Itanium based processors. For the Server and Workstation types, Intel has introduced a new technology named Hyper-Threading. [10] Hyper-Threading Technology enables multi-threaded software applications to execute threads in parallel. This level of threading technology has never been seen before in a general-purpose microprocessor. To improve performance in the past, threading was enabled in the software by splitting instructions into multiple streams so that multiple processors could act upon them. Today with Hyper-Threading Technology, processor-level threading can be utilized which offers more efficient use of processor resources for greater parallelism and improved performance on today's multi-threaded software.

1 Very Long Instruction Word Hardware Systems: Processor and Board Alternatives 5

Hyper-Threading Technology provides thread-level-parallelism (TLP) on each processor resulting in increased utilization of processor execution resources. As a result, resource utilization yields higher processing throughput. Hyper-Threading Technology is a form of simultaneous multi-threading technology (SMT) where multiple threads of software applications can be run simultaneously on one processor. This is achieved by duplicating the architectural state on each processor, while sharing one set of processor execution resources.

AMD [12][13] AMD and Intel are progressing with the same pace in the area of processors. Like Intel AMD has introduced its processors in four different applications: Desktop, Mobile, Server and Workstation. For server solutions AMD has currently introduced its Athlon MP and Optron. Athlon MP is the same as Athlon 64 for desktops but with this additional smart MP technology introduced by AMD. AMD’s innovative Smart MP technology uses dual, independent point-to-point system buses to increase available bus bandwidth. Along with a praiseworthy cache management system Smart MP technology allows high speed communications between processors, helps reduce data transfer latencies, and helps ensure that both processors work to their full potential. [] In Optron which is also a 64-bit processor and is designed in dual core, AMD has introduced HyperTransport technology, an innovation by AMD. HyperTransport technology is a high-speed, low latency, point-to-point link designed to increase the communication speed between integrated circuits in computers, servers, embedded systems, and networking and telecommunications equipment up to 48 times faster than some existing technologies. HyperTransport technology helps reduce the number of buses in a system, which can reduce system bottlenecks and enable today's faster to use system memory more efficiently in high-end multiprocessor systems. HyperTransport technology was invented at AMD with contributions from industry partners and is managed and licensed by the HyperTransport Technology Consortium, a Texas non-profit corporation. []

Sun SPARC [15] The latest product of Sun processors is the UltraSPARC IV [15]. The UltraSPARC IV is drived from Sun Microsystems high-end UltraSPARC III processor, providing the same fundamental features, and offering the advantage of high throughput utilizing Chip Multithreading (CMT2) technology. The UltraSPARC IV features two cores, each based on the UltraSPARC III. In another words two processors in a single chip. UltraSPARC IV provides extra RAS3 features by supporting Chip-Kill-DIMM. This feature inside CPU corrects errors resulting from one failed SDRAM.

2 Chip Multithreading: A processor capable of executing two ore more software threads simultaneously without resorting to a software context switch. This feature could be achieved through the use of multiple processor cores, supporting multiple threads per core, or a combination of these strategies. [15] 3 Reliability, Availability, Serviceability 6 Afshin Attarzadeh

IBM PowerPC [14] IBM also has introduced processors with specialties in parallel computing. PowerPC 440 core is a 32-bit RISC CPU core providing 667MHz implemented in IBM’s advanced 130nm copper CMOS technology. The PowerPC 440 core’s performance, power specifications, and design attributes make it an ideal solution for storage and massive computing applications. As it is a core, it could be specialized and configured for specific usage. See the fig. 2.

Fig. 2. Sample PowerPC 440x5 core application

Some of Cluster Projects

The first commodity clustering product was ARCnet4, developed by in 1977. ARCnet wasn't a commercial success and clustering didn't really take off until DEC released their VAXcluster product in the 1980s for the VAX/VMS operating system. The ARCnet and VAXcluster products not only supported parallel computing, but also shared file systems and peripheral devices. They were supposed to give you the advantage of parallel processing while maintaining data reliability and uniqueness.

4 Attached Resource Hardware Systems: Processor and Board Alternatives 7

The Beowulf Project

One of the more popular implementations is a cluster with nodes running Linux as the OS and free software to implement the parallelism. This configuration is often referred to as a Beowulf cluster. The aim of the Beowulf project was to investigate the potential power of pc clusters in performing computational tasks. Beowulf refers to Pile-of-PCs[2] (PoPC). PoPC insist on using commodity components, dedicated processors and use of a private communication network. The final goal of Beowulf is to build a cluster system with the best cost/performance ratio. In the early evolution of Beowulf systems, Intel DX4 processors were used. Intel DX4 was the 100MHz version of 80486 chip in 1994. The cluster was developed with 16 processors and could deliver over a GigaOPS peak performance. But the floating point performance didn’t approach an equivalent level. In 1994 the price of a DX4 was 550$ or roughly let’s say five and a half dollar a MegaOPS. Two years later Intel introduced Pentium and with an improvement on price/performance with a factor of 2. That means two and a quarter dollars a MegaOPS. This price decreasing continued until today. Today the latest Intel with a clock frequency of 2.8 GHz could be obtained around 175$ which means 62.5 dollars a GigaOPS or six and a quarter cent a MegaOPS!!! Beowulf systems are not limited to Intel processors; Avalon is a Beowulf system which uses DEC Alpha processors. Avalon is a 140 processor Alpha Beowulf cluster constructed entirely from commodity technology.

Other Projects

Sun Microsystems has also released a clustering product called Grid engine. OpenSSI is another clustering project that provides single-system image capabilities. It leverages HP's NonStop Clusters for Unixware technology and other open source technology to provide a full, highly reliable SSI environment for Linux.

Comparison between clusters

An organization publishes the 500 fastest clusters twice a year. Top500 [16] is a collaboration between the university of Mannheim, the university of Tennessee, and the national Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory. The following table shows some of the clusters and their specifications: 8 Afshin Attarzadeh

Processor Overall Maximum Theoretical Cluster Name Max Performance Type Performance (GFlops) No. (GFlops) PowerPC BlueGene/L 32768 91750 70720 0.7GHz SC-Alpha Alpha Server 8192 20480 13880 Server Cluster SC45, 1.25 Tiger4 Cluster - Itanium2 4096 22938 19940 Quadrics 1.4GHz HP DL145 Opteron 512 2252.8 1576 Cluster 2.2GHz Atipa’s NOW Intel Xenon 1024 3686.4 2207 Cluster 1.8 GHz

Table 1. A brief information on the specifications of some clusters [16]

Clusters vs. Distributed Systems

Why are clusters distinguished from distributed systems? Actually, clusters are a special kind of distributed systems. Clusters have qualitative properties that distinguish them from general distributed systems. These properties that imply differences between these two could be categorized in the following manner [3]: Homogeneous hardware and software: A distributed system necessarily involves different kind of hardware and software systems. The communication between these heterogeneous systems comes at a cost. This needs general algorithms to communicate between nodes. Giving transparent access to all data in all nodes is very expensive. In cluster all the nodes have the same hardware specification and run the same software. This simplicity has huge benefits. It accelerates performance and implies transparency while the whole cluster could be assumed as a single computer system. Single administrative domain: one of the major reasons of design of distributed systems is to cross geographical and organizational boundaries. Although it might consist of thousands of devices but it is managed as a single authentication domain, a single performance domain, and a single accounting domain. Its view from the administrator side is as a single entity with no internal boundary. Ideal communication: in a distributed system communication has 3 major deficiencies: It is slow, expensive and unreliable. The finite speed of light and long distances make it slow – a 100ms of response time is typical. The long distances and huge amount of capital costs implies that the communication is the most expensive part of a distributed system. And last but not least, individual connection lose in public networks might happen which means service denial for extended periods. Hardware Systems: Processor and Board Alternatives 9

Communication within a cluster is Ideal. Because the distances are short, and the Bandwidth is plentiful and inexpensive in a cluster. This results in a highly reliable communication. While all the nodes in the cluster are the same, very efficient communication protocols can be used. To summarize, a cluster is a kind of distributed system but simpler and faster. Of course a cluster acting as a server is a key part and a super-server node of a distributed system.

Reference: [1] Gregory F. Pfister, In search of clusters; the ongoing battle in lowly parallel computing, Prentice Hall, 1998 [2] Rajkumar Buyya, High Performance Cluster Computing, Volume 1:Architectures and Systems, Prentice Hall PTR, 1999 [3] Jim Gray, Super-Servers: Commodity Computer Clusters Pose a Software Challenge [4] L. Silva and R. Buyya, Parallel Programming Models and Paradigms, High Performance Cluster Computing: Programming and Applications, Rajkumar Buyya (editor), ISBN 0-13- 013785-5, Prentice Hall PTR, NJ, USA, 1999. [5] NHSE Review, Cluster Management Software, 1996 May; http://nhse.cs.rice.edu/NHSEreview/CMS/index.html [6] Computer Cluster, Wikipedia the free encyclopedia; http://en.wikipedia.org/wiki/Cluster_computing [6] The Beowulf Project: http://www.beowulf.org [7] World supercomputing professionals; http://www.supercomputingonline.com [8] DEC Alpha, Wikipedia the free encyclopedia; http://en.wikipedia.org/wiki/DEC_Alpha [9] Intel Corporation; http://www.intel.com [10] Hyper Threading Technology; http://www.intel.com/technology/hyperthread/ [11] Intel Itanium Architecture Software Developer’s Manual Volume 1:Application Architecture, http://encyclopedia.thefreedictionary.com/computer%20cluster [12] ; www.amd.com [13] HyperTransport Technology; www.hypertransport.org [14] IBM PowerPC 440 core; http://www.ibm.com [15] Sun Microsystems, UltraSPARC IV Processor user’s manual supplement, Version 1.0 April 2004 [15] Atipa Technology; http://www.atipa.com/ [16] www.top500.org