Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64

Journal Of Harmonized Research (JOHR)

Journal Of Harmonized Research in Engineering 1(2), 2013, 54-64 ISSN 2347 – 7393

Original Research Article

High Performance Computing through Parallel and Distributed Processing

Shikha Yadav, Preeti Dhanda, Nisha Yadav

Department of and Engineering, Dronacharya College of Engineering, Khentawas, Farukhnagar, Gurgaon, India

Abstract :

There is a very high need of High Performance Computing (HPC) in many applications like space science to Artificial Intelligence. HPC shall be attained through Parallel and . In this paper, Parallel and Distributed algorithms are discussed based on Parallel and Distributed Processors to achieve HPC. The Programming concepts like threads, fork and sockets are discussed with some simple examples for HPC.

Keywords: High Performance Computing, Parallel and Distributed processing, Computer Architecture Introduction time to solve large problems like weather Computer Architecture and Programming play forecasting, Tsunami, Remote Sensing, a significant role for High Performance National calamities, Defence, Mineral computing (HPC) in large applications Space exploration, Finite-element, Cloud science to Artificial Intelligence. The Computing, and Expert Systems etc. The Algorithms are problem solving procedures Algorithms are Non-Recursive Algorithms, and later these algorithms transform in to Recursive Algorithms, Parallel Algorithms particular Programming language for HPC. and Distributed Algorithms. There is need to study algorithms for High The Algorithms must be supported the Performance Computing. These Algorithms Computer Architecture. The Computer are to be designed to computer in reasonable Architecture is characterized with Flynn’s Classification SISD, SIMD, MIMD, and

For Correspondence: MISD. Most of the Computer Architectures preeti.dhanda01ATgmail.com are supported with SIMD (Single Instruction Received on: October 2013 Multiple Data Streams). The class of Accepted after revision: December 2013 Computer Architecture is VLSI , Multi-processor, and Downloaded from: www.johronline.com Multiple Processor.

www.johronline.com 54 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64

computing is using clusters of independent processors connected in parallel. Many computing problems are suitable for parallelization, often problems can be divided in a manner so that each independent processing node can work on a portion of the problem in parallel by simply dividing the data to be processed, and then combining the final processing results for each portion. This type of parallelism is often referred to as data- High-Performance Computing (HPC) is used parallelism, and data-parallel applications are to describe computing environments which a potential solution to petabyte scale data utilize and computer clusters processing requirements. Data-parallelism can to address complex computational be defined as a computation applied requirements, support applications with independently to each data item of a set of significant processing time requirements, or data which allows the to require processing of significant amounts of be scaled with the volume of data. The most data. Supercomputers have generally been important reason for developing data-parallel associated with scientific research and applications is the potential for scalable compute-intensive types of problems, but performance in high-performance computing, more and more technology is and may result in several orders of magnitude appropriate for both compute-intensive and performance improvement. data-intensive applications. A new trend in supercomputer design for high-performance

The discussion below focuses on the case of Though, the classical PRAM model assumes multiple computers, although many of the synchronous access to the . issues are the same for concurrent processes A model that is closer to the behaviour of real- running on a single computer. world multiprocessor machines and takes into Three view points are commonly used: account the use of machine instructions, such Parallel algorithms in shared-memory model - as Compare-and-swap (CAS), is that of All computers have access to a shared asynchronous shared memory. memory. The algorithm designer chooses the Parallel algorithms in message-passing model- program executed by each computer. The algorithm designer chooses the structure One theoretical model is the parallel random of the network, as well as the program access machines (PRAM) that are used. executed by each computer.

www.johronline.com 55 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64

Models such as Boolean circuits and sorting analytics, and creation of keyed data and networks are used. A Boolean circuit can be indexes to support high-performance seen as a : each gate is a structured queries and data warehouse computer that runs an extremely simple applications. The Data Refinery is also computer program. Similarly, a sorting referred to as Thor, a reference to the mythical network can be seen as a computer network, Norse god of thunder with the large hammer each comparator is a computer. symbolic of crushing large amounts of raw Distributed algorithms in message-passing data into useful information. A Thor cluster is model- similar in its function, execution environment, The algorithm designer only chooses the filesystem, and capabilities to the Google. computer program. All computers run the The figure given below shows a representation same program. The system must work of a physical Thor processing cluster which accurately regardless of the structure of the functions as a batch job execution engine for network.A commonly used model is a graph scalable data-intensive computing with one finite-state machine per node. applications. In addition to the Thor master Architecture of high performance and slave nodes, additional auxiliary and computers: common components are needed to implement The Department of High Performance a complete HPCC processing environment. Computer Architecture is concerned with the architecture, the application and the continued development of high performance computers beneficial to the natural and life sciences. Our focus is on the selection and analysis of experimental data generated by facilities such as GSI ("Society for Heavy Ion Research" in Darmstadt, Germany) and CERN (the European Centre for Nuclear Research in Geneva, Switzerland). Both of these facilities employ shared, typically massive parallel systems and clusters operating under high- Thor Processing Cluster level, real-time and dependability standards. The second of the parallel data processing Our task is the research and development of platforms is called Roxie and functions as a new computer architectures and algorithms to rapid data delivery engine. This platform is achieve better energy-efficiency. Within the designed as an online high-performance context of shared computing we implement structured query and analysis platform or data both GRID and virtual technologies as well as warehouse delivering the parallel data access systems. processing requirements of online applications High Performance Computing Cluster through Web services interfaces supporting (HPCC) System Architecture- thousands of simultaneous queries and users The HPCC system architecture includes two with sub-second response times. Roxie utilizes distinct cluster processing environments, each a distributed indexed file system to provide of which can be optimized independently for parallel processing of queries using an its parallel data processing purpose. The first optimized execution environment and of these platforms is called a Data Refinery filesystem for high-performance online whose overall purpose is the general processing. Both Thor and Roxie clusters processing of massive volumes of raw data of utilize the ECL programming language for any type for any purpose but typically used for implementing applications, increasing data cleansing and hygiene, ETL processing of continuity and programmer productivity. the raw data, record linking and entity the figure given below shows a representation resolution, large-scale ad-hoc complex of a physical Roxie processing cluster which

www.johronline.com 56 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64 functions as an online query execution engine A computer program that runs in a distributed for high-performance query and data system is called a distributed program, and warehousing applications. A Roxie cluster distributed programming is the of includes multiple nodes with server and writing such programs. worker processes for processing queries; an Distributed computing also refers to the use of additional auxiliary component called an ESP distributed systems to solve computational server which provides interfaces for external problems. In distributed computing, a problem client access to the cluster; and additional is divided into many tasks, each of which is common components which are shared with a solved by one or more computers which Thor cluster in an HPCC environment. communicate with each other by message Although a Thor processing cluster can be passing. implemented and used without a Roxie Properties of distributed systems - cluster, an HPCC environment which includes So far the focus has been on designing a a Roxie cluster should also include a Thor distributed system that solves a given cluster. The Thor cluster is used to build the problem. A complementary research problem distributed index files used by the Roxie is studying the properties of a given cluster and to develop online queries which distributed system. will be deployed with the index files to the The halting problem is an analogous example Roxie cluster. from the field of centralised computation: we are given a computer program and the task is to decide whether it halts or runs forever. The halting problem is notdecidable in the general case, and naturally understanding the behaviour of a computer network is at least as hard as understanding the behaviour of one computer. However, there are many interesting special cases that are decidable. In specific case, it is possible to provide reason about the behaviour Roxie Processing Cluster of a network of finite-state machines. One Distributed processing : example is telling whether a given network of Distributed computing is a field of computer interacting (asynchronous and non- science that studies distributed systems. A deterministic) finite-state machines can reach distributed system is a software system in a . This problem is PSPACE- which components located on networked complete, i.e., it is decidable, but it is not computers communicate and coordinate their likely that there is an efficient (centralised, actions by passing messages. The components parallel or distributed) algorithm that solves interact with each other in order to achieve a the problem in the case of large networks. common goal. There are many substitutes for Examples: the message passing mechanism, including Examples of distributed systems and RPC-like connectors and message queues. applications of distributed computing contain Three significant characteristics of distributed the following: systems are: concurrency of components, lack 1.-Telecommunication networks: of a global clock, and independent failure of Telephone networks and cellular networks components. An important aim and challenge Computer networks such as the Internet of distributed systems is location Wireless sensor networks transparency. Examples of distributed systems Routing algorithms vary from SOA-based systems to massively 2.-Network applications: multiplayer online games to peer-to-peer Worldwide web and peer-to-peer networks applications.

www.johronline.com 57 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64

Massively multiplayer online games and The very nature of an application may require virtual reality communities the use of a communication network that Distributed databases and distributed database connects several computers: for example, data management systems produced in one physical location and Network file systems required in another location. Distributed information processing systems There are many cases in which the use of a such as banking systems and airline single computer would be possible in reservation systems principle, but the use of a distributed system is 3.-Real-time process control: beneficial for practical reasons. For example, Aircraft control systems it may be more cost-efficient to obtain the Industrial control systems desired level of performance by using a cluster 4.-Parallel computation: of several low-end computers, in comparison Scientific computing, including cluster with a single high-end computer. A distributed computing and and various system can provide more reliability than a volunteer computing projects; see the list of non-distributed system, as there is no single distributed computing projects rendering in point of failure. Moreover, a distributed computer graphics. system may be easier to expand and manage Applications : than a monolithic uniprocessor system. Reasons for using distributed systems and distributed computing may include: Parallel processing:

IBM's Blue Gene frequency scaling. As power consumption supercomputer (and consequently heat generation) by is a form of computation in computers has become a concern in recent which many calculations are carried out years, parallel computing has become the concurrently, operating on the principle that dominant paradigm in computer architecture, large problems can frequently be divided into mainly in the form of multicore processors. smaller ones, which are then solved Parallel computers can be roughly classified simultaneously ("in parallel"). There are according to the level at which the hardware several different forms of parallel computing: supports parallelism, with multi-core and bit-level, instruction level, data, and task multi-processor computers having multiple parallelism. Parallelism has been employed processing elements within a single machine, for many years, primarily in high-performance while clusters, MPPs, and grids use multiple computing, but interest in it has grown lately computers to work on the same task. due to the physical constraints preventing Specialized parallel computer architectures are

www.johronline.com 58 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64 sometimes used alongside traditional viable option. The idle time of processor processors, for accelerating specific tasks. cycles across network can be used effectively Parallel computer programs are more difficult by sophisticated distributed computing to write than sequential ones, because software. The term parallel processing is used concurrency introduces several new classes of to represent a large class of techniques which potential software bugs, of which race are used to provide simultaneous data conditions are the most common. processing tasks for the purpose of increasing Communication and synchronization between the computational speed of a computer the different subtasks are typically some of the system. greatest complications to getting good parallel Types of parallelism - program performance. ‹ Bit-level parallelism- Parallel computing is the simultaneous use of From the advent of very-large-scale more than one CPU or processor core to integration (VLSI) computer-chip fabrication execute a program or multiple computational technology in the 1970s until about 1986, threads. Ideally, parallel processing makes speed-up in computer architecture was driven programs run faster because there are more by doubling computer word size—the amount engines (CPUs or Cores) running it. In of information the processor can manipulate practice, it is often tough to divide a program per cycle.Increasing the word size reduces the in such a way that separate CPUs or cores can number of instructions the processor must execute different portions without interfering execute to perform an operation on variables with each other. Most computers have just one whose sizes are greater than the length of the CPU, but some models have several, and word. For example, where an 8-bit processor multi-core processor chips are becoming the must add two 16-bit integers, the processor norm. There are even computers with must first add the 8 lower-order bits from each thousands of CPUs. integer using the standard addition instruction, With single-CPU, single-core computers, it is then add the 8 higher-order bits using an add- possible to perform parallel processing by with-carry instruction and the carry bit from connecting the computers in a network. the lower order addition; thus, a 8-bit However, this type of parallel processing processor requires two instructions to requires very sophisticated software called complete a single operation, where a 16-bit distributed processing software. processor would be able to complete the Note that parallelism differs from operation with a single instruction. concurrency. Concurrency is a term used in Historically, 4-bit were the operating systems and databases replaced with 8-bit, then 16-bit, then 32-bit communities which refers to the property of a microprocessors. This trend generally came to system in which multiple tasks remain an end with the introduction of 32-bit logically active and make progress at the same processors, which has been a standard in time by interleaving the execution order of the general-purpose computing for two decades. tasks and thereby creating an illusion of Not until recently (c. 2003–2004), with the simultaneously executing instructions. advent of -64 architectures, have 64-bit Parallelism, on the other hand, is a term processors become commonplace. typically used by the supercomputing ‹ Instruction-level parallelism- community to describe executions that physically execute simultaneously with the goal of solving a problem in less time or solving a larger problem in the same time. Parallelism exploits concurrency. Parallel processing is also called parallel computing. In the quest of cheaper computing alternatives parallel processing provides a

www.johronline.com 59 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64

A canonical five-stage in a RISC Figure-A five-stage pipelined superscalar machine (IF = Instruction Fetch, ID = processor, capable of issuing two instructions Instruction Decode, EX = Execute, MEM = percycle. It can have two instructions in each Memory access, WB = Register write back) stage of the pipeline, for a total of up to A computer program is in essence, a stream of 10instructions (shown in green) being instructions executed by a processor. These simultaneously executed. instructions can be re-ordered and combined is the characteristic of a into groups which are then executed in parallel parallel program that "entirely different without changing the result of the program. calculations can be performed on either the This is known as instruction-level parallelism. same or different sets of data". This contrasts Advances in instruction-level parallelism with , where the same dominated computer architecture from the calculation is performed on the same or mid-1980s until the mid-1990s. different sets of data. Task parallelism does Modern processors have multi-stage not usually scale with the size of a problem. instruction pipelines. Each stage in the Classes of parallel computers- pipeline corresponds to a different action the Parallel computers can be roughly classified processor performs on that instruction in that according to the level at which the hardware stage; a processor with an N-stage pipeline supports parallelism. This classification is can have up to N different instructions at broadly analogous to the distance between different stages of completion. The canonical basic computing nodes. These are not example of a pipelined processor is a RISC mutually exclusive; for example, clusters of processor, with five stages: instruction fetch, symmetric multiprocessors are relatively decode, execute, memory access, and write common. back. The Pentium 4 processor had a 35-stage ‹ Multicore computing- pipeline. A multicore processor is a processor that In addition to instruction-level parallelism includes multiple execution units ("cores") on from pipelining, some processors can issue the same chip. These processors differ from more than one instruction at a time. These are superscalar processors, which can issue known as superscalar processors. Instructions multiple from one can be grouped together only if there is no instruction stream (); in contrast, a data dependency between them. multicore processor can issue multiple Scoreboarding and the Tomasulo algorithm instructions per cycle from multiple (which is similar to scoreboarding but makes instruction streams. IBM's Cell use of ) are two of the most , designed for use in the Sony common techniques for implementing out-of- PlayStation 3, is another prominent multicore order execution and instruction-level processor. parallelism. Each core in a multicore processor can ‹ Task parallelism- potentially be superscalar as well—that is, on every cycle, each core can issue multiple instructions from one instruction stream. Simultaneous multithreading (of which Intel's Hyper Threading is the best known) was an early form of pseudo-multi coreism. A processor capable of simultaneous multithreading has only one ("core"), but when that execution unit is idling (such as during a cache miss), it uses that execution unit to process a second thread. ‹ Symmetric -

www.johronline.com 60 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64

A symmetric multiprocessor (SMP) is a communicates with the others via a high- computer system with multiple identical speed interconnect." processors that share memory and connect via ‹ Grid computing- a bus. Bus contention prevents bus Distributed computing is the most distributed architectures from scaling. As a result, SMPs form of parallel computing. It makes use of generally do not comprise more than 32 computers communicating over the Internet to processors. "Because of the small size of the work on a given problem. Because of the low processors and the significant reduction in the bandwidth and extremely high latency requirements for bus bandwidth achieved by available on the internet ,distributed large caches, such symmetric multiprocessors computing typically deals only with are extremely cost-effective, provided that a problems. Many sufficient amount of memory bandwidth distributed computing applications have been exists." created, of which SETI@home and ‹ Distributed computing- Folding@home are the best-known examples. A distributed computer (also known as a Most grid computing applications use multiprocessor) is a middleware, software that sits between the distributed memory computer system in which operating system and the application to the processing elements are connected by a manage network resources and standardize the network. Distributed computers are highly software interface. The most common scalable. distributed computing middleware is the ‹ Cluster computing- Berkeley Open Infrastructure for Network A cluster is a group of loosely coupled Computing (BOINC). Often, distributed computers that work together closely, so that computing software makes use of "spare in some respects they can be regarded as a cycles", performing computations at times single computer. Clusters are composed of when a computer is idling. multiple standalone machines connected by a ‹ Specialized parallel computers- network. While machines in a cluster do not Within parallel computing, there are have to be symmetric, load balancing is more specialized parallel devices that remain niche difficult if they are not. The most common areas of interest. While not domain-specific, type of cluster is the , which is they tend to be applicable to only a few a cluster implemented on multiple identical classes of parallel problems. commercial off-the-shelf computers connected Reconfigurable computing with field- with a TCP/IP Ethernet local area network. programmable gate arrays- Beowulf technology was originally developed Reconfigurable computing is the use of a by Thomas Sterling and Donald Becker. The field-programmable gate array (FPGA) as a vast majority of the TOP500 supercomputers co-processor to a general-purpose computer. are clusters. An FPGA is, in essence, a computer chip that ‹ Massive parallel processing- can rewire itself for a given task. A massively parallel processor (MPP) is a FPGAs can be programmed with hardware single computer with many networked description languages such as VHDL or processors. MPPs have many of the same Verilog. However, programming in these characteristics as clusters, but MPPs have languages can be tedious. Several vendors specialized interconnect networks (whereas have created C to HDL languages that attempt clusters use commodity hardware for to emulate the syntax and/or semantics of the networking). MPPs also tend to be larger than C programming language, with which most clusters, typically having "far more" than 100 programmers are familiar. processors. In a MPP, "each CPU contains its General-purpose computing on graphics own memory and copy of the operating processing units (GPGPU)- system and application. Each subsystem General-purpose computing on graphics processing units (GPGPU) is a fairly recent

www.johronline.com 61 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64 trend in computer engineering research. One example is the peta-flop RIKEN GPUs are co-processors that have been MDGRAPE-3 machine which uses custom heavily optimized for computer graphics ASICs for molecular dynamics simulation. processing. ‹ Vector processors- Computer graphics processing is a field A vector processor is a CPU or computer dominated by data parallel operations— system that can execute the same instruction particularly linear algebra matrix operations. on large sets of data. "Vector processors have In the early days, GPGPU programs used the high-level operations that work on linear normal graphics for executing programs. arrays of numbers or vectors. An example However, several new programming vector operation is A = B × C, where A, B, languages and platforms have been built to do and C are each 64-element vectors of 64-bit general purpose computation on GPUs with floating-point numbers." They are closely both Nvidia and AMD releasing programming related to Flynn's SIMD classification. environments with CUDA and Stream SDK Cray computers became famous for their respectively. Other GPU programming vector-processing computers in the 1970s and languages include Brook GPU, PeakStream, 1980s. However, vector processors—both as and Rapid Mind. Nvidia has also released CPUs and as full computer systems—have specific products for computation in their generally disappeared. Modern processor Tesla series. The technology consortium instruction sets do include some vector Khronos Group has released the OpenCL processing instructions, such as with specification, which is a framework for AltiVecand Streaming SIMD Extensions writing programs that execute across (SSE). platforms consisting of CPUs and GPUs. AMD, Apple, Intel, Nvidia and others are Advantages- supporting OpenCL. Faster execution time so, higher throughput. ‹ Application-specific integrated circuits- Several application-specific Disadvantages- (ASIC) approaches have been devised for More hardware required, also more power dealing with parallel application because an requirements. Not good for low power and ASIC is (by definition) specific to a given mobile devices. application, it can be fully optimized for that application. As a result, for a given Parallel and distributed computing: application, an ASIC tends to outperform a Distributed systems are groups of networked general-purpose computer. However, ASICs computers, which have the similar goal for are created by X-ray lithography. This process their work. The terms "concurrent requires a mask, which can be extremely computing", "parallel computing", and expensive. A single mask can cost over a "distributed computing" have a lot of overlap, million US dollars. (The smaller the and no clear distinction exists between them. transistors required for the chip, the more The same system may be characterised both as expensive the mask will be.) Meanwhile, "parallel" and "distributed"; the processors in performance increases in general-purpose a typical distributed system run concurrently computing over time (as described by Moore's in parallel. Parallel computing may be seen as Law) tend to wipe out these gains in only one a particular tightly coupled form of distributed or two chip generations. High initial cost, and computing, and distributed computing may be the tendency to be overtaken by Moore's-law- seen as a loosely coupled form of parallel driven general-purpose computing, has computing. Still, it is possible to roughly rendered ASICs unfeasible for most parallel classify concurrent systems as "parallel" or computing applications. However, some have "distributed" using the following criteria: been built.

www.johronline.com 62 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64

(a),(b)-A Distributed system (c)-A Parallel system

In parallel computing, all processors may have parallel algorithms while the coordination of a access to a shared memory to exchange large-scale distributed system uses distributed information between processors. In distributed algorithms. computing, each processor has its own private memory (distributed memory). Information is Conclusion exchanged by passing messages between the High Performance Computing is required processors. when large computations of the problems. The figure above illustrates the difference HPC shall be performed through the Parallel between distributed and parallel systems. and distributed processing. The Parallel and Figure (a) is a schematic view of a typical Distributed are discussed based on Computer distributed system; as usual, the system is Architecture. The Class of Algorithms and represented as a network topology in which Class of Computer Architecture are discussed. each node is a computer and each line The Programming concepts like threads, fork connecting the nodes is a communication link. and sockets are discussed for HPC. This shall Figure (b) shows the same distributed system be extending to large problems likeGrid in more detail: each computer has its own Computing and Cloud Computing. Usually local memory, and information can be Fortran is used for HPC. The Perl and Java exchanged only by passing messages from one Programming are also useful for HPC. node to another by using the available communication links. Figure (c) shows a References: parallel system in which each processor has a 1. Bader, David ; Robert Pennington (June direct access to a shared memory. 1996).”Cluster computing: Applications”. The situation is further complicated by the Georgia Tech College of computing. traditional uses of the terms parallel and Retrieved 2007-07-13. distributed algorithm that do not quite match 2. “Nuclear weapons supercomputer reclaims the above definitions of parallel and world speed record for US ”. The distributed systems. Nevertheless, as a rule of Telegraph. 18 Jun 2012. Retrieved 18 Jun thumb, high-performance parallel computation 2012. in a shared-memory multiprocessor uses

www.johronline.com 63 | P a g e

Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64

3. Network-Based Information Systems: First Milicchio, Wolfgang Alexander Gehrke International Conference, NBIS 2007, ISBN pages 339-341[1]. 2007 ISBN 3-540-74572-6 page 375 15. Grid and Cluster Computing by Parbhu 4. William W. Hargrove, Forrest M. 2008 8120334280 pages 109-112. Hoffman and Thomas Sterling (August 16, 16. Gropp, William; Lusk, Ewing; Skjellum, 2001). “Do-It-Yourself Supercomputer” Anthony (1996). "A High-Performance, .Scientific American 265 (2). pp. 72–79. Portable Implementation of the MPI Retrieved October 18, 2011. Message Passing Interface". Parallel 5. William W. Hargrove and Forrest M. Computing . CiteSeerX: 10.1.1.102.9485. Hoffman (1999). “Cluster Computing: 17. Computer Organization and Design by LinuxTaken to the Extreme”. Linux David A. Patterson and John L. Hennessy magazine . Retrieved October 18, 2011. 2011 ISBN 0-12-374750-3 pages 641-642. 6. TOP500 list to view all clusters on the 18. K. Shirahata, et al Hybrid Map Task TOP500 select "cluster" as architecture Scheduling for GPU-Based Heterogeneous from the sub list menu. Clusters in: Cloud Computing Technology 7. M. Yokokawa et al The K Computer , in and Science (CloudCom), 2010 Nov. 30 "International Symposium on Low Power 2010-Dec. 3 2010 pages 733-740 ISBN Electronics and Design" (ISLPED) 1-3 978-1-4244-9405[2]. Aug. 2011, pages 371-372 19. Alan Robertson Resource using 8. Pfister, Gregory (1998). In Search of STONITH . IBM Linux Research Center, Clusters (2nd ed.). Upper Saddle River, 2010 [3]. NJ: Prentice Hall PTR. p. 36. ISBN 0-13- 20. Sun Cluster environment: Sun Cluster 899709-8. 2.2 by Enrique Vargas, Joseph Bianco, 9. Readings in computer architecture by David Deeths 2001 ISBN page 58 Mark Donald Hill, Norman Paul Jouppi, 21. Computer Science: The Hardware, GurindarSohi1999 ISBN 978-1-55860- Software and Heart of It by Alfred V. 539-8 page 41-48 Aho, Edward K. Blum 2011 ISBN 1-4614- 10. High Performance Linux Clusters by 1167-X pages 156-166 [4]. Joseph D. Sloan 2004 ISBN 0-596-00570- 22. Parallel Programming: For Multicore and 9 page Cluster Systems by Thomas Rauber, 11. High Performance Computing for GudulaRünger 2010ISBN3-642-04817- Computational Science - VECPAR Xpages 94–95 [5]. 2004 by Michel Daydé, Jack Dongarra 23. A debugging standard for high- 2005ISBN 3-540-25424-2 pages 120-121 performance computing by Joan M. 12. Hamada T. et al. (2009) A novel multiple- Francioni and Cherri Pancake, in the walk for the Barnes–Hut "Journal of Scientific Programming" treecode on GPUs – towards cost Volume 8 Issue 2, April 2000 [6]. effective, high performance N-body 24. Computational Science-- ICCS 2003: simulation. Comput. Sci. Res. International Conference edited by Peter Development 24:21- Sloot 2003 ISBN 3-540-40195-4 pages 31.doi:10.1007/s00450-009-0089-1. 291-292. 13. Maurer, Ryan: Xen Virtualization and 25. Top500 List-June2006 (1-100)|TOP500 Linux Clustering. Super Computing Sites. 14. Distributed services with OpenAFS: for enterprise and education by Franco

www.johronline.com 64 | P a g e