Benchmarking an Amdahl-balanced Cluster for Data Intensive Computing

Author: Omkar Kulkarni

Supervisor: Dr. Adam Carter

August 19, 2011

MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011

Abstract

Technology has been advancing fast, but datasets have been growing even faster because it has become easier to generate and capture data. These large datasets, referred to as Big Data, are a storehouse of information waiting to be uncovered. The primary challenge in the analysis of Big Data is to overcome the I/O bottleneck present on most modern systems. Sluggish I/O systems defeat the very purpose of having high-end processors Ȃ they simply cannot supply data fast enough to utilize all of the available processing power. This results in wastage of power and increases the cost of operating large clusters.

This project evaluates the performance of an experimental cluster called EDIM1 which has been commissioned by EPCC specifically for data-intensive research projects. The cluster architecture is based on some of the recommendations from recent research in the area of data-intensive computing, particularly Amdahl-balanced blades. The idea is to use low-powered processors with high-throughput disks so that the performance gap between them is narrowed down to tolerable limits. The cluster relies on the notion of accessing multiple disks simultaneously to attain better aggregate throughput and also makes use of the latest entrant in the storage technology market Ȃ the low-powered, high-performance solid-state drive (SSD).

The project used numerous benchmarks to characterise individual components of a node and also the cluster as a whole. Special attention was paid to the performance of the secondary storage. A custom was written to measure the balance of the cluster Ȃ the extent of difference in performance between the processor and I/O. The tests also included distributed computing benchmarks based on the popular MapReduce programming model using the Hadoop framework. The results of these tests have been very encouraging and show the practicality of this configuration. We demonstrate that not only is this configuration balanced, it is also well suited to meet the scalability requirements of data-intensive computing.

Contents

1 Introduction ...... 1

2 Background...... 4

2.1 The Big Data Challenge ...... 4

2.2 Design Principles for Data-intensive Computing ...... 6

2.2.1 Balanced Systems ...... 7

2.2.2 Scale-up vs. Scale-out ...... 8

2.3 Programming Abstractions ...... 8

2.3.1 MapReduce ...... 9

2.3.2 Hadoop ...... 9

2.4 Energy-efficient Computing...... 10

2.4.1 Solid State Drives ...... 10

2.4.2 Low Powered CPUs ...... 11

2.5 EDIM1 Ȃ A New Machine for Data Intensive Research ...... 11

3 EDIM1 Benchmarks ...... 16

3.1 Single Node Benchmarks ...... 16

3.1.1 LINPACK ...... 16

3.1.2 FLOPS ...... 16

3.1.3 STREAM ...... 17

3.1.4 IOzone ...... 18

3.2 The Amdahl Synthetic Benchmark ...... 18

3.3 Distributed (MapReduce) Benchmarks ...... 22

3.3.1 TestDFSIO ...... 23

3.3.2 TeraSort ...... 23

4 Performance Analysis...... 25

4.1 Results of CPU and Memory Tests ...... 25

i

4.2 Results of I/O tests ...... 28

4.3 Results of the Amdahl Synthetic Benchmark ...... 32

4.4 Results of the Distributed (MapReduce) Benchmarks ...... 37

5 Conclusion ...... 40

6 Further Work ...... 42

Appendix A Results of Tests ...... 44

Appendix B Compiling and Running ...... 56

Appendix Hadoop Configuration ...... 59

References ...... 61

ii

List of Figures

Figure 2-1 Extracting meaningful information from large datasets ...... 5

Figure 2-2 EDIM1: High Level Cluster Architecture ...... 12

Figure 2-3 EDIM1: High level Node Architecture (Amdahl Blade) ...... 13

Figure 3-1 State Transition Diagram for the Amdahl Benchmark ...... 19

Figure 3-2 Pseudo-code for the I/O thread routine ...... 19

Figure 3-3 Pseudo-code for the compute thread routine ...... 20

Figure 3-4 Application and I/O views of the data buffer ...... 21

Figure 4-1 Results of the LINPACK and FLOPS benchmarks ...... 25

Figure 4-2 FLOPs expressed as a percentage of the CPU Clock Frequency ...... 27

Figure 4-3 IOZone: Sequential Read Throughput for Hard Disk Drive (HDD) ...... 29

Figure 4-4 IOzone: Sequential Read Throughput for Solid State Drive (SSD) ...... 29

Figure 4-5 Random and Sequential Read Throughput for a file of 8 GB ...... 30

Figure 4-6 Random and Sequential Write Throughput for a file of 8 GB ...... 30

Figure 4-7 SSD Write Patterns ...... 31

Figure 4-8 Aggregate read throughput for a combination of disks ...... 32

Figure 4-9 Amdahl balance achieved by combining multiple disks ...... 33

Figure 4-10 Variation in CPU utilization with computational intensity (1 HDD) ...... 34

Figure 4-11 Variation in CPU utilization with computational intensity (2 HDDs) ...... 35

Figure 4-12 Variation in CPU utilization with computational intensity (3 HDDs) ...... 35

Figure 4-13 Variation in CPU utilization using different data types ...... 36

Figure 4-14 Variation in aggregate throughput with the number of nodes ...... 37

iii

Figure 4-15 Speed-up graph for the TeraSort benchmark ...... 38

iv

List of Tables

Table 2-1 Values for individual disks ...... 15

Table 2-2 Calculated values for the system ...... 15

Table 3-1 Combination of floating point operations for the FLOPS kernels ...... 17

Table 4-1 Latency and Throughput for Floating Point Instructions [27] ...... 26

Table 4-2 STREAM Results with O4 optimization and double precision arithmetic ...... 28

v

Acknowledgements

I sincerely thank my supervisor Dr. Adam Carter for his guidance and support throughout the course of this project. His feedback during our weekly meetings was highly informative and invaluable to the success of this project.

I want to thank Adrian Jackson for guiding me while Adam was away for a brief period and Gareth Francis for his prompt responses to my e-mails in spite of his busy schedule.

I want to thank all my professors at EPCC who encouraged me to do my best. Last but not least, I want to thank my family back home, particularly my parents, for inspiring me with their love and unquestioning support during my stay in Edinburgh.

Chapter 1

Introduction

Big Data poses a serious challenge to existing cyber-infrastructures as datasets continue to grow exponentially, almost doubling every year [1]. The cost of storage technologies continues to fall while their sizes have increased several orders of magnitude in the last couple of decades. Organizations have a tendency to make utmost use of their available storage capacity causing scientific and enterprise datasets to run into several hundred terabytes and in extreme cases, several petabytes [2]. On the other hand, the performance of storage devices has neither kept pace with their growing sizes nor with the performance of CPUs, making them a significant constraint in the design of cluster architectures. In such a scenario, conventional solutions prove to be severely inadequate to manage and analyze such large volumes of data. New methodologies have evolved to tackle the challenge of performing computations over massive datasets, spinning off a whole new branch of computing, called data-intensive computing (DIC).

Novel data-intensive architectures have emerged which are optimized for data analysis rather than computational performance. The primary challenge for data- intensive systems is to stream data into the CPUs fast enough to maintain optimum CPU utilization because idle CPUs result in sub-optimal performance and wastage of power. Since I/O throughput is the bottleneck, balanced systems require the performance of I/O subsystems to be matched evenly with that of processors. The rules for attaining balance in data-intensive architectures were laid down by Gene Amdahl [3], and reviewed again recently [4]. As far back as 1995, Microsoft researcher Jim Gray, one of the pioneers in the field of data-intensive computing, advocated the Dz›„‡” ”‹ •dz architecture[5] in which each node (brick) is a balanced system in itself comprised of dedicated storage, processing and networking units. Studies conducted 1

on the GrayWulf system [1] „—‹Ž–ƒŽ‘‰ –Š‡•‡ Ž‹‡• •Š‘™ –Šƒ– –Š‡†ƒŠŽǯ• Žƒ™• ˆ‘” balanced systems are indeed relevant in quantifying the performance of data- intensive computing clusters. The GrayWulf system is extremely power hungry, with each node consuming 1150W of power. A subsequent study [6] leverages the latest in storage technology Ȃ solid state drives in conjunction with low-powered Intel Atom CPUs to build an energy-efficient alternative to the GrayWulf system. EPCC has commissioned an experimental cluster EDIM1 based on this model to evaluate its usefulness in data-intensive research projects. The purpose of this project is to quantify and understand the performance characteristics of the EDIM1 cluster using a set of benchmarks.

Chapter 2 outlines the changing trends in area of high-performance computing and the need for scalable data-intensive computing solutions. It reviews the existing literature on the subject and highlights the issues and challenges faced in deploying large-scale applications. It also introduces the solutions to these challenges and lays out principles that can be used to develop such solutions. The last section of the chapter describes the high-level architecture for the EDIM1 cluster, complete with the hardware configuration details.

Chapter 3 provides a brief description of the benchmarks to be used for quantifying and understanding the performance of this novel architecture. It also describes in detail the principle behind a custom benchmark written during the course of this project, for measuring the balance of ƒƒ Š‹‡ƒ ‘”†‹‰–‘†ƒŠŽǯ•’”‹ ‹’Ž‡•ǤTwo distributed benchmarks based on the MapReduce [7] programming model also feature in the list of tests; these are used to verify that the system is indeed scalable to be used on much larger datasets.

Chapter 4 contains the results of all the benchmarks described in the preceding chapter along with their detailed analysis and interpretations. The first part of this chapter discusses some of the conventional performance criteria for computer systems, namely the CPU performance, memory bandwidth and disk throughput. A significant portion of the chapter is dedicated to studying the synergism of all these

2

components using the custom benchmark. It demonstrates how a change in the nature of the application (type of data being processed, computational intensity) tilts the balance either in favour of the processor or I/O. It also shows that multiple disks can indeed be used in parallel to increase the overall throughput. The last part of the chapter is dedicated to evaluating the performance of the system as a multi-node cluster using distributed computing benchmarks.

Chapter 5 concludes the report by summarizing the outcome of the tests qualitatively while Chapter 6 discusses the scope for further work on this subject and issues that could not be completely addressed in this report.

3

Chapter 2

Background

2.1 The Big Data Challenge

Advancement in computer technology has fundamentally changed the nature of science over the last few decades. In addition to the two classical paradigms of experimental and theoretical science, computational science has emerged as the third paradigm of scientific research, driven by the availability of high-end computing power. Real-world phenomena that were hitherto considered too complex for theory or experiment can now be simulated on supercomputers using numerical models. Large-scale simulations on modern supercomputers have resulted in a proliferation of scientific data. In addition, scientists today have highly sophisticated scientific instruments and sensor technologies at their disposal that produce vast amounts of observational data. For example, the ATLAS experiment at the Large Hadron Collider (LHC) undertaken at CERN generated raw data at the rate of 2 petabytes per second in 2008 translating to around 10 petabytes of processed data to be stored per annum [2]. Data being beamed down by satellites had broken the petabyte barrier more than a decade ago [8]. Unravelling the knowledge that is buried deep within this tsunami of data has the potential to become a driving factor behind many scientific breakthroughs. This has resulted in the emergence of data-intensive science as the fourth paradigm [9] in scientific exploration. It involves complex processing and analysis of highly voluminous scientific data to yield results that are concise enough to facilitate visualization and interpretation.

Huge datasets are not an occurrence unique to the domain of scientific applications alone. Business transactions routinely generate large amounts of heterogeneous data which must be transformed and presented meaningfully in order to aid business 4

decisions and gain competitive advantage. The opportunities presented by effective mining of data have been the driving factor behind the emergence of decision support systems (DSS). Similarly, websites capture vast amounts of information generated by millions of users in log files that can yield insights into the behavioural patterns of users which can in turn be used to enrich user experience. As of 2009, the social networking site Facebook had 2.5 petabytes of stored data growing at the rate of 15 terabytes per day while Google had been processing around 20 petabytes of data per day in 2008 [2].

The process for extracting meaningful information from ultra-large datasets is similar, whether it is in the scientific or the business sector; it encompasses a broad range of activities that progressively shrink the total volume of data at each stage to make it more meaningful and easier to interpret. This process is depicted in Figure 2-1.

Figure 2-1 Extracting meaningful information from large datasets

1. Data Capture Ȃ The first stage is that of capturing raw inputs from data sources. It is crucial that the data capture devices be able to record information in real time at the same rate as it gets generated.

5

2. Data Staging Ȃ The captured data must be organized in ways that prepares it in order to be processed. This may include some pre-processing such as cleaning, padding, schematization, addition of metadata, digital curation and warehousing. 3. Data Mining Ȃ Once the data has been staged, it can be mined for information using various algorithmic and statistical methods. For timely delivery of results, the processing must be carried out in parallel on multiple compute nodes of a cluster and requires the application of distributed computing methodologies. 4. Data Presentation Ȃ When data from all the previous stages gets reduced to manageable proportions, it can be presented to end users with the help of reporting and visualization software. Information can now be understood, interpreted and acted upon by humans.

Such large datasets generated by scientific and business applications are typically ”‡ˆ‡””‡†–‘ƒ•Dz‹‰ƒ–ƒdz. The exponential growth in datasets eventually stretched the limits of existing technology to a point where a data gap was created Ȃ our ability to generate raw data far outstripped our ability to analyze and comprehend what we generated, solely due to technical limitations. Tried and tested methods fell severely short of meeting challenges encountered in dealing with this explosion of data. Data- intensive computing sprouted from the urgent need for a scalable, integrated solution consisting of both hardware and software required to bridge the gap.

2.2 Design Principles for Data-intensive Computing

Data-intensive computing represents a significant deviation from conventional high performance computing which typically focuses on maximising the CPU performance measured in terms of the number of floating-point operations per second or FLOPs. HPC has traditionally dealt with problems that fit in the main memory, where the latency of memory chips can be masked by designing applications that exploit data locality and the use of sophisticated multi-level caching. Big Data applications are characteristically I/O bound because the datasets being processed are too large to fit in the main memory of computing clusters and eventually spill over to secondary

6

storage devices. While disk capacities have increased exponentially in the last decade, the improvement in their data rates has been just about linear. On the other hand, CPU ’‡”ˆ‘”ƒ ‡Šƒ••–‡ƒ†‹Ž›ˆ‘ŽŽ‘™‡†‘‘”‡ǯ•Žƒ™ǡhugely overtaking the performance of secondary storage. This acute latency gap between processors and I/O subsystems essentially defeats data locality and caching mechanisms [10].

2.2.1 Balanced Systems

In order to extract optimum performance, the I/O subsystem must be able to transfer data at sustainable rates, fast enough to prevent the CPU from stalling. Such a system is known as a balanced system. Gene Amdahl laid down certain rules of the thumb for building such systems more than four decades ago [4], which are still relevant for modern systems. Accordingly, for a system to be balanced, it requires Ȃ

1. One bit of sequential I/O per instruction per second. This is known as the Amdahl Number and can be expressed as the ratio of the I/O bandwidth to CPU clock rate. The assumption here is that the CPU executes one instruction per clock cycle. However, this may not be true and certain instructions may take more than one clock cycle to finish executing. FLOPs is a better measure of a ǯ•’‡”ˆ‘”ƒ ‡. But then again, the value of FLOPs varies according to the nature of the code Ȃ the type of operations involved and the numerous optimizations that can possibly be applied to the code being executed.

2. One megabyte of main memory (RAM) per million instructions per second (MIPS), known as the Amdahl memory ratio (Ƚ). Interestingly, this law takes into account only the size of the memory and not the bandwidth. The assumption here is that the memory bandwidth is always more than the I/O throughput which is true for current technologies.

3. One I/O operation for every 50,000 instructions, called the Amdahl IOPS ratio.

7

2.2.2 Scale-up vs. Scale-out

In order to obtain a balanced system, the throughput of the I/O subsystem can be stepped up to match CPU performance of high-end servers. One way is to attach a high performance storage area network (SAN) to dedicated compute nodes. However, not only is this expensive, it is also not scalable because interconnect performance is not improving fast enough to keep up with the exponentially increasing storage volume [10]. On the other hand, we can try scaling up the I/O throughput to match the server performance using an array of disks and multiple high-end I/O controllers. This is the approach adopted by the architects of the GrayWulf cluster [1]. However, it has some inherent drawbacks; firstly, high-end controllers are expensive and will increase the overall cost of the system. Secondly, the solution is not scalable because there is a high probability of the system exhausting its PCI bandwidth. And lastly, the system will be very high on power-consumption, further increasing the cost of ownership.

A more sensible approach is to power down the processors to match the disk throughput and scale out to a very large numberǤŠ‹•ǡ‹‡••‡ ‡ǡ‹•–Š‡Dzdata „”‹ •dz model that was originally put forward by Jim Gray [5]. In fact, companies like Google and Facebook have been using large clusters built out of commodity hardware sold in the desktop computing market for processing petabytes of data. At scale, it proves to be extremely cost-effective, cheaper almost by four times than high-end server platforms, and incurs a negligible performance penalty [11].

2.3 Programming Abstractions

Since data-intensive computing relies on a fundamentally different set of principles than conventional computing, the programming models themselves have evolved. New models for scalable computing seek to minimize data movement in order to avoid network bottlenecks by bringing computation to where data are located. In order to ensure delivery of results within realistic time frames, data-centric algorithms are designed to reduce analysis cycles by exploiting opportunities for massive parallelism in data. Scalability comes with the risk of increased rate of hardware failure. Hence,

8

software frameworks for data-intensive computing must be fault-tolerant in the event of node breakdowns and must ensure high availability of data in addition to the speed of access.

2.3.1 MapReduce

MapReduce [7], brainchild of Google Inc., has emerged as the dominant programming model for processing large datasets on clusters, clouds and grids using a data parallel approach. It adopts a functional style of programming by expressing computations on data in terms of two user-specified primitives Ȃ Dzƒ’dz ƒ† Dz”‡†— ‡dzǡ „‘–Š ‘ˆ ™Š‹ Š operate on key/value pairs in successive passes. The map phase processes the input key/value pairs to generate an intermediate set of key/value pairs which serves as the input to the reduce phase. The reduce phase essentially merges values from the intermediate set that are associated with the same intermediate key and generates an output set of key/value pairs. MapReduce also includes an execution platform that shields programmers from low-level implementation details such as data partitioning, task scheduling and synchronization, load balancing, and recovery from node failures.

The advantage of using the MapReduce programming model is that it is not required of application developers to be experts in parallel programming. They can simply write sequential code which is automatically parallelised by the framework. Additionally, it minimizes data movement by moving the computation to nodes which host the data. As a result, it does not require the use of expensive high performance interconnects and can run well on low-end clusters that use commodity networking components.

2.3.2 Hadoop

Hadoop [12] is a popular open source implementation of MapReduce written in . It was originally developed by Yahoo and is currently being managed by the Apache Software Foundation. The Hadoop MapReduce framework uses a master/slave layout of cluster nodes for executing MapReduce tasks. A single master node is designated as the jobtracker which serves as the front-end to which users can submit jobs. The 9

jobtracker splits up a job into a number of map and reduce tasks and assigns each task to a tasktracker for execution. The Hadoop framework comes bundled with its own user-level distributed called HDFS (Hadoop Distributed File System). HDFS provides reliable storage of large files by organizing each file as a sequence of equal sized blocks and replicating them across multiple nodes for fault-tolerance. HDFS follows a master/slave layout similar to the Hadoop MapReduce framework with a single namenode managing the file system namespace through an RPC interface. The remaining nodes on the cluster are designated as datanodes that manage the storage of file blocks and serve requests from clients. Hadoop MapReduce framework can also be made to work with other types of file systems such as cloud-based storage and parallel file systems.

2.4 Energy-efficient Computing

Power does not come cheap, and the maintenance of large clusters entails a high cost of ownership. Since extreme scalability is one of the key requirements for data- intensive computing, the power consumption goes up proportionally as facilities grow larger. Alternative solutions have looked at leveraging the latest in hardware technology to bring down power consumption considerably without adversely affecting system throughput. Low-powered Amdahl-balanced blades [6] combine two of the latest innovations from the hardware industry: solid-state drives (SSDs) and energy-efficient processors that are typically found in netbooks.

2.4.1 Solid State Drives

Solid-state drives (SSDs) [13] are the latest in storage technology that are quickly gaining in popularity due to their high performance and low power consumption. SSDs are a class of non-volatile data storage devices built from silicon chips, the same technology underlying main memory chips, instead of magnetic storage used in hard disks. However, unlike main memory which loses its state once powered off, SSDs provide persistent storage. Earlier prototypes required additional battery backup to ensure persistence, but modern SSDs are based on NAND flash technology that does

10

not require sustained power supply in order to persist its state. Owing to the lack of mechanical parts, namely the rotating platter and read-write head found in HDDs, seek-time is completely eliminated resulting in extremely low latency. It also keeps the power consumption low. On the other hand, since data can be accessed directly from any location within the flash memory, the performance is almost independent of access patterns. As a result, SSDs can achieve data rates far greater than conventional HDDs. The only factor that may prevent them from completely replacing HDDs is their high cost. However, they may be used alongside HDDs to complement one another; use HDDs where large capacity is the requirement and SSDs in places where performance is critical.

2.4.2 Low Powered CPUs

With the advent of netbooks and the rising popularity of handhelds and mobile devices, processor manufacturers introduced a whole generation of energy-efficient CPUs. Netbooks are a class of inexpensive, light-weight laptop computers focusing on longer battery life than on outright performance. They serve most generic purposes such as surfing, word processing and media playback, but may not support CPU/GPU intensive applications like computer games. CPUs mounted on such devices are very low on power consumption and the performance is just a fraction of that offered by higher end CPUs operating at similar clock frequencies. For example, the performance of the Intel ATOM N330 is almost half of that exhibited by the Intel Pentium E2140 in almost all the benchmarks[14] at 8 times less power consumption (based on Thermal Design Power rating as given in the spec sheets [15][16]). Both are dual core processors running at 1.6 GHz. In the scale-out model described previously, this means we can achieve four times better performance at the same level of power consumption.

2.5 EDIM1 Ȃ A New Machine for Data Intensive Research

EDIM1 is an experimental cluster, purpose-built to fulfil the requirements of data- intensive research. It is the result of a collaboration between the EPCC and the School

11

of Informatics, both within the University of Edinburgh. Its purpose is not to compete with high-end machines like HECToR, but serve as a low-cost, power-efficient alternative for research projects. Currently, it has not been deployed as a full-fledged service and is under evaluation for its effectiveness to handle I/O bound workloads.

Figure 2-2 EDIM1: High Level Cluster Architecture

As shown in Figure 2-2, the architecture of the EDIM1 cluster is a hybrid based on the low-powered Amdahl-balanced bricks and traditional Beowulf clusters. It consists of 120 data processing nodes within three racks having 40 nodes each. Nodes within a rack are interconnected with via Gigabit Ethernet. The racks themselves have 10 Gigabit Ethernet between them connected using a high-speed switch. Users can connect to the system through SSH from a login node acts as the front-end for the cluster. Figure 2-3 shows the configuration of a compute node, each consisting of a dual-core Intel Atom N330 CPU mounted on a low-powered Zotac mini-ITX motherboard with the NVIDIA ION chipset. The 4GB DDR3 main memory is shared with the NVIDIA ION GPU, each having 16 CUDA-enabled cores. The secondary storage

12

is comprised of three 2TB (Hitachi Deskstar 7K3000) hard-disk drives and a single 256GB (Crucial RealSSD C300) solid-state drive, all connected to a 3 Gbps SATA controller. There is also a Gigabit Ethernet adapter that is connected to the rack switch. Each node runs on CentOS and the software stack is configurable using the Rocks[17] cluster management suite.

Figure 2-3 EDIM1: High level Node Architecture (Amdahl Blade)

The theoretical values of Amdahl ratios can be calculated based on the manufacturer specifications [18][19]. The sustainable transfer (read as well as write) rate of the HDD is quoted as 162 MB/s, while the read speed for the SSD is 265 MB/s (on a 3 Gbps SATA interface). Although the SSD is much faster than the HDDs, the aggregate bandwidth cannot be calculated merely as a sum of bandwidths. For that, all the disks must finish reading at the same time, meaning, more data must be stored on the SSD than the HDDs. This is infeasible, because SSDs are expensive and are favoured for their performance rather than storage capacity. The SSD is primarily meant to be used as a high-speed cache for applications to store temporary files generated during their 13

runtime. The total time required to completely read all disks simultaneously is the maximum of the time taken to read individual disks:

The HDDs would take the longest to be read completely,

ଶൈଵ଴ଶସൈଵ଴ଶସሺெ஻ሻ Timemax = = 12345 seconds ൎ 3 hours 36 minutes ଵ଺ଶሺெ஻Ȁ௦௘௖ሻ

Aggregate throughput for 4 disks is therefore,

ర σ ௖௔௣௔௖௜௧௬೔ ଷൈଶൈଵ଴ଶସൈଵ଴ଶସାଶହ଺ൈଵ଴ଶସ Throughput(4) = ೔సభ = = 506.25 MB/sec ்௜௠௘೘ೌೣ ଵଶଽସହ

Based on this estimate, the Amdahl number is calculated as,

்௛௥௢௨௚௛௣௨௧ሺ௜௡௕௜௧௦Ȁ௦௘௖ሻ ହ଴଺Ǥଶହൈଵ଴ଶସൈଵ଴ଶସൈ଼ Amdahl Number = = = 1.33 ஼௉௎ோ௔௧௘ሺ௜௡௖௬௖௟௘௦Ȁ௦௘௖ሻ ଷǤଶൈଵ଴వ

ெ௘௠௢௥௬ሺ௜௡ெ஻ሻ ସൈଵ଴ଶସ Amdahl Memory Ratio = = = 1.28 ஼௉௎ோ௔௧௘ሺ௜௡ெூ௉ௌሻ ଷଶ଴଴

According to the datasheet, the SSD is able to perform reads at a rate of 60 KIOPS. The IOPS for each HDD can be calculated using its average seek time (0.5ms) and latency (4.16ms) as given in the datasheets.

ଵ଴଴଴ ଵ଴଴଴ IOPS = = = 214 ஺௩௚Ǥௌ௘௘௞்௜௠௘ሺ௜௡௠௦ሻା஺௩௚Ǥ௅௔௧௘௡௖௬ሺ௜௡௠௦ሻ ଴ǤହାସǤଵ଺

The aggregate IOPS is therefore ͵ ൈ ʹͳͶ ൅ ͸ͲͲͲͲ ൌ ͸Ͳ͸Ͷʹ

ூை௉ௌ ଺଴଺ସଶ Amdahl IOPS Ratio = = = 0.95 ሾ஼௉௎ோ௔௧௘ሺ௜௡ெூ௉ௌሻȀହ଴଴଴଴ሿ ଷǤଶൈଵ଴వȀହ଴଴଴଴

14

Table 2-1 and Table 2-2 summarise all the calculations.

HDD SSD Capacity (in GB) 2048 256 Read Throughput (in 162 265 MB/s) IOPS 214 60000

Table 2-1 Values for individual disks

Aggregate Amdahl Aggregate Amdahl Amdahl Throughput Memory IOPS Number IOPS Ratio (MB/s) Ratio 506.25 60642 1.33 1.28 0.95

Table 2-2 Calculated values for the system

15

Chapter 3

EDIM1 Benchmarks

3.1 Single Node Benchmarks

3.1.1 LINPACK

The LINPACK [20] „‡ Šƒ”‹•ƒ’‘’—Žƒ”„‡ Šƒ”—•‡†ˆ‘””ƒ‹‰–Š‡™‘”Ž†ǯ• fastest supercomputers in terms of their floating point performance. The LINPACK library consists of a set of routines written in FORTRAN for performing numerical linear algebra. The benchmark itself is a specification Ȃ to solve a dense system of linear equations using LU factorization with partial pivoting. It is popular because it can report FLOPs values that are very close to the theoretical peak values of machines due to the regular nature of the problem. It is, therefore, a good estimate of the CPUs raw computational power.

3.1.2 FLOPS

The FLOPS [21] benchmark is a simple C program consisting of 8 computational kernels, each executing a different numerical algorithm. Most of the kernels carry out numerical integration of algebraic and trigonometric identities, with the exception of ‡”‡Ž ʹǡ ™Š‹ Š ‡•–‹ƒ–‡• –Š‡ ˜ƒŽ—‡ ‘ˆ Ɏ „ƒ•‡† ‘ –Še Maclaurin series for tan-1(x). Each kernel contains different proportions of various floating point operations Ȃ FADD, FSUB, FMUL and FDIV. The breakdown of floating point operations per loop iteration for the kernels is given Table 3-1.

16

Kernel FADD FSUB FMUL FDIV TOTAL 1 7 0 6 1 14 2 3 2 1 1 7 3 6 2 9 0 17 4 7 0 8 0 15 5 13 0 15 1 29 6 13 0 16 0 29 7 3 3 3 3 12 8 13 0 17 0 30

Table 3-1 Combination of floating point operations for the FLOPS kernels

FDIV is known to be the most compute-intensive operation of the four listed and too much usage can significantly slow down the code. The kernels are small enough to be held in the CPU cache and mostly rely on register variables to perform calculations. The results therefore best reflect the floating-point performance of the processor.

3.1.3 STREAM

Unlike the previous two benchmarks, the STREAM[22] benchmark is designed specifically to stress the memory system. It measures the sustainable rate at which the memory system can deliver data to the CPU. It uses a set of four simple operations on vectors of length N:

ܥܱܻܲǣܿሾ݅ሿ ൌ ܽሾ݅ሿ

ܵܥܣܮܧǣܿሾ݅ሿ ൌ ݏ݈ܿܽܽݎ ൈ ܾሾ݅ሿ

ܣܦܦǣܿሾ݅ሿ ൌ ܽሾ݅ሿ ൅ ܾሾ݅ሿ

ܴܶܫܣܦǣܿሾ݅ሿ ൌ ܽሾ݅ሿ ൅ ݏ݈ܿܽܽݎ ൈ ܾሾ݅ሿ

17

The lengths of the vectors are selected such that they do not fit into any of the caches so that it may reflect the true performance of the memory system. The performance is measured in MB/sec.

3.1.4 IOzone

IOzone is a benchmark written in C that measures the performance of the file system for a variety of I/O operations. It can be run using different combinations of file sizes and record sizes (smallest unit of data to be fetched by an application). The benchmark can test the I/O system for operations like sequential and random reads/writes. It also supports other operations like reading/writing in reverse and strided data transfers. The benchmark also supports advanced features meant to thoroughly test the I/O and file system performance. However, the purpose of this report is not to test the disk performance alone, but to quantify the performance of the system as a whole and determine its suitability for data-intensive applications. Therefore, only sequential and random I/O have been considered in this report.

3.2 The Amdahl Synthetic Benchmark

All the single node benchmarks listed above give useful insights into the performance of individual components of the system. However, a more holistic benchmark is ‡‡†‡† –‘ —†‡”•–ƒ† –Š‡ ’‡”ˆ‘”ƒ ‡ ‘ˆ –Š‡ Ž—•–‡” ‹ Ž‹‰Š– ‘ˆ †ƒŠŽǯ• Žƒ™• ˆ‘” balanced systems. In this project, a synthetic benchmark has been written just for this purpose Ȃ to measure the Amdahl number for a given node (this value can be extended to the whole cluster since all the nodes are identical). The benchmark is used to study the effect of changes in I/O throughput and compute intensity of applications on the overall CPU utilization of the node. It is also used to verify that by accessing multiple disks concurrently, the throughput can be aggregated and an Amdahl number close to unity can be achieved. The code has been written in C and relies on the P-threads library to implement streaming I/O so that it may be overlapped with computation. The streaming access pattern has the potential to improve performance in many data- intensive applications over the serialized Read-Wait-Compute pattern in which the

18

I/O occurs in periodic bursts and threads have to wait for I/O request to be completed before they can continue processing.

Figure 3-1 State Transition Diagram for the Amdahl Benchmark

Figure 3-1 depicts the state transition diagram for the benchmark code. The program is implemented using a producer-consumer model Ȃ given t compute threads and f files, a total of (t + f) buffers are allocated in memory. Initially, all the buffers are marked free. The f producer threads will read data from files and fill up the free buffers continuously until all the files have been read.

1. Loop until End Of File 2. DATA = Find_Buffer(FREE) 3. If DATA == NULL 4. Begin_Idle_Timer() 5. Wait_For_Condition(BUFFER_FREE) 6. End_Idle_Timer() 7. Go to the start of the Loop 8. End If 9. Set_Buffer_State(DATA,LOADING) 10. Read_From_File(DATA) 11. Set_Buffer_State(DATA,FREE) 12. Broadcast(BUFFER_READY) 13. End Loop

Figure 3-2 Pseudo-code for the I/O thread routine

19

1. REDUCTION = 0 2. Loop until I/O threads are active 3. DATA = Find_Buffer(READY) 4. If DATA == NULL 5. Begin_Idle_Timer() 6. Wait_For_Condition(BUFFER_READY) 7. End_Idle_Timer() 8. Go to the start of the Loop 9. End If 10. Set_Buffer_State(DATA,BUSY) 11. Loop from k=0 to Length_Of(DATA) 12. Loop from i=0 to NUM_OPERATIONS 13. REDUCTION = REDUCTION + DATA[i] 14. End Loop 15. End Loop 16. Set_Buffer_State(DATA,FREE) 17. Broadcast(BUFFER_FREE) 18. End Loop

Figure 3-3 Pseudo-code for the compute thread routine

When data is being read into a buffer, its state is set to loading. As soon as a buffer is completely filled, it is marked ready by the producer thread. At this point, the producer broadcasts a signal informing any compute threads which may be waiting for data that a buffer is now available for being processed. The producer will then move on to the next available free buffer and start reading data into it. Figure 3-2 shows the simplified pseudo-code for the I/O thread routine. If a free buffer is not found at the start of a read cycle, the I/O thread waits until one becomes available. This happens when the all the buffers have been read into, but have not yet been processed by the compute threads i.e. all the buffers are either in ready or busy states. It is an indication that the application is compute bound and the I/O is capable of streaming data faster than the application logic can process it. The t compute threads follow a similar routine, as shown in the pseudo-code listed in Figure 3-3. When a buffer becomes ready, a consumer (compute) thread picks it up for processing and marks the buffer busy. Once all the data in that buffer has been processed, the buffer is marked free again so that it can be reused by a producer thread and a signal conveying

20

this message is broadcasted. Likewise, if no buffer is found in a state ready to be processed, the compute thread waits for a signal from the I/O threads.

Figure 3-4 Application and I/O views of the data buffer

From the I/O perspective, every buffer is treated as a stream of bytes. From the point of view of computations, however, each buffer is an array of elements belonging to a particular data type. The different views are shown in Figure 3-4. The data type can be configured in the code, but recompilation would be necessary. The processing of the buffer occurs inside a loop which carries out a reduction by applying a binary operation to each element in the array for a specified number of times. The operation can be any binary operation (arithmetic, bitwise, logical) supported by the C programming language and can be configured in the source code. Again, recompilation would be necessary. The size of the buffer, number of threads, number of files and the number of times to repeat the operation can all be specified via the command line while running the program.

The Amdahl number is calculated based on the total amount of data transferred (in bits) and the total number of CPU cycles consumed (based on the time stamp counter) throughout the runtime:

௡௨௠௕௘௥௢௙௙௜௟௘௦ ݖ݁௜݅ݏ݈݂݁݅ ͺ ൈ σ௜ୀଵ ൌ  ௡௨௠௕௘௥௢௙௧௛௥௘௔ௗ௦ ݎݑܾ݈݄݉݁ܰܽ݀݉ܣ ௜ݏσ௜ୀଵ ݈ܿ݋ܿ݇ݐ݅ܿ݇ 21

The benchmark records the time spent by all the threads in their wait states. For I/O bound applications, the compute threads would be idle for extended periods (and vice-versa for compute-intensive applications). The time when I/O and computation are overlapped can be expressed as

ݐ௢௩௘௥௟௔௣ ൌ ݐ௥௨௡௧௜௠௘ െ ሺߜ௖௢௠௣௨௧௘ ൅ ߜூைሻ

Where,

runtime application overall = ݐ௥௨௡௧௜௠௘

ߜ௖௢௠௣௨௧௘ = average wait time for compute threads

ߜூை = average wait time for I/O threads

ݐ௥௨௡௧௜௠௘ computation, and I/O between overlap perfect of case in Theoretically,

(no 0 be would ሺߜ௖௢௠௣௨௧௘ ൅ ߜூைሻ quantity the and ݐ௢௩௘௥௟௔௣ to equal be would threads would have to wait). In practice, however, certain minimum wait times are unavoidable due to the overhead caused by thread synchronization. For example, the compute threads inevitably have to wait until the first block of data arrives. The threads also use locks to update buffer states atomically. A good indication of balance is when the wait times for I/O and compute threads are almost equal ሺߜ௖௢௠௣௨௧௘ ൎ

can benchmark The ݐ௥௨௡௧௜௠௘. of 15%) (ൎ fraction small a just is sum their and ߜூைሻ be used to verify the efficacy of an Amdahl-balanced system in ensuring optimum CPU utilization even for applications with low intensity computations.

3.3 Distributed (MapReduce) Benchmarks

Apache Hadoop is used as the default framework for running large parallel batch jobs on the EDIM1 cluster by many of the allocated applications. The Hadoop installation comes bundled with a number of benchmarks in the form of the hadoop-test-*.jar and the hadoop-examples-*.jar files. Two of these are used here to study the scaling behaviour of applications when running on multiple nodes.

22

3.3.1 TestDFSIO

The TestDFSIO benchmark can be used to perform read/write test on the Hadoop Distributed File System (HDFS). It is a useful tool for stress testing the HDFS and identifying potential performance problems. It consists of two tests Ȃ Read and Write, which accept the size per file and the number of files as command line arguments. The Read test expects the files to pre-exist in the file system and therefore must be run after the files have been generated by the write test. There is also a clean command to clean up files after the tests have been performed. The tests run as MapReduce jobs, accessing files from a hard-coded directory in the distributed file system. The tests are configured to launch one Map task per file. The statistics reported for N files are the overall throughput per node and the average I/O rate, both of which are calculated as follows:

ಿ σ೔సభ ௙௜௟௘௦௜௭௘೔ Throughput (N) = ಿ σ೔సభ ௧௜௠௘೔

೑೔೗೐ೞ೔೥೐ σಿ ೔ ೔సభ ೟೔೘೐ Average I/O Rate (N) = ೔ ே

However, for analyzing the performance scalability in a distributed environment, what we are really interested in is the aggregate I/O throughput achieved by the distributed file system. For analysing the results of the tests, the aggregate throughput is calculated as,

ಿ σ ௙௜௟௘௦௜௭௘೔ Aggregate Throughput (N) = ೔సభ ௧௜௠௘ሺேሻ

3.3.2 TeraSort

The TeraSort is an extremely popular benchmark in data-intensive computing and has repeatedly used over the years to show off performance of large clusters by many competing organizations. Back in 2008, Yahoo managed to sort a Terabyte of data in 23

209 seconds using 910 nodes [23]. A year later, they sorted a terabyte of data in just 62 seconds using 1460 nodes[24]. The benchmark was originally part of the sort benchmark challenge[25] laid down by Jim Gray. He proposed[26] disk-based sorting as a simple but highly effective performance metric for ƒ•›•–‡ǯ• ȀǤ

The terasort benchmark suite is part of the hadoop-examples-*.jar package and is comprised of three tools Ȃ

1. teragen Ȃ used to generate a large number of records in a random order to be sorted. Each record is 100 bytes long; the first 10 bytes of a record form the key based on which the data is to be sorted. The next 10 bytes represent the row number as a decimal integer followed by filler data containing an arbitrary alphabetic sequence.

2. terasort Ȃ the actual sort program which reads records from the distributed file system, sorts them in a non-decreasing order and writes them out to the file system. It is essentially a distributed sort that uses a sorted list of sampled keys to define the bounds for each reducer.

3. teravalidate Ȃ a tool to verify that the output records are in a globally sorted order. It runs through the records ensuring that each key it reads is less than or equal to the next one. If the sorting was successful, no output is generated. Otherwise, it outputs keys which were found to be out of order.

Since disk-based sorting is inherently I/O-bound, the Terasort benchmark thoroughly tests the sequential I/O performance of a system and gives a good idea about its capacity for scaling the overall I/O throughput.

24

Chapter 4

Performance Analysis

4.1 Results of CPU and Memory Tests

The FLOPS benchmark runs in a single thread and thus exercises just a single core of the CPU. Considering typical data-intensive workloads to be (almost) embarrassingly parallel, the overall floating point performance for the N330 processor can be obtained by doubling the values obtained from the FLOPS benchmark. Also, the overhead of running multiple threads within a node is negligible compared to those caused by the I/O subsystem and inter-node communication. On the other hand, the LINPACK benchmark makes use of all the available cores to solve a system of linear equations.

1200

1000

800

600 MFLOPs 400

200

0

Execution Kernels

Figure 4-1 Results of the LINPACK and FLOPS benchmarks

25

The LINPACK benchmark was run with a matrix of size 10000 × 10000, and used around 800 MB of memory while the FLOPS benchmark was run with default parameters. As seen in Figure 4-1, the recorded FLOPs values are significantly lower than the processor clock rate. This is because the ATOM processor is designed for low power consumption at the expense of floating point performance. It is optimized to execute simple instructions (such as integer addition, comparison and shift) quickly and complex instructions such as division are heavily penalised. As shown in Table 4-1, the instruction latencies1 are almost twice those of high-end processors like the Core i3 while the throughput2 is just half.

Instruction Latency Throughput ATOM Core i3 ATOM Core i3 x87 Instructions FADD, FSUB 5 3 1 1 FMUL 5 5 1/2 1 FDIV 71 21 1/71 1/20 XMM Instructions ADDSD, SUBSD 5 3 1 1 MULSD 5 5 1/2 1 DIVSD 60 22 1/60 1/22

Table 4-1 Latency and Throughput for Floating Point Instructions [27]

In addition, ATOM supports only in-order execution of instructions, i.e. instructions will be executed by the CPU in the same order that they occur in the instruction stream. This results in lower performance because the execution unit cannot exploit the potential ILP (Instruction Level Parallelism) in the code. Performance is further

1 Latency of an instruction is the delay in clock cycles for it to execute completely.

2 Throughput is the maximum number of instructions that can be executed in a single clock cycle. It denotes the sustainable rate at which a stream of independent instructions can be executed.

26

inhibited by the fact that the execution unit (FPU) is shared between integer and floating point arithmetic operations. The same FLOPS benchmark was run on a single core of two higher end processors Ȃ the Core i3 (2.27 GHz) on a laptop and the AMD Opteron 1218 (2.6 GHz), which is the NESS front-end. Figure 4-2 shows the results of the tests and FLOPs recorded on each of the processors as a percentage of the CPU clock rate. Even in the best case, the ATOM gives a FLOPs value which is only 30% of the processor clock. The other two processors achieve Peak FLOPs very close to their respective CPU clock rates.

EDIM1 (ATOM N330) NESS (AMD Opteron 1218) Intel Core i3

100 90 80 70 60 50 40 30

MFLOPs (%MFLOPs of MIPS) 20 10 0

Execution Kernels

Figure 4-2 FLOPs expressed as a percentage of the CPU Clock Frequency

In the context of data-intensive computing, the FLOPs value indicates the data- processing ability of the cores. For double precision operations, a processor can potentially process DzFLOPs times 8dz bytes of data per second (given that the size of IEEE 754 double precision floating point number is 8 bytes). Consider the result from Kernel 4 of the FLOPS benchmark, which represents the average case performance for the ATOM Ȃ 600 MFLOPs (combining the two cores). Based on this estimate, the ATOM CPU can process data at an average rate of around 4.5 GB/sec. This value is bound to be even higher for integer and character based operations using 64-bit operations.

27

This is more than sufficient to process data-streams for applications which are memory-bound, as seen from the results of the STREAM benchmark (Table 4-2).

Function Data Rate (MB/sec) Copy 2215.45 Scale 2209.43 Add 2575.96 Triad 2575.36

Table 4-2 STREAM Results with O4 optimization and double precision arithmetic

For the Triad kernel, the maximum sustainable bandwidth is around 2.5 GB/sec. Even if we replace the ATOM with a faster processor, it does not translate into an increase in the overall throughput of the system because there is an upper bound on the rate at which data gets delivered to the processor. For data-intensive applications, this limitation becomes even more acute because the bottleneck is disk I/O.

4.2 Results of I/O tests

The performance of the two types of disks attached to the node Ȃ HDD and SSD, was measured using the IOzone benchmark. The performance plots using different file sizes and record sizes are shown in Figure 4-3 and Figure 4-4 respectively. In both cases, the throughput drops sharply for file sizes greater than 4GB which is the size of the main memory per node. This is because, for smaller files, the overall throughput is affected by the file getting cached in the main memory. Our focus is, therefore, on files larger than main memory which is typical for data-intensive applications.

Figure 4-5 shows that the sequential read throughput for both types of drives is independent of the record size. This is due to the fact that the file I/O occurs in the form of block transfer, i.e. the smallest unit of data that can be fetched into memory is a block. When a single record is requested, the entire block containing that record is fetched into memory. If a single block spans multiple records, they too get transferred during the same request and are already present in memory when the next request is 28

made. This is not true for random access where successive requests are not guaranteed to be for records in the same block and pre-fetched records may get discarded. As the record size increases, the number of records per block decreases, causing the random access throughput to approach that of sequential access.

1800 1600 1600-1800 1400 1400-1600 1200 1200-1400 1000 1000-1200 800 800-1000 16M 600 4M 600-800 400 1M

Throughput MB/secThroughput 400-600 200 256K 200-400 0 64K 16K 0-200 4K

File Size

Figure 4-3 IOZone: Sequential Read Throughput for Hard Disk Drive (HDD)

1600 1400 1400-1600 1200 1200-1400 1000 1000-1200 800 800-1000 600 16M 600-800 4M 400 1M 400-600

Throughput MB/secThroughput 200 256K 200-400 0 64K 16K 0-200 4K

File Size

Figure 4-4 IOzone: Sequential Read Throughput for Solid State Drive (SSD)

29

SSD Sequential HDD Sequential SSD Random HDD Random

300

250

200

150

100

Throughput (MB/sec)Throughput 50

0 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Record Size

Figure 4-5 Random and Sequential Read Throughput for a file of 8 GB

SSD Sequential HDD Sequential SSD Random HDD Random

250

200

150

100

50 Throughput (MB/sec)Throughput

0 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Record Size

Figure 4-6 Random and Sequential Write Throughput for a file of 8 GB

The IOPS can be calculated based on the random access rates for 4 KB records Ȃ

்௛௥௢௨௚௛௣௨௧ሺ௜௡௄஻Ȁ௦௘௖ሻ IOPS = ோ௘௖௢௥ௗௌ௜௭௘ሺ௜௡௄஻ሻ

30

In case of the HDD, this works out as 194 which is close to the value arrived at using latency calculations. Surprisingly, for the SSD it comes to 6380 which is just around 10% of the value quoted in the datasheets. This puts the Amdahl IOPS ratio of the machine at 0.11.

Another unexpected result is that the random write throughput for the SSD is consistently faster than the sequential write throughput, as seen in Figure 4-6. One possible explanation for this occurrence has to do with the manner in which SSDs physically organize the data storage, as explained in [28]. An SSD is made up of NAND- flash cells which can store 1 or 2 bits of data depending on the type of cells used. However the smallest addressable unit of data that can be accessed by a read or write request is a group of multiple cells called a page which is typically 4 KB in size. Although pages can be read from and written to, they cannot be overwritten; pages need to be erased before being written to. Pages can only be erased in groups of 128 called blocks, which makes each block 512 KB in size.

Let us consider a scenario where a 1 MB record must be written to the SSD. If the record is perfectly aligned with the blocks i.e. it spans exactly 2 blocks, they would simply be erased, and the data written to them. However, if records are misaligned, then a single record spans 3 blocks instead of 2. The record must be written into the blocks in a way that the data already present in the pages of the block which are not being overwritten, must remain intact. To ensure this, the data from those adjoining pages is first read and appended to the record. The blocks are then erased and the concatenated data is then written into them. Both the cases are shown in Figure 4-7.

Figure 4-7 SSD Write Patterns 31

In case of sequential writes, there would be a dependency between successive records since they would share a block at their adjoining boundary. Therefore, write requests would have to be serialized because the next record cannot be written until the current record has finished being written. The SSD technology relies on parallel operations for its speed, and serializing the requests in this manner is bound to slow it down. This dependency does not exist when the writes are performed randomly because the probability of records from consecutive requests sharing a block is minimal. This limitation may be a result of the file system partition being misaligned, rather than a performance issue with the SSD. Further tests may be needed to verify the effect of partition alignment on SSD write performance.

4.3 Results of the Amdahl Synthetic Benchmark

500 450 400 350 300 250 200 150

Aggregate ThroughputAggregate (MB/sec) 100 50 0

Combination of Disks

Figure 4-8 Aggregate read throughput for a combination of disks

The tests in the previous section show that outright CPU performance or even memory bandwidth is not a priority for data-intensive applications because the I/O throughput lags both by a considerable margin, even when using a low-end processor. It is

32

established that the processor is capable of handling any amount of data it receives from the I/O; the question remains whether the I/O throughput can be maximized so that a greater percentage of CPU time can be utilized effectively.

1.4 1.2 1 0.8 0.6

Amdahl NumberAmdahl 0.4 0.2 0

Combination of Disks

Figure 4-9 Amdahl balance achieved by combining multiple disks

The most logical way to step up the aggregate I/O throughput is by combining multiple disks and accessing them in parallel. There are many ways to perform parallel I/O Ȃ one of them is to use disk-striping, in which a single file is striped across multiple disks. This is also known as RAID level 0 and can be configured in software or hardware. Parallel I/O can also be performed at the application level using multithreading by creating reader/writer threads. These threads hardly take up CPU time because they mostly remain in a blocked state waiting for I/O requests to be completed. This is the technique implemented in the Amdahl synthetic benchmark. The results in Figure 4-8 show the aggregate throughput achieved using different combinations of disks by reading a 4 GB file from each disk concurrently.

While all the benchmarks in the previous section only quantified the performance of the individual components of a node, the purpose of the Amdahl benchmark is to measure the Amdahl balance for the given configuration. The ideal value of the 33

Amdahl number is unity and can be achieved by combining the 3 HDDs. The overall Amdahl number of the node is almost equal to the theoretical value calculated in Section 2.5. Figure 4-9 shows the variation in Amdahl number of the node with respect to the number of disks. It is directly proportional to the aggregate bandwidth. For these tests, both cores of the ATOM processor were utilized to perform computations.

Idle CPU Idle I/O Overlap

35

30

25

20

15

10

Application RuntimeApplication (Seconds) 5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of Operations per Element

Figure 4-10 Variation in CPU utilization with computational intensity (1 HDD)

The benchmark also provides an option of varying the compute intensity by changing the number of operations performed on each element in the data stream. The data element can be of any type supported by the C language (Refer Figure 3-4 for more details). Figure 4-10, Figure 4-11 and Figure 4-12 show the variation in the CPU utilization as a result of changing the number of operations per element. The length of –Š‡„ƒ”‰”ƒ’Š”‡’”‡•‡–•–Š‡‘˜‡”ƒŽŽƒ’’Ž‹ ƒ–‹‘ ”—–‹‡ǤŠ‡Dz †Ž‡ dz’ƒ”– ‘ˆ–Š‡ graph represents the average time for which the compute threads remained blocked ™ƒ‹–‹‰ˆ‘”†ƒ–ƒǤ‹‡™‹•‡ǡ–Š‡Dz †Ž‡ Ȁdz’ƒ”–‘ˆ–Š‡‰”ƒ’Š”‡’”‡•‡–•ƒ˜‡”ƒ‰‡–‹‡ˆor which I/O threads had to wait for buffers to be released after all the data buffers had

34

„‡‡ ˆ‹ŽŽ‡†Ǥ Š‡ Dz˜‡”Žƒ’dz ’‘”–‹‘ ‘ˆ –Š‡ ‰”ƒ’Š •Š‘™• –Š‡ ’ƒ”– ‘ˆ –Š‡ ƒ’’Ž‹ ƒ–‹‘ runtime when computation and I/O are overlapped.

Idle CPU Idle I/O Overlap

45 40 35 30 25 20 15 10

Application RuntimeApplication (Seconds) 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of Operations per Element

Figure 4-11 Variation in CPU utilization with computational intensity (2 HDDs)

Idle CPU Idle I/O Overlap

70

60

50

40

30

20

Application RuntimeApplication (Seconds) 10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of Operations per Element

Figure 4-12 Variation in CPU utilization with computational intensity (3 HDDs) 35

For the above tests, the compute threads processed the data buffer as an array of double precision floating point numbers, applying simple addition operations. Initially, the CPU is underutilized since the compute threads spend a greater part of the runtime waiting for data as shown by the significant Idle CPU time. As the number of operations increased, the CPU utilization improved shown by a growth in the overlap. There was a threshold beyond which the balance tilted and the application became compute bound. This is indicated by an increase in the Idle I/O time.

Idle CPU Idle I/O Processing 160 140 120 100 80 60 40 20 Application RuntimeApplication (Seconds) 0 1 2 3 4 1 2 3 4 1 2 3 4

32-bit Integer 32-bit float 8-bit character

Number of Operations per Element

Figure 4-13 Variation in CPU utilization using different data types

Using a single disk, the CPU remained severely underutilized even when the number of operations per element was very high. For example, even with 12 operations the CPU remains idle for approximately 50% of the runtime. Simply by introducing another disk, the CPU utilization improved drastically Ȃ the idle period becomes less than 10%, for the same number of operations. If yet another disk is added, the application becomes compute bound. Using 3 disks, the optimal value is around 6 operations per data element. Each element is a double precision floating point number which is 8 bytes long. Therefore, in order to keep the CPU sufficiently busy, the optimal number of operations required for every byte of data being read is less than 1. 36

This is evident in Figure 4-13, when only the data type was changed, keeping the number of disks constant at 3. When the buffer was processed as an array of characters (1 byte), even a single operation per element seemed to keep the CPU completely utilized. On the other hand, the I/O threads remained in wait states for almost 50% of the time. This is well within the range of data-intensive applications which seldom perform any heavy duty numerical computation and instead include activities like sorting, searching, indexing and information retrieval Ȃ usually string or integer based operations.

4.4 Results of the Distributed (MapReduce) Benchmarks

The TestDFSIO and Terasort benchmarks were run on the EDIM1 cluster with datasets of 20GB each. The node count was increased progressively to study its effect on the overall throughput of the system. HDFS data nodes were configured to use the 3 SATA HDDs to store the blocks, while the namenode used the SSD for storing the namespace and other metadata. The jobtracker and namenode were configured on the same host while the remaining nodes served as slaves.

Write Read 180 160 140 120 100 80 60 40 20

Aggregate Throughput (MB/sec) Throughput Aggregate 0 1 2 3 4 5 Number of Nodes

Figure 4-14 Variation in aggregate throughput with the number of nodes

37

For the TestDFSIO benchmark the dataset was divided into 20 files of 1GB each. The throughput scaled reasonably as shown in Figure 4-14ǡ„—–‹–™ƒ•ǯ–Ž‹‡ƒ”Ǥ‡‘ˆ–Š‡ reasons for this is the nature of the test itself. The benchmark does not perform any processing of data; it simply reads from and writes to the distributed file system. The slave nodes are required to query the namenode for block location information before they can perform the actual I/O, even if the blocks are located on the same node. This adds a level of indirection to the I/O. There is also contention because the namenode has to serve multiple requests coming in from various slave nodes. The namenode is only able to serve one request at a time and therefore has to queue the other requests. Both these factors cause the performance to drop.

1 × HDD 2 × HDD 3 × HDD 6 5 4 up - 3

Speed 2 1 0 1 2 3 4 5 Number of Nodes

Figure 4-15 Speed-up graph for the TeraSort benchmark

The Terasort benchmark scaled much better than the TestDFSIO as shown by the speed-up graph in Figure 4-15. The sort was done on 20,000,000 records (approximately 20GB of data) and the number of reduce tasks were set to 2 per node. For Terasort, there is a considerable amount of compute workload in spite of it being an I/O bound problem. It showed an almost perfect linear scaling with respect to the number of nodes.

38

It was expected that increasing the number of disks would improve performance, but that did not happen and the runtimes remained almost unchanged. A previous study [29] reveals that the HDFS has many performance related issues that need to be addressed in order to utilize its full potential. When a data node has multiple disks attached to it, the HDFS simply writes blocks to them on a round robin basis. Furthermore, the HDFS client code does not exploit the potential for overlap between I/O and computation, particularly during read operations and instead, uses a periodic access pattern. The study recommends decoupling of the I/O from computations and implementation of streaming, pipelining and pre-fetching in order to improve performance. All these require significant changes to the framework code. Meanwhile, it would have been interesting to see the impact of striping data across the 3 disks using RAID level 0.

39

Chapter 5

Conclusion

This project has been an attempt to quantify and analyze the performance characteristics of the EDIM1 cluster. It is a novel architecture that leverages principles laid down decades ago, but which still hold true. It does not try to mask technical limitations, but instead, embraces them with the intention of getting the best performance within those limitations. The cluster is targeted primarily at data intensive research projects within the University of Edinburgh, and therefore the focus of this report has been mainly on the I/O throughput which has, for a while now, been the root cause of most performance scalability issues. The ATOM processor turns out to be well suited to low intensity computations and to bring down the CPU performance within the reach of current generation of I/O controllers. The results from the tests showed that the performance gap can indeed be bridged by interconnecting a number of Amdahl-balanced blades Ȃ nodes with low powered processors reading from high-throughput disks. It was also shown that the disk throughput can be scaled up to attain the required balance by using multiple disks. The actual Amdahl number for this configuration was calculated using the synthetic benchmark and it almost coincided with the theoretical value which was calculated based on the hardware datasheets. The Amdahl IOPS ratio, though, was not up to the mark. However this is not a stringent requirement for modern machines wherein main memory is large enough to compensate for a low IOPS value.

The project also evaluated Hadoop, one of the most popular frameworks for scalable data-intensive computing, to study the impact of an Amdahl balanced configuration on distributed performance. It turned out that Hadoop, particularly HDFS, was not able to exploit the increased aggregate I/O throughput. It accessed the drives serially which did not serve the purpose of having attached multiple disks to a node. It seems HDFS 40

prioritizes availability and reliability over performance. But the configuration did scale well with the number of nodes, and that would eventually translate into improved performance. The purpose of using the MapReduce programming model is not to extract higher performance from a fixed configuration for a given problem size, but to be able to scale up both these parameters and still manage to deliver results within reasonable time frames. To this end, it does serve its purpose well.

Nonetheless, the Amdahl balanced configuration is ideal for data intensive applications; not only is this model scalable, it has the potential to bring down power consumption costs considerably. The advent of SSDs has been a big leap for I/O technology which was severely lagging behind processor performance. Some issues related to SSD performance were also discussed, along with ways to address them. As the SSD technology matures, the price per GB is sure to drop and it may be possible to use them on a larger scale to boost I/O performance. It would bring down power consumption further and an improvement in the per-disk performance aspect would also benefit MapReduce style applications.

41

Chapter 6

Further Work

There is still plenty of scope for further exploration in this area, since it uses relatively new technologies in an attempt to make them work together. Firstly, we came across performance issues in the SSD for writing large files. The random access write performance was better than the sequential write performance, which is in a way counter-intuitive. The prescribed solution to this problem is to create partitions aligned to 4KB boundaries. This requires further testing and verification.

There is also scope for improvement in the sequential read performance by implementing RAID level 0 Ȃ by striping files across all available disks. The synthetic benchmark already showed that a high aggregate throughput is attainable. It would be interesting to see whether RAID has any impact on the performance of HDFS. The TestDFSIO benchmark could possibly give yield better numbers. On that note, the Terasort benchmark scaled linearly up to 5 nodes for a 20 GB dataset. An article[30] reports that a 100GB dataset was sorted in 236 seconds (roughly 4 minutes), using 16 nodes of Sun Fire X2270 M2 servers, each attached to a single SATA disk. What remains to be seen is the number of Amdahl blades from the EDIM1 cluster required to sort a 100GB dataset in a similar time frame. A comparison of the two results based on cost of equipment, power consumption and simply the aggregate number of clock ticks used could provide many insights into the advantages and limitations of this configuration.

Hadoop is also able to work with file systems other than HDFS. Given the performance limitations in HDFS, Hadoop MapReduce applications could be run on top of high performance file systems like Lustre or Sector. Lustre in particular implements RAID level 0 internally and stripes data, not just across disks, but across multiple nodes

42

similar to the way HDFS distributes blocks. By plugging in a layer which allows Hadoop to communicate with the Lustre file system APIs, information regarding the location of stripes can be conveyed to the MapReduce framework to help with task scheduling on slave nodes.

43

Appendix A Results of Tests

A-1 Output of the LINPACK Benchmark

Number of equations to solve (problem size): 10000 Leading dimension of array: 10000 Number of trials to run: 1 Data alignment value (in Kbytes): 4 Current date/time: Thu Aug 11 07:56:22 2011

CPU frequency: 1.448 GHz Number of CPUs: 1 Number of cores: 2 Number of threads: 4

Parameters are set to:

Number of tests : 1 Number of equations to solve (problem size) : 10000 Leading dimension of array : 10000 Number of trials to run : 1 Data alignment value (in Kbytes) : 4

Maximum memory requested that can be used = 800204096, at the size = 10000

======Timing linear equation system solver ======

Size LDA Align. Time(s) GFlops Residual Residual(norm) 10000 10000 4 908.375 0.7341 9.915883e-11 3.496441e-02

Performance Summary (GFlops)

Size LDA Align. Average Maximal 10000 10000 4 0.7341 0.7341

End of tests

44

A-2 Output of the STREAM Benchmark

------STREAM version $Revision: 5.9 $ ------This system uses 8 bytes per DOUBLE PRECISION word. ------Array size = 10000000, Offset = 0 Total memory required = 228.9 MB. Each test is run 10 times, but only the *best* time for each is used. ------Printing one line per active thread.... ------Your clock granularity/precision appears to be 2 microseconds. Each test below will take on the order of 83662 microseconds. (= 41831 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------Function Rate (MB/s) Avg time Min time Max time Copy: 2215.4501 0.0725 0.0722 0.0735 Scale: 2209.4253 0.0726 0.0724 0.0729 Add: 2575.9648 0.0939 0.0932 0.0973 Triad: 2575.3585 0.0936 0.0932 0.0947 ------Solution Validates ------

45

A-3 Output of the FLOPS Benchmark

FLOPS Benchmark on EDIM1 (ATOM N330)

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

Module Error RunTime MFLOPS 1 -7.6739e-13 0.0673 207.8720 2 -5.7021e-13 0.0460 152.3135 3 -2.4314e-14 0.0522 325.3840 4 6.8501e-14 0.0491 305.5114 5 -1.6320e-14 0.1265 229.2202 6 1.3961e-13 0.0585 495.3388 7 -3.6152e-11 0.1322 90.7832 8 9.0483e-15 0.0950 315.6428

Iterations = 256000000 NullTime (usec) = 0.0000 MFLOPS(1) = 184.3741 MFLOPS(2) = 172.2405 MFLOPS(3) = 251.2987 MFLOPS(4) = 356.9553

FLOPS Benchmark on NESS Front-end (AMD Opteron)

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

Module Error RunTime MFLOPS 1 4.0146e-13 0.0094 1492.0061 2 -1.4166e-13 0.0090 781.2871 3 4.7184e-14 0.0080 2112.4296 4 -1.2557e-13 0.0074 2029.3709 5 -1.3800e-13 0.0161 1797.1992 6 3.2380e-13 0.0139 2085.7102 7 -8.4583e-11 0.0219 548.2632 8 3.4867e-13 0.0145 2071.5122

Iterations = 512000000 NullTime (usec) = 0.0000 MFLOPS(1) = 984.0010 MFLOPS(2) = 1067.3262 MFLOPS(3) = 1600.3117 MFLOPS(4) = 2076.4229

46

FLOPS Benchmark on a laptop (Intel Core i3)

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

Module Error RunTime MFLOPS 1 -8.1208e-11 0.0099 1418.0020 2 -1.5485e-13 0.0134 520.9302 3 1.5740e-13 0.0090 1884.7986 4 9.3701e-14 0.0081 1857.7649 5 -4.6208e-13 0.0147 1970.5375 6 3.9450e-13 0.0142 2038.1606 7 -6.6161e-13 0.0318 377.6043 8 4.8178e-13 0.0158 1893.4911

Iterations = 512000000 NullTime (usec) = 0.0009 MFLOPS(1) = 682.3517 MFLOPS(2) = 830.4681 MFLOPS(3) = 1410.1490 MFLOPS(4) = 1929.3553

A-4 Sample Output of Amdahl Benchmark

$ time ./amdahl --threads=2 --operations=1 \ > /state/mscs1052544/sdb/data /state/mscs1052544/sdc/data

Clock Frequency: 1.60 GHz Overall Run Time: 32.54 seconds Idle Time For Compute Threads: 4.10 Seconds Idle Time For I/O Threads: 1.08 Seconds I/O Throughput 261.87 MB/sec

Amdahl Number: 0.686474

47

A-5 IOzone results for SSD Ȃ Sequential Read

File Size 16M 32M 64M 128M 256M Record Size 4k 919720 1043268 918130 1043940 1015885 8k 1258464 1231324 1074468 1232885 1228745 16K 1120582 1303013 1128547 1307464 1301920 32K 1412649 1406833 1404725 1402607 1406646 64K 1453371 1217547 1461162 1460167 1466049 128K 1461158 1225385 1459633 1460039 1467837 256K 1278506 1260493 1104247 1277156 1276236 512K 1030898 1018558 912085 1029403 995103 1M 900367 874189 809226 895864 895991 2M 783022 869339 869707 872011 872899 4M 880479 784263 874967 874349 873746 8M 782835 785914 874255 873149 874988 16M 883422 784917 869465 875400 870750

File Size 512M 1G 2G 4G 8G Record Size 4k 1065277 1063990 1065073 267053 251836 8k 1263758 739581 1312207 265833 266282 16K 1312906 1355968 1386937 267020 265744 32K 1429371 1455623 1473557 274589 265664 64K 1494119 1479624 1566336 266267 266118 128K 1500697 1532535 1595284 265349 266239 256K 1304269 1330897 1367007 276513 265787 512K 1040691 1064484 1083698 266534 266143 1M 908073 921089 931688 266187 266582 2M 886394 744814 909264 280673 265866 4M 884556 898795 910547 281895 266486 8M 888384 898874 906750 266511 265310 16M 887606 899677 909671 281446 264701

48

A-6 IOzone results for SSD Ȃ Sequential Write

File Size 16M 32M 64M 128M 256M Record Size 4k 78277 78759 78674 78722 78959 8k 86252 85032 85463 85462 85241 16K 89623 88532 89594 63430 89234 32K 92113 91622 91446 90465 91057 64K 90482 90373 92132 91982 91536 128K 92082 91714 90998 91840 91202 256K 91899 91486 91437 90516 90876 512K 92332 91628 91451 91724 91679 1M 89698 91622 91991 90880 91645 2M 92235 92186 91705 91532 91580 4M 92389 91534 92116 92058 91738 8M 92220 92194 91772 92037 91501 16M 92009 91333 92045 91011 91490

File Size 512M 1G 2G 4G 8G Record Size 4k 93120 109030 116838 119678 118808 8k 101926 116926 127311 132744 133361 16K 105254 127109 136418 145118 144074 32K 108708 133452 140634 146577 148737 64K 115371 134601 142272 148483 151182 128K 113869 134103 141286 146743 149704 256K 108746 133113 140410 148884 147497 512K 113770 132335 150526 150320 150514 1M 113854 134695 142044 150866 149740 2M 115853 134680 148923 151626 152249 4M 115288 132975 148919 147294 148913 8M 109766 136061 152466 149145 151687 16M 114649 135482 149150 150150 151160

49

A-7 IOzone results for SSD Ȃ Random Read

File Size 16M 32M 64M 128M 256M Record Size 4k 845020 830603 827193 813161 808896 8k 1083667 1053973 1081630 1063491 1045965 16K 1208981 1198541 1203094 1202783 1117089 32K 1342838 1343281 1349479 1343365 1338097 64K 1425928 1342297 1428955 1428030 1432206 128K 1447219 1451066 1448984 1448819 1449694 256K 1268500 1267433 1267817 1270313 1273836 512K 1026570 1028720 1026645 1027627 1027822 1M 890485 900496 899243 898627 891398 2M 882525 875938 878204 873626 872934 4M 880434 879323 879500 871738 875436 8M 883183 881217 874897 876195 876463 16M 882525 881031 879607 878799 877446

File Size 512M 1G 2G 4G 8G Record Size 4k 814240 805325 811094 37889 25523 8k 1065312 1074459 1075070 61225 41590 16K 1199853 1221909 1227661 90464 62872 32K 1341645 1368582 1409912 131684 90516 64K 1451513 1473998 1507753 168987 123784 128K 1480232 1517210 1565972 197839 152524 256K 1295959 1326524 1356718 219290 166407 512K 1027223 1057453 1080785 251128 203517 1M 907749 922529 932256 281234 234286 2M 886547 899189 910053 304630 253690 4M 887775 896596 898130 316930 265693 8M 888900 900141 912775 315350 271066 16M 889175 901903 913577 309887 270681

50

A-8 IOzone results for SSD Ȃ Random Write

File Size 16M 32M 64M 128M 256M Record Size 4k 121211 123290 117731 118470 119022 8k 139074 137908 133473 132171 132826 16K 138088 143696 141155 140885 140704 32K 151528 152835 145331 145765 145047 64K 149094 145480 150310 144981 111001 128K 149747 149622 148462 145774 145336 256K 145239 145235 145830 145458 144143 512K 152857 147942 143193 145260 143456 1M 151000 147969 146356 146632 142646 2M 148757 149198 146460 145283 147427 4M 147242 145482 146791 145581 147236 8M 150385 145081 149508 146672 146927 16M 152619 143420 149905 144541 147080

File Size 512M 1G 2G 4G 8G Record Size 4k 138030 145702 131559 127116 122984 8k 148231 169160 170143 159622 153085 16K 155037 177389 182351 177025 183356 32K 159492 179904 187901 190423 190717 64K 159608 178698 187947 185319 187095 128K 160747 184223 192374 192603 193251 256K 161882 182861 192765 191057 196148 512K 161541 185285 192367 196895 197724 1M 159063 183941 194595 191850 197169 2M 160617 185710 194902 197120 196939 4M 159758 184124 194027 192500 197533 8M 160034 185413 193787 198170 197996 16M 160591 182324 193656 191893 198311

51

A-9 IOzone results for HDD Ȃ Sequential Read

File Size 16M 32M 64M 128M 256M Record Size 4k 662811 791535 877123 985303 1049869 8k 662811 791517 987388 1213115 1260120 16K 663291 792759 1129071 1213225 1312768 32K 663534 991982 1129447 1314941 1432480 64K 665178 993480 1130850 1315561 1501243 128K 667971 996629 1132252 1316632 1502102 256K 673708 799468 992970 1217429 1315243 512K 508977 808165 887241 991503 1053430 1M 523817 683650 806477 886255 905330 2M 557377 710907 747036 849125 885597 4M 481485 646616 780152 824273 871947 8M 468327 634951 771329 819568 893914 16M 575362 612888 755221 854172 888575

File Size 512M 1G 2G 4G 8G Record Size 4k 1073273 1050436 1089106 129932 127282 8k 1332577 1337041 1353985 129957 127229 16K 1390226 1395929 1416621 129919 127249 32K 1499675 1405629 1524139 129920 127231 64K 1551359 1582591 1562241 129961 127177 128K 1569602 1579109 1661905 129994 127231 256K 1382877 1346176 1429987 129945 127246 512K 1084377 1110032 1090745 129882 127373 1M 841228 947635 961290 130028 130017 2M 885968 910422 925754 129924 127292 4M 900338 916437 926763 130035 126242 8M 868800 878282 928377 130157 127326 16M 856101 889744 923311 130291 127449

52

A-10 IOzone results for HDD Ȃ Sequential Write

File Size 16M 32M 64M 128M 256M Record Size 4k 47830 57819 54613 48547 48250 8k 56158 52423 56987 51235 51070 16K 57809 64452 58253 51571 53960 32K 59561 54610 59129 51912 53410 64K 59567 65530 54995 51912 54053 128K 53131 66644 59579 56378 55095 256K 57828 66650 59134 57830 53321 512K 57850 65552 57834 57202 55290 1M 57889 65579 55013 56793 52523 2M 57977 66748 59620 52447 53419 4M 58140 63615 59665 57447 53428 8M 53689 60852 60210 52482 55521 16M 59735 63090 59930 56332 52941

File Size 512M 1G 2G 4G 8G Record Size 4k 63424 68889 68556 68960 68826 8k 65539 68426 70279 69036 69239 16K 65744 71857 71133 69452 69224 32K 67147 71420 69522 69594 68860 64K 66298 68896 69967 69493 68816 128K 67005 70894 70833 69369 69536 256K 69909 69852 70162 69474 69686 512K 66230 70809 70832 69102 69689 1M 64533 70730 69832 69551 69524 2M 65751 70493 70557 69600 69208 4M 66172 71234 70421 69019 69031 8M 67973 72720 69684 69143 69244 16M 71723 69170 70669 69855 69397

53

A-11 IOzone results for HDD Ȃ Random Read

File Size 16M 32M 64M 128M 256M Record Size 4k 857484 832051 842992 829249 819330 8k 1141428 1117409 1085140 1112344 1102307 16K 1234567 1246128 1245884 1252574 1266396 32K 1383080 1380632 1412688 1443873 1444375 64K 1483523 1480372 1525617 1531895 1557850 128K 1490279 1499822 1544858 1572304 1601720 256K 1294546 1315610 1347752 1362992 1391431 512K 1070500 1071232 1094730 1100639 1115382 1M 928694 927877 938102 946997 954490 2M 906842 905145 914964 920838 928452 4M 905994 872579 915997 922313 928334 8M 907801 907296 913088 922768 919793 16M 908197 906997 917796 922990 930209

File Size 512M 1G 2G 4G 8G Record Size 4k 796320 795919 806929 1439 777 8k 1112826 1040385 1099298 2739 1515 16K 1264804 1246251 1246025 5158 2909 32K 1459680 1492413 1463903 9194 5433 64K 1585614 1599875 1605738 15465 9788 128K 1638589 1656300 1617663 25399 15793 256K 1419262 1314718 1427422 38669 28202 512K 1125610 1129463 1098654 63821 46903 1M 959766 930326 959161 94144 69469 2M 901372 937511 937953 124246 92649 4M 902465 938525 935592 148792 111391 8M 918153 937806 934209 162456 124248 16M 895926 910441 941299 168204 129074

54

A-12 IOzone results for HDD Ȃ Random Write

File Size 16M 32M 64M 128M 256M Record Size 4k 56548 57517 56492 56359 54828 8k 60136 60198 59310 59178 57979 16K 61657 61571 61494 60699 58371 32K 61133 62038 62361 61530 60988 64K 52701 58970 62053 61516 62504 128K 60900 62723 62027 61432 59353 256K 54848 57907 61328 61468 60512 512K 61211 62961 61221 61611 60669 1M 52589 61297 61497 61822 61230 2M 61507 62806 60911 61645 61619 4M 62738 62591 62169 60832 60978 8M 56027 62667 61812 61663 61155 16M 62265 62182 59231 61246 59437

File Size 512M 1G 2G 4G 8G Record Size 4k 62933 52875 31889 24698 18699 8k 68406 61143 46806 31198 19629 16K 71214 68161 53649 32399 19647 32K 72013 70453 51988 33825 21977 64K 67193 70843 53308 36319 27509 128K 68782 72452 57184 45738 40843 256K 67634 73143 63938 55736 52307 512K 67791 78103 72409 69233 67063 1M 72427 80391 81757 78413 77502 2M 71181 77223 85947 88259 88373 4M 70884 75578 80096 86439 89602 8M 71325 72246 75016 75320 75853 16M 70384 72060 72844 74092 74809

55

Appendix B Compiling and Running

B-1 LINPACK Benchmark

The precompiled binaries for the Intel Optimized LINPACK Benchmark can be downloaded from: http://software.intel.com/en-us/articles/intel-math-kernel-library-- download/

$ tar xvf l_lpk_p_10.3.4.008.tgz $ linpack_10.3.4/benchmarks/linpack/xlinpack_xeon64 Input data or print help ? Type [data]/help :

Number of equations to solve (problem size): 10000 Leading dimension of array: 10000 Number of trials to run: 1 Data alignment value (in Kbytes): 4 Current date/time: Thu Aug 11 07:56:22 2011

B-2 FLOPS Benchmark

The C source code for the FLOPS benchmark is available at: http://home.iae.nl/users/mhx/flops.c

$ gcc -o flops.c -g -O4 -DUNIX ŷDROPT $ ./flops

B-3 STREAM Benchmark

The C source code for the FLOPS benchmark is available at: http://www.cs.virginia.edu/stream/FTP/Code/stream.c

$ gcc -o stream stream.c -g -O4 $ ./stream

56

B-4 IOzone Benchmark

The C source code for the IOzone benchmark is available at: http://www.iozone.org/src/current/iozone3_397.tar

$ tar xvf iozone3_397.tar $ cd iozone3_397/src/current $ make linux-AMD64 $ ./iozone ŷRaz ŷi 0 ŷi 1 ŷn 16M ŷg 8G ŷy 4K ŷq 16M

B-5 Amdahl Benchmark

The C source code and make file are provided as a g-zipped tar-ball:

$ make OS=UNIX DATATYPE=double OPERATOR='*' $ ./amdahl --threads=2 --operations=1 ˆ‹Ž‡ɨˆ‹Ž‡ɩŝ

The compilation requires the platform (operating system), data type (each element in the array will be processed as an element of the specified data type), operator (operation to be performed on the data) to be supplied as commandline arguments to Dzƒ‡dzǤŠ‹Ž‡”—‹‰–Š‡’”‘gram, the following options are available:

--threads = number of threads to use. Ideally should be equal to the number of cores

--operations = number of operations to be performed on each element in the array

--bufsize = size of the read buffer

--filesize = the amount of data (in bytes) to be read from each file filesize and bufsize can also be specified using the qualifiers K (kilobytes), M (megabytes) or G (gigabytes).

At least one file must be specified for reading. The file must already exist on the file system. Multiple files can be specified, one for each disk to be used.

57

B-6 Hadoop TestDFSIO benchmark

$ bin/hadoop jar hadoop-test-0.20.203.0.jar \ > TestDFSIO -Dmapred.reduce.tasks=10 \ > -write -nrFiles 20 -fileSize 1000

$ bin/hadoop jar hadoop-test-0.20.203.0.jar \ > TestDFSIO -Dmapred.reduce.tasks=10 \ > -read -nrFiles 20 -fileSize 1000

B-7 Hadoop Terasort benchmark

The terasort dataset must be generated before the benchmark can be run.

$ bin/hadoop jar hadoop-examples-0.20.203.0.jar \ > teragen -Dmapred.reduce.tasks=10 \ > 200000000 /user/terasort-input

Since we would not want to repeat this step every time we run the benchmark, we can generate the dataset once and save it to the local file system.

$ bin/hadoop fs ŷget \ > /user/terasort-input /state/partition1/terainput

In case we reformat the distributed file system, we can load the data saved locally.

$ bin/hadoop fs ŷput \ > /state/partition1/terainput /user/terasort-input

The Hadoop terasort benchmark can be run as follows:

$ bin/hadoop jar hadoop-examples-0.20.203.0.jar \ > terasort -Dmapred.reduce.tasks=10 \ > /user/terasort-input /user/terasort-output

For validating the output, we run teravalidate, which should output empty files

$ bin/hadoop jar hadoop-examples-0.20.203.0.jar \ > teravalidate -Dmapred.reduce.tasks=10 \ > /user/terasort-output /user/terasort-validate

58

Appendix C Hadoop Configuration

C-1 core-site.xml

fs.default.name hdfs://hadoop-jobtracker-1-1:54324 hadoop.tmp.dir /state/partition1/test/tmp

C-2 mapred-site.xml

mapred.job.tracker hadoop-jobtracker-1-1:54325 mapred.local.dir /state/mscs1052544/sdb/test/local true mapred.system.dir /mapred/system true mapred.child.java.opts -Xmx1024m true

59

C-3 hdfs-site.xml

dfs.replication 1 dfs.block.size 67108864 dfs.name.dir /state/partition1/test/hdfs/name dfs.data.dir /state/mscs1052544/sdb/test/hdfs/data

C-4 slaves hadoop-slave-1-2 hadoop-slave-1-3 hadoop-slave-1-4 hadoop-slave-1-5 hadoop-namenode-1-0

C-5 hadoop-env.sh export JAVA_HOME=/usr/java/jdk1.6.0_16 export HADOOP_HEAPSIZE=1500

60

References

[1] Szalay, A., Bell, G., Vandenberg, J., Wonders, A., Burns, R., Fay, D., et al. (2009). GrayWulf: Scalable Clustered Architecture for Data Intensive Computing. 42nd Hawaii International Conference on System Sciences (pp. 1-10). IEEE Computer Society Washington, DC, USA.

[2] Kouzes, R., Anderson, G., Elbert, S., Gorton, I., & Gracio, D. K. (2009). The Changing Paradigm of Data-Intensive Computing. Computer , 42, 26-34.

[3] Amdahl, G. (2007). Computer Architecture and Amdahl's Law. IEEE Solid State Circuits Society News , 4-9.

[4] Bell, G., Gray, J., & Szalay, A. (2006). Petascale Computational Systems: Balanced CyberInfrastructure in a Data-Centric World. Computer , 39, 110-112.

[5] Barclay, T., Chong, W., & Gray, J. (2004). TerraServer Bricks Ȃ A High Availability Cluster Alternative. Microsoft Research.

[6] Szalay, A. S., Bell, G. C., Huang, H. H., Terzis, A., & White, A. (2010). Low-power amdahl-balanced blades for data intensive computing. ACM SIGOPS Operating Systems Review , 71-75.

[7] Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. OSDI '04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (pp. 10-10). Berkeley, CA, USA: Usenix Association.

[8] Simmhan, Y., Barga, R., van Ingen, C., Nieto-Santisteban, M., Dobos, L., Li, N., et al. (2009). GrayWulf: Scalable Software Architecture for Data Intensive Computing. Hawaii International Conference on System Sciences (HICSS) (pp. 1-10). IEEE Computer Society Washington, DC, USA.

61

[9] Szalay, A., Bell, G., & Hey, T. (2009, March 6). Beyond the Data Deluge. Science Magazine , 323, pp. 1297-1298.

ȏͳͲȐ œƒŽƒ›ǡ Ǥǡ Ƭ Žƒ‡Ž‡›ǡ Ǥ ȋʹͲͲͻȌǤ ”ƒ›ǯ• ƒ™•ǣ ƒ–ƒ„ƒ•‡-centric Computing in Science. In The Fourth Paradigm: Data-Intensive Scientific Discovery (pp. 5-11). Microsoft Research.

[11] Lin, J., & Dyer, C. (2010). Data-Intensive Text Processing with MapReduce. Maryland: Morgan & Claypool.

[12] Apache Software Foundation. (2011, July 13). Retrieved from Hadoop: http://hadoop.apache.org/

[13] Ekker, N., Coughlin, T., & Handy, J. (2009). Solid State Storage 101 - An introduction to Solid State Storage. San Francisco, CA: Storage Networking Industry Association (SNIA).

[14] CPU Product Benchmarks. (n.d.). Retrieved August 02, 2011, from AnandTech: http://www.anandtech.com/bench/Product/91?vs=70

[15] Intel Corporation. (n.d.). –‡Ž̾–‘̿”‘ ‡••‘”͹͹ͶǤ Retrieved August 02, 2011, from Intel: http://ark.intel.com/products/35641

[16] Intel Corporation. (n.d.). Intel® Pentium® Processor E2140 . Retrieved August 02, 2011, from Intel: http://ark.intel.com/products/29738

[17] Rocks Homepage. (n.d.). Retrieved from Rocks Clusters: http://www.rocksclusters.org/wordpress/

[18] Hitachi. (2010, October 28). Hitachi Deskstar 7K3000 - Hard Disk Drive Specification. Retrieved from Hitachi Global Storage Technologies: http://www.hitachigst.com/tech/techlib.nsf/techdocs/6487C1BF0107A2E58825781 D0033EE6E/$file/DS7K3000-2TB_andUnder_OEM_manual_1.0.pdf

62

[19] Micron Technology, Inc. (2010, May 02). ‡ƒŽ̿͹ͶͶ‡ Š‹ ƒŽ’‡ ‹ˆ‹ ƒ–‹‘Ǥ Retrieved from Crucial: http://www.crucial.com/pdf/Datasheets- letter_C300_RealSSD_v2-5-10_online.pdf

[20] Top500. (n.d.). The Linpack Benchmark. Retrieved from Top500: www.top500.org/project/linpack

[21] Aburto, A. (1982, December 18). The FLOPS Benchmark. Retrieved from http://home.iae.nl/users/mhx/flops.c

[22] McCalpin, J. (n.d.). Stream Benchmark. Retrieved from http://www.streambench.org/

ȏʹ͵Ȑ ǯƒŽŽ‡›ǡ Ǥ ȋʹͲͲͺǡ ƒ›ȌǤ TeraByte Sort on Apache Hadoop. Retrieved from http://www.hpl.hp.com/hosted/sortbenchmark/YahooHadoop.pdf

[24] Anand, A. (2009, May 11). Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds. Retrieved from http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte _in_162/

[25] Gray, J. (n.d.). Sort Benchmark Home Page. Retrieved from http://sortbenchmark.org/

[26] A Measure of Transaction Processing Power. (1985). Datamation , 31 (7), 112- 118.

[27] Fog, A. (2011, June 08). Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Retrieved from Agner.org.

[28] Shimpi, A. L. (2009, March 03). The Anatomy of an SSD. Retrieved from AnandTech: http://www.anandtech.com/show/2738/5

63

[29] Shafer, J., Rixner, S., & Cox, A. (2010, April 19). The Hadoop Distributed Filesystem: Balancing Portability and Performance. International Symposium on Performance Analysis of Systems & Software (ISPASS) , 122-133.

[30] Oracle Corp. (2010, September 27). Sun Fire X2270 M2 Super-Linear Scaling of Hadoop Terasort and CloudBurst Benchmarks. Retrieved from Oracle Blogs: http://blogs.oracle.com/BestPerf/entry/20090920_x2270m2_hadoop

[31] Aggarwal, A., & Vitter, J. (1988, September). The Input/Output Complexity of Sorting and Related Problems. Communications of the ACM , 1116-1127.

64