Distributed GPU-Based K-Means Algorithm for Data-Intensive Applications: Large-Sized Image Segmentation Case
Total Page:16
File Type:pdf, Size:1020Kb
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 12, 2017 Distributed GPU-Based K-Means Algorithm for Data-Intensive Applications: Large-Sized Image Segmentation Case Hicham Fakhi*, Omar Bouattane, Mohamed Youssfi, Hassan Ouajji Signals, Distributed Systems and Artificial Intelligence Laboratory ENSET, University Hassan II, Casablanca, Morocco Abstract—K-means is a compute-intensive iterative Nonetheless, there are two important factors to consider algorithm. Its use in a complex scenario is cumbersome, when doing image segmentation. First is the number of images specifically in data-intensive applications. In order to accelerate to be processed in a given use case. Second is the image quality the K-means running time for data-intensive application, such as which has known an important evolution during the last few large sized image segmentation, we use a distributed multi-agent years, i.e., the number of pixels that make up an image has system accelerated by GPUs. In this K-means version, the input been multiplied by 200 from the 720 x 480 pixels to 9600 x image data are divided into subsets of image data which can be 7200 pixels. This has resulted in a much better definition of the performed independently on GPUs. In each GPU, we offloaded image (more detail visibility) and more nuances in the colors the data assignment and the K-centroids recalculation steps of and shades. the K-means algorithm for a massively parallel processing. We have implemented this K-means version on the Nvidia GPU with Thus, during last decade, image processing techniques have Compute Unified Device Architecture. The distributed multi- become cumbersome in computing time for monolithic agent system was written with Java Agent Development computers due to the huge number of pixels. This obvious need framework. has led naturally to more powerful computers to allow image processing researchers to use new High-Performance Keywords—Distributed computing; GPU computing; K-means; Computing (HPC) strategies based on the parallelism and image segmentation distributed approaches such as 2D or 3D reconfigurable mesh I. INTRODUCTION [10], FPGA, and recently GPU [8], [9] and Hadoop. In our decade, a huge amount of data must be processed In GPU computing, the most important advance is the continuously by computers to meet the needs of the end users Nvidia CUDA (Compute Unified Device Architecture) in many business areas. By a simple search on Google, we solution. The Nvidia TITAN X is the fastest GPU at the time of found lot of official statistics that show how big the big data this writing. This GPU has 3584 shader units also called processed in image processing is, in web semantics, data CUDA cores or elementary processors. It has 1417 MHz as storage, profiling and other scientific fields used by Google and base clock which can be boosted to 1531 MHz, and 12 Gbits of Facebook. For example, Facebook stores 300 petabytes, and GDDR5 memory with 480 Gbits/s of memory bandwidth. To processes 600 terabytes per days. It deals with 1 billion users have more computational power, four TITAN X GPUs can be per month, and finally 300 million photos are uploaded per interconnected with Nvidia‟s Scalable Link Interface (4-way day. In addition, Google stores much more than Facebook. SLI), the result being a powerful GPU with 14336 CUDA Google stores 15 exabytes; it processes 100 petabytes per day; cores and 48 Gbits of GDDR5 memory which in collaboration it indexes 60 trillion pages and performs 2.3 million searches with the Intel Core i7 5960X CPU can give an interesting per second. In brief, the data to be processed in many optimization not only for image processing but also for many application areas become more than ever increasingly large. other domains of applications. Unfortunately, in some cases, the use of multi-GPU systems is not sufficient to obtain a high- In this paper, we focus on image processing and their enough performance computing for certain scientific or applications. Understanding images and extracting information engineering applications. In the case where these applications from them so that the information can be used for other tasks is have to process a large amount of data and perform complex an important aspect, as for example cancer detection in tasks, as for instance in medical imaging to perform an analysis Magnetic Resonance Imaging (MRI). Such analyses and on large-sized MRI cerebral images using image-processing extraction of useful information from images are ensured by techniques such as the K-means clustering algorithm. In image processing techniques such as image segmentation [6], addition, the scalability is not guaranteed and strongly depends [7] which is one of the clustering problems. The K-means is an on the evolution of GPU and CPU hardware proposed by unsupervised learning algorithm that solves the clustering Nvidia, AMD and Intel. Thus, using a multi-GPU system on a problem. It is an iterative algorithm. Each iteration consists of single node is constrained by hardware limitations. In other two steps, the assignment of data objects and K centroids words, the computing and data communications capabilities of recalculation. the processing environment become the dominating bottleneck. 171 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 12, 2017 To overcome these limitations, we have studied distributed [31]. From our experience, GPU CUDA and Hadoop programming libraries with the objective of combining GPU implementations are by far the most efficient. computing and distributed computing paradigms. Poteras et al. [27] focused on optimization of the data In the distributed computing paradigm, we found a set of assignment step of the K-means algorithm. The idea is that for distributed programming libraries and standards, as for each iteration before the data assignment step, they add a instance MPI (Message Passing Interface) [13], [14], OpenMP procedure that determines which of the data objects could be (Open Multi-Processing) [15], [16], or HPX [19], Hadoop [20]. affected by a move. Thus, they no longer need to visit all the The idea of distributed computing is to combine machines, data objects to define their membership, but just a small list of which is typically commodity hardware, that can be used to data objects. parallelize tasks, as for example the libraries and standards cited above that were used in more than scientific domains Fang et al. [4] propose a GPU-based implementation of K- [11], [12], [17], [18]. But the limitation of the distributed means. This version copies all the data to the texture memory, system lies in the fact that these machines are limited in which uses a cache mechanism. Then it uses constant memory computing power (number of processors in each machine) and to store the K-centroids, which is also more efficient than using the data storage capacity. The scalability of such as system is global memory. Each thread is responsible for finding the slow and expensive. To improve the computation power of nearest centroid of a data point; each block has 256 threads, such a distributed system, we have to connect new machines. and the grid has n/256 blocks. For example, to have 384 more processors in the system, we The workflow of [4] is straightforward. First, each thread must connect 48 machine octa processors. calculates the distance from one corresponding data point to Additionally, in the distributed computing paradigm, all every centroid and finds the minimum distance and researchers agree that the challenge is to find a library or a corresponding centroid. Second, each block calculates a framework which provides ease of programming with a high- temporary centroid set based on a subset of data points, and level programming language (without memory management or each thread calculates one dimension of the temporary centroid. Third, the temporal centroid sets are copied from other low-level programming routines) and the best performance exploitation of hardware. Unfortunately, these GPU to CPU, and then the final new centroid set is calculated two goals are contradictory due to the fact that some on CPU. researchers obtained best performance by using low-level In [4] each data point is assigned to one thread and utilizes communication libraries known to be error-prone like MPI the cache mechanism to get a high reading efficiency. (Message Passing Interface) [21,24] or OpenMP (Open Multi- However, the efficiency could be further improved by other Processing) [22], [23], Other researchers [25], [26] have used memory access mechanisms such as registers and shared libraries and frameworks with a high-level programming memory. language which ensures simplicity of programming and portability of the code, although bringing a loss of performance Che et al. [5] present another optimized K-means and preventing an efficient access to CPU and GPU due to implementation of GPU-based K-means in a single node. They high-level abstractions of the hardware. store all input data in the global memory, and load k-centroids to the shared memory. Each block has 128 threads, and the grid To tackle these problems, we have used a distributed Multi- has n/128 blocks. The main characteristic of [5] is the design of Agent System (MAS) on GPU-accelerated nodes to accelerate a bitmap. The workflow of [5] is as follows. First, each thread the large-sized image segmentation using the K-means calculates the distance from one data point to every centroid, algorithm. The MAS distributed on connected nodes is used to and changes the suitable bit into true bit in the bit array, which divide the data into a subset of dispatched data through stores the nearest centroid for each data point.