Quick viewing(Text Mode)

Using Cell B.E. and System-Z to Accelerate an Image

USING B.E. AND SYSTEM-Z TO ACCELERATE

AN IMAGE STITCHING APPLICATION

USING CELL B.E. AND SYSTEM-Z TO ACCELERATE

AN IMAGE STITCHING APPLICATION

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science

By

Sree Vardhani Malladi Jawaharlal Technological University Bachelor of Technology in Information Technology, 2007

August 2009 University of Arkansas

ABSTRACT

Image stitching is application software that combines multiple images with overlapping fields of to produce a high-resolution composite image. This thesis examines the performance of a parallelizable portion of an image stitching algorithm when it is executed on Cell Broadband Engine, a multicore chip that has been architected for intensive gaming and high performance computing applications. Performance is compared for the same algorithm running on a PC, an IBM System-z mainframe, an IBM

QS21 Cell BE just using the main , and a QS21 Cell BE using the main and 16 satellite processors running in parallel.

This thesis is approved for recommendation to the Graduate Council.

Thesis Director:

______Dr. Craig W. Thompson

Thesis Committee:

______Dr. Amy Apon

______Dr. Gordon Beavers

______Dr. Jackson Cothren

THESIS DUPLICATION RELEASE

I hereby authorize the University of Arkansas Libraries to duplicate this thesis when needed for research and/or scholarship.

Agreed ______Sree Vardhani Malladi

Refused ______

ACKNOWLEDGEMENTS

I thank my thesis advisor Dr. Craig Thompson; my thesis committee Dr. Amy

Apon, Dr. Gordon Beavers, and Dr. Jackson Cothren; Dr. David Douglas; my project mates Hung Bui and Wesley Emeneker; For all their help, guidance, encouragement and support to complete the project.

I also thank my father, mother, family and friends for all their support and encouragement during this project.

Finally, I thank IBM for supporting this research project and especially Hema

Reddy from the IBM Systems and Technologies University Alliances team for her guidance.

v TABLE OF CONTENTS

1. Introduction ...... 1

1.1 Context ...... 1

1.2 Problem ...... 2

1.2 Thesis Statement ...... 3

1.3 Approach ...... 3

1.4 Organization of this Thesis ...... 4

2. Background ...... 5

2.1 Image Stitching Algorithm...... 5

2.1.1 Image Processing to Find Key points ...... 5

2.1.2 Image Matching ...... 7

2.1.3 Image Orientation ...... 7

2.1.4 Rebuild Image...... 7

2.2 System-z and DB2 ...... 8

2.3 Cell BE ...... 9

2.3.1 (PPE) ...... 11

2.3.2 Synergistic Processing Elements (SPE) ...... 13

2.3.3 QS21 and QS22 ...... 14

QS21 ...... 14

QS22 ...... 15

2.3.4 Programming the Cell BE ...... 16

2.3.5 Performance Optimizations of the QS21 ...... 17

vi 3. Architecture ...... 19

3.1 High Level Design ...... 19

3.2 Design ...... 20

3.3 Implementation ...... 23

4. Methodology, Results and Analysis...... 28

4.1 Methodology ...... 28

4.2 Results ...... 29

4.2.1 Timing the Overall Application ...... 29

4.2.2 Timing the Key Point Detection ...... 31

4.2.3 Timing Results in more detail...... 32

4.2.4 Timing the Communication ...... 33

4.3 Analysis ...... 35

4.3 Comparison with Recent Results ...... 36

5. Conclusions ...... 37

5.1 Summary ...... 37

5.2 Contributions...... 38

5.3 Future Work ...... 39

References ...... 41

vii LIST OF FIGURES

Figure 1 : Stitched Image ...... 2

Figure 2 : High Level Block diagram of Cell Broadband Engine...... 10

Figure 3 : Block Diagram of PPE...... 12

Figure 4 : Block Diagram of SPE...... 13

Figure 5 : High level Block Diagram of IBM QS21 ...... 15

Figure 6 : Architecture Design ...... 19

Figure 7 : Image Extraction Runtime ...... 30

Figure 8 : Key point Detection Run time...... 31

Figure 9 : Communication time from Z to Cell...... 34

Figure 10 : Communication time from Cell to Z ...... 35

viii

1. INTRODUCTION

1.1 Context

Images often overlap. Image registration involves mapping and matching image features to a common scale and coordinate system so the images can be overlaid to form a larger image or even a model. When there are many images in an image set, one can refer to the process of registering them all into one large mosaic image as image stitching.

Image stitching is important in military and industrial applications. These applications include urban modeling, precision agriculture, disaster management, and medical images. In all these applications, we take many images of a location, then match the features and line them up to build an image collage. The consumer market is beginning to explore applications of image stitching – for instance, composing many pictures of the Taj Mahal taken from points of view into a composite panorama.

Image stitching involves first identifying interesting points (key features) in each image along with their descriptors, then matching key points across images that may overlap, then transforming the images’ scale and coordinate system to create the composite image.

Below is an example of a Stitched Image.

1

Figure 1 : Stitched Image

1.2 Problem

Image stitching is often a manual process in which humans identify key registration features, and then a computer is used to transform the images’ scale and coordinate systems. A small number of automated image stitching algorithms have been developed by research groups including Google, , and the University of

Arkansas.

The process of matching many images is computationally expensive. Image sets often range from hundreds to thousands of images and may require a database for storing and retrieving the images. Many data sets, like Google Earth, are not refreshed often

2

because of expense – but they could be if the process of image stitching could be made to be automated and fast. If a parallel model can accelerate image stitching, then running this application on large image databases can be accomplished more quickly.

1.2 Thesis Statement

The objective of this thesis is to determine how performance changes when an existing single-threaded (non-parallel) implementation of an image stitching algorithm is modified and instrumented so that one of the computationally intensive portions of the algorithm can execute in an integrated environment consisting of an IBM System-z mainframe communicating with a Cell Broadband Engine multicore accelerator. The

IBM mainframe is used to store a database of images and key point files and the Cell BE uses its multicore architecture to run portions of the image stitching algorithm in parallel.

1.3 Approach

Our approach can be seen as a sequence of moves. In this project, we started with an existing image stitching algorithm that runs on PCs – that implementation does not taking advantage of parallelism. To take advantage of a mainframe DBMS (IBM DB2) on IBM System-z, where we can store a large collection of images (as BLOBs), we first ported the image stitching algorithm to the mainframe. Then, to take advantage of the availability of a Cell Broadband Engine, we ported data intensive portions of the image stitching algorithm, the key point extraction phase, to the IBM QS21, a Cell BE accelerator based to take the advantage of its sixteen synergistic processing units which are computational intensive units and provide vector processing. We

3

instrumented our ported system so we could measure the communication, computation, and overall processing times in order to compare the result with the original configuration.

1.4 Organization of this Thesis

Chapter 2 provides the reader with an understanding of the Image Stitching application as well as the architecture and performance of the IBM z900 System-z mainframe and the Cell BE QS21 accelerator. Chapter 3 describes the overall design and implementation of the test configuration and test harness used to measure performance of the image application before and after modification to run on the System-z/Cell BE.

Chapter 4 describes and analyzes the result of measuring performance (in terms of communication time and overall application time) to identify tradeoffs (advantages and disadvantages) in the design. Chapter 5 provides conclusions and suggests future work.

4

2. BACKGROUND

This section provides the reader with background on the three key concepts of the present project: the image stitching algorithm, the IBM System-z mainframe including the system, and the IBM Cell Broadband Engine accelerator.

2.1 Image Stitching Algorithm

Image stitching is the process of combining multiple images with overlapping fields of view to produce a high-resolution image. There are four stages of the stitching process, described in the sections below.

2.1.1 Image Processing to Find Key Points

The first step of the algorithm is to process each individual image to identify key points (also called interesting points). This computationally intensive step is the only step where we have so far explored parallelism using the Cell BE, as will be reported in

Chapters 3 and 4.

There are several approaches to finding key points. One approach uses scale invariant feature transformation (SIFT) [1], in which key points that are invariant to scale and orientation are identified. Another approach detects affine invariant key points, in which key points that are invariant to affine transformations (translations, rotation, scaling and shearing) are detected [2].

Lowe’s algorithm to find the interesting points based on SIFT [1] has four major phases: scale-space extrema detection, key point localization, orientation assignment, and key point description. 5

The scale-space extrema detection step searches over all scales and image locations to identify potentially interesting points that are invariant to scale and orientation. The search is implemented efficiently by using a difference-of-Gaussian function. The image is up sampled by a factor of two and then convolved over the

Gaussian blur scales and the key points are detected as the difference of the Gaussians.

This is repeated over all the octaves, and accordingly the image is down scaled by a factor of two until one fourth of its original size..

In the key point localization step, a quadratic surface model is fit to each pixel of the Gaussian space (obtained from the above process) with a 3X3 surface around it to determine the location and scale. And the key points are detected based on their stability

(peak values of the derivatives around the pixel surface) [1].

In the orientation assignment step, depending on local image gradient directions, one or more orientations are assigned to every key point location. In this stage image data is transformed to key points with orientation(s), scale and location for each. At this point, the image is invariant to the scale and rotation.

The final step of SIFT is to describe each key point using a 128 dimensional (a

4x4 space around the key point with 8 direction orientations 4*4*8) descriptor-vector.

These descriptor vectors are used in the next stages of the image stitching application.

There are existing implementations of optimized image matching applications that use SIFT or affine invariant features to detect the key point descriptors and then match them accordingly. SIFT++ is an implementation based on MATLAB/ [3]. Andrea

Vedaldi [3] developed a C++-based implementation of SIFT that has the flexibility to choose which number of octaves to start with (default eight), which octave to start with, 6

and other flexibilities like processing more than one image. With eight octaves, an image is up sampled once and down sampled seven times. In our implementation, we use

Vedaldi’s code as our base and port this code onto the QS21 to accelerate the process.

2.1.2 Image Matching

The image matching step of the image stitching algorithm involves identifying all key points that appear in multiple images. Matching is done pair-wise for images over all the 128 vector descriptors obtained from the extraction stage. The same matching can then be extended to multiple images. There are several image matching implementations available for clusters, grids, and graphic processing unit (GPU) accelerators [4, 5].

2.1.3 Image Orientation

The image orientation step uses the matched key points to determine the relative orientation of each image to each of its overlapping images. Relative orientation matrices are used in bundle adjustment1 to refine the orientation of each image with respect to the scene. The 3D scene coordinates of each key point matched in two or more images is computed as well.

2.1.4 Rebuild Image

Once the scene structure is developed, each image is projected onto the scene to stitch together a panorama image.

1 “Bundle adjustment is the problem of refining a visual reconstruction to produce jointly optimal 3D structure and viewing parameter estimates.” -- Bill Triggs , Philip F. McLauchlan , Richard I. Hartley , Andrew W. Fitzgibbon, Bundle Adjustment - A Modern Synthesis, Proceedings of the International Workshop on Vision Algorithms: Theory and Practice, p.298-372, September 21-22, 1999

7

2.2 System-z and DB2

The mainframe used in this application for storing the images is IBM’s System – z900. The IBM z-Series mainframes are descendents of the IBM 360, 370 and 390 mainframes. The z stands for “zero down time.” IBM z-Series mainframes are architected for enhanced flexibility, availability and reliability. Redundant I/O interconnect and enhanced driver maintenance help to protect not just the mainframe but also its applications from unplanned down-time.

The current z900 is based on the 64-bit z architecture, which reduces memory and storage bottle necks. The z900 has two processor clusters, each with six to ten processing units (PUs) [7] and a large instruction and data cache size and bandwidth [8]. The z/Architecture has 64-bit registers, integer instructions, branch instructions, address generation, control registers, I/O architecture, operands and general registers that are used for cryptographic instructions [7].

The z900 series supports DB2, COBOL, C, C++, , Windows and more.

The two most common modes Linux for z-series are z/VM and integrated facility for

Linux. In the Z/VM mode, each Linux system runs in its own virtual machine. System-z is capable of multiple levels of virtualization including virtualization of CPU, virtualization of memory, and virtualization of the network. Using CPU virtualization, one physical machine can be further divided into multiple logical partitions (LPAR), where each LPAR is a separate virtual machine running its own . One can run tens of Linux images on the same LPAR [9]. Specifications of how the real hardware resources are allocated to the virtual machine are allocated by the tuning controls in the Z/VM (Z- virtual machine) [9]. 8

One advantage of the System-z virtual Linux machine [10] is that scalability is improved with z/VM which can support real storage from 2GB to 256GB. The Linux on

System-z also reduces the cost (due to virtualization), improves service and manages risk.

IBM’s DB2 is relational data base management software. Images in DB2 can be stored as Large Objects (LOB), Binary Large Objects (BLOBS), or Double-

Character Large Objects (DBCLOBs). When creating tables in DB2, we can use one table per table space, multiple table spaces or even partitioned table spaces [11]. DB2 on

System–z provides data base security and data scalability along with constant availability.

2.3 Cell BE

The IBM Cell Broadband Engine (Cell BE) is a heterogeneous multicore chip that addresses the three major design challenges – memory latency, power and frequency.

9

Source: IBM Corporation, CELL architecture tutorials, March 2009

Figure 2 : High Level Block diagram of Cell Broadband Engine

Figure 1 shows the architecture of the Cell BE chip which includes two kinds of processor elements: a main processor called a Power Processing Element (PPE) and eight satellite processors called Synergistic Processor Elements (SPEs). The PPE is a 64- bit implementation of the IBM’s Power PC architecture Reduced Instruction Set

Computer (RISC) and is used for program control and operating system functions. Each

Synergistic Processor Element (SPE) consists of a Synergistic Processor Unit (SPU) and a Memory Flow Control (MFC).

10

The Cell BE is based on Single Instruction Multiple Data (SIMD) architecture and has two sorts of storage: main storage associated with the PPE and local storage (LS) for each of the SPEs.

The PPE and SPE processing elements are connected to each other and also with input/output and memory interfaces via an Element Interconnect [12]. The following sub sections describe these parts in more detail.

2.3.1 Power Processing Element (PPE)

The power processing element is a 64-bit IBM PowerPC architecture that can process both 32-bit and 64-bit instructions and applications. It is the main processor and is mainly used for program control, system and memory management, operating system, task management and managing the SPEs. It supports multithreading and integrated vector multimedia operations. [13]

The PPE has an instruction unit, execution unit, and a vector/scalar unit. The instruction unit has a level 1 instruction cache (L1 cache), instruction buffers, and dependency checking logic. It is responsible for instruction fetch, issue, decode, branch, completion. The execution unit has an integer execution unit and a load store unit. It takes care of all the fixed-point instructions along with load and store instructions. The vector/scalar unit consists of floating point unit, vector multimedia extension units, and individual instruction queues. This increases the overall throughput of the processor and is responsible for the floating point and vector operations [14].

11

Source: IBM Corporation, CELL architecture tutorials, March 2009

Figure 3 : Block Diagram of PPE

Figure 2 is a block diagram of a PPE showing the Power Processing Unit (PPU) and its two-level cache. The PPE consists of Power Processing Unit and the power processor storage subsystem (PPSS) [15]. Instruction and execution control are handled by the PPU. It includes a set of 64-bit PowerPC registers, 32 128-bit vector registers, a

32-KB level 1 (L1) instruction cache, a 32-KB level 1 (L1) data cache, an instruction- control unit, a load and store unit, a fixed-point integer unit, a floating-point unit, a vector unit, a branch unit, a virtual-. The PPSS consists of a unified level 2 (L2) instructions and data cache of size 512 KB, queues and a bus interface unit that handles bus arbitration and pacing on the EIB. Delayed execution pipelines are used

12

by the PPE in order to prevent out of order execution and improve the performance of the system.

2.3.2 Synergistic Processing Elements (SPE)

The eight synergistic processing elements (SPEs) of the Cell BE chip are the accelerating units of the system. As shown in Figure 3, each SPE consists of a synergistic processing unit (SPU), a memory flow controller (MFC) and a local store

(LS) of size 256KB, and a 128 bit register.

Figure 3 is a block diagram of a Synergistic Processing Element.

Source: IBM Corporation, CELL architecture tutorials, March 2009

Figure 4 : Block Diagram of SPE

The SIMD registers of the SPU can operate both integer and floating point instructions unlike the PowerPC as well as scalar and vector operations. The SPEs can be controlled by the PPE to initiate actions like start, stop, signal, and interrupt. These SPEs can work independently or in coordination with the PPE or other SPEs. 13

Every SPU has it own programming counter and also fetches the instructions and loads and stores data from its own Local store. In SPU two SIMD instructions can be issued per cycle, one instruction for computation and another for memory access. The memory access operation can be initiated by the PPU or another SPE.

The 256KB local storage is the only memory directly addressable by the SPU.

The main memory cannot be accessed by the SPU directly. One needs to statically copy data to each SPU local store from the PPU main memory. A DMA operation handles both address translation and moving the memory.

The SPU interacts with the other elements of the processor with the help of the memory flow controllers via Element Interconnect Bus (EIB). To load or store the data in to its local store the SPU has to make an asynchronous DMA request to the DMA controllers in the memory flow controller, in a pipeline fashion, which can communicate with other linked elements with the help of the EIB.

2.3.3 QS21 and QS22

QS21

As shown in Figure 4, the IBM QS21 is a Cell BE-based blade server consisting of two high performance Cell BE processors connected to each other in a symmetric multiprocessing (SMP) fashion and delivered as a single-wide (rack unit) blade that supports the Linux operating system. The two processors are connected with coherent

BIF bus via the south bridges using a gigabit Ethernet controller. It has a dual channel

Gigabit Ethernet Broadcom for fault-tolerant network setup. [18]

14

Source: IBM QS21 performance, March 20, 2009

Figure 5 : High level Block Diagram of IBM QS21

Advantages of the Cell BE include the following [from 17]:

 Performance acceleration of computational intensive workloads.

 Automatic server recovery and restart, and recovery after boot hang.

 High performance in applications like image/signal processing and graphic

rendering.

QS22

QS22 is a next-generation Cell BE blade server based on the powerX8i processor.

The QS22 improves over the QS21 by adding double-precision floating-point which yields application results faster with more reliability [19]. Like QS21, QS22 is a single- wide system that supports Linux. It has two 3.2GHz PowerX8i processors and a memory of 32 GB of DDR2 RAM, along with two gigabit Ethernet ports.

15

2.3.4 Programming the Cell BE

From the programming point of view, programming the Cell BE’s SPE is similar to working with Linux threads. The SDK contains libraries that assist in managing the code running on the SPE and communicate with this code during execution. Writing a program for the Cell BE involves the following steps: [20]

1. The PPE program starts and loads the SPE program to its local store.

2. The PPR program instructs the SPEs to execute the program

3. The SPE transfers required data to its local store from the PPE main memory

(under control of the SPE program).

4. The SPE programs processes the data.

5. The SPE program transfers the processed data from its local store to the PPE

main memory.

6. The SPE notifies the PPE of program termination

Both the PPE and SPEs can be programmed in , C or C++ using a common API that the SDK libraries provide. The power processing element manages each SPE as a POSIX . The IBM’s Cell SDK libspe2 library provides the SPE management for the processes. Compiler tools including make are used for inserting SPE executables into PPE. Both PPE and SPE can execute SIMD instructions but use different instruction sets.

Vector/SIMD (Single Instruction Multiple Data) multimedia extension instruction set and the PowerPC instruction set are supported by the PPE. Generally, it is preferred to use the eight SPEs to perform massive SIMD operations and let the PPE program manage the application flow. However, it may be useful in some cases to handle some 16

SIMD computation on the PPE. The PPE’s kernel manages virtual memory, along with mapping each of the SPEs local store and problem state into their respective effective- address space [16].

Each SPE has a SIMD instruction set, 128 vector registers and two in-order execution units, and no operating system. The SPUs local store is unprotected and untranslated storage with no virtual memory. Thus, safeguarding against stack over flow and illegal access outside of array bounds are taken care by the PPU. An illegal instruction halts the SPU.

Data transfers are done with the help of the DMA requests. Data must be moved between PPE main memory and the 256 KB of the SPE local stores with explicit DMA commands [16]. Each DMA transfer can only transfer 16 KB at a time - if the needs to transfer more than 16 KB then she has to use the DMA List command. DMA lists can be up to 2048 DMA transfers and 16KB per each transfer. The

DMA commands transfers the data from the effective address to the local store or from the local store to the effective address depending up on the get/put commands. These transfer continuous data of the transfer size from the effective address/ local store address.

2.3.5 Performance Optimizations of the QS21

Performance of the QS21 depends on the performance of the Cell BE, which in turn depends upon the performance of the SPUs. The performance of a QS21 server is expected to be good as it combines two CELL BE chips.

17

The performance of the SPU can be predicted as the DMA transfers are pipelined.

Even though SPUs need to DMA all the data needed, the impact of the DMA operations is limited. The reason is that instruction prefetch delivers at least seventeen instructions sequentially from the branch target. DMA operations are buffered, and they can only access the LS at most once every eight cycles [15].

The local store of the SPU is filled and emptied only by DMA operations.

Performance bandwidth of the DMA transfers, is as follows [15]

 16--per-cycle load and store bandwidth, quadword aligned only

 128-bytes-per-cycle DMA-transfer bandwidth

 128-byte instruction prefetch per cycle

The Memory Flow Control (MFC) maintains process queues of DMA commands to support DMA transfers. The MFC can autonomously execute a sequence of DMA transfers once a DMA command has been queued to the MFC, while the SPU can continue to execute the instructions. This autonomous execution of SPU instructions and the DMA commands permits scheduling the DMA transfers more easily to hide memory latency [15]. The page and segment tables of the PPU take care of the system storage attributes.

The Element Interconnect Bus (EIB) can transfer 16 bytes of data every bus cycle.

As each address request can transfer up to 128 bytes per cycle, theoretically the maximum bandwidth of the data on the EIB is 204.8 GB per second (at a frequency 1.6

Ghz ; 1.6 *128 = 204.8 GB) [15].

18

3. ARCHITECTURE

3.1 High Level Design

Figure 5, shown below, provides a high level view of the architecture we are evaluating. The IBM System-z900 (mainframe) stores and manages a collection of images using the DB2 relational database management system. A QS21 Cell BE accelerator is accessed across a network. The image stitching application can run on the mainframe or on the Cell BE’s main PPU unit or portions of the algorithm can be run on the sixteen SPU units, or it can run on a standard Intel processor (not shown). The objective of this study is to compare performance of the SIFT portion of the image stitching algorithm on these four machine configurations.

Figure 6 : Architecture Design 19

3.2 Design

This design is developed to demonstrate that the Cell BE and System-z, integrated together can perform Image Stitching for large data sets, and may also accelerate the algorithm.

To implement the first step of the design, one needs to determine if the two environments System–z900 and the QS21 can talk to each other and also Instrument the timing of their communication and tune accordingly. The communication between the

Cell and the System-z should be bi-directional and either one can act as client or server.

As a first step, the first phase of the image stitching algorithm ‘Image extraction’ is implemented. Image extraction is a data computational which takes about 40% of the time of the entire algorithm and needs to be accelerated. The mapping of this part to the

System-z and the cell BE is done in the following way.

The images that need to be stitched are placed in DB2 via JDBC on the System-z.

Then, depending on the requirement, the desired images are retrieved and sent over for processing. These images are stored as Blob’s (binary large objects) inside the DB2.

TIFF libraries and Image magic libraries are used for chunking and formatting the images that are required for the application before storing them into the database.

A blob can store data up to 2GB, so to store images bigger than that, the images need to be chunked before storing. And also working with very large images (without chunking them) is not recommended as the local store on the Cell BE has memory limitations. These images can be stored in tables and managed with the help of hash tables for storage and retrieval.

20

For example, when this application is used for more images such as thousands in number, more storage space is needed to send the receive data back and forth to the mainframe. In the same way, while trying to stitch images that are larger in sizes, more space is needed on the CELL to store the images and the key point files. So eventually chunking/ striping the images is more advantageous.

These images can be of any format including .TIFF or .PIF. For now, at the beginning .TIFF files are tiled and stored. Later depending up on the required application they are retrieved and converted into specific formats and used.

But during implementation this subsection of retrieving images from DB2 is not used, Instead images are loaded into the main memory via file system.

Once the step of image storing is done, portions of the image stitching algorithm are transferred on to the Cell to accelerate the application. This transfer of images and data over to and from the Cell processor is done using socket programming. Different sizes of data were transferred and timed.

On the Cell side, so far the focus is just on the image extraction, the first step in image stitching, because it is the most computationally intensive part of the algorithm which takes longer time relatively. Upon receiving images on the cell BE they are sent as parameters over to the extraction program to find the key points and define the descriptor vectors.

The images that are transferred compute through the four stages, scale-space extrema detection, key point localization, orientation assignment, key point descriptor, of the extraction process. At the beginning the MATLAB version of the sift Lowe detector’s was used but due to licensing issues on the cell BE and system-z for MATLAB its C++ 21

version by Andrea Vedaldi, is used as the base code. This open source C++ code available for image extraction [3] is used to run on the mainframe and the Cell BE.

As the algorithm is sequential and interdepended, at each stage, data parallelism is implemented, that is for all the inner loops that are data computational, the image is stripped and ran over the available SPEs in parallel.

As the SPU has a local store of 256k and are designed to program statically, there is a need to access portions of image that best fit the local store, because utilizing the maximum storage space available, increases the performance of the system. Depending on the size of the image, before computing the data over the SPEs one need to assess what portion of the image is sent over and how many times the SPEs are needed to be called to complete the given task or how many times SPE needs to DMA the data.

Using image strips to run the application in parallel is challenging. Because the algorithm at first takes an image and up samples it to twice its size, so the memory requirement doubles. And the allocation of the strips and its sizes come into account as a lot of memory is required during the scale spacing.

Before attempting for parallel architecture, the image extraction portion of the algorithm was compiled and ran over the mainframe alone, pc alone and then over the power processing element of the Cell BE processor, without using the SPEs. And then based on profiling parts of it are recognized to run over the synergistic processing element, to accelerate the process.

The first phase produces files with “.key” extension. The names of the files are same as the names of the input image files. These files have the location and the 128 *n descriptor vectors of the pixels in of the image that are used for matching the images. 22

The matching phase of the algorithm does not care the order of the descriptor vectors placed in the file, so the write can be done anyway, but needs synchronization as the write operations is done on to the same file for each image. These files can be directly used for matching or stored back in the System-z for later use, depending up on the application. These files are again sent back to the System-z using the socket programs

We open both client and server ports on both the environments and the sockets communicate with each other every time something is done.

3.3 Implementation

As mentioned in the design phase, the images are stored and retrieved in both the file system and DB2 database of the System-z. Images are read into memory, then transferred across the network to the QS21 blade server using sockets. Initially, sockets were implemented using C and later implemented using Java. Once the blade server receives an image into its main memory, portions of that image are again transported to its sub processors (SPUs) to run SIFT in parallel on strips of the image.

The performance of the Cell is high if the data is 128 byte aligned during data transfers, but the data should be 16 byte (quad word) aligned. So the data needs to be 128 or 16 byte aligned before transferring between the PPE and the SPEs. The chunk sizes for the data transfer should be powers of 2. For this reason, we divided the images into power of 2 chunks for processing.

As SPEs are treated as POSIX threads, threads, along with

(DMA) techniques, are used for creating SPE contexts and transferring data between the main power processor elements and the SPEs respectively. As parallelism is

23

implemented by distributing data, we look for the available SPEs and try to utilize the maximum SPE’s. As QS21 has 16 SPU (8 for each Cell BE), every time context creation, call and waiting is done on all the 16 SPEs to finish their task.

The synergistic processing elements access the PPU main memory with the help of DMA. During each DMA, only 16KB of data can be transferred so if larger sizes are needed, then the algorithm must be recoded to break big chunks into small enough sizes.

For this application, as the image is stored in ‘float’ data type, the maximum data transfer per each DMA is only 4096 floating point, i.e., 16,384 bytes.

The data transfer could use single or double buffering or DMA lists. The present implementation used single buffering.

The Gaussian scale spacing portion of the image extraction was transferred to the

SPEs for parallel processing. Specifically, for now the CopyAndDownSample of

Lowe’s extraction was used. This portion of code takes an image of n*n size and down samples it to a quarter of its size. There are two ‘for’ loops that traverse through all the points of the image for resizing it. Even though CopyAndDownSample does not take significant time during computation, it is selected to start with.

For this, the size of the source and destination are required. So the size of the image and the number of iterations that each SPE has is calculated beforehand and the data is sent over to SPE with the help of control blocks. The transfer image size differs for every loop as the image size is different for each level of the octave. The

CopyAndUpSample function takes an image of n*n size and quadruples its size. Thus, in this case, the source image to be transferred is smaller than the destination. Care has to

24

been taken that the source, destination and the SPE program instruction sizes (all together) do not cross the 256KB limit of a local SPE store.

This implementation can be extended to the entire Gaussian scale like

CopyAndUpSample. This is extended to the scale space extrema detection. The convolution function is ported over to the SPUs as it takes about 40% of the execution time, when we run the Image Processing stage. This phase is implemented in different ways and is compared with each other.

In the very first implementation, the data is transferred with the help of simple

DMA transfers and then put back in the same way. The Cell SDK’s mfc_get and mfc_put are used to get data from and put data back to the main memory, using single buffering. As SIFT++ [3] uses 1D convolution, the data is put back transposed. So when the data is put back, it has be transposed and with this array as input the 1D convolution should be called again. This process is followed at all levels of each octave.

Thus, a transpose section was added to the existing code after calling each convolution function. This was later extended to use double buffering with the same

DMA transfers. In both single and double buffering the data is taken row wise, and the

SPU local store was not completely utilized.

In an improved implementation, convolution was re-implemented as two dimensional (2D) convolution using an existing 2D vectorized convolution pre- implemented in the image libraries of the Cell SDK. In theory, 2D convolution takes more time compared to 1D convolution but unoptimized transpose was very expensive making 2D convolution much faster than 1D convolution plus transposition. There are existing 3x3, 5x5, 7x7 and 9x9 convolutions for single precision floating points, unsigned 25

integers etc. But as the base code used is single precision floating points, we used the existing ‘1f’ convolutions.

For this we needed to calculate the weight matrix, depending upon the value of the standard deviation sigma value. And as we need a ‘13x13 1f‘ for the last octave, the existing convolution is extended for a 13x13. These convolution functions take care of the clamping and wrapping of data over the borders but the data above and below the scanline is the programmer’s responsibility.

To take scanline into account, we padded the data above while sending the first few rows of the image and also padded the data below while sending the last few rows of the data.

The resultant image was put back row-wise in the destination buffer in the PPU using the DMA transfers. As the convolution is 2D, we need to call the convolution only once, this reduces the convolution calls of the main algorithm to half of its original calls.

The first implementation of the 1D buffer is now implemented using the DMA – lists, where a scatter/gather technique is used. Here we DMA the data in the same way, but instead of one row at a time, four rows of data is transferred at a time.

This data is then processed and put back into the destination buffer of the PPU, using the DMA lists command, mfc_putl. Here we fill a DMA list, that holds how much data needs to transferred and where to transfer, so if we need to transfer data to n places the DMA list size need to be twice ‘n’. In this case, as the data transfer is the size of row, a DMA list of twice the row size is used.

This implementation avoids the transpose function added after each convolution in the first implementation. This can be extended to process more than four rows at a 26

time, depending up on the space availability of the SPU. The convolution call is still same number of times as the original.

The compute orientations portion of the extraction phase can also be parallelized and for that the data to be distributed over the SPE’s will be the key points detected and computing the orientations for each point can be distributed over the SPEs, here we approach the technique of data parallelism as the task cannot be parallelized.

The next phase, matching algorithm is a brute force search where key points with descriptor vectors are identified and matched according to thresholds. This could be implemented on the cell to take advantage of the core SPUs and optimize the time taken to match the points. This matching is presently implemented to match two images and can be extended to match multiple images.

The image matching, orientation phase and rebuilding the images may also be exploited to parallel in future.

The implemented algorithm is instrumented to be able to collect performance times relates to time to send images of varying sizes across communication channels and time to complete main processing steps. Performance results, analysis and conclusions are the subject of the next chapter.

27

4. METHODOLOGY, RESULTS AND ANALYSIS

4.1 Methodology

This section describes how we tested the SIFT algorithm under various system configurations to determine how performance varied.

Four configurations for running the SIFT step of the image stitching algorithm were considered and instrumented with timings:

 Stand alone Intel class machine – We used an Intel Celeron machine of clock

speed is 1.86 GHz

 Stand alone System-z – We used the Walton College IBM System z900. The

clock speed of the machine is 1.2 GHz

 Stand alone QS21 using only the PPE – We used a QS21 on loan from IBM.

The clock speed of the of the PPE machine is 1.42 GHz

 QS21 using PPE and SPE – We used a QS21 on loan from IBM. The clock

speed of this machine is 3.2 GHz

We used images of size 1024x1024 (which can fit comfortably in the memory of a

Cell BE PPE). For all testing, we considered the times to access data from the file system of the local host. [In the case of the IBM, we separately stored the data in the DB2 database and used JDBC to access it but we do not report the performance results of accessing that data.]

To test each configuration, we started with the same SIFT implementation (C- based, from Vedaldi [3]). We initially used 1D convolution but it proved too slow so we

28

converted the code to use 2D convolution (only in the final configuration that used the

SPUs). We modify the code to make use of the SPEs in the fourth configuration.

The code was unit tested over one octave at the beginning to make sure the results obtained were correct; then over the entire SIFT algorithm (that is over all the octaves).

For this testing, all the portions of the parallel versions were run over an octave level with a defined sigma and the end results were tested by verifying the data on the PPU to the data obtained after using the SPUs.

Once these results are satisfactory, the same is extended to the entire SIFT++ code, but as the image size and the input parameters vary for every the DMA transfer sizes need to be taken special care. Because to utilize the entire cache, one tends to transfer 16KB at a time and though this is constant, the image size varies for every call depending up on the functionality. This needs to take special care while testing, as the code is extended from single octave to several octaves.

4.2 Results

4.2.1 Timing the Overall Application

This section describes the overall performance (in seconds) of the first phase of the image stitching algorithm (image extraction/processing) is run in the four different configurations. All the timings are for running one 1024x1024 pixel image. The key point file generated in this case is 4.42 MB.

29

Figure 7 : Image Extraction Runtime

These results show the Intel PC outperforming the IBM mainframe as well as the

Cell BE PPU only and Cell BE PPU+SPU. These results seem counterintuitive in that we might expect an Intel-based PC to underperform a mainframe coupled with a hardware accelerator. It is necessary to look at more detailed results to understand the timings.

When we separately time substeps, the picture is clearer, as shown in the following table.

Configurations Transfer Load Image Key point Compute Transfer Key Image from file detection key points Points file from z to (seconds) (seconds) and store in from QS21 QS21 file to z (seconds) (seconds) (seconds) PC Alone N/A 0.07 6 5.9 N/A

System-z N/A 0.06 34 51.9 N/A

QS21 PPU N/A 0.06 9 39.9 N/A

30

QS21 PPU+SPU N/A 0.06 3 39.9 N/A

System-z+QS21 0.353 0.06 3 39.9 0.141

Times to load images from files appear fairly constant and small (around .06 seconds) for all configurations. Similarly, the time to transfer an image from the IBM mainframe to the Cell BE is small.

The dominant timing differences appear in the Key Point Detection step and the time it takes to compute and store the key points.

4.2.2 Timing the Key Point Detection

When the key point detection is implemented with the 2D convolution is fast and, the times taken to detect the key points in the four configurations are as shown in Figure

7.

40 34 35

30

25

20

15

Time in Seconds 9 10 6 5 3

0 PC alone System Z PPU PPU+SPU

Figure 8 : Key point Detection Run time 31

This graph shows that, just considering the SIFT algorithm, the cell architecture

PPU underperforms the PC running the same algorithm with parallelism. This is because the Intel chip is faster than the 64-bit PowerPC architecture. The IBM mainframe version is noticeably slower. But when the SPUs are included, the parallelism of this step vaults the Cell BE to outperform the PC for the convolution portion.

4.2.3 Timing Results in more detail

The convolution step is described in more detail, considering a 1024x1024 images that traverse once through the extraction process over one octave level, the Gaussian scale copy and down sample with different convolution implementations are discussed here.

Our initial implementation of the SIFT operation CopyAndDownSample,. This turned out to be slower than using running the algorithm on the PPU.

PPU alone: 0.004 Seconds

PPU + SPU: 0.027 Seconds

The time taken for the SPE context creation is 0.013 seconds.

When 1D convolution was re-implemented with single buffer DMA transfers the timings were worse:

PPU alone: 0.26 Seconds

PPU + SPU: 0.39 Seconds

Transpose alone: 0.16 Seconds

1D convolution alone: 0.035 Seconds

32

When 1D convolution was re-implemented with single buffer DMA-lists, the timings improved:

PPU alone: 0.26 Seconds

PPU + SPU: 0.08 Seconds

1D convolution alone: 0.04 Seconds

When 2D convolution was implemented with single buffer DMA transfers, the overall improved substantially:

PPU+SPU: 0.027 Seconds

As it is implemented using the SIMD instructions it is implemented only on the

SPU and as mentioned in the implementation section this method does not have any additional overhead. Even though the local store of the SPU is limited to 256 KB, the anonymous DMA-lists and direct memory access operations solve these memory limitations, by prioritized and looped data transfers in the process.

Then 1D convolution does the convolution over the matrix once then transpose the matrix and convolves it again. When the same is implemented with SPU we do the convolution send the data back to PPU and then transpose the matrix and call the SPU to do the next convolution. The transpose can be avoided if we use DMA-lists instead of

DMA while transferring the data between the PPU and SPU’s.

4.2.4 Timing the Communication

To understand communication overhead between the System-z and Cell BE, we used a round trip messaging utility to measure the time to send an image from the

33

System-z to the Cell BE and back again for different image sizes ranging from 1MB to

10MB. Figure 3 shows the results – communication time grows linearly with image size.

We also considered the time to send multiple 2MB images round trip from

System-z to Cell BE and back, varying the number of images sent from 1 to 10. Again, communication grows linearly as shown in Figure 8.

Figure 9 : Communication time from Z to Cell

After images are processed, files containing key points are sent back to the mainframe for storage. The images used in the later are 1024x1024 pixel image and eight at a time. The time taken to send back the processed data (key points) on to the system-z was measured, seen in the graph below. When running large data sets the resulted key points are to be stored back on the System-Z, as the Cell cannot hold the results. These key points are used in the later stages for the image matching and orientation.

34

It is observed that the time taken to send the images to the CELL is less than the time taken by the Cell to send key point files to the Z. This is because the key point files are big, as they hold 128 dimensions descriptor vector for each key point of the image.

This observation (change in timings) may vary depending upon the no of key points detected in respective images.

Figure 10 : Communication time from Cell to Z

4.3 Analysis

The image stitching algorithm operates correctly in all four configurations, giving the same results. Still, the timings vary in a way we did not initially expect. Considered overall, the PC implementation outperformed the mainframe substantially and to a lesser extent the Cell BE PPE and the Cell BE PPE+SBE configurations.

The computationally expensive 2D convolution step did show a significant speedup on the Cell BE PPE+SBE configuration (due to parallelism) over the PC.

35

The overall performance of the Cell PPE+SBE was still slower than the PC because of communication costs. After sending the image, the mainframe is waits for the task to complete and then receives the results. In the current implementation, this step is not pipelined. This is a clear area for future work.

Similarly, within the Cell BE, the CopyandDownsample process of the

Gaussian scale step does not show any improvement over using 16 co-processors. This is because the data transfer between the systems is more expensive than the computation itself. So for trivial computations like nested loops with little computation, it is not advisable to parallelize. This is because in this architecture, the memory is not shared, so extra steps are required to create the context, call the SPUs, and transfer the data over to the local store.

4.3 Comparison with Recent Results

One other research team has recently reported on an implementation of the SIFT transformation, the first phase of the Image stitching application, that uses a Playstation-3

Cell BE [21]. The base code used is the AutoPano – C. The Cell BE of the PS-3 has 6 working SPEs and they do not use a large data base for image storage. They implemented

SIFT by dividing it into four tasks: image blurring, difference of gaussian, gradient map calculation, and descriptor calculation. They compared PPU vs. PPU+SPU, but did not compare to other environments. And the results stated that the experiment gained linear scalability.

36

5. CONCLUSIONS

5.1 Summary

This project took over a year to complete. Several of the steps below took months! We started with a Cell BE simulator, initially demonstrated connectivity with it via sockets to a calling program running in a UNIX environment, then later to the IBM

System-z mainframe environment running Linux, and at that point determined that a simulator approach could not yield realistic performance timings for any application.

Then we acquired from IBM and learned to program the Cell BE QS21 hardware accelerator. Around that time, we considered several candidate algorithms for showing off mainframe-Cell BE integration (among them sorting and the image stitching algorithm). We settled on the latter because of its high potential value and our access to the author of that algorithm, Dr. Jackson Cothren.

Soon after, we found that considerable effort was needed to port the algorithm to the IBM mainframe and Cell BE environments (due to licensing issues with the Matlab implementation), and we re-wrote the application to remove dependencies on Matlab, first via porting to Octave and later used the C++ version developed by Andrea Vedaldi.

We initially ran the image stitching application only on the System-z. Then we ported the SIFT portion (that finds interesting points in an image, which is part of the overall image stitching algorithm) to the QS21 main processor. We developed the DB2 database to store chunked images; then re-sized the database images to extract smaller image chunks from larger images so the chunks could fit the Cell BE memory. Finally, we distributed the SIFT algorithm to the Cell BE’s satellite SPU processors. We

37

instrumented the code, ran the experiment, and reporting the results (as reported in

Chapter 4).

We found that the System-z helps in storing large amounts of image data sets and that System-z and the Cell BE, when combined together, can be used for stitching images from large databases of images. Still, our current databases of images are not very large

– measured in thousands of images and could comfortably fit on a database like MySQL on a PC. We also found that, in overall access time, a PC image stitching algorithm outperforms the mainframe/ Cell BE running image stitching because the Intel processor is faster than the 64-bit Power PC architecture (the architecture of the Cell BE main processor). We found that the Cell BE does accelerate the application over a desktop PC

(just for the SIFT algorithm, not overall) and that the co-processors of the Cell can be used for optimizing that part of the solution, but when combined with mainframe, the performance degrades.

5.2 Contributions

Our work provides several contributions. We demonstrated that we could store a collection of images in DB2 as blobs and that, to use the Cell BE with its small memory, we will need to break big images into small ones for processing on the Cell BE. We showed how to port the application (C++ version) onto the z900 and from there onto the

Cell BE QS21. The application was developed as a Single Instruction Multiple Data model, where the instructions of a particular portion of the application remain the same for all the SPUs, executed over various images. We initially developed a one dimensional DMA/DMA-list convolution on the QS21 but when this proved too slow, we

38

re-worked the algorithm to demonstrate speedup using a two dimensional convolution and extended it to a two dimensional weight matrix of size 13x13. We instrumented our system, measured performance of communication and overall processing time and analyzed the results, showing that the Cell BE currently provides limited speed advantages over a stand-alone PC only in certain steps of the algorithm, but not overall.

5.3 Future Work

The above results are preliminary. More work is needed before we can judge whether the image stitching algorithm can be substantially accelerated by the Cell BE architecture and just how much. Following are areas for future research:

 The communication from the System-z to the Cell BE and from the Cell BE main

process to the peripheral processors can be pipelined.

 We could make better use of the memory of the QS21 PPU and SPU to transfer

larger chunk sizes than in our reported results. The downside is that more

revisions of the image stitching algorithm would be needed.

 The data can be stored and managed using hash tables for different data sets for

larger applications, so that when the images are chunked for processing, they need

not be re-stitched as we already know their match.

 Other optimization techniques can be implemented – like external addressing,

where the RAM can be directly accessed by the SPU local storage. These

techniques may improve performance.

 We could repeat our experiments with the IBM Cell BE QS22 architecture, a next

generation Cell BE architecture which is based on PowerX8i architecture. The 39

QS22 has larger memory than the QS21 and supports double precision floating

point arithmetic for greater precision.

 Our present experiment only handles the SIFT portion of the overall image

stitching algorithm in which interesting points are identified. The step of

matching images could also be parallelized using the Cell BE. Rescaling and

translating coordinate systems of individual images could also be accomplished in

parallel.

 Finally, it would be interesting to parallelize the entire image stitching algorithm

in such a way that portions could be accelerated using the Cell BE or alternatively

using a grid/cluster of commodity PCs to compare these two parallelism

architectures. Then we could compare optimization of this algorithm using

general purpose commodity software with benefits from vector accelerators.

40

REFERENCES

[1] D. Lowe, “Distinctive Image Features from Scale-invariant Key Points,” International Journal of Computer Vision, 60, Feb. 2004, pp. 91-110. [2] K. Mikolajczyk, C. Schmid, “An Affine Invariant Interest Point Detector,” Proceedings of the 7th European Conference on Computer Vision-Part I (ECCV '02), London, UK, Springer-Verlag (2002) 128-142. [3] A. Vedaldi, “A Lightweight C++ Implementation of SIFT,” http://vision.ucla.edu/~vedaldi/code/siftpp/siftpp.html, Accessed November 06, 2008. [4] J. Xing , Z. Miao, “An Improved Algorithm on Image Stitching Based on SIFT features,” Proceedings of the Second International Conference on Innovative Computing, Information and Control , September 05 – 07, pp -453 [5] Xiaohua Wang, Weiping Fu, “Optimized SIFT Image Matching Algorithm,” Automation and Logistics, 2008. ICAL 2008. IEEE International Conference, Sept. 2008, pp. 843-847 [6] IBM System-z/OS concepts, ftp://www.redbooks..com/redbooks/SG246366/zosbasics_textbook.pdf, Accessed January 31, 2009. [7] IBM eserver z900 series Technical Guide, http://www.redbooks.ibm.com/redbooks/SG245975.html, Accessed March 28, 2009. [8] E. M. Schwarz, M. A. Check, C.-L. K. Shum, T. Koehler, S. B. Swaney, J. D. MacDougall, and C. A. Krygowski, “The of the IBM eServer z900 processor,” IBM Journal of Research and Development, v.46, n.4/5, 2002, pp.381-396. [9] Linux on IBM System-z: Performance and Tuning, http://www.redbooks.ibm.com/redbooks/SG246926/wwhelp/wwhimpl/js/html/w whelp.htm, Accessed March 29, 2009. [10] Linux on IBM System-z with z/VM, http://www.vm.ibm.com/library/zvmlinux.pdf, Accessed March 29, 2009. [11] LOBS with DB2 for z/OS: Stronger and Faster http://www.redbooks.ibm.com/redpieces/pdfs/sg247270.pdf , Accessed April 17, 2009. [12] C. R. Johns, D. A. Brokenshire,” Introduction to the Cell Broadband Engine Architecture,” IBM Journal of Research and Development, Volume 51, 5, Sep. 2007, pp.502-520.

41

[13] T. Chen, R. Raghavan, J. N. Dale, E. Iwata, “Cell Broadband Engine Architecture and its First Implementation: a Performance View,” IBM Journal of Research and Development, v.51 n.5, Sep. 2007, pp.559-572. [14] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy, “Introduction to the Cell Multiprocessor “, IBM Journal of Research and Development, volume 49, 4/5, pp.589-604. [15] Cell Broadband Engine Architecture Overview, http://www- 01.ibm.com/chips/techlib/techlib.nsf/techdocs/FC857AE550F7EB83872571A800 61F788/$file/CBE_Programming_Tutorial_v3.0.pdf , Accessed March 21, 2009. [16] The Cell Project at IBM Research, http://www.research.ibm.com/cell/heterogeneousCMP.html, Accessed may 6th, 2008. [17] IBM BladeCenter QS21, http://www-03.ibm.com/systems/bladecenter/ hardware/servers/qs21/index.html, accessed March 28, 2009. [18] IBM BladeCenter Products and Technology, http://www.redbooks.ibm.com/redbooks/pdfs/sg247523.pdf, accesses March 28, 2009. [19] IBM BladeCenter QS22, http://www- 03.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html, Accessed on March 25 2009. [20] Basics of SPR Programming, http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/ps3- linux-docs-08.06.09/CellProgrammingTutorial/BasicsOfSPEProgramming.html, accessed November 06, 2008 [21] K. Bomjun, T. Choi, C. Heejin G. Kim, “Parallelization of the Scale-Invariant Key Point Detection Algorithm for Cell Broadband Engine Architecture,” Consumer Communications and Networking Conference, 2008. CCNC 2008. 5th IEEE Volume, Issue , 10-12 Jan. 2008 Page(s):1030 – 1034

42

43