ENTE PER LE NUOVE TECNOLOGIE, L'ENERGIA E L'AMBIENTE

Dipartimento Innovazione '""" IT9700543

FASTER IMAGE I/O ON THROUGH DATA PACKING AND UNPACKING

SERGIO TARAGLIO ENEA - Dipartimento Innovazione Centra Ricerche Casaccia, Roma

FRANCO VALENTINOTTI

ENEA - Progetto Interdipartimentale Calcolo e Reti ad Alte Prestazioni (HPCN) Centra Ricerche Casaccia, Roma

RT/INN/97/2 2 8 N» 2 3 Testo pervenuto nel gennaio 1997

I contenuti tecnico-scientifici dei rapporti tecnici dell'ENEA rispecchiano I'opinione degli autori e non necessariamente quella dell'Ente. SOMMARIO Si presenta un algoritmo estremamente semplice, ma in grado di migliorare sensibilmente le caratteristiche di I/O della famiglia di Quadrics nel caso in cui si leggano immagini. L'algoritmo compatta e scompatta i dati di tipo intero in numeri in virgola mobile a 4 bytes, la parola standard della macchina. I risultati ottenuti sono estremamente interessanti per applicazioni di trattamento delle immagini.

SUMMARY We present an extremely simple but efficient algorithm able to enhance the I/O performances of the Quadrics super computer family, while loading images. The algorithm packs/unpacks integer data into 4 bytes floating point numbers, the standard word of the machine. The resulting performances are interesting for image analysis applications. Faster image I/O on Quadrics through data packing and unpacking

Sergio Taraglio, Franco Valentinotti

Introduction

One of the typical data structure we are confronted with, while processing images, is represented by arrays of single byte integers. This means that each pixel of an image is typically sampled into 256 grey levels. Referring to an image size of 640 by 480 pixels, the total data amount adds up to 300 Kbyte.

The application field we are thinking about is the use of images as input to sensory based real time systems able to extract useful informations. As an example the use of visual information for the detection and classification of defects in rolled strip steel [1], or the guidance of autonomous vehicles in indoor or outdoor environments [2]. In the first application we need an input data rate of around 20 Mbyte per second, while for the second one we look for a frame rate of 10 frames per second, i.e. 3 Mbyte per second.

The Quadrics machine operates on single precision 4 byte floating point numbers (the so called word), since we are dealing with images, i.e. with single byte integer numbers, it is evident the waste of I/O bandwidth if we think of writing each byte inside a single floating point number and loading it.

In the following we present an extremely simple algorithm able to reduce the misuse of the I/O bandwidth, useful for the image input but also whenever it is necessary to load positive integer numbers, e.g. the 3-D chemical description of a material to be radiographed [3]

In the first paragraph we present the I/O features of the Quadrics super computer family. In the second we analyse the feasibility of a compression algorithm for the task. In the third we show our packing/unpacking approach. In the fourth we present some experimental measures performed on three different machines. Finally we discuss the algorithm and present the further work needed for the design of a practical HW board for I/O from a TV camera for sensory based real time applications to be developped on Quadrics.

The I/O on the Quadrics machines

It is well known that the I/O performances of the Quadrics super computers family are far from being acceptable for the huge quantity of data of a visual real time application. The I/O rate supported by the machine is around 720 Kbytes per second. Presently is already available an I/O board based on the HiPPI interface, announced able to sustain at least around 20 Mbyte per second [4].

As above mentioned the Quadrics machine operates on words of 4 byte floating point numbers, that is also the minimal I/O unit. Therefore the above mentioned I/O rate must be divided by 4, this means either an effective rate of 180 Kword or of 5 Mword per second.

The compression algorithm solution

Since we have to input in the machine the grey levels of an image, i.e. integer values, the simplest way to do this is to transform each number in a single real one and then load the so obtained floating point image. If we consider 256 grey levels this means using 4 bytes to store a single one, i.e. enlarging four times the original image. Operating in such a way it is evident the waste of I/O bandwidth and disk space.

What we are looking for, is a more efficient way to transform into real numbers the data we have to input into the machine. The solution to this problem may be the compression and/or packing of data into real numbers, the loading of the resulting real numbers and, finally, the uncompression and/or unpacking of data inside each processor of the machine. For a complete definition of compression and packing see [5].

The solution of compressing the data with standard compression algorithms is not viable out of the parallel SIMD features of Quadrics. First of all we can not compress the whole image, since we want to distribute it among the processors. Besides we may think of a compression on each subset of the image, addressed to a given processor, but since the compression algorithms are usually not deterministic, i.e. are data dependent, the uncompression of each sub image on each processor would be unique to that processor, not matching the SIMD requirements of the machine. For example, let's assume a 512x512 image and a QH4 machine, composed by 512 processors, each element should uncompress a row of the image, but each row would possess a different encoding, due to the different pixel content.

Therefore we have been forced to the packing and unpacking choice, that furnish a SIMD structure for the decoding phase and a constant reduction ratio independent of the data.

The packing/unpacking algorithm solution

The algorithm grounds itself on the idea of representing more integer numbers inside a single floating point one.

Let us now examine in detail the real number representation used by Quadrics (IEEE 754 standard). As can be seen in figure 1, a real number is composed by 4 bytes. Of these 32 bits, 23 are for the mantissa of the number, 8 bits are for the exponent and one is for the sign of the number [6].

The only way to code integer numbers inside such a real number, is to exploit the mantissa, leaving the exponent part, this in order to avoid the use of transcendent functions such as the logarithm or the exponential, that will be a notable burden for the unpacking algorithm on the Quadrics processors. 31 30 23 22 0 bit

• cxp mantissa

Figure 1. The floating point representation of Quadrics.

Under the IEEE 754 standard, the 23 bit mantissa is in the range [1,2), in this way it can be used a hidden bit to enhance the precision of the number; consequently it is possible to use 24 bits for the packing.

We have to pack a given sequence of positive integer numbers, let us call this sequence ilj;j e[0, N - 7]|, where N is the maximum number of packable numbers, that may be called the packing factor.

The real number F is obtained through the following expression:

N-l j - Ij 0) j=0 where the base b is equal to the maximum different values that the Ij may assume.

The packing factor TV can be obtained through the formula N = 2y, , , where the squared brackets indicate the integer part of the expression.

The relation (1) is fully biuni vocal, for each iV-ple of Ij there exist one and only one real number F. The relation holds also the other way around, the unpacking part of the algorithm, this means that given a real number F and the same base we may obtain back the original iV-ple.

This unpacking is implemented via the following expression:

I0=fint(F) (2) Ij=fint(F-b} -Ij_j) for l

Since the Ij numbers are positive ones, we may modify the jint function in order to avoid some conditional instructions on the sign of the number.

Operating in such a way we may enhance the pipe filling performances of the code to a 7%, as confronted to a previous one of 1%. The typical data we use are represented by single byte grey levels, hence the base we will use is 256 and from what we have described above, we can pack three pixels into each real number. If we use a different input sampling, we may pack more (or less) integer numbers into each floating point one, for example if we consider 64 grey levels, we may pack four pixels per word.

This simple algorithm is deterministic in the sense that we know a priori its main features. First of all the total amount of data is reduced by a factor N. In this way we can not only reduce the I/O time but also the disk requirements to store such packed images in the host machine.

A second important feature is the perfect SIMD structure of the unpacking part of the algorithm. It does not require any inter processor communication and is fully and linearly scalable with the number of processor and the quantity of input data.

In the following we will call tj/o the original I/O time needed to input a full image with one pixel per word, tP the I/O time for the packed input (N pixels per word) and tv the time spent in unpacking the data inside the processors.

It is evident that this algorithm allows a more effective performance whenever:

T = tP+tu

This is the conclusive constraint for the applicability of our approach.

Testing the algorithm

It is important to test both the actual I/O and the unpacking time in real experiments on the machine, since the I/O time is usually a function of many variables. Among them some are due to the Quadrics itself, such as the configuration in use (number of processors and topology) or the total number of input data. Other variables in our scenario are represented by the attention offered by the host system to the Quadrics during its asynchronous mode while doing I/O, an unpredictable function of the global work load.

We have chosen as the base for our packing algorithm 256, this means 256 grey levels images and a packing rate of TV = 3.

The tests have been performed on the three different configurations present in the ENEA Casaccia centre. The available machines are the Ql with 8 processors, the QH1 with 128 and the QH4 with 512 processors. All the tests have been performed with the standard I/O board and in working condition, i.e. not reserving the host resources. In order to smooth the resulting fluctuations several experiments have been performed for each test and overall averages have been taken. 3.0 3.0

2.5 2.5

Load time and 2.0 gain on Q1

0.0 0.0 50 100 150 200 300 data input (Kpixel)

Figure 2. I/O performances on the Quadrics Ql

In figure 2 we show the timing performances as recorded on the Ql with 8 processors as a function of the total input data represented as kilo pixels. It is evident some noise in the behaviour due to the working condition of our test, at the same time we want to stress that the ratio depends on the number of input data and converges rapidly to its maximum value, not far from the theoretical expected one.

If we consider the typical image, we have to input 640 x 480 pixels, i.e. 300 Kpixel, from the chart we see that in our approach the tI/0 is around 1.68 seconds, that has to be matched with a tP of 0.56 seconds plus a ty of 0.08 seconds. Globally we gain a factor of approximately 2.61. In other words we pass from an I/O rate of approximately 179 Kpixel per second to 468 Kpixel/s.

In the following graphs we have explored a wider range of image sizes in order to compare the different machines.

In figure 3 all the I/O timings are presented, both the original one (I/O) and the packed one (P) as a function of the input data for the three different machines available. Here we can note that for a given image size the time needed to input it in the different computers increases with the number of processors. The overall behaviour is nearly linear in the input data.

Finally in figure 4 we show the gain in I/O performance we have obtained. The average gain is between 2.5 and 3 since we are packing three pixels for each floating point number loaded.

In figure 4 it is evident that this ratio depends on the input image size differently for the different machines. The larger the ratio between input data and number of processors, the faster the convergence towards the maximum gain value. When the input data are large enough, the bigger machines perform better, since the unpacking time becomes more and more negligible. a>

i • i ' r 1200 2400 3600 4800 input data (Kpixel)

Figure 3. The I/O performances for the Ql, QHl and QH4 machines.

4.0 Quadrics Configurations

3.0 --

co

2.0

1.0 ^ ' I 'I 1200 2400 3600 4800 input data (Kpixel)

Figure 4. The I/O gain for the three different machines. Possible future use of the algorithm: an I/O board for image acquisition

We think that it is possible, and advisable, to conceive and to assemble an I/O board exclusively dedicated to the loading of images into Quadrics, directly connected to a TV camera. The realisation of such a device would open countless fields of application for the Quadrics machine, ranging from the artificial vision systems to quality control systems.

A possible implementation of such a board may well reuse a big part of what already developed for the HiPPI interface. This board must contain a hardwired version of the packing phase of the present algorithm. At the same time it is advisable to provide the possibility of switching the sampling dynamic and the image size.

On the Quadrics side, the unpacking algorithm may be implemented through the ZZ parser in order to become a statement of the TAO language, such as read image [0,0,0][l,l,l] with base 256 imgname, in this way everything would be transparent to the end user.

Let us assume for this hypothetical board an I/O rate of 20 Mbyte/s, i.e. 5 Mpixel/s. This figure is what is expected, at least, from the HiPPI board. Naturally experiments on this device are needed for a more conclusive statement on the gain offered by our approach, out of the existence of a physical limitation for the algorithm linked to the actual speed of the I/O device (see equation 4).

Let us assume also the above used image size (300 Kpixel). Under these assumptions the tI/0 is equal to 60 milliseconds, and the tP is equal to 20 milliseconds, from equation (4) the unpacking time ty must be less than 40 milliseconds. Presently the tv on the different machines ranges from 80 milliseconds on the Ql (8 processors) all the way to a 5 milliseconds on the QH1 (128 processors) and to a 1.25 milliseconds on a QH4 (512 processors).

It is evident that our approach is not convenient on a Ql, unless some optimisation can be made on the unpacking algorithm.

As above said the unpacking makes a massive use of the library function fint. This function possesses some conditional instructions {where) in its code. A first step was represented by the removal of two tests on the sign of the operand, since we are using only positive numbers, but still a conditional instruction is left in the code.

A possible enhancement is the further removal of this instruction, in this way we have tested that the percentage of pipe filling of the unpacking code will rise form a value of 7% to a value of 19%, making the ty drop by a factor of four.

Let us now see the reason for the existence of such a conditional instruction in the fint function. The rounding in the IEEE standard is such that whenever the decimal part of a number exceeds 0.5 is rounded to the upper integer number and whenever is strictly less than 0.5 to the lower integer number. The problem stems from the case in which the decimal part is exactly 0.5. Where should it be rounded to? To the upper or to the lower number? Whichever choice we may take, there will always be a distortion in the distribution of the numbers. Such distortion becomes evident on the big Monte Carlo codes. Hence the standard was designed to remove as much as possible such a problem. The idea is to round the 0.5 decimal numbers to the nearest even integer. In this way, considering the 0.5 decimal numbers, once we round towards the lower number, once to the upper number, eliminating the above mentioned distortion.

The conditional instruction left in our fint function takes into account this rounding mechanism. In fact this function uses the precision of the machine to isolate the integer part of a number, needing these rounding properties.

In our approach the only case in which we need the rounding is represented by the unpacking of N pixel values among which there is one or more pixels equal to zero. Therefore we may skip also this conditional statement if we assume a sampling in the image ranging in [1, 255].

For all the applications we are considering the missing of only one grey level is of absolutely no importance as confronted with the gain in performances it offers to our procedure.

Finally on a Ql we may input a 300 Kpixel size image, sampled in [1, 255], in 42 milliseconds, from a previous timing of 100 milliseconds. This figure must be matched to the plain I/O time obtained with one pixel per word of 60 milliseconds.

A possible target machine for an embedded Quadrics real time visual application is represented by the Quadrics QDeSC machine, composed of 32 processors, equipped by the above hypothesised TV interface. Assuming a I/O rate for the board of 20 Mbyte/s, with our enhanced approach we reach an I/O rate of about 25 ms per image (40 frames per second).

If we think of a frame rate for our application of 10 frames per second, this result means that on a QDeSC we have 75 milliseconds left to analyse the 300 Kpixels. Since the QDeSC has a peak CPU power of 1.6 GFLOPS, this means that we may use up to 120 MFLOP per frame, i.e. up to 400 floating point operations per pixel.

Conclusions

The presented algorithm is of general purpose, i.e. may be applicable whenever we have to input integer data inside Quadrics, both images or non images.

The algorithm offers a good gain in I/O performances, with an asynthotical behaviour of N, the packing rate. This packing rate is naturally a function of the sampling number of grey levels present in the image to be input.

For example while loading one byte integers we get a gain factor of about 2.5, in other words we may reach an input performance of 12 Mpixel per second instead of a 5 Mpixel/s performance while inputting one pixel per real number. On the present I/O board we reach 468 Kpixel/s instead of 179 Kpixel per second.

As above mentioned, these I/O performances may allow the possibility of using Quadrics for all those applications concerning the real time processing of sensory data. For example we may think of artificial vision for robotics or quality control. In this field of application the SIMD characteristics of the machine are perfectly tuned to the processing of images, the typical algorithm being totally local to the pixel neighbourhood. There exists a physical limitation for the algorithm as shown in equation (4), which tells us that our approach is useful if the I/O of the machine is not extremely fast, since some time must be spent in unpacking the data. This unpacking time is, of course, a function of the machine under consideration, being inversely proportional to the number of processors. This means that this restriction may be present in the lower end of the Quadrics series, but never on the medium to big machines.

In any case the use of the present idea allows a great reduction in disk space usage since it reduces the size of the file containing the image by a factor N.

References

[1] Technical Annex of the HIPERCLASS PCI-CAPRI Project.

[2] S. Taraglio, F. Di Fonzo, P. Burrascano, "Training data representation in a neural based robot position estimation", Neural Network World, 6, number 3, pag 393-399, 1996.

[3] A.B. Delia Rocca, L. La Porta, F. Valentinotti, "Radiographic process simulation by Boltzmann equation on SIMD architecture (Quadrics QH4)", in Proceedings of HPCN 96, Springer- Verlag, 1996.

[4] FAST I/O Sub system demonstration, deliverable D2.0/FAST of the HIPERCLASS PCI-CAPRI Project. [5] J.D. Murray, W. van Ryper, "Encyclopedia of graphics file formats", O'Reilly & Associates, Sebastopol, 1994.

[6] A. Bartoloni et al., "A hardware implementation of the APE100 architecture", International Journal of Modern Physics C, 4, No. 5, 969-976, 1993. Edito dall1 Unita Comunicazione e Informazione Lungotevere Grande Ammiraglio Thaon di Revel, 76 - 00196 Roma

Stampa: COM-Centro Stampa Tecnografico - C. R. Frascati

Finito di stampare nel mese di marzo 1997