Faster Image I/O on Quadrics Through Data Packing and Unpacking
Total Page:16
File Type:pdf, Size:1020Kb
ENTE PER LE NUOVE TECNOLOGIE, L'ENERGIA E L'AMBIENTE Dipartimento Innovazione '""" IT9700543 FASTER IMAGE I/O ON QUADRICS THROUGH DATA PACKING AND UNPACKING SERGIO TARAGLIO ENEA - Dipartimento Innovazione Centra Ricerche Casaccia, Roma FRANCO VALENTINOTTI ENEA - Progetto Interdipartimentale Calcolo e Reti ad Alte Prestazioni (HPCN) Centra Ricerche Casaccia, Roma RT/INN/97/2 2 8 N» 2 3 Testo pervenuto nel gennaio 1997 I contenuti tecnico-scientifici dei rapporti tecnici dell'ENEA rispecchiano I'opinione degli autori e non necessariamente quella dell'Ente. SOMMARIO Si presenta un algoritmo estremamente semplice, ma in grado di migliorare sensibilmente le caratteristiche di I/O della famiglia di supercomputer Quadrics nel caso in cui si leggano immagini. L'algoritmo compatta e scompatta i dati di tipo intero in numeri in virgola mobile a 4 bytes, la parola standard della macchina. I risultati ottenuti sono estremamente interessanti per applicazioni di trattamento delle immagini. SUMMARY We present an extremely simple but efficient algorithm able to enhance the I/O performances of the Quadrics super computer family, while loading images. The algorithm packs/unpacks integer data into 4 bytes floating point numbers, the standard word of the machine. The resulting performances are interesting for image analysis applications. Faster image I/O on Quadrics through data packing and unpacking Sergio Taraglio, Franco Valentinotti Introduction One of the typical data structure we are confronted with, while processing images, is represented by arrays of single byte integers. This means that each pixel of an image is typically sampled into 256 grey levels. Referring to an image size of 640 by 480 pixels, the total data amount adds up to 300 Kbyte. The application field we are thinking about is the use of images as input to sensory based real time systems able to extract useful informations. As an example the use of visual information for the detection and classification of defects in rolled strip steel [1], or the guidance of autonomous vehicles in indoor or outdoor environments [2]. In the first application we need an input data rate of around 20 Mbyte per second, while for the second one we look for a frame rate of 10 frames per second, i.e. 3 Mbyte per second. The Quadrics machine operates on single precision 4 byte floating point numbers (the so called word), since we are dealing with images, i.e. with single byte integer numbers, it is evident the waste of I/O bandwidth if we think of writing each byte inside a single floating point number and loading it. In the following we present an extremely simple algorithm able to reduce the misuse of the I/O bandwidth, useful for the image input but also whenever it is necessary to load positive integer numbers, e.g. the 3-D chemical description of a material to be radiographed [3] In the first paragraph we present the I/O features of the Quadrics super computer family. In the second we analyse the feasibility of a compression algorithm for the task. In the third we show our packing/unpacking approach. In the fourth we present some experimental measures performed on three different machines. Finally we discuss the algorithm and present the further work needed for the design of a practical HW board for I/O from a TV camera for sensory based real time applications to be developped on Quadrics. The I/O on the Quadrics machines It is well known that the I/O performances of the Quadrics super computers family are far from being acceptable for the huge quantity of data of a visual real time application. The I/O rate supported by the machine is around 720 Kbytes per second. Presently is already available an I/O board based on the HiPPI interface, announced able to sustain at least around 20 Mbyte per second [4]. As above mentioned the Quadrics machine operates on words of 4 byte floating point numbers, that is also the minimal I/O unit. Therefore the above mentioned I/O rate must be divided by 4, this means either an effective rate of 180 Kword or of 5 Mword per second. The compression algorithm solution Since we have to input in the machine the grey levels of an image, i.e. integer values, the simplest way to do this is to transform each number in a single real one and then load the so obtained floating point image. If we consider 256 grey levels this means using 4 bytes to store a single one, i.e. enlarging four times the original image. Operating in such a way it is evident the waste of I/O bandwidth and disk space. What we are looking for, is a more efficient way to transform into real numbers the data we have to input into the machine. The solution to this problem may be the compression and/or packing of data into real numbers, the loading of the resulting real numbers and, finally, the uncompression and/or unpacking of data inside each processor of the machine. For a complete definition of compression and packing see [5]. The solution of compressing the data with standard compression algorithms is not viable out of the parallel SIMD features of Quadrics. First of all we can not compress the whole image, since we want to distribute it among the processors. Besides we may think of a compression on each subset of the image, addressed to a given processor, but since the compression algorithms are usually not deterministic, i.e. are data dependent, the uncompression of each sub image on each processor would be unique to that processor, not matching the SIMD requirements of the machine. For example, let's assume a 512x512 image and a QH4 machine, composed by 512 processors, each element should uncompress a row of the image, but each row would possess a different encoding, due to the different pixel content. Therefore we have been forced to the packing and unpacking choice, that furnish a SIMD structure for the decoding phase and a constant reduction ratio independent of the data. The packing/unpacking algorithm solution The algorithm grounds itself on the idea of representing more integer numbers inside a single floating point one. Let us now examine in detail the real number representation used by Quadrics (IEEE 754 standard). As can be seen in figure 1, a real number is composed by 4 bytes. Of these 32 bits, 23 are for the mantissa of the number, 8 bits are for the exponent and one is for the sign of the number [6]. The only way to code integer numbers inside such a real number, is to exploit the mantissa, leaving the exponent part, this in order to avoid the use of transcendent functions such as the logarithm or the exponential, that will be a notable burden for the unpacking algorithm on the Quadrics processors. 31 30 23 22 0 bit • cxp mantissa Figure 1. The floating point representation of Quadrics. Under the IEEE 754 standard, the 23 bit mantissa is in the range [1,2), in this way it can be used a hidden bit to enhance the precision of the number; consequently it is possible to use 24 bits for the packing. We have to pack a given sequence of positive integer numbers, let us call this sequence ilj;j e[0, N - 7]|, where N is the maximum number of packable numbers, that may be called the packing factor. The real number F is obtained through the following expression: N-l j - Ij 0) j=0 where the base b is equal to the maximum different values that the Ij may assume. The packing factor TV can be obtained through the formula N = 2y, , , where the squared brackets indicate the integer part of the expression. The relation (1) is fully biuni vocal, for each iV-ple of Ij there exist one and only one real number F. The relation holds also the other way around, the unpacking part of the algorithm, this means that given a real number F and the same base we may obtain back the original iV-ple. This unpacking is implemented via the following expression: I0=fint(F) (2) Ij=fint(F-b} -Ij_j) for l<j<N-l hsxtfint is the library function that translates a real number in another one in which the decimal part is zeroed. In other words returns the integer part of the input number but written as a floating point number (remember that the MAD only processes real numbers) Since the Ij numbers are positive ones, we may modify the jint function in order to avoid some conditional instructions on the sign of the number. Operating in such a way we may enhance the pipe filling performances of the code to a 7%, as confronted to a previous one of 1%. The typical data we use are represented by single byte grey levels, hence the base we will use is 256 and from what we have described above, we can pack three pixels into each real number. If we use a different input sampling, we may pack more (or less) integer numbers into each floating point one, for example if we consider 64 grey levels, we may pack four pixels per word. This simple algorithm is deterministic in the sense that we know a priori its main features. First of all the total amount of data is reduced by a factor N. In this way we can not only reduce the I/O time but also the disk requirements to store such packed images in the host machine.