High Performance Image Processing Solution with Intel® Platform Technology

Yang Lu

Intel Corporation 2015.

White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others © 2015 Intel Corporation.

Software and workloads used in performance tests of this paper may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

2 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Contents

Contents ...... 3 1. Abstract ...... 4 2. Image Processing Introduction ...... 4 3. Performance Characters ...... 7 3.1 Image Processing Performances Overview ...... 7 3.2 Simultaneous Multithreading and Turbo Boost ...... 9 3.3 Micro Architecture Characters ...... 10 4. High Performance Solution Based on Intel® Xeon™ Platform ...... 11 4.1 Image Compression Tuning ...... 11 4.2 Image Scaling Program Tuning...... 12 4.2.1 Down-Sampling Algorithm ...... 12 4.2.2 Intel® High Performance Tools ...... 14 5. WebP Image Processing ...... 18 6. Summary ...... 22 Reference ...... 22

Author contacts Yang Lu, Senior Application Engineer

3 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

1. Abstract

With the increasing popularity of internet and media cloud applications, huge volume of image data has been generated, utilized and shared every day, that presents the big computing and storage challenges to the media related industry and company. In this paper, we study the techniques of most popular image processing, analyze the performance challenge, explore the tuning methodology, and implement the most effective solution based on IA platform. We aim to maximize IA platforms’ capabilities for typical image processing workloads, achieve best performance and efficiency to benefit the most popular internet and media industry.

2. Image Processing Introduction

One picture is worth a thousand words. People are more and more used to upload pictures to describe their status, feeling, and some events at SNS. Sellers also upload many distinct pictures to describe and advertise their products at B2B, B2C and C2C platforms. We even can search the knowledge or information through images at popular search engine. All kinds of images have filled every corner of our life. This flood of images presents the computing and storage challenge: how to effectively process, compact, store, manage and transmit those images? And what kind of platforms and technologies are most efficient for image processing? In this paper, we analyze the typical image processing framework and the performance characters, then explore the most effective IA technologies to maximize the image processing applications performance, propose the best solution in terms of the processing performance and efficiency.

Generally, most of the companies need to scale and edit the images that customers uploaded or the media content they purchased from 3rd party, such as scaling the original images to the different dimensions that fit different terminal devices, compressing the images to the target format that save storage size and network bandwidth further, and editing (adding logo and watermark) the images to meet the business requirement. Figure 1 is the typical image processing flow that most companies adopted.

4 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Figure 1: Typical Image Processing Flow for Cloud Applications

To conduct those processing, following software stacks are mainly adopted: Software Type License Developed by language OS

Apache 2.0 ImageMagick ImageMagick Image manipulation Cross platform License Studio LLC

Fork from ImageMagick version GraphicsMagick GraphicsMagick 5.5.2, emphasizing MIT License C Cross platform Group stability and performance

 Intel Computer Vision Corporation, Will OpenCV BSD License ow Garage, C/C++ Cross platform library and framework Itseez

Graphics application SGI Open  formerly: OpenG API, aim to achieve L Architecture source license Review Board OpenGL hardware-accelerated (ARB) C Cross platform and Trademark rendering.  now: Khronos License. Group

Table 1: Common Image Processing Software Stack

5 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Currently, most of the media companies adopt the proper image formats to store and distribute images to achieve the best compression ratio and flexible internet content expression, such as JPEG, GIF and PNG. Table 2 illustrates these kinds of image formats characters, usage models and respective advantage and disadvantage. Browser File Animati Transpa Image Developed Licens Lossl Support Name Extensi on rency Usage Pros Cons Format by ed ess (without ons Support Support plugin) Windows Large file BMP .bmp .dib Microsoft No Yes No Yes No Bitmap size Animation Widely ie Graphics supported firefox Interchan No Animatio Limited to GIF .gif CompuServce Yes Yes Yes format chrome ge expired n 256 colors Format Transpare safari ncy opera support Joint Small file ie Photogra Joint size Lossy firefox .jpg . .j No Photogra JPEG phic Photographic No No No Widely compressi chrome pe .jif .jfif .jfi invalid phy Experts Experts Group supported on safari Group format opera Joint JPEG Photogra Yes .jp2 .j2c .j2 Joint Yes replace Computin phic Lossles Small file JPEG2000 k .jpx .jpf .j Photographic Yes ISO/IEC Yes ment, g safari Experts s and size 2c .j2k Experts Group 15444-2 HD intensive Group lossy 2000 imaging Multiple-i W3C (donated Not mage by PNG .mng Yes Yes Yes Animation widely No MNG Network Development Graphics Group) supported Lossless Widely ie W3C (donated Portable supported firefox by PNG Network .png No Yes No Yes PNG Development Icons format chrome Graphics Group) Transpare safari ncy opera support Lossless Photosho layers p Image support PSD .psd .pdd Adobe Systems Yes No Yes no Documen editing transpare t ncy support HDR .crw .cr2 .r photogra RAW Camera Large file RAW aw .rw2 .ne Yes No No phy, Lossless No Image file manufacturer size f .nrw .orf ... Archivin g Images Tagged from Image Large file TIFF .tiff .tif Adobe Yes No No scanner, Lossless No File size HD Format imaging Lossy Small file chrome WebP WebP . Google No No No compressi size opera on Table 2: Common Image Standard and Format

6 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Dealing with so many types of images with dedicated processing efficiently is a big challenge for the backend clusters. End uses generate and upload all kinds of images every day, new media contents are created and distributed frequently, that make the image processing clusters always carrying the huge processing pressures. In the following sections we will analyze what kinds of IA technologies are most important for the image processing applications, and how to take the advantage of those technologies to improve the image processing performance, to benefit the media related business finally.

3. Performance Characters

3.1 Image Processing Performances Overview

Image processing is the typical computing intensive workload, which consumes lots of CPU and memory resource. The performance of the image processing applications highly depends on the capabilities of the CPU cores, cache and the memory bandwidth. We take the traditional image scaling, compression and rotate process for examples, running those applications at different IA platforms, from the low-end processors to the high-end processors as shown in the table 3, get the following performance results shown in the figure 2. Intel® Xeon™ Platform

Processor Number E5-2620V2 E5-2640V2 E5-2697V2 E5-2697V3 HSW

# of Cores 6 8 12 14

# of Threads 12 16 24 28

Clock Speed 2.1 GHz 2 GHz 2.7 GHz 2.3 GHz

Max Turbo Frequency 2.6 GHz 2.5 GHz 3.5 GHz 4.0 GHz

Intel® Smart Cache 15 MB 20 MB 30 MB 35 MB

Intel® QPI Speed 7.2 GT/s 7.2 GT/s 8 GT/s 9.6 GT/s Table 3: Intel® Xeon™ Ivy Bridge and Haswell Platform Configuration

The performance metric here is the processing time, lower is better, from Figure 2, we can see that the processing performance improves along with CPU frequency, cache size and the number of the cores increasing, and IA architecture upgrade. Therefore, the high-bin processor is more efficient for image processing applications.

7 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

image processing performance at different IA pltaforms

4.5

4

3.5 rotate (s)

scaling (s) (s)

3 compress (s)

time 2.5

2

1.5 e5-2620 v2 e5-2640 v2 e5-2697 v2 e5-2695-v3 hsw

Figure 2: Image Processing Applications Performance at Different Platforms

We also conduct the traditional image scaling application at the different Haswell platforms, and used the IPP (Intel® Integrated Performance Primitives)[6] to optimize the performance. Figure 3 demonstrates that image scaling performance increases with the CPU frequency and cache capabilities growth, and the CPU frequency plays more important role at this application.

image scaling performance at Haswell platform 240

220

200 180

160

140 scaling scaling time(ms) 120

100 E5-2670 v3 (2.30 E5-2699 v3 (2.30 E5-2697 v3 (2.60 GHz,30M Cache) GHz, 45M cache) GHz, 35M cache) original 233 221 207 ipp tuning 192 158 152

Figure 3: Image Scaling Application Performance at Haswell Platforms

8 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

3.2 Simultaneous Multithreading and Turbo Boost

Intel® Simultaneous Multithreading (SMT), also called Hyper-threading (HT), and Intel® Xeon™ Turbo Boost are two kinds of key technologies that IA platform provided. They are widely supported in the most of IA platforms and contribute lots of performance speedup at many media related applications.

− SMT makes the addresses two virtual or logical cores for each physical core, and shares the resources between them when possible. The main function of hyper-threading is to decrease the number of dependent instructions on the pipeline. It offers performance benefits when CPU cores fully running in the heavy level, but not in every application such as that have the cores stay idle, in this case SMT technology will introduce the task/thread switching overhead. − Intel® Turbo Boost increases performance by translating the temperature, power and current head room into higher frequency. The actual Turbo Boost frequency is determined by the processor active cores (in C0 state), the type of workload, estimated power consumption, and the processor temperature. For most customers, the behavior of Intel® Turbo Boost Technology will have positive impact to application performance as it provides opportunistic frequency upside above the rated frequency when conditions allow and no action is required. For customers that need deterministic processor frequency, it is recommended to disable Intel® Turbo Boost Technology.

Figure 4 is one example of SMT and Turbo Boost contribution for the image scaling workload. We can see that, SMT provide around 30% performance speedup and the Turbo Boost boosts the performance by about 5%. Those two kinds of IA technologies can improve the performance distinctly, and have been widely adopted at customer image processing applications.

Figure 4: SMT and Turbo Boost Contribution for Image Processing Application

9 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

3.3 Micro Architecture Characters

From the processor micro architecture perspective, a well-tuned computing intensive workload should be almost fully running at the specific logical computing modules, less cache miss, less mis-predicted branches ratio, and ideal CPI(cycles per instruction) ratio. Generally, CPI is less than 1 is the ideal state. Figure 4 is one well-tuned example of the micro-architecture profiling data for image scaling application. We can see from the figure 5 that image scaling application also consumes lots of memory bandwidth and resource, therefore memory capability is also very important for the performance of image processing applications.

Figure 5: Processor Micro Architecture Characters for Image Processing

With these performance characters of the image processing application, we will analyze the IA platform technologies, and investigate how to take the favors of those technologies to benefit image processing applications at the following sections.

10 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

4. High Performance Solution Based on Intel® Xeon™ Platform

As we illustrated in the section 3, image processing application is a standard CPU and memory intensive workload, which requires high capabilities of the server platform, such as core computing efficiency, reliability, and stability. In this section, we will introduce the key IA technologies that can bring the significant performance boost for image related processing.

4.1 Image Compression Tuning

Most of the images are stored and distributed in compressed format, like JPEG, GIF and PNG, therefore image compression/decompression is one of the most critical modules in the image related processing clusters. As we know that every image contains several [wide*height] matrix, and the image processing is a set of matrix computing essentially. Those kind of calculation can be optimized by the vectorization technology, such as IA SIMD (single instruction multiple data) instructions, which operate multiple element data that perform the same operation on multiple data points simultaneously, as shown in the figure 6, that will greatly improve the data throughput and execution efficiency.

Figure 6: SIMD Methodology

SIMD instructions have been widely supported in x86 processors, evolving from MMX, SSE, AVX, to the AVX2 at different x86 platform generations respectively. Lots of

11 White Paper: High Performance Image Processing Solution with Intel® Platform Technology projects have been developing to utilize the vectorization instructions optimizing the image compression and decompression applications. libjpeg-turbo[3] is one of the open source project that uses SIMD instructions to accelerate the baseline JPEG compression and decompression on x86 Platform. Generally, libjpeg-turbo can provide 2-4x times performance speedup for JPEG image compression and decompression. Figure 7 is one of the examples that we used the libjpeg-turbo library to optimize the JPEG image compression application via SIMD technology.

Figure 7: JPEG Image Compression Tuning by libjpeg-turbo with SIMD

4.2 Image Scaling Program Tuning

As we illustrated in the section 2, image scaling is another hot function in the image procesing cluster, since most of the companies would scale original images to the target resolution and format when they receive the image source. To optimize the image scaling appliction, we can consider both of algorithm level and code level.

4.2.1 Down-Sampling Algorithm

As shown in the figure 8, traditional image scaling application starts from the decoding original compressed image to the raw data, then scale in or scale out as the application’s requirement, and finally save as the target format as the output.

12 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Figure 8: Image Scaling Application

A. N. Skodras and C. A. Christopoulos proposed a new algorithm, "Down-Sampling of Compressed Images in the DCT Domain"[8], that is during the decompression stage, only sampling and decompress the part of most important data, not full size decoding, so as to save lots of computing time and improve the performance. For JPEG image, it is the 8x8 DCT compression algorithm, and can be decompressed to n/8 (n=4,2,1) directly. By this way, we can decompress the image to the reasonable n/8 resolution instead the original full size.

With ImageMagick[10] framework, for example, scaling the source image from the resolution 4352x3264 to the target 500x575: o Step 1: decompress source image to: (4352/8) x (3264/8) = 544x408 o Step 2: zoom image to 500x375 These two steps can be implemented by ImageMagick with the following command: # convert -size 500x375 -scale 500x375 source.jpg target.jpg (-size 500x375: ImageMagick will decompress to 544x408 automatically)

The performance is shown in the table 4: Scaling Method Time (s) comments #convert -scale 500x375 0.825s original full size decompression and source.jpg target.jpg scaling #convert -size 500x375 -scale 0.230s -size option first decompress to the 500x375 source.jpg target.jpg 544x408 automatically by IM framework, and then scaling to 500x375 Performance speed up: 0.825/0.230=3.6x Table 4: down-sampling algorithm performance

In some ImageMagick versions, “-size” parameter can’t be supported well. Therefore the most stable solution is to modify the code in the file of “ImageMagick_DIR/coders/jpeg.c” as the following (bold lines):

13 White Paper: High Performance Image Processing Solution with Intel® Platform Technology static Image *ReadJPEGImage(const ImageInfo *image_info,ExceptionInfo *exception) { … if (units == 2) image->units=PixelsPerCentimeterResolution; number_pixels=(MagickSizeType) image->columns*image->rows; //option=GetImageOption(image_info,"jpeg:size"); //change this line to the following line option=image_info->size; if (option != (const char *) NULL) { … } … }

4.2.2 Intel® High Performance Tools

To maximize IA platform’s capabilities and facilitate customers to utilize and deploy advanced IA technologies, Intel developed full set of high performance libraries[4], for all IA based client and server platforms, many OSs (Windows*, *, OS X* and Android), and various of domains, such as system profiling, compiler, math kernel libs, cluster analysis, graphics SDK, and multithreading programming tools. Intel® Compiler and IPP (Intel® Integrated Performance Primitives) have been widely used to optimize the image processing applications.

4.2.2.1 Intel® Compiler

ICC (Intel® Compiler)[5] generates the optimized code for all IA based platforms automatically, including the auto vectoring and paralleling, memory and cache line tuning, as well as serious of high level optimization, based on Intel architecture’s advanced features. It explores the most possible way to complete the task within the minimal CPU cycles, and compatible with Microsoft Visual C++ on Windows, GCC (GNU Compiler Collection) on Linux.

Figures 9 and 10 are two examples of ICC contribution to the image scaling applications, at single thread/single core and multi threads/multi cores scenarios respectively. Replace GCC with ICC, using the same optimized switch, ICC provides distinct performance improvement here.

14 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Figure 9: icc contribution for single thread image scaling application

Figure 10: icc contribution for multi thread image scaling application

Generally, for those applications running on Intel architectures, ICC could help customers to have further performance improvement more easily and flexibly.

4.2.2.2 Intel® Integrated Performance Primitives

IPP (Intel® Integrated Performance Primitives)[6] exploits the best thread-level parallelism and Intel architecture instruction set implementation for following applications and algorithms: 1) Image, video and audio processing 2) Data communication 3) Data compression and encryption 4) Signal processing, etc.

IPP functions achieve the significant performance improvement via following technologies:  thread-level parallelism o Multi-Core

15 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

o Hyper-Threading  instruction set architecture o SIMD vectorization Instructions, MMX, SSE, AVX o processing data in larger chunks with each instruction o any new instructions based on IA arch.  microarchitecture by o pre-fetching data and avoiding cache blocking o resolving data and trace cache misses o avoiding branch mis-predictions

Here we take a common image scaling workload to demonstrate how to use the ipp library to replace original implementation and achieve better performance on IA platform: #include "lanczos.h" #include #include #include

//------Original Scaling Implementation ------//

void original_scaling(IplImage *src, IplImage *dst, int width, int height) { double x_factor; double y_factor; LanczosResizeFilter *resize_filter; int long span = 0; IplImage * filter_image = NULL; CvSize filter_size; unsigned int status;

x_factor = (double)width/src->width; y_factor = (double)height/src->height;

// set the value of filter structure // resize_filter = AcquireLanczosResizeFilter(); filter_size.width = width; filter_size.height = height;

//create temp matrix // if ((x_factor*y_factor) > WorkLoadFactor) { filter_size.width = width; filter_size.height = src->height; filter_image = cvCreateImage(filter_size, src->depth, src->nChannels); } else { filter_size.width = src->width; filter_size.height = height; filter_image = cvCreateImage(filter_size, src->depth, src->nChannels); }

// compute piexl of dest matrix// if ((x_factor*y_factor) > WorkLoadFactor) { span = (int long)filter_image->width + height; cv::Mat srcMat = cv::cvarrToMat(src); cv::Mat filterMat = cv::cvarrToMat(filter_image);

16 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

cv::Mat dstMat = cv::cvarrToMat(dst); status = lanczosHorizontalFilter(resize_filter, &srcMat, &filterMat, x_factor, span); status &= lanczosVerticalFilter(resize_filter, &filterMat, &dstMat, y_factor, span); } else { span = (int long)filter_image->height + width; cv::Mat srcMat = cv::cvarrToMat(src); cv::Mat filterMat = cv::cvarrToMat(filter_image); cv::Mat dstMat = cv::cvarrToMat(dst); status = lanczosVerticalFilter(resize_filter, &srcMat, &filterMat, y_factor, span); status &= lanczosHorizontalFilter(resize_filter, &filterMat, &dstMat, x_factor, span); } // memory free Matri// cvReleaseImage(&filter_image); DestoryLanczosResizeFilter(resize_filter);

}

//------IPP Code ------//

void ipp_scaling(IplImage *src, IplImage *dst, int width, int height) { LanczosResizeFilter *resize_filter; /* set the value of filter structure */ resize_filter = AcquireLanczosResizeFilter(); double x_factor; double y_factor; x_factor = (double)width/src->width; y_factor = (double)height/src->height;

ippSetNumThreads(1);

//define ipp parameters IppiRect srcRoi = {0,0, src->width, src->height}; IppiRect dstRoi={0,0, width,height}; IppiSize srcSize = {src->width, src->height}; IppiSize dstSize = {width,height};

int interpolation = IPPI_INTER_LANCZOS; int srcStep, dstStep; int channel = src->nChannels;

//allocate memory int BufferSize; ippiResizeGetBufSize(srcRoi, dstRoi, channel, interpolation, &BufferSize); Ipp8u* pBuffer=ippsMalloc_8u(BufferSize);

if(channel == 1) { //ippiConvert_32f8u_C1R((Ipp32f*)Temdst->imageData, Temdst->widthStep,(Ipp8u*)dst->imageData, dst->widthStep, dstSize, ippRndNear); ippiResizeSqrPixel_8u_C1R((Ipp8u*)src->imageData, srcSize, src->widthStep, srcRoi, (Ipp8u*)dst->imageData, dst->widthStep, dstRoi, x_factor, y_factor,0.0, 0.0, interpolation, pBuffer);

} else { //ippiConvert_32f8u_C3R((Ipp32f*)Temdst->imageData,

17 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Temdst->widthStep,(Ipp8u*)dst->imageData, dst->widthStep, dstSize, ippRndNear); ippiResizeSqrPixel_8u_C3R((Ipp8u*)src->imageData, srcSize, src->widthStep, srcRoi, (Ipp8u*)dst->imageData, dst->widthStep, dstRoi, x_factor, y_factor,0.0 , 0.0, interpolation, pBuffer);

} ippsFree(pBuffer);

}

Figure 11 is the result of using the Intel optimized solution to tuning the image scaling application. Based on the Intel® Xeon™ CPU E5-2697 v3 @ 2.60GHz system, using libjpeg-turbo library to optimize the image compression and decompression, using IPP high performance library to tuning the scaling process, since the IA SIMD/AVX instructions have been utilized and implemented well in the libjpeg-turbo and IPP, we achieved 2-3x times performance speedup in this application.

Image Scaling Application Tuning on IA platform 0.08

0.07

0.06 test1.jpg 0.05

test2.jpg

0.04 test3.jpg

time time (s) test4.jpg 0.03 test5.jpg 0.02 test6.jpg 0.01

0 base time(s) turbojpeg-SIMD IPP IPP+turbojpeg

Figure 11: Image Scaling Application Tuning

5. WebP Image Processing

WebP[11][12], a new image format, proposed and developed by Google, based on the technology from On2 company. It aims to save around half or more times of the image size(compare with the traditional jpeg, png, gif, etc.), which will help to reduce significant storage volume and network bandwidth for the image processing platforms. WebP adopts the block-based transformation and prediction scheme, with eight bits of color depth and a luminance-chrominance model. Each block is predicted on the values from the three blocks above it and from one block to the left of it (block decoding

18 White Paper: High Performance Image Processing Solution with Intel® Platform Technology is done in raster-scan order: left to right and top to bottom). And it supports four basic modes of block prediction: horizontal, vertical, DC (one color), and TrueMotion. Mis-predicted data and non-predicted blocks are compressed in a 4×4 pixel sub-block with a discrete cosine transform or a Walsh–Hadamard transform. Lossy WebP algorithm only supports 8-bit YUV 4:2:0 format, which may cause color loss on. Furthermore the WebP is the derivative of the VP8/VP9 video format.

In the table 5 and table 6 we compare the WebP with JPEG and other image formats from both of color domain display and compression performance. WebP format demonstrates the same image quality with the JPEG format image, but has some different behavior in the color domain display.

JPEG Image WebP Image

Table 5: JPEG and WebP Image Compare at Color Domain

In those five images, WebP format can save 2-4x times storage size and bandwidth, but also needs 3-4x times more computing time and resource.

19 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

quality time file size Image Format time (s) Size (KB) (PSNR) webp/jpeg jpeg/webp

Webp 0.208 75 41.02

Jpeg 0.056 287 38.2934 3.714286 3.826667

PNG 0.911 1917

1.tiff 1419 x Gif 0.633 887 1001 1922k B Jpeg2000 1.621 1905

Webp 0.078 30 39.9

Jpeg 0.026 113 34.2956 3 3.766667

PNG 0.294 252

2.tiff Gif 0.227 198 -800x600 253kB Jpeg2000 0.593 647

Webp 2.302 523 43.74

Jpeg 0.548 1967 43.2104 4.20073 3.760994

PNG 12.186 15813

3.tiff-5120 x Gif 5.203 9222 3840 15877kB Jpeg2000 15.071 14309

Webp 0.701 604 38.57

Jpeg 0.176 1442 37.2701 3.982955 2.387417

PNG 1.963 6824

4.tiff-2560 x Gif 1.363 3310 1600 6869kB Jpeg2000 5.095 5367 Webp 2.62 772 43.26

Jpeg 0.549 2533 46.8093 4.772313 3.281088

PNG 12.047 17122

5.tiff-3942 x Gif 5.591 11865 4684 17344kB Jpeg2000 13.005 12983 Table 6: Image Compression Performance Compare

Currently, more and more media cloud customers have adopted WebP format to provide the image related service to the supported client devices, which will save lots of cost from the storage volume and network bandwidth. However, WebP image compression consumes over 3 times more computing resource. They also are exploring the most efficient WebP image processing solution to meet the computing requirement.

20 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Similar with other image processing applications, the most time-consuming modules of the WebP image processing are also block based data intensive functions, that can be optimized by the IA SIMD vectorization technology also. Figure 12 is a workload that converts the original JPEG image to the WebP format, and delivers to the end users who are using the WebP format supported browser. For the jpeg decompress process, jpeg-turbo can be adopted to leverage the SIMD optimization. For the webp compress part we use the icc optimized SIMD switches to optimize, 17% performance speedup obtained. WebP Image Workload

0.092 0.091 0.09 0.089

0.088

0.087 0.086

time (s) time 0.085 0.084 0.083 0.082

Figure 12: WebP Image Tuning with SIMD

From the profiling data that shown in the figure 13, Google has developed the IA SIMD/SSE2 code to optimize the webp compression performance, but less AVX2 support. AVX2 instruction will theoretically double the performance of previous 128b SSE code by 256b int computing, which has be supported in Intel® Xeon™ E5-2600 v3 platform already, we can expect further extremely performance improvement when upgrade the SSE code to AVX2 at E5-2600 v3 platform.

21 White Paper: High Performance Image Processing Solution with Intel® Platform Technology

Figure 13: Profiling Result of the WebP Processing

6. Summary

In this paper, we analyze the architecture and performance characters of the most popular image processing applications, demonstrate the high capabilities of the Intel server platforms, and the leadership in the media processing domain via architecture design, rigorous manufacture procedure and the excellent performance. As the new image technology and usage modules emerging constantly, and IA platform upgrading stably, more and more applications and customers will get benefit from high performance and high reliability of IA technologies definitely.

Reference [1] The JPEG http://www.jpeg.org/ [2] Wikipedia: http://es.wikipedia.org [3] libjpeg-turbo: http://www.libjpeg-turbo.org/Main/HomePage [4] https://software.intel.com/en-us/intel-sdp-home [5] https://software.intel.com/en-us/c-compilers/ [6] https://software.intel.com/en-us/intel-ipp/ [7] http://software.intel.com/en-us/intel-isa-extensions [8] https://www.academia.edu/3036328/Down-sampling_of_compressed_images_in_the_DCT_domain [9] http://jpegclub.org/jidctred/ [10] http://www.imagemagick.org/ [11] https://code.google.com/p/webp/ [12] http://en.wikipedia.org/wiki/WebP

22