Tencent Speeds MD5 Image Identification by 2X

Case Study Software Speeding MD5 Image Identification by 2x Intel® Integrated Performance Primitives High-Performance Computing TenCent Optimizes Image Identification Tencent, Inc., is China’s largest and most-used Internet service portal. It owns both the largest online game community and the largest Web portal (qq.com), as well as the No. 1 and No. 2 applications (WeChat*, QQ*) in China. Every day, Tencent needs to process billions of new user-generated images from WeChat, QQ, and QQ Album*. Some hot applications even have hundreds of mil - lions of images to be uploaded, stored, processed, and downloaded in a single day―which consumes vast computing resources. To manage, store, and process these images, Tencent developed Tencent File Sys - tem* (TFS*). But even with compression, the image volume reached hundreds of petabytes. Moreover, it is still growing explosively―and the supported cluster has more than 20,000 servers. Technical Background Based on TFS, the image processing system provides uploading, scaling, encod - ing, and downloading services. As an image uploads, TFS scales it into a different resolution and creates the related ID by Message Digest Algorithm 5 (MD5). 1 Next, the image is transcoded into WebP* format for storage. While downloading an image, the system must find the right place to read the image, and then transcode it into the user-required image format and resolution (Figure 1). Figure 1. TenCent File System* Image Processing Speeding MD5 Image Identification by 2x 2 Higher Performance and More Efficient Data Management Across a Wide Range of Applications “Through close collaboration Because the website has tons of visits the Intel® SIMD processor supplemen - each second, there’s a small possibility tary instruction sets. Intel SSE2 is sup - with Intel engineers, we that the image download component will plemented by Intel SSE3, Intel SSE4.x, adopted the Intel® Integrated read the wrong image. Avoiding this kind and Intel Advanced Vector Extensions of error requires an MD5 calculation and (Intel AVX). Performance Primitives library check. However, this is a huge computing Intel AVX is a 256-bit instruction set ex - workload―so Tencent needed to maxi - for the image identification tension to Intel SSE, designed to provide mize MD5 computing performance. component in our online even higher performance for applications Originally, Tencent used the md5sum* that are compute-intensive. Intel AVX image storage and processing utility tool along with the Operator* OS to adds new functionality to the Intel SIMD application. The application’s compute the MD5 value for each image instruction set (based on Intel SSE) on file. Intel worked closely with Tencent en - floating-point and integer computing, performance improved gineers to help them optimize perform - and it includes a more compact SIMD in - ance with Intel® Integrated Performance struction set. significantly, and our cost of Primitives (Intel® IPP)―which helped Ten - Figure 2 shows one SIMD operation on cent achieve a 100 percent performance operations reduced greatly. eight data (32-bit integer type, floating improvement on the Intel® architecture- point type) instructions. We really appreciate the based platform. collaboration with Intel and Intel AVX improves performance by ex - Intel® Streaming SIMD Extensions tending the breadth of vector processing are looking forward to and Software Optimization capability across floating-point and inte - ger data domains. This results in higher more collaboration.” Intel introduced an instruction set exten - performance and more efficient data ―Nicholas, sion with the Intel® Pentium® III processor management across a wide range of ap - Leader of the TFS-Based called Intel® Streaming SIMD Extensions plications such as image and audio/ Image Storage and Processing Team, (Intel® SSE). This was a major redesign of video processing, scientific simulations, TenCent an earlier single-instruction, multiple-data financial analytics, and 3D modeling and (SIMD) instruction set called MMX®, intro - analysis. duced with the Intel Pentium processor. Intel evolved the Intel SSE instruction Algorithms That Benefit from Intel SSE set along with Intel architecture, extend - Algorithms that can benefit from Intel ing it by wider vectors and adding a new SSE 2 include those that employ logical or extensible syntax and rich functionality. mathematical operations on data sets The latest SIMD instruction set, Intel® larger than a single 32-bit or 64-bit word. Advanced Vector Extensions 2 (Intel® Intel SSE uses vector instructions, or AVX2), can be found in the Intel® Core™ SIMD architecture, to complete opera - i7 processor. tions such as bitwise XOR, integer or Most of the Intel® Xeon® processors in the floating-point multiply-and-accumulate, TFS system support Intel SSE2, one of and scaling in a single clock cycle for mul- Figure 2. SIMD Operation on 8 Data Instructions Speeding MD5 Image Identification by 2x 3 Table 1. Intel® IPP Features Transform) that operate over a large number of time series samples. Optimized for Intel Engineered and Wide Range of Cross- • Digest, hashing, and encoding . Performance and Future-Proofed to Platform and OS Power Efficiency Shorten Development Time Functionalities Algorithms used for security, data corruption protection, and data loss • Highly tuned routines • Fully optimized for • Thousands of highly protection such as simple parity, current and past optimized signal, CRC (cyclic redundancy check), MD5, • Highly optimized processors data, and media SHA (secure hash algorithm), Galois using SSSE4, SSSE3, functions math, Reed-Solomon encoding, and Intel® SSE, and Intel® • Saves development, CBC (cypher-block-chaining) all • Broad domain sup - AVX2, Intel® AVX12 debug, and mainte - make use of logical and mathemati - port instruction sets nance time cal operators over blocks of data, often many kilobytes in size. • Performance beyond • Code once now, • Supports Intel® Quark™, Intel® Core™, what an optimize receive future opti - • Data transformation and data Intel® Xeon®, and compiler produces mizations later compression . Most often, simula - Intel® Xeon Phi™ alone tions in engineering and scientific platforms computing involve data transforma - tion over time and can include grids of data that are transformed. For Table 2. Processor-Specific Codes example, in physical thermodynam - Associated with Power-Specific Libraries ic, mechanical, fluid-dynamic, or electrical-field models, a grid of px mx Generic code optimized for processors with Intel® floating-point values is used to rep - Streaming SIMD Extensions (Intel® SSE) resent the physical fields as finite elements. These finite element grids W7 Optimized for processors with Intel SSE2 are then updated through mathe - matical transformations over time m7 Optimized for processors with Intel SSE3 to simulate a physical process. v8 u8 Optimized for processors with Supplemental Streaming SIMD Extensions 3 (SSSE3), including Optimized for Performance Intel® Atom™ processor Intel IPP is a performance building block for all kinds of image and signal pro - P8 y8 Optimized for processors with Intel SSE4.1 cessing, data compression, and cryptog - raphy needs. These ready-to-use, g9 e9 Optimized for processors with Intel® Advanced Vec - royalty-free functions are highly opti - tor Extensions (Intel® AVX) and Intel® Advanced En - mized using Intel SSE and Intel AVX and cryption Standard New Instructions Intel AVX2 instruction sets, which often outperform what an optimized compiler h9 i9 Optimized for processors with Intel AVX2 can produce alone. 3 Table 1 summarizes the features of Intel IPP. tiple 32-bit or 64-bit words. Speed-up comes intensity and color) and both bene - from the parallel operation and the size of fit from speedup relative to pro - The Intel IPP library is optimized for a the vector (multiword data) to which each cessing frame rates. variety of SIMD instruction sets. Besides mathematical or logical operator is applied. the optimization, Intel IPP also provides • Digital signal processing (DSP) . an automatic “dispatching” mechanism, Examples of algorithms that can signifi - Samples digitized from sensors and which can detect the SIMD instruction cantly benefit from SIMD vector instruc - instrumentation have resolution like set that is available on the running tions include: images as well as data acquisition processor and select the optimal SIMD • Image processing and graphics. rates. Often, a time series of digi - instructions for that processor. tized data that is one-dimensional Both scale in terms of resolution Table 2 shows processor-specific codes will still be transformed using algo - (pixels per unit area) and the pixel that Intel IPP uses. encoding (bits per pixel to represent rithms, like a DFT (Discrete Fourier Speeding MD5 Image Identification by 2x 4 See Understanding CPU Dispatching in Most Linux*-based operating systems in - function to set up the initial context the Intel® IPP Library for more informa - clude md5sum utilities in their distribu - state with the MD5-specified initial - tion on dispatching. For more information tion packages. ization vectors. on Intel IPP functions optimized for Intel The Intel IPP MD5 functions apply hash AVX, read the article Intel® IPP Functions 3. Keep calling the function MD5Update algorithms to digesting streaming mes - Optimized for Intel® AVX . to digest the incoming message sages. It uses a state context (for example, stream in the queue until its comple - ippsSHA1State) as an

Tencent Speeds MD5 Image Identification by 2X

07 Vectorization for Intel C++ & Fortran Compiler .Pdf

New Instruction Set Extensions

Operating Guide

SIMD Extensions

Amd Epyc 7351

Intel® Architecture and Tools

CS 110 Discussion 15 Programming with SIMD Intrinsics

PCLMULQDQ Instruction Definition

Intel® Software Guard Extensions (Intel® SGX)

AMD Ryzen 5 1600 Specifications

X86 Intrinsics Cheat Sheet Jan Finis [email protected]

AMD A10-6700 Specifications General Information