API for Real Time/Dynamic Image and Video Processing Presented By
Total Page:16
File Type:pdf, Size:1020Kb
API for Real time/Dynamic Image and Video Processing Presented By Ivan Wong VP and Co-founder Background CTAccelAbout & XilinxCTAccel partnership CTAccel Limited Partner with Xilinx Inc since 2016 Delivered first Xilinx FPGA based Off-Premise Founded in Mar, 2016 solution in 2016 Based in Hong Kong & Shenzhen Delivered first Xilinx FPGA based Cloud Focus on FPGA-based accelerated computing for Data solution in 2017 Center US Patent Core value of our company: Quality & Innovation Background Market demand and challenges More Data Users generate more image and video data everyday Higher resolution in capture and display More Better Quality Better Data Quality Users crave better viewing experience Faster Access Faster Customers demand instant access to the resource Access Challenges: • Huge consumption of computational and storage resource • Server and storage performance IS NOT KEEPING PACE Data storage supply and demand worldwide, from 2009 to 2020 (in exabytes)* *Source: https://www.statista.com/statistics/751749/worldwide-data-storage-capacity-and-demand/ Background Constraints in existing solutions Internet traffic increases by 26% annually, CPU performance per core has not image and video contributes a large increased since 2005 portion of internet data. Source: Cisco--VNI Forecast Highlights Tool Source: The Future of Computing Performance (2011) Chip Frequency graphed against year of introduction How to achieve the fastest image processing with the least amount of computational resources? Background Modern image and video formats continue to emerge Guetzli HEIF Mozjpeg OpenCV Professional Lepton image BPG processing GraphicsMagick accelerator ImageMagick WebP JPEG Baseline Pixels Processing /Progressive Image Video AI Where is CTAccel image processor located in customers’ production environment? Cloud album Social network E-commerce News app Scenarios UGC Web portals Cloud storage …… CIP in Object-oriented storage Image Video AI CIP software stack and accelerated functions System Stack Accelerated Function Units Pixel Decode Encode /FFmpeg Processing JPEG Resizing JPEG WebP Crop WebP Lepton Rotate Lepton HEIF HEIF Image Video AI Why Xilinx? Applications Financial Machine Video/Image Processing Big Data Analytics Genomics Security Analytics Learning Machine Software-Defined Learning Development Environments Video & Image Processing Solution Development Partners & Customers Compute Acceleration Data Analytics Increase Service Level Cloud, Enterprise to Edge-Computing Genomics Image Video AI Product 1:JPEG 2 JPEG Problems solved Solution Scenarios Internet apps with JPEG Accelerate the speed of JPEG End-user E-commerce platform Cloud album image thumbnail generation, Smart phone/PAD Camera … PC reduce image-processing Social network with pictures servers. Web portals App News application Key features Cloud album Social network … News app Values TCO reduction: High Throughput CIP Reduce image-processing servers Low Latency OPEX reduction JPEG JPEG FPGA Resize Low CPU Utilization Decode Encode Improve customer experience: Accelerate the image rendering speed Image Video AI JPEG 2 JPEG performance on cloud VU9P (HUAWEI fp1c.2xlarge) 8vCPU (HUAWEI fp1c.2xlarge) Throughput(MB/s) Latency(ms) CPU Utilization 350 310.4 1400 120.00% 1122 1162 99.00% 99.00% 300 1200 100.00% 95.00% 250 1000 80.00% 200 800 63.00% 60.00% 150 600 100 400 40.00% 62.2 201 20.00% 50 7.7 7 200 28 0 0 0.00% 640x480 2048x1080 640x480 2048x1080 640x480 2048x1080 Input:10000*4096x2160(Average Size 803k) Throughput(MB/s) Latency(ms) CPU Utilization 100 94.9 250 100.00% 83.00% 87.00% 197 80 200 80.00% 70.00% 60 50.8 150 60.00% 40 100 40.00% 37.00% 48 20 13.3 8.5 50 23 20.00% 4 0 0 0.00% 240x180 768x576 240x180 768x576 240x180 768x576 Input:10000*1024x768 (Average Size 130k) Image Video AI Product 2: JPEG 2 WebP(M6) – Highest compression ratio Problems solved Solution Scenarios Vast amount of computational Internet apps with WebP resources and long time when End-user E-commerce platform processing WebP encoding by Smart phone/PAD Camera … PC Social network with pictures CPU; Web portals Accelerate the transcoding from News application JPEG to WebP; App Key features E-commerce Social network … News app Values TCO reduction: High Throughput CIP Save CDN traffic Reduce image-processing servers Low Latency JPEG WebP FPGA Resize OPEX reduction Low CPU Utilization Decode Encode Improve customer experience: Accelerate the image rendering speed Image Video AI JPEG to WebP performance FPGA Used: QPS Throughput (MB/s) Single Xilinx UltraScale+ 800 721 700 Dual Xilinx UltraScale+ 700 599 600 600 470 500 500 QPS 384 390 Single FPGA is 5.11 times of CPU 400 400 318 286 Dual FPGA is 8.39 times of CPU 300 300 237 208 173 200 200 99 108 82 89 100 56 25 100 46 21 Latency 0 0 Single FPGA is 0.20 times of CPU 640x480 1024x768 2048x1080 640x480 1024x768 2048x1080 Dual FPGA is times of CPU 0.12 FPGA*1 FPGA*2 CPU FPGA*1 FPGA*2 CPU Input: 10000*4096 x2160 images Latency (ms) CPU Utilization 1000 941.62 120% 100% 100% 100% 800 100% Test environment: 80% CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2 600 RAM: 128GB 422.73 60% 44% 45% 45% OS: CentOS Linux release 7.2.1511 400 240.64 40% 220.07 25% 23% 21% 200 82.79 113.03 60.9732.7 49.94 20% 0 0% 640x480 1024x768 2048x1080 640x480 1024x768 2048x1080 FPGA*1 FPGA*2 CPU FPGA*1 FPGA*2 CPU Image Video AI Modern image format-Lepton and its problem Lepton is a new image compression format developed by Dropbox and made open source in 2016. Lepton compresses JPEG images losslessly. It reduces file size by an average of 22%. It preserves the original JPEG file bit-by-bite perfectly, including all metadata. Lepton can be applied into the scenarios with massive images storage. It can effectively reduce the storage cost. JPEG Lepton Compression Ratio 1024x768 420 1.3G 1.1G 15% 1024x768 422 1.4G 1.1G 21% 1024x768 444 1.6G 1.3G 18% 4096x2160 420 8.3G 6.3G 24% 4096x2160 422 8.7G 6.6G 24% 4096x2160 444 9.5G 7.1G 25% Average 21% The downside of Lepton is that it requires heavy computation power for both compression and decompression. Problem A conventional X86 server with dual E5-2630 CPU can only compress JPEG files into Lepton format at a rate of 20 megabytes per second. Image Video AI Product 3: JPEG to Lepton Problems solved Solution Scenarios Vast amount of computational resources and long time when End-user Scenarios with massive images storage : processing Lepton encoding by Smart phone/PAD Camera … PC ; CPU; Cloud album ; Accelerate the transcoding from Cloud storage JPEG to Lepton; App Key features Cloud album Cloud storage Values TCO Reduction: High Throughput CIP Save CDN traffic Low Latency Reduce image-processing servers FPGA JPEG Lepton Low CPU Utilization Decode Encode Reduce image-storage servers OPEX reduction Image Video AI JPEG to Lepton performance FPGA Used: QPS U200 Throughput(MB/s) U200 CPU Virtex UltraScale+ FPGA Alveo U200 30.0 28.3 CPU 120.0 108.2 25.0 100.0 20.0 80.0 QPS FPGA is 4.2 times of CPU 15.0 60.0 10.0 6.7 40.0 25.7 Latency 5.0 20.0 FPGA is 0.24 times of CPU 0.0 0.0 24 threads 24 threads U200 Input images: Latency(ms) CPU Utilization U200 Total number: 999 images CPU 4000.00 120% CPU 3492.08 100% Total file size: 3.8GB 3500.00 100% 3000.00 2500.00 80% Test environment : 2000.00 60% CPU: Intel(R) Xeon(R) E5-2630v2 x2 1500.00 40% 1000.00 827.77 RAM: 128GB 500.00 20% 7% OS: CentOS Linux release 7.3.1611 0.00 0% 24 threads 24 threads Image Video AI Product 4: JPEG to HEIF Problems solved Solution Scenarios Internet apps with HEIF Vast amount of computational End-user E-commerce platform resources and long time when Smart phone/PAD Camera … PC Social network with pictures processing HEIC encoding by CPU; Web portals Poor customer experience; News application App Key features Cloud album Social network … News app Values TCO reduction: High Throughput CIP Reduce image-processing servers Low Latency OPEX reduction JPEG HEIF FPGA Resize Low CPU Utilization Decode Encode Improve customer experience: Accelerate the image rendering speed Image Video AI JPEG to HEIF performance QPS: FPGA*1 is 10 times that of Params JPG FPGA CPU CPU Images 100 100 100 Latency: FPGA*1 is 0.1 times that of Total 173738658 135858024 133290287 CPU Size(Bytes) Input: Compression 78.2% 76.72% Resolution:1920*1080 Ratio VQ: preset=medium Total files: 100 Param -slice_qp 5 crf=12 Output: (10179.74 kb/s) Resolution: 1920*1080 FPS 200.12 20.02 Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2680v4 PSNR 54.51 57.06 RAM: 128GB OS: CentOS Linux release 7.2.1511 Kernel version: 3.10.0-862.el7.x86_64 VMAF 97.25 97.31 Image Video AI SummaryFeatures of CIP’s of valuesCIP High 5-10 times throughput promotion compare to CPU performance 3 times latency reduction compare to CPU 20W-40W per FPGA card Low power TCO reduction: improve computing density of DC, reduce racks Software Support: ImageMagick, OpenCV, GraphicsMagick, Lepton compatible Allow seamless migration from software-based implementation to CIP Remote upgrading Ease of PR(Partial Reconfiguration) technology allows fast and easy context switch in maintenance accelerator functionality without rebooting the server Image Video AI Case study 1 Customer Advantages Leading mobile phone High TCO Better Ease of manufacturer in China performance reduction QoS maintenance Reduce the total Reduce TCO by 3x latency reduction Smaller cluster of number of server 25% servers Scenario cluster by 50% Improve the throughput by 50% Deterministic