Quick viewing(Text Mode)

API for Real Time/Dynamic Image and Video Processing Presented By

API for Real Time/Dynamic Image and Video Processing Presented By

API for Real time/Dynamic Image and Video Processing Presented By

Ivan Wong VP and Co-founder Background CTAccelAbout & XilinxCTAccel partnership

CTAccel Limited  Partner with Xilinx Inc since 2016  Delivered first Xilinx FPGA based Off-Premise  Founded in Mar, 2016 solution in 2016  Based in Hong Kong & Shenzhen  Delivered first Xilinx FPGA based Cloud  Focus on FPGA-based accelerated computing for Data solution in 2017 Center  US Patent  Core value of our company: Quality & Innovation Background Market demand and challenges

More Data

 Users generate more image and video data everyday  Higher resolution in capture and display More Better Quality Better Data Quality  Users crave better viewing experience Faster Access Faster  Customers demand instant access to the resource Access

Challenges: • Huge consumption of computational and storage resource • Server and storage performance IS NOT KEEPING PACE

Data storage supply and demand worldwide, from 2009 to 2020 (in exabytes)*

*Source: https://www.statista.com/statistics/751749/worldwide-data-storage-capacity-and-demand/ Background Constraints in existing solutions

Internet traffic increases by 26% annually, CPU performance per core has not image and video contributes a large increased since 2005 portion of internet data. Source: Cisco--VNI Forecast Highlights Tool Source: The Future of Computing Performance (2011)

Chip Frequency graphed against year of introduction

How to achieve the fastest image processing with the least amount of computational resources? Background Modern image and video formats continue to emerge

Guetzli HEIF Mozjpeg

OpenCV

Professional Lepton image BPG processing GraphicsMagick accelerator

ImageMagick WebP JPEG Baseline Pixels Processing /Progressive Image Video AI Where is CTAccel image processor located in customers’ production environment?

Cloud album Social network E-commerce News app Scenarios UGC Web portals Cloud storage ……

CIP in Object-oriented storage Image Video AI CIP software stack and accelerated functions

System Stack Accelerated Function Units

Pixel Decode Encode /FFmpeg Processing JPEG Resizing JPEG WebP Crop WebP Lepton Rotate Lepton HEIF HEIF Image Video AI Why Xilinx? Applications Financial Machine Video/Image Processing Big Data Analytics Genomics Security Analytics Learning

Machine Software-Defined Learning Development Environments

Video & Image Processing Solution Development Partners & Customers Compute Acceleration Data Analytics Increase Service Level Cloud, Enterprise to Edge-Computing Genomics Image Video AI Product 1:JPEG 2 JPEG

Problems solved Solution Scenarios

Internet apps with JPEG

 Accelerate the speed of JPEG End-user  E-commerce platform  Cloud album image thumbnail generation, Smart phone/PAD Camera … PC reduce image-processing  Social network with pictures servers.  Web portals App  News application Key features Cloud album Social network … News app Values

 TCO reduction:  High Throughput CIP  Reduce image-processing servers  Low Latency  OPEX reduction JPEG JPEG FPGA Resize  Low CPU Utilization Decode Encode  Improve customer experience:  Accelerate the image rendering speed Image Video AI JPEG 2 JPEG performance on cloud VU9P (HUAWEI fp1c.2xlarge) 8vCPU (HUAWEI fp1c.2xlarge) Throughput(MB/s) Latency(ms) CPU Utilization 350 310.4 1400 120.00% 1122 1162 99.00% 99.00% 300 1200 100.00% 95.00% 250 1000 80.00% 200 800 63.00% 60.00% 150 600 100 400 40.00% 62.2 201 20.00% 50 7.7 7 200 28 0 0 0.00% 640x480 2048x1080 640x480 2048x1080 640x480 2048x1080 Input:10000*4096x2160(Average Size 803k) Throughput(MB/s) Latency(ms) CPU Utilization 100 94.9 250 100.00% 83.00% 87.00% 197 80 200 80.00% 70.00% 60 50.8 150 60.00% 40 100 40.00% 37.00% 48 20 13.3 8.5 50 23 20.00% 4 0 0 0.00% 240x180 768x576 240x180 768x576 240x180 768x576 Input:10000*1024x768 (Average Size 130k) Image Video AI Product 2: JPEG 2 WebP(M6) – Highest compression ratio Problems solved Solution Scenarios

 Vast amount of computational Internet apps with WebP resources and long time when End-user  E-commerce platform

processing WebP encoding by Smart phone/PAD Camera … PC  Social network with pictures CPU;  Web portals  Accelerate the transcoding from  News application JPEG to WebP; App Key features E-commerce Social network … News app Values

 TCO reduction:   High Throughput CIP Save CDN traffic  Reduce image-processing servers  Low Latency JPEG WebP  FPGA Resize OPEX reduction  Low CPU Utilization Decode Encode  Improve customer experience:  Accelerate the image rendering speed Image Video AI JPEG to WebP performance

FPGA Used: QPS Throughput (MB/s)  Single Xilinx UltraScale+ 800 721 700  Dual Xilinx UltraScale+ 700 599 600 600 470 500  500 QPS 384 390 Single FPGA is 5.11 times of CPU 400 400 318 286 Dual FPGA is 8.39 times of CPU 300 300 237 208 173 200 200 99 108 82 89 100 56 25 100 46 21  Latency 0 0 Single FPGA is 0.20 times of CPU 640x480 1024x768 2048x1080 640x480 1024x768 2048x1080 Dual FPGA is times of CPU 0.12 FPGA*1 FPGA*2 CPU FPGA*1 FPGA*2 CPU

 Input: 10000*4096 x2160 images Latency (ms) CPU Utilization 1000 941.62 120% 100% 100% 100% 800 100%  Test environment: 80% CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2 600 RAM: 128GB 422.73 60% 44% 45% 45% OS: CentOS Linux release 7.2.1511 400 240.64 40% 220.07 25% 23% 21% 200 82.79 113.03 60.9732.7 49.94 20% 0 0% 640x480 1024x768 2048x1080 640x480 1024x768 2048x1080 FPGA*1 FPGA*2 CPU FPGA*1 FPGA*2 CPU Image Video AI Modern image format-Lepton and its problem

 Lepton is a new format developed by Dropbox and made open source in 2016.  Lepton JPEG images losslessly. It reduces by an average of 22%.  It preserves the original JPEG file bit-by-bite perfectly, including all .  Lepton can be applied into the scenarios with massive images storage. It can effectively reduce the storage cost.

JPEG Lepton Compression Ratio

1024x768 420 1.3G 1.1G 15%

1024x768 422 1.4G 1.1G 21%

1024x768 444 1.6G 1.3G 18%

4096x2160 420 8.3G 6.3G 24%

4096x2160 422 8.7G 6.6G 24%

4096x2160 444 9.5G 7.1G 25% Average 21%

The downside of Lepton is that it requires heavy computation power for both compression and decompression. Problem A conventional X86 server with dual E5-2630 CPU can only JPEG into Lepton format at a rate of 20 megabytes per second. Image Video AI Product 3: JPEG to Lepton

Problems solved Solution Scenarios

 Vast amount of computational resources and long time when End-user Scenarios with massive images storage : processing Lepton encoding by Smart phone/PAD Camera … PC  ; CPU; Cloud album  ;  Accelerate the transcoding from Cloud storage JPEG to Lepton; App Key features Cloud album Cloud storage Values

 TCO Reduction:  High Throughput CIP  Save CDN traffic  Low Latency  Reduce image-processing servers FPGA JPEG Lepton  Low CPU Utilization Decode Encode  Reduce image-storage servers  OPEX reduction Image Video AI JPEG to Lepton performance

FPGA Used: QPS U200 Throughput(MB/s) U200  CPU Virtex UltraScale+ FPGA Alveo U200 30.0 28.3 CPU 120.0 108.2 25.0 100.0 20.0 80.0  QPS FPGA is 4.2 times of CPU 15.0 60.0 10.0 6.7 40.0 25.7  Latency 5.0 20.0 FPGA is 0.24 times of CPU 0.0 0.0 24 threads 24 threads U200  Input images: Latency(ms) CPU Utilization U200 Total number: 999 images CPU 4000.00 120% CPU 3492.08 100% Total file size: 3.8GB 3500.00 100% 3000.00 2500.00 80%  Test environment : 2000.00 60% CPU: Intel(R) Xeon(R) E5-2630v2 x2 1500.00 40% 1000.00 827.77 RAM: 128GB 500.00 20% 7% OS: CentOS Linux release 7.3.1611 0.00 0% 24 threads 24 threads Image Video AI Product 4: JPEG to HEIF

Problems solved Solution Scenarios

Internet apps with HEIF  Vast amount of computational End-user  E-commerce platform resources and long time when Smart phone/PAD Camera … PC  Social network with pictures processing HEIC encoding by CPU;  Web portals  Poor customer experience;  News application App Key features Cloud album Social network … News app Values

 TCO reduction:  High Throughput CIP  Reduce image-processing servers  Low Latency  OPEX reduction JPEG HEIF FPGA Resize  Low CPU Utilization Decode Encode  Improve customer experience:  Accelerate the image rendering speed Image Video AI JPEG to HEIF performance

 QPS:  FPGA*1 is 10 times that of Params JPG FPGA CPU CPU Images 100 100 100  Latency:  FPGA*1 is 0.1 times that of Total 173738658 135858024 133290287 CPU Size()

Input: Compression 78.2% 76.72%  Resolution:1920*1080 Ratio  VQ: preset=medium  Total files: 100 Param -slice_qp 5 crf=12 Output: (10179.74 kb/s)  Resolution: 1920*1080 FPS 200.12 20.02 Test environment:  CPU: 2*Intel(R) Xeon(R) CPU E5-2680v4 PSNR 54.51 57.06  RAM: 128GB  OS: CentOS Linux release 7.2.1511  Kernel version: 3.10.0-862.el7.x86_64 VMAF 97.25 97.31 Image Video AI SummaryFeatures of CIP’s of valuesCIP

High 5-10 times throughput promotion compare to CPU performance 3 times latency reduction compare to CPU

20W-40W per FPGA card Low power TCO reduction: improve computing density of DC, reduce racks

Software Support: ImageMagick, OpenCV, GraphicsMagick, Lepton compatible Allow seamless migration from software-based implementation to CIP

Remote upgrading Ease of PR(Partial Reconfiguration) technology allows fast and easy context switch in maintenance accelerator functionality without rebooting the server Image Video AI Case study 1

Customer Advantages

Leading mobile phone High TCO Better Ease of manufacturer in China performance reduction QoS maintenance

Reduce the total Reduce TCO by 3x latency reduction Smaller cluster of number of server 25% servers Scenario cluster by 50% Improve the throughput by 50% Deterministic Image thumbnail timing generation for cloud album

CIP Feedback from customer JPEG Resize · CIP has been deployed in the production environment for more than 16 months decode · Maintaining stable operating status Xilinx FPGA Image Video AI Case study 2

Customer Advantages

Famous video portal High TCO Improve customer website in China performance reduction experience

1 server with one CIP Reduce TCO by Reduce latency by accelerator has the 50% at least xx% (not announced the Scenario specific values by our equivalent computing customer) capability with 3 Transcoding from servers without CIP Deterministic JPEG to WebP timing: the same low latency at full- load and no-load

CIP Feedback from customer JPEG Resize · Test period: 2 months decode · CIP has passed the test of customer WebP · CIP will be deployed in the production environment in 2018Q4 encode Image Video AI Evolution planning Products Ready to Market Plan & Development

Accelerated image/video transcoding JPEG  Lepton PNG - Image codec : JPEG, Webp, HEIF, Lepton, BPG, GIF, PNG  - processing: High quality resizing, High performance JPEG WebP GIF resizing, smart-crop, -resolution CIP - : H264, H265, AV1, AVS3 JPEG  HEIF Smart-crop Accelerated image/video analytics Image Codecs Super-resolution JPEG - AI function products around pictures and video AVIF content: content identification, face detection, pedestrian detection, etc. Resize/Crop/ Rotate - Functional services such as searching for pictures, searching for videos, etc. CVP MPSoC - H265 AV1 Software as a Service Video Codecs - Ultimately all CTAccel accelerated functions will MPSoC - H264 AVS3 be served by a unified SaaS framework - Unified APIs across CSPs and on-premise Scan for pornographic content Video/Image Video game recognition Analytics AI assist security inspection Image Video AI Cloud platformPartners support

Hardware platform support

Alveo Alveo Kintex® Zynq MPSoC U200 U50 UltraScale™115 Image Video AI Product line of CTAccel

Image Video AI

Higher performance Improve quality of TCO reduction density service

High-quality FPGA-based accelerated computing solution provider Image Video AI Xilinx MPSoC with VCU - H264/H265

Problems solved Solution Scenarios

Applications which require video

 Accelerate the speed of video End-user transcoding: processing, improve customer Smartphone PC … TV  Video portal websites ; experience  Short video apps;

App Video portal Short video … Live Key features website app broadcast Values

 TCO Reduction:  FPS Xilinx MPSoC with VCU  Reduce processing servers  PSNR H264/H265 H264/H265  OPEX reduction FPGA Resize Decode Encode  Improve customer experience Image Video AI Performance of H.264 and H.265 (MPSoC)

FPS FPS 643.87 712.39 578 636 516.21 449.94 478.07 377.57 307.2 324.77 227.13 150.97 118.96 100.34 30.37 63.89

1920*1080 1280*720 960*540 640*480 1920*1080 1280*720 960*540 640*480 CVP medium CVP medium

H.264 Encode H.265 Encode

Test environment:  CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2  RAM: 128GB  OS: CentOS Linux release 7 Image Video AI Performance of H.264 and H.265 (MPSoC)

FPS FPS 236.41 236.03 200.42 200.68 183.22

85.89 91.09

28.3

CVP H.264 CVP H.265 X264 medium X265 medium CVP H.264 CVP H.265 X264 medium X265 medium

Server: 2*E5-2630v2 Server: 2*E5-2680v4

Test environment:  Input:  CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2 Crowdrun1080p, H264 & 2*Intel(R) Xeon(R) CPU E5-2680v4  Output:  RAM: 128GB 720p  OS: CentOS Linux release 7 540p 480p Image Video AI Quality of H.264 and H.265 (MPSoC)

H.265 VMAF H.265 PSNR 36.00 100.00 90.00 34.00 80.00 32.00 70.00 60.00 30.00 50.00 28.00 40.00 30.00 26.00 20.00  Input: 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 24.00 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 High dynamic and CVP X265 NVP4 H.265 Bitrate(Mbps) CVP X265 NVP4 H.265 Bitrate(Mbps) high complexity video H.265 Medium Medium H.265 Medium Medium

- Crowdrun, park_joy H.264 VMAF H.264 PSNR 36.00 100.00 90.00 34.00 80.00 32.00 70.00 30.00 60.00 50.00 28.00 40.00 26.00 30.00 24.00 Test environment: 20.00  CPU: 2*Intel(R) Xeon(R) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 22.00 CPU E5-2630v2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CVP X264 NVP4 H.264 Bitrate(Mbps) CVP X264 NVP4 H.264 Bitrate(Mbps)  RAM: 128GB H.264 Medium Medium H.264 Medium Medium  OS: CentOS Linux release 7 Image Video AI I-Frame 2 JPEG

Problems solved Solution Scenarios

Applications which require “I-

 Accelerate the speed of ”I-Frame 2 End-user Frame 2 JPEG”:  Video portal websites ; JPEG”, improve customer Smartphone PC … TV  experience Short video apps;

App Video portal Short video … Live Key features website app broadcast Values

  TCO Reduction: High Throughput CIP + MPSoC  Reduce processing servers  Low Latency H264/H265 JPEG  OPEX reduction FPGA Resize  Low CPU Utilization Decode Encode  Improve customer experience Image Video AI AI accelerator: face detection/recognition

Problems solved Solution Scenarios

 Smart City;  Accelerate the speed of ”face End-user  Intelligent Transportation; detection/recognition”, improve Smartphone PAD … PC  Safe City; customer experience, reduce  Security system; TCO App Intelligent Safe City Smart City … Key features Transportation Values  TCO Reduction:  High Throughput  Reduce image-processing servers Image Face  OPEX reduction  Resize Low Latency Decode recognition  Improve customer experience  Low CPU Utilization FPGA GPU/FPGA  Improve the speed of face detection/recognition Image Video AI AI accelerator: test scenarios

Test-1:Alexnet model

FPGA #1 GPU GPU

Test-2:nsfw model

Memory FPGA #2 GPU GPU caching Test-3:Inception V4 model

FPGA #3 GPU Test-4:ResNet-50 model

GPU computing matrix

Test environment Test data  CPU: 2*Intel(R) Xeon(R) CPU E5-2690v4 x 2 Image type Resolution Quantity  RAM: 128GB 8 Mega Pixel 2647X3278 14708  OS: CentOS Linux release 7.4  Kernel version: 3.10.0-514.2.2.el7.x86_64 16 Mega Pixel 5312x2988 9280  Python version: 2.7 1080p 1920x1080 21400 Image Video AI Test result 1-Alexnet

Alexnet model Throughput(MB/S) 2500.00

QPS: 2000.00 FPGA+GPU is 2.5 faster than CPU+GPU 1500.00 Latency: 1000.00 FPGA+GPU is 30% of CPU+GPU 500.00 CPU usage: FPGA+GPU is 20% of CPU+GPU 0.00 8M Pixels 16M Pixels 1080p

3 FPGA + 5 GPU CPU+ 5 GPU

Latency(ms) CPU utilization(%) 80.00 80.00 70.00 70.00 60.00 60.00 50.00 50.00 40.00 40.00 30.00 30.00 20.00 20.00 10.00 10.00 0.00 0.00 8M Pixels 16M Pixels 1080p 8M Pixels 16M Pixels 1080p

3 FPGA+ 5 GPU CPU+ 5 GPU 3 FPGA+ 5 GPU CPU + 5 GPU Image Video AI Test result 2-nsfw

nsfw model Throughput(MB/S) 2500.00

QPS: 2000.00 FPGA+GPU is 3 faster than CPU+GPU 1500.00 Latency: 1000.00 FPGA+GPU is 30% of CPU+GPU 500.00 CPU usage: FPGA+GPU is 20% of CPU+GPU 0.00 8M Pixels 16M Pixels 1080p

3 FPGA+ 5 GPU CPU+5 GPU

Latency(ms) CPU utilization(%) 90.00 70.00 80.00 60.00 70.00 50.00 60.00 50.00 40.00 40.00 30.00 30.00 20.00 20.00 10.00 10.00 0.00 0.00 8M Pixels 16M Pixels 1080p 8M Pixels 16M Pixels 1080p

3 FPGA+ 5 GPU CPU+5 GPU 3 FPGA+ 5 GPU CPU+5 GPU Image Video AI Test result 3-Inception V4

InceptionV4 model Throughput(MB/S) 2500

QPS: 2000 FPGA+GPU is 2 faster than CPU+GPU 1500 Latency: 1000 FPGA+GPU is 40% of CPU+GPU 500 CPU usage: FPGA+GPU is 30% of CPU+GPU 0 8M Pixels 16M Pixels 1080p

3 FPGA+ 5 GPU CPU+5 GPU

Latency(ms) CPU utilization(%) 90 70 80 60 70 50 60 50 40

40 30 30 20 20 10 10 0 0 8M Pixels 16M Pixels 1080p 8M Pixels 16M Pixels 1080p

3 FPGA+ 5 GPU CPU+5 GPU 3 FPGA+ 5 GPU CPU+5 GPU Image Video AI Test result 4-ResNet-50

ResNet-50 model Throughput(MB/S) 2500.00

QPS: 2000.00 FPGA+GPU is 3 faster than CPU+GPU 1500.00 Latency: 1000.00 FPGA+GPU is 30% of CPU+GPU 500.00 CPU usage: 0.00 FPGA+GPU is 20% of CPU+GPU 8M Pixels 16M Pixels 1080p

3 FPGA+ 5 GPU CPU+5 GPU

Latency(ms) CPU utilization(%) 80.00 70.00

70.00 60.00 60.00 50.00 50.00 40.00 40.00 30.00 30.00 20.00 20.00 10.00 10.00 0.00 0.00 8M Pixels 16M Pixels 1080p 8M Pixels 16M Pixels 1080p

3 FPGA+ 5 GPU CPU+5 GPU 3 FPGA+ 5 GPU CPU+5 GPU Image Video AI Test result 5-Accuracy

Model Category Accuracy(top1) Accuracy(top5)

TensorFlow 0.49 0.74 Alexnet accel 0.49 0.73

TensorFlow 0.80 0.95 Inceptionv4 accel 0.79 0.95

TensorFlow 0.73 0.91 ResNet50 accel 0.72 0.90

TensorFlow 0.75 N/A nsfw accel 0.75 N/A For any product related enquiries:

Product enquiry email: Product website: [email protected] www.ct-accel.com

Cloud partners: AWS HUAWEI Cloud Product hotline: Alibaba Cloud +86-0755-88914045 Baidu Cloud Tencent Cloud