API for Real Time/Dynamic Image and Video Processing Presented By
API for Real time/Dynamic Image and Video Processing Presented By
Ivan Wong VP and Co-founder Background CTAccelAbout & XilinxCTAccel partnership
CTAccel Limited Partner with Xilinx Inc since 2016 Delivered first Xilinx FPGA based Off-Premise Founded in Mar, 2016 solution in 2016 Based in Hong Kong & Shenzhen Delivered first Xilinx FPGA based Cloud Focus on FPGA-based accelerated computing for Data solution in 2017 Center US Patent Core value of our company: Quality & Innovation Background Market demand and challenges
More Data
Users generate more image and video data everyday Higher resolution in capture and display More Better Quality Better Data Quality Users crave better viewing experience Faster Access Faster Customers demand instant access to the resource Access
Challenges: • Huge consumption of computational and storage resource • Server and storage performance IS NOT KEEPING PACE
Data storage supply and demand worldwide, from 2009 to 2020 (in exabytes)*
*Source: https://www.statista.com/statistics/751749/worldwide-data-storage-capacity-and-demand/ Background Constraints in existing solutions
Internet traffic increases by 26% annually, CPU performance per core has not image and video contributes a large increased since 2005 portion of internet data. Source: Cisco--VNI Forecast Highlights Tool Source: The Future of Computing Performance (2011)
Chip Frequency graphed against year of introduction
How to achieve the fastest image processing with the least amount of computational resources? Background Modern image and video formats continue to emerge
Guetzli HEIF Mozjpeg
OpenCV
Professional Lepton image BPG processing GraphicsMagick accelerator
ImageMagick WebP JPEG Baseline Pixels Processing /Progressive Image Video AI Where is CTAccel image processor located in customers’ production environment?
Cloud album Social network E-commerce News app Scenarios UGC Web portals Cloud storage ……
CIP in Object-oriented storage Image Video AI CIP software stack and accelerated functions
System Stack Accelerated Function Units
Pixel Decode Encode /FFmpeg Processing JPEG Resizing JPEG WebP Crop WebP Lepton Rotate Lepton HEIF HEIF Image Video AI Why Xilinx? Applications Financial Machine Video/Image Processing Big Data Analytics Genomics Security Analytics Learning
Machine Software-Defined Learning Development Environments
Video & Image Processing Solution Development Partners & Customers Compute Acceleration Data Analytics Increase Service Level Cloud, Enterprise to Edge-Computing Genomics Image Video AI Product 1:JPEG 2 JPEG
Problems solved Solution Scenarios
Internet apps with JPEG
Accelerate the speed of JPEG End-user E-commerce platform Cloud album image thumbnail generation, Smart phone/PAD Camera … PC reduce image-processing Social network with pictures servers. Web portals App News application Key features Cloud album Social network … News app Values
TCO reduction: High Throughput CIP Reduce image-processing servers Low Latency OPEX reduction JPEG JPEG FPGA Resize Low CPU Utilization Decode Encode Improve customer experience: Accelerate the image rendering speed Image Video AI JPEG 2 JPEG performance on cloud VU9P (HUAWEI fp1c.2xlarge) 8vCPU (HUAWEI fp1c.2xlarge) Throughput(MB/s) Latency(ms) CPU Utilization 350 310.4 1400 120.00% 1122 1162 99.00% 99.00% 300 1200 100.00% 95.00% 250 1000 80.00% 200 800 63.00% 60.00% 150 600 100 400 40.00% 62.2 201 20.00% 50 7.7 7 200 28 0 0 0.00% 640x480 2048x1080 640x480 2048x1080 640x480 2048x1080 Input:10000*4096x2160(Average Size 803k) Throughput(MB/s) Latency(ms) CPU Utilization 100 94.9 250 100.00% 83.00% 87.00% 197 80 200 80.00% 70.00% 60 50.8 150 60.00% 40 100 40.00% 37.00% 48 20 13.3 8.5 50 23 20.00% 4 0 0 0.00% 240x180 768x576 240x180 768x576 240x180 768x576 Input:10000*1024x768 (Average Size 130k) Image Video AI Product 2: JPEG 2 WebP(M6) – Highest compression ratio Problems solved Solution Scenarios
Vast amount of computational Internet apps with WebP resources and long time when End-user E-commerce platform
processing WebP encoding by Smart phone/PAD Camera … PC Social network with pictures CPU; Web portals Accelerate the transcoding from News application JPEG to WebP; App Key features E-commerce Social network … News app Values
TCO reduction: High Throughput CIP Save CDN traffic Reduce image-processing servers Low Latency JPEG WebP FPGA Resize OPEX reduction Low CPU Utilization Decode Encode Improve customer experience: Accelerate the image rendering speed Image Video AI JPEG to WebP performance
FPGA Used: QPS Throughput (MB/s) Single Xilinx UltraScale+ 800 721 700 Dual Xilinx UltraScale+ 700 599 600 600 470 500 500 QPS 384 390 Single FPGA is 5.11 times of CPU 400 400 318 286 Dual FPGA is 8.39 times of CPU 300 300 237 208 173 200 200 99 108 82 89 100 56 25 100 46 21 Latency 0 0 Single FPGA is 0.20 times of CPU 640x480 1024x768 2048x1080 640x480 1024x768 2048x1080 Dual FPGA is times of CPU 0.12 FPGA*1 FPGA*2 CPU FPGA*1 FPGA*2 CPU
Input: 10000*4096 x2160 images Latency (ms) CPU Utilization 1000 941.62 120% 100% 100% 100% 800 100% Test environment: 80% CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2 600 RAM: 128GB 422.73 60% 44% 45% 45% OS: CentOS Linux release 7.2.1511 400 240.64 40% 220.07 25% 23% 21% 200 82.79 113.03 60.9732.7 49.94 20% 0 0% 640x480 1024x768 2048x1080 640x480 1024x768 2048x1080 FPGA*1 FPGA*2 CPU FPGA*1 FPGA*2 CPU Image Video AI Modern image format-Lepton and its problem
Lepton is a new image compression format developed by Dropbox and made open source in 2016. Lepton compresses JPEG images losslessly. It reduces file size by an average of 22%. It preserves the original JPEG file bit-by-bite perfectly, including all metadata. Lepton can be applied into the scenarios with massive images storage. It can effectively reduce the storage cost.
JPEG Lepton Compression Ratio
1024x768 420 1.3G 1.1G 15%
1024x768 422 1.4G 1.1G 21%
1024x768 444 1.6G 1.3G 18%
4096x2160 420 8.3G 6.3G 24%
4096x2160 422 8.7G 6.6G 24%
4096x2160 444 9.5G 7.1G 25% Average 21%
The downside of Lepton is that it requires heavy computation power for both compression and decompression. Problem A conventional X86 server with dual E5-2630 CPU can only compress JPEG files into Lepton format at a rate of 20 megabytes per second. Image Video AI Product 3: JPEG to Lepton
Problems solved Solution Scenarios
Vast amount of computational resources and long time when End-user Scenarios with massive images storage : processing Lepton encoding by Smart phone/PAD Camera … PC ; CPU; Cloud album ; Accelerate the transcoding from Cloud storage JPEG to Lepton; App Key features Cloud album Cloud storage Values
TCO Reduction: High Throughput CIP Save CDN traffic Low Latency Reduce image-processing servers FPGA JPEG Lepton Low CPU Utilization Decode Encode Reduce image-storage servers OPEX reduction Image Video AI JPEG to Lepton performance
FPGA Used: QPS U200 Throughput(MB/s) U200 CPU Virtex UltraScale+ FPGA Alveo U200 30.0 28.3 CPU 120.0 108.2 25.0 100.0 20.0 80.0 QPS FPGA is 4.2 times of CPU 15.0 60.0 10.0 6.7 40.0 25.7 Latency 5.0 20.0 FPGA is 0.24 times of CPU 0.0 0.0 24 threads 24 threads U200 Input images: Latency(ms) CPU Utilization U200 Total number: 999 images CPU 4000.00 120% CPU 3492.08 100% Total file size: 3.8GB 3500.00 100% 3000.00 2500.00 80% Test environment : 2000.00 60% CPU: Intel(R) Xeon(R) E5-2630v2 x2 1500.00 40% 1000.00 827.77 RAM: 128GB 500.00 20% 7% OS: CentOS Linux release 7.3.1611 0.00 0% 24 threads 24 threads Image Video AI Product 4: JPEG to HEIF
Problems solved Solution Scenarios
Internet apps with HEIF Vast amount of computational End-user E-commerce platform resources and long time when Smart phone/PAD Camera … PC Social network with pictures processing HEIC encoding by CPU; Web portals Poor customer experience; News application App Key features Cloud album Social network … News app Values
TCO reduction: High Throughput CIP Reduce image-processing servers Low Latency OPEX reduction JPEG HEIF FPGA Resize Low CPU Utilization Decode Encode Improve customer experience: Accelerate the image rendering speed Image Video AI JPEG to HEIF performance
QPS: FPGA*1 is 10 times that of Params JPG FPGA CPU CPU Images 100 100 100 Latency: FPGA*1 is 0.1 times that of Total 173738658 135858024 133290287 CPU Size(Bytes)
Input: Compression 78.2% 76.72% Resolution:1920*1080 Ratio VQ: preset=medium Total files: 100 Param -slice_qp 5 crf=12 Output: (10179.74 kb/s) Resolution: 1920*1080 FPS 200.12 20.02 Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2680v4 PSNR 54.51 57.06 RAM: 128GB OS: CentOS Linux release 7.2.1511 Kernel version: 3.10.0-862.el7.x86_64 VMAF 97.25 97.31 Image Video AI SummaryFeatures of CIP’s of valuesCIP
High 5-10 times throughput promotion compare to CPU performance 3 times latency reduction compare to CPU
20W-40W per FPGA card Low power TCO reduction: improve computing density of DC, reduce racks
Software Support: ImageMagick, OpenCV, GraphicsMagick, Lepton compatible Allow seamless migration from software-based implementation to CIP
Remote upgrading Ease of PR(Partial Reconfiguration) technology allows fast and easy context switch in maintenance accelerator functionality without rebooting the server Image Video AI Case study 1
Customer Advantages
Leading mobile phone High TCO Better Ease of manufacturer in China performance reduction QoS maintenance
Reduce the total Reduce TCO by 3x latency reduction Smaller cluster of number of server 25% servers Scenario cluster by 50% Improve the throughput by 50% Deterministic Image thumbnail timing generation for cloud album
CIP Feedback from customer JPEG Resize · CIP has been deployed in the production environment for more than 16 months decode · Maintaining stable operating status Xilinx FPGA Image Video AI Case study 2
Customer Advantages
Famous video portal High TCO Improve customer website in China performance reduction experience
1 server with one CIP Reduce TCO by Reduce latency by accelerator has the 50% at least xx% (not announced the Scenario specific values by our equivalent computing customer) capability with 3 Transcoding from servers without CIP Deterministic JPEG to WebP timing: the same low latency at full- load and no-load
CIP Feedback from customer JPEG Resize · Test period: 2 months decode · CIP has passed the test of customer WebP · CIP will be deployed in the production environment in 2018Q4 encode Image Video AI Evolution planning Products Ready to Market Plan & Development
Accelerated image/video transcoding JPEG Lepton PNG - Image codec : JPEG, Webp, HEIF, Lepton, BPG, GIF, PNG - Pixel processing: High quality resizing, High performance JPEG WebP GIF resizing, smart-crop, super-resolution CIP - Video codec: H264, H265, AV1, AVS3 JPEG HEIF Smart-crop Accelerated image/video analytics Image Codecs Super-resolution JPEG - AI function products around pictures and video AVIF content: content identification, face detection, pedestrian detection, etc. Resize/Crop/ Rotate - Functional services such as searching for pictures, searching for videos, etc. CVP MPSoC - H265 AV1 Software as a Service Video Codecs - Ultimately all CTAccel accelerated functions will MPSoC - H264 AVS3 be served by a unified SaaS framework - Unified APIs across CSPs and on-premise Scan for pornographic content Video/Image Video game recognition Analytics AI assist security inspection Image Video AI Cloud platformPartners support
Hardware platform support
Alveo Alveo Kintex® Zynq MPSoC U200 U50 UltraScale™115 Image Video AI Product line of CTAccel
Image Video AI
Higher performance Improve quality of TCO reduction density service
High-quality FPGA-based accelerated computing solution provider Image Video AI Xilinx MPSoC with VCU - H264/H265
Problems solved Solution Scenarios
Applications which require video
Accelerate the speed of video End-user transcoding: processing, improve customer Smartphone PC … TV Video portal websites ; experience Short video apps;
App Video portal Short video … Live Key features website app broadcast Values
TCO Reduction: FPS Xilinx MPSoC with VCU Reduce processing servers PSNR H264/H265 H264/H265 OPEX reduction FPGA Resize Decode Encode Improve customer experience Image Video AI Performance of H.264 and H.265 (MPSoC)
FPS FPS 643.87 712.39 578 636 516.21 449.94 478.07 377.57 307.2 324.77 227.13 150.97 118.96 100.34 30.37 63.89
1920*1080 1280*720 960*540 640*480 1920*1080 1280*720 960*540 640*480 CVP X264 medium CVP X265 medium
H.264 Encode H.265 Encode
Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2 RAM: 128GB OS: CentOS Linux release 7 Image Video AI Performance of H.264 and H.265 (MPSoC)
FPS FPS 236.41 236.03 200.42 200.68 183.22
85.89 91.09
28.3
CVP H.264 CVP H.265 X264 medium X265 medium CVP H.264 CVP H.265 X264 medium X265 medium
Server: 2*E5-2630v2 Server: 2*E5-2680v4
Test environment: Input: CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2 Crowdrun1080p, H264 & 2*Intel(R) Xeon(R) CPU E5-2680v4 Output: RAM: 128GB 720p OS: CentOS Linux release 7 540p 480p Image Video AI Quality of H.264 and H.265 (MPSoC)
H.265 VMAF H.265 PSNR 36.00 100.00 90.00 34.00 80.00 32.00 70.00 60.00 30.00 50.00 28.00 40.00 30.00 26.00 20.00 Input: 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 24.00 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 High dynamic and CVP X265 NVP4 H.265 Bitrate(Mbps) CVP X265 NVP4 H.265 Bitrate(Mbps) high complexity video H.265 Medium Medium H.265 Medium Medium
- Crowdrun, park_joy H.264 VMAF H.264 PSNR 36.00 100.00 90.00 34.00 80.00 32.00 70.00 30.00 60.00 50.00 28.00 40.00 26.00 30.00 24.00 Test environment: 20.00 CPU: 2*Intel(R) Xeon(R) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 22.00 CPU E5-2630v2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CVP X264 NVP4 H.264 Bitrate(Mbps) CVP X264 NVP4 H.264 Bitrate(Mbps) RAM: 128GB H.264 Medium Medium H.264 Medium Medium OS: CentOS Linux release 7 Image Video AI I-Frame 2 JPEG
Problems solved Solution Scenarios
Applications which require “I-
Accelerate the speed of ”I-Frame 2 End-user Frame 2 JPEG”: Video portal websites ; JPEG”, improve customer Smartphone PC … TV experience Short video apps;
App Video portal Short video … Live Key features website app broadcast Values
TCO Reduction: High Throughput CIP + MPSoC Reduce processing servers Low Latency H264/H265 JPEG OPEX reduction FPGA Resize Low CPU Utilization Decode Encode Improve customer experience Image Video AI AI accelerator: face detection/recognition
Problems solved Solution Scenarios
Smart City; Accelerate the speed of ”face End-user Intelligent Transportation; detection/recognition”, improve Smartphone PAD … PC Safe City; customer experience, reduce Security system; TCO App Intelligent Safe City Smart City … Key features Transportation Values TCO Reduction: High Throughput Reduce image-processing servers Image Face OPEX reduction Resize Low Latency Decode recognition Improve customer experience Low CPU Utilization FPGA GPU/FPGA Improve the speed of face detection/recognition Image Video AI AI accelerator: test scenarios
Test-1:Alexnet model
FPGA #1 GPU GPU
Test-2:nsfw model
Memory FPGA #2 GPU GPU caching Test-3:Inception V4 model
FPGA #3 GPU Test-4:ResNet-50 model
GPU computing matrix
Test environment Test data CPU: 2*Intel(R) Xeon(R) CPU E5-2690v4 x 2 Image type Resolution Quantity RAM: 128GB 8 Mega Pixel 2647X3278 14708 OS: CentOS Linux release 7.4 Kernel version: 3.10.0-514.2.2.el7.x86_64 16 Mega Pixel 5312x2988 9280 Python version: 2.7 1080p 1920x1080 21400 Image Video AI Test result 1-Alexnet
Alexnet model Throughput(MB/S) 2500.00
QPS: 2000.00 FPGA+GPU is 2.5 faster than CPU+GPU 1500.00 Latency: 1000.00 FPGA+GPU is 30% of CPU+GPU 500.00 CPU usage: FPGA+GPU is 20% of CPU+GPU 0.00 8M Pixels 16M Pixels 1080p
3 FPGA + 5 GPU CPU+ 5 GPU
Latency(ms) CPU utilization(%) 80.00 80.00 70.00 70.00 60.00 60.00 50.00 50.00 40.00 40.00 30.00 30.00 20.00 20.00 10.00 10.00 0.00 0.00 8M Pixels 16M Pixels 1080p 8M Pixels 16M Pixels 1080p
3 FPGA+ 5 GPU CPU+ 5 GPU 3 FPGA+ 5 GPU CPU + 5 GPU Image Video AI Test result 2-nsfw
nsfw model Throughput(MB/S) 2500.00
QPS: 2000.00 FPGA+GPU is 3 faster than CPU+GPU 1500.00 Latency: 1000.00 FPGA+GPU is 30% of CPU+GPU 500.00 CPU usage: FPGA+GPU is 20% of CPU+GPU 0.00 8M Pixels 16M Pixels 1080p
3 FPGA+ 5 GPU CPU+5 GPU
Latency(ms) CPU utilization(%) 90.00 70.00 80.00 60.00 70.00 50.00 60.00 50.00 40.00 40.00 30.00 30.00 20.00 20.00 10.00 10.00 0.00 0.00 8M Pixels 16M Pixels 1080p 8M Pixels 16M Pixels 1080p
3 FPGA+ 5 GPU CPU+5 GPU 3 FPGA+ 5 GPU CPU+5 GPU Image Video AI Test result 3-Inception V4
InceptionV4 model Throughput(MB/S) 2500
QPS: 2000 FPGA+GPU is 2 faster than CPU+GPU 1500 Latency: 1000 FPGA+GPU is 40% of CPU+GPU 500 CPU usage: FPGA+GPU is 30% of CPU+GPU 0 8M Pixels 16M Pixels 1080p
3 FPGA+ 5 GPU CPU+5 GPU
Latency(ms) CPU utilization(%) 90 70 80 60 70 50 60 50 40
40 30 30 20 20 10 10 0 0 8M Pixels 16M Pixels 1080p 8M Pixels 16M Pixels 1080p
3 FPGA+ 5 GPU CPU+5 GPU 3 FPGA+ 5 GPU CPU+5 GPU Image Video AI Test result 4-ResNet-50
ResNet-50 model Throughput(MB/S) 2500.00
QPS: 2000.00 FPGA+GPU is 3 faster than CPU+GPU 1500.00 Latency: 1000.00 FPGA+GPU is 30% of CPU+GPU 500.00 CPU usage: 0.00 FPGA+GPU is 20% of CPU+GPU 8M Pixels 16M Pixels 1080p
3 FPGA+ 5 GPU CPU+5 GPU
Latency(ms) CPU utilization(%) 80.00 70.00
70.00 60.00 60.00 50.00 50.00 40.00 40.00 30.00 30.00 20.00 20.00 10.00 10.00 0.00 0.00 8M Pixels 16M Pixels 1080p 8M Pixels 16M Pixels 1080p
3 FPGA+ 5 GPU CPU+5 GPU 3 FPGA+ 5 GPU CPU+5 GPU Image Video AI Test result 5-Accuracy
Model Category Accuracy(top1) Accuracy(top5)
TensorFlow 0.49 0.74 Alexnet accel 0.49 0.73
TensorFlow 0.80 0.95 Inceptionv4 accel 0.79 0.95
TensorFlow 0.73 0.91 ResNet50 accel 0.72 0.90
TensorFlow 0.75 N/A nsfw accel 0.75 N/A For any product related enquiries:
Product enquiry email: Product website: [email protected] www.ct-accel.com
Cloud partners: AWS HUAWEI Cloud Product hotline: Alibaba Cloud +86-0755-88914045 Baidu Cloud Tencent Cloud