Cross-Platform Software Optimization with Intel's Ct Technology

Cross-Platform Software Optimization with Intel's Ct Technology

Cross-platform Software Optimization with Intel’s Ct Technology Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda • What is Intel’s Ct Technology? • How does Intel’s Ct Technology work? • How to port C++ code to Intel’s Ct Technology? 2 What is Intel’s Ct Technology? 3 Throughput and Visual Computing Applications Entertainment “RMS” Applications Recognition Mining TIPS Learning & Travel Synthesis Personal Media RMS Creation and GIPS Management 3D & Video Tera-scale Performance MIPS Mult- Media Multi-core IPSInstruction= per second KIPS Text Single Core Health Kilobytes Megabytes Gigabytes Terabytes Dataset Size 4 Software Opportunities and Risks Opportunities: SW Performance ISV Differentiation Increasing Performance Many-core architecture offers unprecedented opportunities for differentiation: • Model-based applications: New functionalities enriching the user’s experience • Improved quality: Higher resolution/accuracy in results • Increased usability: Raw power to fuel more sophisticated usage models 5 Software Opportunities and Risks Opportunities: Risk: SW Performance Programmer ISV Differentiation "Headaches" (Reduced Productivity) Increasing Performance Reduced Productivity: – Data races: New class of bugs that increase exponentially with degree of parallelism – Performance tuning: Programmers can expect to spend most time here – Forward scaling: Anticipating future HW enhancements – Modularity: Difficult to compose parallel programs – Proprietary Tools: Reduces choice and challenges build infrastructure 6 Software Opportunities and Risks Risk: Opportunities: Programmer SW Performance “Headaches” ISV Differentiation Throughput (Reduced Computing SW Productivity) Technologies Increasing Performance Visual computing software technologies make it easier: Industry Standard APIs: DirectX*, OpenGL* Native Tools: SSE/AVX/Co-processor Native Compilers, OpenMP*, Intel® Threading Building Blocks, etc. New Tools, including: – Ct - Forward-scaling, easy-to-use, and deterministic 7 Software Opportunities and Risks Risk: Opportunities: Programmer SW Performance “Headaches” ISV Differentiation Throughput (Reduced Computing SW Productivity) Technologies Increasing Performance Productivity is greatly increased with the right choice of tools 8 Intel’s Ct Technology • Productive Data Parallel Programming – Presented as high-level abstraction, natural notation – Delivers application performance with ease of programming • Ct forward-scales software written today – Ct is designed to be dynamically retargetable to SSE, AVX, co-processor, and beyond • Extends standard C++ using templates – No changes to standard C++ compilers • Ct abstracts away architectural details – Vector ISA width / Core count / Memory model / Cache sizes • Ct is deterministic – No data races 9 Forward Scaling with Ct Single source Productivity . Integrates with existing tools . Applicable to many problem domains . Safe by default: maintainable CPU with CPU Compute Performance Co-processor . Efficient and scalable SSE 4 . Harnesses both vectors and threads . Eliminates modularity overhead of C++ Portability AVX Hybrid . High-level abstraction Processor . Hardware independent . Forward scaling 10 What Does the Product Based on Intel Ct Technology Look Like? • Core API – Flexible, forward scaling data parallelism in C++ • Application Libraries – Linear Algebra, FFT, Random Number Generation • Powered by Intel® Math Kernel Library (Intel® MKL)! • Samples – Medical Imaging, Financial Analytics, Seismic Processing, and more • Initial release on Windows, followed by Linux* – IA-32 and Intel® 64 – Works with Intel® C/C++ Compiler, Microsoft* Visual C++*, and GCC* – Works with Intel® VTune™ Analyzer 11 How does Intel’s Ct Technology work? 12 Ct’s Parallel Collections Regular dense containers Irregular dense containers & growing… Ct operates on data collections called containers Based on C++ templates Ct’s containers represent a variety of regular and irregular data collections dense<T>, dense<T,2>, dense<T,3>, nested<T>, indexed <T> 13 Expressing Computation in Ct The Ct JIT Automates This Transformation Vector Processing Scalar Processing Native/Intrinsic Coding CMP cviVPREFETCH cviFMADD INC JMP f32 kernel(f32 a, b, c, d) { native(ndense<f32> a, b, c, d) { return a + (b/c)*d; __asm__ { dense<f32> A, B, C, D; } … A += B/C * D; … }; dense <f32> A, B, C, D; } … A = map(kernel)(A, B, C, D); ndense<f32> A, B, C, D; A = nmap(native)(A, B, C, D); Or Programmers Can Choose Desired Level of Abstraction 14 The Ct Runtime • C++ APIs Virtual machine • Other language API •Intel Ct Technology offers bindings a standards compliant C++ library… JIT Compiler …backed by a runtime Virtual Machine Implementation Backend Virtual Debug/ Memory Threading •Runtime generates and JIT ISA Svcs Manager Runtime manages threads and Compiler vector code, via – Machine independent optimization – Offload management – Machine specific code generation and optimizations – Scalable threading runtime (based on Intel® Threading Building Blocks) CPU Accelerator Future 15 Dynamic Ct: Supporting All Intel Platforms Parallel Runtime Ct API CtVM Programs API … dense<i32>a(src, …),b(src, …); Memory Manager dense<i32>c =a+ a * (b/127); IA32/ Ct Next c.copyOut(dest, …); Intel® 64 Compilericc/gcc Generation … Compiler for LRBni Compiler JIT CVI 32-bit/64 Ct FFTs 32-bit/64 X86 binary Next -bit binary -bit binary Generation SSE AVX Cores with SSE withwith LRBni Ct Binary Ct Image Ct Dynamic Engine Ct BLAS/LAPACK CPU Cores Accelerator Future … Once compiled, run on all Intel platforms 16 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] dense<i32> A(ar_a, …); dense<i32> B(ar_b,…); call( work ) ( A, B ); Ct Dynamic Engine 17 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] dense<i32> A(ar_a, …); dense<i32> B(ar_b,…); call( work ) ( A, B ); Memory Manager a b Ct Dynamic Engine 18 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] dense<i32> A(ar_a, …); IR Builder dense<i32> B(ar_b,…); call( work ) ( A, B ); void work(dense<i32> a, dense<i32> & b ) { b = a + 1; } Memory Manager a b Ct Dynamic Engine 19 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] dense<i32> A(ar_a, …); IR Builder dense<i32> B(ar_b,…); call( work ) ( A, B ); V1 1 + void work(dense<i32> a, dense<i32>& b ) { V2 b = a + 1; } Memory Manager a b Ct Dynamic Engine 20 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] Trigger JIT IR Builder JIT dense<i32> A(ar_a, …); V1 1 dense<i32> B(ar_b,…); call( work ) ( A, B ); + void work(dense<i32>a, V2 dense<i32>& b ) { b = a + 1; } Memory Manager a b Ct Dynamic Engine 21 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] Trigger JIT IR Builder JIT dense<i32> A(ar_a, …); High-Level Optimizer V1 1 dense<i32> B(ar_b,…); call( work ) ( A, B ); + Low-Level Optimizer void work(dense<i32> a, CVI Code Gen V2 dense<i32> & b ) { b = a + 1; } SSE LNI AVX Memory Manager a b Ct Dynamic Engine 22 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] Trigger JIT IR Builder JIT dense<i32> A(ar_a, …); High-Level Optimizer V1 1 dense<i32> B(ar_b,…); call( work ) ( A, B ); + Low-Level Optimizer void work(dense<i32> a, CVI Code Gen V2 dense<i32> & b ) { b = a + 1; } SSE LNI AVX Memory Manager Parallel Runtime a b Ct Dynamic Engine 23 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] Trigger JIT IR Builder JIT dense<i32> A(ar_a, …); High-Level Optimizer V1 1 dense<i32> B(ar_b,…); call( work ) ( A, B ); + Low-Level Optimizer void work(dense<i32> a, CVI Code Gen V2 dense<i32> & b ) { b = a + 1; } SSE LNI AVX Memory Manager Parallel Runtime a Thread Scheduler b Ct Dynamic Engine Data Partition 24 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] Trigger JIT IR Builder JIT dense<i32> A(ar_a, …); High-Level Optimizer V1 1 dense<i32> B(ar_b,…); call( work ) ( A, B ); + Low-Level Optimizer void work(dense<i32> a, CVI Code Gen V2 dense<i32> & b ) { b = a + 1; } SSE LNI AVX Memory Manager Parallel Runtime a Thread Scheduler b Ct Dynamic Engine Data Partition Different Platforms 25 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] dense<i32> A(ar_a, …); dense<i32> B(ar_b,…); Code Manager call( work ) ( A, B ); Emitted Code for ‘work’ void work(dense<i32> a, Compute dense<i32> & b ) { Kernel b = a + 1; } Again Memory Manager Parallel Runtime a Thread Scheduler b Ct Dynamic Engine 26 Ct Dynamic Engine Execution int ar_a[1024], ar_b[1024] dense<i32> A(ar_a, …); dense<i32> B(ar_b,…); Code Manager Code Cache call( work ) ( A, B ); Emitted Code for ‘work’ void work(dense<i32> a, Compute dense<i32> & b ) { Kernel b = a + 1; } Again Memory Manager Parallel Runtime a Thread Scheduler b Ct Dynamic Engine Data Partition Different Platforms 27 Ct Dynamic Engine Implications • Second stage of compilation at runtime Allows run-time specialization • Ct kernel: exists as C function, but – Executes only once => for dynamic compilation – Symbolic “capture” of computation – Takes a snapshot of current bindings and state • Ct call: execution is a remote function call – Data flow dependencies inferred – Automatic (and always safe) synchronization – Enables remote execution and data storage 28 Why Does this Matter? Widely used libraries often give up performance for well designed generic interfaces and modularity – Virtual function calls, function pointers, and control flow dependent on parameters not known until runtime can limit performance – Modularity inherently spreads computation across methods – In extreme cases,

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    45 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us