Intel® Software Conference 2015 Greg Anderson Director, Worldwide Software Sales Software and Services Group What are you developing software for?

Technical Computing Media Embedded System Responsiveness Web Multi-Platform & Performance

Sciences / HPC Encoding / Decoding Internet of Things Cross-Platform Multimedia Cross Device – Multiple Enterprise apps Streaming Hardware-based performance APP stores Big Data embedded programming • Android • Mobile Apps Servers / Clusters Performance through • BIOS/UEFI/FW • Windows • HTML5 technology hardware acceleration • Kernel/OS • OS X • Write - once Performance through • MPEG4, etc. • Drivers Parallel Processing • HEVC • Embedded Applications • Vectorization • Threading • Message Passing

2 Intel® Software Development Products

Technical Computing Media Embedded System Responsiveness Web Multi-Platform & Performance

Application Video streaming Fast, efficient Immersive Deploy apps on performance, performance embedded & mobile interactivity for multiple platforms scalability & reliability devices/systems multimedia apps using one codebase

3 Intel’s Vision-If it computes, it does it best with Intel (hardware & software)

Data Center Client Ultra-Mobile Embedded/IoT

4 Create Faster Code…Faster Intel® Parallel Studio XE 2016

. C++ and Fortran tools optimized for performance . What’s new . Accelerate Data Analytics . Simplified Vectorization . MPI Cluster Performance . Standards Updates . Intel® Cluster Checker . Microsoft* Windows* 10 support . KNL support . Code name for upcoming Intel® ® Phi processor

5 What’s Inside: Intel® Parallel Studio XE 2016

Intel® Parallel Studio XE 2016 Composer Intel® Parallel Studio XE 2016 Intel® Parallel Studio XE 2016 Edition Professional Edition Cluster Edition

Intel® C++ Compiler Intel® C++ Compiler Intel® C++ Compiler Intel® Fortran Compiler Intel® Fortran Compiler Intel® Fortran Compiler Intel® Threading Building Blocks Intel® Threading Building Blocks Intel® Threading Building Blocks Intel® Integrated Performance Primitives Intel® Integrated Performance Primitives Intel® Integrated Performance Primitives Intel® Data Analytics Acceleration Library (DAAL) Intel® Data Analytics Acceleration Library (DAAL) Intel® Data Analytics Acceleration Library (DAAL) Intel® Intel® Math Kernel Library Intel® Math Kernel Library Intel® Cilk™ Plus Intel® Cilk™ Plus Intel® Cilk™ Plus Intel® OpenMP* Intel® OpenMP* Intel® OpenMP* Intel® Advisor XE Intel® Advisor XE Intel® Inspector XE Intel® Inspector XE Intel® VTune™ Amplifier XE Intel® VTune™ Amplifier XE Intel® MPI Library Intel® Trace Analyzer and Collector Bundle or Add-on: Add-on: Add-on: Rogue Wave IMSL* Library Rogue Wave IMSL* Library Rogue Wave IMSL* Library Additional configurations including, floating and academic, are available at: http://intel.ly/perf-tools

6 Intel® Parallel Studio XE 2016 Suites what’s new

• Accelerate Data Analytics – Easily Build IA Optimized Data Analytics Application Intel® Data Analytics Acceleration Library (DAAL) will help data scientists speed through big data challenges with optimized IA functions. • Simplified Vectorization – Boost Performance By Utilizing Vector Instructions / Units Intel® Advisor XE - Vectorization Advisor identifies new vectorization opportunities as well as improvements to existing vectorization and highlights them in your code. It makes actionable coding recommendations to boost performance and estimates the speedup. • MPI Cluster Performance – Fast & Lightweight Analysis for 32K+ Ranks Intel® Trace Analyzer and Collector add MPI Performance Snapshot feature for easy to use, scalable MPI statistics collection and analysis of large MPI jobs to identify areas for improvement. • Standards – Scaling Development Efforts Forward Supporting the evolution of industry standards of OpenMP, MPI, TBB, Fortran and C++ Intel® Compilers & performance libraries

7 Intel® Data Analytics Acceleration Library (DAAL) . Library of optimized building blocks for Data Analytics . “MKL” for Data Analytics with a few key differences . Optimizes entire data flow, not algorithmic part only . Targets both Data Center (Xeon & Phi) and edge devices (including & Quark based) . Abstracted from cross-device communication layer . Allows plugging in different Big Data & End-to-End analytics frameworks . Builds upon MKL/IPP low level primitives for best performance . Scale/Horizontal product for different market segments/verticals . Essential for IA stickiness in Data Center and edge devices

Vision: Industry leading end-to-end IA-based data analytics acceleration library of fundamental algorithms covering all data analysis stages

8 Data Analysis Stages & End-To-End Analytics

. Same data analysis process and analytics building blocks despite variety of data formats, usages and domains . Analytics Targets: . Perform analysis close to data source (sensor/client/server) to optimize response latency, decrease network bandwidth utilization, and maximize security. . Offload data to server/cluster for complex and large-scale analytics only.

Data Source Edge Compute (Server, Desktop, … ) Client Edge

Pre-processing Transformation Analysis Modeling Validation Decision Making

Business

Web/Social Scientific/Engineering

Machine Learning (Training) Hypothesis testing Decompression, Aggregation, Summary Statistics Forecasting Parameter Estimation Model errors Filtering, Normalization Dimension Reduction Clustering, etc. Decision Trees, etc. Simulation

9 Turn Big Data Into Information Faster with Intel® Data Analytics Acceleration Library . Advanced analytics algorithms supporting all Designed and Built by Intel data analysis stages. to

Pre-processing Transformation Analysis Modeling Validation Decision Making Delight Business Scientific • Decompression • Aggregation • Summary • Machine • Hypothesis • Forecasting testing Data Scientists Engineering • Filtering • Dimension Statistics Learning • Decision Trees • Parameter • Model • Web/Social • Normalization Reduction • Clustering. Etc. Estimation errors PCA Performance Boost • Simulation Using Intel® DAAL vs. Spark* MLLib 8

6 7X 7X 6X . Simple to incorporate object-oriented APIs for 6X 4 4X

C++ and Java Speedup 2

. Easy connections to: 0 1M x 200 1M x 400 1M x 600 1M x 800 1M x 1000 Table Size

Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2016, CDH v5.3.1, Apache Spark* v1.2.0; Hardware: Intel® Xeon® Processor E5-2699 v3, 2 . Popular analytics platforms (Hadoop, Spark) Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of RAM per node; Operating System: CentOS 6.6 x86_64. PCA normalized input. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor- dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture . Data sources (SQL, non-SQL, files, in-memory) are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .

10 What’s new: Intel® Advisor XE Vectorization Optimization . Have you: . Data Driven Vectorization:New! . Recompiled for AVX2 with little gain . What vectorization will pay . Wondered where to off most? vectorize? . Recoded intrinsics for new . What’s blocking arch.? vectorization? Why? . Struggled with compiler . Are my loops vector reports? friendly? . Will reorganizing data increase performance? . Is it safe to just use pragma simd?

11 160,000How much potential lies untapped today?

140,000 Parallelized Vectorized 120,000   100,000  Scalar 179x 80,000 OptionsPer Sec Single Thread  60,000 Single Thread Scalar

40,000

20,000 Configuration info on 0 Configurations for 2007 2009 2010 2012 2013 2014 Binomial Options SP Intel® Xeon™ Intel® Xeon™ Intel® Xeon™ Intel® Xeon™ Intel® Xeon™ Intel® Xeon™ slide at the end Processor Processor Processor X5680 Processor Processor Processor of this presentation X5472 X5570 formerly E5-2600 E5-2600 v2 family E5-2600 v3 family formerly codenamed formerly codenamed family formerly formerly codenamed formerly codenamed Harpertown codenamed Westmere codenamed Ivy Bridge Haswell Nehalem Parallel + Vectorized is much faster than either one alone

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

12 Vectorization Made Easy Get all the data you need for high impact vectorization

Filter by which loops What prevents Trip Counts . are vectorized! vectorization?

Focus on What vectorization Which Vector instructions How efficient is hot loops issues do I have? are being used? the code?

Get Faster Code Faster! Intel® Advisor XE Vectorization Optimization and Thread Prototyping

13 MPI Performance Snapshot for Cluster MPI applications

. Advantage • Get an initial profile of the application very quickly • Performance variation at scale can be detected sooner • Provides development recommendations to developers based on analysis • Easy to use functionality for out of the box use . Benefits • Difficult performance issues are easier to spot • Application performance is improved faster • Experienced & non-experienced developers can adopt quickly

14 Intel® Cluster Checker . A systems diagnostic tool suitable for high-performance computing (HPC) cluster experts and those new to HPC • For years available only to customers part of the Intel Cluster Ready program (https://clusterready.intel.com/whatisclusterready/) • Inspects more than 100 characteristics that indicate cluster health • Examines the system at both the node and cluster level • Making sure all components work together and can deliver optimal performance • Verifies compliance and ensure each cluster will function as it should • Spend less time problem solving issues with individual deployments https://clusterready.intel.com/intel-cluster-checker/

15 Who benefits?

Finance eCommerce IT Manufacturing Government Energy Aerospace Healthcare

NASA

CERN Bank of America Digital

16 Intel® System Studio Provides a Comprehensive Suite of Tools That Provide Deep System-wide Insight for System and Embedded Developers

COMPILER & LIBRARIES ANALYZERS DEBUGGERS

C/C++ Image, Signal, Math and Power & Memory & Application & Debug & Compiler Data Processing Performance Threading System Trace JTAG, Simics* 2 System and Application Code Running on Platform JTAG UEFI* * 1, Android*, Windows*, FreeBSD* or VxWorks* Simulation over Agent Target USB system Intel® Architecture-based Platforms

1 Linux*, Embedded Linux, Wind River* Linux*, Yocto Project* 2 UEFI: Unified Extensible Firmware Interface

17 Intel® System Studio Helps System and Embedded Developers Address Unique Needs Across Usages and Platforms

Device Manufacturers System Integrators Embedded Application Developers Shorter system bring-up Faster software stack Efficiently introduce compelling new and validation cycles integration and optimization device capabilities

Wide-Ranging System and Embedded Platforms

Cloud / Digital Military, Client & Networks & IoT Devices Gateways data centers / Transportation Retail Industrial Imaging Security Aerospace, Medical Mobile Communication storage Surveillance Government

F

143 bpm $$

18 Accelerate Strengthen Boost Power Time to System Efficiency & Support Newest Platforms Market Reliability Performance Added Support for New Intel Processors and Target Operating Systems

. Support for recently launched versions of Intel® processors o Intel® Atom™ x3 processors Expanded New formerly code-named SoFIA o Intel® Atom™ x5, x7 processors formerly code-named Cherry Trail o 6th Generation Intel® Core™ processors formerly code-named Skylake . Microsoft* Windows* 10 . FreeBSD*

Expanded Expanded

19 Accelerate Strengthen Boost Power Time to System Efficiency & System-wide Closed Chassis Debugging Market Reliability Performance JTAG-based Debug and Trace over Low-cost USB Connection

. Flexibility – alleviates requirement for an accessible hardware JTAG port . Low-cost – debug over standard USB connection instead of expensive JTAG probe

Intel® SVT Closed Chassis Adapter (1) USB cable

Target Intel® System Target Intel® System System JTAG data over Debugger System Debugger physical USB port

Debug & trace from CPU reset Debug & trace OS boot

(1) SVT = Silicon View Technology – more details: https://designintools.intel.com/product_p/itpxdpsvt.htm

20 ”With the new Intel® System Studio 2015, we improved the user experience of our recently launched Android* based tablet tolino tab 8” (optimized for eReading) drastically by a factor of 3x (200ms vs. 500-700ms); which reduced the CPU workload and the resulting power consumption by at least the same factor.” Dirk Hofmann, Chief Product Owner, Deutsche Telekom

“It is well apparent that if a new customer will “I am very satisfied with Intel System Studio 2015 develop Intel Architecture based products, then Intel® product. The C++ compiler is very fast and complete. System Studio tools are essential for success…” The development tools that are part of these suites Wayne Merrill, Manager, International Dept., Flatoak are very useful and they help detecting performance Co., Ltd./JAPAN issues quite easily.” Eduardo Quintana, SFTWY CDI Ltda., Microsoft Partner Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

21 Intel® System Studio 2016 Summary Deep System-wide Insight for System and Embedded Developers

. Improves developer productivity with expanded usability and capabilities . Increases performance with highly optimized compiler and libraries . Improved performance profiling and identification complex defects . Improves power efficiency and performance with enhanced analyzers . Extends support to the newest Intel platforms and operating systems . Support for the broad Intel processor spectrum – Quark, Atom, Core, Skylake

Create smarter code —smarter, with Intel System Studio Learn more at: http://intel.ly/system-studio

22