Intel® System Studio 2014 for * Complete Development Solution for Intelligent/Embedded Systems

Intel® VTune™ Amplifier 2014 for Systems Overview System Developer Challenges

– Meeting release schedule – System reliability – Power efficiency & application performance

If you could improve one thing about your embedded design activities, what would it be?

Debugging Tools 22%

Engineering Team/Skill levels 16%

Schedule 15%

Programming Tools 8%

VDC Research – Strategic Insights 2012: Embedded Software & Tools Market. Microprocessor 8% October 2012 UBM Electronics - 2012 Embedded Market Survey

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® System Studio 2014 for Linux* Deep System Insights for Embedded and Mobile Developers

Accelerate Strengthen Boost Power Time To System Efficiency and Market Reliability Performance

Speed-up development Enhance code stability Boost system power and testing with deep using in-depth system efficiency and hardware and software wide debuggers and performance using insights analyzers system-wide analyzers, compilers and libraries

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Introducing Intel® System Studio 2014 for Linux*

Integrated software tool suite that provides deep system-wide insights to help: . Accelerate Time To Market . Strengthen System Reliability . Boost Power Efficiency and Performance

Debuggers Analyzers Compiler and Libraries Power & Memory & Thread C/C++ Signal, Media, Data & System Application Performance errors Compiler Math Processing

JTAG Interface System & Application Code running Linux*

Embedded or Mobile System

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Support for Latest Intel Processors & SoCs

Silvermont Ivy Bridge Haswell microarchictectur microarchitecture microarchitecture e Intel® JTAG Debugger† – ✔ -- ✔ System Debug Enhanced GDB* Debugger ✔ ✔ ✔ – Application Debug Intel® Inspector – ✔ ✔ ✔ Memory & Thread Memory & Thread Memory & Thread Memory & Thread Analysis Analysis Analysis Analysis

Intel® VTune™ Amplifier †† ✔ ✔ ✔ – Power & Performance Hardware Events Hardware Events Hardware Events

® ✔ ✔ ✔ Intel C++ Compiler SSE4.2 SSE, AVX SSE, AVX, AVX2, FMA3 ✔ ✔ ® -- Intel MKL SSE, AVX SSE, AVX, AVX2, FMA3 Intel® IPP ✔ ✔ ✔

† Hardware platform debug coverage added as new processors ship † † Hardware events for new processors added as new processors ship 5

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Updated Studio Components

Build Major Windows* Eclipse* * Yocto Intel® Atom™ 4th Gen Intel® Update & Linux* Host Integration Project* 1.4 Processor E3xxx Core™ Processor Intel® C++ Compiler 14.0 P P P P P P P for Embedded OS Linux* Intel® Integrated Performance Primitives P P P P P P 8.0 Update 1 Intel® 11.1 P P P P (Intel® 64 only) Analysis Intel® VTune™ Amplifier 2014 for Systems P P P P P P Intel® VTune™ Amplifier Sampling Enabling Product P P P P P P (SEP) 3.10 Update 12 Intel® Inspector 2014 for Systems P P P P P P Debug The GNU* Project Debugger – GDB v7.5 P P P P P P (Provided under GNU Public License v3) Intel® JTAG Debugger 2014 for Linux* P P P P P P Intel® JTAG Debugger notification module xdbntf.ko P P P P P (Provided under GNU Public License v2) SVEN Technology 1.0 P P P P P P (Provided under GNU Public License v2)

6

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Supported OSs

Host: • Red Hat Enterprise* Linux* 5, 6 • Ubuntu* 10.04 LTS, 12.04 LTS, 13.04 • Fedora* 17, 18 • Wind River* Linux* 4, 5 • openSUSE 12.1 • SUSE LINUX Enterprise Server* 11 SP2 • * Windows* 7,8 Target:

• Yocto Project* 1.3, 1.4, and newer based environment • CE Linux* PR32 based environment • Tizen* IVI 1.0, 2.0 • Wind River* Linux* 4, 5 based environment

7

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda

This presentation will focus on Performance and Power Analysis using VTune™ Amplifier 2014 for Systems

• Performance Analysis • Hotspots • General Exploration

• Power Analysis • Sleep state analysis • Frequency state analysis

• Analysis using sep

8

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems Power & “The ability for Intel® VTune™ Amplifier to exactly pinpoint performance bottlenecks in our code was a big time saver and made it a far better choice compared to other analysis tools that we used.” Jagadish Kamath, Co-founder and Software Architect, RiverSilica Technologies

Where is my system…

Spending Time? Wasting Time? Waiting Too Waking-up Too Long? Often?

• Focus tuning on • See cache misses • See locks by wait • See wakeup causes functions taking on your source time on your source time • See functions • Red/Green for CPU • See CPU • See call stacks sorted by # of utilization during frequencies per • See time on source cache misses wait core

Advanced profiling for power efficiency and scalable multicore performance

Copyright© 2013, Intel Corporation. All rights reserved. 9 *Other brands and names are the property of their respective owners. Performance profiling: Intel® VTune™ Amplifier 2014 for Systems

Host Target device VTune amplxe-runss.py VTune collector GUI binary runs on control amplxe-runss target and collection SSH stores result on target (local storage like card or NFS mounted) transfer Vtune Vtune result data/modules SSH result

Data is opened in GUI and symbols are resolved driver using modules stored in CLI interface for remote result dir User can specify search collection. Transfers data dir with separate debug collected remotely back to host files if needed automatically together with application modules for symbol resolution

Simple python script (no remote collection in GUI) Using SSH protocol for data transfers Flexible collection configuration + control (pause/resume/stop)

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems More Profiling Data Easier To Use • CPU power and frequency • Source view for inlined code (For Intel® and GCC* compilers) • Statistical call counts • Remote Collection • Hardware events + stacks Lower overhead, Higher resolution • Task annotation API Finds hot spots in small functions Label and visualize tasks. • Uncore event counting • User defined metrics More accurate bandwidth analysis Create meaningful metrics from events • Ivy Bridge events • More/better advanced profiles • Haswell events (e.g., Bandwidth) Updates as new processors ship Activity in CPU

Easy to use, wealth of data, powerful analysis

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Performance Tuning Methodology using VTune™ Amplifier 2014 for Systems

Use a top-down approach: system tuning, then algorithmic/application tuning, then micro-architectural tuning

General process for algorithm and micro-architectural tuning: • Find hotspots • Focus on top hotspot – Determine efficiency: Use Concurrency Analysis, Stalls/Uop Analysis, or Code examination – If inefficient, look for source of in-efficiency using Locks and Waits Analysis, Micro-Architectural metrics, or Code examination (If efficient, go to next hotspot) – Optimize if necessary • Repeat!

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems – Hotspot Analysis

13

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems – Hotspot Analysis (contd)

By drilling down to the source code level you can see line-by-line and instruction-by-instruction, where your application is spending its time.

14

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems – General Exploration

In the general exploration viewpoint you can see if your application has exceeded the thresholds for our performance metrics. Metrics that exceed defined thresholds are colored in pink.

15

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems – General Exploration(cont’d)

You can also see which functions in your program had the most of a particular event. (for example Branch Mispredict)

16

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. CPU Power Analysis Intel® VTune™ Amplifier 2014 for Systems

To decrease CPU power usage minimize wake-ups • Identify wake-up causes – Timers triggered by application – Interrupts mapped to HW intr level – Show wake-up rate • Display source code for events that wake-up processor • Show CPU frequencies by CPU core (CPU frequencies can change by CPU activity level)

Uniquely identifies the cause of wake-ups and give timer call stacks

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Overview of power analysis

Idle vs. Active • Do nothing efficiently • Hurry up and get idle. e.g. Multi-threading (distributing work evenly across cores)

Optimize Sleep Behavior • Minimize sporadic wakeups. • Schedule all periodic activities from the app into same wakeup period. • What is waking h/w from low power states? Why?

Optimize Utilization • What is active? Why is it active? • Minimize Polling Loops. Use event driven framework when possible. • Turn devices off. Open devices can prevent the system from entering power saving state.

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

CPU C-States / P-States

 P0 - CPU active at highest frequency (HFM)  Pn - CPU active at lowest frequency (LFM) P0 CPU Active C0 P1  C0 - CPU active (In any P-state)

Pn

C1 CPU

Sleep C2  C1 - Core clock is Off  C3/C4 - Reduced Voltage, Partial L2 cache flush

C3  C6 - Core Off, L2 cache flush, state saved to SRAM Power Higher Power C4

C5 Latency Greater Latency C6 The deeper the sleep state  more power saving  but longer to wake up

19 Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. CPU Sleep States

• Flexible C-States to Select Idle Power Level vs. Responsiveness

Active state

C0 C1 C3 C4 C6

Core voltage*

Core clock off off off off

off PLL off off

L1 caches flushed flushed off

L2 cache partial flush off

active Wakeup time*

Idle power*

* Rough approximation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Tracing C-States

VTune power driver does not cause wakeups by using kernel tracepoints to drive the collection of data

21 Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems Sleep states power analysis view

22

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems Sleep states power analysis view

23

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Processor Power and Processor Frequency

Power vs. Frequency Curve for Single Architecture 359

309

259

209 Small Increases in Processor Speed 159

Power (w) Power Results in Large Increases in Power

109

59

9 0 0.5 1 1.5 2 2.5 3 3.5 Frequency (GHz)

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems Frequency states power analysis view

25

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems Frequency states power analysis view

26

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems System-wide Analysis

Analysis of Intel uncore blocks supported via SEP. Details: - Memory bandwidth for Intel® Core™ Processor; - Memory bandwidth, QPI bandwidth for Intel® Xeon™ Processor; - Cache Box (Cbo) is supported for both client and server parts;

27

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Product Installation of sep

• On target (whether you have built on target or host) • Load driver – ./insmod-sep3 • Once you have loaded the sep driver you need to source the environment to have access to sep. – source $SEP_INSTALL/bin/setup_sep_runtime_env.sh

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Collecting performance data on the Yocto Project* using sep

1. Prepare target – Command line tool for collecting performance data. – Learn the installation requirements and setup device drivers.

2. Pick an event to sample and configure PMU – Cache misses, branch mis-predictions, Dependency/pipeline stalls

3. Start SEP sampling routine and application – Performance Monitoring Unit (PMU) periodically interrupts the processor – Time based sampling – Event based sampling – Both architectural and non-architectural processor events can be monitored using sampling and counting technologies

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. SEP command line example sep -start -d 20 -ec "CPU_CLK_UNHALTED.CORE", "INST_RETIRED.ANY", "CPU_CLK_UNHALTED.REF", "DATA_TLB_MISSES.DTLB_MISS", "MEM_LOAD_RETIRED.L2_MISS" -out my_data

With this run of sep:

• -d 20 specified a run of 20 seconds

• -ec specifies the events to be collected.

• -out specifies the name, note a suffix of .tb6 will be used.

• For a list of supported events: • sep -el

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2014 for Systems General exploration analysis types

Through extensive analysis Intel has determined a list of events and metrics that are often useful at providing initial data on applications.

In addition to providing useful metrics, it also provides built-in rules that will notify you when its thresholds have been exceeded.

VTune Amplifier XE has “General Exploration” analysis types built in for many of the “big core” processors. When run on a embedded system this data must be collected using sep. (see the following slide)

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Running an Intel® Atom™ processor general exploration via sep

In order for Intel® VTune™ Amplifier 2014 for Systems to report Atom processor based metrics we need to specify a specific sequence of events. This event list is known as the “General Exploration” event list for the Intel® Atom™ processor.

sep -start -em -ec "BR_INST_RETIRED.MISPRED.PS,BUS_LOCK_CLOCKS.ALL_AGENTS,CP U_CLK_UNHALTED.CORE,CPU_CLK_UNHALTED.REF,CYCLES_DIV_BUSY, DATA_TLB_MISSES.DTLB_MISS,EXT_SNOOP.ALL_AGENTS.HITM,FP_AS SIST.S,ICACHE.MISSES,INST_RETIRED.ANY,ITLB.MISSES,MACHINE_CL EARS.SMC,MEM_LOAD_RETIRED.L2_HIT.PS,MEM_LOAD_RETIRED.L2_ MISS.PS,MISALIGN_MEM_REF.LD_SPLIT.AR,MISALIGN_MEM_REF.ST_S PLIT.AR,PAGE_WALKS.CYCLES,REISSUE.OVERLAP_STORE.AR,SIMD_AS SIST,UOPS.MS_CYCLES,UOPS_RETIRED.ANY" - app ./tachyon_find_hotspots

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Importing SEP data into the Intel® VTune™ Amplifier 2014 for Systems GUI

1. Create new project • File->New->Project

2. Set Search directories for the project – Source – Symbols – Binaries

3. Copy file.tb6 from target.

4. File->Import Result • “Import file.tb6” into project

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. NDA packages of Intel® VTune™ Amplifier 2014 for Systems

Prerequisites: • VTune already needs to be installed.

Install the NDA add-on package over the installed product. • On Windows: – Run Amplifier_XE_2013-update*_win_nda.msi • On Linux: < uppack vtune_amplifier_xe_2013_update*_nda_tar.gz Run install.sh from top-level folder

34

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Memory Bandwidth Limitations

Why: Bandwidth bottlenecks increase the latency at which cache misses are being serviced

How: Bandwidth Profile

What Now: • Compute the maximum theoretical bandwidth per socket for your platform in GB/s: ()/1000 • Run bandwidth analysis on your application. If total bandwidth per socket is > 75% of the maximum theoretical bandwidth, your application may be experiencing loaded (higher) latencies • If appropriate , make system tuning adjustments (upgrading/balancing DIMMs, disablng HW prefetchers) • Reduce bandwidth usage if possible: remove ineffective SW prefetches, make algorithmic changes to reduce data storage/sharing, reduce data updates, and balance memory accesss across sockets.

35

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Summary • Comprehensive software development tools solution set for embedded devices and intelligent systems

• Integrates into cross-build environments for Yocto Project*, Wind River* Linux*, and custom Linux*

• Covers all phases of development

• Powerful open source debug enhancements through GDB and SVEN

• Power Analysis, Performance Analysis, Thread Checking & Memory Checking

For more information, to evaluate, or purchase: http://intel.ly/system-studio

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Questions? Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Atom, Core, Xeon, and VTune are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor- dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 38

Copyright© 2013,2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.