Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Document Number: 324063-003US

Intel(R) C++ Compiler - VxWorks* OS www.intel.com

Disclaimer and Legal Information

Contents

Legal Information...... 37 Getting Help and Support...... 39

Chapter 1: Introduction Introducing the Intel® C++ Compiler...... 41 Notational Conventions...... 42 Related Information...... 44

Part I: Building Applications

Chapter 2: Overview: Building Applications Introduction to the Compiler...... 49 Compilation Phases...... 49 Default Output Files...... 49 Using Compiler Options...... 50 Saving Compiler Information in Your Executable...... 53 Redistributing Libraries When Deploying Applications...... 54

Chapter 3: Building Applications from the Command Line Invoking the Compiler from the Command Line...... 55 Invoking the Compiler from the Command Line with make...... 56 Passing Options to the Linker...... 56 Compiler Input Files...... 57 Output Files...... 58 Specifying Compilation Output...... 59

iii Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Specifying Executable Files...... 59 Specifying Object Files...... 59 Specifying Assembly Files...... 59 Specifying Alternate Tools and Paths...... 59 Using Precompiled Header Files...... 60 Compiler Option Mapping Tool...... 63 Open Source Tools...... 65

Chapter 4: Using Preprocessor Options About Preprocessor Options...... 67 Using Options for Preprocessing...... 68 Using Options to Define Macros...... 69

Chapter 5: Modifying the Compilation Environment About Modifying the Compilation Environment...... 71 Setting Environment Variables...... 71 Using Configuration Files...... 74 Specifying Include Files...... 75 Using Response Files...... 75

Chapter 6: Debugging Using the Debugger...... 77 Preparing for Debugging...... 77 Symbolic Debugging and Optimizations...... 77 Using Options for Debug Information...... 78

Chapter 7: Creating and Using Libraries Overview: Using Libraries...... 81 Supplied Libraries...... 81 Managing Libraries...... 83 Creating Libraries...... 84 Compiling for Non-shared Libraries...... 85 Overview: Compiling for Non-shared Libraries...... 85

iv Contents

Global Symbols and Visibility Attributes...... 86 Symbol Preemption...... 87 Specifying Symbol Visibility Explicitly...... 88 Other Visibility-related Command-line Options...... 89

Chapter 8: gcc* Compatibility gcc Compatibility...... 91 gcc* Interoperability...... 95 gcc Interoperability...... 95 Compiler Options for Interoperability...... 95 Predefined Macros for Interoperability...... 98 gcc Built-in Functions...... 99 Thread-local Storage...... 100

Chapter 9: Language Conformance Conformance to the C Standard...... 101 Conformance to the C++ Standard...... 101 Exported Templates...... 101 Template Instantiation...... 104

Chapter 10: Porting Applications Overview: Porting Applications...... 105 Modifying Your makefile...... 107 Equivalent Macros...... 109 Equivalent Environment Variables...... 112 Other Considerations...... 112 Porting from GNU gcc* to Visual C++*...... 116

Chapter 11: Error Handling Remarks, Warnings, and Errors...... 119

Chapter 12: Reference C/C++ Calling Conventions...... 123

v Intel(R) C++ Compiler for VxWorks* User and Reference Guides

ANSI Standard Predefined Macros...... 128 Additional Predefined Macros...... 129

Part II: Compiler Options Overview...... 136 Overview: Compiler Options...... 136

Chapter 13: Alphabetical Compiler Options Compiler Option Descriptions and General Rules...... 137 A Options...... 139 A...... 139 A-...... 140 alias-const...... 141 align...... 142 ansi...... 143 ansi-alias...... 143 ansi-alias-check...... 144 auto-ilp32...... 146 auto-p32...... 147 ax...... 148 B Options...... 151 B...... 151 Bdynamic...... 152 Bstatic...... 153 Bsymbolic...... 154 Bsymbolic-functions...... 155 C Options...... 156 C...... 156 c...... 157 c99...... 158 check-uninit...... 159 complex-limited-range...... 159 cxxlib...... 160

vi Contents

D Options...... 161 D...... 161 dD...... 162 debug...... 163 diag...... 166 diag-dump...... 169 diag-error-limit...... 170 diag-file...... 171 diag-file-append...... 172 diag-id-numbers...... 174 diag-once...... 175 dM...... 175 dN...... 176 dryrun...... 177 dumpmachine...... 178 dumpversion...... 178 dynamic-linker...... 179 E Options...... 180 E...... 180 early-template-check...... 181 EP...... 182 export...... 182 export-dir...... 183 F Options...... 184 fabi-version...... 184 falias...... 185 falign-functions...... 186 falign-stack...... 187 fargument-alias...... 188 fargument-noalias-global...... 189 fasm-blocks...... 190 fast...... 190

vii Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fast-transcendentals...... 192 fbuiltin, Oi...... 194 fcode-asm...... 195 fcommon...... 196 fexceptions...... 197 ffnalias...... 197 ffreestanding...... 198 ffriend-injection...... 199 ffunction-sections...... 200 fimf-absolute-error...... 201 fimf-accuracy-bits...... 202 fimf-arch-consistency...... 204 fimf-max-error...... 205 fimf-precision...... 207 finline...... 208 finline-functions...... 209 finline-limit...... 210 finstrument-functions...... 211 fjump-tables...... 213 fkeep-static-consts...... 214 fmath-errno...... 215 fminshared...... 216 fmudflap...... 217 fno-defer-pop...... 217 fno-gnu-keywords...... 218 fno-implicit-fp-sse...... 219 fno-implicit-inline-templates...... 219 fno-implicit-templates...... 220 fno-operator-names...... 221 fno-rtti...... 222 fnon-call-exceptions...... 222 fnon-lvalue-assign...... 223

viii Contents fomit-frame-pointer, Oy...... 224 fp...... 226 fp-model, fp...... 226 fp-port...... 232 fp-speculation...... 233 fp-stack-check...... 234 fp-trap...... 235 fp-trap-all...... 237 fpack-struct...... 239 fpascal-strings...... 240 fpermissive...... 240 fpic...... 241 fpie...... 242 freg-struct-return...... 243 fshort-enums...... 244 fsource-asm...... 244 fstack-protector...... 245 fstack-security-check, GS...... 245 fstrict-aliasing...... 246 fsyntax-only...... 246 ftemplate-depth...... 247 ftls-model...... 248 ftrapuv...... 249 ftz...... 250 funroll-all-loops...... 251 funroll-loops...... 252 funsigned-bitfields...... 252 funsigned-char...... 253 fvar-tracking...... 254 fvar-tracking-assignments...... 254 fverbose-asm...... 254 fvisibility...... 255

ix Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fvisibility-inlines-hidden...... 257 fzero-initialized-in-bss , Qzero-initialized-in-bss...... 258 G Options...... 259 g, Zi, Z7...... 259 g0...... 260 gcc...... 261 gcc-name...... 262 gcc...... 263 gcc-version...... 264 gdwarf-2...... 265 global-hoist...... 266 fstack-security-check, GS...... 267 guide...... 267 guide-data-trans...... 269 guide-opts...... 270 guide-vec...... 273 gxx-name...... 274 H Options...... 275 H...... 275 help...... 275 help-pragma...... 277 I Options...... 278 I...... 278 i-dynamic...... 279 i-static...... 279 icc...... 279 idirafter...... 280 imacros...... 280 inline-calloc...... 281 inline-debug-info...... 283 inline-factor...... 284 inline-forceinline...... 285

x Contents

inline-level, Ob...... 286 inline-max-per-compile...... 288 inline-max-per-routine...... 289 inline-max-size...... 290 inline-max-total-size...... 291 inline-min-size...... 293 intel-extensions...... 294 ip...... 295 ip-no-inlining...... 296 ip-no-pinlining...... 297 ipo...... 298 ipo-c...... 299 ipo-jobs...... 300 ipo-S...... 301 ipo-separate...... 302 ipp...... 303 iprefix...... 304 iquote...... 305 isystem...... 306 iwithprefix...... 306 iwithprefixbefore...... 307 K Options...... 308 Kc++, TP...... 308 L Options...... 309 l...... 309 L...... 310 M Options...... 311 m...... 311 M...... 312 m32, m64...... 313 malign-double...... 314 map-opts...... 315

xi Intel(R) C++ Compiler for VxWorks* User and Reference Guides

march...... 316 mcmodel...... 318 mcpu...... 320 mc++-abr...... 320 MD...... 321 membedded...... 321 MF...... 322 MG...... 323 minstruction...... 324 MM...... 325 MMD...... 326 mp...... 326 MP...... 328 multiple-processes, MP...... 328 mp1...... 329 MQ...... 330 mregparm...... 331 mrtp...... 332 MT...... 332 mtune...... 333 multibyte-chars...... 336 multiple-processes, MP...... 336 N Options...... 337 no-bss-init...... 337 no-libgcc...... 338 nodefaultlibs...... 339 nolib-inline...... 340 non-static...... 340 nostartfiles...... 341 nostdinc++...... 342 nostdlib...... 342 O Options...... 343

xii Contents

o...... 343 O...... 344 inline-level, Ob...... 348 fbuiltin, Oi...... 349 opt-args-in-regs...... 350 opt-block-factor...... 352 opt-calloc...... 352 opt-class-analysis...... 353 opt-jump-tables...... 354 opt-malloc-options...... 355 opt-matmul...... 357 opt-multi-version-aggressive...... 358 opt-prefetch...... 359 opt-ra-region-strategy...... 360 opt-report...... 361 opt-report-file...... 362 opt-report-help...... 363 opt-report-phase...... 364 opt-report-routine...... 365 opt-streaming-stores...... 366 opt-subscript-in-range...... 367 Os...... 368 fomit-frame-pointer, Oy...... 369 P Options...... 371 P...... 371 pc...... 372 pch...... 373 pch-create...... 374 pch-dir...... 375 pch-use...... 377 pie...... 378 pragma-optimization-level...... 379

xiii Intel(R) C++ Compiler for VxWorks* User and Reference Guides

prec-div...... 380 prec-sqrt...... 381 print-multi-lib...... 382 prof-data-order...... 382 prof-dir...... 383 prof-file...... 384 prof-func-groups...... 385 prof-func-order...... 386 prof-gen...... 388 prof-genx...... 390 prof-hotness-threshold...... 390 prof-src-dir...... 391 prof-src-root...... 392 prof-src-root-cwd...... 394 prof-use...... 395 prof-value-profiling...... 397 profile-functions...... 398 profile-loops...... 399 profile-loops-report...... 401 pthread...... 402 Q Options...... 403 Qinstall...... 403 Qlocation...... 404 Qoption...... 405 R Options...... 407 rcd...... 407 regcall...... 408 restrict...... 409 S Options...... 410 S...... 410 save-temps...... 411 scalar-rep...... 412

xiv Contents

shared...... 413 shared-intel...... 414 shared-libgcc...... 415 simd...... 415 sox...... 417 static...... 418 static-intel...... 418 static-libgcc...... 419 std...... 420 strict-ansi...... 422 T Options...... 423 T...... 423 Kc++, TP...... 423 U Options...... 424 u (* OS)...... 424 U...... 425 undef...... 426 unroll...... 426 unroll-aggressive...... 427 use-asm...... 428 use-intel-optimized-headers...... 429 use-msasm...... 430 V Options...... 431 v...... 431 V...... 432 vec...... 433 vec-guard-write...... 434 vec-report...... 435 vec-threshold...... 436 version...... 437 W Options...... 438 w...... 438

xv Intel(R) C++ Compiler for VxWorks* User and Reference Guides

w, W...... 439 w, W...... 439 Wa...... 440 Wabi...... 441 Wall...... 442 Wbrief...... 443 Wcheck...... 443 Wcomment...... 444 Wcontext-limit...... 445 wd...... 446 Wdeprecated...... 446 we...... 447 Weffc++...... 448 Werror, WX...... 449 Werror-all...... 450 Wextra-tokens...... 451 Wformat...... 452 Wformat-security...... 452 Winline...... 453 Wl...... 454 WL...... 455 Wmain...... 456 Wmissing-declarations...... 456 Wmissing-prototypes...... 457 wn...... 458 Wnon-virtual-dtor...... 459 wo...... 459 Wp...... 460 Wp64...... 461 Wpointer-arith...... 461 Wpragma-once...... 462 wr...... 463

xvi Contents

Wreorder...... 464 Wremarks...... 464 Wreturn-type...... 465 Wshadow...... 466 Wsign-compare...... 467 Wstrict-aliasing...... 468 Wstrict-prototypes...... 469 Wtrigraphs...... 469 Wuninitialized...... 470 Wunknown-pragmas...... 471 Wunused-function...... 472 Wunused-variable...... 472 ww...... 473 Wwrite-strings...... 474 Werror, WX...... 475 X Options...... 475 x (type option)...... 475 x...... 477 X...... 481 Xbind-lazy...... 482 Xbind-now...... 483 xHost...... 484 Xlinker...... 487 Z Options...... 488 g, Zi, Z7...... 488 Zd...... 490 g, Zi, Z7...... 490 ZI...... 491 Zp...... 492

Chapter 14: Related Options Related Options...... 493

xvii Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Linking Tools and Options...... 493 Portability Options...... 498

Part III: Optimizing Applications

Chapter 15: Using Interprocedural Optimization (IPO) Interprocedural Optimization (IPO) Overview...... 507 Interprocedural Optimization (IPO) Quick Reference...... 510 Using IPO...... 511 IPO-Related Performance Issues...... 513 IPO for Large Programs...... 514 Understanding Code Layout and Multi-Object IPO...... 516 Creating a Library from IPO Objects...... 518 Requesting Compiler Reports with the xi* Tools...... 519 Inline Expansion of Functions...... 520 Inline Function Expansion...... 520 Compiler Directed Inline Expansion of Functions...... 522 Developer Directed Inline Expansion of User Functions...... 525

Chapter 16: Using Profile-Guided Optimization (PGO) Profile-Guided Optimizations Overview...... 529 Profile-Guided Optimization (PGO) Quick Reference...... 530 Profile an Application...... 538 Profile Function or Loop Execution Time...... 540 PGO Tools...... 550 PGO Tools Overview...... 550 code coverage Tool...... 550 test prioritization Tool...... 570 profmerge and proforder Tools...... 579 Using Function Order Lists, Function Grouping, Function Ordering, and Data Ordering Optimizations...... 585

xviii Contents

Comparison of Function Order Lists and IPO Code Layout...... 591 PGO API Support...... 592 API Support Overview...... 592 PGO Environment Variables...... 593 Dumping Profile Information...... 594 Interval Profile Dumping...... 596 Resetting the Dynamic Profile Counters...... 598 Dumping and Resetting Profile Information...... 599

Chapter 17: Using High-Level Optimization (HLO) High-Level Optimizations (HLO) Overview...... 601

Part IV: Creating Parallel Applications

Chapter 18: Automatic Vectorization Automatic Vectorization Overview...... 605 Automatic Vectorization Options Quick Reference...... 606 Programming Guidelines for Vectorization...... 607 Using Automatic Vectorization...... 615 Vectorization and Loops...... 624 Loop Constructs...... 633 User-mandated or SIMD Vectorization...... 644 Function Annotations and the SIMD Directive for Vectorization...... 654

Part V: Floating-point Operations

Chapter 19: Overview Overview: Floating-point Operations...... 659 Floating-point Options Quick Reference...... 659

Chapter 20: Understanding Floating-point Operations Programming Tradeoffs in Floating-point Applications...... 663

xix Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Floating-point Optimizations...... 665 Using the -fp-model (/fp) Option...... 668 Denormal Numbers...... 672 Floating-point Environment...... 673 Setting the FTZ and DAZ Flags...... 674 Checking the Floating-point Stack State...... 676

Chapter 21: Tuning Performance Overview: Tuning Performance...... 677 Handling Floating-point Array Operations in a Loop Body...... 677 Reducing the Impact of Denormal Exceptions...... 678 Avoiding Mixed Data Type Arithmetic Expressions...... 679 Using Efficient Data Types...... 682

Chapter 22: Understanding IEEE Floating-point Operations Overview: Understanding IEEE Floating-point Standard...... 683 Floating-point Formats...... 683 Special Values...... 683

Part VI: Intrinsics Reference

Chapter 23: Overview: Intrinsics Reference Overview: Intrinsics Reference...... 687 Details about Intrinsics...... 688 Naming and Usage Syntax...... 691 References...... 693

Chapter 24: Intrinsics for All Intel Architectures Overview: Intrinsics across Intel Architectures...... 695 Integer Arithmetic Intrinsics...... 695 Floating-point Intrinsics...... 696 String and Block Copy Intrinsics...... 700

xx Contents

Miscellaneous Intrinsics...... 701

Chapter 25: Data Alignment, Memory Allocation Intrinsics, and Inline Assembly Overview: Data Alignment, Memory Allocation Intrinsics, and Inline Assembly...... 705 Alignment Support...... 705 Allocating and Freeing Aligned Memory Blocks...... 706 Inline Assembly...... 707

Chapter 26: MMX(TM) Technology Intrinsics Overview: MMX™ Technology Intrinsics...... 715 Details about MMX(TM) Technology Intrinsics...... 715 The EMMS Instruction: Why You Need It ...... 716 EMMS Usage Guidelines...... 718 MMX™ Technology General Support Intrinsics...... 718 MMX™ Technology Packed Arithmetic Intrinsics...... 721 MMX™ Technology Shift Intrinsics...... 723 MMX™ Technology Logical Intrinsics...... 726 MMX™ Technology Compare Intrinsics...... 726 MMX™ Technology Set Intrinsics...... 728

Chapter 27: Intrinsics for Intel(R) Streaming SIMD Extensions Overview: Intel(R) Streaming SIMD Extensions...... 731 Details about Intel(R) Streaming SIMD Extension Intrinsics....731 Writing Programs with Intel(R) Streaming SIMD Extensions Intrinsics...... 734 Arithmetic Intrinsics...... 734 Logical Intrinsics ...... 739 Compare Intrinsics...... 740 Conversion Intrinsics...... 750 Load Intrinsics...... 755

xxi Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Set Intrinsics...... 757 Store Intrinsics...... 758 Cacheability Support Intrinsics...... 760 Integer Intrinsics ...... 761 Intrinsics to Read and Write Registers ...... 765 Miscellaneous Intrinsics...... 766 Macro Functions...... 768 Macro Function for Shuffle Operations...... 768 Macro Functions to Read and Write Control Registers.....769 Macro Function for Matrix Transposition...... 772

Chapter 28: Intrinsics for Intel(R) Streaming SIMD Extensions 2 Overview: Intel(R) Streaming SIMD Extensions 2...... 773 Floating-point Intrinsics...... 774 Arithmetic Intrinsics...... 774 Logical Intrinsics...... 778 Compare Intrinsics...... 779 Conversion Intrinsics...... 790 Load Intrinsics...... 794 Set Intrinsics...... 796 Store Intrinsics...... 798 Integer Intrinsics...... 801 Arithmetic Intrinsics...... 801 Logical Intrinsics...... 810 Shift Intrinsics...... 811 Compare Intrinsics...... 816 Conversion Intrinsics...... 819 Move Intrinsics...... 821 Load Intrinsics...... 822 Set Intrinsics...... 823 Store Intrinsics...... 827

xxii Contents

Miscellaneous Functions and Intrinsics...... 829 Cacheability Support Intrinsics ...... 829 Miscellaneous Intrinsics ...... 831 Casting Support Intrinsics ...... 837 Pause Intrinsic...... 838 Macro Function for Shuffle...... 839 Intrinsics Returning Vectors of Undefined Values...... 840

Chapter 29: Intrinsics for Intel(R) Streaming SIMD Extensions 3 Overview: Intel(R) Streaming SIMD Extensions 3...... 843 Integer Vector Intrinsics ...... 843 Single-precision Floating-point Vector Intrinsics ...... 844 Double-precision Floating-point Vector Intrinsics ...... 845 Miscellaneous Intrinsics ...... 847 Macro Functions ...... 847

Chapter 30: Intrinsics for Intel(R) Supplemental Streaming SIMD Extensions 3 Overview: Intel(R) Supplemental Streaming SIMD Extensions 3 ...... 849 Addition Intrinsics ...... 849 Subtraction Intrinsics ...... 851 Multiplication Intrinsics ...... 853 Absolute Value Intrinsics ...... 854 Shuffle Intrinsics...... 855 Concatenate Intrinsics ...... 856 Negation Intrinsics ...... 857

Chapter 31: Intrinsics for Intel(R) Streaming SIMD Extensions 4 Overview: Intel(R) Streaming SIMD Extensions 4...... 865 Vectorizing Compiler and Media Accelerators...... 865

xxiii Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Overview: Vectorizing Compiler and Media Accelerators...... 865 Packed Blending Intrinsincs ...... 866 Floating Point Dot Product Intrinsincs ...... 867 Packed Format Conversion Intrinsics ...... 867 Packed Integer Min/Max Intrinsics...... 868 Floating Point Rounding Intrinsics...... 870 DWORD Multiply Intrinsics...... 871 Register Insertion/Extraction Intrinsics ...... 871 Test Intrinsics ...... 873 Packed DWORD to Unsigned WORD Intrinsic ...... 874 Packed Compare for Equal Intrinsic...... 874 Cacheability Support Intrinsic ...... 874 Efficient Accelerated String and Text Processing...... 875 Overview: Efficient Accelerated String and Text Processing...... 875 Packed Compare Intrinsics ...... 875 Application Targeted Accelerators Intrinsics...... 878

Chapter 32: Intrinsics for Advanced Encryption Standard Implementation Overview: Intrinsics for Carry-less Multiplication Instruction and Advanced Encryption Standard Instructions ...... 881 Intrinsics for Carry-less Multiplication Instruction and Advanced Encryption Standard Instructions...... 882

Chapter 33: Intrinsics for Intel(R) Advanced Vector Extensions Overview: Intrinsics for Intel® Advanced Vector Extensions Instructions ...... 885 Details of Intel® Advanced Vector Extensions Intrinsics and Fused Multiply Add Intrinsics...... 886 Intrinsics for Arithmetic Operations...... 891

xxiv Contents

_mm256_add_pd...... 891 _mm256_add_ps...... 891 _mm256_addsub_pd...... 892 _mm256_addsub_ps...... 893 _mm256_hadd_pd...... 893 _mm256_hadd_ps...... 894 _mm256_sub_pd...... 894 _mm256_sub_ps...... 895 _mm256_hsub_pd...... 896 _mm256_hsub_ps...... 896 _mm256_mul_pd...... 897 _mm256_mul_ps...... 897 _mm256_div_pd...... 898 _mm256_div_ps...... 899 _mm256_dp_ps...... 899 _mm256_sqrt_pd...... 900 _mm256_sqrt_ps...... 901 _mm256_rsqrt_ps...... 901 _mm256_rcp_ps...... 902 Intrinsics for Bitwise Operations...... 902 _mm256_and_pd...... 902 _mm256_and_ps...... 903 _mm256_andnot_pd...... 903 _mm256_andnot_ps...... 904 _mm256_or_pd...... 905 _mm256_or_ps...... 905 _mm256_xor_pd...... 906 _mm256_xor_ps...... 906 Intrinsics for Blend and Conditional Merge Operations...... 907 _mm256_blend_pd...... 907 _mm256_blend_ps...... 908 _mm256_blendv_pd...... 909

xxv Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_blendv_ps...... 909 Intrinsics for Compare Operations...... 910 _mm_cmp_pd, _m256_cpm_pd...... 910 _mm_cmp_ps, _m256_cmp_ps...... 911 _mm_cmp_sd...... 912 _mm_cmp_ss...... 913 Intrinsics for Conversion Operations...... 913 _mm256_cvtepi32_pd...... 913 _mm256_cvtepi32_ps...... 914 _mm256_cvtpd_epi32...... 914 _m256_cvtps_epi32...... 915 _mm256_cvtpd_ps...... 915 _m256_cvtps_pd...... 916 _mm256_cvttp_epi32...... 916 _mm256_cvttps_epi32...... 917 Intrinsics to Determine Maximum and Minimum Values...... 918 _mm256_max_pd...... 918 _mm256_max_ps...... 918 _mm256_min_pd...... 919 _mm256_min_ps...... 919 Intrinsics for Load Operations...... 920 _mm256_broadcast_pd...... 920 _mm256_broadcast_ps...... 921 _mm256_broadcast_sd...... 921 _mm256_broadcast_ss, _mm_broadcast_ss...... 922 _mm256_load_pd...... 923 _mm256_load_ps...... 923 _mm256_load_si256...... 924 _mm256_loadu_pd...... 924 _mm256_loadu_ps...... 925 _mm256_loadu_si256...... 926 _mm256_maskload_pd, _mm_maskload_pd...... 926

xxvi Contents

_mm256_maskload_ps, _mm_maskload_ps...... 927 _mm256_store_pd...... 928 _mm256_store_ps...... 928 _mm256_store_si256...... 929 _mm256_storeu_pd...... 929 _mm256_storeu_ps...... 930 _mm256_storeu_si256...... 931 _mm256_stream_pd...... 931 _mm256_stream_ps...... 932 _mm256_stream_si256...... 933 _mm256_maskstore_pd, _mm_maskstore_pd...... 933 _mm256_maskstore_ps, _mm_maskstore_ps...... 934 Intrinsics for Miscellaneous Operations...... 935 _mm256_extractf128_pd...... 935 _mm256_extractf128_ps...... 936 _mm256_extractf128_si256...... 936 _mm256_insertf128_pd...... 937 _mm256_insertf128_ps...... 938 _mm256_insertf128_si256...... 938 _mm256_lddqu_si256...... 939 _mm256_movedup_pd...... 940 _mm256_movehdup_ps...... 940 _mm256_moveldup_ps...... 941 _mm256_movemask_pd...... 941 _mm256_movemask_ps...... 942 _mm256_round_pd...... 942 _mm256_round_ps...... 943 _mm256_set_pd...... 944 _mm256_set_ps...... 945 _mm256_set_epi32...... 945 _mm256_set1_pd...... 946 _mm256_set1_ps...... 946

xxvii Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_set1_epi32...... 947 _mm256_setzero_pd...... 948 _mm256_setzero_ps...... 948 _mm256_setzero_si256...... 949 _mm256_zeroall...... 949 _mm256_zeroupper...... 950 Intrinsics for Packed Test Operations...... 950 _mm256_testz_si256...... 950 _mm256_testc_si256...... 951 _mm256_testnzc_si256...... 952 _mm256_testz_pd, _mm_testz_pd...... 952 _mm256_testz_ps, _mm_testz_ps...... 953 _mm256_testc_pd, _mm_testc_pd...... 954 _mm256_testc_ps, _mm_testc_ps...... 955 _mm256_testnzc_pd, _mm_testnzc_pd...... 956 _mm256_testnzc_ps, _mm_testnzc_ps...... 957 Intrinsics for Permute Operations...... 959 _mm256_permute_pd, _mm_permute_pd...... 959 _mm256_permute_ps...... 960 _mm256_permutevar_pd, _mm_permutevar_pd...... 961 _mm256_permutevar_ps...... 961 _mm256_permute2f128_pd...... 962 _mm256_permute2f128_ps...... 963 _mm256_permute2f128_si256...... 963 Intrinsics for Shuffle Operations...... 964 _mm256_shuffle_pd...... 964 _mm256_shuffle_ps...... 965 Intrinsics for Unpack and Interleave Operations...... 966 _mm256_unpackhi_pd...... 966 _mm256_unpackhi_ps...... 966 _mm256_unpacklo_pd...... 967 _mm256_unpacklo_ps...... 967

xxviii Contents

Support Intrinsics for Vector Typecasting Operations...... 968 _mm256_castpd_ps...... 968 _mm256_castps_pd...... 969 _mm256_castpd_si256...... 969 _mm256_castps_si256...... 970 _mm256_castsi256_pd...... 970 _mm256_castsi256_ps...... 971 _mm256_castpd128_pd256...... 971 _mm256_castpd256_pd128...... 972 _mm256_castps128_ps256...... 973 _mm256_castps256_ps128...... 973 _mm256_castsi128_si256...... 974 _mm256_castsi256_si128...... 974 Intrinsics for Fused Multiply Add Operations...... 975 _mm_fmadd_pd, _mm256_fmadd_pd...... 975 _mm_fmadd_ps, _mm256_fmadd_ps...... 976 _mm_fmadd_sd ...... 977 _mm_fmadd_ss ...... 978 _mm_fmaddsub_pd, _mm256_fmaddsub_pd...... 979 _mm_fmaddsub_ps, _mm256_fmaddsub_ps...... 980 _mm_fmsubadd_pd, _mm256_fmsubadd_pd...... 981 _mm_fmsubadd_ps, _mm256_fmsubadd_ps...... 982 _mm_fmsub_pd, _mm256_fmsub_pd...... 983 _mm_fmsub_ps, _mm256_fmsub_ps...... 984 _mm_fmsub_sd ...... 985 _mm_fmsub_ss ...... 986 _mm_fnmadd_pd, _mm256_fnmadd_pd...... 987 _mm_fnmadd_ps, _mm256_fnmadd_ps...... 988 _mm_fnmadd_sd...... 989 _mm_fnmadd_ss...... 989 _mm_fnmsub_pd, _mm256_fnmsub_pd...... 990 _mm_fnmsub_ps, _mm256_fnmsub_ps...... 991

xxix Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_fnmsub_sd...... 992 _mm_fnmsub_ss...... 993 Intrinsics Generating Vectors of Undefined Values...... 994 _mm256_undefined_ps()...... 994 _mm256_undefined_pd()...... 994 _mm256_undefined_si128()...... 995

Chapter 34: Intrinsics for Intel(R) Post-32nm Processor Instruction Extensions Overview: Intrinsics for Intel(R) Post 32nm Processor Instruction Extensions...... 997 Intrinsics for Converting Half Floats that Map to Intel(R) Post-32nm Processor Instructions...... 997 _mm_cvtph_ps()...... 998 _mm256_cvtph_ps()...... 998 _mm_cvtps_ph()...... 999 _mm256_cvtps_ph()...... 999 Intrinsics that Generate Random Numbers of 16/32/64 Bit Wide Random Integers...... 1000 _rdrand_u16(), _rdrand_u32(), _rdrand_u64()...... 1000 Intrinsics that Allow Reading from and Writing to the FS Base and GS Base Registers...... 1001 _readfsbase_u32(), _readfsbase_u64()...... 1001 _readgsbase_u32(), _readgsbase_u64()...... 1001 _writefsbase_u32(), _writefsbase_u64()...... 1002 _writegsbase_u32(), _writegsbase_u64()...... 1002

Chapter 35: Intrinsics for Managing Extended Processor States and Registers Overview: Intrinsics for Managing Extended Processor States and Registers...... 1003 Intrinsics for Reading and Writing the Content of Extended Control Registers...... 1003

xxx Contents

_xgetbv()...... 1004 _xsetbv()...... 1005 Intrinsics for Saving and Restoring the Extended Processor States...... 1005 _fxsave()...... 1008 _fxsave64()...... 1008 _fxrstor()...... 1009 _fxrstor64()...... 1009 _xsave()...... 1010 _xsave64()...... 1010 _xsaveopt()...... 1011 _xsaveopt64()...... 1011 _xrstor()...... 1012 _xrstor64()...... 1012

Chapter 36: Intrinsics for Converting Half Floats Overview: Intrinsics to Convert Half Float Types...... 1013 Intrinsics for Converting Half Floats...... 1014

Chapter 37: Intrinsics for Short Vector Math Library Operations Overview: Intrinsics for Short Vector Math Library Functions ...... 1017 Intrinsics for Error Function Operations...... 1017 _mm_erf_pd, _mm256_erf_pd ...... 1017 _mm_erf_ps, _mm256_erf_ps ...... 1018 _mm_erfc_pd, _mm256_erfc_pd...... 1019 _mm_erfc_ps, _mm256_erfc_ps ...... 1019 Intrinsics for Exponential Operations...... 1020 _mm_exp2_pd, _mm256_exp2_pd ...... 1020 _mm_exp2_ps, _mm256_exp2_ps ...... 1021 _mm_exp_pd, _mm256_exp_pd ...... 1022 _mm_exp_ps, _mm256_exp_ps ...... 1022

xxxi Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_cexp_ps, _mm256_cexp_ps ...... 1023 _mm_pow_pd, _mm256_pow_pd ...... 1024 _mm_pow_ps, _mm256_pow_ps ...... 1025 _mm_cpow_ps, _mm256_cpow_ps ...... 1025 Intrinsics for Logrithmic Operations...... 1026 _mm_log2_pd, _mm256_log2_pd ...... 1026 _mm_log2_ps, _mm256_log2_ps ...... 1027 _mm_log10_pd, _mm256_log10_pd ...... 1028 _mm_log10_ps, _mm256_log10_ps ...... 1028 _mm_log_pd, _mm256_log_pd ...... 1029 _mm_log_ps, _mm256_log_ps ...... 1030 _mm_clog2_ps, _mm256_clog2_ps ...... 1030 _mm_clog10_ps, _mm256_clog10_ps ...... 1031 _mm_clog_ps, _mm256_clog_ps ...... 1032 Intrinsics for Square Root and Cube Root Operations...... 1032 _mm_sqrt_pd, _mm256_sqrt_pd ...... 1032 _mm_sqrt_ps, _mm256_sqrt_ps ...... 1033 _mm_invsqrt_pd, _mm256_invsqrt_pd ...... 1034 _mm_invsqrt_ps, _mm256_invsqrt_ps ...... 1034 _mm_cbrt_pd, _mm256_cbrt_pd ...... 1035 _mm_cbrt_ps, _mm256_cbrt_ps ...... 1036 _mm_invcbrt_pd, _mm256_invcbrt_pd ...... 1036 _mm_invcbrt_ps, _mm256_invcbrt_ps ...... 1037 _mm_csqrt_ps, _mm256_csqrt_ps ...... 1038 Intrinsics for Trignometric Operations...... 1038 _mm_acos_pd, _mm256_acos_pd ...... 1038 _mm_acos_ps, _mm256_acos_ps ...... 1039 _mm_acosh_pd, _mm256_acosh_pd ...... 1040 _mm_acosh_ps, _mm256_acosh_ps ...... 1040 _mm_asin_pd, _mm256_asin_pd ...... 1041 _mm_asin_ps, _mm256_asin_ps ...... 1042 _mm_asinh_pd, _mm256_asinh_pd ...... 1042

xxxii Contents

_mm_asinh_ps, _mm256_asinh_ps ...... 1043 _mm_atan_pd, _mm256_atan_pd ...... 1044 _mm_atan_ps, _mm256_atan_ps ...... 1044 _mm_atan2_pd, _mm256_atan2_pd ...... 1045 _mm_atan2_ps, _mm256_atan2_ps ...... 1046 _mm_atanh_pd, _mm256_atanh_pd ...... 1047 _mm_atanh_ps, _mm256_atanh_ps ...... 1047 _mm_cos_pd, _mm256_cos_pd ...... 1048 _mm_cos_ps, _mm256_cos_ps ...... 1049 _mm_cosh_pd, _mm256_cosh_pd ...... 1049 _mm_cosh_ps, _mm256_cosh_ps ...... 1050 _mm_sin_pd, _mm256_sin_pd ...... 1051 _mm_sin_ps, _mm256_sin_ps ...... 1051 _mm_sinh_pd, _mm256_sinh_pd ...... 1052 _mm_sinh_ps, _mm256_sinh_ps ...... 1053 _mm_tan_pd, _mm256_tan_pd ...... 1053 _mm_tan_ps, _mm256_tan_ps ...... 1054 _mm_tanh_pd, _mm256_tanh_pd ...... 1055 _mm_tanh_ps, _mm256_tanh_ps ...... 1055 _mm_sincos_pd, _mm256_sincos_pd ...... 1056 _mm_sincos_ps, _mm256_sincos_ps ...... 1057 Intrinsics for Trignometric Operations with Complex Numbers...... 1057

Part VII: Compiler Reference

Chapter 38: Intel C++ Compiler Pragmas Overview: Intel® C++ Compiler Pragmas...... 1069 Intel-Specific Pragmas...... 1070 Intel-specific Pragma Reference...... 1071 alloc_section...... 1071 distribute_point...... 1072 inline, noinline, and forceinline...... 1076

xxxiii Intel(R) C++ Compiler for VxWorks* User and Reference Guides

ivdep...... 1078 loop_count...... 1080 memref_control...... 1082 novector...... 1086 optimize...... 1087 optimization_level...... 1088 prefetch/noprefetch...... 1090 simd...... 1094 swp/noswp...... 1096 unroll/nounroll...... 1098 unroll_and_jam/nounroll_and_jam...... 1100 unused...... 1101 vector ...... 1102 Intel Supported Pragmas ...... 1106

Chapter 39: Intel Math Library Overview: Intel® Math Library...... 1109 Using the Intel Math Library...... 1109 Math Functions...... 1113 Function List...... 1113 Trigonometric Functions...... 1119 Hyperbolic Functions...... 1124 Exponential Functions...... 1126 Special Functions...... 1131 Nearest Integer Functions...... 1135 Remainder Functions...... 1138 Miscellaneous Functions...... 1139 Complex Functions...... 1145 C99 Macros...... 1151

Chapter 40: Intel C++ Class Libraries Introduction to the Class Libraries...... 1153 Overview: Intel C++ Class Libraries...... 1153

xxxiv Contents

Hardware and Requirements...... 1153 About the Classes...... 1153 Details About the Libraries...... 1154 C++ Classes and SIMD Operations...... 1155 Capabilities of C++ SIMD Classes...... 1159 Integer Vector Classes...... 1161 Overview: Integer Vector Classes...... 1161 Terms, Conventions, and Syntax Defined...... 1162 Rules for Operators...... 1164 Assignment Operator...... 1167 Logical Operators...... 1168 Addition and Subtraction Operators...... 1170 Multiplication Operators...... 1173 Shift Operators...... 1175 Comparison Operators...... 1177 Conditional Select Operators...... 1179 Debug Operations...... 1182 Unpack Operators...... 1185 Pack Operators...... 1190 Clear MMX™ State Operator...... 1190 Integer Functions for Streaming SIMD Extensions...... 1191 Conversions between Fvec and Ivec...... 1192 Floating-point Vector Classes...... 1193 Overview: Floating-point Vector Classes...... 1193 Fvec Notation Conventions...... 1194 Data Alignment...... 1195 Conversions...... 1195 Constructors and Initialization...... 1196 Arithmetic Operators...... 1198 Minimum and Maximum Operators...... 1204 Logical Operators...... 1205 Compare Operators...... 1207

xxxv Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Conditional Select Operators for Fvec Classes...... 1211 Cacheability Support Operators...... 1217 Debug Operations...... 1217 Load and Store Operators...... 1219 Unpack Operators...... 1219 Move Mask Operators...... 1220 Classes Quick Reference...... 1220 Programming Example...... 1230 C++ Librarcref_cls/common/y Extensions...... 1232 Introduction...... 1232 Intel's valarray implementation...... 1232

Chapter 41: Intel's C/C++ Language Extensions Introduction...... 1237 Intel's C++ lambda extensions...... 1237 Introduction...... 1237 Details on Using Lambda Expressions in C++...... 1237 Understanding Lambda-Capture...... 1239 Lambda Function Object...... 1241

xxxvi Legal Information

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Copyright (C) 1996-2010, Intel Corporation. All rights reserved. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P.

37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

38 Getting Help and Support

The Intel® C++ Compiler lets you build and optimize C/C++ applications for the VxWorks* OS (operating system). You can use the compiler in VxWorks* Development Shell or in Wind River Workbench*. For more information about the compiler features and other components, see your Release Notes. This documentation assumes that you are familiar with the C++ programming language and with your processor's architecture. You should also be familiar with the host computer's operating system.

System Requirements For detailed information on system requirements, see the Release Notes.

39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

40 Introduction 1

Introducing the Intel® C++ Compiler

The Intel® C++ Compiler can generate code for IA-32, or Intel® 64 architecture applications on any Intel®-based VxWorks* system. IA-32 architecture applications (32-bit) can run on all Intel®-based VxWorks systems. Intel® 64 architecture applications can run only on Intel® 64 architecture-based VxWorks systems. You can use the compiler in VxWorks* Development Shell or in Wind River Workbench*.

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

41 1 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Notational Conventions

Information in this documentation applies to all supported operating systems and architectures unless otherwise specified. This documentation uses the following conventions:

Notational Conventions

this type Indicates command-line or option arguments.

This type Indicates a code example.

This type Indicates what you type as input.

This type Indicates menu names, menu items, button names, dialog window names, and other user-interface items.

File > Open Menu names and menu items joined by a greater than (>) sign indicate a sequence of actions. For example, "Click File>Open" indicates that in the File menu, click Open to perform this action.

{value | value} Indicates a choice of items or values. You can usually only choose one of the values in the braces.

[item] Indicates items that are optional.

item [, item ]... Indicates that the item preceding the ellipsis (three dots) can be repeated.

Windows* OS These terms refer to all supported Microsoft* Windows* operating systems. Windows operating system

Linux* OS These terms refer to all supported Linux* operating systems. Linux operating system

42 Introduction 1

VxWorks* OS These terms refer to all supported VxWorks* operating systems. VxWorks operating system

Microsoft Windows XP* An asterisk at the end of a word or name indicates it is a third-party product trademark. compiler option This term refers to Windows* OS options, Linux* OS options, VxWorks* OS options that can be used on the compiler command line.

Conventions Used in Compiler Options

/option or A slash before an option name indicates the -option option is available on Windows OS. A dash before an option name indicates the option is available on Linux OS*, VxWorks* OS systems. For example: Windows option: /fast VxWorks option: -fast Note: If an option is available on Windows* OS, Linux* OS, VxWorks* OS systems, no slash or dash appears in the general description of the option. The slash and dash will only appear where the option syntax is described.

/option:argument or Indicates that an option requires a argument -option argument (parameter). For example, you must specify an argument for the following options: Windows OS option: /Qdiag-error-limit:n Linux OS, VxWorks OS option: -diag-error- limit n

/option:keyword or Indicates that an option requires one of the -option keyword keyword values.

/option[:keyword ] or Indicates that the option can be used alone -option [keyword ] or with an optional keyword.

43 1 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

option[n] or option[:n] or option[=n] Indicates that the option can be used alone or with an optional value; for example, in /Qfnalign[:n] and -falign-func- tions[=n], the n can be omitted or a valid value can be specified for n.

option[-] Indicates that a trailing hyphen disables the option; for example, /Qglobal_hoist- disables the Windows OS option /Qglob- al_hoist.

[no]option or Indicates that "no" or "no-" preceding an [no-]option option disables the option. For example: In the Windows OS option /[no]traceback, /traceback enables the option, while /no- traceback disables it. In the Linux OS, VxWorks OS option -[no- ]global_hoist, -global_hoist enables the option, while -no-global_hoist disables it. In some options, the "no" appears later in the option name; for example, -fno-alias disables the -falias option.

Related Information

Associated Intel Documents The following Intel documents provide additional information about the Intel® C++ Compiler, Intel® architecture, Intel® processors, or tools:

• Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 1: Basic Architecture, Intel Corporation • Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M, Intel Corporation • Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2B: Instruction Set Reference, N-Z, Intel Corporation

44 Introduction 1

• Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Intel Corporation • Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Intel Corporation • Intel® 64 and IA-32 Architectures Optimization Reference Manual • Intel® Itanium® Architecture Software Developer's Manual - Volume 1: Application Architecture, Revision 2.2 • Intel® Itanium® Architecture Software Developer's Manual - Volume 2: System Architecture, Revision 2.2 • Intel® Itanium® Architecture Software Developer's Manual - Volume 3: Instruction Set Reference, Revision 2.2 • Intel® Processor Identification with the CPUID Instruction, Intel Corporation, doc. number 241618 • IA-64 Architecture Assembler User's Guide • IA-64 Architecture Assembly Language Reference Guide

Most Intel documents can be found at the Intel web site http://www.intel.com/software/products/

Optimization and Vectorization Terminology and Technology The following documents provide details on basic optimization and vectorization terminology and technology:

• Intel® Architecture Optimization Reference Manual • Dependence Analysis, Utpal Banerjee (A Book Series on Loop Transformations for Restructuring Compilers). Kluwer Academic Publishers. 1997. • The Structure of Computers and Computation: Volume I, David J. Kuck. John Wiley and Sons, New York, 1978. • Loop Transformations for Restructuring Compilers: The Foundations, Utpal Banerjee (A Book Series on Loop Transformations for Restructuring Compilers). Kluwer Academic Publishers. 1993. • Loop parallelization, Utpal Banerjee (A Book Series on Loop Transformations for Restructuring Compilers). Kluwer Academic Publishers. 1994. • High Performance Compilers for Parallel Computers, Michael J. Wolfe. Addison-Wesley, Redwood City. 1996.

45 1 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• Supercompilers for Parallel and Vector Computers, H. Zima. ACM Press, New York, 1990. • An Auto-vectorizing Compiler for the Intel® Architecture, Aart Bik, Paul Grey, Milind Girkar, and Xinmin Tian. Submitted for publication • Efficient Exploitation of Parallelism on Pentium® III and Pentium® 4 Processor-Based Systems, Aart Bik, Milind Girkar, Paul Grey, and Xinmin Tian. • The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance, A.J.C. Bik. Intel Press, June, 2004. • Multi-Core Programming: Increasing Performance through Software Multithreading, Shameem Akhter and Jason Roberts. Intel Press, April, 2006.

Additional Training For additional technical product information including white papers about Intel compilers, open the page associated with your product at http://www.intel.com/software/products/

46 Part I Building Applications

Topics:

• Overview: Building Applications • Building Applications from the Command Line • Using Preprocessor Options • Modifying the Compilation Environment • Debugging • Creating and Using Libraries • gcc* Compatibility • Language Conformance • Porting Applications • Error Handling • Reference

47 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

48 Overview: Building Applications 2

This section describes how to build your applications with the Intel® C++ Compiler. It includes information on using the compiler on the command line and how to use the compiler with supported integrated development environments. Also in this section are helpful topics on linking, debugging, libraries, compatibility, and language conformance.

Introduction to the Compiler

You can invoke the Intel® C++ Compiler from a system command prompt with the icc or icpc command after setting the environment variables.

Getting Help On the command line, you can execute icpc -help for a summary of command-line options.

Compilation Phases

The Intel® C++ Compiler processes C and C++ language source files. By default, the compiler performs the compile and link phases of compilation and produces an executable file. The compiler also determines which compilation phases to perform based on the file name extension and the compilation options specified. The compiler passes object files and any unrecognized file names to the linker. The linker then determines whether the file is an object file or a library file.

Default Output Files

A default invocation of the compiler requires only a C or C++ file name argument, such as: icpc x.cpp

You can compile more than one input file: icpc x.cpp y.cpp z.cpp

This command does the following: • Compiles and links three input source files. • Produces one executable file, a.out, in the current directory.

49 2 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Using Compiler Options

A compiler option is a case-sensitive, command-line expression used to change the compiler's default operation. Compiler options are not required to compile your program, but they are extremely useful in helping you control different aspects of your application, such as: • code generation • optimization • output file (type, name, location) • linking properties • size of the executable • speed of the executable

See Option Categories below for a broader view of the option capabilities included with the Intel® C++ Compiler.

Command-line Syntax When you specify compiler options on the command-line, the following syntax applies: icc [options] [@response_file] file1 [file2...]

where options represents zero or more compiler options where file is any of the following:

• C or C++ source file (.C .c .cc .cpp .cxx .c++ .i .ii) • assembly file (.s .S) • object file (.o) • static library (.a)

If you are compiling just C language sources, invoke the compiler with icc. You should invoke the compiler with icpc if you are compiling just C++ language sources or a combination of C and C++. The optional response_file is a text file that lists compiler options you want to include during compilation. See Using Response Files.

50 2

The compiler reads command-line options from left to right. If your compilation includes competing options, then the compiler uses the one furthest to the right. In other words, "the last one wins." In this example: icc -xP main.c file1.c -xW file2.c the compiler sees -xP and -xW as two forms of the same option where only one form can be used. Since -xW is last (furthest to the right), it wins. All options specified on the command line are used to compile each file. The compiler will NOT compile individual files with specific options as this example may suggest: icc -O3 main.c file1.c -mp1 file2.c

It may seem that main.c and file1.c are compiled with -O3, and file2.c is compiled with the -mp1 option. This is not the case. All files are compiled with both options. A rare exception to this general rule is the -x type option: icc -x c file1 -x c++ file2 -x assembler file3 where the type argument identifies each file type for the compiler.

Default Operation The compiler invokes many options by default. For example, the -O2 option is on by default for systems based on IA-32 architecture. In this simple example, the compiler includes -O2 (and the other default options) in the compilation: icc main.c

See the Compiler Options reference for the default status of each compiler option. Each time you invoke the compiler, options listed in the corresponding configuration file (icc.cfg or icpc.cfg) override any competing default options. For example, if your icc.cfg file includes the -O3 option, the compiler will use -O3 rather than the default -O2 option. Use the configuration file to list options you'd like the compiler to use for every compilation. See Using Configuration Files. Finally, options used on the command-line override any competing options specified elsewhere (default options, options in the configuration file). If you specify -O1 on the command line, this option setting would "win" over competing option defaults and competing options in configuration files, in addition to competing options in the CL environment variable.

51 2 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Certain #pragma statements in your source code can override competing options specified on the command line. For example, if a function in your code is preceded by #pragma optimize("", off), then optimization for that function is turned off, even though -O2 optimization is on by default, -O3 is listed in the configuration file, and -O1 is specified on the command-line for the rest of the program.

Using Options with Arguments Compiler options can be as simple as a single letter, such as -E/E. However, many options accept or require arguments. The -O option, for example, accepts a single-value argument that the compiler uses to determine the degree of optimization. Other options require at least one argument and can accept multiple arguments. For most options that accept arguments, the compiler will warn you if your option and argument are not recognized. If you specify -O9, for example, the compiler will issue a warning, ignore the unrecognized -O9 option, and proceed with compilation. While the -O option does not require an argument, there are other options that must include an argument. The -I option requires an argument that identifies the directory to add to the include file search path. If you use this option without an argument, the compiler will not finish compilation. See the Compiler Options reference for a complete description of options and their supported arguments.

Other Forms of Options You can toggle some options on or off by using the negation convention. For example, the -complex-limited-range option, and many others, include a negation form, -no-complex- limited-range, to change the state of the option. Since this option is disabled by default, using -complex-limited-range on the command line would toggle it to the "ON" state.

Option Categories When you invoke the Intel C++ Compiler and specify a compiler option, you have a wide range of choices to influence the compiler's default operation. Intel compiler options typically correspond to one or more of the following categories:

• Advanced Optimization • Code Generation • Compatibility

52 2

• Component Control • Data • Deprecated • Diagnostics • Floating Point • Help • Inlining • Interprocedural Optimizations (IPO) • Language • Linking/Linker • Miscellaneous • Parallel Processing • Optimization • Output • Profile Guided Optimization (PGO) • Preprocessor

To see which options are included in each category, invoke the compiler from the command line with the -help category option. For example: icc -help codegen

will print to stdout the names and syntax of the options in the Code Generation category.

Saving Compiler Information in Your Executable

If you want to save information about the compiler in your executable, use the -sox (Linux* OS and VxWorks* OS) or /Qsox (Windows*) option. When you use this option, the following information is saved: • compiler version number • compiler options that were used to produce the executable On Linux OS and VxWorks OS: To view the information stored in the object file, use the following command: objdump -sj comment a.out

53 2 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

strings -a a.out |grep comment: On Windows OS: To view the linker directives stored in string format in the object file, use the following command: link /dump /drectives filename.obj In the output, the ?-comment linker directive displays the compiler version information. To search your executable for compiler information, use the following command: findstr "Compiler" filename.exe This searches for any strings that have the substring "Compiler" in them.

Redistributing Libraries When Deploying Applications

When you deploy your application to systems on which the Intel Compiler is not installed, you need to redistribute certain Intel libraries to which your application is linked. You can do so in one of the following ways: • Statically link your application. An application built with statically-linked libraries eliminates the need to distribute runtime libraries with the application executable. By linking the application to the static libraries, you are not dependent on the Intel C/C++ dynamic shared libraries. • Dynamically link your application. If you must build your application with dynamically linked (or shared) compiler libraries, you should address the following concerns: • You must build your application with shared or dynamic libraries that are redistributable. • Pay careful attention to the directory where the redistributables are installed and how the OS finds them. • You should determine which shared or dynamic libraries your application needs.

54 Building Applications from the Command Line 3

Invoking the Compiler from the Command Line

Set the Environment Variables VxWorks* Development Shell sets the environment variables for Intel(R) C++ Compiler automatically.

Invoking the Compiler with icc or icpc You can invoke the Intel C++ Compiler on the command line with either icc or icpc.

• When you invoke the compiler with icc, the compiler builds C source files using C libraries and C include files. If you use icc with a C++ source file, it is compiled as a C++ file. Use icc to link C object files. • When you invoke the compiler with icpc the compiler builds C++ source files using C++ libraries and C++ include files. If you use icpc with a C source file, it is compiled as a C++ file. Use icpc to link C++ object files.

Command-line Syntax When you invoke the Intel C++ Compiler with icc or icpc, use the following syntax: {icc|icpc} [options] file1 [file2 . . .]

Argument Description

options Indicates one or more command-line options. The compiler recognizes one or more letters preceded by a hyphen (-). This includes linker options.

file1, file2 . Indicates one or more files to be processed by the compiler. You can specify . . more than one file. Use a space as a delimiter for multiple files.

55 3 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Invoking the Compiler from the Command Line with make

To run make from the command line using Intel® C++ Compiler, make sure that /usr/bin is in your path. To use the Intel compiler, your makefile must include the setting CC=icpc. Use the same setting on the command line to instruct the makefile to use the Intel compiler. If your makefile is written for gcc, the GNU* C compiler, you will need to change those command line options not recognized by the Intel compiler. Then you can compile: make -f my_makefile

See Also • Modifying Your makefile

Passing Options to the Linker

This topic describes the options that let you control and customize the linking with tools and libraries and define the output of the ld linker. See the ld man page for more information on the linker.

Option Description

-Ldirectory Instruct the linker to search directory for libraries.

-Qoption,tool,list Passes an argument list to another program in the compilation sequence, such as the assembler or linker.

-shared Instructs the compiler to build a Dynamic Shared Object (DSO) instead of an executable.

-shared-libgcc -shared-libgcc has the opposite effect of -static-libgcc. When it is used, the GNU standard libraries are linked in dynamically, allowing the user to override the static linking behavior when the -static option is used. Note: By default, all C++ standard and support libraries are linked dynamically.

-shared-intel Specifies that all Intel-provided libraries should be linked dynamically.

56 3

Option Description

-static Causes the executable to link all libraries statically, as opposed to dynamically. When -static is not used:

• /lib/ld-linux.so.2 is linked in • all other libs are linked dynamically

When -static is used:

• /lib/ld-linux.so.2 is not linked in • all other libs are linked statically

-static-libgcc This option causes the GNU standard libraries to be linked in statically.

-Bstatic This option is placed in the linker command line corresponding to its location on the user command line. This option is used to control the linking behavior of any library being passed in via the command line.

-Bdynamic This option is placed in the linker command line corresponding to its location on the user command line. This option is used to control the linking behavior of any library being passed in via the command line.

-static-intel This option causes Intel-provided libraries to be linked in statically. It is the opposite of -shared-intel.

-Wl,optlist This option passes a comma-separated list (optlist) of linker options to the linker.

-Xlinker val This option passes a value (val), such as a linker option, an object, or a library, directly to the linker.

Compiler Input Files

The Intel® C++ Compiler recognizes input files with the extensions listed in the following table:

57 3 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

File Name Interpretation Action

file.c C source file Passed to compiler

file.C C++ source file Passed to compiler file.CC file.cc file.cpp file.cxx

file.a Library file Passed to linker file.so

file.i Preprocessed file Passed to stdout

file.o Object file Passed to linker

file.s Assembly file Passed to assembler

Output Files

The Intel® C++ Compiler produces output files with the extensions listed in the following table:

File Name Description

file.i Preprocessed file -- produced with the -P option.

file.o Object file -- produced with the -c option.

file.s Assembly language file -- produced with the -S option.

a.out Executable file -- produced by default compilation.

See Also • Using Options for Preprocessing • Specifying Object Files • Specifying Assembly Files • Specifying Executable Files

58 3

Specifying Compilation Output

Specifying Executable Files

You can use the -o option to specify the name of the executable file. In the following example, the compiler produces an executable file named startup. icpc -ostartup prog1.cpp

See Also • -o

Specifying Object Files

You can use the -c and -o options to specify an alternate name for an object file. In this example, the compiler generates an object file name myobj.o: icpc -c -omyobj.o x.cpp

See Also • -c • -o

Specifying Assembly Files

You can use the -S and -o options to specify an alternate name for an assembly file. In this example, the compiler generates an assembly file named myasm.s: icpc -S -omyasm.s x.cpp

See Also • -S

Specifying Alternate Tools and Paths

Use the -Qlocation option to specify an alternate path for a tool. This option accepts two arguments using the following syntax: -Qlocation,tool,path where tool designates which compilation tool is associated with the alternate path.

59 3 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

tool Description

cpp Specifies the compiler front-end preprocessor.

c Specifies the C++ compiler.

asm Specifies the assembler.

link Specifies the linker.

Use the -Qoption option to pass an option specified by optlist to a tool, where optlist is a comma-separated list of options. The syntax for this command is: -Qoption,tool,optlist where tool designates which compilation tool receives the optlist.

tool Description

cpp Specifies the compiler front-end preprocessor.

c Specifies the C++ compiler.

asm Specifies the assembler.

link Specifies the linker.

optlist indicates one or more valid argument strings for the designated program. If the argument is a command-line option, you must include the hyphen. If the argument contains a space or tab character, you must enclose the entire argument in quotation characters (""). You must separate multiple arguments with commas.

Using Precompiled Header Files

The Intel® C++ Compiler supports precompiled header (PCH) files to significantly reduce compile times using the following options: • -pch • -pch-dir dirname • -pch-create filename • -pch-use filename

60 3

CAUTION. Depending on how you organize the header files listed in your sources, these options may increase compile times. See Organizing Source Files to learn how to optimize compile times using the PCH options.

Using -pch The -pch option directs the compiler to use appropriate PCH files. If none are available, they are created as sourcefile.pchi. This option supports multiple source files, such as the ones shown in Example 1: Example 1 command line: icpc -pch source1.cpp source2.cpp

Example 1 output when .pchi files do not exist: "source1.cpp": creating precompiled header file "source1.pchi" "source2.cpp": creating precompiled header file "source2.pchi"

Example 1 output when .pchi files do exist: "source1.cpp": using precompiled header file "source1.pchi" "source2.cpp": using precompiled header file "source2.pchi"

NOTE. The -pch option will use PCH files created from other sources if the headers files are the same. For example, if you compile source1.cpp using -pch, then source1.pchi is created. If you then compile source2.cpp using -pch, the compiler will use source1.pchi if it detects the same headers.

Using -pch-create Use the -pch-create filename option if you want the compiler to create a PCH file called filename. Note the following regarding this option:

• The filename parameter must be specified. • The filename parameter can be a full path name. • The full path to filename must exist. • The .pchi extension is not automatically appended to filename. • This option cannot be used in the same compilation as -pch-use filename. • The -pch-create filename option is supported for single source file compilations only.

61 3 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example 2 command line: icpc -pch-create /pch/source32.pchi source.cpp

Example 2 output: "source.cpp": creating precompiled header file "/pch/source32.pchi"

Using -pch-use filename This option directs the compiler to use the PCH file specified by filename. It cannot be used in the same compilation as -pch-create filename. The -pch-use filename option supports full path names and supports multiple source files when all source files use the same .pchi file. Example 3 command line: icpc -pch-use /pch/source32.pchi source.cpp

Example 3 output: "source.cpp": using precompiled header file /pch/source32.pchi

Using -pch-dir dirname Use the -pch-dir dirname option to specify the path (dirname) to the PCH file. You can use this option with -pch, -pch-create filename, and -pch-use filename. Example 4 command line: icpc -pch -pch-dir /pch/source32.cpp

Example 4 output: "source32.cpp": creating precompiled header file /pch/source32.pchi

Organizing Source Files If many of your source files include a common set of header files, place the common headers first, followed by the #pragma hdrstop directive. This pragma instructs the compiler to stop generating PCH files. For example, if source1.cpp, source2.cpp, and source3.cpp all include common.h, then place #pragma hdrstop after common.h to optimize compile times. #include "common.h" #pragma hdrstop #include "noncommon.h"

62 3

When you compile using the -pch option: icpc -pch source1.cpp source2.cpp source3.cpp

the compiler will generate one PCH file for all three source files: "source1.cpp": creating precompiled header file "source1.pchi" "source2.cpp": using precompiled header file "source1.pchi" "source3.cpp": using precompiled header file "source1.pchi"

If you don't use #pragma hdrstop, a different PCH file is created for each source file if different headers follow common.h, and the subsequent compile times will be longer. #pragma hdrstop has no effect on compilations that do not use these PCH options.

See Also • -pch • -pch-dir • -pch-create • -pch-use

Compiler Option Mapping Tool

The Intel compiler's Option Mapping Tool provides an easy method to derive equivalent options between Windows* and Linux*. If you are a Windows-based application developer who is developing an application for Linux OS, you may want to know, for example, the Linux OS equivalent for the /Oy- option. Likewise, the Option Mapping Tool provides Windows OS equivalents for Intel compiler options supported on Linux OS.

Using the Compiler Option Mapping Tool You can start the Option Mapping Tool from the command line by:

• invoking the compiler and using the -map-opts option • or, executing the tool directly

NOTE. Compiler options are mapped to their equivalent on the architecture you are using.

Calling the Option Mapping Tool with the Compiler

63 3 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

If you use the compiler to execute the Option Mapping Tool, the following syntax applies:

Example: Finding the Windows OS equivalent for -fp icpc -map-opts -fp Intel(R) Compiler option mapping tool

mapping Linux options to Windows OS for C++

'-map-opts' Linux option maps to --> '-Qmap-opts' option on Windows --> '-Qmap_opts' option on Windows

'-fp' Linux option maps to --> '-Oy-' option on Windows

Output from the Option Mapping Tool also includes:

• option mapping information (not shown here) for options included in the compiler configuration file • alternate forms of the option that are supported but may not be documented

When you call the Option Mapping Tool with the compiler, your source file is not compiled. Calling the Option Mapping Tool Directly Use the following syntax to execute the Option Mapping Tool directly from a command line environment where the full path to the map_opts executable is known (compiler bin directory): map_opts [-nologo] -t -l -opts

where values for: • = {l|linux|w|windows}

64 3

Example: Finding the Windows OS equivalent for -fp map_opts -tw -lc -opts -fp Intel(R) Compiler option mapping tool

mapping Linux options to Windows for C++

'-fp' Linux option maps to --> '-Oy-' option on Windows

Open Source Tools

This version of the Intel® C++ Compiler includes improved support for the following open source tools: • GNU Libtool – a script that allows package developers to provide generic shared library support. • Valgrind – a flexible system for debugging and profiling executables running on x86 processors. • GNU Automake – a tool for automatically generating Makefile.ins from files called Makefile.am.

See Also GNU Libtool documentation – http://www.gnu.org/software/libtool/manual.html Valgrind documentation – http://valgrind.org/docs/manual/manual.html GNU Automake documentation – http://sources.redhat.com/automake/automake.html

65 3 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

66 Using Preprocessor Options 4

About Preprocessor Options

This section explains how preprocessor options are implemented in the Intel® C++ Compiler to perform preliminary operations on C and C++ source files. The preprocessor options are summarized in the following table:

Option Description

-E preprocess to stdout. #line directives included.

-P preprocess to a file. #line directives omitted.

-EP preprocess to stdout omitting #line directives.

-C retain comments in intermediate file (use with -E or -P).

-D define a macro.

-U undefine a macro.

-I add directory to include file search path.

-X remove standard directories from include file search path.

-H print include file order.

-M generate makefile dependency information.

See Also • -E • -P • -EP • -C • -D • -U

67 4 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• -I • -X • -H • -M

Using Options for Preprocessing

Use these options to preprocess your source files without compiling them. When using these options, only the preprocessing phase of compilation is activated.

Using -E Use this option to preprocess to stdout. For example, to preprocess two source files and write them to stdout, enter the following command: icpc -E prog1.cpp prog2.cpp

Using -P Use this option to preprocess to a .i file omitting #line directives. For example, the following command creates two files named prog1.i and prog2.i, which you can use as input to another compilation: icpc -P prog1.cpp prog2.cpp

CAUTION. Existing files with the same name and extension are overwritten when you use this option.

Using -EP Use this option to preprocess to stdout omitting #line directives. icpc -EP prog1.cpp prog2.cpp

Using -C Use this option to retain comments. In this example: icpc -C -P prog1.cpp prog2.cpp

the compiler preserves comments in the prog1.i preprocessed file.

68 4

Option Summary The following table summarizes the preprocessing options:

Option Output Includes Output #line Directives

-E Yes stdout

-P No .i file

-EP No stdout

-P -EP No .i file

See Also • -E • -P • -EP • -C

Using Options to Define Macros

You can use compiler options to define or undefine predefined macros.

Using -D Use this option to define a macro. For example, to define a macro called SIZE with the value 100 use the following command: icpc -DSIZE=100 prog1.cpp

If you define a macro, but do not assign a value, the compiler defaults to 1 for the value of the macro.

Using -U Use this option to undefine a macro. For example, this command: icpc -Uia32 prog1.cpp

69 4 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

undefines the ia32 predefined macro. If you attempt to undefine an ANSI C macro, the compiler will emit an error: invalid macro undefinition:

See Also • ANSI Standard Predefined Macros • Additional Predefined Macros

70 Modifying the Compilation Environment 5

About Modifying the Compilation Environment

To run the Intel® C++ Compiler, you need to start with the proper environment. The compiler includes scripts which set the environment for all necessary components. You can modify the compilation environment by specifying different settings for: • Environment Variables • Configuration Files • Include Files

See Also • Response Files

Setting Environment Variables

You can customize your system environment by specifying paths where the compiler searches for certain files such as libraries, include files, and configuration files, and certain settings.

Compiler Environment Variables The Intel® C++ Compiler supports the following environment variables on VxWorks* OS:

Environment Description Variable

GCCROOT Specifies the location of the gcc binaries. Set this variable only when the compiler cannot locate the gcc binaries when using the -gcc- name option.

GXX_INCLUDE Specifies the location of the gcc headers. Set this variable to specify the locations of the GCC installed files when the compiler does not find the needed values as specified by the use of the -gcc- name=/gcc option.

71 5 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Environment Description Variable

GXX_ROOT Specifies the location of the gcc binaries. Set this variable to specify the locations of the GCC installed files when the compiler does not find the needed values as specified by the use of the -gcc-name=/gcc option.

IA32ROOT (IA-32 Points to the directories containing the include and library files architecture and Intel® 64 for a non-standard installation structure. architecture)

ICCCFG Specifies the configuration file for customizing compilations when invoking the compiler using icc.

ICPCCFG Specifies the configuration file for customizing compilations when invoking the compiler using icpc.

LD_LIBRARY_PATH Specifies the location for shared objects. (Linux OS)

PATH Specifies the directories the system searches for binary executable files.

Specifies the directory where profiling files (files with extensions .dyn, PROF_DIR .dpi, .spi and so forth) are stored. The default is to store the .dyn files in the source directory of the file containing the first executed instrumented routine in the binary compiled with -prof-gen . Name for the .dpi file. The default is pgopti.dpi. PROF_DPI

TMP Specifies the location for temporary files. If none of these are TMPDIR specified, the compiler stores temporary files in /tmp. TEMP

GNU* Environment Variables and extensions

CPATH Specifies path to include directory for C/C++ compilations.

72 5

Environment Description Variable

C_INCLUDE_PATH Specifies path to include directory for C compilations.

CPLUS_INCLUDE_PATH Specifies path to include directory for C++ compilations.

Specifies how to output dependencies for make based on the DEPENDENCIES_OUTPUT non-system header files processed by the compiler. System header files are ignored in the dependency output.

Specifies alternative names for the linker (ld) and assembler GCC_EXEC_PREFIX (as).

LIBRARY_PATH Specifies a colon-separated list of directories, much like PATH.

This variable is the same as DEPENDENCIES_OUTPUT, except SUNPRO_DEPENDENCIES that system header files are not ignored.

PGO Environment Variables

When using interval profile dumping (initiated by INTEL_PROF_DUMP_INTERVAL or the function _PGOPTI_Set_Interval_Prof_Dump) during the execution of an instrumented user application, allows creation of a single .dyn file to contain profiling information instead of multiple .dyn INTEL_PROF_DUMP_CUMULATIVE files. If this environment variable is not set, executing an instrumented user application creates a new .dyn file for each interval. Setting this environment variable is useful for applications that do not terminate or those that terminate abnormally (bypass the normal exit code).

Initiates interval profile dumping in an instrumented user INTEL_PROF_DUMP_INTERVAL application. This environment variable may be used to initiate Interval Profile Dumping in an instrumented application.

Specifies the directory in which dynamic information files are PROF_DIR created. This variable applies to all three phases of the profiling process.

73 5 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Environment Description Variable

PROF_DUMP_INTERVAL Deprecated; use INTEL_PROF_DUMP_INTERVAL instead.

Alters the feedback compilation phase slightly. By default, during the feedback compilation phase, the compiler merges data from all dynamic information files and creates a new pgopti.dpi file if the .dyn files are newer than an existing pgopti.dpi file. PROF_NO_CLOBBER When this variable is set the compiler does not overwrite the existing pgopti.dpi file. Instead, the compiler issues a warning. You must remove the pgopti.dpi file if you want to use additional dynamic information files.

Using Configuration Files

You can decrease the time you spend entering command-line options by using the configuration file to automate command-line entries. Add any valid command-line option to the configuration file. The compiler processes options in the configuration file in the order they appear followed by the command-line options that you specify when you invoke the compiler. Options in the configuration file are executed every time you run the compiler. If you have varying option requirements for different projects, use response files.

How to Use Configuration Files The following example illustrates a basic configuration file. The text following the "#" character is recognized as a comment. The configuration file, icc.cfg and icpc.cfg, is located in the same directory as the compiler's executable file. You should modify the configuration file environment variable if you need to specify a different location for the configuration file. # Sample configuration file. -I/my_headers

In this example, the compiler reads the configuration file and invokes the -I option every time you run the compiler, along with any options you specify on the command line.

See Also • Environment Variables • Using Response Files

74 5

Specifying Include Files

The Intel® C++ Compiler searches the default system areas for include files and whatever is specified by the -I compiler option. The compiler searches directories for include files in the following order: 1. Directories specified by the -I option 2. Directories specified in the environment variables 3. Default include directory

Use the -X option to remove default directories from the include file search path. For example, to direct the compiler to search the path /alt/include instead of the default path, do the following: icpc -X -I/alt/include prog1.cpp

See Also • -I compiler option • -X compiler option • Environment Variables

Using Response Files

Use response files to specify options used during particular compilations. Response files are invoked as an option on the command line. Options in a response file are inserted in the command line at the point where the response file is invoked.

75 5 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Sample Response Files # response file: response1.txt # compile with these options -w0 # end of response1 file # response file: response2.txt # compile with these options -O0 # end of response2 file

Use response files to decrease the time spent entering command-line options and to ensure consistency by automating command-line entries. Use individual response files to maintain options for specific projects. Any number of options or file names can be placed on a line in a response file. Several response files can be referenced in the same command line. The following example shows how to specify a response file on the command line: icpc @response1.txt prog1.cpp @response2.txt prog2.cpp

NOTE. An "@" symbol must precede the name of the response file on the command line.

See Also • Using Configuration Files

76 Debugging 6

Using the Debugger

See the Intel® Debugger (IDB) documentation for complete information on using the . You can also use the GNU Debugger (gdb) to debug programs compiled with the Intel® C++ Compiler.

Preparing for Debugging

Use this option to direct the compiler to generate code to support symbolic debugging. For example: icpc -g prog1.cpp

The compiler does not support the generation of debugging information in assembly files. When you specify this option, the resulting object file will contain debugging information, but the assembly file will not. This option changes the default optimization from -O2 to -O0.

See Also • -g

Symbolic Debugging and Optimizations

When you use debugging options to generate code with debug symbols, the compiler disables On optimizations. If you specify an On option with debugging options, some of the generated debug information may be inaccurate as a side-effect of optimization. The table below summarizes the effects of using the debugging options with the optimization options.

Option(s) Result

-g Debugging information produced, -O0 and -fp enabled.

-g -O1 Debugging information produced, -O1 optimizations enabled.

-g -O2 Debugging information produced, -O2 optimizations enabled.

-g -O3 -fp Debugging information produced, -O3 optimizations and -fp are enabled.

77 6 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also • -g • -fp • -O1, -O2, -O3

Using Options for Debug Information

The Intel® C++ Compiler provides basic debugging information and new features for enhanced debugging of code. The basic debugging options are listed in the following table.

Option Description

-debug all These options are equivalent to -g. They turn on production of -debug full basic debug information. They are off by default.

-debug none This option turns off production of debug information. This option is on by default.

The Intel C++ Compiler improves debuggability of optimized code through enhanced support for: • tracebacks • variable locations • breakpoints and stepping

The options described in the following table control generation of enhanced debug information.

Option Description

-debug expr-source- This option enables expanded line number information. pos

-debug inline-debug- This option produces enhanced debug information for inlined info code. It provides more information to debuggers for function call traceback.

-debug emit_column This option generates column number information for debugging.

-debug semantic-step- This option generates information useful for breakpoints and ping stepping. It tells the debugger to stop only at machine instructions that achieve the final effect of a source statement.

78 6

Option Description

-debug variable-loca- This option produces additional debug information for scalar tions local variables using a feature of the DWARF object module format known as "location lists." The runtime locations of local scalar variables are specified more accurately using this feature, i.e. whether at a given position in the code, a variable value is found in memory or a machine register.

-debug extended This option turns on the following -debug options:

• -debug semantic-stepping • -debug variable-locations

It also specifies that column numbers should appear in the line information.

-debug parallel This option determines whether the compiler generates parallel (Linux* OS IA-32 and debug code instrumentations useful for thread data sharing and Intel® 64 architectures reentrant call detection. Must be used in conjunction with the only) -g option

NOTE. When the compiler needs to choose between optimization and quality of debug information, optimization is given priority.

79 6 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

80 Creating and Using Libraries 7

Overview: Using Libraries

The Intel® C++ Compiler uses the GNU* C Library and the Standard C++ Library. These libraries are documented at the following URLs: • GNU C Library -- http://www.gnu.org/software/libc/manual/ • Standard C++ Library -- http://gcc.gnu.org/onlinedocs/libstdc/documentation.html

Supplied Libraries

Libraries are an indexed collection of object files that are included as needed in a linked program. Combining object files into a library makes it easy to distribute your code without disclosing the source. It also reduces the number of command-line entries needed to compile your project. The Intel® compiler for VxWorks* provides static libraries only. The file credist.txt in install-dir/Documentation/language lists the Intel compiler libraries that are redistributable. The table below shows the libraries provided for the compiler.

Library Description

libimf.a Intel math library. Many routines in this library are more highly optimized for Intel microprocessors than for non-Intel microprocessors.

libdecimal.a Intel IEEE* 754-2008 decimal floating-point math library, static.

libsvml.a Short vector math library. Many routines in this library are more highly optimized for Intel microprocessors than for non-Intel microprocessors.

81 7 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Library Description

libirc.a Intel support libraries for CPU dispatch,intel_fast_*, and traceback support routines. This is the static libirc_s.a thread-safe version. Many routines in this library are more highly optimized for Intel microprocessors than for non-Intel microprocessors.

libirc_fp.a IA-32 architecture only Intel support libraries for CPU dispatch, intel_fast_*, and traceback support routines. This is the static thread-safe version built with the -fno-omit-frame-pointer option to aid in debugging. Many routines in this library are more highly optimized for Intel microprocessors than for non-Intel microprocessors.

When you invoke the -cxxlib option, libcprts is replaced with libstdc++ from the gcc* distribution (3.2 or newer).

CAUTION. The Linux* OS and VxWorks* OS system libraries and the compiler libraries are not built with the -align option. Therefore, if you compile with the -align option and make a call to a compiler distributed or system library, and have long long, double, or long double types in your interface, you will get the wrong answer due to the difference in alignment. Any code built with -align cannot make calls to libraries that use these types in their interfaces unless they are built with -align (in which case they will not work without -align).

Math Libraries The Intel math library, libimf.a, contains optimized versions of math functions found in the standard C run-time library. The functions in libimf.a are optimized for program execution speed on Intel processors. The Intel math library is linked by default.

82 7

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

See Also • Managing Libraries • Overview: Intel® Math Library

Managing Libraries

During compilation, the compiler reads the LIBRARY_PATH environment variable for static libraries it needs to link when building the executable. At runtime, the executable will link against dynamic libraries referenced in the LD_LIBRARY_PATH environment variable.

Modifying LIBRARY_PATH If you want to add a directory, /libs for example, to the LIBRARY_PATH, you can do either of the following:

• command line: prompt> export LIBRARY_PATH=/libs:$LIBRARY_PATH

83 7 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• startup file: export LIBRARY_PATH=/libs:$LIBRARY_PATH

To compile file.cpp and link it with the library mylib.a, enter the following command: icpc file.cpp mylib.a

The compiler passes file names to the linker in the following order:

1. the object file 2. any objects or libraries specified on the command line, in a response file, or in a configuration file 3. the Intel® Math Library, libimf.a

Creating Libraries

Libraries are simply an indexed collection of object files that are included as needed in a linked program. Combining object files into a library makes it easy to distribute your code without disclosing the source. It also reduces the number of command-line entries needed to compile your project.

Static Libraries Executables generated using static libraries are no different than executables generated from individual source or object files. Static libraries are not required at runtime, so you do not need to include them when you distribute your executable. At compile time, linking to a static library is generally faster than linking to individual source files. To build a static library on Linux* OS and VxWorks* OS:

1. use the -c option to generate object files from the source files: icpc -c my_source1.cpp my_source2.cpp my_source3.cpp 2. use the GNU tool ar to create the library file from the object files: ar rc my_lib.a my_source1.o my_source2.o my_source3.o 3. compile and link your project with your new library: icpc main.cpp my_lib.a

If your library file and source files are in different directories, use the -Ldir option to indicate where your library is located: icpc -L/cpp/libs main.cpp my_lib.a

84 7

If you are using Interprocedural Optimization, see Creating a Library from IPO Objects using xiar.

Shared Libraries Shared libraries, also referred to as dynamic libraries or Dynamic Shared Objects (DSO), are linked differently than static libraries. At compile time, the linker insures that all the necessary symbols are either linked into the executable, or can be linked at runtime from the shared library. Executables compiled from shared libraries are smaller, but the shared libraries must be included with the executable to function correctly. When multiple programs use the same shared library, only one copy of the library is required in memory. To build a shared library on Linux* OS and VxWorks* OS:

1. use the -fPIC and -c options to generate object files from the source files: icpc -fPIC -c my_source1.cpp my_source2.cpp my_source3.cpp 2. use the -shared option to create the library file from the object files: icpc -shared -o my_lib.so my_source1.o my_source2.o my_source3.o 3. compile and link your project with your new library: icpc main.cpp my_lib.so

See Also • Using Intel Shared Libraries • Compiling for Non-shared Libraries

Compiling for Non-shared Libraries

Overview: Compiling for Non-shared Libraries

This section includes information on: • Global Symbols and Visibility Attributes • Symbol Preemption • Specifying Symbol Visibility Explicitly • Other Visibility-related Command-line Options

85 7 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Global Symbols and Visibility Attributes

A global symbol is one that is visible outside the compilation unit (single source file and its include files) in which it is declared. In C/C++, this means anything declared at file level without the static keyword. For example: int x = 5; // global data definition extern int y; // global data reference int five() // global function definition { return 5; } extern int four(); // global function reference

A complete program consists of a main program file and possibly one or more shareable object (.so) files that contain the definitions for data or functions referenced by the main program. Similarly, shareable objects might reference data or functions defined in other shareable objects. Shareable objects are so called because if more than one simultaneously executing process has the shareable object mapped into its virtual memory, there is only one copy of the read-only portion of the object resident in physical memory. The main program file and any shareable objects that it references are collectively called the components of the program. Each global symbol definition or reference in a compilation unit has a visibility attribute that controls how (or if) it may be referenced from outside the component in which it is defined. There are five possible values for visibility: • EXTERNAL – The compiler must treat the symbol as though it is defined in another component. For a definition, this means that the compiler must assume that the symbol will be overridden (preempted) by a definition of the same name in another component. See Symbol Preemption. If a function symbol has external visibility, the compiler knows that it must be called indirectly and can inline the indirect call stub. • DEFAULT – Other components can reference the symbol. Furthermore, the symbol definition may be overridden (preempted) by a definition of the same name in another component. • PROTECTED – Other components can reference the symbol, but it cannot be preempted by a definition of the same name in another component. • HIDDEN – Other components cannot directly reference the symbol. However, its address might be passed to other components indirectly (for example, as an argument to a call to a function in another component, or by having its address stored in a data item reference by a function in another component). • INTERNAL – The symbol cannot be referenced outside its defining component, either directly or indirectly.

86 7

Static local symbols (in C/C++, declared at file scope or elsewhere with the keyword static) usually have HIDDEN visibility--they cannot be referenced directly by other components (or, for that matter, other compilation units within the same component), but they might be referenced indirectly.

NOTE. Visibility applies to references as well as definitions. A symbol reference's visibility attribute is an assertion that the corresponding definition will have that visibility.

Symbol Preemption

Sometimes you may need to use some of the functions or data items from a shareable object, but may wish to replace others with your own definitions. For example, you may want to use the standard C runtime library shareable object, libc.so, but to use your own definitions of the heap management routines malloc() and free(). In this case it is important that calls to malloc() and free() within libc.so call your definition of the routines and not the definitions present in libc.so. Your definition should override, or preempt, the definition within the shareable object. This feature of shareable objects is called symbol preemption. When the runtime loader loads a component, all symbols within the component that have default visibility are subject to preemption by symbols of the same name in components that are already loaded. Since the main program image is always loaded first, none of the symbols it defines will be preempted. The possibility of symbol preemption inhibits many valuable compiler optimizations because symbols with default visibility are not bound to a memory address until runtime. For example, calls to a routine with default visibility cannot be inlined because the routine might be preempted if the compilation unit is linked into a shareable object. A preemptable data symbol cannot be accessed using GP-relative addressing because the name may be bound to a symbol in a different component; the GP-relative address is not known at compile time. Symbol preemption is a very rarely used feature that has drastic negative consequences for compiler optimization. For this reason, by default the compiler treats all global symbol definitions as non-preemptable (i.e., protected visibility). Global references to symbols defined in other compilation units are assumed by default to be preemptable (i.e., default visibility). In those rare cases when you need all global definitions, as well as references, to be preemptable, specify the -fpic option to override this default.

87 7 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Specifying Symbol Visibility Explicitly

You can explicitly set the visibility of an individual symbol using the visibility attribute on a data or function declaration. For example: int i __attribute__ ((visibility("default"))); void __attribute__ ((visibility("hidden"))) x () {...} extern void y() __attribute__ ((visibilty("protected");

The visibility declaration attribute accepts one of the five keywords: • external • default • protected • hidden • internal

The value of the visibility declaration attribute overrides the default set by the -fvisibil- ity, -fpic, or -fno-common attributes. If you have a number of symbols for which you wish to specify the same visibility attribute, you can set the visibility using one of the five command line options: • -fvisibility-external=file • -fvisibility-default=file • -fvisibility-protected=file • -fvisibility-hidden=file • -fvisibility-internal=file

where file is the pathname of a file containing a list of the symbol names whose visibility you wish to set. The symbol names in the file are separated by white space (blanks, TAB characters, or newlines). For example, the command line option: -fvisibility-protected=prot.txt where file prot.txt contains: a b c d e

sets protected visibility for symbols a, b, c, d, and e. This has the same effect as

88 7

__attribute__ ((visibility=("protected"))) on the declaration for each of the symbols. Note that these two ways to explicitly set visibility are mutually exclusive – you may use __attribute((visibilty())) on the declaration, or specify the symbol name in a file, but not both. You can set the default visibility for symbols using one of the command line options: • -fvisibility=external • -fvisibility=default • -fvisibility=protected • -fvisibility=hidden • -fvisibility=internal

This option sets the visiblity for symbols not specified in a visibility list file and that do not have __attribute__((visibilty())) in their declaration. For example, the command line options: -fvisibility=protected -fvisibility-default=prot.txt where file prot.txt is as previously described, will cause all global symbols except a, b, c, d, and e to have protected visibility. Those five symbols, however, will have default visibility and thus be preemptable.

Other Visibility-related Command-line Options

-fminshared The -fminshared option specifies that the compilation unit will be part of a main program component and will not be linked as part of a shareable object. Since symbols defined in the main program cannot be preempted, this allows the compiler to treat symbols declared with default visibility as though they have protected visibility (i.e., -fminshared implies -fvisibil- ity=protected). Also, the compiler need not generate position-independent code for the main program. It can use absolute addressing, which may reduce the size of the global offset table (GOT) and may reduce memory traffic.

-fpic The -fpic option specifies full symbol preemption. Global symbol definitions as well as global symbol references get default (i.e., preemptable) visibility unless explicitly specified otherwise.

89 7 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-fno-common Normally a C/C++ file-scope declaration with no initializer and without the extern or static keyword int i;

is represented as a common symbol. Such a symbol is treated as an external reference, except that if no other compilation unit has a global definition for the name, the linker allocates memory for it. The -fno-common option causes the compiler to treat what otherwise would be common symbols as global definitions and to allocate memory for the symbol at compile time. This may permit the compiler to use the more efficient GP-relative addressing mode when accessing the symbol.

See Also • -fminshared compiler • -fpic • -fno-common

90 gcc* Compatibility 8

gcc Compatibility

C language object files created with the Intel® C++ Compiler are binary compatible with the GNU gcc* compiler and glibc*, the GNU C language library. You can use the Intel compiler or the gcc compiler to pass object files to the linker. However, to correctly pass the Intel libraries to the linker, use the Intel compiler. The Intel C++ Compiler supports many of the language extensions provided by the GNU compilers.

gcc Extensions to the C Language GNU C includes several, non-standard features not found in ISO standard C. This version of the Intel C++ Compiler supports most of these extensions listed in the following table. See http://www.gnu.org for more information.

gcc Language Extension Intel Support

Statements and Declarations in Expressions Yes

Locally Declared Labels Yes

Labels as Values Yes

Nested Functions No

Constructing Function Calls No

Naming an Expression's Type Yes

Referring to a Type with typeof Yes

Generalized Lvalues Yes

Conditionals with Omitted Operands Yes

Double-Word Integers Yes

91 8 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

gcc Language Extension Intel Support

Complex Numbers Yes

Hex Floats Yes

Arrays of Length Zero Yes

Arrays of Variable Length Yes

Macros with a Variable Number of Arguments. Yes

Slightly Looser Rules for Escaped Newlines No

String Literals with Embedded Newlines Yes

Non-Lvalue Arrays May Have Subscripts Yes

Arithmetic on void-Pointers Yes

Arithmetic on Function-Pointers Yes

Non-Constant Initializers Yes

Compound Literals Yes

Designated Initializers Yes

Cast to a Union Type Yes

Case Ranges Yes

Mixed Declarations and Code Yes

Declaring Attributes of Functions Yes

Attribute Syntax Yes

Prototypes and Old-Style Function Definitions No

C++ Style Comments Yes

92 8 gcc Language Extension Intel Support

Dollar Signs in Identifier Names Yes

ESC Character in Constants Yes

Specifying Attributes of Variables Yes

Specifying Attributes of Types Yes

Inquiring on Alignment of Types or Variables Yes

Inline Function is As Fast As a Macro Yes

Assembler Instructions with C Expression Operands Yes

Controlling Names Used in Assembler Code Yes

Variables in Specified Registers Yes

Alternate Keywords Yes

Incomplete enum Types Yes

Function Names as Strings Yes

Getting the Return or Frame Address of a Function Yes

Using Vector Instructions Through Built-in Functions No

Other built-in functions provided by GCC Yes

Built-in Functions Specific to Particular Target Machines No

Pragmas Accepted by GCC Yes

Unnamed struct/union fields within structs/unions Yes

Yes Decimal floating types

93 8 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

g++* Extensions to the C++ Language GNU C++ includes several, non-standard features not found in ISO standard C++. This version of the Intel C++ Compiler supports many of these extensions listed in the following table. See http://www.gnu.org for more information.

g++ Language Extension Intel Support

Minimum and Maximum operators in C++ Yes

When is a Volatile Object Accessed? No

Restricting Pointer Aliasing Yes

Vague Linkage Yes

Declarations and Definitions in One Header No

Where's the Template? extern template supported

Extracting the function pointer from a bound pointer to member function Yes

C++-Specific Variable, Function, and Type Attributes Yes

Java Exceptions No

Deprecated Features No

Backwards Compatibility No

94 8

NOTE. Statement expressions are supported, except the following are prohibited inside them: • dynamically-initialized local static variables • local non-POD class definitions • try/catch • variable length arrays Also, branching out of a statement expression, statement expressions in constructor initializers, and statement expressions are not allowed. Variable-length arrays are no longer allowed in statement expressions.

NOTE. The Intel C++ Compiler supports gcc-style inline ASM if the assembler code uses AT&T* System V/386 syntax.

gcc* Interoperability

gcc Interoperability

C++ compilers are interoperable if they can link object files and libraries generated by one compiler with object files and libraries generated by the second compiler, and the resulting executable runs successfully. The Intel® C++ Compiler is highly compatible with the gcc and g++ compilers. This section describes features of the Intel C++ Compiler that provide interoperability with gcc and g++.

See Also • gcc Compatibility • Compiler Options for Interoperability • Predefined Macros for Interoperability

Compiler Options for Interoperability

The Intel® C++ Compiler options that affect gcc* interoperability include: • -gcc-name=dir • -gcc-version=nnn • -gxx-name=dir

95 8 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• -cxxlib • -fabi-version=n • -no-gcc

-gcc-name option The -gcc-name=dir option, used with -cxxlib, lets you specify the full-path location of gcc if the compiler cannot locate the gcc C++ libraries. Use this option when referencing a non-standard gcc installation.

-gcc-version option The -gcc-version=nnn option provides compatible behavior with gcc, where nnn indicates the gcc version. The -gcc-version option is ON by default, and the value of nnn depends on the version of gcc installed on your system. This option selects the version of gcc with which you achieve ABI interoperability.

Installed Version Default Value of gcc of -gcc-version

older than not set version 3.2

3.2 320

3.3 330

3.4 340

4.0 400

4.1 410

4.2 420

4.3 430

96 8

-gxx-name option The -gxx-name=dir option specifies that the g++ compiler should be used to set up the environment for C++ compilations.

-cxxlib option The -cxxlib[=dir] option (ON by default) builds your applications using the C++ libraries and header files included with the gcc compiler. They include:

• libstdc++ standard C++ header files • libstdc++ standard C++ library • libgcc C++ language support

Use the optional argument, =dir, to specify the top-level location for the gcc binaries and libraries.

NOTE. The Intel C++ Compiler is compatible with gcc 3.2, 3.3, 3.4, 4.0, 4.1 and 4.2.

When you compile and link your application, the resulting C++ object files and libraries can interoperate with C++ object files and libraries generated by gcc 3.2 or higher. This means that third-party C++ libraries built with gcc 3.2 will work with C++ code generated by the Intel Compiler.

NOTE. gcc 3.2, 3.3, and 3.4 are not interoperable. gcc 4.0, 4.1, and 4.2 are interoperable. By default, the Intel compiler will generate code that is interoperable with the version of gcc it finds on your system.

By default, the Intel C++ Compiler uses headers and libraries included with the product, when the system includes a version of gcc less than 3.2. If you build one shared library against the Intel C++ libraries, build a second shared library against the gnu C++ libraries, and use both libraries in a single application, you will have two C++ run-time libraries in use. Since the application might use symbols from both libraries, the following problems may occur:

• partially initialized libraries • lost I/O operations from data put in unaccessed buffers

97 8 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• other unpredictable results, such as jumbled output

The Intel C++ Compiler does not support more than one run-time library in one application.

CAUTION. If you successfully compile your application using more than one run-time library, the resulting program will likely be very unstable, especially when new code is linked against the shared libraries.

-fabi-version The -fabi-version=n option directs the compiler to select a specific ABI implementation. By default, the Intel compiler uses the ABI implementation that corresponds to the installed version of gcc. Both gcc 3.2 and 3.3 are not fully ABI-compliant.

Value of n Description

n=0 Select most recent ABI implementation

n=1 Select g++ 3.2 compatible ABI implementation

n=2 Select most conformant ABI implementation

See Also • Specifying Alternate Tools and Paths • -gcc-name • -gcc-version • -cxxlib • -fabi-version • -no-gcc For more information on ABI conformance, see http://www.codesourcery.com/

Predefined Macros for Interoperability

The Intel® C++ Compiler and gcc*/g++* support the following predefined macros: • __GNUC__ • __GNUG__ • __GNUC_MINOR__

98 8

• __GNUC_PATCHLEVEL__

You can specify the -no-gcc option to undefine these macros. If you need gcc interoperability (-cxxlib), do not use the -no-gcc compiler option.

CAUTION. Not defining these macros results in different paths through system header files. These alternate paths may be poorly tested or otherwise incompatible.

See Also • Predefined Macros • GNU Environment Variables gcc Built-in Functions

This version of the Intel® C++ compiler supports the following gcc* built-in functions: __builtin_abs __builtin_labs __builtin_cos __builtin_cosf __builtin_expect __builtin_fabs __builtin_fabsf __builtin_memcmp __builtin_memcpy __builtin_sin __builtin_sinf __builtin_sqrt __builtin_sqrtf __builtin_strcmp __builtin_strlen __builtin_strncmp __builtin_abort __builtin_prefetch __builtin_constant_p __builtin_printf __builtin_fprintf __builtin_fscanf __builtin_scanf __builtin_fputs __builtin_memset

99 8 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__builtin_strcat __builtin_strcpy __builtin_strncpy __builtin_exit __builtin_strchr __builtin_strspn __builtin_strcspn __builtin_strstr __builtin_strpbrk __builtin_strrchr __builtin_strncat __builtin_alloca __builtin_ffs __builtin_index __builtin_rindex __builtin_bcmp __builtin_bzero __builtin_sinl __builtin_cosl __builtin_sqrtl __builtin_fabsl __builtin_frame_address (IA-32 architecture only) __builtin_return_address (IA-32 architecture only)

See Also • fbuiltin, Oi

Thread-local Storage

The Intel® C++ Compiler supports the storage class keyword __thread, which can be used in variable definitions and declarations. Variables defined and declared this way are automatically allocated locally to each thread: __thread int i; __thread struct state s; extern __thread char *p;

NOTE. The __thread keyword is only recognized when the GNU compatibility version is 3.3 or higher. You may need to specify the -gcc-version=330 compiler option to enable thread-local storage.

100 Language Conformance 9

Conformance to the C Standard

The Intel® C++ Compiler provides conformance to the ANSI/ISO standard for C language compilation (ISO/IEC 9899:1990). This standard requires that conforming C compilers accept minimum translation limits. This compiler exceeds all of the ANSI/ISO requirements for minimum translation limits.

C99 Support The following C99 features are supported in this version of the Intel C++ Compiler:

• restricted pointers (restrict keyword). • variable-length Arrays • flexible array members • complex number support (_Complex keyword) • hexadecimal floating-point constants • compound literals • designated initializers • mixed declarations and code • macros with a variable number of arguments • inline functions (inline keyword) • boolean type (_Bool keyword)

Conformance to the C++ Standard

The Intel® C++ Compiler conforms to the ANSI/ISO standard (ISO/IEC 14882:1998) for the C++ language.

Exported Templates

The Intel® C++ Compiler supports exported templates using the following options:

101 9 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Option Description

-export Enable recognition of exported templates. Supported in C++ mode only.

-export-dir Specifies a directory name to be placed on the exported template search dir path. The directories are used to find the definitions of exported templates and are searched in the order in which they are specified on the command-line. The current directory is always the first entry on the search path.

NOTE. Exported templates are not supported with gcc 3.3 or earlier. You need gcc 3.4 or newer to use this feature.

102 9

Exported templates are templates declared with the export keyword. Exporting a class template is equivalent to exporting each of its static data members and each of its non-inline member functions. An exported template is unique because its definition does not need to be present in a translation unit that uses that template. For example, the following C++ program consists of two separate translation units: // file1.cpp #include static void trace() { printf("File 1\n"); } export template T const& min(T const&, T const&); int main() { trace(); return min(2, 3); }

// file2.cpp #include static void trace() { printf("File 2\n"); } export template T const& min(T const &a, T const &b) { trace(); return a

Note that these two files are separate translation units: one is not included in the other. That allows the two functions trace() to coexist (with internal linkage).

Usage icpc -export -export-dir /usr2/export/ -c file1.cpp icpc -export -export-dir /usr2/export/ -c file2.cpp icpc -export -export-dir /usr2/export/ file1.o file2.o

See Also • -export • -export-dir

103 9 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Template Instantiation

The Intel® C++ Compiler supports extern template, which lets you specify that a template in a specific translation unit will not be instantiated because it will be instantiated in a different translation unit or different library. The compiler now includes additional support for: • inline template – instantiates the compiler support data for the class (i.e. the vtable) for a class without instantiating its members. • static template – instantiates the static data members of the template, but not the virtual tables or member functions.

You can now use the following options to gain more control over the point of template instantiation:

Option Description

-fno-implic- Never emit code for non-inline templates which are instantiated implicitly it-templates (i.e. by use). only emit code for explicit instantiations.

-fno-implic- Do not emit code for implicit instantiations of inline templates either. The it-inline- default is to handle inlines differently so that compilations, with and templates without optimization, will need the same set of explicit instantiations.

See Also • -fno-implicit-templates • -fno-implicit-inline-templates

104 Porting Applications 10

Overview: Porting Applications

This section describes a basic approach to porting applications from GCC's C/C++ compilers to the Intel® C/C++ compilers. These compilers correspond to each other in the following ways:

Language Intel Compiler GCC* Compiler

C icc gcc

C++ icpc g++

It also contains information on how to use the -diag enable port-win option to issue warnings about general syntactical problems when porting from GNU gcc* to Microsoft C++.

NOTE. To simplify this discussion on porting applications, the term "gcc" , unless otherwise indicated, refers to both gcc and g++ compilers from the GNU Compiler Collection*.

Advantages to Using the Intel Compiler In many cases, porting applications from gcc to the Intel compiler can be as easy as modifying your makefile to invoke the Intel compiler (icc) instead of gcc. Using the Intel compiler typically improves the performance of your application, especially for those that run on Intel processors. In many cases, your application's performance may also show improvement when running on non-Intel processors. When you compile your application with the Intel compiler, you have access to:

• compiler options that optimize your code for the latest Intel processors. • advanced profiling tools (PGO) similar to gprof. • high-level optimizations (HLO). • interprocedural optimization (IPO). • Intel intrinsic functions that the compiler uses to inline instructions, including SSE, SSE2, SSE3, SSSE3, and SSE4. • highly-optimized Intel Math Library for improved accuracy.

105 10 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Since the Intel compiler is compatible and interoperable with gcc, porting your gcc application to the Intel compiler includes the benefits of binary compatibility. As a result, you should not have to re-build libraries from your gcc applications. The Intel compiler also supports many of the same compiler options, macros, and environment variables you already use in your gcc work.

Porting Strategy For many gcc applications, porting to the Intel compiler requires little more than modifying your makefile to account for differences that may exist between compiling with gcc and compiling with icc. One challenge in porting applications from one compiler to another is making sure there is support for the compiler options you use to build your application. The Compiler Options reference lists compiler options that are supported by both the Intel® C++ Compiler and gcc. Next Steps

• Modifying Your Makefile • Other Considerations

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

106 10

Optimization Notice Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Modifying Your makefile

If you use makefiles to build your gcc application, you need to change the value for the CC compiler variable to use the Intel compiler. You may also want to review the options specified by CFLAGS. A simple example follows:

gcc makefile # Use gcc compiler CC = gcc # Compile-time flags CFLAGS = -O2 -std=c99 all: area_app area_app: area_main.o area_functions.o $(CC) area_main.o area_functions.o -o area area_main.o: area_main.c $(CC) -c $(CFLAGS) area_main.c area_functions.o: area_functions.c $(CC) -c -fno-asm $(CFLAGS) area_functions.c clean: rm -rf *o area

107 10 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Modified makefile for Intel Compiler # Use Intel C compiler CC = icc # Compile-time flags CFLAGS = -std=c99 all: area-app area-app: area_main.o area_functions.o $(CC) area_main.o area_functions.o -o area area_main.o: area_main.c $(CC) -c $(CFLAGS) area_main.c area_functions.o: area_functions.c gcc -c -O2 -fno-asm $(CFLAGS) area_functions.c clean: rm -rf *o area

If your gcc code includes features that are not supported with the Intel compiler, such as compiler options, language extensions, macros, pragmas, etc., you can compile those sources separately with gcc if necessary. In the above makefile, area_functions.c is an example of a source file that includes features unique to gcc. Since the Intel compiler uses the -O2 compiler option by default and gcc's default is -O0, we instruct gcc to compile with -O2. We also include the -fno-asm switch from the original makefile since this switch is not supported with the Intel compiler. With the modified makefile, the output of make is: icc -c -std=c99 area_main.c gcc -c -O2 -fno-asm -std=c99 area_functions.c icc area_main.o area_functions.o -o area

See Also • Equivalent Macros • Equivalent Environment Variables • Compiler Options for Interoperability • Support for gcc and g++ Language Extensions • Predefined Macros

108 10

• gcc Builtin Functions

Equivalent Macros

Macro support is an important aspect in porting applications from gcc to the Intel compiler. The following table lists the most common macros used in both compilers.

__CHAR_BIT__

__DATE__

__DBL_DENORM_MIN__

__DBL_DIG__

__DBL_EPSILON__

__DBL_HAS_INFINITY__

__DBL_HAS_QUIET_NAN__

__DBL_MANT_DIG__

__DBL_MAX__

__DBL_MAX_10_EXP__

__DBL_MAX_EXP__

__DBL_MIN__

__DBL_MIN_10_EXP__

__DBL_MIN_EXP__

__DECIMAL_DIG__

__ELF__

__FINITE_MATH_ONLY__

__FLT_DENORM_MIN__

109 10 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__FLT_DIG__

__FLT_EPSILON__

__FLT_EVAL_METHOD__

__FLT_HAS_INFINITY__

__FLT_HAS_QUIET_NAN__

__FLT_MANT_DIG__

__FLT_MAX__

__FLT_MAX_10_EXP__

__FLT_MAX_EXP__

__FLT_MIN__

__FLT_MIN_10_EXP__

__FLT_MIN_EXP__

__FLT_RADIX__

__gnu_linux__

__GNUC__

__GNUG__

__GNUC_MINOR__

__GNUC_PATCHLEVEL__

__GXX_ABI_VERSION

__i386

__i386__

__INT_MAX__

110 10

__LDBL_DENORM_MIN__

__LDBL_DIG__

__LDBL_EPSILON__

__LDBL_HAS_INFINITY__

__LDBL_HAS_QUIET_NAN__

__LDBL_MANT_DIG__

__LDBL_MAX__

__LDBL_MAX_10_EXP__

__LDBL_MAX_EXP__

__LDBL_MIN__

__LDBL_MIN_10_EXP__

__LDBL_MIN_EXP__

__linux

__linux__

__LONG_LONG_MAX__

__LONG_MAX__

__NO_INLINE__

__OPTIMIZE__

__PTRDIFF_TYPE__

__REGISTER_PREFIX__

__SCHAR_MAX__

__SHRT_MAX__

111 10 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__SIZE_TYPE__

__STDC__

__STDC_HOSTED__

__TIME__

__unix

__unix__

__USER_LABEL_PREFIX__

__VERSION__

__WCHAR_MAX__

__WCHAR_TYPE__

__WINT_TYPE__

i386

linux

unix

See Also • Predefined Macros

Equivalent Environment Variables

The Intel® C++ Compiler supports many of the same environment variables used with gcc as well as other compiler-specific variables. See Setting Environment Variables for a list of supported variables.

Other Considerations

There are some notable differences between the Intel® C++ Compiler and gcc. Consider the following as you begin compiling your source code with the Intel C++ Compiler.

112 10

Setting the Environment The Intel compilers rely on environment variables for the location of compiler binaries, libraries, man pages, and license files. In some cases these are different from the environment variables that gcc uses. Another difference is that these variables are not set by default after installing the Intel compiler. The following environment variables need to be set prior to running the Intel compiler:

• PATH – add the location of the compiler binaries to PATH • LD_LIBRARY_PATH (Linux* OS and VxWorks* OS) – sets the location of the compiler libraries as well as the resulting binary generated by the compiler • MANPATH – add the location of the compiler man pages (icc and icpc) to MANPATH

Using Optimization The Intel C++ Compiler is an optimizing compiler that begins with the assumption that you want improved performance from your application when it is executed on Intel architecture. Consequently, certain optimizations, such as -O2, are part of the default invocation of the Intel compiler. By default, gcc turns off optimization - the equivalent of compiling with -O or -O0. Other forms of the -O option compare as follows:

Option Intel gcc

-O0 Turns off optimization. Default. Turns off optimization. Same as -O.

-O1 Decreases code size with some Decreases code size with some increase in speed. increase in speed.

-O2 Default. Favors speed optimization Optimizes for speed as long as there with some increase in code size. is not an increase in code size. Loop Same as -O. Intrinsics, loop unrolling, unrolling and function inlining, for and inlining are performed. example, are not performed.

-O3 Enables -O2 optimizations plus more Optimizes for speed while generating aggressive optimizations, such as larger code size. Includes -O2 prefetching, scalar replacement, and optimizations plus loop unrolling and loop and memory access inlining. Similar to -O2 -ip on Intel transformations. compiler.

113 10 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Targeting Intel Processors While many of the same options that target specific processors are supported with both compilers, Intel includes options that utilize processor-specific instruction scheduling to target the latest Intel processors. If you compile your gcc application with the -march or -mtune option, consider using Intel's -x or -ax options for applications that run on IA-32 or Intel® 64 architecture.

Modifying Your Configuration The Intel compiler lets you maintain configuration and response files that are part of compilation. Options stored in the configuration file apply to every compilation, while options stored in response files apply only where they are added on the command line. If you have several options in your makefile that apply to every build, you may find it easier to move these options to the configuration file (../bin/icc.cfg and ../bin/icpc.cfg). In a multi-user, networked environment, options listed in the icc.cfg and icpc.cfg files are generally intended for everyone who uses the compiler. If you need a separate configuration, you can use the ICCCFG or ICPCCFG environment variable to specify the name and location of your own .cfg file, such as /my_code/my_config.cfg. Anytime you instruct the compiler to use a different configuration file, the system configuration files (icc.cfg and icpc.cfg) are ignored.

Using the Intel Libraries The Intel C++ Compiler supplies additional libraries that contain optimized implementations of many commonly used functions. Some of these functions are implemented using CPU dispatch. This means that different code may be executed when run on different processors. Supplied libraries include the Intel® Math Library (libimf), the Short Vector Math Library (libsvml), libirc, as well as others. These libraries are linked in by default when the compiler sees that references to them have been generated. Some library functions, such as sin or memset, may not require a call to the library, since the compiler may inline the code for the function. Intel® Math Library (libimf) With the Intel C++ Compiler, the Intel® Math Library, libimf, is linked by default when calling math functions that require the library. Some functions, such as sin, may not require a call to the library, since the compiler already knows how to compute the sin function. The Intel Math Library also includes some functions not found in the standard math library.

NOTE. You cannot make calls the Intel Math Library with gcc.

114 10

Many routines in this library are more highly optimized for Intel microprocessors than for non-Intel microprocessors. Short Vector Math Library (libsvml) When vectorization is being done, the compiler may translate some calls to the libm math library functions into calls to libsvml functions. These functions implement the same basic operations as the Intel Math Library, but operate on short vectors of operands. This results in greater efficiency. In some cases, the libsvml functions are slightly less precise than the equivalent libimf functions. Many routines in this library are more highly optimized for Intel microprocessors than for non-Intel microprocessors. libirc libirc contains optimized implementations of some commonly used string and memory functions. For example, it contains functions that are optimized versions of memcpy and memset. The compiler will automatically generate calls to these functions when it sees calls to memcpy and memset. The compiler may also transform loops that are equivalent to memcpy or memset into calls to these functions. Many routines in this library are more highly optimized for Intel microprocessors than for non-Intel microprocessors.

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other

115 10 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Optimization Notice optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

See Also • -O • Using Configuration Files • Using Response Files

Porting from GNU gcc* to Microsoft Visual C++*

You can use the -diag-enable port-win option to alert you to general syntactical problems when porting an application from GNU gcc* to Microsoft Visual C++*. The following examples illustrate use of this option. Example 1: $ cat -n attribute.c 1 int main() 2 { 3 int i __attribute__((unused)); 4 return 0; 5 } $ icc -c -diag-enable port-win attribute.c attribute.c(3): warning #2133: attributes are a GNU extension int i __attribute__((unused));

116 10

Example 2: $ cat -n state_expr.c 1 int main() 2 { 3 int i = ({ int j; j = 1; j--; j;}); 4 return i; 5 } $ icc -c -diag-enable port-win state_expr.c state_expr.c(3): warning #2132: statement expressions are a GNU extension int i = ({ int j; j = 1; j--; j;});

117 10 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

118 Error Handling 11

Remarks, Warnings, and Errors

This topic describes compiler remarks, warnings, and errors. The Intel® C++ Compiler displays these messages, along with the erroneous source line, on the standard output.

Remarks Remark messages report common but sometimes unconventional use of C or C++. The compiler does not print or display remarks unless you specify the -w2 option. Remarks do not stop translation or linking. Remarks do not interfere with any output files. The following are some representative remark messages: • function declared implicitly • type qualifiers are meaningless in this declaration • controlling expression is constant

Warnings Warning messages report legal but questionable use of the C or C++. The compiler displays warnings by default. You can suppress all warning messages with the -w compiler option. Warnings do not stop translation or linking. Warnings do not interfere with any output files. The following are some representative warning messages: • declaration does not declare anything • pointless comparison of unsigned integer with zero • possible use of = where == was intended

Additional Warnings This version of Intel C++ Compiler includes support for the following options:

Option Result

-W[no-]missing-proto- Warn for missing prototypes types

119 11 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Option Result -W[no-]missing-decla- Warn for missing declarations rations -W[no-]unused-vari- Warn for unused variable able

-W[no-]pointer-arith Warn for questionable pointer arithmetic

-W[no-]uninitialized Warn if a variable is used before being initialized

-W[no-]deprecated Display warnings related to deprecated features

-W[no-]abi Warn if generated code is not C++ ABI compliant

-W[no-]unused-func- Warn if declared function is not used tion

-W[no-]unknown-prag- Warn if an unknown #pragma directive is used mas

-W[no-]main Warn if return type of main is not expected

-W[no-]comment[s] Warn when /* appears in the middle of a /* */ comment

-W[no-]return-type Warn when a function uses the default int return type Warn when a return statement is used in a void function

Errors These messages report syntactic or semantic misuse of C or C++. The compiler always displays error messages. Errors suppress object code for the module containing the error and prevent linking, but they allow parsing to continue to detect other possible errors. Some representative error messages are: • missing closing quote • expression must have arithmetic type • expected a ";"

120 11

Option Summary Use the following compiler options to control remarks, warnings, and errors:

Option Result

-w Suppress all warnings

-w0 Display only errors

-w2,-Wall Display only errors and warnings

-Wremarks Display remarks and comments

-Wbrief Display brief one-line diagnostics

-Wcheck Enable more strict diagnostics

-Werror-all Change all warnings and remarks to errors

-Werror Change all warnings to errors

You can also control the display of diagnostic information with variations of the -diag compiler option. This compiler option accepts numerous arguments and values, allowing you wide control over displayed diagnostic messages and reports. Some of the most common variations include the following:

Option Result

-diag-enable list Enables a diagnostic message or a group of messages

-diag-disable list Disables a diagnostic message or a group of messages

-diag-warning list Tells the compiler to change diagnostics to warnings

-diag-error list Tells the compiler to change diagnostics to errors

-diag-remark list Tells the compiler to change diagnostics to remarks (comments)

The list items can be specific diagnostic IDs, one of the keywords warn, remark, or error, or a keyword specifying a certain group (par, vec, sc, driver, thread, , port-win, sc). For more information, see -diag.

121 11 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Other diagnostic-related options include the following:

Option Result

-diag-dump Tells the compiler to print all enabled diagnostic messages and stop compilation

-diag-file[=file] Causes the results of diagnostic analysis to be output to a file

-diag-file-ap- Causes the results of diagnostic analysis to be appended to a pend[=file] file -diag-error-limit n Specifies the maximum number of errors allowed before compilation stops

122 Reference 12

C/C++ Calling Conventions

There are a number of calling conventions that set the rules on how arguments are passed to a function and how the values are returned from the function.

Calling Conventions on Windows* OS The following table summarizes the supported calling conventions on Windows* OS:

Calling Convention Compiler option Description

__cdecl /Gd Default calling convention for C/C++ programs. Can be specified on a function with variable arguments.

__thiscall none Default calling convention used by C++ member functions that do not use variable arguments.

__clrcall none Calling convention that specifies that a function can only be called from managed code.

__stdcall /Gz Standard calling convention used for Win32 API functions.

__fastcall /Gr Fast calling convention that specifies that arguments are passed in registers rather than on the stack.

__regcall /Qregcall ,which Intel Compiler calling convention that specifies that __regcall specifies that as many arguments as is the default calling possible are passed in registers; likewise, convention for functions __regcalluses registers whenever

123 12 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Calling Convention Compiler option Description in the compilation, possible to return values. This calling unless another calling convention is ignored if specified on a convention is specified function with variable arguments. on a declaration

__thiscall none Default calling convention used by C++ member functions that do not use variable arguments.

Calling Conventions on Linux* OS, VxWorks* OS The following table summarizes the supported calling conventions on Linux* OS, VxWorks* OS:

Calling Convention Compiler Option Description

__attribute((cdecl)) none Default calling convention for C/C++ programs. Can be specified on a function with variable arguments.

__attribute((stdcall)) none Calling convention that specifies the arguments are passed on the stack. Cannot be specified on a function with variable arguments.

__attribute((regparm none On systems based on IA-32 (number))) architecture, the Intel 386, the regparm attribute causes the compiler to pass up to number arguments in registers EAX, EDX, and ECX instead of on the stack. Functions that take a variable number of arguments will continue to pass all of their arguments on the stack.

124 12

Calling Convention Compiler Option Description

__regcall -regcall , which specifies Intel Compiler calling __attribute__((regcall)) that __regcall is the convention that specifies that default calling convention for as many arguments as functions in the compilation, possible are passed in unless another calling registers; likewise, convention is specified on a __regcall uses registers declaration whenever possible to return values. This calling convention is ignored if specified on a function with variable arguments.

The __regcall Calling Convention The __regcall calling convention is unique to the Intel compiler and requires some additional explanation. To use __regcall, place the keyword before a function declaration. For example: __regcall int foo (int i, int j); __attribute__((regcall)) foo (int I, int j);

Available Registers The number of registers that can be used for parameter passing/returning a value is normally all available registers available on the platform, except those reserved by the compiler. The following table lists registers are available in each register class depending on the default ABI for the compilation. The registers are used in the order shown in the table.

Register class/Architecture IA-32 Intel® 64

GPR (see Note 1) EAX, ECX, EDX, EDI, ESI RAX, RCX, RDX, RDI, RSI, R8 - R15

FP ST0 ST0

MMX None None

125 12 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Register class/Architecture IA-32 Intel® 64

XMM XMM0 - XMM7 XMM0 - XMM15

YMM YMM0 - YMM7 YMM0 - YMM15

Note 1: On Windows systems based on Intel® 64 architecture, R13 is not used for passing parameters or returning values and should be treated as reserved.

Data Type Classification Parameters and return values are classified by data type and are passed in the registers of the classes shown in the following table.

Type (for both unsigned IA-32 Intel® 64 and signed types)

bool, char, int, enum, GPR GPR _Decimal32, long, pointer

short, __mmask GPR GPR

long long, __int64 See Note 3; also see GPR Structured Data Type Classification Rules

_Decimal64 XMM GPR

long double FP FP

float, double, float128, XMM XMM _Decimal128

__m128, __m128i, __m128d XMM XMM

__m256, __m256i, __m256d YMM YMM

__m512 n/a n/a

complex , struct, See Structured Data Type See Structured Data Type union Classification Rules Classification Rules

126 12

Note 2: For the purpose of structured types,the classification of GPR class is used. Note 3: On systems based on IA-32 architecture, these 64-bit integer types (long long, __int64) get classified to the GPR class and are passed in 2 registers, as if they were implemented as a structure of two 32-bit integer fields. Types that are smaller in size than registers than registers of their associated class are passed in the lower part of those registers; for example, float is passed in the lower 4 bytes of an XMM register.

Structured Data Type Classification Rules Structures/unions and complex types are classified similarly to what is described in the x86_64 ABI, with the following exceptions:

• There is no limitation on the overall size of a structure. • The register classes for basic types are given in Data Type Classifications. • For systems based on the IA-32 architecture, classification is performed on four-bytes. For systems based on other architectures, classification is performed on eight-bytes.

Placement in Registers or on the Stack After the classification described in Data Type Classifications and Structured Data Type Classification Rules, parameters and return values are either put into registers specified in Available Registers or placed in memory, according to the following:

• Each chunk (eight bytes on systems based on Intel® 64 architecture or four-bytes on systems based on IA-32 architecture) of a value of Data Type is assigned a register class. If enough registers from Available Registers are available, the whole value is passed in registers, otherwise the value is passed using the stack. • If the classification were to use one or more register classes, then the registers of these classes from the table in Available Registers are used, in the order given there. • If no more registers are available in one of the required register classes, then the whole value is put on the stack.

Registers that Preserve Their Values The following registers preserve their values across a call, as long as they were not used for passing a parameter or returning a value:

127 12 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Register class/ABI IA-32 Intel® 64

GPR ESI, EDI, EBX, EBP, ESP R10 - R15, RBX, RBP, RSP

FP None None

MMX None None

XMM XMM4 - XMM7 XMM8 - XMM15

YMM XMM4 - XMM7 XMM8 - XMM15

All other registers do not preserve their values across this call.

Decoration Function names used with __regcall are decorated. Specifically, they are prepended with __regcall__ before any further traditional mangling occurs. For example, the function foo would be decorated as follows: __regcall__foo. This helps avoid improper linking with a name following a different calling convention, while allowing the full range of manipulations to be done with foo (such as setting a breakpoint in the debugger).

ANSI Standard Predefined Macros

The ANSI/ISO standard for the C language requires that certain predefined macros be supplied with conforming compilers. The following table lists the macros that the Intel® C++ Compiler supplies in accordance with this standard: The compiler includes predefined macros in addition to those required by the standard. The default predefined macros differ among Windows*, Linux* and VxWorks* operating systems due to the default /Za compiler option on Windows. Differences also exist on VxWorks OS as a result of the -std compiler option.

Macro Value

__DATE__ The date of compilation as a string literal in the form Mmm dd yyyy.

__FILE__ A string literal representing the name of the file being compiled.

__LINE__ The current line number as a decimal constant.

__STDC__ The name __STDC__ is defined when compiling a C translation unit.

128 12

Macro Value

__STDC_HOSTED__ 1

__TIME__ The time of compilation as a string literal in the form hh:mm:ss.

See Also • Additional Predefined Macros

Additional Predefined Macros

The Intel® C++ Compiler includes a number of predefined macros. The compiler also includes predefined macros specified by the ISO/ANSI standard.

Predefined Macros on Systems based on IA-32 or Intel® 64 Architectures The following table lists the predefined macros on systems based on either the IA-32 or Intel® 64 architecture. Unless otherwise stated, the macros are supported on systems based on IA-32 architecture and also on systems based on Intel® 64 architecture.

Macro (IA-32 and Intel® 64 Value architecture)

__ARRAY_OPERATORS 1

__BASE_FILE__ Name of source file

_BOOL 1

__cplusplus 1 (with C++ compiler)

__DEPRECATED 1

__EDG__ 1

__EDG_VERSION__ EDG version

__ELF__ 1

129 12 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Macro (IA-32 and Intel® 64 Value architecture)

__extension__

__EXCEPTIONS Defined as 1 when -fno-exceptions is not used.

__GNUC__ The major version number of gcc installed on the system.

__GNUG__ The major version number of g++ installed on the system.

__gnu_linux__ 1

__GNUC_MINOR__ The minor version number of gcc or g++ installed on the system.

__GNUC_PATCHLEVEL__ The patch level version number of gcc or g++ installed on the system.

__GXX_ABI_VERSION 102

__HONOR_STD 1

__i386 1 Available only on systems based on IA-32 architecture.

__i386__ 1 Available only on systems based on IA-32 architecture.

i386 1 Available only on systems based on IA-32 architecture.

__ICC Intel compiler version

__INTEL_COMPILER Intel compiler version

130 12

Macro (IA-32 and Intel® 64 Value architecture)

__INTEL_COMPILER_BUILD_DATE YYYYMMDD

__INTEL_RTTI__ Defined as 1 when -fno-rtti is not specified.

__INTEL_STRICT_ANSI__ Defined as 1 when -strict-ansi is specified.

__linux 1

__linux__ 1 linux 1

__LONG_DOUBLE_SIZE__ 80

__LONG_MAX__ 9223372036854775807L Available only on systems based on Intel® 64 architecture.

__LP64__ 1 Available only on systems based on Intel® 64 architecture.

_LP64 1 Available only on systems based on Intel® 64 architecture.

_MT 1 Available only on systems based on Intel® 64 architecture.

__MMX__ 1 Available only on systems based on Intel® 64 architecture.

__NO_INLINE__ 1

131 12 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Macro (IA-32 and Intel® 64 Value architecture)

__NO_MATH_INLINES 1

__NO_STRING_INLINES 1

__OPTIMIZE__ 1

__pentium4 1

__pentium4__ 1

__PIC__ Defined as 1 when -fPIC is specified.

__pic__ Defined as 1 when -fPIC is specified.

_PGO_INSTRUMENT Defined as 1 when -prof-gen[x] is specified.

_PLACEMENT_DELETE 1

__PTRDIFF_TYPE__ int on IA-32 architecture; long on Intel® 64 architecture

__REGISTER_PREFIX__

__SIGNED_CHARS__ 1

__SIZE_TYPE__ unsigned on IA-32 architecture ; unsigned long on Intel® 64 architecture

__SSE__ Defined as 1 for processors that support SSE instructions.

__SSE2__ Defined as 1 for processors that support SSE2 instructions.

__SSE3__ Defined as 1 for processors that support SSE3 instructions.

132 12

Macro (IA-32 and Intel® 64 Value architecture)

__SSSE3__ Defined as 1 for processors that support SSSE3 instructions.

__unix 1

__unix__ 1

unix 1

__USER_LABEL_PREFIX__

__VERSION__ Intel version string

__WCHAR_T 1

__WCHAR_TYPE__ long int on IA-32 architecture ; int on Intel® 64 architecture

__WINT_TYPE__ unsigned int

__x86_64 1 Available only on systems based on Intel® 64 architecture.

__x86_64__ 1 Available only on systems based on Intel® 64 architecture.

See Also • -D • -U • ANSI Standard Predefined Macros

133 12 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

134 Part II Compiler Options

Topics:

• Overview • Alphabetical Compiler Options • Related Options

135 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Overview

Overview: Compiler Options

This document provides details on all current VxWorks* OS compiler options. It provides the following information:

• Alphabetical Compiler Options This topic is the main source in the documentation set for general information on all compiler options. Options are described in alphabetical order. The Overview describes what information appears in each compiler option description. • Related Options This topic lists related options that can be used under certain conditions.

In this guide, compiler options are available on all supported operating systems and architectures unless otherwise identified. For information on compiler options that are equivalent to gcc options, see Portability Options. For further information on compiler options, see Building Applications and Optimizing Applications.

Functional Groupings of Compiler Options To see functional groupings of compiler options, specify a functional category for option help on the command line. For example, to see a list of options that affect diagnostic messages displayed by the compiler, enter the following command: -help diagnostics ! VxWorks systems

For details on the categories you can specify, see help.

136 Alphabetical Compiler Options 13

Compiler Option Descriptions and General Rules

This section includes descriptions of all the current Vxworks* OS compiler options. The option names are listed in alphabetical order.

Option Descriptions Each option description contains the following information:

• A short description of the option. • Architectures This shows the architectures where the option is valid. Possible architectures are:

• IA-32 architecture • Intel® 64 architecture • IA-64 architecture

• Syntax This shows the syntax on VxWorks system. • Arguments This shows any arguments (parameters) that are related to the option. If the option has no arguments, it will specify "None". • Default This shows the default setting for the option. • Description This shows the full description of the option. It may also include further information on any applicable arguments. • Alternate Options These are options that are synonyms of the described option. If there are no alternate options, it will specify "None". Many options have an older spelling where underscores ("_") instead of hyphens ("-") connect the main option names. The older spelling is a valid alternate option name.

137 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Some option descriptions may also have the following:

• Example This shows a short example that includes the option • See Also This shows where you can get further information on the option or related options.

General Rules for Compiler Options You cannot combine options with a single dash (VxWorks OS). For example:

• On VxWorks systems: This is incorrect: -wc; this is correct: -w -c

All compiler options are case sensitive. Some options have different meanings depending on their case; for example, option "c" prevents linking, but option "C" places comments in preprocessed source output. Options specified on the command line apply to all files named on the command line. Options can take arguments in the form of file names, strings, letters, or numbers. If a string includes spaces, the string must be enclosed in quotation marks. For example:

• On VxWorks systems, -dynamic-linker mylink (file name) or -Umacro3 (string)

Compiler options can appear in any order. Unless you specify certain options, the command line will both compile and link the files you specify. You can abbreviate some option names, entering as many characters as are needed to uniquely identify the option. Certain options accept one or more keyword arguments following the option name. For example, the arch option accepts several keywords. To specify multiple keywords, you typically specify the option multiple times. However, there are exceptions; for example, the following are valid: -axNB (VxWorks OS). Compiler options remain in effect for the whole compilation unless overridden by a compiler #pragma. To disable an option, specify the negative form of the option. If there are enabling and disabling versions of an option on the command line, the last one on the command line takes precedence.

138 13

Lists and Functional Groupings of Compiler Options To see a list of all the compiler options, specify option help on the command line. To see functional groupings of compiler options, specify a functional category for option help. For example, to see a list of options that affect diagnostic messages displayed by the compiler, enter the following command: -help diagnostics ! VxWorks systems

For details on the categories you can specify, see help.

A Options

A Specifies an identifier for an assertion.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Aname[(value)]

Arguments

name Is the identifier for the assertion. value Is an optional value for the assertion. If a value is specified, it must be within quotes, including the parentheses delimiting it.

Default

OFF Assertions have no identifiers or symbol names.

Description This option specifies an identifier (symbol name) for an assertion. It is equivalent to an #assert preprocessing directive. Note that this option is not the positive form of the C++ /QA- option.

139 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Alternate Options None

Example To make an assertion for the identifier fruit with the associated values orange and banana use the following command. On VxWorks* systems: icpc -A"fruit(orange,banana)" prog1.cpp

A- Disables all predefined macros. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -A-

Arguments None

Default

OFF Predefined macros remain enabled.

Description This option disables all predefined macros. It causes all predefined macros and assertions to become inactive. Note that this option is not the negative form of the C++ /QA option.

Alternate Options VxWorks: None

140 13 alias-const Determines whether the compiler assumes a parameter of type pointer-to-const does not alias with a parameter of type pointer-to-non-const.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -alias-const -no-alias-const

Arguments None

Default

-no-alias-const The compiler uses standard C/C++ rules for the interpretation of or /Qalias-const- const.

Description This option determines whether the compiler assumes a parameter of type pointer-to-const does not alias with a parameter of type pointer-to-non-const. It implies an additional attribute for const. This functionality complies with the input/output buffer rule, which assumes that input and output buffer arguments do not overlap. This option allows the compiler to do some additional optimizations with those parameters. In C99, you can also get the same result if you additionally declare your pointer parameters with the restrict keyword.

Alternate Options None

141 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

align Determines whether variables and arrays are naturally aligned.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -align -noalign

Arguments None

Default

OFF Variables and arrays are aligned according to the gcc model, which means they are aligned to 4-byte boundaries.

Description This option determines whether variables and arrays are naturally aligned.Option -align forces the following natural alignment:

Type Alignment

double 8 bytes

long long 8 bytes

long double 16 bytes

If you are not interacting with system libraries or other libraries that are compiled without -align, this option can improve performance by reducing misaligned accesses.

CAUTION. If you are interacting with system libraries or other libraries that are compiled without -align, your application may not perform as expected.

142 13

Alternate Options None ansi Enables language compatibility with the gcc option -ansi.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ansi

Arguments None

Default

OFF GNU C++ is more strongly supported than ANSI C.

Description This option enables language compatibility with the gcc option -ansi and provides the same level of ANSI standard conformance as that option. This option sets option fmath-errno. If you want strict ANSI conformance, use the -strict-ansi option.

Alternate Options None ansi-alias Enable use of ANSI aliasing rules in optimizations.

Architectures IA-32, Intel® 64 architectures

143 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -ansi-alias -no-ansi-alias

Arguments None

Default

-no-ansi-alias or /Qan- Disable use of ANSI aliasing rules in optimizations. si-alias-

Description This option tells the compiler to assume that the program adheres to ISO C Standard aliasability rules. If your program adheres to these rules, then this option allows the compiler to optimize more aggressively. If it doesn't adhere to these rules, then it can cause the compiler to generate incorrect code. When you specify -ansi-alias or /Qansi-alias, the ansi-alias checker is enabled by default. To disable the ansi-alias checker, you must specify -no-ansi-alias-check (VxWorks).

Alternate Options VxWorks: -fstrict-aliasing

See Also • ansi-alias-check, Qansi-alias-check

ansi-alias-check Enables or disables the ansi-alias checker.

Architectures IA-32, Intel® 64 architectures

144 13

Syntax

VxWorks: -ansi-alias-check -no-ansi-alias-check

Arguments None

Default

-no-ansi-alias-check The ansi-alias checker is disabled unless option -ansi-alias- or /Qansi-alias-check - check or /Qansi-alias-check has been specified.

Description This option enables or disables the ansi-alias checker. The ansi-alias checker checks the source code for potential violations of ANSI aliasing rules and disables unsafe optimizations related to the code for those statements that are identified as potential violations. You can use option –Wstrict-aliasing to identify potential violations. If option -ansi-alias (VxWorks) has been specified, the ansi-alias checker is enabled by default. You can use option -no-ansi-alias-check or /Qansi-alias-check- to disable the checker.

Alternate Options None

See Also • ansi-alias, Qansi-alias • Wstrict-aliasing

145 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

auto-ilp32 Instructs the compiler to analyze the program to determine if there are 64-bit pointers that can be safely shrunk into 32-bit pointers and if there are 64-bit longs (on VxWorks* OS) that can be safely shrunk into 32-bit longs.

Architectures Intel® 64 architecture

Syntax

VxWorks: -auto-ilp32

Arguments None

Default

OFF The optimization is not attempted.

Description This option instructs the compiler to analyze the program to determine if there are 64-bit pointers that can be safely shrunk into 32-bit pointers and if there are 64-bit longs (on VxWorks* OS) that can be safely shrunk into 32-bit longs. For this option to be effective, the compiler must be able to optimize using the -ipo (VxWorks OS) option and must be able to analyze all library calls or external calls the program makes. This option has no effect on VxWorks* systems unless you specify setting SSE3 or higher for option -x (VxWorks). This option requires that the size of the program executable never exceeds 232 bytes and all data values can be represented within 32 bits. If the program can run correctly in a 32-bit system, these requirements are implicitly satisfied. If the program violates these size restrictions, unpredictable behavior may occur.

Alternate Options None

146 13

See Also • auto-p32 • ipo, Qipo • x, Qx auto-p32 Instructs the compiler to analyze the program to determine if there are 64-bit pointers that can be safely shrunk to 32-bit pointers.

Architectures Intel® 64 architecture

Syntax

VxWorks: -auto-p32

Arguments None

Default

OFF The optimization is not performed.

Description This option instructs the compiler to analyze and transform the program so that 64-bit pointers are shrunk to 32-bit pointers, wherever it is legal and safe to do so. For this option to be effective, the compiler must be able to optimize using the -ipo (VxWorks) option and it must be able to analyze all library calls or external calls the program makes. This option has no effect unless you specify setting SSE3 or higher for option -x. The application cannot exceed a 32-bit address space; otherwise, unpredictable results can occur.

Alternate Options None

147 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also • auto-ilp32, Qauto-ilp32 • ipo, Qipo • x, Qx

ax Tells the compiler to generate multiple, processor-specific auto-dispatch code paths for Intel processors if there is a performance benefit.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -axcode

Arguments

code Indicates the instructions to be generated for the set of processors in each description. The following descriptions refer to Intel® Streaming SIMD Extensions (Intel® SSE) and Supplemental Streaming SIMD Extensions (Intel® SSSE). Possible values are: AVX May generate Intel® Advanced Vector Extensions (Intel® AVX), SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel® processors. SSE4.2 May generate Intel® SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel processors. SSE4.1 May generate Intel® SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel processors. This replaces value S, which is deprecated.

148 13

SSSE3 May generate Intel® SSSE3, SSE3, SSE2, and SSE instructions for Intel processors. This replaces value T, which is deprecated. SSE3 May generate Intel® SSE3, SSE2, and SSE instructions for Intel processors. This replaces value P, which is deprecated. SSE2 May generate Intel® SSE2 and SSE instructions for Intel processors. This replaces value N, which is deprecated.

Default

OFF No auto-dispatch code is generated. Processor-specific code is generated and is controlled by the setting of compiler options -m and -x (VxWorks).

Description This option tells the compiler to generate multiple, processor-specific auto-dispatch code paths for Intel processors if there is a performance benefit. It also generates a baseline code path. The Intel processor-specific auto-dispatch path is usually more optimized than the baseline path. Other options, such as O3, control how much optimization is performed on the baseline path. The baseline code path is determined by the architecture specified by options -m or -x (VxWorks). While there are defaults for the -x option that depend on the operating system being used, you can specify an architecture and optimization level for the baseline code that is higher or lower than the default. The specified architecture becomes the effective minimum architecture for the baseline code path. If you specify both the -ax and -x options (VxWorks), the baseline code will only execute on Intel® processors compatible with the processor type specified by the -x option. If you specify both the -ax and -m options (VxWorks), the baseline code will execute on non-Intel processors compatible with the processor type specified by the -m option. The -ax option tells the compiler to find opportunities to generate separate versions of functions that take advantage of features of the specified Intel® processor.

149 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

If the compiler finds such an opportunity, it first checks whether generating a processor-specific version of a function is likely to result in a performance gain. If this is the case, the compiler generates both a processor-specific version of a function and a baseline version of the function. At run time, one of the versions is chosen to execute, depending on the Intel processor in use. In this way, the program can benefit from performance gains on more advanced Intel processors, while still working properly on older processors and non-Intel processors. A non-Intel processor always executes the baseline code path. You can use more than one of the processor values by combining them. For example, you can specify -axSSE4.1,SSSE3 (VxWorks). You cannot combine the old style, deprecated options and the new options. For example, you cannot specify -axSSE4.1,T (VxWorks). Previous values W and K are deprecated. The details on replacements are as follows:

• VxWorks systems: The replacement for W is -msse2 (VxWorks). There is no exact replacement for K. However, on VxWorks systems, -axK is interpreted as -mia32. You can also do one of the following:

• Upgrade to option -msse2 (VxWorks). This will produce one code path that is specialized for Intel® SSE2. It will not run on earlier processors • Specify the two option combination -mia32 -axSSE2 (VxWorks). This combination will produce an executable that runs on any processor with IA-32 architecture but with an additional specialized Intel® SSE2 code path.

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD

150 13

Optimization Notice Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options None

See Also • x, Qx • xHost, QxHost • m

B Options

B Specifies a directory that can be used to find include files, libraries, and executables.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Bdir

Arguments

dir Is the directory to be used. If necessary, the compiler adds a directory separator character at the end of dir.

151 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF The compiler looks for files in the directories specified in your PATH environment variable.

Description This option specifies a directory that can be used to find include files, libraries, and executables. The compiler uses dir as a prefix. For include files, the dir is converted to -I/dir/include. This command is added to the front of the includes passed to the preprocessor. For libraries, the dir is converted to -L/dir. This command is added to the front of the standard -L inclusions before system libraries are added. For executables, if dir contains the name of a tool, such as ld or as, the compiler will use it instead of those found in the default directories. The compiler looks for include files in dir /include while library files are looked for in dir. Another way to get the behavior of this option is to use the environment variable GCC_EXEC_PREFIX.

Alternate Options None

Bdynamic Enables dynamic linking of libraries at run time.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Bdynamic

Arguments None

152 13

Default

OFF Limited dynamic linking occurs.

Description This option enables dynamic linking of libraries at run time. Smaller executables are created than with static linking. This option is placed in the linker command line corresponding to its location on the user command line. It controls the linking behavior of any library that is passed using the command line. All libraries on the command line following option -Bdynamic are linked dynamically until the end of the command line or until a -Bstatic option is encountered. The -Bstatic option enables static linking of libraries.

Alternate Options None

See Also • Bstatic

Bstatic Enables static linking of a user's library.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Bstatic

Arguments None

Default

OFF Default static linking occurs.

153 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option enables static linking of a user's library. This option is placed in the linker command line corresponding to its location on the user command line. It controls the linking behavior of any library that is passed using the command line. All libraries on the command line following option -Bstatic are linked statically until the end of the command line or until a -Bdynamic option is encountered. The -Bdynamic option enables dynamic linking of libraries.

Alternate Options None

See Also • Bdynamic

Bsymbolic Binds references to all global symbols in a program to the definitions within a user's shared library.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Bsymbolic

Arguments None

Default

OFF When a program is linked to a shared library, it can override the definition within the shared library.

154 13

Description This option binds references to all global symbols in a program to the definitions within a user's shared library. This option is only meaningful on Executable Linkage Format (ELF) platforms that support shared libraries.

CAUTION. This option can have unintended side-effects of disabling symbol preemption in the shared library.

Alternate Options None

See Also • Bsymbolic-functions

Bsymbolic-functions Binds references to all global function symbols in a program to the definitions within a user's shared library.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Bsymbolic-functions

Arguments None

Default

OFF When a program is linked to a shared library, it can override the definition within the shared library.

155 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option binds references to all global function symbols in a program to the definitions within a user's shared library. This option is only meaningful on Executable Linkage Format (ELF) platforms that support shared libraries.

CAUTION. This option can have unintended side-effects of disabling symbol preemption in the shared library.

Alternate Options None

See Also • Bsymbolic

C Options

C Places comments in preprocessed source output.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -C

Arguments None

Default

OFF No comments are placed in preprocessed source output.

156 13

Description This option places (or preserves) comments in preprocessed source output. Comments following preprocessing directives, however, are not preserved.

Alternate Options None

See Also Building Applications: About Preprocessor Options c Prevents linking.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -c

Arguments None

Default

OFF Linking is performed.

Description This option prevents linking. Compilation stops after the object file is generated. The compiler generates an object file for each C or C++ source file or preprocessed source file. It also takes an assembler file and invokes the assembler to generate an object file.

Alternate Options None

157 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

c99 Determines whether C99 support is enabled for C programs. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -c99 -no-c99

Arguments None

Default

-no-c99 C99 support is disabled for C programs on VxWorks* OS. or/Qc99-

Description This option determines whether C99 support is enabled for C programs. One of the features enabled by -c99 (VxWorks), restricted pointers, is available by using option restrict. For more information, see restrict. Note that C99 features are only available in C; they are not available in C++.

Alternate Options VxWorks: -std=c99

See Also • restrict, Qrestrict

158 13 check-uninit Determines whether checking occurs for uninitialized variables.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -check-uninit -no-check-uninit

Arguments None

Default

-no-check-uninit

Description Enables run-time checking for uninitialized variables. If a variable is read before it is written, a run-time error routine will be called. Run-time checking of undefined variables is only implemented on local, scalar variables. It is not implemented on dynamically allocated variables, extern variables or static variables. It is not implemented on structs, classes, unions or arrays.

Alternate Options None complex-limited-range Determines whether the use of basic algebraic expansions of some arithmetic operations involving data of type COMPLEX is enabled.

Architectures IA-32, Intel® 64 architectures

159 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -complex-limited-range -no-complex-limited-range

Arguments None

Default

-no-complex-limited-range Basic algebraic expansions of some arithmetic operations or/Qcomplex-limited-range- involving data of type COMPLEX are disabled.

Description This option determines whether the use of basic algebraic expansions of some arithmetic operations involving data of type COMPLEX is enabled. When the option is enabled, this can cause performance improvements in programs that use a lot of COMPLEX arithmetic. However, values at the extremes of the exponent range may not compute correctly.

Alternate Options None

cxxlib Determines whether the compile links using the C++ run-time libraries and header files provided by gcc.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -cxxlib[=dir] -cxxlib-nostd

160 13

-no-cxxlib

Arguments

dir Is an optional top-level location for the gcc binaries and libraries.

Default

C++: -cxxlib For C++, the compiler uses the run-time libraries and headers C: -no-cxxlib provided by gcc. For C, the compiler uses the default run-time libraries and headers and does not link to any additional C++ run-time libraries and headers. However, if you specify compiler option -std=gnu++98, the default is -cxxlib.

Description This option determines whether the compile links using the C++ run-time libraries and header files provided by gcc. Option -cxxlib-nostd prevents the compiler from linking with the standard C++ library.

Alternate Options None

See Also Building Applications: Options for Interoperability

D Options

D Defines a macro name that can be associated with an optional value.

Architectures IA-32, Intel® 64 architectures

161 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -Dname[=value]

Arguments

name Is the name of the macro. value Is an optional integer or an optional character string delimited by double quotes; for example, Dname=string.

Default

OFF Only default symbols or macros are defined.

Description Defines a macro name that can be associated with an optional value. This option is equivalent to a #define preprocessor directive. If a value is not specified, name is defined as "1".

CAUTION. On VxWorks systems, if you are not specifying a value, do not use D for name, because it will conflict with the -DD option.

Alternate Options None

See Also Building Applications: Predefined Preprocessor Symbols

dD Same as option -dM, but outputs #define directives in preprocessed source.

Architectures IA-32, Intel® 64 architectures

162 13

Syntax

VxWorks: -dD

Arguments None

Default

OFF The compiler does not output #define directives.

Description Same as -dM, but outputs #define directives in preprocessed source. To use this option, you must also specify the E option.

Alternate Options None debug Enables or disables generation of debugging information.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -debug[keyword]

Arguments keyword Is the type of debugging information to be generated. Possible values are: none Disables generation of debugging information.

163 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

full or all Generates complete debugging information. minimal Generates line number information for debugging. [no]emit_column Determines whether the compiler generates column number information for debugging. [no]expr-source-pos Determines whether the compiler generates source position information at the expression level of granularity. [no]inline-debug-info Determines whether the compiler generates enhanced debug information for inlined code. [no]semantic-stepping Determines whether the compiler generates enhanced debug information useful for breakpoints and stepping. [no]variable-locations Determines whether the compiler generates enhanced debug information useful in finding scalar local variables. extended Sets keyword values semantic-stepping and variable-locations. [no]parallel Determines whether the compiler generates parallel debug code instrumentations useful for thread data sharing and reentrant call detection.

For information on the non-default settings for these keywords, see the Description section.

Default

-debug none No debugging information is generated.

Description This option enables or disables generation of debugging information. Note that if you turn debugging on, optimization is turned off.

164 13

Keywords semantic-stepping, inline-debug-info, variable-locations, and extended can be used in combination with each other. If conflicting keywords are used in combination, the last one specified on the command line has precedence.

Option Description

-debug none Disables generation of debugging information.

-debug full or -debug Generates complete debugging information. It is the same as all specifying -debug with no keyword.

-debug minimal Generates line number information for debugging.

-debug emit_column Generates column number information for debugging.

-debug expr-source- Generates source position information at the statement level of pos granularity.

-debug inline-debug- Generates enhanced debug information for inlined code. info On inlined functions, symbols are (by default) associated with the caller. This option causes symbols for inlined functions to be associated with the source of the called function.

-debug semantic-step- Generates enhanced debug information useful for breakpoints ping and stepping. It tells the debugger to stop only at machine instructions that achieve the final effect of a source statement. For example, in the case of an assignment statement, this might be a store instruction that assigns a value to a program variable; for a function call, it might be the machine instruction that executes the call. Other instructions generated for those source statements are not displayed during stepping. This option has no impact unless optimizations have also been enabled.

-debug variable-loca- Generates enhanced debug information useful in finding scalar tions local variables. It uses a feature of the Dwarf object module known as "location lists".

165 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Option Description This feature allows the run-time locations of local scalar variables to be specified more accurately; that is, whether, at a given position in the code, a variable value is found in memory or a machine register.

-debug extended Sets keyword values semantic-stepping and variable-locations. It also tells the compiler to include column numbers in the line information.

-debug parallel Generates parallel debug code instrumentations needed for the thread data sharing and reentrant call detection of the Intel® Parallel Debugger Extension.

On VxWorks* systems, debuggers read debug information from executable images. As a result, information is written to object files and then added to the executable by the linker.

Alternate Options

For -debug full, -debug VxWorks: -g all, or -debug For -debug inline-debug- VxWorks: -inline-debug-info (this is a deprecated option) info For -debug variable-lo- VxWorks: -fvar-tracking cations For -debug semantic- VxWorks: -fvar-tracking-assignments stepping

See Also Building Applications: Chapter on Debugging

diag Controls the display of diagnostic information.

Architectures IA-32, Intel® 64 architectures

166 13

Syntax

VxWorks: -diag-type diag-list

Arguments type Is an action to perform on diagnostics. Possible values are: enable Enables a diagnostic message or a group of messages. disable Disables a diagnostic message or a group of messages. error Tells the compiler to change diagnostics to errors. warning Tells the compiler to change diagnostics to warnings. remark Tells the compiler to change diagnostics to remarks (comments). diag-list Is a diagnostic group or ID value. Possible values are: driver Specifies diagnostic messages issued by the compiler driver. port-linux Specifies diagnostic messages for language features that may cause errors when porting to Linux* systems. This diagnostic group is only available on Windows* systems. port-win Specifies diagnostic messages for GNU extensions that may cause errors when porting to Windows. This diagnostic group is only available on VxWorks* systems. thread Specifies diagnostic messages that help in thread-enabling a program. vec Specifies diagnostic messages issued by the vectorizer. warn Specifies diagnostic messages that have a "warning" severity level.

167 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

error Specifies diagnostic messages that have an "error" severity level. remark Specifies diagnostic messages that are remarks or comments. cpu-dispatch Specifies the CPU dispatch remarks for diagnostic messages. These remarks are enabled by default. id[,id,...] Specifies the ID number of one or more messages. If you specify more than one message number, they must be separated by commas. There can be no intervening white space between each "id". tag[,tag,...] Specifies the mnemonic name of one or more messages. If you specify more than one mnemonic name, they must be separated by commas. There can be no intervening white space between each "tag". The diagnostic messages generated can be affected by certain options, such as -m or -x (VxWorks).

Default

OFF The compiler issues certain diagnostic messages by default.

Description This option controls the display of diagnostic information. Diagnostic messages are output to stderr unless compiler option -diag-file (VxWorks*) or /Qdiag-file (Windows*) is specified. To control the diagnostic information reported by the vectorizer, use the -vec-report (VxWorks) option.

Alternate Options

enable vec VxWorks: -vec-report Windows: /Qvec-report disable vec VxWorks: -vec-report0 Windows: /Qvec-report0

168 13

Example The following example shows how to enable diagnostic IDs 117, 230 and 450: -diag-enable 117,230,450 ! VxWorks systems /Qdiag-enable:117,230,450 ! Windows systems

The following example shows how to change vectorizer diagnostic messages to warnings: -diag-enable vec -diag-warning vec ! VxWorks systems /Qdiag-enable:vec /Qdiag-warning:vec ! Windows systems

Note that you need to enable the vectorizer diagnostics before you can change them to warnings. The following example shows how to change all diagnostic warnings and remarks to errors: -diag-error warn,remark ! VxWorks systems /Qdiag-error:warn,remark ! Windows systems

See Also • diag-dump, Qdiag-dump • diag-id-numbers, Qdiag-id-numbers • diag-file, Qdiag-file • vec-report, Qvec-report diag-dump Tells the compiler to print all enabled diagnostic messages and stop compilation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -diag-dump

Arguments None

169 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF The compiler issues certain diagnostic messages by default.

Description This option tells the compiler to print all enabled diagnostic messages and stop compilation. The diagnostic messages are output to stdout. This option prints the enabled diagnostics from all possible diagnostics that the compiler can issue, including any default diagnostics. If -diag-enable diag-list (VxWorks) or /Qdiag-enable diag-list (Windows) is specified, the print out will include the diag-list diagnostics.

Alternate Options None

Example The following example adds vectorizer diagnostic messages to the printout of default diagnostics: -diag-enable vec -diag-dump ! VxWorks systems /Qdiag-enable:vec /Qdiag-dump ! Windows systems

See Also • diag, Qdiag

diag-error-limit Specifies the maximum number of errors allowed before compilation stops.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -diag-error-limitn -no-diag-error-limit

170 13

Arguments n Is the maximum number of error-level or fatal-level compiler errors allowed.

Default

30 A maximum of 30 error-level and fatal-level messages are allowed.

Description This option specifies the maximum number of errors allowed before compilation stops. It indicates the maximum number of error-level or fatal-level compiler errors allowed for a file specified on the command line. If you specify -no-diag-error-limit (VxWorks) on the command line, there is no limit on the number of errors that are allowed. If the maximum number of errors is reached, a warning message is issued and the next file (if any) on the command line is compiled.

Alternate Options VxWorks: -wn (this is a deprecated option) Windows: /Qwn (this is a deprecated option) diag-file Causes the results of diagnostic analysis to be output to a file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -diag-file[=filename]

Arguments filename Is the name of the file for output.

171 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF Diagnostic messages are output to stderr.

Description This option causes the results of diagnostic analysis to be output to a file. The file is placed in the current working directory. You can include a file extension in filename. For example, if file.txt is specified, the name of the output file is file.txt. If you do not provide a file extension, the name of the file is filename.diag. If filename is not specified, the name of the file is name-of-the-first-source-file.diag. This is also the name of the file if the name specified for file conflicts with a source file name provided in the command line.

NOTE. If you specify -diag-file (Linux) and you also specify -diag-file-append (Linux), the last option specified on the command line takes precedence.

Alternate Options None

Example The following example shows how to cause diagnostic analysis to be output to a file named my_diagnostics.diag: -diag-file=my_diagnostics ! Linux systems /Qdiag-file:my_diagnostics ! Windows systems

See Also • diag-file-append, Qdiag-file-append

diag-file-append Causes the results of diagnostic analysis to be appended to a file.

Architectures IA-32, Intel® 64 architectures

172 13

Syntax

VxWorks: -diag-file-append[=filename]

Arguments filename Is the name of the file to be appended to. It can include a path.

Default

OFF Diagnostic messages are output to stderr.

Description This option causes the results of diagnostic analysis to be appended to a file. If you do not specify a path, the driver will look for filename in the current working directory. If filename is not found, then a new file with that name is created in the current working directory. If the name specified for file conflicts with a source file name provided in the command line, the name of the file is name-of-the-first-source-file.diag.

NOTE. If you specify -diag-file-append (Linux) and you also specify -diag-file (Linux), the last option specified on the command line takes precedence.

Alternate Options None

Example The following example shows how to cause diagnostic analysis to be appended to a file named my_diagnostics.txt: -diag-file-append=my_diagnostics.txt ! Linux systems /Qdiag-file-append:my_diagnostics.txt ! Windows systems

See Also • diag-file, Qdiag-file

173 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

diag-id-numbers Determines whether the compiler displays diagnostic messages by using their ID number values.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -diag-id-numbers -no-diag-id-numbers

Arguments None

Default

-diag-id-numbers The compiler displays diagnostic messages by using their ID or/Qdiag-id-numbers number values.

Description This option determines whether the compiler displays diagnostic messages by using their ID number values. If you specify -no-diag-id-numbers (VxWorks), mnemonic names are output for driver diagnostics only.

Alternate Options None

See Also • diag, Qdiag

174 13 diag-once Tells the compiler to issue one or more diagnostic messages only once.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -diag-onceid[,id,...]

Arguments id Is the ID number of the diagnostic message. If you specify more than one message number, they must be separated by commas. There can be no intervening white space between each id.

Default

OFF The compiler issues certain diagnostic messages by default.

Description This option tells the compiler to issue one or more diagnostic messages only once.

Alternate Options VxWorks: -wo (this is a deprecated option) dM Tells the compiler to output macro definitions in effect after preprocessing.

Architectures IA-32, Intel® 64 architectures

175 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -dM

Arguments None

Default

OFF The compiler does not output macro definitions after preprocessing.

Description This option tells the compiler to output macro definitions in effect after preprocessing. To use this option, you must also specify the E option.

Alternate Options None

See Also • E

dN Same as -dD, but output #define directives contain only macro names.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -dN

Arguments None

176 13

Default

OFF The compiler does not output #define directives.

Description Same as -dD, but output #define directives contain only macro names. To use this option, you must also specify the E option.

Alternate Options None dryrun Specifies that driver tool commands should be shown but not executed.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -dryrun

Arguments None

Default

OFF No tool commands are shown, but they are executed.

Description This option specifies that driver tool commands should be shown but not executed.

Alternate Options None

See Also • v

177 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

dumpmachine Displays the target machine and operating system configuration.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -dumpmachine

Arguments None

Default

OFF The compiler does not display target machine or operating system information.

Description This option displays the target machine and operating system configuration. No compilation is performed.

Alternate Options None

See Also • dumpversion

dumpversion Displays the version number of the compiler.

Architectures IA-32, Intel® 64 architectures

178 13

Syntax

VxWorks: -dumpversion

Arguments None

Default

OFF The compiler does not display the compiler version number.

Description This option displays the version number of the compiler. It does not compile your source files.

Alternate Options None

Example Consider the following command: icc -dumpversion

If it is specified when using the Intel C++ Compiler 10.1, the compiler displays "10.1".

See Also • dumpmachine dynamic-linker Specifies a dynamic linker other than the default.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -dynamic-linker file

179 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

file Is the name of the dynamic linker to be used.

Default

OFF The default dynamic linker is used.

Description This option lets you specify a dynamic linker other than the default.

Alternate Options None

E Options

E Causes the preprocessor to send output to stdout.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -E

Arguments None

Default

OFF Preprocessed source files are output to the compiler.

Description This option causes the preprocessor to send output to stdout. Compilation stops when the files have been preprocessed.

180 13

When you specify this option, the compiler's preprocessor expands your source module and writes the result to stdout. The preprocessed source contains #line directives, which the compiler uses to determine the source file and line number.

Alternate Options None early-template-check Lets you semantically check template function template prototypes before instantiation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -early-template-check -no-early-template-check

Arguments None

Default

-no-early-template- The prototype instantiation of function templates and function check members of class templates is deferred.

Description Lets you semantically check template function template prototypes before instantiation. On Linux* OS platforms, gcc 3.4 (or newer) compatibilty modes (i.e. -gcc-version=340 and later) must be in effect.

Alternate Options None

181 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

EP Causes the preprocessor to send output to stdout, omitting #line directives.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -EP

Arguments None

Default

OFF Preprocessed source files are output to the compiler.

Description This option causes the preprocessor to send output to stdout, omitting #line directives. If you also specify option P or Linux* OS option F, the preprocessor will write the results (without #line directives) to a file instead of stdout.

Alternate Options None

export Enables support for the C++ export template feature.

Architectures IA-32, Intel® 64 architectures

182 13

Syntax

VxWorks: -export

Arguments None

Default

OFF The export template feature is not enabled.

Description This option enables support for the C++ export template feature. This option is supported only in C++ mode.

Alternate Options None

See Also • export-dir export-dir Specifies a directory name for the exported template search path.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -export-dir dir

Arguments dir Is the directory name to add to the search path.

183 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF The compiler does not recognize exported templates.

Description This option specifies a directory name for the exported template search path. To use this option, you must also specify the -export option. Directories in the search path are used to find the definitions of exported templates and are searched in the order in which they are specified on the command-line. The current directory is always the first entry in the search path.

Alternate Options None

See Also • export

F Options

fabi-version Instructs the compiler to select a specific ABI implementation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fabi-version=n

Arguments

n Is the ABI implementation. Possible values are: 0 Requests the latest ABI implementation. 1 Requests the ABI implementation used in gcc 3.2 and gcc 3.3.

184 13

2 Requests the ABI implementation used in gcc 3.4 and higher.

Default

Varies The compiler uses the ABI implementation that corresponds to the installed version of gcc.

Description This option tells the compiler to select a specific ABI implementation. This option is compatible with gcc option -fabi-version. If you have multiple versions of gcc installed, the compiler may change the value of n depending on which gcc is detected in your path.

NOTE. gcc 3.2 and 3.3 are not fully ABI-compliant, but gcc 3.4 is highly ABI-compliant.

CAUTION. Do not mix different values for -fabi-version in one link.

Alternate Options None falias Determines whether aliasing should be assumed in the program.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -falias -fno-alias

Arguments None

185 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

-falias Aliasing is assumed in the program.

Description This option determines whether aliasing should be assumed in the program. If you do not want aliasing to be assumed in the program, specify -fno-alias.

Alternate Options VxWorks: None Windows: /Oa[-]

See Also • ffnalias

falign-functions Tells the compiler to align functions on an optimal byte boundary.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -falign-functions[=n] -fno-align-functions

Arguments

n Is the byte boundary for function alignment. Possible values are 2 or 16.

Default

-fno-align-functions or The compiler aligns functions on 2-byte boundaries. This is the /Qfnalign- same as specifying -falign-functions=2 (VxWorks).

186 13

Description This option tells the compiler to align functions on an optimal byte boundary. If you do not specify n, the compiler aligns the start of functions on 16-byte boundaries.

Alternate Options None falign-stack Tells the compiler the stack alignment to use on entry to routines.

Architectures IA-32 architecture

Syntax

VxWorks: -falign-stack=mode

Arguments mode Is the method to use for stack alignment. Possible values are: assume-4-byte Tells the compiler to assume the stack is aligned on 4-byte boundaries. The compiler can dynamically adjust the stack to 16-byte alignment if needed. maintain-16-byte Tells the compiler to not assume any specific stack alignment, but attempt to maintain alignment in case the stack is already aligned. The compiler can dynamically align the stack if needed. This setting is compatible with gcc. assume-16-byte Tells the compiler to assume the stack is aligned on 16-byte boundaries and to continue to maintain 16-byte alignment. This setting is compatible with gcc.

187 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

-falign-stack=assume- The compiler assumes the stack is aligned on 16-byte boundaries 16-byte and continues to maintain 16-byte alignment.

Description This option tells the compiler the stack alignment to use on entry to routines.

Alternate Options None

fargument-alias Determines whether function arguments can alias each other.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fargument-alias -fargument-noalias

Arguments None

Default

-fargument-alias or Function arguments can alias each other and can alias global /Qalias-args storage.

Description This option determines whether function arguments can alias each other. If you specify –far- gument-noalias, function arguments cannot alias each other, but they can alias global storage. On VxWorks systems, you can also disable aliasing for global storage, by specifying option -fargument-noalias-global.

188 13

See Also • fargument-noalias-global fargument-noalias-global Tells the compiler that function arguments cannot alias each other and cannot alias global storage.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fargument-noalias-global

Arguments None

Default

OFF Function arguments can alias each other and can alias global storage.

Description This option tells the compiler that function arguments cannot alias each other and they cannot alias global storage. If you only want to prevent function arguments from being able to alias each other, specify option -fargument-noalias.

Alternate Options None

See Also • fargument-alias, Qalias-args

189 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fasm-blocks Enables the use of blocks and entire functions of assembly code within a C or C++ file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fasm-blocks

Arguments None

Default

OFF The compiler allows a GNU*-style inline assembly format.

Description This option enables the use of blocks and entire functions of assembly code within a C or C++ file. It allows a Microsoft* MASM-style inline assembly block not a GNU*-style inline assembly block.

Alternate Options -use-msasm

fast Maximizes speed across the entire program.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fast

190 13

Arguments None

Default

OFF The optimizations that maximize speed are not enabled.

Description This option maximizes speed across the entire program. It sets the following options:

• On systems using IA-32 architecture and Intel® 64 architecture: Linux: -ipo, -O3, -no-prec-div, -static, and -xHost

When option fast is specified on systems using IA-32 architecture or Intel® 64 architecture, you can override the -xHost or /QxHost setting by specifying a different processor-specific -x or /Qx option on the command line. However, the last option specified on the command line takes precedence. For example, if you specify -fast -xSSE3 (Linux), option -xSSE3 or /QxSSE3 takes effect. However, if you specify -xSSE3 -fast (Linux), option -xHost or /QxHost takes effect. For implications on non-Intel processors, refer to the xHost documentation.

NOTE. The options set by option fast may change from release to release.

Alternate Options None

See Also • xHost, QxHost • x, Qx

191 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fast-transcendentals Enables the compiler to replace calls to transcendental functions with faster but less precise implementations.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fast-transcendentals -no-fast-transcendentals

Default

depends on the setting of If you do not specify option -[no-]fast-transcendentalsor -fp-model (VxWorks) option /Qfast-transcendentals[-]:

• The default is ON if option -fp-model fast or /fp:fast is specified or is in effect. • The default is OFF if a value-safe setting is specified for -fp- model or /fp (such as "precise", "source", etc.).

Description This option enables the compiler to replace calls to transcendental functions with implementations that may be faster but less precise. It allows the compiler to perform certain optimizations on transcendental functions, such as replacing individual calls to sine in a loop with a single call to a less precise vectorized sine library routine. This option does not affect explicit Short Vector Math Library (SVML) intrinsics. It only affects scalar calls to the standard math library routines. You cannot use option –fast-transcendentals with option –fp-model strict and you cannot use option /Qfast-transcendentals with option /fp:strict. This option determines the setting for the maximum allowable relative error for math library function results (max-error) if none of the following options are specified:

192 13

• -fimf-accuracy-bits (VxWorks* OS) or /Qimf-accuracy-bits (Windows* OS) • -fimf-max-error (VxWorks OS) or /Qimf-max-error (Windows OS) • -fimf-precision (VxWorks OS) or /Qimf-precision (Windows OS)

This option enables extra optimization that only applies to Intel® processors.

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options None

See Also • fp-model, fp • fimf-accuracy-bits, Qimf-accuracy-bits • fimf-max-error, Qimf-max-error

193 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• fimf-precision, Qimf-precision

fbuiltin, Oi Enables or disables inline expansion of intrinsic functions.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fbuiltin[-func] -fno-builtin[-func]

Arguments

func A comma-separated list of intrinsic functions.

Default

OFF Inline expansion of intrinsic functions disabled.

Description This option enables or disables inline expansion of one or more intrinsic functions. If -func is not specified, -fno-builtin disables inline expansion for all intrinsic functions. For a list of built-in functions affected by -fbuiltin, search for "built-in functions" in the appropriate gcc* documentation. For a list of built-in functions affected by /Oi, search for "/Oi" in the appropriate Microsoft* Visual C/C++* documentation.

Alternate Options None

194 13 fcode-asm Produces an assembly listing with machine code annotations.

IDE Equivalent None

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fcode-asm

Arguments None

Default

OFF No machine code annotations appear in the assembly listing file, if one is produced.

Description This option produces an assembly listing file with machine code annotations. The assembly listing file shows the hex machine instructions at the beginning of each line of assembly code. The file cannot be assembled; the file name is the name of the source file with an extension of .cod. To use this option, you must also specify option -S, which causes an assembly listing to be generated.

Alternate Options VxWorks: None Windows: /FAc

195 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also • S

fcommon Determines whether the compiler treats common symbols as global definitions.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fcommon -fno-common

Arguments None

Default

-fcommon The compiler does not treat common symbols as global definitions.

Description This option determines whether the compiler treats common symbols as global definitions and to allocate memory for each symbol at compile time. Option -fno-common tells the compiler to treat common symbols as global definitions. When using this option, you can only have a common variable declared in one module; otherwise, a link time error will occur for multiple defined symbols. Normally, a file-scope declaration with no initializer and without the extern or static keyword "int i;" is represented as a common symbol. Such a symbol is treated as an external reference. However, if no other compilation unit has a global definition for the name, the linker allocates memory for it.

Alternate Options None

196 13 fexceptions Enables exception handling table generation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fexceptions -fno-exceptions

Arguments None

Default

-fexceptions Exception handling table generation is enabled. Default for C++. -fno-exceptions Exception handling table generation is disabled. Default for C.

Description This option enables exception handling table generation. The -fno-exceptions option disables exception handling table generation, resulting in smaller code. When this option is used, any use of exception handling constructs (such as try blocks and throw statements) will produce an error. Exception specifications are parsed but ignored. It also undefines the preprocessor symbol __EXCEPTIONS.

Alternate Options None ffnalias Specifies that aliasing should be assumed within functions.

Architectures IA-32, Intel® 64 architectures

197 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -ffnalias -fno-fnalias

Arguments None

Default

-ffnalias Aliasing is assumed within functions.

Description This option specifies that aliasing should be assumed within functions. The -fno-fnalias option specifies that aliasing should not be assumed within functions, but should be assumed across calls.

Alternate Options VxWorks: None Windows: /Ow[-]

See Also • falias

ffreestanding Ensures that compilation takes place in a freestanding environment.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ffreestanding

198 13

Arguments None

Default

OFF Standard libraries are used during compilation.

Description This option ensures that compilation takes place in a freestanding environment. The compiler assumes that the standard library may not exist and program startup may not necessarily be at main. This environment meets the definition of a freestanding environment as described in the C and C++ standard. An example of an application requiring such an environment is an OS kernel.

NOTE. When you specify this option, the compiler will not assume the presence of compiler-specific libraries. It will only generate calls that appear in the source code.

Alternate Options None ffriend-injection Causes the compiler to inject friend functions into the enclosing namespace.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ffriend-injection -fno-friend-injection

Arguments None

199 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

-fno-friend-injection The compiler does not inject friend functions into the enclosing namespace. A friend function that is not declared in an enclosing scope can only be found using argument-dependent lookup.

Description This option causes the compiler to inject friend functions into the enclosing namespace, so they are visible outside the scope of the class in which they are declared. In gcc versions 4.1 or later, this is not the default behavior. This option allows compatibility with gcc 4.0 or earlier.

Alternate Options None

ffunction-sections Places each function in its own COMDAT section.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ffunction-sections

Arguments None

Default

OFF

Description Places each function in its own COMDAT section.

Alternate Options -fdata-sections

200 13 fimf-absolute-error Defines the maximum allowable absolute error for math library function results.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fimf-absolute-error=value[:funclist]

Arguments value Is a positive, floating-point number indicating the maximum allowable absolute error the compiler should use. The format for the number is [digits] [.digits] [ { e | E }[sign]digits] funclist Is an optional list of one or more math library functions to which the attribute should be applied. If you specify more than one function, they must be separated with commas.

Default

0 (zero) The compiler uses default heuristics when calling math library functions. An absolute-error setting of 0 means that the function is bound by the relative error setting. This is the default behavior.

Description This option defines the maximum allowable absolute error for math library function results. This option can improve run-time performance, but it may decrease the accuracy of results. This option only affects functions that have zero as a possible return value, such as log, sin, asin, etc. The relative error requirements for a particular function are determined by options that set max-error and precision. The return value from a function must have a relative error less than the max-error value, or an absolute error less than the absolute-error value.

201 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

NOTE. Many routines in libraries libm (Math Library) and svml (Short Vector Math Library) are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Alternate Options None

See Also • fimf-accuracy-bits, Qimf-accuracy-bits • fimf-max-error, Qimf-max-error • fimf-arch-consistency, Qimf-arch-consistency • fimf-precision, Qimf-precision

fimf-accuracy-bits Defines the relative error for math library function results.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fimf-accuracy-bits=bits[:funclist]

Arguments

bits Is a positive, floating-point number indicating the number of correct bits the compiler should use. funclist Is an optional list of one or more math library functions to which the attribute should be applied. If you specify more than one function, they must be separated with commas.

Default

OFF The compiler uses default heuristics when calling math library functions.

202 13

Description This option defines the relative error, measured by the number of correct bits, for math library function results. The following formula is used to convert bits into ulps: ulps = 2p-1-bits, where p is the number of the target format mantissa bits (24, 53, and 64 for single, double, and long double, respectively). This option can improve run-time performance, but it may decrease the accuracy of results. If option -fimf-precision (VxWorks* OS), or option -fimf-max-error (VxWorks* OS), or option -fimf-accuracy-bits (VxWorks* OS) is specified, the default value for max-error is determined by that option. If one or more of these options are specified, the default value for max-error is determined by the last one specified on the command line. If none of these options are specified, the default value for max-error is determined by the setting specified for option-[no-]fast-transcendentals (VxWorks OS). If that option also has not been specified, the default value is determined by the setting of option -fp-model (VxWorks OS).

NOTE. Many routines in libraries libm (Math Library) and svml (Short Vector Math Library) are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Alternate Options None

See Also • fimf-absolute-error, Qimf-absolute-error • fimf-max-error, Qimf-max-error • fimf-arch-consistency, Qimf-arch-consistency • fimf-precision, Qimf-precision

203 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fimf-arch-consistency Ensures that the math library functions produce consistent results across different implementations of the same architecture.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fimf-arch-consistency=value[:funclist]

Arguments

value Is one of the logical values "true" or "false". funclist Is an optional list of one or more math library functions to which the attribute should be applied. If you specify more than one function, they must be separated with commas.

Default

OFF The compiler uses default heuristics when calling math library functions.

Description This option ensures that the math library functions produce consistent results across different implementations of the same architecture. If -fimf-arch-consistency=true (VxWorks* OS) or /Qimf-arch-consistency:true is specified, it takes precedence over precision settings in the following options:

• -fimf-absolute-error (VxWorks* OS) • -fimf-accuracy-bits (VxWorks OS) • -fimf-max-error (VxWorks OS) • -fimf-precision (VxWorks OS)

204 13

The -fimf-arch-consistency (VxWorks* OS) and /Qimf-arch-consistency (Windows* OS) option may decrease run-time performance, but the option will provide bit-wise consistent results on all Intel® processors and compatible, non-Intel processors, regardless of micro-architecture. This option may not provide bit-wise consistent results between different architectures, for example, between IA-32 and Intel® 64 architectures.

NOTE. Many routines in libraries libm (Math Library) and svml (Short Vector Math Library) are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Alternate Options None

See Also • fimf-absolute-error, Qimf-absolute-error • fimf-accuracy-bits, Qimf-accuracy-bits • fimf-max-error, Qimf-max-error • fimf-precision, Qimf-precision fimf-max-error Defines the maximum allowable relative error for math library function results.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fimf-max-error=ulps[:funclist]

Arguments ulps Is a positive, floating-point number indicating the maximum allowable relative error the compiler should use. The format for the number is [digits] [.digits] [ { e | E }[sign]digits]

205 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

funclist Is an optional list of one or more math library functions to which the attribute should be applied. If you specify more than one function, they must be separated with commas.

Default

OFF The compiler uses default heuristics when calling math library functions.

Description This option defines the maximum allowable relative error, measured in ulps, for math library function results. This option can improve run-time performance, but it may decrease the accuracy of results. If option -fimf-precision (VxWorks* OS), or option -fimf-max-error (VxWorks* OS), or option -fimf-accuracy-bits (VxWorks* OS) is specified, the default value for max-error is determined by that option. If one or more of these options are specified, the default value for max-error is determined by the last one specified on the command line. If none of these options are specified, the default value for max-error is determined by the setting specified for option-[no-]fast-transcendentals (VxWorks OS). If that option also has not been specified, the default value is determined by the setting of option -fp-model (VxWorks OS).

NOTE. Many routines in libraries libm (Math Library) and svml (Short Vector Math Library) are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Alternate Options None

See Also • fimf-absolute-error, Qimf-absolute-error • fimf-accuracy-bits, Qimf-accuracy-bits • fimf-arch-consistency, Qimf-arch-consistency • fimf-precision, Qimf-precision

206 13 fimf-precision Defines the accuracy for math library functions.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fimf-precision[=value[:funclist]]

Arguments value Is one of the following values denoting the desired accuracy: high This is equivalent to max-error = 0.6. medium This is equivalent to max-error = 4; this is the default setting if the option is specified and value is omitted. low This is equivalent to accuracy-bits = 11 if the value is single precision; accuracy-bits = 26 if the value is double precision. In the above explanations, max-error means option -fimf- max-error (VxWorks* OS); accuracy-bits means option -fimf-accuracy-bits (VxWorks* OS). funclist Is an optional list of one or more math library functions to which the attribute should be applied. If you specify more than one function, they must be separated with commas.

Default

OFF The compiler uses default heuristics when calling math library functions.

Description This option defines the accuracy (precision) for math library functions.

207 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

This option can be used to improve run-time performance if reduced accuracy is sufficient for the application, or it can be used to increase the accuracy of math library functions. In general, using a lower precision can improve run-time performance and using a higher precision may reduce run-time performance. If option -fimf-precision (VxWorks* OS), or option -fimf-max-error (VxWorks* OS), or option -fimf-accuracy-bits (VxWorks* OS) is specified, the default value for max-error is determined by that option. If one or more of these options are specified, the default value for max-error is determined by the last one specified on the command line. If none of these options are specified, the default value for max-error is determined by the setting specified for option-[no-]fast-transcendentals (VxWorks OS). If that option also has not been specified, the default value is determined by the setting of option -fp-model (VxWorks OS).

NOTE. Many routines in libraries libm (Math Library) and svml (Short Vector Math Library) are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Alternate Options None

See Also • fimf-absolute-error, Qimf-absolute-error • fimf-accuracy-bits, Qimf-accuracy-bits • fimf-max-error, Qimf-max-error • fimf-arch-consistency, Qimf-arch-consistency

finline Tells the compiler to inline functions declared with __inline and perform C++ inlining.

Architectures IA-32, Intel® 64 architectures

208 13

Syntax

VxWorks: -finline -fno-inline

Arguments None

Default

-fno-inline The compiler does not inline functions declared with __inline.

Description This option tells the compiler to inline functions declared with __inline and perform C++ inlining.

Alternate Options None finline-functions Enables function inlining for single file compilation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -finline-functions -fno-inline-functions

Arguments None

209 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

-finline-functions Interprocedural optimizations occur. However, if you specify -O0, the default is OFF.

Description This option enables function inlining for single file compilation. It enables the compiler to perform inline function expansion for calls to functions defined within the current source file. The compiler applies a heuristic to perform the function expansion. To specify the size of the function to be expanded, use the -finline-limit option.

Alternate Options VxWorks: -inline-level=2 Windows: /Ob2

See Also • ip, Qip • finline-limit

finline-limit Lets you specify the maximum size of a function to be inlined.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -finline-limit=n

Arguments

n Must be an integer greater than or equal to zero. It is the maximum number of lines the function can have to be considered for inlining.

210 13

Default

OFF The compiler uses default heuristics when inlining functions.

Description This option lets you specify the maximum size of a function to be inlined. The compiler inlines smaller functions, but this option lets you inline large functions. For example, to indicate a large function, you could specify 100 or 1000 for n. Note that parts of functions cannot be inlined, only whole functions. This option is a modification of the -finline-functions option, whose behavior occurs by default.

Alternate Options None

See Also • finline-functions finstrument-functions Determines whether function entry and exit points are instrumented.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -finstrument-functions -fno-instrument-functions

Arguments None

211 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

-fno-instrument-func- Function entry and exit points are not instrumented. tions or/Qinstrument-func- tions-

Description This option determines whether function entry and exit points are instrumented. It may increase execution time. The following profiling functions are called with the address of the current function and the address of where the function was called (its "call site"):

• This function is called upon function entry:

• On IA-32 architecture and Intel® 64 architecture: void __cyg_profile_func_enter (void *this_fn, void *call_site);

• This function is called upon function exit:

• On IA-32 architecture and Intel® 64 architecture: void __cyg_profile_func_exit (void *this_fn, void *call_site);

These functions can be used to gather more information, such as profiling information or timing information. Note that it is the user's responsibility to provide these profiling functions. If you specify -finstrument-functions (VxWorks), function inlining is disabled. If you specify -fno-instrument-functions or /Qinstrument-functions-, inlining is not disabled. On VxWorks systems, you can use the following attribute to stop an individual function from being instrumented: __attribute__((__no_instrument_function__))

It also stops inlining from being disabled for that individual function. This option is provided for compatibility with gcc.

212 13

Alternate Options None fjump-tables Determines whether jump tables are generated for switch statements.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fjump-tables -fno-jump-tables

Arguments None

Default

-fjump-tables The compiler uses jump tables for switch statements.

Description This option determines whether jump tables are generated for switch statements. Option -fno-jump-tables prevents the compiler from generating jump tables for switch statements. This action is performed unconditionally and independent of any generated code performance consideration. Option -fno-jump-tables also prevents the compiler from creating switch statements internally as a result of optimizations. Use -fno-jump-tables with -fpic when compiling objects that will be loaded in a way where the jump table relocation cannot be resolved.

Alternate Options None

213 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also • fpic

fkeep-static-consts Tells the compiler to preserve allocation of variables that are not referenced in the source.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fkeep-static-consts -fno-keep-static-consts

Arguments None

Default

-fno-keep-static-consts If a variable is never referenced in a routine, the variable is or /Qkeep-static-con- discarded unless optimizations are disabled by option -O0 sts- (VxWorks* OS).

Description This option tells the compiler to preserve allocation of variables that are not referenced in the source. The negated form can be useful when optimizations are enabled to reduce the memory usage of static data.

Alternate Options None

214 13 fmath-errno Tells the compiler that errno can be reliably tested after calls to standard math library functions.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fmath-errno -fno-math-errno

Arguments None

Default

-fno-math-errno The compiler assumes that the program does not test errno after calls to standard math library functions.

Description This option tells the compiler to assume that the program tests errno after calls to math library functions. This restricts optimization because it causes the compiler to treat most math functions as having side effects. Option -fno-math-errno tells the compiler to assume that the program does not test errno after calls to math library functions. This frequently allows the compiler to generate faster code. Floating-point code that relies on IEEE exceptions instead of errno to detect errors can safely use this option to improve performance.

Alternate Options None

215 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fminshared Specifies that a compilation unit is a component of a main program and should not be linked as part of a shareable object.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fminshared

Arguments None

Default

OFF Source files are compiled together to form a single object file.

Description This option specifies that a compilation unit is a component of a main program and should not be linked as part of a shareable object. This option allows the compiler to optimize references to defined symbols without special visibility settings. To ensure that external and common symbol references are optimized, you need to specify visibility hidden or protected by using the -fvisibility, -fvisibility- hidden, or -fvisibility-protected option. Also, the compiler does not need to generate position-independent code for the main program. It can use absolute addressing, which may reduce the size of the global offset table (GOT) and may reduce memory traffic.

Alternate Options None

See Also • fvisibility

216 13 fmudflap The compiler instruments risky pointer operations to prevent buffer overflows and invalid heap use.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fmudflap

Arguments None

Default

OFF The compiler does not instrument risky pointer operations.

Description The compiler instruments risky pointer operations to prevent buffer overflows and invalid heap use. Requires gcc 4.0 or newer. When using this compiler option, you must specify linker option -lmudflap in the link command line to resolve references to the libmudflap library.

Alternate Options None fno-defer-pop Always pops the arguments to each function call as soon as that function returns.

Architectures IA-32, Intel® 64 architectures

217 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -fno-defer-pop

Arguments None

Default

OFF

Description Always pops the arguments to each function call as soon as that function returns. Default is -fdefer-pop.

Alternate Options None

fno-gnu-keywords Do not recognize typeof as keyword.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fno-gnu-keywords

Arguments None

Default

OFF Keyword typeof is recognized.

218 13

Description Do not recognize typeof as keyword.

Alternate Options None fno-implicit-fp-sse Disables implicit use of floating point and SSE registers.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fno-implicit-fp-sse

Arguments None

Default

OFF

Description Disables implicit use of floating point and SSE registers. Default is -fimplicit-fp-sse.

Alternate Options None fno-implicit-inline-templates Tells the compiler to not emit code for implicit instantiations of inline templates.

Architectures IA-32, Intel® 64 architectures

219 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -fno-implicit-inline-templates

Arguments None

Default

OFF The compiler handles inlines so that compilations, with and without optimization, will need the same set of explicit instantiations.

Description This option tells the compiler to not emit code for implicit instantiations of inline templates.

Alternate Options None

fno-implicit-templates Tells the compiler to not emit code for non-inline templates that are instantiated implicitly.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fno-implicit-templates

Arguments None

Default

OFF The compiler handles inlines so that compilations, with and without optimization, will need the same set of explicit instantiations.

220 13

Description This option tells the compiler to not emit code for non-inline templates that are instantiated implicitly, but to only emit code for explicit instantiations.

Alternate Options None fno-operator-names Disables support for the operator names specified in the standard.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fno-operator-names

Arguments None

Default

OFF

Description Disables support for the operator names specified in the standard.

Alternate Options None

221 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fno-rtti Disables support for run-time type information (RTTI).

Architectures IA-32 architecture

Syntax

VxWorks: -fno-rtti

Arguments None

Default

OFF

Description This option disables support for run-time type information (RTTI).

Alternate Options None

fnon-call-exceptions Allows trapping instructions to throw C++ exceptions.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fnon-call-exceptions -fno-non-call-exceptions

222 13

Arguments None

Default

-fno-non-call-excep- C++ exceptions are not thrown from trapping instructions. tions

Description This option allows trapping instructions to throw C++ exceptions. It allows hardware signals generated by trapping instructions to be converted into C++ exceptions and caught using the standard C++ exception handling mechanism. Examples of such signals are SIGFPE (floating-point exception) and SIGSEGV (segmentation violation). You must write a signal handler that catches the signal and throws a C++ exception. After that, any occurrence of that signal within a C++ try block can be caught by a C++ catch handler of the same type as the C++ exception thrown within the signal handler. Only signals generated by trapping instructions (that is, memory access instructions and floating-point instructions) can be caught. Signals that can occur at any time, such as SIGALRM, cannot be caught in this manner.

Alternate Options None fnon-lvalue-assign Determines whether casts and conditional expressions can be used as lvalues.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fnon-lvalue-assign -fno-non-lvalue-assign

223 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments None

Default

-fnon-lvalue-assign The compiler allows casts and conditional expressions to be used as lvalues.

Description This option determines whether casts and conditional expressions can be used as lvalues.

Alternate Options None

fomit-frame-pointer, Oy Determines whether EBP is used as a general-purpose register in optimizations.

Architectures -f[no-]omit-frame-pointer: IA-32, Intel® 64 architectures /Oy[-]: IA-32 architecture

Syntax

VxWorks: -fomit-frame-pointer -fno-omit-frame-pointer

Arguments None

Default

-fomit-frame-pointer EBP is used as a general-purpose register in optimizations. or /Oy However, on VxWorks* systems, the default is -fno-omit-frame- pointer if option -O0 or -g is specified. On Windows* systems, the default is /Oy- if option /Od is specified.

224 13

Description These options determine whether EBP is used as a general-purpose register in optimizations. Options -fomit-frame-pointer and /Oy allow this use. Options -fno-omit-frame-pointer and /Oy- disallow it. Some debuggers expect EBP to be used as a stack frame pointer, and cannot produce a stack backtrace unless this is so. The -fno-omit-frame-pointer and /Oy- options direct the compiler to generate code that maintains and uses EBP as a stack frame pointer for all functions so that a debugger can still produce a stack backtrace without doing the following:

• For -fno-omit-frame-pointer: turning off optimizations with -O0 • For /Oy-: turning off /O1, /O2, or /O3 optimizations

The -fno-omit-frame-pointer option is set when you specify option -O0 or the -g option. The -fomit-frame-pointer option is set when you specify option -O1, -O2, or -O3. The /Oy option is set when you specify the /O1, /O2, or /O3 option. Option /Oy- is set when you specify the /Od option. Using the -fno-omit-frame-pointer or /Oy- option reduces the number of available general-purpose registers by 1, and can result in slightly less efficient code.

NOTE. There is currently an issue with GCC 3.2 exception handling. Therefore, the Intel compiler ignores this option when GCC 3.2 is installed for C++ and exception handling is turned on (the default).

Alternate Options VxWorks: -fp (this is a deprecated option) Windows: None

225 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fp This is a deprecated, alternate option name for option -fomit-frame-pointer.

fp-model, fp Controls the semantics of floating-point calculations.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fp-model keyword

Arguments

keyword Specifies the semantics to be used. Possible values are: precise Enables value-safe optimizations on floating-point data. fast[=1|2] Enables more aggressive optimizations on floating-point data. strict Enables precise and except, disables contractions, and enables pragma stdc fenv_access. source Rounds intermediate results to source-defined precision and enables value-safe optimizations. double Rounds intermediate results to 53-bit (double) precision. extended Rounds intermediate results to 64-bit (extended) precision. [no-]except Determines whether floating-point (VxWorks) or exception semantics are used. except[-] (Windows)

226 13

Default

-fp-model fast=1 The compiler uses more aggressive optimizations on floating-point or /fp:fast=1 calculations.

Description This option controls the semantics of floating-point calculations. The keywords can be considered in groups:

• Group A: precise, fast, strict • Group B: source, double, extended • Group C: except (or the negative form)

You can use more than one keyword. However, the following rules apply:

• You cannot specify fast and except together in the same compilation. You can specify any other combination of group A, group B, and group C. Since fast is the default, you must not specify except without a group A or group B keyword. • You should specify only one keyword from group A and only one keyword from group B. If you try to specify more than one keyword from either group A or group B, the last (rightmost) one takes effect. • If you specify except more than once, the last (rightmost) one takes effect.

Option Description

-fp-model precise or /fp:precise Tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations. It disables optimizations that can change the result of floating-point calculations, which is required for strict ANSI conformance. These semantics ensure the accuracy of floating-point computations, but they may slow performance. The compiler assumes the default floating-point environment; you are not allowed to modify it.

227 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Option Description Intermediate results are computed with the precision shown in the following table, unless it is overridden by a keyword from Group B:

Windows Linux and VxWorks

IA-32 Double Extended architecture

Intel® 64 Source Source architecture

Floating-point exception semantics are disabled by default. To enable these semantics, you must also specify -fp-model except or /fp:except.

-fp-model fast[=1|2] or /fp:fast[=1|2] Tells the compiler to use more aggressive optimizations when implementing floating-point calculations. These optimizations increase speed, but may alter the accuracy of floating-point computations. Specifying fast is the same as specifying fast=1. fast=2 may produce faster and less accurate results. Floating-point exception semantics are disabled by default and they cannot be enabled because you cannot specify fast and except together in the same compilation. To enable exception semantics, you must explicitly specify another keyword (see other keyword descriptions for details).

228 13

Option Description

-fp-model strict or /fp:strict Tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations and enables floating-point exception semantics. This is the strictest floating-point model. The compiler does not assume the default floating-point environment; you are allowed to modify it. Floating-point exception semantics can be disabled by explicitly specifying -fp-model no-except or /fp:except-.

-fp-model source or /fp:source This option causes intermediate results to be rounded to the precision defined in the source code. It also implies keyword precise unless it is overridden by a keyword from Group A. Intermediate expressions use the precision of the operand with higher precision, if any.

long 64-bit 80-bit 15-bit double precision data exponent type

double 53-bit 64-bit 11-bit precision data exponent; type on Windows systems using IA-32 architecture, the exponent may be 15-bit if an x87 register

229 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Option Description is used to hold the value.

float 24-bit 32-bit 8-bit precision data exponent type

The compiler assumes the default floating-point environment; you are not allowed to modify it.

-fp-model double or /fp:double This option causes intermediate results to be rounded as follows:

53-bit (double) precision

64-bit data type

11-bit exponent; on Windows systems using IA-32 architecture, the exponent may be 15-bit if an x87 register is used to hold the value.

This option also implies keyword precise unless it is overridden by a keyword from Group A. The compiler assumes the default floating-point environment; you are not allowed to modify it.

-fp-model extended or /fp:extended This option causes intermediate results to be rounded as follows:

64-bit (extended) precision

80-bit data type

15-bit exponent

230 13

Option Description This option also implies keyword precise unless it is overridden by a keyword from Group A. The compiler assumes the default floating-point environment; you are not allowed to modify it.

-fp-model except or /fp:except Tells the compiler to use floating-point exception semantics.

This option determines the setting for the maximum allowable relative error for math library function results (max-error) if none of the following options are specified:

• -fimf-accuracy-bits (VxWorks* OS) or /Qimf-accuracy-bits (Windows* OS) • -fimf-max-error (VxWorks OS) or /Qimf-max-error (Windows OS) • -fimf-precision (VxWorks OS) or /Qimf-precision (Windows OS) • -[no-]fast-transcendentals (VxWorks OS )

NOTE. On Windows, VxWorks and Linux operating systems on IA-32 architecture, the compiler, by default, implements floating-point (FP) arithmetic using SSE2 and SSE instructions. This can cause differences in floating-point results when compared to previous x87 implementations.

This option enables extra optimization that only applies to Intel® processors.

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors.

231 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Optimization Notice While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options None

See Also • O • mp1, Qprec • fimf-accuracy-bits, Qimf-accuracy-bits • fimf-max-error, Qimf-max-error • fimf-precision, Qimf-precision • fast-transcendentals, Qfast-transcendentals The MSDN article Microsoft Visual C++ Floating-Point Optimization, which discusses concepts that apply to this option.

fp-port Rounds floating-point results after floating-point operations.

Architectures IA-32, Intel® 64 architectures

232 13

Syntax

VxWorks: -fp-port -no-fp-port

Arguments None

Default

-no-fp-port The default rounding behavior depends on the compiler's code or/Qfp-port- generation decisions and the precision parameters of the operating system.

Description This option rounds floating-point results after floating-point operations. Rounding to user-specified precision occurs at assignments and type conversions. This has some impact on speed. The default is to keep results of floating-point operations in higher precision. This provides better performance but less consistent floating-point results.

Alternate Options None fp-speculation Tells the compiler the mode in which to speculate on floating-point operations.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fp-speculation=mode

233 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

mode Is the mode for floating-point operations. Possible values are: fast Tells the compiler to speculate on floating-point operations. safe Tells the compiler to disable speculation if there is a possibility that the speculation may cause a floating-point exception. strict Tells the compiler to disable speculation on floating-point operations. off This is the same as specifying strict.

Default

-fp-speculation=fast The compiler speculates on floating-point operations. This is also or/Qfp-speculation:fast the behavior when optimizations are enabled. However, if you specify no optimizations (-O0 on Linux* OS; /Od on Windows* OS), the default is -fp-speculation=safe (Linux* OS).

Description This option tells the compiler the mode in which to speculate on floating-point operations.

Alternate Options None

See Also Floating-point Operations: Floating-point Options Quick Reference

fp-stack-check Tells the compiler to generate extra code after every function call to ensure that the floating-point stack is in the expected state.

Architectures IA-32, Intel® 64 architectures

234 13

Syntax

VxWorks: -fp-stack-check

Arguments None

Default

OFF There is no checking to ensure that the floating-point (FP) stack is in the expected state.

Description This option tells the compiler to generate extra code after every function call to ensure that the floating-point (FP) stack is in the expected state. By default, there is no checking. So when the FP stack overflows, a NaN value is put into FP calculations and the program's results differ. Unfortunately, the overflow point can be far away from the point of the actual bug. This option places code that causes an access violation exception immediately after an incorrect call occurs, thus making it easier to locate these issues.

Alternate Options None fp-trap Sets the floating-point trapping mode for the main routine.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fp-trap=mode[,mode,...]

235 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

mode Is the floating-point trapping mode. If you specify more than one mode value, the list is processed sequentially from left to right. Possible values are: [no]divzero Enables or disables the IEEE trap for division by zero. [no]inexact Enables or disables the IEEE trap for inexact result. [no]invalid Enables or disables the IEEE trap for invalid operation. [no]overflow Enables or disables the IEEE trap for overflow. [no]underflow Enables or disables the IEEE trap for underflow. [no]denormal Enables or disables the trap for denormal. all Enables all of the above traps. none Disables all of the above traps. common Sets the most commonly used IEEE traps: division by zero, invalid operation, and overflow.

Default

-fp-trap=none No traps are enabled when a program starts. or/Qfp-trap:none

Description This option sets the floating-point trapping mode for the main routine. It does not set a handler for floating-point exceptions. The [no] form of a mode value is only used to modify the meaning of mode values all and common, and can only be used with one of those values. The [no] form of the option by itself does not explicitly cause a particular trap to be disabled. Use mode value inexact with caution. This results in the trap being enabled whenever a floating-point value cannot be represented exactly, which can cause unexpected results.

236 13

If mode value underflow is specified, the compiler ignores the FTZ (flush-to-zero) bit state of Intel® Streaming SIMD Extensions (Intel® SSE) floating-point units. When a DAZ (denormals are zero) bit is set in an Intel® SSE floating-point unit control word, a denormal operand exception is never generated. To set the floating-point trapping mode for all routines, specify option -fp-trap-all (VxWorks).

NOTE. Option -[no-]ftz (VxWorks) and /Qftz[-] (Windows) can be used to set or reset the FTZ and the DAZ hardware flags.

Alternate Options None

See Also • ftz, Qftz • fp-trap-all, Qfp-trap-all fp-trap-all Sets the floating-point trapping mode for all routines.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fp-trap-all=mode[,mode,...]

Arguments mode Is the floating-point trapping mode. If you specify more than one mode value, the list is processed sequentially from left to right. Possible values are: [no]divzero Enables or disables the IEEE trap for division by zero.

237 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

[no]inexact Enables or disables the IEEE trap for inexact result. [no]invalid Enables or disables the IEEE trap for invalid operation. [no]overflow Enables or disables the IEEE trap for overflow. [no]underflow Enables or disables the IEEE trap for underflow. [no]denormal Enables or disables the trap for denormal. all Enables all of the above traps. none Disables all of the above traps. common Sets the most commonly used IEEE traps: division by zero, invalid operation, and overflow.

Default

-fp-trap-all=none No traps are enabled for all routines. or/Qfp-trap-all:none

Description This option sets the floating-point trapping mode for the main routine. It does not set a handler for floating-point exceptions. The [no] form of a mode value is only used to modify the meaning of mode values all and common, and can only be used with one of those values. The [no] form of the option by itself does not explicitly cause a particular trap to be disabled. Use mode value inexact with caution. This results in the trap being enabled whenever a floating-point value cannot be represented exactly, which can cause unexpected results. If mode value underflow is specified, the compiler ignores the FTZ (flush-to-zero) bit state of Intel® Streaming SIMD Extensions (Intel® SSE) floating-point units. When a DAZ (denormals are zero) bit is set in an Intel® SSE floating-point unit control word, a denormal operand exception is never generated. To set the floating-point trapping mode for the main routine only, specify option -fp-trap (VxWorks).

238 13

NOTE. Option -[no-]ftz (VxWorks) and /Qftz[-] (Windows) can be used to set or reset the FTZ and the DAZ hardware flags.

Alternate Options None

See Also • ftz, Qftz • fp-trap, Qfp-trap fpack-struct Specifies that structure members should be packed together.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fpack-struct

Arguments None

Default

OFF

Description Specifies that structure members should be packed together. Note: Using this option may result in code that is not usable with standard (system) c and C++ libraries.

Alternate Options -Zp1

239 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fpascal-strings Allow for Pascal-style string literals.

Architectures IA-32 architecture

Syntax

VxWorks: -fpascal-strings

Arguments None

Default

OFF The compiler does not allow for Pascal-style string literals.

Description Allow for Pascal-style string literals.

Alternate Options None

fpermissive Allow for non-conformant code.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fpermissive

Arguments None

240 13

Default

OFF

Description Allow for non-conformant code.

Alternate Options None fpic Determines whether the compiler generates position-independent code.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fpic -fno-pic

Arguments None

Default

-fno-pic On systems using IA-32 or Intel® 64 architecture, the compiler does not generate position-independent code.

Description This option determines whether the compiler generates position-independent code. Option -fpic specifies full symbol preemption. Global symbol definitions as well as global symbol references get default (that is, preemptable) visibility unless explicitly specified otherwise. Option -fno-pic is only valid on systems using IA-32 or Intel® 64 architecture.

241 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

On systems using IA-32 or Intel® 64 architecture, -fpic must be used when building shared objects. This option can also be specified as -fPIC.

Alternate Options None

fpie Tells the compiler to generate position-independent code. The generated code can only be linked into executables.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fpie

Arguments None

Default

OFF The compiler does not generate position-independent code for an executable-only object.

Description This option tells the compiler to generate position-independent code. It is similar to -fpic, but code generated by -fpie can only be linked into an executable. Because the object is linked into an executable, this option causes better optimization of some symbol references. To ensure that run-time libraries are set up properly for the executable, you should also specify option -pie to the compiler driver on the link command line. Option -fpie can also be specified as -fPIE.

242 13

Alternate Options None

See Also • fpic • pie freg-struct-return Return struct and union values in registers when possible.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -freg-struct-return

Arguments None

Default

OFF

Description Return struct and union values in registers when possible.

Alternate Options None

243 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fshort-enums Tells the compiler to allocate as many bytes as needed for enumerated types.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fshort-enums

Arguments None

Default

OFF The compiler allocates a default number of bytes for enumerated types.

Description This option tells the compiler to allocate as many bytes as needed for enumerated types.

Alternate Options None

fsource-asm Produces an assembly listing with source code annotations.

IDE Equivalent None

Architectures IA-32, Intel® 64 architectures

244 13

Syntax

VxWorks: -fsource-asm

Arguments None

Default

OFF No source code annotations appear in the assembly listing file, if one is produced.

Description This option produces an assembly listing file with source code annotations. The assembly listing file shows the source code as interspersed comments. To use this option, you must also specify option -S, which causes an assembly listing to be generated.

Alternate Options None

See Also • S fstack-protector This is an alternate name for option -fstack-security-check. fstack-security-check, GS Determines whether the compiler generates code that detects some buffer overruns.

Architectures IA-32, Intel® 64 architectures

245 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -fstack-security-check -fno-stack-security-check

Arguments None

Default

-fno-stack-security-check The compiler does not detect buffer overruns. or /GS-

Description This option determines whether the compiler generates code that detects some buffer overruns that overwrite the return address. This is a common technique for exploiting code that does not enforce buffer size restrictions.

Alternate Options VxWorks: -f[no-]stack-protector Windows: None

fstrict-aliasing This is an alternate name for option -ansi-alias.

fsyntax-only Tells the compiler to check only for correct syntax.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fsyntax-only

246 13

Arguments None

Default

OFF Normal compilation is performed.

Description This option tells the compiler to check only for correct syntax. No object file is produced.

Alternate Options /Zs ftemplate-depth Control the depth in which recursive templates are expanded.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ftemplate-depth-n

Arguments n The number of recursive templates that are expanded.

Default

OFF

Description Control the depth in which recursive templates are expanded. On Linux*, this option is supported only by invoking the compiler with icpc.

247 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Alternate Options None

ftls-model Change thread local storage model.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ftls-model=model

Arguments

model Possible values are: global-dynamic local-dynamic initial-exec local-exec

Default

OFF

Description Change thread local storage model.

Alternate Options None

248 13 ftrapuv Initializes stack local variables to an unusual value to aid error detection.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ftrapuv

Arguments None

Default

OFF The compiler does not initialize local variables.

Description This option initializes stack local variables to an unusual value to aid error detection. Normally, these local variables should be initialized in the application. The option sets any uninitialized local variables that are allocated on the stack to a value that is typically interpreted as a very large integer or an invalid address. References to these variables are then likely to cause run-time errors that can help you detect coding errors. This option sets option -g (VxWorks) and /Zi.

Alternate Options None

See Also • g, Zi, Z7

249 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

ftz Flushes denormal results to zero.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ftz -no-ftz

Arguments None

Default

Systems using IA-32 Denormal results are flushed to zero. architecture and Intel® 64 Every optimization option O level, except O0, sets -ftz and /Qftz. architecture: -ftz or /Qftz

Description This option flushes denormal results to zero when the application is in the gradual underflow mode. It may improve performance if the denormal values are not critical to your application's behavior. This option sets or resets the FTZ and the DAZ hardware flags. If FTZ is ON, denormal results from floating-point calculations will be set to the value zero. If FTZ is OFF, denormal results remain as is. If DAZ is ON, denormal values used as input to floating-point instructions will be treated as zero. If DAZ is OFF, denormal instruction inputs remain as is. Systems using Intel® 64 architecture have both FTZ and DAZ. FTZ and DAZ are not supported on all IA-32 architectures. When -ftz (VxWorks) is used in combination with an SSE-enabling option on systems using IA-32 architecture (for example, xN or QxN), the compiler will insert code in the main routine to set FTZ and DAZ. When -ftz or /Qftz is used without such an option, the compiler will insert code to conditionally set FTZ/DAZ based on a run-time processor check. -no-ftz (VxWorks) will prevent the compiler from inserting any code that might set FTZ or DAZ.

250 13

This option only has an effect when the main program is being compiled. It sets the FTZ/DAZ mode for the process. The initial thread and any threads subsequently created by that process will operate in FTZ/DAZ mode. If this option produces undesirable results of the numerical behavior of your program, you can turn the FTZ/DAZ mode off by using -no-ftz or /Qftz- in the command line while still benefiting from the O3 optimizations.

NOTE. Options -ftz and /Qftz are performance options. Setting these options does not guarantee that all denormals in a program are flushed to zero. They only cause denormals generated at run time to be flushed to zero.

Alternate Options None

Example To see sample code showing the state of the FTZ and DAZ flags, see Setting the FTZ and DAZ Flags.

See Also • x, Qx funroll-all-loops Unroll all loops even if the number of iterations is uncertain when the loop is entered.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -funroll-all-loops

Arguments None

251 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF Do not unroll all loops.

Description Unroll all loops, even if the number of iterations is uncertain when the loop is entered. There may a performance impact with this option.

Alternate Options None

funroll-loops This is an alternate name for option -unroll.

funsigned-bitfields Determines whether the default bitfield type is changed to unsigned.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -funsigned-bitfields -fno-unsigned-bitfields

Arguments None

Default

-fno-unsigned-bitfields The default bitfield type is signed.

Description This option determines whether the default bitfield type is changed to unsigned.

252 13

Alternate Options None funsigned-char Change default char type to unsigned.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -funsigned-char

Arguments None

Default

OFF Do not change default char type to unsigned.

Description Change default char type to unsigned.

Alternate Options None

253 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fvar-tracking This is an alternate name for option -debug variable-locations.

fvar-tracking-assignments This is an alternate name for option -debug semantic-stepping.

fverbose-asm Produces an assembly listing with compiler comments, including options and version information.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fverbose-asm -fno-verbose-asm

Arguments None

Default

-fno-verbose-asm No source code annotations appear in the assembly listing file, if one is produced.

Description This option produces an assembly listing file with compiler comments, including options and version information. To use this option, you must also specify -S, which sets -fverbose-asm. If you do not want this default when you specify -S, specify -fno-verbose-asm.

254 13

Alternate Options None

See Also • S fvisibility Specifies the default visibility for global symbols or the visibility for symbols in a file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fvisibility=keyword -fvisibility-keyword=filename

Arguments keyword Specifies the visibility setting. Possible values are: default Sets visibility to default. extern Sets visibility to extern. hidden Sets visibility to hidden. protected Sets visibility to protected. filename Is the pathname of a file containing the list of symbols whose visibility you want to set. The symbols must be separated by whitespace (spaces, tabs, or newlines).

Default

-fvisibility=default The compiler sets visibility of symbols to default.

255 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option specifies the default visibility for global symbols (syntax -fvisibility=keyword) or the visibility for symbols in a file (syntax -fvisibility-keyword=filename). Visibility specified by -fvisibility-keyword=filename overrides visibility specified by -fvisibility=keyword for symbols specified in a file.

Option Description

-fvisibility=default Sets visibility of symbols to default. This -fvisibility-default=filename means other components can reference the symbol, and the symbol definition can be overridden (preempted) by a definition of the same name in another component.

-fvisibility=extern Sets visibility of symbols to extern. This -fvisibility-extern=filename means the symbol is treated as though it is defined in another component. It also means that the symbol can be overridden by a definition of the same name in another component.

-fvisibility=hidden Sets visibility of symbols to hidden. This -fvisibility-hidden=filename means that other components cannot directly reference the symbol. However, its address may be passed to other components indirectly.

-fvisibility=protected CELL_TEXT -fvisibility-protected=filename

If an -fvisibility option is specified more than once on the command line, the last specification takes precedence over any others. If a symbol appears in more than one visibility filename, the setting with the least visibility takes precedence. The following shows the precedence of the visibility settings (from greatest to least visibility): • extern • default

256 13

• protected • hidden

Note that extern visibility only applies to functions. If a variable symbol is specified as extern, it is assumed to be default.

Alternate Options None

Example A file named prot.txt contains symbols a, b, c, d, and e. Consider the following: -fvisibility-protected=prot.txt

This option sets protected visibility for all the symbols in the file. It has the same effect as specifying fvisibility=protected in the declaration for each of the symbols. fvisibility-inlines-hidden Causes inline member functions (those defined in the class declaration) to be marked hidden.

Architectures IA-32 architecture

Syntax

VxWorks: -fvisibility-inlines-hidden

Arguments None

Default

OFF The compiler does not cause inline member functions to be marked hidden.

257 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Causes inline member functions (those defined in the class declaration) to be marked hidden. This option is particularly useful for templates.

Alternate Options None

fzero-initialized-in-bss , Qzero-initialized-in-bss Determines whether the compiler places in the DATA section any variables explicitly initialized with zeros.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fzero-initialized-in-bss -fno-zero-initialized-in-bss

Arguments None

Default

-fno-zero-initialized- Variables explicitly initialized with zeros are placed in the BSS in-bss section. This can save space in the resulting code. or/Qzero-initialized- in-bss -

Description This option determines whether the compiler places in the DATA section any variables explicitly initialized with zeros. If option -fno-zero-initialized-in-bss (VxWorks) is specified, the compiler places in the DATA section any variables that are initialized to zero.

258 13

Alternate Options None

G Options

g, Zi, Z7 Tells the compiler to generate full debugging information in the object file or a project database (PDB) file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -g

Arguments None

Default

OFF No debugging information is produced in the object file or in a PDB file.

Description Options -g (VxWorks* OS) and /Z7 (Windows*) tell the compiler to generate symbolic debugging information in the object file, which increases the size of the object file. The /Zi option (Windows) tells the compiler to generate symbolic debugging information in a PDB file. The compiler does not support the generation of debugging information in assemblable files. If you specify these options, the resulting object file will contain debugging information, but the assemblable file will not. These options turn off O2 and make O0 (VxWorks OS) the default unless O2 (or higher) is explicitly specified in the same command line.

259 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

On Linux* OS and VxWorks* OS using Intel® 64 architecture andVxWorks* OS systems using IA-32 architecture, specifying the -g or -O0 option sets the -fno-omit-frame-pointer option.

Alternate Options

/Zi VxWorks: None Windows: /debug:full, /debug:all, /debug, or /ZI

See Also Related Information • Z Options • Z Options

g0 Disables generation of symbolic debug information.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -g0

Arguments None

Default

OFF The compiler generates symbolic debug information.

Description This option disables generation of symbolic debug information.

Alternate Options None

260 13 gcc Determines whether certain GNU macros are defined or undefined.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -gcc -no-gcc -gcc-sys

Arguments None

Default

-gcc The compiler defines the GNU macros __GNUC__, __GNUC_MINOR__, and __GNUC_PATCHLEVEL__

Description This option determines whether the GNU macros __GNUC__, __GNUC_MINOR__, and __GNUC_PATCHLEVEL__ are defined or undefined.

Option Description

-gcc Defines GNU macros.

-no-gcc Undefines GNU macros.

-gcc-sys Defines GNU macros only during compilation of system headers.

Alternate Options None

261 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

gcc-name Specifies the name of the gcc compiler that should be used to set up the environment for C compilations.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -gcc-name=name

Arguments

name Is the name of the gcc compiler to use. It can include a full path.

Default

OFF The compiler locates the gcc libraries in the gcc install directory.

Description This option specifies the name of the gcc compiler that should be used to set up the environment for C compilations. This option is helpful when you are referencing a non-standard gcc installation. The C++ equivalent to option -gcc-name is -gxx-name.

Alternate Options None

See Also • gxx-name • cxxlib

262 13 gcc Determines whether certain GNU macros are defined or undefined.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -gcc -no-gcc -gcc-sys

Arguments None

Default

-gcc The compiler defines the GNU macros __GNUC__, __GNUC_MINOR__, and __GNUC_PATCHLEVEL__

Description This option determines whether the GNU macros __GNUC__, __GNUC_MINOR__, and __GNUC_PATCHLEVEL__ are defined or undefined.

Option Description

-gcc Defines GNU macros.

-no-gcc Undefines GNU macros.

-gcc-sys Defines GNU macros only during compilation of system headers.

Alternate Options None

263 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

gcc-version Provides compatible behavior with gcc.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -gcc-version=n

Arguments

n Is the gcc compatibility. Possible values are: 320 Specifies gcc 3.2 compatibility. 330 Specifies gcc 3.3 compatibility. 340 Specifies gcc 3.4 compatibility. 400 Specifies gcc 4.0 compatibility. 410 Specifies gcc 4.1 compatibility. 411 Specifies gcc 4.11 compatibility. 420 Specifies gcc 4.2 compatibility. 430 Specifies gcc 4.3 compatibility.

Default

OFF This option defaults to the installed version of gcc.

Description This option provides compatible behavior with gcc. It selects the version of gcc with which you achieve ABI interoperability.

Alternate Options None

264 13 gdwarf-2 Enables generation of debug information using the DWARF2 format.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -gdwarf-2

Arguments None

Default

OFF No debug information is generated. However, if compiler option -g is specified, debug information is generated in the latest DWARF format, which is currently DWARF2.

Description This option enables generation of debug information using the DWARF2 format. This is currently the default when compiler option -g is specified.

Alternate Options None

See Also • g

265 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

global-hoist Enables certain optimizations that can move memory loads to a point earlier in the program execution than where they appear in the source.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -global-hoist -no-global-hoist

Arguments None

Default

-global-hoist Certain optimizations are enabled that can move memory loads. or/Qglobal-hoist

Description This option enables certain optimizations that can move memory loads to a point earlier in the program execution than where they appear in the source. In most cases, these optimizations are safe and can improve performance. The -no-global-hoist (VxWorks) option is useful for some applications, such as those that use shared or dynamically mapped memory, which can fail if a load is moved too early in the execution stream (for example, before the memory is mapped).

Alternate Options None

266 13 fstack-security-check, GS Determines whether the compiler generates code that detects some buffer overruns.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -fstack-security-check -fno-stack-security-check

Arguments None

Default

-fno-stack-security-check The compiler does not detect buffer overruns. or /GS-

Description This option determines whether the compiler generates code that detects some buffer overruns that overwrite the return address. This is a common technique for exploiting code that does not enforce buffer size restrictions.

Alternate Options VxWorks: -f[no-]stack-protector Windows: None guide Lets you set a level of guidance for auto-vectorization and data transformation.

Architectures IA-32, Intel® 64 architectures

267 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -guide[=n]

Arguments

n Is an optional value specifying the level of guidance to be provided. The values available are 1 through 4. Value 1 indicates a standard level of guidance. Value 4 indicates the most advanced level of guidance. If n is omitted, the default is 4.

Default

OFF You do not receive guidance about how to improve optimizations for vectorization, and data transformation.

Description This option lets you set a level of guidance (advice) for auto-vectorization and data transformation. It causes the compiler to generate messages suggesting ways to improve these optimizations. When this option is specified, the compiler does not produce any objects or executables. You can set levels of guidance for the individual guide optimizations by specifying one of the following options: data transformation -guide-data-trans (VxWorks OS) auto-vectorization -guide-vec (VxWorks OS)

If you specify -guide or /Qguide and also specify one of the options setting a level of guidance for an individual guide optimization, the value set for the individual guide optimization will override the setting specified in -guide or /Qguide. If you do not specify -guide or /Qguide, but specify one of the options setting a level of guidance for an individual guide optimization, option -guide or /Qguide is enabled with the greatest value passed among any of the three individual guide optimizations specified. In debug mode, this option has no effect unless option O2 (or higher) is explicitly specified in the same command line.

268 13

NOTE. You can specify –diag-disable (VxWorks OS) or /Qdiag-disable (Windows OS) to prevent the compiler from issuing one or more diagnostic messages.

Alternate Options None

See Also • guide-data-trans, Qguide-data-trans • guide-vec, Qguide-vec • guide-opt, Qguide-opt • diag and Qdiag guide-data-trans Lets you set a level of guidance for data transformation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -guide-data-trans[=n]

Arguments n Is an optional value specifying the level of guidance to be provided. The values available are 1 and 2. Value 1 indicates a standard level of guidance. Value 2 indicates a more advanced level of guidance. If n is omitted, the default is 2.

Default

OFF You do not receive guidance about how to improve optimizations for data transformation.

269 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option lets you set a level of guidance for data transformation. It causes the compiler to generate messages suggesting ways to improve that optimization.

Alternate Options None

See Also • guide, Qguide • guide-vec, Qguide-vec

guide-opts Tells the compiler to analyze certain code and generate recommendations that may improve optimizations.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: guide-opts=string

Arguments

string Is the text denoting the code to analyze. The string must appear within quotes. It can take one or more of the following forms:

filename filename, routine filename, range [, range]... filename, routine, range [, range]...

If you specify more than one of the above forms in a string, a semicolon must appear between each form. If you specify more than one range in a string, a comma must appear

270 13 between each range. Optional blanks can follow each parameter in the forms above and they can also follow each form in a string. filename Specifies the name of a file to be analyzed. It can include a path. If you do not specify a path, the compiler looks for filename in the current working directory. routine Specifies the name of a routine to be analyzed. You can include an identifying parameter. The name, including any parameter, must be enclosed in single quotes. The compiler tries to uniquely identify the routine that corresponds to the specified routine name. It may select multiple routines to analyze, especially if the following is true:

• More than one routine has the specified routine name, so the routine cannot be uniquely identified. • No parameter information has been specified to narrow the number of routines selected as matches.

range Specifies a range of line numbers to analyze in the file or routine specified. The range must be specified in integers in the form: first_line_number-last_line_number The hyphen between the line numbers is required.

271 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF You do not receive guidance on how to improve optimizations. However, if you specify option -guide (VxWorks* OS), the compiler analyzes and generates recommendations for all the code in an application

Description This option tells the compiler to analyze certain code and generate recommendations that may improve optimizations. This option is ignored unless you also specify one or more of the following options: option -guide, -guide-vec, -guide-data-trans (VxWorks* OS). When option -guide-opt or /Qguide-opt is specified, a message is output that includes which parts of the input files are being analyzed. If a routine is selected to be analyzed, the complete routine name will appear in the generated message. When inlining is involved, you should specify callee line numbers. Generated messages also use callee line numbers.

Alternate Options None

Example Consider the following: -guide-opts="v.c, 1-10; v2.c, 1-40, 50-90, 100-200; v5.c, 300-400; x.c, 'funca(int)', 22-44, 55-77, 88-99; y.c, 'funcb'"

The above command causes the following to be analyzed:

file v.c, line numbers 1 to 10 file v2.c, line numbers 1 to 40, 50 to 90, and 100 to 200 file v5.c, line numbers 300 to 400 file x.c, routine funca with parameter (int), line numbers 22 to 44, 55 to 77, and 88 to 99 file y.c, routine funcb

See Also • guide, Qguide • guide-data-trans, Qguide-data-trans

272 13

• guide-vec, Qguide-vec guide-vec Lets you set a level of guidance for auto-vectorization.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -guide-vec[=n]

Arguments n Is an optional value specifying the level of guidance to be provided. The values available are 1 and 2. Value 1 indicates a standard level of guidance. Value 2 indicates a more advanced level of guidance. If n is omitted, the default is 2.

Default

OFF You do not receive guidance about how to improve optimizations for vectorization.

Description This option lets you set a level of guidance for auto-vectorization. It causes the compiler to generate messages suggesting ways to improve that optimization.

Alternate Options None

See Also • guide, Qguide • guide-data-trans, Qguide-data-trans

273 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

gxx-name Specifies the name of the g++ compiler that should be used to set up the environment for C++ compilations.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -gxx-name=name

Arguments

name Is the name of the g++ compiler to use. It can include a full path.

Default

OFF The compiler uses the PATH setting to find the g++ compiler and resolve environment settings.

Description This option specifies the name of the g++ compiler that should be used to set up the environment for C++ compilations. The C equivalent to option -gxx-name is -gcc-name.

NOTE. When compiling a C++ file with icc, g++ is used to get the environment.

Alternate Options None

See Also • gcc-name

274 13

H Options

H Tells the compiler to display the include file order and continue compilation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -H

Arguments None

Default

OFF Compilation occurs as usual.

Description This option tells the compiler to display the include file order and continue compilation.

Alternate Options None

help Displays all available compiler options or a category of compiler options.

Architectures IA-32, Intel® 64 architectures

275 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -help[category]

Arguments

category Is a category or class of options to display. Possible values are: advanced Displays advanced optimization options that allow fine tuning of compilation or allow control over advanced features of the compiler. codegen Displays Code Generation options. compatibility Displays options affecting language compatibility. component Displays options for component control. data Displays options related to interpretation of data in programs or the storage of data. deprecated Displays options that have been deprecated. diagnostics Displays options that affect diagnostic messages displayed by the compiler. float Displays options that affect floating-point operations. help Displays all the available help categories. inline Displays options that affect inlining. ipo Displays Interprocedural Optimization (IPO) options language Displays options affecting the behavior of the compiler language features. link Displays linking or linker options. misc Displays miscellaneous options that do not fit within other categories.

276 13

opt Displays options that help you optimize code. output Displays options that provide control over compiler output. pgo Displays Profile Guided Optimization (PGO) options. preproc Displays options that affect preprocessing operations. reports Displays options for optimization reports.

Default

OFF No list is displayed unless this compiler option is specified.

Description This option displays all available compiler options or a category of compiler options. If category is not specified, all available compiler options are displayed. This option can also be specified as --help.

Alternate Options VxWorks: None Windows: /? help-pragma Displays all supported pragmas.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -help-pragma

Arguments None

277 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF No list is displayed unless this compiler option is specified.

Description This option displays all supported pragmas and shows their syntaxes.

Alternate Options None

I Options

I Specifies an additional directory to search for include files.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Idir

Arguments

dir Is the additional directory for the search.

Default

OFF The default directory is searched for include files.

Description This option specifies an additional directory to search for include files. To specify multiple directories on the command line, repeat the include option for each directory.

Alternate Options None

278 13 i-dynamic This is a deprecated option. It's replacement is shared-intel. i-static This is a deprecated option. Its replacement is static-intel. icc Determines whether certain Intel compiler macros are defined or undefined.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -icc -no-icc

Arguments None

Default

-icc The __ICC and __INTEL_COMPILER macros are set to represent the current version of the compiler.

Description This option determines whether certain Intel compiler macros are defined or undefined. If you specify -no-icc, the compiler undefines the __ICC and __INTEL_COMPILER macros. These macros are defined by default or by specifying -icc.

Alternate Options None

279 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

idirafter Adds a directory to the second include file search path.

IDE Equivalent None

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -idirafterdir

Arguments

dir Is the name of the directory to add.

Default

OFF Include file search paths include certain default directories.

Description This option adds a directory to the second include file search path (after -I).

Alternate Options None

imacros Allows a header to be specified that is included in front of the other headers in the translation unit.

Architectures IA-32, Intel® 64 architectures

280 13

Syntax

VxWorks: -imacros file

Arguments file Name of header file.

Default

OFF

Description Allows a header to be specified that is included in front of the other headers in the translation unit.

Alternate Options None inline-calloc Tells the compiler to inline calls to calloc() as calls to malloc() and memset().

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-calloc -no-inline-calloc

Arguments None

Default

-no-inline-calloc The compiler inlines calls to calloc() as calls to calloc().

281 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

or/Qinline-calloc-

Description This option tells the compiler to inline calls to calloc() as calls to malloc() and memset(). This enables additional memset() optimizations. For example, it can enable inlining as a sequence of store operations when the size is a compile time constant.

NOTE. Many routines in the supplied libraries are more highly optimized for Intel® microprocessors than for non-Intel microprocessors

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options None

282 13 inline-debug-info Produces enhanced source position information for inlined code. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-debug-info

Arguments None

Default

OFF No enhanced source position information is produced for inlined code.

Description This option produces enhanced source position information for inlined code. This leads to greater accuracy when reporting the source location of any instruction. It also provides enhanced debug information useful for function call traceback. To use this option for debugging, you must also specify a debug enabling option, such as -g (Linux).

Alternate Options VxWorks: -debug inline-debug-info Windows: None

283 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

inline-factor Specifies the percentage multiplier that should be applied to all inlining options that define upper limits.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-factor=n -no-inline-factor

Arguments

n Is a positive integer specifying the percentage value. The default value is 100 (a factor of 1).

Default

-no-inline-factor The compiler uses default heuristics for inline routine expansion. or/Qinline-factor-

Description This option specifies the percentage multiplier that should be applied to all inlining options that define upper limits:

• -inline-max-size and /Qinline-max-size • -inline-max-total-size and /Qinline-max-total-size • -inline-max-per-routine and /Qinline-max-per-routine • -inline-max-per-compile and /Qinline-max-per-compile

This option takes the default value for each of the above options and multiplies it by n divided by 100. For example, if 200 is specified, all inlining options that define upper limits are multiplied by a factor of 2. This option is useful if you do not want to individually increase each option limit. If you specify -no-inline-factor (VxWorks), the following occurs:

284 13

• Every function is considered to be a small or medium function; there are no large functions. • There is no limit to the size a routine may grow when inline expansion is performed. • There is no limit to the number of times some routine may be inlined into a particular routine. • There is no limit to the number of times inlining can be applied to a compilation unit.

To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks).

CAUTION. When you use this option to increase default limits, the compiler may do so much additional inlining that it runs out of memory and terminates with an "out of memory" message.

Alternate Options None

See Also • inline-max-size, Qinline-max-size • inline-max-total-size, Qinline-max-total-size • inline-max-per-routine, Qinline-max-per-routine • inline-max-per-compile, Qinline-max-per-compile • opt-report, Qopt-report inline-forceinline Specifies that an inline routine should be inlined whenever the compiler can do so.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-forceinline

285 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF The compiler uses default heuristics for inline routine expansion.

Description This option specifies that a inline routine should be inlined whenever the compiler can do so. This causes the routines marked with an inline keyword or attribute to be treated as if they were "forceinline".

NOTE. Because C++ member functions whose definitions are included in the class declaration are considered inline functions by default, using this option will also make these member functions "forceinline" functions.

The "forceinline" condition can also be specified by using the keyword __forceinline. To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks* OS).

CAUTION. When you use this option to change the meaning of inline to "forceinline", the compiler may do so much additional inlining that it runs out of memory and terminates with an "out of memory" message.

Alternate Options None

See Also • opt-report, Qopt-report

inline-level, Ob Specifies the level of inline function expansion.

Architectures IA-32, Intel® 64 architectures

286 13

Syntax

VxWorks: -inline-level=n

Arguments n Is the inline function expansion level. Possible values are 0, 1, and 2.

Default

-inline-level=2 or /Ob2 This is the default if option O2 is specified or is in effect by default. On Windows* systems, this is also the default if option O3 is specified. -inline-level=0 or /Ob0 This is the default if option -O0 (VxWorks* OS) is specified.

Description This option specifies the level of inline function expansion. Inlining procedures can greatly improve the run-time performance of certain programs.

Option Description

-inline-level=0 or Disables inlining of user-defined functions. Note that statement Ob0 functions are always inlined.

-inline-level=1 or Enables inlining when an inline keyword or an inline attribute is Ob1 specified. Also enables inlining according to the C++ language.

-inline-level=2 or Enables inlining of any function at the compiler's discretion. Ob2

Alternate Options Linux: -Ob (this is a deprecated option)

See Also Related Information • O Options

287 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

inline-max-per-compile Specifies the maximum number of times inlining may be applied to an entire compilation unit.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-max-per-compile=n -no-inline-max-per-compile

Arguments

n Is a positive integer that specifies the number of times inlining may be applied.

Default

-no-inline-max-per-compile The compiler uses default heuristics for inline routine or/Qinline-max-per-compile- expansion.

Description This option the maximum number of times inlining may be applied to an entire compilation unit. It limits the number of times that inlining can be applied. For compilations using Interprocedural Optimizations (IPO), the entire compilation is a compilation unit. For other compilations, a compilation unit is a file. If you specify -no-inline-max-per-compile (VxWorks), there is no limit to the number of times inlining may be applied to a compilation unit. To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks).

CAUTION. When you use this option to increase the default limit, the compiler may do so much additional inlining that it runs out of memory and terminates with an "out of memory" message.

288 13

Alternate Options None

See Also • inline-factor, Qinline-factor • opt-report, Qopt-report inline-max-per-routine Specifies the maximum number of times the inliner may inline into a particular routine.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-max-per-routine=n -no-inline-max-per-routine

Arguments n Is a positive integer that specifies the maximum number of times the inliner may inline into a particular routine.

Default

-no-inline-max-per-routine The compiler uses default heuristics for inline routine or/Qinline-max-per-routine- expansion.

Description This option specifies the maximum number of times the inliner may inline into a particular routine. It limits the number of times that inlining can be applied to any routine. If you specify -no-inline-max-per-routine (VxWorks), there is no limit to the number of times some routine may be inlined into a particular routine. To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks).

289 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks).

CAUTION. When you use this option to increase the default limit, the compiler may do so much additional inlining that it runs out of memory and terminates with an "out of memory" message.

Alternate Options None

See Also • inline-factor, Qinline-factor • opt-report, Qopt-report

inline-max-size Specifies the lower limit for the size of what the inliner considers to be a large routine.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-max-size=n -no-inline-max-size

Arguments

n Is a positive integer that specifies the minimum size of what the inliner considers to be a large routine.

Default

-inline-max-size The compiler sets the maximum size (n) dynamically, or /Qinline-max-size based on the platform.

290 13

Description This option specifies the lower limit for the size of what the inliner considers to be a large routine (a function). The inliner classifies routines as small, medium, or large. This option specifies the boundary between what the inliner considers to be medium and large-size routines. The inliner prefers to inline small routines. It has a preference against inlining large routines. So, any large routine is highly unlikely to be inlined. If you specify -no-inline-max-size (VxWorks), there are no large routines. Every routine is either a small or medium routine. To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks).

CAUTION. When you use this option to increase the default limit, the compiler may do so much additional inlining that it runs out of memory and terminates with an "out of memory" message.

Alternate Options None

See Also • inline-min-size, Qinline-min-size • inline-factor, Qinline-factor • opt-report, Qopt-report inline-max-total-size Specifies how much larger a routine can normally grow when inline expansion is performed.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-max-total-size=n

291 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-no-inline-max-total-size

Arguments

n Is a positive integer that specifies the permitted increase in the routine's size when inline expansion is performed.

Default

-no-inline-max-total-size The compiler uses default heuristics for inline routine or/Qinline-max-total-size- expansion.

Description This option specifies how much larger a routine can normally grow when inline expansion is performed. It limits the potential size of the routine. For example, if 2000 is specified for n, the size of any routine will normally not increase by more than 2000. If you specify -no-inline-max-total-size (VxWorks), there is no limit to the size a routine may grow when inline expansion is performed. To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks). To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks).

CAUTION. When you use this option to increase the default limit, the compiler may do so much additional inlining that it runs out of memory and terminates with an "out of memory" message.

Alternate Options None

See Also • inline-factor, Qinline-factor • opt-report, Qopt-report

292 13 inline-min-size Specifies the upper limit for the size of what the inliner considers to be a small routine.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-min-size=n -no-inline-min-size

Arguments n Is a positive integer that specifies the maximum size of what the inliner considers to be a small routine.

Default

-no-inline-min-size The compiler uses default heuristics for inline routine or/Qinline-min-size- expansion.

Description This option specifies the upper limit for the size of what the inliner considers to be a small routine (a function). The inliner classifies routines as small, medium, or large. This option specifies the boundary between what the inliner considers to be small and medium-size routines. The inliner has a preference to inline small routines. So, when a routine is smaller than or equal to the specified size, it is very likely to be inlined. If you specify -no-inline-min-size (VxWorks), there is no limit to the size of small routines. Every routine is a small routine; there are no medium or large routines. To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks). To see compiler values for important inlining limits, specify compiler option -opt-report (VxWorks).

293 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

CAUTION. When you use this option to increase the default limit, the compiler may do so much additional inlining that it runs out of memory and terminates with an "out of memory" message.

Alternate Options None

See Also • inline-max-size, Qinline-max-size • opt-report, Qopt-report

intel-extensions Enables or disables all Intel® C/C++ language extensions.

Architectures IA-32, Intel® 64 architectures

Syntax

Linux : -intel-extensions -no-intel-extensions

Arguments None

Default

OFF The Intel C/C++ language extensions are enabled.

Description This option enables or disables all Intel® C/C++ language extensions. Option –no-intel-extensions (Linux* OS) disables all Intel® C/C++ language extensions. For example, it disables support for the decimal floating-point types.

294 13

Alternate Options None

See Also ip Determines whether additional interprocedural optimizations for single-file compilation are enabled.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ip -no-ip

Arguments None

Default

OFF Some limited interprocedural optimizations occur, including inline function expansion for calls to functions defined within the current source file. These optimizations are a subset of full intra-file interprocedural optimizations. Note that this setting is not the same as -no-ip (VxWorks).

Description This option determines whether additional interprocedural optimizations for single-file compilation are enabled. Options -ip (VxWorks) and /Qip (Windows) enable additional interprocedural optimizations for single-file compilation.

295 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Options -no-ip (VxWorks) and /Qip- (Windows) may not disable inlining. To ensure that inlining of user-defined functions is disabled, specify -inline-level=0or -fno-inline (VxWorks), or specify /Ob0 (Windows). To ensure that inliningof compiler intrinsic functions is disabled, specify -fno-builtin (VxWorks).

Alternate Options None

See Also • finline-functions

ip-no-inlining Disables full and partial inlining enabled by interprocedural optimization options.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ip-no-inlining

Arguments None

Default

OFF Inlining enabled by interprocedural optimization options is performed.

Description This option disables full and partial inlining enabled by the following interprocedural optimization options:

• On VxWorks systems: -ip or -ipo • On Windows systems: /Qip, /Qipo, or /Ob2

It has no effect on other interprocedural optimizations.

296 13

On Windows systems, this option also has no effect on user-directed inlining specified by option /Ob1.

Alternate Options None ip-no-pinlining Disables partial inlining enabled by interprocedural optimization options.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ip-no-pinlining

Arguments None

Default

OFF Inlining enabled by interprocedural optimization options is performed.

Description This option disables partial inlining enabled by the following interprocedural optimization options:

• On VxWorks systems: -ip or -ipo • On Windows systems: /Qip or /Qipo

It has no effect on other interprocedural optimizations.

Alternate Options None

297 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

ipo Enables interprocedural optimization between files.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ipo[n]

Arguments

n Is an optional integer that specifies the number of object files the compiler should create. The integer must be greater than or equal to 0.

Default

OFF Multifile interprocedural optimization is not enabled.

Description This option enables interprocedural optimization between files. This is also called multifile interprocedural optimization (multifile IPO) or Whole Program Optimization (WPO). When you specify this option, the compiler performs inline function expansion for calls to functions defined in separate files. You cannot specify the names for the files that are created. If n is 0, the compiler decides whether to create one or more object files based on an estimate of the size of the application. It generates one object file for small applications, and two or more object files for large applications. If n is greater than 0, the compiler generates n object files, unless n exceeds the number of source files (m), in which case the compiler generates only m object files. If you do not specify n, the default is 0.

Alternate Options None

298 13

See Also Optimizing Applications: Interprocedural Optimization (IPO) Quick Reference Interprocedural Optimization (IPO) Overview Using IPO ipo-c Tells the compiler to optimize across multiple files and generate a single object file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ipo-c

Arguments None

Default

OFF The compiler does not generate a multifile object file.

Description This option tells the compiler to optimize across multiple files and generate a single object file (named ipo_out.o on VxWorks systems; ipo_out.obj on Windows systems). It performs the same optimizations as -ipo (VxWorks), but compilation stops before the final link stage, leaving an optimized object file that can be used in further link steps.

Alternate Options None

See Also • ipo, Qipo

299 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

ipo-jobs Specifies the number of commands (jobs) to be executed simultaneously during the link phase of Interprocedural Optimization (IPO).

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ipo-jobsn

Arguments

n Is the number of commands (jobs) to run simultaneously. The number must be greater than or equal to 1.

Default

-ipo-jobs1 One command (job) is executed in an interprocedural optimization or/Qipo-jobs:1 parallel build.

Description This option specifies the number of commands (jobs) to be executed simultaneously during the link phase of Interprocedural Optimization (IPO). It should only be used if the link-time compilation is generating more than one object. In this case, each object is generated by a separate compilation, which can be done in parallel. This option can be affected by the following compiler options:

• -ipo (VxWorks) when applications are large enough that the compiler decides to generate multiple object files. • -ipon (VxWorks) or /Qipon (Windows) when n is greater than 1. • -ipo-separate (Linux and VxWorks)

300 13

CAUTION. Be careful when using this option. On a multi-processor system with lots of memory, it can speed application build time. However, if n is greater than the number of processors, or if there is not enough memory to avoid thrashing, this option can increase application build time.

Alternate Options None

See Also • ipo, Qipo • ipo-separate, Qipo-separate ipo-S Tells the compiler to optimize across multiple files and generate a single assembly file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ipo-S

Arguments None

Default

OFF The compiler does not generate a multifile assembly file.

Description This option tells the compiler to optimize across multiple files and generate a single assembly file (named ipo_out.s on VxWorks systems; ipo_out.asm on Windows systems). It performs the same optimizations as -ipo (VxWorks), but compilation stops before the final link stage, leaving an optimized assembly file that can be used in further link steps.

301 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Alternate Options None

See Also • ipo, Qipo

ipo-separate Tells the compiler to generate one object file for every source file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ipo-separate

Arguments None

Default

OFF The compiler decides whether to create one or more object files.

Description This option tells the compiler to generate one object file for every source file. It overrides any -ipo (Linux) specification.

Alternate Options None

See Also • ipo, Qipo

302 13 ipp Tells the compiler to link to the some or all of the Intel® Integrated Performance Primitives (Intel® IPP) libraries.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -ipp[=lib]

Arguments lib Indicates the Intel® IPP libraries that the compiler should link to. Possible values are: common Tells the compiler to link using the main libraries set. This is the default if the option is specified with no lib. crypto Tells the compiler to link using the main libraries set and the crypto library.

Default

OFF The compiler does not link to the Intel® IPP libraries.

Description The option tells the compiler to link to the some or all of the Intel® Integrated Performance Primitives (Intel® IPP) libraries and include the Intel® IPP headers.

Optimization Notice

The Intel® Integrated Performance Primitives (Intel® IPP) library contains functions that are more highly optimized for Intel microprocessors than for other microprocessors. While the functions in the Intel® IPP library offer optimizations for both Intel and Intel-compatible microprocessors, depending on your code and other factors, you will likely get extra performance on Intel microprocessors.

303 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Optimization Notice While the paragraph above describes the basic optimization approach for the Intel® IPP library as a whole, the library may or may not be optimized to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options None

iprefix Option for indicating the prefix for referencing directories containing header files.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -iprefix

Arguments None

Default

OFF

304 13

Description Options for indicating the prefix for referencing directories containing header files. Use with -iwithprefix as a prefix.

Alternate Options None iquote Add directory to the front of the include file search path for files included with quotes but not brackets.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -iquote dir

Arguments dir Is the name of the directory to add.

Default

OFF The compiler does not add a directory to the front of the include file search path.

Description Add directory to the front of the include file search path for files included with quotes but not brackets.

Alternate Options None

305 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

isystem Specifies a directory to add to the start of the system include path.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -isystemdir

Arguments

dir Is the directory to add to the system include path.

Default

OFF The default system include path is used.

Description This option specifies a directory to add to the system include path. The compiler searches the specified directory for include files after it searches all directories specified by the -I compiler option but before it searches the standard system directories. This option is provided for compatibility with gcc.

Alternate Options None

iwithprefix Appends a directory to the prefix passed in by -iprefix and puts it on the include search path at the end of the include directories.

Architectures IA-32, Intel® 64 architectures

306 13

Syntax

VxWorks: -iwithprefix

Arguments None

Default

OFF

Description This option appends a directory to the prefix passed in by -iprefix and puts it on the include search path at the end of the include directories.

Alternate Options None iwithprefixbefore Similar to -iwithprefix except the include directory is placed in the same place as -I command line include directories.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -iwithprefixbefore

Arguments None

Default

OFF

307 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Similar to -iwithprefix except the include directory is placed in the same place as -I command line include directories.

Alternate Options None

K Options

Kc++, TP Tells the compiler to process all source or unrecognized file types as C++ source files. This option is deprecated.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Kc++

Arguments None

Default

OFF The compiler uses default rules for determining whether a file is a C++ source file.

Description This option tells the compiler to process all source or unrecognized file types as C++ source files.

Alternate Options None

308 13

L Options

l Tells the linker to search for a specified library when linking.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -lstring

Arguments

string Specifies the library (libstring) that the linker should search.

Default

OFF The linker searches for standard libraries in standard directories.

Description This option tells the linker to search for a specified library when linking. When resolving references, the linker normally searches for libraries in several standard directories, in directories specified by the L option, then in the library specified by the l option. The linker searches and processes libraries and object files in the order they are specified. So, you should specify this option following the last object file it applies to.

Alternate Options None

See Also • L

309 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

L Tells the linker to search for libraries in a specified directory before searching the standard directories.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Ldir

Arguments

dir Is the name of the directory to search for libraries.

Default

OFF The linker searches the standard directories for libraries.

Description This option tells the linker to search for libraries in a specified directory before searching for them in the standard directories.

Alternate Options None

See Also • l

310 13

M Options

m Tells the compiler to generate code specialized for the processor that executes your program.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -mcode

Arguments

code Indicates the instructions to be generated for the set of processors in each description. Many of the following descriptions refer to Intel® Streaming SIMD Extensions (Intel® SSE) and Supplemental Streaming SIMD Extensions (Intel® SSSE). Possible values are: sse4.1 May generate Intel® SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions. ssse3 May generate Intel® SSSE3, SSE3, SSE2, and SSE instructions. sse3 May generate Intel® SSE3, SSE2, and SSE instructions. sse2 May generate Intel® SSE2 and SSE instructions. This value is only available on Linux and VxWorks systems. sse This option has been deprecated; it is now the same as specifying ia32. ia32 Generates x86/x87 generic code that is compatible with IA-32 architecture. Disables any default extended instruction settings, and any previously set extended

311 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

instruction settings. It also disables all processor-specific optimizations and instructions. This value is only available on Linux and VxWorks systems using IA-32 architecture.

Default

VxWorks systems: -msse2 For more information on the default values, see Arguments above.

Description This option tells the compiler to generate code specialized for the processor that executes your program. Code generated with these options should execute on any compatible, non-Intel processor with support for the corresponding instruction set. Options -x and -m are mutually exclusive. If both are specified, the compiler uses the last one specified and generates a warning.

Alternate Options VxWorks: None Windows: /arch

See Also • x, Qx • xHost, QxHost • ax, Qax

M Tells the compiler to generate makefile dependency lines for each source file.

Architectures IA-32, Intel® 64 architectures

312 13

Syntax

VxWorks: -M

Arguments None

Default

OFF The compiler does not generate makefile dependency lines for each source file.

Description This option tells the compiler to generate makefile dependency lines for each source file, based on the #include lines found in the source file.

Alternate Options None m32, m64 Tells the compiler to generate code for a specific architecture.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -m32 -m64

Arguments None

313 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF The compiler's behavior depends on the host system.

Description These options tell the compiler to generate code for a specific architecture.

Option Description

-m32 Tells the compiler to generate code for IA-32 architecture.

-m64 Tells the compiler to generate code for Intel® 64 architecture.

Note that these options are provided for compatibility with gcc. They are not related to the Intel® C++ compiler option arch.

Alternate Options None

malign-double Aligns double, long double, and long long types for better performance for systems based on IA-32 architecture.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -malign-double

Arguments None

Default

OFF

314 13

Description Aligns double, long double, and long long types for better performance for systems based on IA-32 architecture.

Alternate Options -align map-opts Maps one or more compiler options to their equivalent on a different operating system.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -map-opts

Arguments None

Default

OFF No platform mappings are performed.

Description This option maps one or more compiler options to their equivalent on a different operating system. The result is output to stdout. On Windows systems, the options you provide are presumed to be Windows options, so the options that are output to stdout will be Linux equivalents. On Linux systems, the options you provide are presumed to be Linux options, so the options that are output to stdout will be Windows equivalents. The tool can be invoked from the compiler command line or it can be used directly. No compilation is performed when the option mapping tool is used. This option is useful if you have both compilers and want to convert scripts or makefiles.

315 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

NOTE. Compiler options are mapped to their equivalent on the architecture you are using. For example, if you are using a processor with IA-32 architecture, you will only see equivalent options that are available on processors with IA-32 architecture.

Alternate Options None

Example The following command line invokes the option mapping tool, which maps the Linux options to Windows-based options, and then outputs the results to stdout: icc -map-opts -xP -O2

The following command line invokes the option mapping tool, which maps the Windows options to Linux-based options, and then outputs the results to stdout: icl /Qmap-opts /QxP /O2

See Also Building Applications: Using the Option Mapping Tool

march Tells the compiler to generate code for a specified processor.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -march=processor

Arguments

processor Is the processor for which the compiler should generate code. Possible values are:

316 13

pentium3 Generates code for Intel® Pentium® III processors. pentium4 Generates code for Intel® Pentium® 4 processors. core2 Generates code for the Intel® Core 2™ processor family.

Default

OFF or On IA-32 architecture, the compiler does not generate -march=pentium4 processor-specific code unless it is told to do so. On systems using Intel® 64 architecture, the compiler generates code for Intel Pentium 4 processors.

Description This option tells the compiler to generate code for a specified processor. Specifying -march=pentium4 sets -mtune=pentium4. For compatibility, a number of historical processor values are also supported, but the generated code will not differ from the default.

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and

317 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Optimization Notice Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options

-march=pentium3 Linux: -xSSE -march=pentium4 Linux: -xSSE2 -march=core2 Linux: -xSSSE3

mcmodel Tells the compiler to use a specific memory model to generate code and store data.

Architectures Intel® 64 architecture

Syntax

VxWorks: -mcmodel=mem_model

Arguments

mem_model Is the memory model to use. Possible values are: small Tells the compiler to restrict code and data to the first 2GB of address space. All accesses of code and data can be done with Instruction Pointer (IP)-relative addressing. medium Tells the compiler to restrict code to the first 2GB; it places no memory restriction on data. Accesses of code can be done

318 13

with IP-relative addressing, but accesses of data must be done with absolute addressing. large Places no memory restriction on code or data. All accesses of code and data must be done with absolute addressing.

Default

-mcmodel=small On systems using Intel® 64 architecture, the compiler restricts code and data to the first 2GB of address space. Instruction Pointer (IP)-relative addressing can be used to access code and data.

Description This option tells the compiler to use a specific memory model to generate code and store data. It can affect code size and performance. If your program has global and static data with a total size smaller than 2GB, -mcmodel=small is sufficient. Global and static data larger than 2GB requires -mcmodel=medium or -mcmodel=large. Allocation of memory larger than 2GB can be done with any setting of -mcmodel. IP-relative addressing requires only 32 bits, whereas absolute addressing requires 64-bits. IP-relative addressing is somewhat faster. So, the small memory model has the least impact on performance.

NOTE. When you specify -mcmodel=medium or -mcmodel=large, you must also specify compiler option -shared-intel to ensure that the correct dynamic versions of the Intel run-time libraries are used.

Alternate Options None

Example The following example shows how to compile using -mcmodel: icl -shared-intel -mcmodel=medium -o prog prog.c

See Also • shared-intel

319 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• fpic

mcpu This is a deprecated, alternate option for option -mtune.

mc++-abr Uses the Dinkum Abridged C++ Library instead of the Dinkum Standard C++ Library.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -mc++-abr

Arguments None

Default

OFF

Description Uses the Dinkum Abridged C++ Library instead of the Dinkum Standard C++ Library. This option defines the preprocessor macro _ECPP. You can use it with or without -membedded.

Alternate Options None

See Also • membedded

320 13

MD Preprocess and compile, generating output file containing dependency information ending with extension .d.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -MD

Arguments None

Default

OFF The compiler does not generate dependency information.

Description Preprocess and compile, generating output file containing dependency information ending with extension .d.

Alternate Options None membedded Disables exceptions and run-time type information.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -membedded

321 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments None

Default

OFF

Description Disables exceptions and run-time type information, (same as -fno-exceptions -fno-rtti). This option also defines the preprocessor macro _NO_EX. You can use it with or without -mc++-abr.

Alternate Options -fno-exceptions -fno-rtti

See Also • mc++-abr

MF Tells the compiler to generate makefile dependency information in a file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -MFfilename

Arguments

filename Is the name of the file where the makefile dependency information should be placed.

Default

OFF The compiler does not generate makefile dependency information in files.

322 13

Description This option tells the compiler to generate makefile dependency information in a file. To use this option, you must also specify /QM or /QMM.

Alternate Options None

See Also • QM • QMM

MG Tells the compiler to generate makefile dependency lines for each source file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -MG

Arguments None

Default

OFF The compiler does not generate makefile dependency information in files.

Description This option tells the compiler to generate makefile dependency lines for each source file. It is similar to /QM, but it treats missing header files as generated files.

Alternate Options None

323 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also • QM

minstruction Determines whether MOVBE instructions are generated for Intel processors.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -minstruction=[no]movbe

Arguments None

Default

–minstruction=nomovbe The compiler does not generate MOVBE instructions for Intel® or/Qinstruction:nomovbe Atom™ processors.

Description This option determines whether MOVBE instructions are generated for Intel processors. To use this option, you must also specify -xSSE3_ATOM (VxWorks). If -minstruction=movbe or /Qinstruction:movbe is specified, the following occurs:

• MOVBE instructions are generated that are specific to the Intel® Atom™ processor. • The options are ON by default when -xSSE3_ATOM or /QxSSE3_ATOM is specified. • Generated executables can only be run on Intel® Atom™ processors or processors that support Intel® Streaming SIMD Extensions 3 (Intel® SSE3) and MOVBE.

If -minstruction=nomovbe or /Qinstruction:nomovbe is specified, the following occurs:

• The compiler optimizes code for the Intel® Atom™ processor, but it does not generate MOVBE instructions. • Generated executables can be run on non-Intel® Atom™ processors that support Intel® SSE3.

324 13

Alternate Options None

See Also • x, Qx

MM Tells the compiler to generate makefile dependency lines for each source file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -MM

Arguments None

Default

OFF The compiler does not generate makefile dependency information in files.

Description This option tells the compiler to generate makefile dependency lines for each source file. It is similar to /QM, but it does not include system header files.

Alternate Options None

See Also • QM

325 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

MMD Tells the compiler to generate an output file containing dependency information.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -MMD

Arguments None

Default

OFF The compiler does not generate an output file containing dependency information.

Description This option tells the compiler to preprocess and compile a file, then generate an output file (with extension .d) containing dependency information. It is similar to /QMD, but it does not include system header files.

Alternate Options None

mp Maintains floating-point precision while disabling some optimizations. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

326 13

Syntax

VxWorks: -mp

Arguments None

Default

OFF The compiler provides good accuracy and run-time performance at the expense of less consistent floating-point results.

Description This option maintains floating-point precision while disabling some optimizations, such as -fma (Linux) and /Qfma (Windows). It restricts optimization to maintain declared precision and to ensure that floating-point arithmetic conforms more closely to the ANSI* language and IEEE* arithmetic standards. For most programs, specifying this option adversely affects performance. If you are not sure whether your application needs this option, try compiling and running your program both with and without it to evaluate the effects on both performance and precision. The recommended method to control the semantics of floating-point calculations is to use option -fp-model.

NOTE. This option may enable extra optimization that only applies to Intel® microprocessors.

Alternate Options VxWorks: -mieee-fp Windows: None

See Also • mp1, Qprec • fp-model, fp

327 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

MP Tells the compiler to add a phony target for each dependency.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -MP

Arguments None

Default

OFF The compiler does not generate dependency information unless it is told to do so.

Description This option tells the compiler to add a phony target for each dependency. Note that this option is not related to Windows* option /MP.

Alternate Options None

multiple-processes, MP Creates multiple processes that can be used to compile large numbers of source files at the same time.

Architectures IA-32, Intel® 64 architectures

328 13

Syntax

VxWorks: -multiple-processes[=n]

Arguments n Is the maximum number of processes that the compiler should create.

Default

OFF A single process is used to compile source files.

Description This option creates multiple processes that can be used to compile large numbers of source files at the same time. It can improve performance by reducing the time it takes to compile source files on the command line. This option causes the compiler to create one or more copies of itself, each in a separate process. These copies simultaneously compile the source files. If n is not specified for this option, the default value is as follows:

• On VxWorks OS, the value is 2.

This option applies to compilations, but not to linking or link-time code generation.

Alternate Options None mp1 Improves floating-point precision and consistency.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -mp1

329 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments None

Default

OFF The compiler provides good accuracy and run-time performance at the expense of less consistent floating-point results.

Description This option improves floating-point consistency. It ensures the out-of-range check of operands of transcendental functions and improves the accuracy of floating-point compares. This option prevents the compiler from performing optimizations that change NaN comparison semantics and causes all values to be truncated to declared precision before they are used in comparisons. It also causes the compiler to use library routines that give better precision results compared to the X87 transcendental instructions. This option disables fewer optimizations and has less impact on performance than option mp or Op.

Alternate Options None

See Also • mp

MQ Changes the default target rule for dependency generation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -MQtarget

330 13

Arguments target Is the target rule to use.

Default

OFF The default target rule applies to dependency generation.

Description This option changes the default target rule for dependency generation. It is similar to -MT, but quotes special Make characters.

Alternate Options None mregparm Control the number registers used to pass integer arguments.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -mregparm=value

Arguments None

Default

OFF The compiler does not use registers to pass arguments.

Description Control the number registers used to pass integer arguments.

331 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Alternate Options None

mrtp Generates code for VxWorks real time processes (RTPs).

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -mrtp

Arguments None

Default

OFF

Description Generates code for VxWorks real time processes (RTPs). This option defines the preprocessor macro __RTP__. By default compiler generates code for kernels.

Alternate Options None

MT Changes the default target rule for dependency generation.

Architectures IA-32, Intel® 64 architectures

332 13

Syntax

VxWorks: -MTtarget

Arguments target Is the target rule to use.

Default

OFF The default target rule applies to dependency generation.

Description This option changes the default target rule for dependency generation.

Alternate Options None mtune Performs optimizations for specific processors.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -mtune=processor

Arguments processor Is the processor for which the compiler should perform optimizations. Possible values are: generic Generates code for the compiler's default behavior.

333 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

core2 Optimizes for the Intel® Core™ 2 processor family, including support for MMX™, Intel® SSE, SSE2, SSE3 and SSSE3 instruction sets. pentium Optimizes for Intel® Pentium® processors. pentium-mmx Optimizes for Intel® Pentium® with MMX technology. pentiumpro Optimizes for Intel® Pentium® Pro, Intel Pentium II, and Intel Pentium III processors. pentium4 Optimizes for Intel® Pentium® 4 processors. pentium4m Optimizes for Intel® Pentium® 4 processors with MMX technology.

Default

generic Code is generated for the compiler's default behavior.

Description This option performs optimizations for specific processors. The resulting executable is backwards compatible and generated code is optimized for specific processors. For example, code generated with -mtune=pentium4 will run correctly on Core2 processors, but it might not run as fast as if it had been generated using -mtune=core2. The following table shows on which architecture you can use each value.

Architecture

processor Value IA-32 architecture Intel® 64 architecture

generic X X

core2 X X

pentium X

pentium-mmx X

334 13

Architecture

processor Value IA-32 architecture Intel® 64 architecture

pentiumpro X

pentium4 X

pentium4m X

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options

-mtune VxWorks: -mcpu (this is a deprecated option)

335 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

multibyte-chars Determines whether multi-byte characters are supported.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -multibyte-chars -no-multibyte-chars

Arguments None

Default

-multibyte-charsor Multi-byte characters are supported. /Qmultibyte-chars

Description This option determines whether multi-byte characters are supported.

Alternate Options None

multiple-processes, MP Creates multiple processes that can be used to compile large numbers of source files at the same time.

Architectures IA-32, Intel® 64 architectures

336 13

Syntax

VxWorks: -multiple-processes[=n]

Arguments

n Is the maximum number of processes that the compiler should create.

Default

OFF A single process is used to compile source files.

Description This option creates multiple processes that can be used to compile large numbers of source files at the same time. It can improve performance by reducing the time it takes to compile source files on the command line. This option causes the compiler to create one or more copies of itself, each in a separate process. These copies simultaneously compile the source files. If n is not specified for this option, the default value is as follows:

• On VxWorks OS, the value is 2.

This option applies to compilations, but not to linking or link-time code generation.

Alternate Options None

N Options

no-bss-init Tells the compiler to place in the DATA section any uninitialized variables and explicitly zero-initialized variables.

Architectures IA-32, Intel® 64 architectures

337 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -no-bss-init

Arguments None

Default

OFF Uninitialized variables and explicitly zero-initialized variables are placed in the BSS section.

Description This option tells the compiler to place in the DATA section any uninitialized variables and explicitly zero-initialized variables.

Alternate Options VxWorks: -nobss-init (this is a deprecated option) Windows: None

no-libgcc Prevents the linking of certain gcc-specific libraries.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -no-libgcc

Arguments None

Default

OFF

338 13

Description This option prevents the linking of certain gcc-specific libraries. This option is not recommended for general use.

Alternate Options None nodefaultlibs Prevents the compiler from using standard libraries when linking.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -nodefaultlibs

Arguments None

Default

OFF The standard libraries are linked.

Description This option prevents the compiler from using standard libraries when linking. It is provided for GNU compatibility.

Alternate Options None

See Also • nostdlib

339 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

nolib-inline Disables inline expansion of standard library or intrinsic functions.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -nolib-inline

Arguments None

Default

OFF The compiler inlines many standard library and intrinsic functions.

Description This option disables inline expansion of standard library or intrinsic functions. It prevents the unexpected results that can arise from inline expansion of these functions.

Alternate Options None

non-static Links an RTP executable against shared libraries rather than static libraries.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -non-static

340 13

Arguments None

Default

OFF

Description Links an RTP executable against shared libraries rather than static libraries, -static is the default.

Alternate Options None nostartfiles Prevents the compiler from using standard startup files when linking.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -nostartfiles

Arguments None

Default

OFF The compiler uses standard startup files when linking.

Description This option prevents the compiler from using standard startup files when linking.

Alternate Options None

341 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also • nostdlib

nostdinc++ Do not search for header files in the standard directories for C++, but search the other standard directories.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -nostdinc++

Arguments None

Default

OFF

Description Do not search for header files in the standard directories for C++, but search the other standard directories.

Alternate Options None

nostdlib Prevents the compiler from using standard libraries and startup files when linking.

Architectures IA-32, Intel® 64 architectures

342 13

Syntax

VxWorks: -nostdlib

Arguments None

Default

OFF The compiler uses standard startup files and standard libraries when linking.

Description This option prevents the compiler from using standard libraries and startup files when linking. It is provided for GNU compatibility.

Alternate Options None

See Also • nodefaultlibs • nostartfiles

O Options

o Specifies the name for an output file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -o filename

343 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

filename Is the name for the output file. The space before filename is optional.

Default

OFF The compiler uses the default file name for an output file.

Description This option specifies the name for an output file as follows:

• If -c is specified, it specifies the name of the generated object file. • If -S is specified, it specifies the name of the generated assembly listing file. • If -P is specified, it specifies the name of the generated preprocessor file.

Otherwise, it specifies the name of the executable file.

NOTE. If you misspell a compiler option beginning with "o", such as -opt-report, etc., the compiler interprets the misspelled option as an -o filename option. For example, say you misspell "-opt-report" as " -opt-reprt"; in this case, the compiler interprets the misspelled option as "-o pt-reprt", where pt-reprt is the output file name.

Alternate Options VxWorks: None Windows: /Fe

See Also

O Specifies the code optimization for applications.

Architectures IA-32, Intel® 64 architectures

344 13

Syntax

VxWorks: -O[n]

Arguments n Is the optimization level. Possible values are 1, 2, or 3. On VxWorks* OS systems, you can also specify 0.

Default

O2 Optimizes for code speed. This default may change depending on which other compiler options are specified. For details, see below.

Description This option specifies the code optimization for applications.

Option Description

O (VxWorks* OS) This is the same as specifying O2.

O0 (VxWorks OS) Disables all optimizations. This option may set other options. This is determined by the compiler, depending on which operating system and architecture you are using. The options that are set may change from release to release.

O1 Enables optimizations for speed and disables some optimizations that increase code size and affect speed. To limit code size, this option:

• Enables global optimization; this includes data-flow analysis, code motion, strength reduction and test replacement, split-lifetime analysis, and instruction scheduling. • Disables inlining of some intrinsics.

345 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Option Description This option may set other options. This is determined by the compiler, depending on which operating system and architecture you are using. The options that are set may change from release to release. The O1 option may improve performance for applications with very large code size, many branches, and execution time not dominated by code within loops.

O2 Enables optimizations for speed. This is the generally recommended optimization level. Vectorization is enabled at O2 and higher levels. On systems using IA-32 architecture, some basic loop optimizations such as Distribution, Predicate Opt, Interchange, multi-versioning, and scalar replacements are performed. This option also enables:

• Inlining of intrinsics • Intra-file interprocedural optimization, which includes:

• inlining • constant propagation • forward substitution • routine attribute propagation • variable address-taken analysis • dead static function elimination • removal of unreferenced variables

• The following capabilities for performance gain:

• constant propagation • copy propagation • dead-code elimination • global register allocation • global instruction scheduling and control speculation

346 13

Option Description • loop unrolling • optimized code selection • partial redundancy elimination • strength reduction/induction variable simplification • variable renaming • exception handling optimizations • tail recursions • peephole optimizations • structure assignment lowering and optimizations • dead store elimination

This option may set other options, especially options that optimize for code speed. This is determined by the compiler, depending on which operating system and architecture you are using. The options that are set may change from release to release. Many routines in the shared libraries are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

O3 Performs O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements. This option may set other options. This is determined by the compiler, depending on which operating system and architecture you are using. The options that are set may change from release to release. On systems using IA-32 architecture or Intel® 64 architecture, when O3 is used with options -ax or -x (Linux OS) or with options /Qax or /Qx (Windows OS), the compiler performs more aggressive data dependency analysis than for O2, which may result in longer compilation times.

347 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Option Description The O3 optimizations may not cause higher performance unless loop and memory access transformations take place. The optimizations may slow down code in some cases compared to O2 optimizations. The O3 option is recommended for applications that have loops that heavily use floating-point calculations and process large data sets. Many routines in the shared libraries are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

The last O option specified on the command line takes precedence over any others.

Alternate Options

O0 VxWorks: None Windows: /Od

See Also

inline-level, Ob Specifies the level of inline function expansion.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -inline-level=n

Arguments

n Is the inline function expansion level. Possible values are 0, 1, and 2.

348 13

Default

-inline-level=2 or /Ob2 This is the default if option O2 is specified or is in effect by default. On Windows* systems, this is also the default if option O3 is specified. -inline-level=0 or /Ob0 This is the default if option -O0 (VxWorks* OS) is specified.

Description This option specifies the level of inline function expansion. Inlining procedures can greatly improve the run-time performance of certain programs.

Option Description

-inline-level=0 or Disables inlining of user-defined functions. Note that statement Ob0 functions are always inlined.

-inline-level=1 or Enables inlining when an inline keyword or an inline attribute is Ob1 specified. Also enables inlining according to the C++ language.

-inline-level=2 or Enables inlining of any function at the compiler's discretion. Ob2

Alternate Options Linux: -Ob (this is a deprecated option)

See Also Related Information • O Options fbuiltin, Oi Enables or disables inline expansion of intrinsic functions.

Architectures IA-32, Intel® 64 architectures

349 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -fbuiltin[-func] -fno-builtin[-func]

Arguments

func A comma-separated list of intrinsic functions.

Default

OFF Inline expansion of intrinsic functions disabled.

Description This option enables or disables inline expansion of one or more intrinsic functions. If -func is not specified, -fno-builtin disables inline expansion for all intrinsic functions. For a list of built-in functions affected by -fbuiltin, search for "built-in functions" in the appropriate gcc* documentation. For a list of built-in functions affected by /Oi, search for "/Oi" in the appropriate Microsoft* Visual C/C++* documentation.

Alternate Options None

opt-args-in-regs Determines whether calls to routines are optimized by passing parameters in registers instead of on the stack.

Architectures IA-32 architecture

Syntax

VxWorks: -opt-args-in-reg[=keyword]

350 13

Arguments keyword Specifies whether the optimization should be performed and under what conditions. Possible values are: none The optimization is not performed. No parameters are passed in registers. They are put on the stack. seen Causes parameters to be passed in registers when they are passed to routines whose definition can be seen in the same compilation unit. all Causes parameters to be passed in registers, whether they are passed to routines whose definition can be seen in the same compilation unit, or not.

Default

-opt-args-in-regs=seen Parameters are passed in registers when they are passed to or /Qopt-args-in- routines whose definition is seen in the same compilation unit regs:seen

Description This option determines whether calls to routines are optimized by passing parameters in registers instead of on the stack. It also indicates the conditions when the optimization will be performed. This option can improve performance for Application Binary Interfaces (ABIs) that require parameters to be passed in memory and compiled without interprocedural optimization (IPO). Note that if all is specified, a small overhead may be paid when calling 'unseen' routines that have not been compiled with the same option. This is because the call will need to go through a 'thunk' to ensure that parameters are placed back on the stack where the callee expects them.

Alternate Options None

351 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

opt-block-factor Lets you specify a loop blocking factor.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-block-factor=n

Arguments

n Is the blocking factor. It must be an integer. The compiler may ignore the blocking factor if the value is 0 or 1.

Default

OFF The compiler uses default heuristics for loop blocking.

Description This option lets you specify a loop blocking factor.

Alternate Options None

opt-calloc Tells the compiler to substitute a call to _intel_fast_calloc() for a call to calloc().

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-calloc -no-opt-calloc

352 13

Arguments None

Default

-no-opt-calloc The compiler does not substitute a call to _intel_fast_calloc() for a call to calloc().

Description This option tells the compiler to substitute a call to_intel_fast_calloc() for a call to calloc(). This option may increase the performance of long-running programs that use calloc() frequently. It is recommended for these programs over combinations of options -inline-calloc and-opt- malloc-options=3 because this option causes less memory fragmentation.

NOTE. Many routines in the libirc library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Alternate Options None opt-class-analysis Determines whether C++ class hierarchy information is used to analyze and resolve C++ virtual function calls at compile time.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-class-analysis -no-opt-class-analysis

353 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments None

Default

-no-opt-class-analysis C++ class hierarchy information is not used to analyze and resolve or/Qopt-class-analysis- C++ virtual function calls at compile time.

Description This option determines whether C++ class hierarchy information is used to analyze and resolve C++ virtual function calls at compile time. The option is turned on by default with the -ipo compiler option, enabling improved C++ optimization. If a C++ application contains non-standard C++ constructs, such as pointer down-casting, it may result in different behaviors.

Alternate Options None

opt-jump-tables Enables or disables generation of jump tables for switch statements.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-jump-tables=keyword -no-opt-jump-tables

Arguments

keyword Is the instruction for generating jump tables. Possible values are:

354 13

never Tells the compiler to never generate jump tables. All switch statements are implemented as chains of if-then-elses. This is the same as specifying -no-opt- jump-tables (VxWorks* OS). default The compiler uses default heuristics to determine when to generate jump tables. large Tells the compiler to generate jump tables up to a certain pre-defined size (64K entries). n Must be an integer. Tells the compiler to generate jump tables up ton entries in size.

Default

-opt-jump-tables=de- The compiler uses default heuristics to determine when to generate fault jump tables for switch statements. or/Qopt-jump-tables:de- fault

Description This option enables or disables generation of jump tables for switch statements. When the option is enabled, it may improve performance for programs with large switch statements.

Alternate Options None opt-malloc-options Lets you specify an alternate algorithm for malloc().

Architectures IA-32, Intel® 64 architectures

355 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -opt-malloc-options=n

Arguments

n Specifies the algorithm to use for malloc(). Possible values are: 0 Tells the compiler to use the default algorithm for malloc(). This is the default. 1 Causes the following adjustments to the malloc() algorithm: M_MMAP_MAX=2 and M_TRIM_THRESHOLD=0x10000000. 2 Causes the following adjustments to the malloc() algorithm: M_MMAP_MAX=2 and M_TRIM_THRESHOLD=0x40000000. 3 Causes the following adjustments to the malloc() algorithm: M_MMAP_MAX=0 and M_TRIM_THRESHOLD=-1. 4 Causes the following adjustments to the malloc() algorithm: M_MMAP_MAX=0, M_TRIM_THRESHOLD=-1, M_TOP_PAD=4096.

Default

-opt-malloc-options=0 The compiler uses the default algorithm when malloc() is called. No call is made to mallopt().

Description This option lets you specify an alternate algorithm for malloc(). If you specify a non-zero value for n, it causes alternate configuration parameters to be set for how malloc() allocates and frees memory. It tells the compiler to insert calls to mallopt() to adjust these parameters to malloc() for dynamic memory allocation. This may improve speed.

356 13

Alternate Options None

See Also malloc(3) man page mallopt function (defined in malloc.h) opt-matmul Enables or disables a compiler-generated Matrix Multiply (matmul) library call.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-matmul -no-opt-matmul

Arguments None

Default

-no-opt-matmul The matmul library call optimization does not occur unless this or/Qopt-matmul- option is enabled or certain other compiler options are specified (see below).

Description This option enables or disables a compiler-generated Matrix Multiply (matmul) library call. Options -opt-matmul and /Qopt-matmul tell the compiler to identify matrix multiplication loop nests (if any) and replace them with a matmul library call for improved performance. The resulting executable may get additional performance gain on Intel® microprocessors than on non-Intel microprocessors.

357 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

This option is enabled by default if options O3 and -parallel (Linux* OS) or /Qparallel (Windows*) are specified. To disable this optimization, specify -no-opt-matmul or /Qopt- matmul-. This option has no effect unless option O2 or higher is set.

NOTE. Many routines in the matmul library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Alternate Options None

See Also • O

opt-multi-version-aggressive Tells the compiler to use aggressive multi-versioning to check for pointer aliasing and scalar replacement.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-multi-version-aggressive -no-opt-multi-version-aggressive

Arguments None

Default

-no-opt-multi-version- The compiler uses default heuristics when checking for pointer aggressive aliasing and scalar replacement.

358 13 or/Qopt-multi-version- aggressive-

Description This option tells the compiler to use aggressive multi-versioning to check for pointer aliasing and scalar replacement. This option may improve performance. The performance can be affected by certain options, such as /arch or -m or -x (VxWorks* OS).

Alternate Options None opt-prefetch Enables or disables prefetch insertion optimization.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-prefetch[=n] -no-opt-prefetch

Arguments n Is the level of detail in the report. Possible values are: 0 Disables software prefetching. This is the same as specifying -no-opt-prefetch (VxWorks). 1 to 4 Enables different levels of software prefetching. If you do not specify a value for n, the default is 2 on IA-32 and Intel® 64 architecture. Use lower values to reduce the amount of prefetching.

359 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

IA-32 architecture and Intel® On IA-32 architecture and Intel® 64 architecture, prefetch insertion 64 architecture: optimization is disabled. -no-opt-prefetch or/Qopt-prefetch-

Description This option enables or disables prefetch insertion optimization. The goal of prefetching is to reduce cache misses by providing hints to the processor about when data should be loaded into the cache. On IA-32 architecture and Intel® 64 architecture, this option enables prefetching when higher optimization levels are specified.

opt-ra-region-strategy Selects the method that the register allocator uses to partition each routine into regions.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-ra-region-strategy[=keyword]

Arguments

keyword Is the method used for partitioning. Possible values are: routine Creates a single region for each routine. block Partitions each routine into one region per basic block. trace Partitions each routine into one region per trace. region Partitions each routine into one region per loop.

360 13

default The compiler determines which method is used for partitioning.

Default

-opt-ra-region-strate- The compiler determines which method is used for partitioning. gy=default This is also the default if keyword is not specified. or/Qopt-ra-region- strategy:default

Description This option selects the method that the register allocator uses to partition each routine into regions. When setting default is in effect, the compiler attempts to optimize the tradeoff between compile-time performance and generated code performance. This option is only relevant when optimizations are enabled (O1 or higher).

Alternate Options None

See Also • O opt-report Tells the compiler to generate an optimization report to stderr.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-report [n]

361 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

n Is the level of detail in the report. On VxWorks systems, a space must appear before the n. Possible values are: 0 Tells the compiler to generate no optimization report. 1 Tells the compiler to generate a report with the minimum level of detail. 2 Tells the compiler to generate a report with the medium level of detail. 3 Tells the compiler to generate a report with the maximum level of detail.

Default

-opt-report 2 or /Qopt- If you do not specify n, the compiler generates a report with report:2 medium detail. If you do not specify the option on the command line, the compiler does not generate an optimization report.

Description This option tells the compiler to generate an optimization report to stderr.

Alternate Options None

See Also • opt-report-file, Qopt-report-file

opt-report-file Specifies the name for an optimization report.

Architectures IA-32, Intel® 64 architectures

362 13

Syntax

VxWorks: -opt-report-file=filename

Arguments filename Is the name for the optimization report.

Default

OFF No optimization report is generated.

Description This option specifies the name for an optimization report. If you use this option, you do not have to specify -opt-report (VxWorks).

Alternate Options None

See Also • opt-report, Qopt-report opt-report-help Displays the optimizer phases available for report generation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-report-help

Arguments None

363 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF No optimization reports are generated.

Description This option displays the optimizer phases available for report generation using -opt-report- phase (VxWorks). No compilation is performed.

Alternate Options None

See Also • opt-report, Qopt-report • opt-report-phase, Qopt-report-phase

opt-report-phase Specifies an optimizer phase to use when optimization reports are generated.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-report-phase=phase

Arguments

phase Is the phase to generate reports for. Some of the possible values are: ipo The Interprocedural Optimizer phase hlo The High Level Optimizer phase hpo The High Performance Optimizer phase ilo The Intermediate Language Scalar Optimizer phase pgo The Profile Guided Optimization phase

364 13

all All optimizer phases

Default

OFF No optimization reports are generated.

Description This option specifies an optimizer phase to use when optimization reports are generated. To use this option, you must also specify -opt-report (VxWorks). This option can be used multiple times on the same command line to generate reports for multiple optimizer phases. When one of the logical names for optimizer phases is specified for phase, all reports from that optimizer phase are generated. To find all phase possibilities, use option -opt-report-help (VxWorks).

Alternate Options None

See Also • opt-report, Qopt-report opt-report-routine Tells the compiler to generate reports on the routines containing specified text.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-report-routine=string

Arguments string Is the text (string) to look for.

365 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF No optimization reports are generated.

Description This option tells the compiler to generate reports on the routines containing specified text as part of their name.

Alternate Options None

See Also • opt-report, Qopt-report

opt-streaming-stores Enables generation of streaming stores for optimization.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-streaming-stores keyword

Arguments

keyword Specifies whether streaming stores are generated. Possible values are: always Enables generation of streaming stores for optimization. The compiler optimizes under the assumption that the application is memory bound. never Disables generation of streaming stores for optimization. Normal stores are performed.

366 13

auto Lets the compiler decide which instructions to use.

Default

-opt-streaming-stores The compiler decides whether to use streaming stores or normal auto stores. or/Qopt-streaming- stores:auto

Description This option enables generation of streaming stores for optimization. This method stores data with instructions that use a non-temporal buffer, which minimizes memory hierarchy pollution. This option may be useful for applications that can benefit from streaming stores.

Alternate Options None

See Also • ax, Qax • x, Qx opt-subscript-in-range Determines whether the compiler assumes that there are no "large" integers being used or being computed inside loops.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -opt-subscript-in-range -no-opt-subscript-in-range

367 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments None

Default

-no-opt-subscript-in- The compiler assumes there are "large" integers being used or range being computed within loops. or/Qopt-subscript-in- range-

Description This option determines whether the compiler assumes that there are no "large" integers being used or being computed inside loops. If you specify -opt-subscript-in-range (VxWorks), the compiler assumes that there are no "large" integers being used or being computed inside loops. A "large" integer is typically > 231. This feature can enable more loop transformations.

Alternate Options None

Example The following shows an example where these options can be useful. m is declared as type long (64-bits) and all other variables inside the subscript are declared as type int (32-bits): A[ i + j + ( n + k) * m ]

Os Enables optimizations that do not increase code size and produces smaller code size than O2.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Os

368 13

Arguments None

Default

OFF Optimizations are made for code speed. However, if O1 is specified, Os is the default.

Description This option enables optimizations that do not increase code size and produces smaller code size than O2. It disables some optimizations that increase code size for a small speed benefit. This option tells the compiler to favor transformations that reduce code size over transformations that produce maximum performance.

Alternate Options None

See Also • O fomit-frame-pointer, Oy Determines whether EBP is used as a general-purpose register in optimizations.

Architectures -f[no-]omit-frame-pointer: IA-32, Intel® 64 architectures /Oy[-]: IA-32 architecture

Syntax

VxWorks: -fomit-frame-pointer -fno-omit-frame-pointer

Arguments None

369 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

-fomit-frame-pointer EBP is used as a general-purpose register in optimizations. or /Oy However, on VxWorks* systems, the default is -fno-omit-frame- pointer if option -O0 or -g is specified. On Windows* systems, the default is /Oy- if option /Od is specified.

Description These options determine whether EBP is used as a general-purpose register in optimizations. Options -fomit-frame-pointer and /Oy allow this use. Options -fno-omit-frame-pointer and /Oy- disallow it. Some debuggers expect EBP to be used as a stack frame pointer, and cannot produce a stack backtrace unless this is so. The -fno-omit-frame-pointer and /Oy- options direct the compiler to generate code that maintains and uses EBP as a stack frame pointer for all functions so that a debugger can still produce a stack backtrace without doing the following:

• For -fno-omit-frame-pointer: turning off optimizations with -O0 • For /Oy-: turning off /O1, /O2, or /O3 optimizations

The -fno-omit-frame-pointer option is set when you specify option -O0 or the -g option. The -fomit-frame-pointer option is set when you specify option -O1, -O2, or -O3. The /Oy option is set when you specify the /O1, /O2, or /O3 option. Option /Oy- is set when you specify the /Od option. Using the -fno-omit-frame-pointer or /Oy- option reduces the number of available general-purpose registers by 1, and can result in slightly less efficient code.

NOTE. There is currently an issue with GCC 3.2 exception handling. Therefore, the Intel compiler ignores this option when GCC 3.2 is installed for C++ and exception handling is turned on (the default).

Alternate Options VxWorks: -fp (this is a deprecated option) Windows: None

370 13

P Options

P Tells the compiler to stop the compilation process and write the results to a file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -P

Arguments None

Default

OFF Normal compilation is performed.

Description This option tells the compiler to stop the compilation process after C or C++ source files have been preprocessed and write the results to files named according to the compiler's default file-naming conventions. On Linux systems, this option causes the preprocessor to expand your source module and direct the output to a .i file instead of stdout. Unlike the -E option, the output from -P on Linux does not include #line number directives. By default, the preprocessor creates the name of the output file using the prefix of the source file name with a .i extension. You can change this by using the -o option.

Alternate Options -F

See Also Building Applications: About Preprocessor Options

371 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

pc Enables control of floating-point significand precision.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -pcn

Arguments

n Is the floating-point significand precision. Possible values are: 32 Rounds the significand to 24 bits (single precision). 64 Rounds the significand to 53 bits (double precision). 80 Rounds the significand to 64 bits (extended precision).

Default

-pc80 On VxWorks* systems, the floating-point significand is rounded or/Qpc64 to 64 bits. On Windows* systems, the floating-point significand is rounded to 53 bits.

Description This option enables control of floating-point significand precision. Some floating-point algorithms are sensitive to the accuracy of the significand, or fractional part of the floating-point value. For example, iterative operations like division and finding the square root can run faster if you lower the precision with the this option. Note that a change of the default precision control or rounding mode, for example, by using the -pc32 (VxWorks) option or by user intervention, may affect the results returned by some of the mathematical functions.

372 13

Alternate Options None

See Also Floating-point Operations: Floating-point Options Quick Reference pch Tells the compiler to use appropriate precompiled header files.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -pch

Arguments None

Default

OFF The compiler does not create or use precompiled headers unless you tell it to do so.

Description This option tells the compiler to use appropriate precompiled header (PCH) files. If none are available, they are created as sourcefile.pchi. This option is supported for multiple source files. The -pch option will use PCH files created from other sources if the headers files are the same. For example, if you compile source1.cpp using -pch, then source1.pchi is created. If you then compile source2.cpp using -pch, the compiler will use source1.pchi if it detects the same headers.

373 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

CAUTION. Depending on how you organize the header files listed in your sources, this option may increase compile times. To learn how to optimize compile times using the PCH options, see "Using Precompiled Header Files" in the User's Guide.

Alternate Options None

Example Consider the following command line: icpc -pch source1.cpp source2.cpp

It produces the following output when .pchi files do not exist: "source1.cpp": creating precompiled header file "source1.pchi" "source2.cpp": creating precompiled header file "source2.pchi"

It produces the following output when .pchi files do exist: "source1.cpp": using precompiled header file"source1.pchi" "source2.cpp": using precompiled header file "source2.pchi"

See Also • -pch-create • -pch-dir • -pch-use • Using Precompiled Header Files

pch-create Tells the compiler to create a precompiled header file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -pch-create filename

374 13

Arguments filename Is the name for the precompiled header file. A space must appear before the file name. It can include a path.

Default

OFF The compiler does not create or use precompiled headers unless you tell it to do so.

Description This option tells the compiler to create a precompiled header (PCH) file. It is supported only for single source file compilations. Note that the .pchi extension is not automatically appended to the file name. This option cannot be used in the same compilation as the -pch-use option.

Alternate Options VxWorks: None Windows: /Yc

Example Consider the following command line: icpc -pch-create /pch/foo.pchi foo.cpp

This creates the precompiled header file "/pch/foo.pchi".

See Also • pch-use pch-dir Tells the compiler the location for precompiled header files.

Architectures IA-32, Intel® 64 architectures

375 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -pch-dir dir

Arguments

dir Is the path for precompiled header files. The path must exist.

Default

OFF The compiler does not create or use precompiled headers unless you tell it to do so.

Description This option tells the compiler the location for precompiled header files. It denotes where to find precompiled header files, and where new PCH files should be placed. This option can be used with the -pch, -pch-create, and -pch-use options.

Alternate Options None

Example Consider the following command line: icpc -pch -pch-dir /pch source32.cpp

It produces the following output: "source32.cpp": creating precompiled header file /pch/source32.pchi

See Also • pch • pch-create • pch-use • Using Precompiled Header Files

376 13 pch-use Tells the compiler to use a precompiled header file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -pch-use filename

Arguments filename Is the name of the precompiled header file to use. A space must appear before the file name. It can include a path.

Default

OFF The compiler does not create or use precompiled headers unless you tell it to do so.

Description This option tells the compiler to use a precompiled header (PCH) file. It is supported for multiple source files when all source files use the same .pchi file. This option cannot be used in the same compilation as the -pch-create option.

Alternate Options VxWorks: None Windows: /Yu

Example Consider the following command line: icpc -pch-use /pch/source32.pchi source.cpp

It produces the following output: "source.cpp": using precompiled header file /pch/source32.pchi

377 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also • -pch-create • Using Precompiled Header Files

pie Produces a position-independent executable on processors that support it.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -pie

Arguments None

Default

OFF The driver does not set up special run-time libraries and the linker does not perform the optimizations on executables.

Description This option produces a position-independent executable on processors that support it. It is both a compiler option and a linker option. When used as a compiler option, this option ensures the linker sets up run-time libraries correctly. Normally the object linked has been compiled with option -fpie. When you specify -pie, it is recommended that you specify the same options that were used during compilation of the object.

Alternate Options None

See Also • fpie

378 13 pragma-optimization-level Specifies which interpretation of the optimization_level pragma should be used if no prefix is specified.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -pragma-optimization-level=interpretation

Arguments interpretation Compiler-specific interpretation of optimization_level pragma. Possible values are: Intel Specify the Intel interpretation. GCC Specify the GCC interpretation.

Default

-pragma-optimization- Use the Intel interpretation of the optimization_level pragma. level=Intel

Description Specifies which interpretation of the optimization_level pragma should be used if no prefix is specified.

Alternate Options None

See Also Compiler Reference: pragma optimization_level

379 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

prec-div Improves precision of floating-point divides.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prec-div -no-prec-div

Arguments None

Default

-prec-div The compiler uses this method for floating-point divides. or/Qprec-div

Description This option improves precision of floating-point divides. It has a slight impact on speed. With some optimizations, such as -msse2 (Linux and VxWorks), the compiler may change floating-point division computations into multiplication by the reciprocal of the denominator. For example, A/B is computed as A * (1/B) to improve the speed of the computation. However, sometimes the value produced by this transformation is not as accurate as full IEEE division. When it is important to have fully precise IEEE division, use this option to disable the floating-point division-to-multiplication optimization. The result is more accurate, with some loss of performance. If you specify -no-prec-div (VxWorks), it enables optimizations that give slightly less precise results than full IEEE division.

Alternate Options None

380 13

See Also Floating-point Operations: Floating-point Options Quick Reference prec-sqrt Improves precision of square root implementations.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prec-sqrt -no-prec-sqrt

Arguments None

Default

-no-prec-sqrt The compiler uses a faster but less precise implementation of or /Qprec-sqrt- square root. However, the default is -prec-sqrt or /Qprec-sqrt if any of the following options are specified: /Od, /Op, or /Qprec on Windows systems; -O0, -mp, or -mp1 on VxWorks* systems.

Description This option improves precision of square root implementations. It has a slight impact on speed. This option inhibits any optimizations that can adversely affect the precision of a square root computation. The result is fully precise square root implementations, with some loss of performance.

Alternate Options None

381 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

print-multi-lib Prints information about where system libraries should be found.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -print-multi-lib

Arguments None

Default

OFF No information is printed unless the option is specified.

Description This option prints information about where system libraries should be found, but no compilation occurs. It is provided for compatibility with gcc.

Alternate Options None

prof-data-order Enables or disables data ordering if profiling information is enabled.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-data-order

382 13

-no-prof-data-order

Arguments None

Default

-no-prof-data-order Data ordering is disabled. or/Qprof-data-order-

Description This option enables or disables data ordering if profiling information is enabled. It controls the use of profiling information to order static program data items. For this option to be effective, you must do the following:

• For instrumentation compilation, you must specify -prof-gen=globdata (Linux). • For feedback compilation, you must specify -prof-use (Linux). You must not use multi-file optimization by specifying options such as option -ipo (Linux), or option -ipo-c (Linux).

Alternate Options None

See Also • prof-gen, Qprof-gen • prof-use, Qprof-use • prof-func-order, Qprof-func-order prof-dir Specifies a directory for profiling information output files.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-dir dir

383 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

dir Is the name of the directory.

Default

OFF Profiling output files are placed in the directory where the program is compiled.

Description This option specifies a directory for profiling information output files (*.dyn and *.dpi). The specified directory must already exist. You should specify this option using the same directory name for both instrumentation and feedback compilations. If you move the .dyn files, you need to specify the new path. Option /Qprof-dir is equivalent to option /Qcov-dir. If you specify both options, the last option specified on the command line takes precedence.

Alternate Options None

See Also Optimizing Applications: Profile-guided Optimization (PGO) Quick Reference

prof-file Specifies an alternate file name for the profiling summary files.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-file filename

384 13

Arguments filename Is the name of the profiling summary file.

Default

OFF The profiling summary files have the file name pgopti.*

Description This option specifies an alternate file name for the profiling summary files. The filename is used as the base name for files created by different profiling passes. If you add this option to profmerge, the .dpi file will be named filename.dpi instead of pgopti.dpi. If you specify this option with option-prof-genx (VxWorks), the .spi and .spl files will be named filename.spi and filename.spl instead of pgopti.spi and pgopti.spl. If you specify this option with option -prof-use (VxWorks), the .dpi file will be named filename.dpi instead of pgopti.dpi. Option /Qprof-file is equivalent to option /Qcov-file. If you specify both options, the last option specified on the command line takes precedence.

Alternate Options None

See Also • prof-gen, Qprof-gen • prof-use, Qprof-use prof-func-groups Enables or disables function grouping if profiling information is enabled.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-func-groups

385 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-no-prof-func-groups

Arguments None

Default

-no-prof-func-groups Function grouping is disabled.

Description This option enables or disables function grouping if profiling information is enabled. A "function grouping" is a profiling optimization in which entire routines are placed either in the cold code section or the hot code section. If profiling information is enabled by option -prof-use, option -prof-func-groups is set and function grouping is enabled. However, if you explicitly enable -prof-func-order (Linux), function ordering is performed instead of function grouping. If you want to disable function grouping when profiling information is enabled, specify -no- prof-func-groups. To set the hotness threshold for function grouping, use option -prof-hotness-threshold (Linux).

See Also • prof-use, Qprof-use • prof-func-order, Qprof-func-order • prof-hotness-threshold, Qprof-hotness-threshold

prof-func-order Enables or disables function ordering if profiling information is enabled.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-func-order

386 13

-no-prof-func-order

Arguments None

Default

-no-prof-func-order Function ordering is disabled. or/Qprof-func-order-

Description This option enables or disables function ordering if profiling information is enabled. For this option to be effective, you must do the following:

• For instrumentation compilation, you must specify -prof-gen=srcpos (Linux). • For feedback compilation, you must specify -prof-use (Linux). You must not use multi-file optimization by specifying options such as option -ipo (Linux), or option -ipo-c (Linux).

If you enable profiling information by specifying option -prof-use (Linux), -prof-func-groups (Linux) and /Qprof-func-groups (Windows) are set and function grouping is enabled. However, if you explicitly enable -prof-func-order (Linux), function ordering is performed instead of function grouping. On Linux* systems, this option is only available for Linux linker 2.15.94.0.1, or later. To set the hotness threshold for function grouping and function ordering, use option -prof- hotness-threshold (Linux).

Alternate Options None

Example The following example shows how to use this option on a Windows system: icl /Qprof-gen:globdata file1.c file2.c /Fe instrumented.exe ./instrumented.exe icl /Qprof-use /Qprof-func-order file1.c file2.c /Fe feedback.exe

387 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The following example shows how to use this option on a Linux system: icl -prof-gen:globdata file1.c file2.c -o instrumented ./instrumented.exe icl -prof-use -prof-func-order file1.c file2.c -o feedback

See Also • prof-hotness-threshold, Qprof-hotness-threshold • prof-gen, Qprof-gen • prof-use, Qprof-use • prof-data-order, Qprof-data-order • prof-func-groups

prof-gen Produces an instrumented object file that can be used in profile-guided optimization.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-gen[=keyword] -no-prof-gen

Arguments

keyword Specifies details for the instrumented file. Possible values are: default Produces an instrumented object file. This is the same as specifying -prof-gen (VxWorks*) or /Qprof-gen (Windows*) with no keyword.

388 13

srcpos Produces an instrumented object file that includes extra source position information. This option is the same as option -prof- genx (VxWorks*) or /Qprof-genx (Windows*), which are deprecated. globdata Produces an instrumented object file that includes information for global data layout.

Default

-no-prof-gen or /Qprof- Profile generation is disabled. gen-

Description This option produces an instrumented object file that can be used in profile-guided optimization. It gets the execution count of each basic block. If you specify keyword srcpos or globdata, a static profile information file (.spi) is created. These settings may increase the time needed to do a parallel build using -prof-gen, because of contention writing the .spi file. These options are used in phase 1 of the Profile Guided Optimizer (PGO) to instruct the compiler to produce instrumented code in your object files in preparation for instrumented execution.

Alternate Options None

See Also Optimizing Applications: Profile-Guided Optimizations Overview

389 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

prof-genx This is a deprecated, alternate option name for option -prof-gen=srcpos or option /Qprof-gen:srcpos.

prof-hotness-threshold Lets you set the hotness threshold for function grouping and function ordering.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-hotness-threshold=n

Arguments

n Is the hotness threshold. n is a percentage having a value between 0 and 100 inclusive. If you specify 0, there will be no hotness threshold setting in effect for function grouping and function ordering.

Default

OFF The compiler's default hotness threshold setting of 10 percent is in effect for function grouping and function ordering.

Description This option lets you set the hotness threshold for function grouping and function ordering. The "hotness threshold" is the percentage of functions in the application that should be placed in the application's hot region. The hot region is the most frequently executed part of the application. By grouping these functions together into one hot region, they have a greater probability of remaining resident in the instruction cache. This can enhance the application's performance. For this option to take effect, you must specify option -prof-use (Linux) or /Qprof-use (Windows) and one of the following:

390 13

• On Linux systems: -prof-func-groups or -prof-func-order • On Windows systems: /Qprof-func-order

Alternate Options None

See Also • prof-use, Qprof-use • prof-func-groups • prof-func-order, Qprof-func-order prof-src-dir Determines whether directory information of the source file under compilation is considered when looking up profile data records.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-src-dir -no-prof-src-dir

Arguments None

Default

-prof-src-dir Directory information is used when looking up profile data records or/Qprof-src-dir in the .dpi file.

Description This option determines whether directory information of the source file under compilation is considered when looking up profile data records in the .dpi file. To use this option, you must also specify option -prof-use (VxWorks).

391 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

If the option is enabled, directory information is considered when looking up the profile data records within the .dpi file. You can specify directory information by using one of the following options:

• VxWorks: -prof-src-root or -prof-src-root-cwd • Windows: /Qprof-src-root or /Qprof-src-root-cwd

If the option is disabled, directory information is ignored and only the name of the file is used to find the profile data record. Note that options -prof-src-dir (VxWorks) and /Qprof-src-dir (Windows) control how the names of the user's source files get represented within the .dyn or .dpi files. Options -prof- dir (VxWorks) and /Qprof-dir (Windows) specify the location of the .dyn or the .dpi files.

Alternate Options None

See Also • prof-use, Qprof-use • prof-src-root, Qprof-src-root • prof-src-root-cwd, Qprof-src-root-cwd

prof-src-root Lets you use relative directory paths when looking up profile data and specifies a directory as the base.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-src-root=dir

Arguments

dir Is the base for the relative paths.

392 13

Default

OFF The setting of relevant options determines the path used when looking up profile data records.

Description This option lets you use relative directory paths when looking up profile data in .dpi files. It lets you specify a directory as the base. The paths are relative to a base directory specified during the -prof-gen (VxWorks) compilation phase. This option is available during the following phases of compilation:

• VxWorks: -prof-gen and -prof-use phases • Windows: /Qprof-gen and /Qprof-use phases

When this option is specified during the -prof-gen or /Qprof-gen phase, it stores information into the .dyn or .dpi file. Then, when .dyn files are merged together or the .dpi file is loaded, only the directory information below the root directory is used for forming the lookup key. When this option is specified during the -prof-use or /Qprof-use phase, it specifies a root directory that replaces the root directory specified at the -prof-gen or /Qprof-gen phase for forming the lookup keys. To be effective, this option or option -prof-src-root-cwd (VxWorks) must be specified during the -prof-gen or /Qprof-gen phase. In addition, if one of these options is not specified, absolute paths are used in the .dpi file.

Alternate Options None Consider the initial -prof-gen compilation of the source file c:\user1\feature_foo\myproject\common\glob.c: icc -prof-gen -prof-src-root=c:\user1\feature_foo\myproject -c common\glob.c

For the -prof-use phase, the file glob.c could be moved into the directory c:\user2\feature_bar\myproject\common\glob.c and profile information would be found from the .dpi when using the following: icc -prof-use -prof-src-root=c:\user2\feature_bar\myproject -c common\glob.c

If you do not use option -prof-src-root during the -prof-gen phase, by default, the -prof-use compilation can only find the profile data if the file is compiled in the c:\user1\feature_foo\my_project\common directory.

393 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also • prof-gen, Qprof-gen • prof-use, Qprof-use • prof-src-dir, Qprof-src-dir • prof-src-root-cwd, Qprof-src-root-cwd

prof-src-root-cwd Lets you use relative directory paths when looking up profile data and specifies the current working directory as the base.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-src-root-cwd

Arguments None

Default

OFF The setting of relevant options determines the path used when looking up profile data records.

Description This option lets you use relative directory paths when looking up profile data in .dpi files. It specifies the current working directory as the base. To use this option, you must also specify option -prof-use (VxWorks). This option is available during the following phases of compilation:

• VxWorks: -prof-gen and -prof-use phases • Windows: /Qprof-gen and /Qprof-use phases

394 13

When this option is specified during the -prof-gen or /Qprof-gen phase, it stores information into the .dyn or .dpi file. Then, when .dyn files are merged together or the .dpi file is loaded, only the directory information below the root directory is used for forming the lookup key. When this option is specified during the -prof-use or /Qprof-use phase, it specifies a root directory that replaces the root directory specified at the -prof-gen or /Qprof-gen phase for forming the lookup keys. To be effective, this option or option -prof-src-root (VxWorks) must be specified during the -prof-gen or /Qprof-gen phase. In addition, if one of these options is not specified, absolute paths are used in the .dpi file.

Alternate Options None

See Also • prof-gen, Qprof-gen • prof-use, Qprof-use • prof-src-dir, Qprof-src-dir • prof-src-root, Qprof-src-root prof-use Enables the use of profiling information during optimization.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-use[=keyword] -no-prof-use

Arguments keyword Specifies additional instructions. Possible values are:

395 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

weighted Tells the profmerge utility to apply a weighting to the .dyn file values when creating the .dpi file to normalize the data counts when the training runs have differentexecution durations. This argument only has an effect when the compiler invokes the profmerge utility to create the .dpi file. This argument does not have an effect if the .dpi file was previously created without weighting. [no]merge Enables or disables automatic invocation of the profmerge utility. The default is merge. Note that you cannot specify both weighted and nomerge. If you try to specify both values, a warning will be displayed and nomerge takes precedence. default Enables the use of profiling information during optimization. The profmerge utility is invoked by default. This value is the same as specifying -prof-use (VxWorks) with no argument.

Default

-no-prof-use or /Qprof- Profiling information is not used during optimization. use-

Description This option enables the use of profiling information (including function splitting and function grouping) during optimization. It enables option /Qfnsplit (Windows). This option instructs the compiler to produce a profile-optimized executable and it merges available profiling output files into a pgopti.dpi file. Note that there is no way to turn off function grouping if you enable it using this option. To set the hotness threshold for function grouping and function ordering, use option -prof- hotness-threshold (Linux).

396 13

Alternate Options None

See Also • prof-hotness-threshold, Qprof-hotness-threshold prof-value-profiling Controls which values are value profiled.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -prof-value-profiling[=keyword]

Arguments keyword Controls which type of value profiling is performed. Possible values are: none Prevents all types of value profiling. nodivide Prevents value profiling of non-compile time constants used in division or remainder operations. noindcall Prevents value profiling of function addresses at indirect call sites. all Enables all types of value profiling.

You can specify more than one keyword, but they must be separated by commas.

Default

-prof-value-profil- All value profile types are enabled and value profiling is performed. ing=all or /Qprof-value- profiling:all

397 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option controls which features are value profiled. If this option is specified with option -prof-gen (VxWorks*) or /Qprof-gen (Windows*), it turns off instrumentation of operations of the specified type. This also prevents feedback of values for the operations. If this option is specified with option -prof-use (VxWorks), it turns off feedback of values collected of the specified type. If you specify -opt-report (VxWorks) or /Qopt-report (Windows) level 2 or higher, the value profiling specialization information will be reported within the PGO optimization report.

Alternate Options None

See Also • prof-gen, Qprof-gen • prof-use, Qprof-use • opt-report, Qopt-report

profile-functions Inserts instrumentation calls at a function's entry and exit points.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -profile-functions

Arguments None

398 13

Default

OFF No instrumentation calls are inserted at a function's entry and exit points.

Description This option inserts instrumentation calls at a function's entry and exit points within a single-threaded application to collect the cycles spent within the function to produce reports that can help in identifying code hotspots. When the instrumented application is run, this option causes the generation of a loop_prof_funcs_.dump file, where is a timestamp for the run. The same data values are also dumped into the file loop_prof_.xml for use with the data viewer application, unless you turn off the output format by setting the environment variable INTEL_LOOP_PROF_XML_DUMP to 0.

Alternate Options None

See Also • profile-loops, Qprofile-loops • profile-loops-report, Qprofile-loops-report profile-loops Inserts instrumentation calls at a function's entry and exit points, and before and after instrumentable loops.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -profile-loops=keyword

399 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

keyword Specifies which type of loops should have instrumentation. Possible values are: inner Inserts instrumentation before and after inner loops. outer Inserts instrumentation before and after outer loops. all Inserts instrumentation before and after all loops.

Default

OFF No instrumentation calls are inserted at a function's entry and exit points, or before and after instrumentable loop.

Description This option inserts instrumentation calls at a function's entry and exit points within a single-threaded application. For unthreaded applications, it also inserts instrumentation before and after instrumentable loops of the type listed in keyword. When the instrumented application is run, this option causes the generation of a loop_prof_funcs_.dump file and a loop_prof_funcs_.dump file, where is a timestamp for the run. The same timestamp is used for the loop file and function file. This identifies that the loop data and function data were from the same program run. The same data values are also dumped into the file loop_prof_.xml for use with the data viewer application, unless you turn off the output format by setting the environment variable INTEL_LOOP_PROF_XML_DUMP to 0.

Alternate Options None

See Also • profile-functions, Qprofile-functions • profile-loops-report, Qprofile-loops-report

400 13 profile-loops-report Controls the level of detail for the data collected when instrumentation occurs before and after certain loops.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -profile-loops-report[=n]

Arguments n Is a value denoting the level of detail to report. Possible values are: 1 Reports the cycle counts on entry and exits of loops. This is the default if n is not specified. 2 Reports the level 1 default details, but also includes the loop min/max and average loop iteration counts. To collect the loop iteration counts, additional instrumentation is inserted. This can increase overhead in the instrumented application and slow performance.

Default

-profile-loops-report=1 The report shows the cycle counts on entry and exits of loops. or /Qprofile-loops-re- port:1

401 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option controls the level of detail for the data collected when instrumentation occurs before and after certain loops. To use this option, you must also specify option -profile-loops (VxWorks). The report appears in file loop_prof_loops_.dump, where is a timestamp value for the run. The columns listed in the report will be based on the level of detail that was selected during instrumentation. It is recommended that the same report level be used for all files that are instrumented for the application. If different files of the application were instrumented with different levels, the report will contain all the columns of the highest detail level, but with default values for unavailable fields for files that were instrumented at lower levels.

Alternate Options None

See Also • profile-functions, Qprofile-functions • profile-loops, Qprofile-loops

pthread Tells the compiler to use pthreads library for multithreading support.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -pthread

Arguments None

402 13

Default

OFF The compiler does not use pthreads library for multithreading support.

Description Tells the compiler to use pthreads library for multithreading support.

Alternate Options None

Q Options

Qinstall Specifies the root directory where the compiler installation was performed.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Qinstalldir

Arguments

dir Is the root directory where the installation was performed.

Default

OFF The default root directory for compiler installation is searched for the compiler.

Description This option specifies the root directory where the compiler installation was performed. It is useful if you want to use a different compiler or if you did not use the iccvars shell script to set your environment variables.

403 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Alternate Options None

Qlocation Specifies the directory for supporting tools.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Qlocation,string,dir

Arguments

string Is the name of the tool. dir Is the directory (path) where the tool is located.

Default

OFF The compiler looks for tools in a default area.

Description This option specifies the directory for supporting tools. string can be any of the following:

• c - Indicates the Intel C++ compiler. • cpp (or fpp) - Indicates the Intel C++ preprocessor. • cxxinc - Indicates C++ header files. • cinc - Indicates C header files. • asm - Indicates the assembler. • link - Indicates the linker. • prof - Indicates the profiler. • On Windows systems, the following is also available:

404 13

• masm - Indicates the Microsoft assembler.

• On VxWorks systems, the following are also available:

• as - Indicates the assembler. • gas - Indicates the GNU assembler. • ld - Indicates the loader. • gld - Indicates the GNU loader. • lib - Indicates an additional library. • crt - Indicates the crt%.o files linked into executables to contain the place to start execution.

Alternate Options None

See Also • Qoption

Qoption Passes options to a specified tool.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Qoption,string,options

Arguments string Is the name of the tool. options Are one or more comma-separated, valid options for the designated tool.

405 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF No options are passed to tools.

Description This option passes options to a specified tool. If an argument contains a space or tab character, you must enclose the entire argument in quotation marks (" "). You must separate multiple arguments with commas. string can be any of the following:

• cpp - Indicates the Intel compiler preprocessor. • c - Indicates the C++ compiler. • asm - Indicates the assembler. • link - Indicates the linker. • prof - Indicates the profiler. • On Windows systems, the following is also available:

• masm - Indicates the Microsoft assembler.

• On VxWorks systems, the following are also available:

• as - Indicates the assembler. • gas - Indicates the GNU assembler. • ld - Indicates the loader. • gld - Indicates the GNU loader. • lib - Indicates an additional library. • crt - Indicates the crt%.o files linked into executables to contain the place to start execution.

Alternate Options None

See Also • Qlocation

406 13

R Options

rcd Enables fast float-to-integer conversions.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -rcd

Arguments None

Default

OFF Floating-point values are truncated when a conversion to an integer is involved.

Description This option enables fast float-to-integer conversions. It can improve the performance of code that requires floating-point-to-integer conversions. The system default floating-point rounding mode is round-to-nearest. However, the C language requires floating-point values to be truncated when a conversion to an integer is involved. To do this, the compiler must change the rounding mode to truncation before each floating-point-to-integer conversion and change it back afterwards. This option disables the change to truncation of the rounding mode for all floating-point calculations, including floating point-to-integer conversions. This option can improve performance, but floating-point conversions to integer will not conform to C semantics.

Alternate Options VxWorks: None Windows: /QIfist (this is a deprecated option)

407 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

regcall Tells the compiler that the __regcall calling convention should be used for functions that do not directly specify a calling convention.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -regcall

Arguments None

Default

OFF The __regcall calling convention will only be used if a function explicitly specifies it.

Description This option tells the compiler that the __regcall calling convention should be used for functions that do not directly specify a calling convention. This calling convention ensures that as many values as possible are passed or returned in registers. It ensures that __regcall is the default calling convention for functions in the compilation, unless another calling convention is specified in a declaration. This calling convention is ignored if it is specified for a function with variable arguments. Note that all __regcall functions must have prototypes.

Alternate Options None

See Also • C/C++ Calling Conventions

408 13 restrict Determines whether pointer disambiguation is enabled with the restrict qualifier.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -restrict -no-restrict

Arguments None

Default

-no-restrict or /Qre- Pointers are not qualified with the restrict keyword. strict-

Description This option determines whether pointer disambiguation is enabled with the restrict qualifier. Options -restrict (VxWorks) and /Qrestrict (Windows OS) enable the recognition of the restrict keyword as defined by the ANSI standard. By qualifying a pointer with the restrict keyword, you assert that an object accessed by the pointer is only accessed by that pointer in the given scope. You should use the restrict keyword only when this is true. When the assertion is true, the restrict option will have no effect on program correctness, but may allow better optimization.

Alternate Options None

See Also • c99, Qc99

409 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

S Options

S Causes the compiler to compile to an assembly file only and not link.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -S

Arguments None

Default

OFF Normal compilation and linking occur.

Description This option causes the compiler to compile to an assembly file only and not link. On VxWorks systems, the assembly file name has a .s suffix. On Windows systems, the assembly file name has an .asm suffix.

Alternate Options VxWorks: None Windows: /Fa

410 13

See Also save-temps Tells the compiler to save intermediate files created during compilation.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -save-temps -no-save-temps

Arguments None

Default

VxWorks: -no-save-temps On VxWorks systems, the compiler deletes intermediate files after Windows: .obj files are compilation is completed. On Windows systems, the compiler saves saved only intermediate object files after compilation is completed.

Description This option tells the compiler to save intermediate files created during compilation. The names of the files saved are based on the name of the source file; the files are saved in the current working directory. If -save-temps or /Qsave-temps is specified, the following occurs:

• The object .o file (VxWorks) or .obj file (Windows) is saved. • The assembler .s file (VxWorks) or .asm file (Windows) is saved if you specified -use-asm (VxWorks).

If -no-save-temps is specified on VxWorks systems, the following occurs:

• The .o file is put into /tmp and deleted after calling ld. • The preprocessed file is not saved after it has been used by the compiler.

411 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

If /Qsave-temps- is specified on Windows systems, the following occurs:

• The .obj file is not saved after the linker step. • The preprocessed file is not saved after it has been used by the compiler.

NOTE. This option only saves intermediate files that are normally created during compilation.

Alternate Options None

Example If you compile program my_foo.c on a VxWorks system and you specify option -save-temps and option -use-asm, the compilation will produce files my_foo.o and my_foo.s. If you compile program my_foo.c on a Windows system and you specify option /Qsave-temps and option /Quse-asm, the compilation will produce files my_foo.o and my_foo.asm.

scalar-rep Enables scalar replacement performed during loop transformation.

Architectures IA-32 architecture

Syntax

VxWorks: -scalar-rep -no-scalar-rep

Arguments None

Default

-no-scalar-rep Scalar replacement is not performed during loop transformation.

412 13 or/Qscalar-rep-

Description This option enables scalar replacement performed during loop transformation. To use this option, you must also specify O3.

Alternate Options None

See Also • O shared Tells the compiler to produce a dynamic shared object instead of an executable.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -shared

Arguments None

Default

OFF The compiler produces an executable.

Description This option tells the compiler to produce a dynamic shared object (DSO) instead of an executable. This includes linking in all libraries dynamically and passing -shared to the linker. On systems using IA-32 architecture and Intel® 64 architecture, you must specify option fpic for the compilation of each object file you want to include in the shared library.

413 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Alternate Options None

See Also • fpic • Xlinker

shared-intel Causes Intel-provided libraries to be linked in dynamically.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -shared-intel

Arguments None

Default

OFF Intel® libraries are linked in statically.

Description This option causes Intel-provided libraries to be linked in dynamically. It is the opposite of -static-intel.

Alternate Options VxWorks: -i-dynamic (this is a deprecated option) Windows: None

See Also • static-intel

414 13 shared-libgcc Links the GNU libgcc library dynamically.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -shared-libgcc

Arguments None

Default

-shared-libgcc The compiler links the libgcc library dynamically.

Description This option links the GNU libgcc library dynamically. It is the opposite of option static- libgcc. This option is useful when you want to override the default behavior of the static option, which causes all libraries to be linked statically.

Alternate Options None

See Also • static-libgcc simd Enables or disables the SIMD vectorization feature of the compiler.

Architectures IA-32, Intel® 64 architectures

415 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -simd -no-simd

Arguments None

Default

-simd The SIMD vectorization feature is enabled. or/Qsimd

Description This option enables or disables the SIMD vectorization feature of the compiler. To disable the SIMD transformations for vectorization, specify -no-simd (VxWorks). To disable vectorization, specify options -no-vec (VxWorks* OS).

NOTE. The SIMD vectorization feature is available for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx (Windows) or -m or -x (VxWorks* OS).

Alternate Options None

See Also • vec, Qvec • Function Annotations and the SIMD Directive for Vectorization • simd pragma

416 13 sox Tells the compiler to save the compilation options and version number in the Linux* OS executable or the Windows* OS object file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -sox -no-sox

Arguments None

Default

-no-sox The compiler does not save the compiler options and version or/Qsox- number in the executable.

Description Tells the compiler to save the compilation options and version number in the Linux* OS executable or the Windows* OS object file. On Linux systems, the size of the executable on disk is increased slightly by the inclusion of these infotmation strings. This option forces the compiler to embed in each object file or assembly output a string that contains information about the compiler version and compilation options for each source file that has been compiled. On Windows systems, the information stays in the object file. On Linux systems, when you link the object files into an executable file, the linker places each of the information strings into the header of the executable. It is then possible to use a tool, such as a strings utility, to determine what options were used to build the executable file. If -no-sox or /Qsox- is specified, this extra information is not put into the object or assembly output generated by the compiler.

417 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Alternate Options None

static Prevents linking with shared libraries.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -static

Arguments None

Default

OFF The compiler links with shared libraries.

Description This option prevents linking with shared libraries. It causes the executable to link all libraries statically.

Alternate Options None

See Also • static-intel

static-intel Causes Intel-provided libraries to be linked in statically.

Architectures IA-32, Intel® 64 architectures

418 13

Syntax

VxWorks: -static-intel

Arguments None

Default

ON Intel® libraries are linked in statically.

Description This option causes Intel-provided libraries to be linked in statically. It is the opposite of -shared- intel.

Alternate Options VxWorks: i-static (this is a deprecated option) Windows: None

See Also • shared-intel static-libgcc Links the GNU libgcc library statically.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -static-libgcc

Arguments None

419 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF DEFAULT_DESC

Description This option links the GNU libgcc library statically. It is the opposite of option libgcc. This option is useful when you want to override the default behavior of the libgcc option, which causes all libraries to be linked statically.

Alternate Options None

See Also • shared-libgcc

std Tells the compiler to conform to a specific language standard.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -std=val

Arguments

val Possible values are: c89 Conforms to the ISO/IEC 9899:1990 International Standard. c99 Conforms to The ISO/IEC 9899:1999 International Standard. gnu89 Conforms to ISO C90 plus GNU* extensions.

420 13 gnu99 Conforms to ISO C99 plus GNU* extensions. gnu++98 Conforms to the 1998 ISO C++ standard plus GNU extensions. c++0x Enable support for the following C++0x features:

• Atomic types and operations • Scoped enumeration types • Defaulted and deleted functions • Rvalue references • Empty macro arguments • Variadic macros • Type long long • Trailing comma in enum definition • Concatenation of mixed-width string literals • Extended friend declarations • Use of ">>" to close two template argument lists • Relaxed rules for use of "typename" • Relaxed rules for disambiguation using the "template" keyword • "extern template" to suppress instantiation of an entity • "auto" type specifier • decltype operator • static_assert • compliant __func__ • lambda expressions

421 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

-std=gnu89 (default for C) Conforms to ISO C90 plus GNU extensions. -std=gnu++98 (default for Conforms to the 1998 ISO C++ standard plus GNU* extensions. C++) /Qstd=c89 Conforms to the ISO/IEC 9899:1990 International Standard.

Description Tells the compiler to conform to a specific language standard.

Alternate Options None

strict-ansi Tells the compiler to implement strict ANSI conformance dialect.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -strict-ansi

Arguments None

Default

OFF The compiler conforms to default standards.

Description This option tells the compiler to implement strict ANSI conformance dialect. If you need to be compatible with gcc, use the -ansi option. This option sets option fmath-errno.

422 13

Alternate Options None

T Options

T Tells the linker to read link commands from a file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Tfilename

Arguments

filename Is the name of the file.

Default

OFF The linker does not read link commands from a file.

Description This option tells the linker to read link commands from a file.

Alternate Options None

Kc++, TP Tells the compiler to process all source or unrecognized file types as C++ source files. This option is deprecated.

Architectures IA-32, Intel® 64 architectures

423 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -Kc++

Arguments None

Default

OFF The compiler uses default rules for determining whether a file is a C++ source file.

Description This option tells the compiler to process all source or unrecognized file types as C++ source files.

Alternate Options None

U Options

u (Linux* OS) Tells the compiler the specified symbol is undefined.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -u symbol

Arguments None

424 13

Default

OFF Standard rules are in effect for variables.

Description This option tells the compiler the specified symbol is undefined.

Alternate Options None

U Undefines any definition currently in effect for the specified macro.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Uname

Arguments name Is the name of the macro to be undefined.

Default

OFF Macro definitions are in effect until they are undefined.

Description This option undefines any definition currently in effect for the specified macro. It is equivalent to a #undef preprocessing directive. On Windows systems, use the /u option to undefine all previously defined preprocessor values.

Alternate Options None

425 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also Building Applications: About Preprocessor Options

undef Disables all predefined macros.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -undef

Arguments None

Default

OFF Defined macros are in effect until they are undefined.

Description This option disables all predefined macros.

Alternate Options -A- (this is a deprecated option)

See Also

unroll Tells the compiler the maximum number of times to unroll loops.

Architectures IA-32, Intel® 64 architectures

426 13

Syntax

VxWorks: -unroll[=n]

Arguments n Is the maximum number of times a loop can be unrolled. To disable loop enrolling, specify 0.

Default

-unroll The compiler uses default heuristics when unrolling loops. or/Qunroll

Description This option tells the compiler the maximum number of times to unroll loops. If you do not specify n, the optimizer determines how many times loops can be unrolled.

Alternate Options VxWorks: -funroll-loops Windows: None

See Also Optimizing Applications: Loop Unrolling unroll-aggressive Determines whether the compiler uses more aggressive unrolling for certain loops.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -unroll-aggressive

427 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-no-unroll-aggressive

Arguments None

Default

-no-unroll-aggressive The compiler uses default heuristics when unrolling loops. or /Qunroll-aggressive-

Description This option determines whether the compiler uses more aggressive unrolling for certain loops. The positive form of the option may improve performance. On IA-32 architecture and Intel® 64 architecture, this option enables aggressive, complete unrolling for loops with small constant trip counts.

Alternate Options None

use-asm Tells the compiler to produce objects through the assembler.

Architectures -use-asm: IA-32, Intel® 64 architectures

Syntax

VxWorks: -use-asm -no-use-asm

Arguments None

Default

OFF The compiler produces objects directly.

428 13

Description This option tells the compiler to produce objects through the assembler.

Alternate Options None use-intel-optimized-headers Determines whether the performance headers directory is added to the include path search list.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -use-intel-optimized-headers

Arguments None

Default

-no-use-intel-opti- The performance headers directory is not added to the include mized-headers or /Quse- path search list. intel-optimized-head- ers-

Description This option determines whether the performance headers directory is added to the include path search list. The performance headers directory is added if you specify -use-intel-optimized-headers (VxWorks) or /Quse-intel-optimized-headers (Windows OS). Appropriate libraries are also linked in, as needed, for proper functionality.

429 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Optimization Notice

The Intel® Integrated Performance Primitives (Intel® IPP) library contains functions that are more highly optimized for Intel microprocessors than for other microprocessors. While the functions in the Intel® IPP library offer optimizations for both Intel and Intel-compatible microprocessors, depending on your code and other factors, you will likely get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for the Intel® IPP library as a whole, the library may or may not be optimized to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options None

use-msasm Enables the use of blocks and entire functions of assembly code within a C or C++ file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -use-msasm

Arguments None

430 13

Default

OFF The compiler allows a GNU*-style inline assembly format.

Description This option enables the use of blocks and entire functions of assembly code within a C or C++ file. It allows a Microsoft* MASM-style inline assembly block not a GNU*-style inline assembly block.

Alternate Options -fasm-blocks

V Options

v Specifies that driver tool commands should be displayed and executed.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -v[filename]

Arguments

filename Is the name of a file.

Default

OFF No tool commands are shown.

Description This option specifies that driver tool commands should be displayed and executed.

431 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

If you use this option without specifying a file name, the compiler displays only the version of the compiler.

Alternate Options None

See Also • dryrun

V Displays the compiler version information.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -V

Arguments None

Default

OFF The compiler version information is not displayed.

Description This option displays the startup banner, which contains the following compiler information:

• The name of the compiler and its applicable architecture • The major and minor version of the compiler, the update number, and the package number(for example, Version 11.1.0.047) • The specific build and build date (for example, Build ) • The copyright date of the software

This option can be placed anywhere on the command line.

432 13

Alternate Options None vec Enables or disables vectorization.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -vec -no-vec

Arguments None

Default

-vec Vectorization is enabled. or/Qvec

Description This option enables or disables vectorization. To disable vectorization, specify -no-vec (VxWorks). To disable the SIMD transformations, specify option -no-simd (VxWorks) .

NOTE. Using this option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or -m or -x (VxWorks* OS).

433 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Alternate Options None

See Also • simd, Qsimd • ax, Qax • x, Qx • vec-report, Qvec-report • vec-guard-write, Qvec-guard-write • vec-threshold, Qvec-threshold

vec-guard-write Tells the compiler to perform a conditional check in a vectorized loop.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -vec-guard-write -no-vec-guard-write

Arguments None

Default

-vec-guard-write The compiler performs a conditional check in a vectorized or/Qvec-guard-write loop.

Description This option tells the compiler to perform a conditional check in a vectorized loop. This checking avoids unnecessary stores and may improve performance.

434 13

Alternate Options None vec-report Controls the diagnostic information reported by the vectorizer.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -vec-report[n]

Arguments n Is a value denoting which diagnostic messages to report. Possible values are: 0 Tells the vectorizer to report no diagnostic information. 1 Tells the vectorizer to report on vectorized loops. 2 Tells the vectorizer to report on vectorized and non-vectorized loops. 3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven or assumed data dependences. 4 Tells the vectorizer to report on non-vectorized loops. 5 Tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized.

435 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

-vec-report1 If the vectorizer has been enabled and you do not specify n, the or/Qvec-report1 compiler reports diagnostics on vectorized loops. If you do not specify the option on the command line, the default is to display no messages.

Description This option controls the diagnostic information reported by the vectorizer. The vectorizer report is sent to stdout. If you do not specify n, it is the same as specifying -vec-report1 (VxWorks).

Alternate Options None

vec-threshold Sets a threshold for the vectorization of loops.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -vec-threshold[n]

Arguments

n Is an integer whose value is the threshold for the vectorization of loops. Possible values are 0 through 100. If n is 0, loops get vectorized always, regardless of computation work volume. If n is 100, loops get vectorized when performance gains are predicted based on the compiler analysis data. Loops get vectorized only if profitable vector-level parallel execution is almost certain.

436 13

The intermediate 1 to 99 values represent the percentage probability for profitable speed-up. For example, n=50 directs the compiler to vectorize only if there is a 50% probability of the code speeding up if executed in vector form.

Default

-vec-threshold100 Loops get vectorized only if profitable vector-level parallel execution or /Qvec-threshold100 is almost certain. This is also the default if you do not specify n.

Description This option sets a threshold for the vectorization of loops based on the probability of profitable execution of the vectorized loop in parallel. This option is useful for loops whose computation work volume cannot be determined at compile-time. The threshold is usually relevant when the loop trip count is unknown at compile-time. The compiler applies a heuristic that tries to balance the overhead of creating multiple threads versus the amount of work available to be shared amongst the threads.

Alternate Options None version Display GCC-style version information.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: --version

Arguments None

437 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF

Description Display GCC-style version information.

Alternate Options None

W Options

w Disables all warning messages.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -w

Arguments None

Default

OFF Default warning messages are enabled.

Description This option disables all warning messages.

Alternate Options VxWorks: -W0 Windows: /W0

438 13 w, W Specifies the level of diagnostic messages to be generated by the compiler.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -wn

Arguments n Is the level of diagnostic messages to be generated. Possible values are 0 Disables warnings; displays errors. 1 Displays warnings and errors. 2 Displays warnings and errors. This setting is equivalent to level 1 (n=1).

Default n=1 The compiler displays warnings and errors.

Description This option specifies the level of diagnostic messages to be generated by the compiler.

Alternate Options None w, W Specifies the level of diagnostic messages to be generated by the compiler.

Architectures IA-32, Intel® 64 architectures

439 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -wn

Arguments

n Is the level of diagnostic messages to be generated. Possible values are 0 Disables warnings; displays errors. 1 Displays warnings and errors. 2 Displays warnings and errors. This setting is equivalent to level 1 (n=1).

Default

n=1 The compiler displays warnings and errors.

Description This option specifies the level of diagnostic messages to be generated by the compiler.

Alternate Options None

Wa Passes options to the assembler for processing.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wa,option1[,option2,...]

440 13

Arguments option Is an assembler option. This option is not processed by the driver and is directly passed to the assembler.

Default

OFF No options are passed to the assembler.

Description This option passes one or more options to the assembler for processing. If the assembler is not invoked, these options are ignored.

Alternate Options None

Wabi Determines whether a warning is issued if generated code is not C++ ABI compliant.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wabi -Wno-abi

Arguments None

Default

-Wno-abi No warning is issued when generated code is not C++ ABI compliant.

441 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option determines whether a warning is issued if generated code is not C++ ABI compliant.

Alternate Options None

Wall Tells the compiler to display errors and warnings.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wall

Arguments None

Default

OFF Default warning messages are enabled.

Description This option tells the compiler to display errors and warnings. On Windows systems, this is the same as specifying the /W2 option. If you want to display remarks and comments, specify option -Wremarks.

Alternate Options None

442 13

Wbrief Tells the compiler to display a shorter form of diagnostic output.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wbrief

Arguments None

Default

OFF The compiler displays its normal diagnostic output.

Description This option tells the compiler to display a shorter form of diagnostic output. In this form, the original source line is not displayed and the error message text is not wrapped when too long to fit on a single line.

Alternate Options Linux: None Windows: /WL

Wcheck Tells the compiler to perform compile-time code checking for certain code.

Architectures IA-32, Intel® 64 architectures

443 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -Wcheck

Arguments None

Default

OFF No compile-time code checking is performed.

Description This option tells the compiler to perform compile-time code checking for certain code. It specifies to check for code that exhibits non-portable behavior, represents a possible unintended code sequence, or possibly affects operation of the program because of a quiet change in the ANSI C Standard.

Alternate Options None

Wcomment Determines whether a warning is issued when /* appears in the middle of a /* */ comment.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wcomment -Wno-comment

Arguments None

444 13

Default

-Wno-comment No warning is issued when /* appears in the middle of a /* */ comment.

Description This option determines whether a warning is issued when /* appears in the middle of a /* */ comment.

Alternate Options None

Wcontext-limit Set the maximum number of template instantiation contexts shown in diagnostic.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wcontext-limit=n

Arguments n Number of template instantiation contexts.

Default

OFF

Description Set maximum number of template instantiation contexts shown in diagnostic.

Alternate Options None

445 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

wd Disables a soft diagnostic. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -wdn[,n]...

Arguments

n Is the number of the diagnostic to disable.

Default

OFF The compiler returns soft diagnostics as usual.

Description This option disables the soft diagnostic that corresponds to the specified number. If you specify more than one n, each n must be separated by a comma.

Alternate Options None VxWorks: -diag-disable

Wdeprecated Determines whether warnings are issued for deprecated features.

Architectures IA-32, Intel® 64 architectures

446 13

Syntax

VxWorks: -Wdeprecated -Wno-deprecated

Arguments None

Default

-Wno-deprecated No warnings are issued for deprecated features.

Description This option determines whether warnings are issued for deprecated features.

Alternate Options None we Changes a soft diagnostic to an error. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -weLn[,Ln,...]

Arguments

Ln Is the number of the diagnostic to be changed.

Default

OFF The compiler returns soft diagnostics as usual.

447 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option overrides the severity of the soft diagnostic that corresponds to the specified number and changes it to an error. If you specify more than one Ln, each Ln must be separated by a comma.

Alternate Options None

Weffc++ Enables warnings based on certain C++ programming guidelines.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Weffc++

Arguments None

Default

OFF Diagnostics are not enabled.

Description This option enables warnings based on certain programming guidelines developed by Scott Meyers in his books on effective C++ programming. With this option, the compiler emits warnings for these guidelines:

• Use const and inline rather than #define. Note that you will only get this in user code, not system header code. • Use rather than . • Use new and delete rather than malloc and free.

448 13

• Use C++ style comments in preference to C style comments. C comments in system headers are not diagnosed. • Use delete on pointer members in destructors. The compiler diagnoses any pointer that does not have a delete. • Make sure you have a user copy constructor and assignment operator in classes containing pointers. • Use initialization rather than assignment to members in constructors. • Make sure the initialization list ordering matches the declartion list ordering in constructors. • Make sure base classes have virtual destructors. • Make sure operator= returns *this. • Make sure prefix forms of increment and decrement return a const object. • Never overload operators &&, ||, and ,.

NOTE. The warnings generated by this compiler option are based on the following books from Scott Meyers:

• Effective C++ Second Edition - 50 Specific Ways to Improve Your Programs and Designs • More Effective C++ - 35 New Ways to Improve Your Programs and Designs

Alternate Options None

Werror, WX Changes all warnings to errors.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Werror

449 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments None

Default

OFF The compiler returns diagnostics as usual.

Description This option changes all warnings to errors.

Alternate Options VxWorks: -diag-error warn Windows: /Qdiag-error:warn

Werror-all Causes all warnings and currently-enabled remarks to be reported as errors.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Werror-all

Arguments None

Default

OFF The compiler returns diagnostics as usual.

Description This option causes all warnings and currently-enabled remarks to be reported as errors. To enable display of remarks, specify option -Wremarks.

450 13

Alternate Options VxWorks: -diag-error warn, remark Windows: /Qdiag-error:warn, remark

Wextra-tokens Determines whether warnings are issued about extra tokens at the end of preprocessor directives.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wextra-tokens -Wno-extra-tokens

Arguments None

Default

-Wno-extra-tokens The compiler does not warn about extra tokens at the end of preprocessor directives.

Description This option determines whether warnings are issued about extra tokens at the end of preprocessor directives.

Alternate Options None

451 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Wformat Determines whether argument checking is enabled for calls to printf, scanf, and so forth.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wformat -Wno-format

Arguments None

Default

-Wno-format Argument checking is not enabled for calls to printf, scanf, and so forth.

Description This option determines whether argument checking is enabled for calls to printf, scanf, and so forth.

Alternate Options None

Wformat-security Determines whether the compiler issues a warning when the use of format functions may cause security problems.

Architectures IA-32, Intel® 64 architectures

452 13

Syntax

VxWorks: -Wformat-security -Wno-format-security

Arguments None

Default

-Wno-format-security No warning is issued when the use of format functions may cause security problems.

Description This option determines whether the compiler issues a warning when the use of format functions may cause security problems. When -Wformat-security is specified, it warns about uses of format functions where the format string is not a string literal and there are no format arguments.

Alternate Options None

Winline Enables diagnostics about what is inlined and what is not inlined.

IDE Equivalent None

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Winline

453 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments None

Default

OFF No diagnostics are produced about what is inlined and what is not inlined.

Description This option enables diagnostics about what is inlined and what is not inlined. The diagnostics depend on what interprocedural functionality is available.

Alternate Options None

Wl Passes options to the linker for processing.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wl,option1[,option2,...]

Arguments

option Is a linker option. This option is not processed by the driver and is directly passed to the linker.

Default

OFF No options are passed to the linker.

Description This option passes one or more options to the linker for processing. If the linker is not invoked, these options are ignored.

454 13

This option is equivalent to specifying option -Qoption,link,options.

Alternate Options None

See Also • Qoption

WL Tells the compiler to display a shorter form of diagnostic output.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: See Wbrief.

Arguments None

Default

OFF The compiler displays its normal diagnostic output.

Description This option tells the compiler to display a shorter form of diagnostic output. In this form, the original source line is not displayed and the error message text is not wrapped when too long to fit on a single line.

Alternate Options Linux: -Wbrief Windows: None

455 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Wmain Determines whether a warning is issued if the return type of main is not expected.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wmain -Wno-main

Arguments None

Default

-Wno-main No warning is issued if the return type of main is not expected.

Description This option determines whether a warning is issued if the return type of main is not expected.

Alternate Options None

Wmissing-declarations Determines whether warnings are issued for global functions and variables without prior declaration.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wmissing-declarations

456 13

-Wno-missing-declarations

Arguments None

Default

-Wno-missing-declara- No warnings are issued for global functions and variables without tions prior declaration.

Description This option determines whether warnings are issued for global functions and variables without prior declaration.

Alternate Options None

Wmissing-prototypes Determines whether warnings are issued for missing prototypes.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wmissing-prototypes -Wno-missing-prototypes

Arguments None

Default

-Wno-missing-prototypes No warnings are issued for missing prototypes.

457 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Determines whether warnings are issued for missing prototypes.

Alternate Options None

wn Controls the number of errors displayed before compilation stops. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -wnn

Arguments

n Is the number of errors to display.

Default

100 The compiler displays a maximum of 100 errors before aborting compilation.

Description This option controls the number of errors displayed before compilation stops.

Alternate Options None

458 13

Wnon-virtual-dtor Issue a warning when a class appears to be polymorphic, yet it declares a non-virtual one.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wnon-virtual-dtor

Arguments None

Default

OFF The compiler does not issue a warning.

Description Issue a warning when a class appears to be polymorphic, yet it declares a non-virtual one. This option is supported in C++ only.

Alternate Options None wo Tells the compiler to issue one or more diagnostic messages only once. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -woLn[,Ln,...]

459 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

Ln Is the number of the diagnostic.

Default

OFF

Description Specifies the ID number of one or more messages. If you specify more than one Ln, each Ln must be separated by a comma.

Alternate Options None

Wp Passes options to the preprocessor.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wp,option1[,option2,...]

Arguments

option Is a preprocessor option. This option is not processed by the driver and is directly passed to the preprocessor.

Default

OFF No options are passed to the preprocessor.

Description This option passes one or more options to the preprocessor. If the preprocessor is not invoked, these options are ignored. This option is equivalent to specifying option -Qoption, cpp, options.

460 13

Alternate Options None

See Also • Qoption

Wp64 Tells the compiler to display diagnostics for 64-bit porting.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wp64

Arguments None

Default

OFF The compiler does not display diagnostics for 64-bit porting.

Description This option tells the compiler to display diagnostics for 64-bit porting.

Alternate Options None

Wpointer-arith Determines whether warnings are issued for questionable pointer arithmetic.

Architectures IA-32, Intel® 64 architectures

461 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -Wpointer-arith -Wno-pointer-arith

Arguments None

Default

-Wno-pointer-arith No warnings are issued for questionable pointer arithmetic.

Description Determines whether warnings are issued for questionable pointer arithmetic.

Alternate Options None

Wpragma-once Determines whether a warning is issued about the use of #pragma once.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wpragma-once -Wno-pragma-once

Arguments None

462 13

Default

-Wno-pragma-once No warning is issued about the use of #pragma once.

Description This option determines whether a warning is issued about the use of #pragma once.

Alternate Options None wr Changes a soft diagnostic to an remark. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -wrLn[,Ln,...]

Arguments

Ln Is the number of the diagnostic to be changed.

Default

OFF The compiler returns soft diagnostics as usual.

Description This option overrides the severity of the soft diagnostic that corresponds to the specified number and changes it to a remark. If you specify more than one Ln, each Ln must be separated by a comma.

Alternate Options None

463 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Wreorder Tells the compiler to issue a warning when the order of member initializers does not match the order in which they must be executed.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wreorder

Arguments None

Default

OFF The compiler does not issue a warning.

Description This option tells the compiler to issue a warning when the order of member initializers does not match the order in which they must be executed. This option is supported for C++ only.

Alternate Options None

Wremarks Tells the compiler to display remarks and comments.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wremarks

464 13

Arguments None

Default

OFF Default warning messages are enabled.

Description This option tells the compiler to display remarks and comments. If you want to display warnings and errors, specify option -Wall.

Alternate Options None

Wreturn-type Determines whether warnings are issued when a function uses the default int return type or when a return statement is used in a void function.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wreturn-type -Wno-return-type

Arguments None

Default

-Wno-return-type No warnings are issued when a function uses the default int return type or when a return statement is used in a void function.

465 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option determines whether warnings are issued when a function uses the default int return type or when a return statement is used in a void function.

Alternate Options None

Wshadow Determines whether a warning is issued when a variable declaration hides a previous declaration.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wshadow -Wno-shadow

Arguments None

Default

-Wno-shadow No warning is issued when a variable declaration hides a previous declaration.

Description This option determines whether a warning is issued when a variable declaration hides a previous declaration. Same as -ww1599.

Alternate Options None

466 13

Wsign-compare Determines whether warnings are issued when a comparison between signed and unsigned values could produce an incorrect result when the signed value is converted to unsigned.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wsign-compare -Wno-sign-compare

Arguments None

Default

-Wno-sign-compare The compiler does not issue these warnings

Description This option determines whether warnings are issued when a comparison between signed and unsigned values could produce an incorrect result when the signed value is converted to unsigned. This option is provided for compatibility with gcc.

Alternate Options None

467 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Wstrict-aliasing Determines whether warnings are issued for code that might violate the optimizer's strict aliasing rules.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wstrict-aliasing -Wno-strict-aliasing

Arguments None

Default

-Wno-strict-aliasing No warnings are issued for code that might violate the optimizer's strict aliasing rules.

Description This option determines whether warnings are issued for code that might violate the optimizer's strict aliasing rules. These warnings will only be issued if you also specify option –ansi-alias or option –fstrict-aliasing.

Alternate Options None

See Also • ansi-alias, Qansi-alias

468 13

Wstrict-prototypes Determines whether warnings are issued for functions declared or defined without specified argument types.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wstrict-prototypes -Wno-strict-prototypes

Arguments None

Default

-Wno-strict-prototypes No warnings are issued for functions declared or defined without specified argument types.

Description This option determines whether warnings are issued for functions declared or defined without specified argument types.

Alternate Options None

Wtrigraphs Determines whether warnings are issued if any trigraphs are encountered that might change the meaning of the program.

Architectures IA-32, Intel® 64 architectures

469 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -Wtrigraphs -Wno-trigraphs

Arguments None

Default

-Wno-trigraphs No warnings are issued if any trigraphs are encountered that might change the meaning of the program.

Description This option determines whether warnings are issued if any trigraphs are encountered that might change the meaning of the program.

Alternate Options None

Wuninitialized Determines whether a warning is issued if a variable is used before being initialized.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wuninitialized -Wno-uninitialized

Arguments None

470 13

Default

-Wno-uninitialized No warning is issued if a variable is used before being initialized.

Description This option determines whether a warning is issued if a variable is used before being initialized. Equivalent to -ww592 and -wd592.

Alternate Options -ww592 and -wd592

Wunknown-pragmas Determines whether a warning is issued if an unknown #pragma directive is used.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wunknown-pragmas -Wno-unknown-pragmas

Arguments None

Default

-Wunknown-pragmas No warning is issued if an unknown #pragma directive is used.

Description This option determines whether a warning is issued if an unknown #pragma directive is used.

Alternate Options None

471 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Wunused-function Determines whether a warning is issued if a declared function is not used.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wunused-function -Wno-unused-function

Arguments None

Default

-Wno-unused-function No warning is issued if a declared function is not used.

Description This option determines whether a warning is issued if a declared function is not used.

Alternate Options None

Wunused-variable Determines whether a warning is issued if a local or non-constant static variable is unused after being declared.

Architectures IA-32, Intel® 64 architectures

472 13

Syntax

VxWorks: -Wunused-variable -Wno-unused-variable

Arguments None

Default

-Wno-unused-variable No warning is issued if a local or non-constant static variable is unused after being declared.

Description This option determines whether a warning is issued if a local or non-constant static variable is unused after being declared.

Alternate Options None ww Changes a soft diagnostic to an warning. This is a deprecated option.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -wwLn[,Ln,...]

Arguments

Ln Is the number of the diagnostic to be changed.

473 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Default

OFF The compiler returns soft diagnostics as usual.

Description This option overrides the severity of the soft diagnostic that corresponds to the specified number and changes it to an warning. If you specify more than one Ln, each Ln must be separated by a comma.

Alternate Options None

Wwrite-strings Issues a diagnostic message if const char * is converted to (non-const) char *.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Wwrite-strings

Arguments None

Default

OFF No diagnostic message is issued if const char * is converted to (non-const) char*.

Description This option issues a diagnostic message if const char* is converted to (non-const) char *.

Alternate Options None

474 13

Werror, WX Changes all warnings to errors.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Werror

Arguments None

Default

OFF The compiler returns diagnostics as usual.

Description This option changes all warnings to errors.

Alternate Options VxWorks: -diag-error warn Windows: /Qdiag-error:warn

X Options

x (type option) All source files found subsequent to -x type will be recognized as a particular type.

Architectures IA-32, Intel® 64 architectures

475 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -x type

Arguments

type is the type of source file. Possible values are: c C source file c++ C++ source file c-header C header file cpp-output C pre-processed file c++-cpp-output C++ pre-processed file assembler Assembly file assembler-with-cpp Assembly file that needs to be preprocessed none Disable recognition, and revert to file extension

Default

none Disable recognition and revert to file extension.

Description All source files found subsequent to -x will be recognized as a particular type.

Alternate Options None

Example Suppose you want to compile the following C and C++ source files whose extensions are not recognized by the compiler:

File Name Language

file1.c99 C

476 13

File Name Language

file2.cplusplus C++

We will also include these files whose extensions are recognized:

File Name Language

file3.c C

file4.cpp C++

The command-line invocation using the -x option follows: icpc -x c file1.c99 -x c++ file2.cplusplus -x none file3.c file4.cpp x Tells the compiler to generate optimized code specialized for the Intel processor that executes your program.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -xcode

Arguments code Indicates the instructions and optimizations to be generated for the set of processors in each description. Many of the following descriptions refer to Intel® Streaming SIMD Extensions (Intel® SSE) and Supplemental Streaming SIMD Extensions (Intel® SSSE). Possible values are:

477 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

AVX May generate Intel® Advanced Vector Extensions (Intel® AVX), Intel® SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel® processors. Optimizes for a future Intel processor. SSE4.2 May generate Intel® SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel® Core™ i7 processors. May generate Intel® SSE4 Vectorizing Compiler and Media Accelerator, Intel® SSSE3, SSE3, SSE2, and SSE instructions and it may optimize for the Intel® Core™ processor family. SSE4.1 May generate Intel® SSE4 Vectorizing Compiler and Media Accelerator instructions for Intel processors. May generate Intel® SSSE3, SSE3, SSE2, and SSE instructions and it may optimize for Intel® 45nm Hi-k next generation Intel® Core™ microarchitecture. This replaces value S, which is deprecated. SSE3_ATOM May generate MOVBE instructions for Intel processors, depending on the setting of option -minstruction (VxWorks). May also generate Intel® SSSE3, SSE3, SSE2, and SSE instructions for Intel processors. Optimizes for the Intel® Atom™ processor and Intel® Centrino® Atom™ Processor Technology. SSSE3 May generate Intel® SSSE3, SSE3, SSE2, and SSE instructions for Intel processors. Optimizes for the Intel® Core™ microarchitecture. This replaces value T, which is deprecated. SSE3 May generate Intel® SSE3, SSE2, and SSE instructions for Intel processors. Optimizes for the enhanced Pentium® M

478 13

processor microarchitecture and Intel NetBurst® microarchitecture. This replaces value P, which is deprecated. SSE2 May generate Intel® SSE2 and SSE instructions for Intel processors. Optimizes for the Intel NetBurst® microarchitecture. This replaces value N, which is deprecated. You can also specify Host. For more information, see option -xHost (VxWorks* OS) or /QxHost (Windows*).

Default

VxWorks* systems: None On Windows systems, if neither /Qx nor /arch is specified, the default is /arch:SSE2. On Linux and VxWorks* systems, if neither -x nor -m is specified, the default is -msse2.

Description This option tells the compiler to generate optimized code specialized for the Intel processor that executes your program. It also enables optimizations in addition to Intel processor-specific optimizations. The specialized code generated by this option may run only on a subset of Intel processors. The resulting executables from these processor-specific options can only be run on the specified or later Intel® processors, as they incorporate optimizations specific to those processors and use a specific version of the Intel® Streaming SIMD Extensions (Intel® SSE) instruction set. The binaries produced by these code values will run on Intel processors that support all of the features for the targeted processor. Do not use code values to create binaries that will execute on a processor that is not compatible with the targeted processor. The resulting program may fail with an illegal instruction exception or display other unexpected behavior. Compiling the function main() with any of the code values produces binaries that display a fatal run-time error if they are executed on unsupported processors, including all non-Intel processors. For more information, see Optimizing Applications. Compiler options m and arch produce binaries that should run on processors not made by Intel that implement the same capabilities as the corresponding Intel processors.

479 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Previous value O is deprecated and has been replaced by option -msse3 (VxWorks) and option /arch:SSE3 (Windows). Previous values W and K are deprecated. The details on replacements are as follows:

• Windows, VxWorks and Linux systems: The replacement for W is -msse2 (Linux and VxWorks). There is no exact replacement for K. However, on Windows systems, /QxK is interpreted as /arch:IA32; on Linux and VxWorks systems, -xK is interpreted as -mia32. You can also do one of the following:

• Upgrade to option -msse2 (Linux and VxWorks) or option /arch:SSE2 (Windows). This will produce one code path that is specialized for Intel® SSE2. It will not run on earlier processors • Specify the two option combination -mia32 -axSSE2 (Linux and VxWorks). This combination will produce an executable that runs on any processor with IA-32 architecture but with an additional specialized Intel® SSE2 code path.

The -x and /Qx options enable additional optimizations not enabled with options -m or /arch (nor with options –ax and /Qax). On Windows* systems, options /Qx and /arch are mutually exclusive. If both are specified, the compiler uses the last one specified and generates a warning. Similarly, on VxWorks* systems, options -x and -m are mutually exclusive. If both are specified, the compiler uses the last one specified and generates a warning.

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that

480 13

Optimization Notice are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options None

See Also • xHost, QxHost • ax, Qax • m • minstruction, Qinstruction

X Removes standard directories from the include file search path.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -X

Arguments None

Default

OFF Standard directories are in the include file search path.

481 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description This option removes standard directories from the include file search path. It prevents the compiler from searching the default path specified by the INCLUDE environment variable. On VxWorks systems, specifying -X (or -noinclude) prevents the compiler from searching in /usr/include for files specified in an INCLUDE statement. You can use this option with the I option to prevent the compiler from searching the default path for include files and direct it to use an alternate path.

Alternate Options VxWorks: -nostdinc Windows: None

See Also • I

Xbind-lazy Enables lazy binding of function calls.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Xbind-lazy

Arguments None

Default

OFF

Description Enables lazy binding of function calls. This option is equivalent to -Wl,-z,now.

482 13

Alternate Options -Wl,-z,now

See Also • Xbind-now

Xbind-now Disables lazy binding of function calls.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Xbind-now

Arguments None

Default

ON

Description Disables lazy binding of function calls. This option is the default.

Alternate Options None

See Also • Xbind-lazy

483 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

xHost Tells the compiler to generate instructions for the highest instruction set available on the compilation host processor.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -xHost

Arguments None

Default

VxWorks* systems: None On VxWorks systems, if neither -x nor -m is specified, the default is -msse2.

Description This option tells the compiler to generate instructions for the highest instruction set available on the compilation host processor. The instructions generated by Host differ depending on the compilation host processor. The following table describes the effects of specifying Host and it tells whether the resulting executable will run on processors different from the host processor. Descriptions in the table refer to Intel® Advanced Vector Extensions (Intel® AVX), Intel® Streaming SIMD Extensions (Intel® SSE), and Intel® Supplemental Streaming SIMD Extensions (Intel® SSSE).

Instruction Set Effects When Host is Specified of Host Processor

Intel® AVX On Intel® Processors:

484 13

Instruction Set Effects When Host is Specified of Host Processor Corresponds to option -xAVX (VxWorks* OS) or /QxAVX (Windows*). The executable will not run on non-Intel processors and it will not run on Intel® processors that do not support Intel® AVX instructions. On Non-Intel Processors: Corresponds to option -msse3 (VxWorks* OS). The executable will run on Intel® processors and non-Intel processors that support at least Intel® SSE3 instructions. You may see a run-time error if the run-time processor does not support Intel® SSE3 instructions.

Intel® SSE4.2 On Intel® Processors: Corresponds to option -xSSE4.2 (VxWorks* OS) or /QxSSE4.2 (Windows*). The executable will not run on non-Intel processors and it will not run on Intel® processors that do not support Intel® SSE4.2 instructions. On Non-Intel Processors: Corresponds to option -msse3 (VxWorks* OS) or /arch:SSE3 (Windows). The executable will run on Intel® processors and non-Intel processors that support at least Intel® SSE3 instructions. You may see a run-time error if the run-time processor does not support Intel® SSE3 instructions.

Intel® SSE4.1 On Intel® Processors: Corresponds to option -xSSE4.1 (VxWorks* OS) or /QxSSE4.1 (Windows*). The executable will not run on non-Intel processors and it will not run on Intel® processors that do not support Intel® SSE4.1 instructions. On Non-Intel Processors: Corresponds to option -msse3 (VxWorks* OS) or /arch:SSE3 (Windows). The executable will run on Intel® processors and non-Intel processors that support at least Intel® SSE3 instructions. You may see a run-time error if the run-time processor does not support Intel® SSE3 instructions.

Intel® SSSE3 On Intel® Processors:

485 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Instruction Set Effects When Host is Specified of Host Processor Corresponds to option -xSSSE3 (VxWorks* OS) or /QxSSSE3 (Windows*). The executable will not run on non-Intel processors and it will not run on Intel® processors that do not support Intel® SSSE3 instructions. On Non-Intel Processors: Corresponds to option -msse3 (VxWorks* OS) or /arch:SSE3 (Windows). The executable will run on Intel® processors and non-Intel processors that support at least Intel® SSE3 instructions. You may see a run-time error if the run-time processor does not support Intel® SSE3 instructions.

Intel® SSE3 On Intel® Processors: Corresponds to option -xSSE3 (VxWorks* OS) or /QxSSE3 (Windows*). The executable will not run on non-Intel processors and it will not run on Intel® processors that do not support Intel® SSE3 instructions. On Non-Intel Processors: Corresponds to option -msse3 (VxWorks* OS) or /arch:SSE3 (Windows). The executable will run on Intel® processors and non-Intel processors that support at least Intel® SSE3 instructions. You may see a warning run-time error if the run-time processor does not support Intel® SSE3 instructions.

Intel® SSE2 On Intel® Processors: Corresponds to option -msse2 (VxWorks* OS) or /arch:SSE2 (Windows*). The executable will run on non-Intel processors. You may see a run-time error if the run-time processor does not support Intel® SSE2 instructions. On Non-Intel Processors: Corresponds to option -msse2 (VxWorks* OS) or /arch:SSE2 (Windows). The executable will run on Intel® processors and non-Intel processors that support at least Intel® SSE2 instructions. You may see a run-time error if the run-time processor does not support Intel® SSE2 instructions.

For more information on other settings for option -x (VxWorks* OS) and /Qx (Windows*), see that option description.

486 13

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Alternate Options None

See Also • x, Qx • ax, Qax • m

Xlinker Passes a linker option directly to the linker.

Architectures IA-32, Intel® 64 architectures

487 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

VxWorks: -Xlinker option

Arguments

option Is a linker option.

Default

OFF No options are passed directly to the linker.

Description This option passes a linker option directly to the linker. If -Xlinker -shared is specified, only -shared is passed to the linker and no special work is done to ensure proper linkage for generating a shared object. -Xlinker just takes whatever arguments are supplied and passes them directly to the linker. If you want to pass compound options to the linker, for example "-L $HOME/lib", you must use the following method: -Xlinker -L -Xlinker $HOME/lib

Alternate Options None

See Also • shared

Z Options

g, Zi, Z7 Tells the compiler to generate full debugging information in the object file or a project database (PDB) file.

Architectures IA-32, Intel® 64 architectures

488 13

Syntax

VxWorks: -g

Arguments None

Default

OFF No debugging information is produced in the object file or in a PDB file.

Description Options -g (VxWorks* OS) and /Z7 (Windows*) tell the compiler to generate symbolic debugging information in the object file, which increases the size of the object file. The /Zi option (Windows) tells the compiler to generate symbolic debugging information in a PDB file. The compiler does not support the generation of debugging information in assemblable files. If you specify these options, the resulting object file will contain debugging information, but the assemblable file will not. These options turn off O2 and make O0 (VxWorks OS) the default unless O2 (or higher) is explicitly specified in the same command line. On Linux* OS and VxWorks* OS using Intel® 64 architecture andVxWorks* OS systems using IA-32 architecture, specifying the -g or -O0 option sets the -fno-omit-frame-pointer option.

Alternate Options

/Zi VxWorks: None Windows: /debug:full, /debug:all, /debug, or /ZI

See Also Related Information • Z Options • Z Options

489 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Zd This option has been deprecated. Use keyword minimal in option /debug.

g, Zi, Z7 Tells the compiler to generate full debugging information in the object file or a project database (PDB) file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -g

Arguments None

Default

OFF No debugging information is produced in the object file or in a PDB file.

Description Options -g (VxWorks* OS) and /Z7 (Windows*) tell the compiler to generate symbolic debugging information in the object file, which increases the size of the object file. The /Zi option (Windows) tells the compiler to generate symbolic debugging information in a PDB file. The compiler does not support the generation of debugging information in assemblable files. If you specify these options, the resulting object file will contain debugging information, but the assemblable file will not. These options turn off O2 and make O0 (VxWorks OS) the default unless O2 (or higher) is explicitly specified in the same command line. On Linux* OS and VxWorks* OS using Intel® 64 architecture andVxWorks* OS systems using IA-32 architecture, specifying the -g or -O0 option sets the -fno-omit-frame-pointer option.

490 13

Alternate Options

/Zi VxWorks: None Windows: /debug:full, /debug:all, /debug, or /ZI

See Also Related Information • Z Options • Z Options

ZI Tells the compiler to generate full debugging information in the object file.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: See g.

Arguments None

Default

OFF No debugging information is produced in the object file.

Description For details, see /Zi.

Alternate Options

• Linux: -g • Windows: /Zi, /Z7

491 13 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Zp Specifies alignment for structures on byte boundaries.

Architectures IA-32, Intel® 64 architectures

Syntax

VxWorks: -Zp[n]

Arguments

n Is the byte size boundary. Possible values are 1, 2, 4, 8, or 16.

Default

Zp16 Structures are aligned on either size boundary 16 or the boundary that will naturally align them.

Description This option specifies alignment for structures on byte boundaries. If you do not specify n, you get Zp16.

Alternate Options None

492 Related Options 14

Related Options

This topic lists related options that can be used under certain conditions.

Linking Tools and Options

This topic describes how to use the Intel® linking tools, xild (VxWorks* OS) or xilink (Windows* OS). The Intel linking tools behave differently on different platforms. The following sections summarizes the primary differences between the linking behaviors.

VxWorks OS Linking Behavior Summary

The linking tool invokes the compiler to perform IPO if objects containing IR (intermediate representation) are found. (These are mock objects.) It invokes GNU ld to link the application. The command-line syntax for xild is the same as that of the GNU linker: xild []

where:

• []: (optional) one or more options supported only by xild. • : linker command line containing a set of valid arguments for ld.

To create app using IPO, use the option -ofilename as shown in the following example: xild -qipo-fas -oapp a.o b.o c.o

The linking tool calls the compiler to perform IPO for objects containing IR and creates a new list of object(s) to be linked. The linker then calls ld to link the object files that are specified in the new list and produce the application with the name specified by the -o option. The linker supports the -ipo[n] and -ipo-separate options.

493 14 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Windows OS Linking Behavior Summary

The linking tool invokes the Intel compiler to perform multi-file IPO if objects containing IR (intermediate representation) is found. These are mock objects. It invokes Microsoft* link.exe to link the application. The command-line syntax for the Intel® linker is the same as that of the Microsoft linker: xilink []

where:

• []: (optional) one or more options supported only by xilink. • : linker command line containing a set of valid arguments for the Microsoft linker.

To place the multifile IPO executable in ipo_file.exe, use the linker option /out:filename; for example: xilink -qipo-fas /out:ipo_file.exe a.obj b.obj c.obj

The linker calls the compiler to perform IPO for objects containing IR and creates a new list of object(s) to be linked. The linker calls Microsoft link.exe to link the object files that are specified in the new list and produce the application with the name specified by the /out:filename linker option.

Using the Linking Tools You must use the Intel linking tools to link your application if the following conditions apply:

• Your source files were compiled with multifile IPO enabled. Multi-file IPO is enabled by specifying compiler option -ipo (VxWorks OS) or /Qipo (Windows OS) . • You normally would invoke either the GNU linker (ld) or the Microsoft linker (link.exe) to link your application.

The following table lists the available, case-insensitive options supported by the Intel linking tools and briefly describes the behavior of each option:

Linking Tools Option Description

-qdiag-type=diag-list Controls the display of diagnostic information.

494 14

Linking Tools Option Description The type is an action to perform on diagnostics. Possible values are: enable Enables a diagnostic message or a group of messages. disable Disables a diagnostic message or a group of messages.

The diag-list is a diagnostic group or ID value. Possible values are: thread Specifies diagnostic messages that help in thread-enabling a program. vec Specifies diagnostic messages issued by the vectorizer. warn Specifies diagnostic messages that have a "warning" severity level. error Specifies diagnostic messages that have an "error" severity level. remark Specifies diagnostic messages that are remarks or comments. cpu-dispatch Specifies the CPU dispatch remarks for diagnostic messages. These remarks are enabled by default. id[,id,...] Specifies the ID number of one or more messages. If you specify more than one message number, they must

495 14 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Linking Tools Option Description

be separated by commas. There can be no intervening white space between each "id". tag[,tag,...] Specifies the mnemonic name of one or more messages. If you specify more than one mnemonic name, they must be separated by commas. There can be no intervening white space between each "tag".

The diagnostic messages generated can be affected by certain options, such as /arch or -m or -x (VxWorks).

-qhelp Lists the available linking tool options. Same as passing no option.

-qnoipo Disables multi-file IPO compilation.

-qipo-fa [{filename | dir}] Produces an assembly listing for the multi-file IPO compilation. You may specify an optional name for the listing file, or a directory (with the backslash) in which to place the file. The default listing name depends on the platform:

• VxWorks OS: ipo_out.s • Windows OS: ipo_out.asm

If the Intel linking tool invocation results in multi-object compilation, either because the application is big or because the user explicitly instructed the compiler to generate

496 14

Linking Tools Option Description multiple objects, the first .s (VxWorks OS) or .asm (Windows OS) file takes its name from the -qipo-fa option. The compiler derives the names of subsequent .s (VxWorks OS) or .asm (Windows OS) files by appending an incrementing number to the name, for example, foo.asm and foo1.asm for ipo_fafoo.asm. The same is true for the -qipo-fo option (listed below).

-qipo-fo [{filename | dir}] Produces an object file for the multi-file IPO compilation. You may specify an optional name for the object file, or a directory (with the backslash) in which to place the file. The default object file name depends on the platform:

• VxWorks OS: ipo_out.o • Windows OS: ipo_out.obj

-qipo-fas Adds source lines to the assembly listing.

-qipo-fac Adds code bytes to the assembly listing.

-qipo-facs Adds code bytes and source lines to the assembly listing.

-qomit-il Omits il-data from relocatable objects generated when using "xild -r". This prevents these relocatable objects from being included in any further IPO compilations, which may affect further optimization. It reduces the size of the generated object file.

-quseenv Disables any overrides of existing PATH, LIB, and INCLUDE variables.

497 14 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Linking Tools Option Description

-lib Invokes the librarian instead of the linker.

-qv Displays version information.

Note that in the above options, you can specify an underscore (_) instead of a dash (-).

Portability Options

A challenge in porting applications from one compiler to another is making sure there is support for the compiler options you use to build your application. The Intel® C++ Compiler supports many of the options that are valid on other compilers you may be using. The following sections list compiler options that are supported by the Intel® C++ Compiler and the following: • gcc* Compiler

Options that are unique to either compiler are not listed in these sections.

Options Equivalent to gcc* Options (Linux* OS and VxWorks* OS) The following table lists compiler options that are supported by both the Intel® C++ Compiler and the gcc* compiler.

-A

-ansi

-B

-C

-c

-D

-dD

-dM

498 14

-dN

-E

-fargument-noalias

-fargument-noalias-global

-fdata-sections

-ffunction-sections

-fmudflap

-f[no-]builtin

-f[no-]common

-f[no-]freestanding

-f[no-]gnu-keywords

-fno-implicit-inline-templates

-fno-implicit-templates

-f[no-]inline

-f[no-]inline-functions

-f[no-]math-errno

-f[no-]operator-names

-f[no-]stack-protector

-f[no-]unsigned-bitfields

-fpack-struct

-fpermissive

499 14 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-fPIC

-fpic

-freg-struct-return

-fshort-enums

-fsyntax-only

-ftemplate-depth

-ftls-model=global-dynamic

-ftls-model=initial-exec

-ftls-model=local-dynamic

-ftls-model=local-exec

-funroll-loops

-funsigned-char

-fverbose-asm

-fvisibility=default

-fvisibility=hidden

-fvisibility=internal

-fvisibility=protected

-H

-help

-I

-idirafter

500 14

-imacros

-iprefix

-iwithprefix

-iwithprefixbefore

-l

-L

-M

-malign-double

-march

-mcpu

-MD

-MF

-MG

-MM

-MMD

-m[no-]ieee-fp

-MP

-mp

-MQ

-msse

-msse2

501 14 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-msse3

-MT

-mtune

-nodefaultlibs

-nostartfiles

-nostdinc

-nostdinc++

-nostdlib

-o

-O

-O0

-O1

-O2

-O3

-Os

-P

-S

-shared

-static

-std

-trigraphs

502 14

-U

-u

-v

-V

-w

-Wall

-Werror

-Winline

-W[no-]cast-qual

-W[no-]comment

-W[no-]comments

-W[no-]deprecated

-W[no-]fatal-errors

-W[no-]format-security

-W[no-]main

-W[no-]missing-declarations

-W[no-]missing-prototypes

-W[no-]overflow

-W[no-]overloaded-virtual

-W[no-]pointer-arith

-W[no-]return-type

503 14 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-W[no-]strict-prototypes

-W[no-]trigraphs

-W[no-]uninitialized

-W[no-]unknown-pragmas

-W[no-]unused-function

-W[no-]unused-variable

-X

-x assembler-with-cpp

-x c

-x c++

-Xlinker

504 Part III Optimizing Applications

Topics:

• Using Interprocedural Optimization (IPO) • Using Profile-Guided Optimization (PGO) • Using High-Level Optimization (HLO)

505 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

506 Using Interprocedural Optimization (IPO) 15

Interprocedural Optimization (IPO) Overview

Interprocedural Optimization (IPO) allows the compiler to analyze your code to determine where you can benefit from specific optimizations. When you use IPO with the -x or -ax (Linux* OS) options, or the /Qx options, you may see additional optimizations for Intel microprocessors than for non-Intel microprocessors. The compiler might apply the following optimizations for the listed architectures:

Architecture Optimization

All supported Intel® architectures • inlining • constant propagation • mod/ref analysis • alias analysis • forward substitution • routine key-attribute propagation • address-taken analysis • partial dead call elimination • symbol table data promotion • common block variable coalescing • dead function elimination • unreferenced variable removal • whole program analysis • array dimension padding • common block splitting • stack frame alignment • structure splitting and field reordering • formal parameter alignment analysis • C++ class hierarchy analysis • indirect call conversion

507 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Architecture Optimization • specialization

IA-32 and Intel® 64 architectures • Passing arguments in registers to optimize calls and register usage

IPO is an automatic, multi-step process with compilation and linking; however, IPO supports two compilation models: single-file compilation and multi-file compilation. Single-file compilation, which uses the -ip (VxWorks* OS) option, results in one, real object file for each source file being compiled. During single-file compilation the compiler performs inline function expansion for calls to procedures defined within the current source file. The compiler performs some single-file interprocedural optimization at the default optimization level: -O2 (VxWorks*) or /O2 (Windows*); additionally sometimes the compiler performs some inlining for the -O1 (VxWorks*) or /O1 (Windows*) optimization level, like inlining functions marked with inlining pragmas or attributes (GNU C and C++) and C++ class member functions with bodies included in the class declaration. Multi-file compilation, which uses the -ipo (VxWorks) or /Qipo (Windows) option, results in one or more mock object files rather than normal object files. (See the Compilation section below for information about mock object files.) Additionally, the compiler collects information from the individual source files that make up the program. Using this information, the compiler performs optimizations across functions and procedures in different source files. See Inline Function Expansion.

NOTE. Inlining and other optimizations are improved by profile information. For a description of how to use IPO with profile information for further optimization, see Profile an Application.

Compilation As each source file is compiled with IPO, the compiler stores an intermediate representation (IR) of the source code in a mock object file, which includes summary information used for optimization. The mock object files contain the IR instead of the normal object code. Mock object files can be ten times larger, and in some cases more, than the size of normal object files. During the IPO compilation phase only the mock object files are visible. The Intel compiler does not expose the real object files during IPO unless you also specify the -ipo-c (VxWorks) or /Qipo-c (Windows) option.

508 15

Linkage When you link with the -ipo (VxWorks) or /Qipo (Windows) option the compiler is invoked a final time. The compiler performs IPO across all object files that have an IR equivalent. The mock objects must be linked with the Intel compiler or by using the Intel linking tools. The compiler calls the linkers indirectly by using aliases (or wrappers) for the native linkers, so you must modify make files to account for the different linking tool names. For information on using the linking tools, see Using IPO; see the Linking Tools and Options topic for detailed information.

CAUTION. Linking the mock object files with ld (VxWorks) or link.exe (Windows) will cause linkage errors. You must use the Intel linking tools to link mock object files. During the compilation process, the compiler first analyzes the summary information and then produces mock object files for source files of the application.

Whole program analysis The compiler supports a large number of IPO optimizations that either can be applied or have the effectiveness greatly increased when the whole program condition is satisfied. Whole program analysis, when it can be done, enables many interprocedural optimizations. During the analysis process, the compiler reads all Intermediate Representation (IR) in the mock file, object files, and library files to determine if all references are resolved and whether or not a given symbol is defined in a mock object file. Symbols that are included in the IR in a mock object file for both data and functions are candidates for manipulation based on the results of whole program analysis. There are two types of whole program analysis: object reader method and table method. Most optimizations can be applied if either type of whole program analysis determine that the whole program conditions exists; however, some optimizations require the results of the object reader method, and some optimizations require the results of table method.

NOTE. The IPO report provides details about whether whole program analysis was satisfied and indicate the method used during IPO compilation.

In the first type of whole program analysis, the object reader method, the object reader emulates the behavior of the native linker and attempts to resolve the symbols in the application. If all symbols are resolved correctly, the whole program condition is satisfied. This type of whole program analysis is more likely to detect the whole program condition.

509 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Often the object files and libraries accessed by the compiler do not represent the whole program; there are many dependencies to well-known libraries. IPO linking, whole program analysis, determines whether or not the whole program can be detected using the available compiler resources. The second type of whole program analysis, the table method, is where the compiler analyzes the mock object files and generates a call-graph. The compiler contains detailed tables about all of the functions for all important language-specific libraries, like libc. In this second method, the compiler constructs a call-graph for the application. The compiler then compares the function table and application call-graph. For each unresolved function in the call-graph, the compiler attempts to resolve the calls. If the compiler can resolve the functions call, the whole program condition exists.

Interprocedural Optimization (IPO) Quick Reference

IPO is a two step process: compile and link. See Using IPO.

VxWorks* Windows* Description

-ipo /Qipo Enables interprocedural optimization for multi-file or or compilations. -ipoN /QipoN Normally, multi-file compilations result in a single object file only. Passing an integer value for N allows you to specify number of true object files to generate; the default value is 0, which means the compiler determines the appropriate number of object files to generate. (See IPO for Large Programs.)

-ipo-separate /Qipo-separate Instructs the compiler to generate a separate, real object file for each mock object file. Using this option overrides any integer value passed for ipoN. (See IPO for Large Programs for specifics.)

510 15

VxWorks* Windows* Description

-ip /Qip Enables interprocedural optimizations for single file compilations. Instructs the compiler to generate a separate, real object file for each source file.

Additionally, the compiler supports options that provide support for compiler-directed or developer-directed inline function expansion. Refer to Quick Reference Lists for a complete listing of the quick reference topics.

Using IPO

This topic discusses how to use IPO from a command line. For specific information on using IPO from within an Integrated Development Environment (IDE), refer to the appropriate section in Building Applications.

Compiling and Linking Using IPO The steps to enable IPO for compilations targeted for all supported architectures are the same: compile and link. First, compile your source files with -ipo (VxWorks*) or /Qipo (Windows*) as demonstrated below:

Operating System Example Command

VxWorks icpc -ipo -c a.cpp b.cpp c.cpp

Windows* icl /Qipo /c a.cpp b.cpp c.cpp

The output of the above example command differs according to operating system:

• VxWorks: The commands produce a.o, b.o, and c.o object files. • Windows: The commands produce a.obj, b.obj, and c.obj object files.

Use -c (VxWorks) or /c (Windows) option to stop compilation after generating .o or .obj files. The output files contain compiler intermediate representation (IR) corresponding to the compiled source files. (See the section below on Capturing the Intermediate IPO Output.)

511 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Second, link the resulting files. The following example command will produce an executable named app:

Operating System Example Command

VxWorks icpc -o app a.o b.o c.o

Windows icl /Feapp a.obj b.obj c.obj

The command invoke the compiler on the objects containing IR and creates a new list of objects to be linked. Alternately, you can use the xild (VxWorks) or xilink (Windows) tool, with the appropriate linking options.

Combining the Steps The separate compile and link commands demonstrated above can be combined into a single command, as shown in the following examples:

Operating System Example Command

VxWorks icpc -ipo -o app a.cpp b.cpp c.cpp

Windows icl /Qipo /Feapp a.cpp b.cpp c.cpp

The icl/icpc command, shown in the examples above, calls GCC ld (VxWorks) or Microsoft* link.exe (Windows only) to link the specified object files and produce the executable application, which is specified by the -o (VxWorks) option.

NOTE. VxWorks: Using icpc allows the compiler to use standard C++ libraries automatically; icc will not use the standard C++ libraries automatically. The Intel linking tools emulate the behavior of compiling at -O0 (VxWorks) option. Multi-file IPO is applied only to the source files that have intermediate representation (IR); otherwise, the object file passes to link stage.

Capturing Intermediate IPO Output The -ipo-c (VxWorks) or /Qipo-c (Windows*) and -ipo-S (VxWorks) or /Qipo-S (Windows) options are useful for analyzing the effects of multi-file IPO, or when experimenting with multi-file IPO between modules that do not make up a complete program.

512 15

• Use the -ipo-c option to optimize across files and produce an object file. The option performs optimizations as described for the -ipo option but stops prior to the final link stage, leaving an optimized object file. The default name for this file is ipo_out.o (VxWorks) or ipo_out.obj (Windows). • Use the -ipo-S option to optimize across files and produce an assembly file. The option performs optimizations as described for -ipo, but stops prior to the final link stage, leaving an optimized assembly file. The default name for this file is ipo_out.s (Linux) or ipo_out.asm (Windows).

For both options, you can use the -o (VxWorks) option to specify a different name. These options generate multiple outputs if multi-object IPO is being used. The name of the first file is taken from the value of the -o (VxWorks) or /Fe (Windows) option. The names of subsequent files are derived from the first file with an appended numeric value to the file name. For example, if the first object file is named foo.o (VxWorks) or foo.obj (Windows), the second object file will be named foo1.o or foo1.obj. You can use the object file generated with the -ipo-c (VxWorks) or /Qipo-c (Windows) option, but you will not get the same benefits from the optimizations applied that would otherwise if the whole program analysis condition had been satisfied. The object file created using the -ipo-c option is a real object file, in contrast to the mock file normally generated using IPO; however, the generated object file is significantly different than the mock object file. It does not contain the IR information needed to fully optimize the application using IPO. The compiler generates a message indicating the name of each object or assembly file it generates. These files can be added to the real link step to build the final application.

Using Option -auto-ilp32 (Linux* OS and VxWorks* OS) or /Qauto-ilp32 (Windows* OS) On Linux systems based on Intel® 64 architecture, the -auto-ilp32 option has no effect unless you specify SSE3 or higher suffix for the -x option.

IPO-Related Performance Issues

There are some general optimization guidelines for IPO that you should keep in mind: • Large IPO compilations might trigger internal limits of other compiler optimization phases.

513 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• Combining IPO and PGO can be a key technique for C++ applications. The -O3, -ipo, and -prof-use (VxWorks*) or /O3, /Qipo, /Qprof-use (Windows*) options can result in performance gains. • IPO benefits C more than C++, since C++ compilations include intra-file inlining by default. • Applications where the compiler does not have sufficient intermediate representation (IR) coverage to do whole program analysis might not perform as well as those where IR information is complete.

In addition to the general guidelines, there are some practices to avoid while using IPO. The following list summarizes the activities to avoid: • Do not use the link phase of an IPO compilation using mock object files produced for your application by a different compiler. The Intel® Compiler cannot inspect mock object files generated by other compilers for optimization opportunities. • Do not link mock files with the -prof-use (VxWorks*) or /Qprof-use (Windows*) option unless the mock files were also compiled with the -prof-use (VxWorks) option. • Update make files to call the appropriate Intel linkers when using IPO from scripts. For VxWorks, replaces all instances of ld with xild; for Windows, replace all instances of link with xilink. • Update make file to call the appropriate Intel archiver. Replace all instances of ar with xiar.

IPO for Large Programs

In most cases, IPO generates a single object file for the link-time compilation. This behavior is not optimal for very large programs, perhaps even making it impossible to use -ipo (VxWorks*) or /Qipo (Windows*) on the application. The compiler provides two methods to avoid this problem. The first method is an automatic size-based heuristic, which causes the compiler to generate multiple object files for large link-time compilations. The second method is to manually instruct the compiler to perform multi-object IPO. • Use the -ipoN (VxWorks) or /QipoN (Windows) option and pass an integer value in the place of N. • Use the -ipo-separate (VxWorks) or /Qipo-separate (Windows) option. The number of true object files generated by the link-time compilation is invisible to you unless the -ipo-c or -ipo-S (VxWorks) or /Qipo-c or /Qipo-S (Windows) option is used.

Regardless of the method used, it is best to use the compiler defaults first and examine the results. If the defaults do not provide the expected results then experiment with generating more object files.

514 15

You can use the -ipo-jobs (VxWorks) or /Qipo-jobs (Windows) option to control the number of commands, or jobs, executed during parallel builds.

Using -ipoN or /QipoN to Create Multiple Object Files

If you specify -ipo0 (VxWorks) or /Qipo0 (Windows), which is the same as not specifying a value, the compiler uses heuristics to determine whether to create one or more object files based on the expected size of the application. The compiler generates one object file for small applications, and two or more object files for large applications. If you specify any value greater than 0, the compiler generates that number of object files, unless the value you pass a value that exceeds the number of source files. In that case, the compiler creates one object file for each source file then stops generating object files. The following example commands demonstrate how to use -ipo2 (VxWorks) or /Qipo2 (Windows) to compile large programs.

Operating System Example Command

VxWorks icpc -ipo2 -c a.cpp b.cpp

Windows icl /Qipo2 /c a.cpp b.cpp

Because the example command shown above, the compiler generates object files using an OS-dependent naming convention. On VxWorks, the example command results in object files named ipo_out.o, ipo_out1.o, ipo_out2.o, and ipo_out3.o. On Windows, the file names follow the same convention; however, the file extensions will be .obj. Link the resulting object files as shown in Using IPO or Linking Tools and Options.Linking Tools and Options.

Creating the Maximum Number of Object Files Using -ipo-separate (VxWorks) or /Qipo-separate (Windows) allows you to force the compiler to generate the maximum number of true object files that the compiler will support during multiple object compilation. For example, if you passed example commands similar to the following:

Operating System Example Command

VxWorks icpc -ipo-separate -ipo-c a.o b.o c.o

515 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operating System Example Command

Windows icl a.obj b.obj c.obj /Qipo-separate /Qipo-c

The compiler will generate multiple object files, which use the same naming convention discussed above. The first object file contains global variables. The other object files contain code for the functions or routines used in the source files. Link the resulting object files as shown in Using IPO or Linking Tools and Options.

Considerations for Large Program Compilation For many large programs, compiling with IPO can result in a single, large object file. Compiling to produce large objects can create problems for efficient compilation. During compilation, the compiler attempts to swap the memory usage during compiles; a large object file might result in poor swap usage, which could result in out-of-memory message or long compilation times. Using multiple, relatively small object files during compilation causes the system to use resources more efficiently.

Understanding Code Layout and Multi-Object IPO

One of the optimizations performed during an IPO compilation is code layout. The analysis performed by the compiler during multi-file IPO determines a layout order for all of the routines for which it has intermediate representation (IR) information. For a multi-object IPO compilation, the compiler must tell the linker about the desired order. If you are generating an executable in the link step, the compiler does all of this automatically. However, if you are generating object files instead of an executable, the compiler generates a layout script, which contains the correct information needed to optimally link the executable when you are ready to create it. This linking tool script must be taken into account if you use either -ipo-c or -ipo-S (Linux*) or /Qipo-c or /Qipo-S (Windows*). With these options, the IPO compilation and actual linking are done by different invocations of the compiler. When this occurs, the compiler issues a message indicating that it is generating an explicit linker script, ipo_layout.script. The Windows linker (link.exe) automatically collates these sections lexigraphically in the desired order. The compiler first puts each routine in a named text section that varies depending on the operating system: Windows:

516 15

• The first routine is placed in .text$00001, the second is placed in .text$00002, and so on.

Linux: • The first routine is placed in .text00001, the second is placed in .text00002, and so on. It then generates a linker script that tells the linker to first link contributions from .text00001, then .text00002.

When ipo_layout.script is generated, you should modify your link command if you want to use the script to optimize code layout:

Example --script=ipo_layout.script

If your application already requires a custom linker script, you can place the necessary contents of ipo_layout.script in your script. The layout-specific content of ipo_layout.script is at the beginning of the description of the .text section. For example, to describe the layout order for 12 routines:

Example output .text : { *(.text00001) *(.text00002) *(.text00003) *(.text00004) *(.text00005) *(.text00006) *(.text00007) *(.text00008) *(.text00009) *(.text00010) *(.text00011) *(.text00012) ...

For applications that already require a linker script, you can add this section of the .text section description to the customized linker script. If you add these lines to your linker script, it is desirable to add additional entries to account for future development. The addition is harmless, since the "r;*("¦)" syntax makes these contributions optional. If you choose to not use the linker script your application will still build, but the layout order will be random. This may have an adverse affect on application performance, particularly for large applications.

517 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Creating a Library from IPO Objects

VxWorks* Libraries are often created using a library manager such as lib. Given a list of objects, the library manager will insert the objects into a named library to be used in subsequent link steps.

Example xiar cru user.a a.o b.o

The above command creates a library named user.a containing the a.o and b.o objects. If the objects have been created using -ipo-c then the archive will not only contain a valid object, but the archive will also contain intermediate representation (IR) for that object file. For example, the following example will produce a.o and b.o that may be archived to produce a library containing both object code and IR for each source file.

Example icc -ipo -c a.cpp b.cpp

This program will invoke the compiler on the IR saved in the object file and generate a valid object that can be inserted into a library. Using xiar is the same as specifying xild -lib.

Windows* Only Create libraries using xilib or xilink -lib to create libraries of IPO mock object files and link them on the command line. For example, assume that you create three mock object files by using a command similar to the following:

Example icl /c /Qipo a.obj b.obj c.obj

518 15

Further assume a.obj contains the main subprogram. You can enter commands similar to the following to link the objects.

Example xilib -out:main.lib b.obj c.obj or xilink -lib -out:main.lib b.obj c.obj

You can link the library and the main program object file by entering a command similar to the following:

Example xilink -out:result.exe a.obj main.lib

Requesting Compiler Reports with the xi* Tools

The compiler option -opt-report (VxWorks*) generates optimization reports with different levels of detail. Related compiler options described in Compiler Reports Quick Reference allow you to specify the phase, direct output to a file (instead of stderr), and request reports from all routines with names containing a string as part of their name. The xi* tools are used with interprocedural optimization (IPO) during the final stage of IPO compilation. You can request compiler reports to be generated during the final IPO compilation by using certain options. The supported xi* tools are: • Linker tools: xilink (Windows) and xild (VxWorks) • Library tools: xilib (Windows), xiar (VxWorks)

The following tables lists the compiler report options for the xi* tools. These options are equivalent to the corresponding compiler options, but occur during the final IPO compilation.

Tool Option Description

-qopt-report[=n] Enables optimization report generation with different levels of detail, directed to stderr. Valid values for n are 0 through 3. By default, when you specify this option without passing a value the compiler will generate a report with a medium level of detail.

519 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Tool Option Description

-qopt-report-file=file Generates an optimization report and directs the report output to the specified file name. If you omit the path, the file is created in the current directory. To create the file in a different directory, specify the full path to the output file and its file name.

-qopt-report-phase=name Specifies the optimization phase name to use when generating reports. If you do not specify a phase the compiler defaults to all. You can request a list of all available phase by using the -qopt-report-help option.

-qopt-report-rou- Generates reports from all routines with names containing a tine=string string as part of their name. If not specified, the compiler will generate reports on all routines.

-qopt-report-help Displays the optimization phases available.

To understand the compiler reports, use the links provided in Compiler Reports Overview.

Inline Expansion of Functions

Inline Function Expansion

Inline function expansion does not require that the applications meet the criteria for whole program analysis normally required by IPO; so this optimization is one of the primary optimizations used in Interprocedural Optimization (IPO). For function calls that the compiler believes are frequently executed, the Intel® compiler often decides to replace the instructions of the call with code for the function itself. In the compiler, inline function expansion typically favors relatively small user functions over functions that are relatively large. This optimization improves application performance by performing the following: • Removing the need to set up parameters for a function call • Eliminating the function call branch • Propagating constants

520 15

Function inlining can improve execution time by removing the runtime overhead of function calls; however, function inlining can increase code size, code complexity, and compile times. In general, when you instruct the compiler to perform function inlining, the compiler can examine the source code in a much larger context, and the compiler can find more opportunities to apply optimizations. Specifying -ip (VxWorks*) or /Qip (Windows*), single-file IPO, causes the compiler to perform inline function expansion for calls to procedures defined within the current source file; in contrast, specifying -ipo (VxWorks) or /Qipo (Windows), multi-file IPO, causes the compiler to perform inline function expansion for calls to procedures defined in other files.

CAUTION. Using the -ip and -ipo (VxWorks) or /Qip and /Qipo (Windows) options can, in some cases, significantly increase compile time and code size.

Selecting Routines for Inlining The compiler attempts to select the routines whose inline expansions provide the greatest benefit to program performance. The selection is done using default heuristics. The inlining heuristics used by the compiler differ based on whether or not you use options for Profile-Guided Optimizations (PGO): -prof-use (VxWorks) or /Qprof-use (Windows). When you use PGO with -ip or -ipo (VxWorks) or /Qip or /Qipo (Windows), the compiler uses the following guidelines for applying heuristics:

• The default heuristic focuses on the most frequently executed call sites, based on the profile information gathered for the program. • The default heuristic always inlines very small functions that meet the minimum inline criteria.

Using IPO with PGO (Windows) Combining IPO and PGO typically produces better results than using IPO alone. PGO produces dynamic profiling information that can usually provide better optimization opportunities than the static profiling information used in IPO. The compiler uses characteristics of the source code to estimate which function calls are executed most frequently. It applies these estimates to the PGO-based guidelines described above. The estimation of frequency, based on static characteristics of the source, is not always accurate. Avoid using static profile information when combining PGO and IPO; with static profile information, the compiler can only estimate the application performance for the source files being used. Using dynamically generated profile information allows the compiler to accurately determine the real performance characteristics of the application.

521 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Inline Expansion of Library Functions By default, the compiler automatically inlines (expands) a number of standard and math library functions at the point of the call to that function, which usually results in faster computation. Many routines in the libirc, libm, or the svml library are more highly optimized for Intel microprocessors than for non-Intel microprocessors. The -fno-builtin (VxWorks) option disables inlining for intrinsic functions and disables the by-name recognition support of intrinsic functions and the resulting optimizations. Use these options if you redefine standard library routines with your own version and your version of the routine has the same name as the standard library routine.

Inlining and Function Preemption (Linux and VxWorks) You must specify -fpic or -fPIC to use function preemption. By default the compiler does not generate the position-independent code needed for preemption.

Compiler Directed Inline Expansion of Functions

Without directions from the user, the compiler attempts to estimate what functions should be inlined to optimize application performance. See Criteria for Inline Function Expansion for more information. The following options are useful in situations where an application can benefit from user function inlining but does not need specific direction about inlining limits. Except where noted, these options are supported on all Intel architectures.

VxWorks* Windows* Effect

Specifies the level of inline function expansion. Depending on the value specified, the option can disable or enable inlining. By -inline-level /Ob default, the option enables inlining of any function if the compiler believes the function can be inlined. For more information, see the following topic:

522 15

VxWorks* Windows* Effect • -inline-level compiler option

Disables only inlining normally enabled by the following options: • VxWorks: -ip or -ipo • Windows: /Qip, /Qipo, or /Ob2 -ip-no-inlining /Qip-no-inlining No other IPO optimization are disabled. For more information, see the following topic: • -ip-no-inlining compiler option

Disables partial inlining normally enabled by the following options: • VxWorks: -ip or -ipo • Windows: /Qip or /Qipo

-ip-no-pinlining /Qip-no-pinlining No other IPO optimization are disabled. For more information, see the following topic: • -ip-no-pinlining compiler option

-fno-builtin /Oi- Disables inlining for intrinsic functions. Disables the by-name recognition support of intrinsic functions and the resulting optimizations. Use this option if you redefine

523 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

VxWorks* Windows* Effect standard library routines with your own version and your version of the routine has the same name as the standard library routine. By default, the compiler automatically inlines (expands) a number of standard and math library functions at the point of the call to that function, which usually results in faster computation. Many routines in the libirc, libm, or svml library are more highly optimized for Intel microprocessors than for non-Intel microprocessors. • -fno-builtin compiler option

Keeps source information for inlined functions. The additional source code can be used by the Intel® Debugger track the user-defined call stack while using inlining. To use this option you must also specify an additional -inline-debug-info /Qinline-debug-info option to enable debugging: • Linux and VxWorks: -g

For more information, see the following topic: • -inline-debug-info compiler option

524 15

Developer Directed Inline Expansion of User Functions

In addition to the options that support compiler directed inline expansion of user functions, the compiler also provides compiler options that allow you to more precisely direct when and if inline function expansion should occur. The compiler measures the relative size of a routine in an abstract value of intermediate language units, which is approximately equivalent to the number of instructions that will be generated. The compiler uses the intermediate language unit estimates to classify routines and functions as relatively small, medium, or large functions. The compiler then uses the estimates to determine when to inline a function; if the minimum criteria for inlining is met and all other things are equal, the compiler has an affinity for inlining relatively small functions and not inlining relative large functions. The following developer directed inlining options provide the ability to change the boundaries used by the inlining optimizer to distinguish between small and large functions. These options are supported on all Intel architectures. In general, you should use the -inline-factor (VxWorks*) and /Qinline-factor (Windows*) option before using the individual inlining options listed below; this single option effectively controls several other upper-limit options.

VxWorks* Windows* Effect

Controls the multiplier applied to all inlining options that define upper limits: inline-max-size, inline-max-total-size, inline-max-per-routine, and inline-max-per-compile. While you can specify an -inline-factor /Qinline-factor individual increase in any of the upper-limit options, this single option provides an efficient means of controlling all of the upper-limit options with a single command. By default, this option uses a multiplier of 100, which corresponds to a factor of 1. Specifying 200 implies a factor of 2, and so on.

525 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

VxWorks* Windows* Effect Experiment with the multiplier carefully. You could increase the upper limits to allow too much inlining, which might result in your system running out of memory. For more information, see the following topic: • -inline-factor compiler option

Instructs the compiler to force inlining of functions suggested for inlining whenever the compiler is capable doing so. Typically, the compiler targets functions that have been marked for inlining based on the presence of keywords, like __forceinline, in the source code; however, all such directives in the source -inline-forceinline /Qinline-forceinline code are treated only as suggestions for inlining. The option instructs the compiler to view the inlining suggestion as mandatory and inline the marked function if it can be done legally. For more information, see the following topic: • -inline-forceinline compiler option

526 15

VxWorks* Windows* Effect

Redefines the maximum size of small routines; routines that are equal to or smaller than the value specified are more likely to be inlined. -inline-min-size /Qinline-min-size For more information, see the following topic: • -inline-min-size compiler option

Redefines the minimum size of large routines; routines that are equal to or larger than the value specified are less likely to be inlined. -inline-max-size /Qinline-max-size For more information, see the following topic: • -inline-max-size compiler option

Limits the expanded size of inlined functions. For more information, see the -inline-max-total-size /Qinline-max-total-size following topic: • -inline-max-total- size compiler option

Limits the number of times inlining can be applied within a routine. /Qinline-max-per-rou- -inline-max-per-routine For more information, see the tine following topic: • -inline-max-per-rou- tine compiler option

527 15 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

VxWorks* Windows* Effect

Limits the number of times inlining can be applied within a compilation unit. The compilation unit limit depends on the whether or not you specify the -ipo (VxWorks) or /Qipo (Windows) option. If you enable IPO, all source files that are part of the /Qinline-max-per-com- -inline-max-per-compile compilation are considered pile one compilation unit. For compilations not involving IPO each source file is considered an individual compilation unit. For more information, see the following topic: • -inline-max-per-com- pile compiler option

528 Using Profile-Guided Optimization (PGO) 16

Profile-Guided Optimizations Overview

Profile-guided Optimization (PGO) improves application performance by reorganizing code layout to reduce instruction-cache problems, shrinking code size, and reducing branch mispredictions. PGO provides information to the compiler about areas of an application that are most frequently executed. By knowing these areas, the compiler is able to be more selective and specific in optimizing the application. PGO consists of three phases or steps. 1. Step one is to instrument the program. In this phase, the compiler creates and links an instrumented program from your source code and special code from the compiler. 2. Step two is to run the instrumented executable. Each time you execute the instrumented code, the instrumented program generates a dynamic information file, which is used in the final compilation. 3. Step three is a final compilation. When you compile a second time, the dynamic information files are merged into a summary file. Using the summary of the profile information in this file, the compiler attempts to optimize the execution of the most heavily traveled paths in the program.

See Profile-guided Optimization Quick Reference for information about the supported options and Profile an Application for specific details about using PGO from a command line. PGO provides the following benefits: • Use profile information for register allocation to optimize the location of spill code. • Improve branch prediction for indirect function calls by identifying the most likely targets. (Some processors have longer pipelines, which improves branch prediction and translates into high performance gains.) • Detect and do not vectorize loops that execute only a small number of iterations, reducing the run time overhead that vectorization might otherwise add.

Interprocedural Optimization (IPO) and PGO can affect each other; using PGO can often enable the compiler to make better decisions about function inlining, which increases the effectiveness of interprocedural optimizations. Unlike other optimizations, such as those strictly for size or speed, the results of IPO and PGO vary. This variability is due to the unique characteristics of each program, which often include different profiles and different opportunities for optimizations.

529 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Performance Improvements with PGO PGO works best for code with many frequently executed branches that are difficult to predict at compile time. An example is the code with intensive error-checking in which the error conditions are false most of the time. The infrequently executed (cold) error-handling code can be relocated so the branch is rarely predicted incorrectly. Minimizing cold code interleaved into the frequently executed (hot) code improves instruction cache behavior. When you use PGO, consider the following guidelines:

• Minimize changes to your program after you execute the instrumented code and before feedback compilation. During feedback compilation, the compiler ignores dynamic information for functions modified after that information was generated. (If you modify your program the compiler issues a warning that the dynamic information does not correspond to a modified function.)

• Repeat the instrumentation compilation if you make many changes to your source files after execution and before feedback compilation. • Know the sections of your code that are the most heavily used. If the data set provided to your program is very consistent and displays similar behavior on every execution, then PGO can probably help optimize your program execution. • Different data sets can result in different algorithms being called. The difference can cause the behavior of your program to vary for each execution. In cases where your code behavior differs greatly between executions, PGO may not provide noticeable benefits. If it takes multiple data sets to accurately characterize application performance then execute the application with all data sets then merge the dynamic profiles; this technique should result in an optimized application.

You must insure the benefit of the profiled information is worth the effort required to maintain up-to-date profiles.

Profile-Guided Optimization (PGO) Quick Reference

Profile-Guided Optimization consists of three phases (or steps): 1. Generating instrumented code by compiling with the -prof-gen (VxWorks* OS) option when creating the instrumented executable. 2. Running the instrumented executable, which produces dynamic-information (.dyn) files. 3. Compiling the application using the profile information using the -prof-use (VxWorks) option.

530 16

The figure illustrates the phases and the results of each phase.

See Profile an Application for details about using each phase. The following table lists the compiler options used in PGO:

VxWorks* Windows* Effect

Instruments a program for profiling to get the execution counts of each basic block. The option is used in phase 1 (instrumenting the code) to instruct the compiler to -prof-gen /Qprof-gen produce instrumented code for your object files in preparation for instrumented execution. By default, each instrumented execution creates one dynamic-information (dyn) file for each executable and

531 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

VxWorks* Windows* Effect (on Windows OS) one for each DLL invoked by the application. You can specify keywords, such as -prof- gen=default (VxWorks). The keywords control the amount of source information gathered during phase 2 (run the instrumented executable). The prof-gen keywords are: • Specify default (or omit the keyword) to request profiling information for use with the prof-use option and optimization when the instrumented application is run (phase 2). • Specify srcpos or globdata to request additional profiling information for the code coverage and test prioritization tools when the instrumented application is run (phase 2). The phase 1 compilation creates an spi file. • Specify globdata to request additional profiling information for data ordering optimization when the instrumented application is run (phase 2). The phase 1 compilation creates an spi file.

532 16

VxWorks* Windows* Effect If you are performing a parallel make, this option will not affect it. Instruments program minimally only for code coverage tool. Use None /Qcov-gen to reduce impact on performance while using the code coverage tool.

Instructs the compiler to produce a profile-optimized executable and merges available dynamic-information (dyn) files into a pgopti.dpi file. This option implicitly invokes the profmerge tool. The dynamic-information files are produced in phase 2 when you run the instrumented executable. If you perform multiple executions of the instrumented program to -prof-use /Qprof-use create additional dynamic-information files that are newer than the current summary pgopti.dpi file, this option merges the dynamic-information files again and overwrites the previous pgopti.dpi file (you can set the environment variable PROF_NO_CLOBBER to prevent the previous dpi file from being overwritten). When you compile with prof-use, all dynamic information and summary information files should be in the same directory (current

533 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

VxWorks* Windows* Effect directory or the directory specified by the prof-dir option). If you need to use certain profmerge options not available with compiler options (such as specifying multiple directories), use the profmerge tool. For example, you can use profmerge to create a new summary dpi file before you compile with the prof-use option to create the optimized application. You can specify keywords, such as -prof-gen=weight- ed (VxWorks). If you omit the weighted keyword, the merged dynamic-information (dyn) files will be weighted proportionally to the length of time each application execution runs. If you specify the weighted keyword, the profiler applies an equal weighting (regardless of execution times) to the dyn file values to normalize the data counts. This keyword is useful when the execution runs have different time durations and you want them to be treated equally. When you use prof-use, you can also specify the prof- file option to rename the summary dpi file and the prof-dir option to specify the directory for dynamic-information (dyn) and summary (dpi) files.

534 16

VxWorks* Windows* Effect Linux and VxWorks: • Using this option with -prof-func-groups allows you to control function grouping behavior.

Enables ordering of program routines using profile information when specified with prof-use (phase 3). The instrumented program (phase 1) must have been compiled with the prof-gen -prof-func-groups /Qprof-func-order option srcpos keyword. Not valid for multi-file compilation with the ipo options. For more information, see Using Function Ordering, Function Order Lists, Function Grouping, and Data Ordering Optimizations.

Enables ordering of static program data items based on profiling information when specified with prof-use. The instrumented program (phase 1) must have been compiled with the prof-gen option -prof-data-order /Qprof-data-order srcpos keyword. Not valid for multi-file compilation with the ipo options. For more information, see Using Function Ordering, Function Order Lists, Function Grouping, and Data Ordering Optimizations.

535 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

VxWorks* Windows* Effect

Controls whether full directory information for the source directory path is stored in or read from dynamic-information (dyn) files. When used during phase 1 compilation (prof- gen), this determines whether the full path is added into dyn file created during instrumented application execution. When used during profmerge or phase 3 compilation (prof-use), this determines whether the full path for source file names is used or ignored when reading -prof-src-dir /Qprof-src-dir the dyn or dpi files. Using the default -prof- src-dir (VxWorks) uses the full directory information and also enables the use of the prof-src-root and prof- src-cwd options. If you specify -no-prof- src-dir (VxWorks), only the file name (and not the full path) is stored or used. If you do this, all dyn or dpi files must be in the current directory and the prof-src- root and prof-src-cwd options are ignored.

Specifies the directory in which dynamic information -prof-dir /Qprof-dir (dyn) files are created in, read from, and stored; otherwise, the dyn files are created in or read from the

536 16

VxWorks* Windows* Effect current directory used during compilation. For example, you can use this option when compiling in phase 1 (prof- gen option) to define where dynamic information files will be created when running the instrumented executable in phase 2. You also can use this option when compiling in phase 3 (prof-use option) to define where the dynamic information files will be read from and a summary file (dpi) created.

Specifies a directory path prefix for the root directory where the user's application files are stored: • To specify the directory prefix root where source /Qprof-src-root files are stored, specify -prof-src-root or the -prof-src-root or (VxWorks) option. -prof-src-cwd /Qprof-src-cwd • To use the current working directory, specify the -prof-src-cwd (VxWorks) option.

This option is ignored if you specify -no-prof-src-dir (VxWorks).

Specifies file name for profiling summary file. If this -prof-file /Qprof-file option is not specified, the name of the file containing summary information will be pgopti.dpi.

537 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Refer to Quick Reference Lists for a complete listing of the quick reference topics.

Profile an Application

Profiling an application includes the following three phases: • Instrumentation compilation and linking • Instrumented execution • Feedback compilation

This topic provides detailed information on how to profile an application by providing sample commands for each of the three phases (or steps). 1. Instrumentation compilation and linking Use -prof-gen (VxWorks*) or /Qprof-gen (Windows*) to produce an executable with instrumented information included. Use /Qcov-gen (Windows) option to obtain minimum instrumentation only for code coverage.

Operating System Commands

VxWorks icpc -prof-gen -prof-dir/usr/profiled a1.cpp a2.cpp a3.cpp icpc a1.o a2.o a3.o

Windows icl /Qprof-gen /Qprof-dirc:\profiled a1.cpp a2.cpp a3.cpp icl a1.obj a2.obj a3.obj

Windows icl /Qcov-gen /Qcov-dirc:/cov_data a1.cpp a2.cpp a3.cpp icl a1.obj a2.obj a3.obj

Use the -prof-dir (VxWorks) option if the application includes the source files located in multiple directories; using the option insures the profile information is generated in one consistent place. The example commands demonstrate how to combine these options on multiple sources.

538 16

The compiler gathers extra information when you use the -prof-gen=srcpos (VxWorks) option; however, the extra information is collected to provide support for specific Intel tools, like the code-coverage tool. If you do not expect to use such tools, do not specify -prof- gen=srcpos (VxWorks); the extended option does not provide better optimization and could slow parallel compile times. If you are interested in using the instrumentation only for code coverage, use the /Qcov-gen (Windows) option, instead of the /Qprof-gen:srcpos (Windows) option, to minimize instrumentation overhead. 2. Instrumented execution Run your instrumented program with a representative set of data to create one or more dynamic information files.

Operating System Command

VxWorks ./a1.out

Windows a1.exe

Executing the instrumented applications generates dynamic information file that has a unique name and .dyn suffix. A new dynamic information file is created every time you execute the instrumented program. You can run the program more than once with different input data. 3. Feedback compilation Before this step, copy all .dyn and .dpi files into the same directory. Compile and link the source files with -prof-use (VxWorks); the option instructs the compiler to use the generated dynamic information to guide the optimization:

Operating System Examples

VxWorks icpc -prof-use -ipo -prof-dir/usr/profiled a1.cpp a2.cpp a3.cpp

Windows icl /Qprof-use /Qipo /Qprof-dirc:\profiled a1.cpp a2.cpp a3.cpp

This final phase compiles and links the sources files using the data from the dynamic information files generated during instrumented execution (phase 2). In addition to the optimized executable, the compiler produces a pgopti.dpi file.

539 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Most of the time, you should specify the default optimizations,-02 (VxWorks), for phase 1, and specify more advanced optimizations, -ipo (Linux and VxWorks), during the final (phase 3) compilation. For example, the example shown above used-O2 (VxWorks) in step 1 and-ipo (VxWorks) in step 3.

NOTE. The compiler ignores the -ipo or -ip (VxWorks) or /Qipo option during phase 1 with -prof-gen (VxWorks).

Profile Function or Loop Execution Time

This topic describes how to profile an application for function or loop execution time. Using the instrumentation method to profile function or loop execution time provides an easy means of getting a view of where cycles are being spent in your application. The Intel® compiler inserts instrumentation code into your application to collect the time spent in various locations. Such data helps you to identify hotspots that may be candidates for optimization tuning or targeting parallelization. Different levels of instrumentation are available because of the difference in the overhead associated with the inserted probe points that read the CPU cycle counter. The instrumentation levels, in increasing amount of collected data, are grouped as follows: • function level • function and loop level • loop body level You need not profile the entire application but it is recommended that the same level of instrumentation is used for all instrumented files. Compile your application with the command line options described below for all the files for which you require profile information. After the instrumented application is run, data files containing a summary of the collected data are created in the working directory. View these data files to identify the time proportions of the different parts of your application.

Compiler Options to Obtain Instrumented Program Use the following command-line compiler options to instrument your application code to profile execution times for functions/loops. When you compile your application using any one of these options, you get an instrumented program.

540 16

Instrumentation VxWorks* Operating Windows* Description Level System Operating System Function level -profile-func- /Qprofile- Instruments function entry/exit points tions functions only.

• Inserts instrumentation calls on the function entry and exit points to collect the cycles spent within the function. • Data collection is supported for single- threaded applications only. • The collected data from execution of the instrumented application is placed in the file, loop_prof_funcs_.dump in the current working directory.

Function and loop -profile- /Qprofile- Instruments function and loop entry/exit level loops= loops: points:

• The argument allows you to specify the types of the loops to be instrumented. The value must be one of the following:

• inner - to instrument inner loops • outer - to instrument outer loops • all - to instrument both inner and outer loops

• Inserts instrumentation calls for function entry and exit points, and before and after instrumentable loops as listed in the argument.

541 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Instrumentation VxWorks* Operating Windows* Description Level System Operating System

NOTE. The compiler may be unable to instrument some loops.

• Data collection is supported for single- threaded applications only. • The collected data from execution of the instrumented application is placed in the files, loop_prof_funcs_.dump and loop_prof_loops_.dump in the current working directory for the function data and loop data, respectively.

NOTE. The instrumentation takes place after many compiler transformations have taken place, so loops that have been restructured, or replicated by the compiler may result in multiple entries in the generated reports. Also, functions may have been inlined into other functions; this may cause the same loop to be listed in multiple functions.

542 16

Instrumentation VxWorks* Operating Windows* Description Level System Operating System Loop body level -profile-loops- /Qprofile- Controls level of instrumentation for report=[n] loops-re- loop bodies, where, port:[n] • Specifying n=1 collects the cycle counts on entry and exits of loops. This is the default value if no value is specified for n. • Specifying n=2 inserts additional instrumentation for collecting loop iteration counts to compute loop min/max and average iteration counts. Such instrumentation increases overhead in the instrumented application.

NOTE.

• Instrumentation takes place after compiler transformations have taken place, so the loop trip count is based on the generated form of the loop, which may be unrolled, and may be different from the original source form. • This option is effective only when used with the -pro- file-loops (VxWorks*) or /Qprofile-loops (Windows*)option enabled; it acts as a modifier of the -profile-loops or /Qpro- file-loops option.

543 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

NOTE. Because this instrumentation relies on the CPU time stamp counter for collecting timing information, only single threaded applications should be profiled with these options. Using these options with other compiler options that enable parallel code generation is not supported.

Output Formats Obtained on Running Instrumented Program After compiling your application using the options, you obtain an instrumented program. Run this instrumented program with a representative set of data to create one or more data files. The output files are produced by a routine that is registered with the atexit() routine, and are created only when the program exits via a standard exit procedure. Two formats of the output files are created:

• A plain text file using ‘tab’ delimiters, which can be viewed within a text editor or a spreadsheet. • An XML data file that can be loaded into the data viewer application.

Text Output Format The text output is written into separate files in the current directory. The filenames are loop_prof_funcs_.dump and loop_prof_loops_.dump for functions and loops respectively. The value is the same for both files to indicate they are from the same instrumentation run. You can load the text files into a spreadsheet by specifying that the data is delimited with ‘tab’ delimiters. The function report file contains the following columns:

Column Heading Description

Time (abs) Total accumulated cycles spent between function entry and exit.

Time % Percentage of total application time for cycles between function entry and exit.

Self (abs) Accumulated cycles spent between function entry and exit less the cycles spent within calls to other instrumented functions.

NOTE. Time spent in calls to routines or libraries that were not instrumented are included within this cycle count value.

544 16

Column Heading Description

Self % Percentage that self(abs) represents of total application time.

Call count Number of times the function was entered.

Exit count Number of times the function exit instrumentation was invoked. (In some instances, such as cases involving exception handling, it may be possible for control to transfer out of a function without the exit instrumentation being seen, resulting in this being different than the call count value.)

Loop ticks (%) The percentage of total application time for the the self time cycles of loops within the function. This field is relevant only when you use –profile-loops(VxWorks*) or /Qprofile-loops (Windows*) compiler option.

Function Demangled function name

File:line Source file containing the code for the function, and the line number the function begins on.

The loop report file contains the following columns:

Column Heading Description Time (abs) Total accumulated cycles spent between loop entry and exit. Time % Percentage of total application time for cycles between loop entry and exit. Self (abs) Accumulated cycles spent between loop entry and exit less the cycles spent within calls to other instrumented functions/loops. Self % Percentage that self(abs) represents of total application time. Loop entries Number of times the loop was entered. Loop exits Number of times the loop exit instrumentation was invoked. (In some instances, such as cases involving exception handling, it may be possible for control to transfer out of a loop without the exit instrumentation being seen, resulting in this being different than the call count value.) Min iteration counts The smallest iteration count for an invocation of the loop. This field is relevant only when you use instrumentation level = 2, using–profile- loops-report=2(VxWorks*) or /Qprofile-loops-report:2 (Windows*) compiler option.

545 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Column Heading Description Avg iteration counts The average iteration count for an invocation of the loop. This field is relevant only when you use instrumentation level = 2, using–profile- loops-report=2(VxWorks*) or /Qprofile-loops-report:2 (Windows*) compiler option. Max iteration counts The largest iteration count for an invocation of the loop. This field is relevant only when you use instrumentation level = 2, using–profile- loops-report=2(VxWorks*) or /Qprofile-loops-report:2 (Windows*) compiler option. Function Demangled function name that executed the loop; this may be different from the function that contained the loop source code, if that function was inlined into another function. Function file:line The name of the source file containing the function that executed the loop, and the line number the function begins on. Loop file:line The name of the source file containing the loop body, and the line number the loop body begins on.

XML Output Format The INTEL_LOOP_PROF_XML_DUMP environment variable is used to control the creation of the XML output format. When INTEL_LOOP_PROF_XML_DUMP = 0, no XML file is created. The XML output file, named loop_prof_.xml, is written into the current directory. This file is viewed with the data viewer utility. The XML data file contains both the function and loop level data. Remember, using the environment variable with a value of 0 implies that the data viewer utility is not going to be used since no XML data files are created.

Data Viewer Utility You must have the Java Runtime Environment, version 1.6 installed on your system to use the data viewer utility. The JRE v1.6 is not included with the Intel® Compiler package. The Data Viewer displays the data collected for the functions and loops in a sequence of columns that allows for sorting of individual columns. The Data Viewer also provides the ability to filter the function or loops to only display items that meet some threshold amount (total time % or self time %) of the total application time, or to only show loops that are associated with a particular function.

546 16

The Data Viewer is located in the \bin directory of the compiler installation. Double-click on the file loopprofileviewer.jar in this directory to launch the Data Viewer utility. Alternatively, use the following commands from the command line. On Windows* operating system: On VxWorks* operating systems: loopprofileviewer.sh or loopprofileviewer.csh

The main user interface displays an upper pane, Function Profile, containing function level data, and a lower pane, Loop Profile, containing loop level data. If the XML data file is created using only function level instrumentation, the lower pane is not displayed. Drag the separator

547 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

between individual panes to resize them or click the up/down arrows available on the separator line.

The Function Profile pane and Loop Profile pane contain the set of columns described in the section, Text Output Format. Click on any of these columns to sort the data based on that column. The Data Viewer utility has two specific menus - the View menu and the Filter menu - in the menu bar besides the generic File and Help menus.

548 16

View Menu The View menu allows you to open a window that shows the source code for the line you have selected in the table (these menus are grayed out if no line is selected). To use the View menu:

1. Select a line in the Function Profile pane. 2. Go to View > Function source for selected function to see the source code for the selected function. 3. Select a line in the Loop Profile pane. 4. Go to View > Function source for selected loop to see the source code for the selected function. 5. Go to View > Loop source for selected loop to see the source code for the selected loop.

Alternatively, when the cursor is within the Function Profile pane or the Loop Profile pane, right-click on any line to display a popup menu that gives access to these View menu options. By default, the Data Viewer utility displays the most relevant columns based on the data file loaded. Go to View > Select Columns to display or hide the columns. In the dialog box that opens, columns not available for the currently loaded data file are grayed out. Filter Menu The Filter menu allows you to limit the data displayed in the tables to items that meet some criteria. As an example, you can use the Filter menu as follows:

1. Go to Filter > Function Profile > “Self time > 1.00%” to only list the functions in the Function Profile pane that represented more than 1.0 % of the application time. 2. Go to Filter > Set thresholds to modify the default thresholds used when enabling the filters for total time % or self time %.

To display only the loops of a particular function, do the following:

1. Select a function in the Function column of the Function Profile pane. 2. Go to Filter > Loop Profile > Show loops for selected function to display the loops for the selected function.

Alternatively, when the cursor is within the Function Profile pane or the Loop Profile pane, you can right-click any function to display a popup menu that gives access to these Filter menu options.

549 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

PGO Tools

PGO Tools Overview

This section describes the tools that take advantage or support the Profile-guided Optimizations (PGO) available in the compiler. • code coverage tool • test prioritization tool • profmerge tool • proforder tool

In addition to the tools, this section also contains information on using Software-based Speculative Precomputation, which will allow you to optimize applications using profiling- and sampling-feedback methods on IA-32 architectures.

code coverage Tool

The code coverage tool provides software developers with a view of how much application code is exercised when a specific workload is applied to the application. To determine which code is used, the code coverage tool uses Profile-guided Optimization (PGO) options and optimizations. The major features of the code coverage tool are: • Visually presenting code coverage information for an application with a customizable code coverage coloring scheme • Displaying dynamic execution counts of each basic block of the application • Providing differential coverage, or comparison, profile data for two runs of an application

The information about using the code coverage tool is separated into the following sections: • code coverage tool Requirements • Visually Presenting Code Coverage for an Application • Excluding Code from Coverage Analysis • Exporting Coverage Data

The tool analyzes static profile information generated by the compiler, as well as dynamic profile information generated by running an instrumented form of the application binaries on the workload. The tool can generate the in HTML-formatted report and export data in both text-, and XML-formatted files. The reports can be further customized to show color-coded, annotated, source-code listings that distinguish between used and unused code.

550 16

The code coverage tool is available on all supported Intel architectures on Linux*, VxWorks* and Windows* operating systems. You can use the tool in a number of ways to improve development efficiency, reduce defects, and increase application performance: • During the project testing phase, the tool can measure the overall quality of testing by showing how much code is actually tested. • When applied to the profile of a performance workload, the code coverage tool can reveal how well the workload exercises the critical code in an application. High coverage of performance-critical modules is essential to taking full advantage of the Profile-Guided Optimizations that Intel Compilers offer. • The tool provides an option, useful for both coverage and performance tuning, enabling developers to display the dynamic execution count for each basic block of the application. • The code coverage tool can compare the profile of two different application runs. This feature can help locate portions of the code in an application that are unrevealed during testing but are exercised when the application is used outside the test space, for example, when used by a customer. code coverage tool Requirements To run the code coverage tool on an application, you must have following items:

• The application sources. • The .spi file generated by the Intel® compiler when compiling the application for the instrumented binaries using the -prof-gen=srcpos (VxWorks) option.

NOTE. Use the –[Q]prof-gen:srcpos option if you intend to use the collected data for code coverage and profile feedback. If you are only interested in using the instrumentation for code coverage, use the /Qcov-gen option. Using the /Qcov-gen option saves time and improves performance. This option can be used only on Windows platform for all architectures.

• A pgopti.dpi file that contains the results of merging the dynamic profile information (.dyn) files, which is most easily generated by the profmerge tool. This file is also generated implicitly by the Intel® compilers when compiling an application with -prof-use (VxWorks) options with available .dyn and .dpi files.

See Profile an Application and Example of Profile-guided Optimization for general information on creating the files needed to run this tool.

551 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Using the Tool The tool uses the following syntax:

Tool Syntax

codecov [-codecov_option]

where -codecov_option is one or more optional parameters specifying the tool option passed to the tool. The available tool options are listed in code coverage tools Options (below). If you do not use any additional tool options, the tool will provide the top-level code coverage for the entire application. In general, you must perform the following steps to use the code coverage tool:

1. Compile the application using -prof-gen=srcpos (VxWorks), and/or /Qcov-gen (Windows) option. This step generates an instrumented executable and a corresponding static profile information (pgopti.spi) file when the -[Q]prof-gen=srcpos option is used. When the /Qcov-gen option is used, minimum instrumentation only for code coverage and generation of .spi file is enabled.

NOTE. You can specify both the /Qprof-gen=srcpos and /Qcov-gen options on the command line. The higher level of instrumentation needed for profile feedback is enabled along with the profile option for generating the .spi file, regardless of the order the options are specified on the command line.

2. Run the instrumented application. This step creates the dynamic profile information (.dyn) file Each time you run the instrumented application, the compiler generates a unique .dyn file either in the current directory or the directory specified in by prof_dir. 3. Use the profmerge tool to merge all the .dyn files into one .dpi (pgopti.dpi) file. This step consolidates results from all runs and represents the total profile information for the application, generates an optimized binary, and creates the dpi file needed by the code coverage tool.

552 16

You can use the profmerge tool to merge the .dyn files into a .dpi file without recompiling the application. The profmerge tool can also merge multiple .dpi files into one .dpi file using the profmerge -a option. Select the name of the output .dpi file using the profmerge -prof_dpi option.

CAUTION. The profmerge tool merges all .dyn files that exist in the given directory. Make sure unrelated .dyn files, which may remain from unrelated runs, are not present. Otherwise, the profile information will be skewed with invalid profile data, which can result in misleading coverage information and adverse performance of the optimized code.

4. Run the code coverage tool. (The valid syntax and tool options are shown below.) This step creates a report or exported data as specified. If no other options are specified, the code coverage tool creates a single HTML file (CODE_COVERAGE.HTML) and a sub-directory (CodeCoverage) in the current directory. Open the file in a web browser to view the reports.

NOTE. Windows* only: Unlike the compiler options, which are preceded by forward slash ("/"), the tool options are preceded by a hyphen ("-").

The code coverage tool allows you to name the project and specify paths to specific, necessary files. The following example demonstrates how to name a project and specify .dpi and .spi files to use:

Example: specify .dpi and .spi files

codecov -prj myProject -spi pgopti.spi -dpi pgopti.dpi

The tool can add a contact name and generate an email link for that contact at the bottom of each HTML page. This provides a way to send an electronic message to the named contact. The following example demonstrates how to add specify a contact and the email links:

Example: add contact information

codecov -prj myProject -mname JoeSmith -maddr [email protected]

553 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

This following example demonstrates how to use the tool to specify the project name, specify the dynamic profile information file, and specify the output format and file name.

Example: export data to text

codecov -prj test1 -dpi test1.dpi -txtbcvrg test1_bcvrg.txt

code coverage tool Options The tool uses the options listed in the table:

Option Default Description

-bcolor color #FFFF99 Specifies the HTML color name for code in the uncovered blocks.

-beginblkdsbl Specifies the comment that marks the beginning of the string code fragment to be ignored by the coverage tool.

-ccolor color #FFFFFF Specifies the HTML color name or code of the covered code.

-comp file Specifies the file name that contains the list of files being (or not) displayed.

-counts Generates dynamic execution counts.

-demang Demangles both function names and their arguments.

-dpi file pgopti.dpi Specifies the file name of the dynamic profile information file (.dpi).

-endblkdsbl string Specifies the comment that marks the end of the code fragment to be ignored by the coverage tool.

-fcolor color #FFCCCC Specifies the HTML color name for code of the uncovered functions.

-help, -h Prints tool option descriptions.

554 16

Option Default Description

-icolor color #FFFFFF Specifies the HTML color name or code of the information lines, such as basic-block markers and dynamic counts.

-maddr string Nobody Sets the email address of the web-page owner

-mname string Nobody Sets the name of the web-page owner.

-nopartial Treats partially covered code as fully covered code.

-nopmeter Turns off the progress meter. The meter is enabled by default.

-onelinedsbl Specifies the comment that marks individual lines of string code or the whole functions to be ignored by the coverage tool.

-pcolor color #FAFAD2 Specifies the HTML color name or code of the partially covered code.

-prj string Sets the project name.

-ref Finds the differential coverage with respect to ref_dpi_file.

-spi file pgopti.spi Specifies the file name of the static profile information file (.spi).

-srcroot dir Specifies a different top level project directory than was used during compiler instrumentation run to use for relative paths to source files in place of absolute paths.

-txtbcvrg file Export block-coverage for covered functions as text format. The file parameter must be in the form of a valid file name.

555 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Option Default Description

-txtbcvrgfull file Export block-coverage for entire application in text and HTML formats. The file parameter must be in the form of a valid file name.

-txtdcg file Generates the dynamic call-graph information in text format. The file parameter must be in the form of a valid file name.

-txtfcvrg file Export function coverage for covered function in text format. The file parameter must by in the form of a valid file name.

-ucolor color #FFFFFF Specifies the HTML color name or code of the unknown code.

-xcolor color #90EE90 Specifies the HTML color of the code ignored.

-xmlbcvrg file Export the block-coverage for the covered function in XML format. The file parameter must by in the form of a valid file name.

-xmlbcvrgfull file Export function coverage for entire application in XML format in addition to HTML output. The file parameter must be in the form of a valid file name.

-xmlfcvrg file Export function coverage for covered function in XML format. The file parameter must be in the form of a valid file name.

Visually Presenting Code Coverage for an Application Based on the profile information collected from running the instrumented binaries when testing an application, the Intel® compiler will create HTML-formatted reports using the code coverage tool. These reports indicate portions of the source code that were or were not exercised by the tests. When applied to the profile of the performance workloads, the code coverage information shows how well the training workload covers the application's critical code. High coverage of performance-critical modules is essential to taking full advantage of the profile-guided optimizations.

556 16

The code coverage tool can create two levels of coverage:

• Top level (for a group of selected modules) • Individual module source views

Top Level Coverage The top-level coverage reports the overall code coverage of the modules that were selected. The following options are provided:

• Select the modules of interest • For the selected modules, the tool generates a list with their coverage information. The information includes the total number of functions and blocks in a module and the portions that were covered. • By clicking on the title of columns in the reported tables, the lists may be sorted in ascending or descending order based on:

• basic block coverage • function coverage • function name

By default, the code coverage tool generates a single HTML file (CODE_COVERAGE.HTML) and a subdirectory (CodeCoverage) in the current directory. The HTML file defines a frameset to display all of the other generated reports. Open the HTML file in a web-browser. The tool places all other generated report files in a CodeCoverage subdirectory. If you choose to generate the html-formatted version of the report, you can view coverage source of that particular module directly from a browser. The following figure shows the top-level coverage report.

557 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The coverage tool creates a frame set that allows quick browsing through the code to identify uncovered code. The top frame displays the list of uncovered functions while the bottom frame displays the list of covered functions. For uncovered functions, the total number of basic blocks of each function is also displayed. For covered functions, both the total number of blocks and the number of covered blocks as well as their ratio (that is, the coverage rate) are displayed. For example, 66.67(4/6) indicates that four out of the six blocks of the corresponding function were covered. The block coverage rate of that function is thus 66.67%. These lists can be sorted based on the coverage rate, number of blocks, or function names. Function names are linked to the position in source view where the function body starts. So, just by one click, you can see the least-covered function in the list and by another click the browser displays the body of the function. You can scroll down in the source view and browse through the function body. Individual Module Source View Within the individual module source views, the tool provides the list of uncovered functions as well as the list of covered functions. The lists are reported in two distinct frames that provide easy navigation of the source code. The lists can be sorted based on:

• Number of blocks within uncovered functions • Block coverage in the case of covered functions • Function names

558 16

Setting the Coloring Scheme for the Code Coverage The tool provides a visible coloring distinction of the following coverage categories: covered code, uncovered basic blocks, uncovered functions, partially covered code, and unknown code. The default colors that the tool uses for presenting the coverage information are shown in the tables that follows:

Category Default Description

Covered code #FFFFFF Indicates code was exercised by the tests. You can override the default color with the -ccolor tool option.

Uncovered basic #FFFF99 Indicates the basic blocks that were not exercised by block any of the tests. However, these blocks were within functions that were executed during the tests. You can override the default color with the -bcolor tool option.

Uncovered function #FFCCCC Indicates functions that were never called during the tests. You can override the default color with the -fcolor tool option.

Partially covered #FAFAD2 Indicates that more than one basic block was generated code for the code at this position. Some of the blocks were covered and some were not. You can override the default color with the -pcolor tool option.

Ignored code #90EE90 Indicates code that was specifically marked to be ignored. You can override this default color using the -xcolor tool option.

Information lines #FFFFFF Indicates basic-block markers and dynamic counts. You can override the default color with the -icolor tool option.

Unknown #FFFFFF Indicates that no code was generated for this source line. Most probably, the source at this position is a comment, a header-file inclusion, or a variable declaration. You can override the default color with the -ucolor tool option.

The default colors can be customized to be any valid HTML color name or hexadecimal value using the options mentioned for each coverage category in the table above.

559 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

For code coverage colored presentation, the coverage tool uses the following heuristic: source characters are scanned until reaching a position in the source that is indicated by the profile information as the beginning of a basic block. If the profile information for that basic block indicates that a coverage category changes, then the tool changes the color corresponding to the coverage condition of that portion of the code, and the coverage tool inserts the appropriate color change in the HTML-formatted report files.

NOTE. You need to interpret the colors in the context of the code. For instance, comment lines that follow a basic block that was never executed would be colored in the same color as the uncovered blocks. Another example is the closing brackets in C/C++ applications.

Dynamic Counters The coverage tool can be configured to generate the information about the dynamic execution counts. This ability can display the dynamic execution count of each basic block of the application and is useful for both coverage and performance tuning. The custom configuration requires using the -counts option. The counts information is displayed under the code after a "^" sign precisely under the source position where the corresponding basic block begins. If more than one basic block is generated for the code at a source position (for example, for macros), then the total number of such blocks and the number of the blocks that were executed are also displayed in front of the execution count. For example, line 11 in the code is an if statement:

Example 11 if ((N == 1).OR. (N == 0)) ^ 10 (1/2) 12 printf("%d\n", N) ^

The coverage lines under code lines 11 and 12 contain the following information:

• The IF statement in line 11 was executed 10 times. • Two basic blocks were generated for the IF statement in line 11. • Only one of the two blocks was executed, hence the partial coverage color.

560 16

• Only seven out of the ten times variable n had a value of 0 or 1.

In certain situations, it may be desirable to consider all the blocks generated for a single source position as one entity. In such cases, it is necessary to assume that all blocks generated for one source position are covered when at least one of the blocks is covered. This assumption can be configured with the -nopartial option. When this option is specified, decision coverage is disabled, and the related statistics are adjusted accordingly. The code lines 11 and 12 indicate that the print statement in line 12 was covered. However, only one of the conditions in line 11 was ever true. With the -nopartial option, the tool treats the partially covered code (like the code on line 11) as covered. Differential Coverage Using the code coverage tool, you can compare the profiles from two runs of an application: a reference run and a new run identifying the code that is covered by the new run but not covered by the reference run. Use this feature to find the portion of the applications code that is not covered by the applications tests but is executed when the application is run by a customer. It can also be used to find the incremental coverage impact of newly added tests to an applications test space. Generating Reference Data Create the dynamic profile information for the reference data, which can be used in differential coverage reporting later, by using the -ref option. The following command demonstrate a typical command for generating the reference data:

Example: generating reference data

codecov -prj Project_Name -dpi customer.dpi -ref appTests.dpi

The coverage statistics of a differential-coverage run shows the percentage of the code exercised on a new run but missed in the reference run. In such cases, the tool shows only the modules that included the code that was not covered. Keep this in mind when viewing the coloring scheme in the source views. The code that has the same coverage property (covered or not covered) on both runs is considered as covered code. Otherwise, if the new run indicates that the code was executed while in the reference run the code was not executed, then the code is treated as uncovered. On the other hand, if the code is covered in the reference run but not covered in the new run, the differential-coverage source view shows the code as covered. Running Differential Coverage

561 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

To run the code coverage tool for differential coverage, you must have the application sources, the .spi file, and the .dpi file, as described in the code coverage tool Requirements section (above). Once the required files are available, enter a command similar to the following begin the process of differential coverage analysis:

Example

codecov -prj Project_Name -spi pgopti.spi -dpi pgopti.dpi

Specify the .dpi and .spi files using the -spi and -dpi options.

Excluding Code from Coverage Analysis The code coverage tool allows you to exclude portions of your code from coverage analysis. This ability can be useful during development; for example, certain portions of code might include functions used for debugging only. The test case should not include tests for functionality that will unavailable in the final application. Another example of code that can be excluded is code that might be designed to deal with internal errors unlikely to occur in the application. In such cases, not having a test case lack of a test case is preferred. You might want to ignore infeasible (dead) code in the coverage analysis. The code coverage tool provides several options for marking portions of the code infeasible (dead) and ignoring the code at the file level, function level, line level, and arbitrary code boundaries indicated by user-specific comments. The following sections explain how to exclude code at different levels. Including and Excluding Coverage at the File Level The code coverage tool provides the ability to selectively include or exclude files for analysis. Create a component file and add the appropriate string values that indicate the file and directory name for code you want included or excluded. Pass the file name as a parameter of the -comp option. The following example shows the general command:

Example: specifying a component file

codecov -comp file

where file is the name of a text file containing strings that ask as file and directory name masks for including and excluding file-level analysis. For example, assume that the following:

• You want to include all files with the string "source" in the file name or directory name.

562 16

• You create a component text file named myComp.txt with the selective inclusion string.

Once you have a component file, enter a command similar to the following:

Example

codecov -comp myComp.txt

In this example, individual files name including the string "source" (like source1.c and source2.c) and files in directories where the name contains the string "source" (like source/file1.c and source2\file2.c ) are include in the analysis. Excluding files is done in the same way; however, the string must have a tilde (~) prefix. The inclusion and exclusion can be specified in the same component file. For example, assume you want to analyze all individual files or files contained in a directory where the name included the string "source", and you wanted to exclude all individual file and files contained in directories where the name included the string "skip". You would add content similar to the following to the component file (myComp.txt) and pass it to the -comp option:

Example: inclusion and exclusion strings

source ~skip

Entering the codecov -comp myComp.txt command with both instructions in the component file will instruct the tool to include individual files where the name contains "source" (like source1.c and source2.c) and directories where the name contains "source" (like source/file1.c and source2\file2.c ), and exclude any individual files where the name contains "skip" (like skipthis1.c and skipthis2.c) or directories where the name contains "skip" (like skipthese1\debug1.c and skipthese2\debug2.c). Excluding Coverage at the Line and Function Level You can mark individual lines for exclusion my passing string values to the -onelinedsbl option. For example, assume that you have some code similar to the following:

Sample code printf ("internal error 123 - please report!\n"); // NO_COVER printf ("internal error 456 - please report!\n"); /* INF IA-32 architecture */

563 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

If you wanted to exclude all functions marked with the comments NO_COVER or INF IA-32 architecture, you would enter a command similar to the following.

Example

codecov -onelinedsbl NO_COVER -onelinedsbl "INF IA-32 architecture"

You can specify multiple exclusion strings simultaneously, and you can specify any string values for the markers; however, you must remember the following guidelines when using this option:

• Inline comments must occur at the end of the statement. • The string must be a part of an inline comment.

An entire function can be excluded from coverage analysis using the same methods. For example, the following function will be ignored from the coverage analysis when you issue example command shown above.

Sample code void dumpInfo (int n) { // NO_COVER ... }

Additionally, you can use the code coverage tool to color the infeasible code with any valid HTML color code by combining the -onelinedsbl and -xcolor options. The following example commands demonstrate the combination:

Example: combining tool options

codecov -onelinedsbl INF -xcolor lightgreen codecov -onelinedsbl INF -xcolor #CCFFCC

Excluding Code by Defining Arbitrary Boundaries The code coverage tool provides the ability to arbitrarily exclude code from coverage analysis. This feature is most useful where the excluded code either occur inside of a function or spans several functions.

564 16

Use the -beginblkdsbl and -endblkdsbl options to mark the beginning and end, respectively, of any arbitrarily defined boundary to exclude code from analysis. Remember the following guidelines when using these options:

• Inline comments must occur at the end of the statement. • The string must be a part of an inline comment.

For example assume that you have the following code:

Sample code void div (int m, int n) { if (n == 0) /* BEGIN_INF */ { printf (internal error 314 please report\n); recover (); } /* END_INF */ else { ... } } ... // BINF Void recover () { ... } // EINF

565 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The following example commands demonstrate how to use the -beginblkdsbl option to mark the beginning and the -endblkdsbl option to mark the end of code to exclude from the sample shown above.

Example: arbitrary code marker commands

codecov -xcolor #ccFFCC -beginblkdsbl BINF -endblkdsbl EINF codecov -xcolor #ccFFCC -beginblkdsbl "BEGIN_INF" -endblkdsbl "END_INF"

Notice that you can combine these options in combination with the -xcolor option.

Exporting Coverage Data The code coverage tool provides specific options to extract coverage data from the dynamic profile information (.dpi files) that result from running instrumented application binaries under various workloads. The tool can export the coverage data in various formats for post-processing and direct loading into databases: the default HTML, text, and XML. You can choose to export data at the function and basic block levels. There are two basic methods for exporting the data: quick export and combined export. Each method has associated options supported by the tool

• Quick export: The first method is to export the data coverage to text- or XML-formatted files without generating the default HTML report. The application sources need not be present for this method. The code coverage tool creates reports and provides statistics only about the portions of the application executed. The resulting analysis and reporting occurs quickly, which makes it practical to apply the coverage tool to the dynamic profile information (the .dpi file) for every test case in a given test space instead of applying the tool to the profile of individual test suites or the merge of all test suites. The -xmlfcvrg, -txtfcvrg, -xml- bcvrg and -txtbcvrg options support the first method. • Combined export: The second method is to generate the default HTML and simultaneously export the data to text- and XML-formatted files. This process is slower than first method since the application sources are parsed and reports generated. The -xmlbcvrgfull and -txtbcvrgfull options support the second method.

These export methods provide the means to quickly extend the code coverage reporting capabilities by supplying consistently formatted output from the code coverage tool. You can extend these by creating additional reporting tools on top of these report files. Quick Export

566 16

The profile of covered functions of an application can be exported quickly using the -xmlfcvrg, -txtfcvrg, -xmlbcvrg, and -txtbcvrg options. When using any of these options, specify the output file that will contain the coverage report. For example, enter a command similar to the following to generate a report of covered functions in XML formatted output:

Example: quick export of function data

codecov -prj test1 -dpi test1.dpi -xmlfcvrg test1_fcvrg.xml

The resulting report will show how many times each function was executed and the total number of blocks of each function together with the number of covered blocks and the block coverage of each function. The following example shows some of the content of a typical XML report.

XML-formatted report example ... ...

In the above example, we note that function f0, which is defined in file sample.c, has been executed twice. It has a total number of six basic blocks, five of which are executed, resulting in an 83.33% basic block coverage.

567 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

You can also export the data in text format using the -txtfcvrg option. The generated text report, using this option, for the above example would be similar to the following example:

Text-formatted report example Covered Functions in File: "D:\SAMPLE.C" "f0" 2 6 5 83.33 "f1" 1 6 4 66.67 "f2" 1 6 3 50.00 ...

In the text formatted version of the report, the each line of the report should be read in the following manner:

Column 1 Column 2 Column 3 Column 4 Column 5

function execution line number column percentage of basic-block coverage name frequency of the start number of of the function of the the start of function the function definition definition

Additionally, the tool supports exporting the block level coverage data using the -xmlbcvrg option. For example, enter a command similar to the following to generate a report of covered blocks in XML formatted output:

Example: quick export of block data to XML

codecov -prj test1 -dpi test1.dpi -xmlbcvrg test1_bcvrg.xml

568 16

The example command shown above would generate XML-formatted results similar to the following:

XML-formatted report example ...

In the sample report, notice that one basic block is generated for the code in function f0 at the line 11, column 2 of the file sample.cpp. This particular block has been executed only once. Also notice that there are two basic blocks generated for the code that starts at line 12, column 3 of file. One of these blocks, which has id = 1, has been executed two times, while the other block has been executed only once. A similar report in text format can be generated through the -txtbcvrg option. Combined Exports The code coverage tool has also the capability of exporting coverage data in the default HTML format while simultaneously generating the text- and XML-formatted reports. Use the -xmlbcvrgfull and -txtbcvrgfull options to generate reports in all supported formatted in a single run. These options export the basic-block level coverage data while simultaneously generating the HTML reports. These options generate more complete reports since they include analysis on functions that were not executed at all. However, exporting the coverage data using these options requires access to application source files and take much longer to run. Dynamic Call Graphs

569 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Using the -txtdcg option the tool can provide detailed information about the dynamic call graphs in an application. Specify an output file for the dynamic call-graph report. The resulting call graph report contains information about the percentage of static and dynamic calls (direct, indirect, and virtual) at the application, module, and function levels.

test prioritization Tool

The test prioritization tool enables the profile-guided optimizations on all supported Intel architectures, on Linux*, VxWorks* and Windows* operating systems, to select and prioritize tests for an application based on prior execution profiles. The tool offers a potential of significant time saving in testing and developing large-scale applications where testing is the major bottleneck. Development often requires changing applications modules. As applications change, developers can have a difficult time retaining the quality of their functional and performance tests so they are current and on-target. The test prioritization tool lets software developers select and prioritize application tests as application profiles change. The information about the tool is separated into the following sections: • Features and benefits • Requirements and syntax • Usage model • Tool options • Running the tool

Features and Benefits The test prioritization tool provides an effective testing hierarchy based on the code coverage for an application. The following list summarizes the advantages of using the tool:

• Minimizing the number of tests that are required to achieve a given overall coverage for any subset of the application: the tool defines the smallest subset of the application tests that achieve exactly the same code coverage as the entire set of tests. • Reducing the turn-around time of testing: instead of spending a long time on finding a possibly large number of failures, the tool enables the users to quickly find a small number of tests that expose the defects associated with the regressions caused by a change set. • Selecting and prioritizing the tests to achieve certain level of code coverage in a minimal time based on the data of the tests' execution time.

570 16

See Understanding Profile-guided Optimization and Example of Profile-guided Optimization for general information on creating the files needed to run this tool. test prioritization Tool Requirements The test prioritization tool needs the following items to work:

• The .spi file generated by Intel® compilers when compiling the application for the instrumented binaries with the -prof-gen=srcpos (VxWorks*) or /Qprof-gen:srcpos (Windows*) option. • The .dpi files generated by the profmerge tool as a result of merging the dynamic profile information .dyn files of each of the application tests. (Run the profmerge tool on all .dyn files that are generated for each individual test and name the resulting .dpi in a fashion that uniquely identifies the test.)

CAUTION. The profmerge tool merges all .dyn files that exist in the given directory. Make sure unrelated .dyn files, which may remain from unrelated runs, are not present. Otherwise, the profile information will be skewed with invalid profile data, which can result in misleading coverage information and adverse performance of the optimized code;

• User-generated file containing the list of tests to be prioritized. For successful instrumented code run, you should:

• Name each test .dpi file so the file names uniquely identify each test. • Create a .dpi list file, which is a text file that contains the names of all .dpi test files. Each line of the .dpi list file should include one, and only one .dpi file name. The name can optionally be followed by the duration of the execution time for a corresponding test in the dd:hh:mm:ss format. For example: Test1.dpi 00:00:60:35 states that Test1 lasted 0 days, 0 hours, 60 minutes and 35 seconds. The execution time is optional. However, if it is not provided, then the tool will not prioritize the test for minimizing execution time. It will prioritize to minimize the number of tests only.

The tool uses the following general syntax:

Tool Syntax

tselect -dpi_list file

571 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-dpi_list is a required tool option that sets the path to the list file containing the list of the all .dpi files. All other tool commands are optional.

NOTE. Windows* only: Unlike the compiler options, which are preceded by forward slash ("/"), the tool options are preceded by a hyphen ("-").

Usage Model The following figure illustrates a typical test prioritization tool usage model.

test prioritization tool Options The tool uses the options that are listed in the following table:

572 16

Option Default Description

-help Prints tool option descriptions.

-dpi_list file Required. Specifies the file name of the file that contains the names of the dynamic profile information (.dpi) files. Each line of the file must contain only one .dpi file name, which can be optionally followed by its execution time. The name must uniquely identify the test.

-spi file pgopti.spi Specifies the file name of the static profile information file (.SPI).

-o file Specifies the file name of the output report file.

-comp file Specifies the file name that contains the list of files of interest.

-cutoff file Instructs the tool to terminate when the cumulative block coverage reaches a preset percentage, as specified by value, of pre-computed total coverage. value must be greater than 0.0 (for example, 99.00) but not greater than 100. value can be set to 100.

573 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Option Default Description

-nototal Instructs the tool to ignore the pre-compute total coverage process.

-mintime Instructs the tool to minimize testing execution time. The execution time of each test must be provided on the same line of dpi_list file, after the test name in dd:hh:mm:ss format.

-srcbasedir dir Specifies a different top level project directory than was used during compiler instrumentation run with the prof-src-root compiler option to support relative paths to source files in place of absolute paths.

-verbose Instructs the tool to generate more logging information about program progress.

Running the tool The following steps demonstrate one simple example for running the tool on IA-32 architectures.

1. Specify the directory by entering a command similar to the following:

Example

set prof-dir=c:\myApp\prof-dir

2. Compile the program and generate instrumented binary by issuing commands similar to the following:

574 16

Operating System Command

VxWorks icpc -prof-gen=srcpos myApp.cpp

Windows icl /Qprof-gen:srcpos myApp.cpp

This commands shown above compiles the program and generates instrumented binary myApp as well as the corresponding static profile information pgopti.spi. 3. Make sure that unrelated .dyn files are not present by issuing a command similar to the following:

Example

rm prof-dir \*.dyn

4. Run the instrumented files by issuing a command similar to the following:

Example

myApp < data1

The command runs the instrumented application and generates one or more new dynamic profile information files that have an extension .dyn in the directory specified by the -prof- dir step above. 5. Merge all .dyn file into a single file by issuing a command similar to the following:

Example

profmerge -prof_dpi Test1.dpi

The profmerge tool merges all the .dyn files into one file (Test1.dpi) that represents the total profile information of the application on Test1. 6. Again make sure there are no unrelated .dyn files present a second time by issuing a command similar to the following:

Example

rm prof-dir \*.dyn

575 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

7. Run the instrumented application and generate one or more new dynamic profile information files that have an extension .dyn in the directory specified the prof-dir step above by issuing a command similar to the following:

Example

myApp < data2

8. Merge all .dyn files into a single file, by issuing a command similar to the following

Example

profmerge -prof_dpi Test2.dpi

At this step, the profmerge tool merges all the .dyn files into one file (Test2.dpi) that represents the total profile information of the application on Test2. 9. Make sure that there are no unrelated .dyn files present for the final time, by issuing a command similar to the following:

Example

rm prof-dir \*.dyn

10. Run the instrumented application and generates one or more new dynamic profile information files that have an extension .dyn in the directory specified by -prof-dir by issuing a command similar to the following:

Example

myApp < data3

11. Merge all .dyn file into a single file, by issuing a command similar to the following:

Example

profmerge -prof_dpi Test3.dpi

At this step, the profmerge tool merges all the .dyn files into one file (Test3.dpi) that represents the total profile information of the application on Test3. 12. Create a file named tests_list with three lines. The first line contains Test1.dpi, the second line contains Test2.dpi, and the third line contains Test3.dpi.

576 16

Tool Usage Examples When these items are available, the test prioritization tool may be launched from the command line in prof-dir directory as described in the following examples.

Example 1: Minimizing the Number of Tests The following example describes how minimize the number of test runs.

Example Syntax

tselect -dpi_list tests_list -spi pgopti.spi where the -spi option specifies the path to the .spi file. The following sample output shows typical results.

Sample Output Total number of tests = 3 Total block coverage ~ 52.17 Total function coverage ~ 50.00 num %RatCvrg %BlkCvrg %FncCvrg Test Name @ Options ------1 87.50 45.65 37.50 Test3.dpi 2 100.50 52.17 50.00 Test2.dpi

In this example, the results provide the following information:

• By running all three tests, we achieve 52.17% block coverage and 50.00% function coverage. • Test3 by itself covers 45.65% of the basic blocks of the application, which is 87.50% of the total block coverage that can be achieved from all three tests. • By adding Test2, we achieve a cumulative block coverage of 52.17% or 100% of the total block coverage of Test1, Test2, and Test3. • Elimination of Test1 has no negative impact on the total block coverage.

577 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example 2: Minimizing Execution Time Suppose we have the following execution time of each test in the tests_list file:

Sample Output Test1.dpi 00:00:60:35 Test2.dpi 00:00:10:15 Test3.dpi 00:00:30:45

The following command minimizes the execution time by passing the -mintime option:

Sample Syntax

tselect -dpi_list tests_list -spi pgopti.spi -mintime

The following sample output shows possible results:

Sample Output Total number of tests = 3 Total block coverage ~ 52.17 Total function coverage ~ 50.00 Total execution time = 1:41:35 num elaspedTime %RatCvrg %BlkCvrg %FncCvrg Test Name @ Options ------1 10:15 75.00 39.13 25.00 Test2.dpi 2 41:00 100.00 52.17 50.00 Test3.dpi

In this case, the results indicate that the running all tests sequentially would require one hour, 45 minutes, and 35 seconds, while the selected tests would achieve the same total block coverage in only 41 minutes. The order of tests is based on minimizing time (first Test2, then Test3) could be different than when prioritization is done based on minimizing the number of tests. See Example 1 shown above: first Test3, then Test2. In Example 2, Test2 is the test that gives the highest coverage per execution time, so Test2 is picked as the first test to run.

578 16

Using Other Options The -cutoff option enables the tool to exit when it reaches a given level of basic block coverage. The following example demonstrates how to the option:

Example

tselect -dpi_list tests_list -spi pgopti.spi -cutoff 85.00

If the tool is run with the cutoff value of 85.00, as in the above example, only Test3 will be selected, as it achieves 45.65% block coverage, which corresponds to 87.50% of the total block coverage that is reached from all three tests. The tool does an initial merging of all the profile information to figure out the total coverage that is obtained by running all the tests. The -nototal option enables you to skip this step. In such a case, only the absolute coverage information will be reported, as the overall coverage remains unknown. profmerge and proforder Tools profmerge Tool Use the profmerge tool to merge dynamic profile information (.dyn) files and any specified summary files (.dpi). The compiler executes profmerge automatically during the feedback compilation phase when you specify -prof-use (VxWorks*) or /Qprof-use (Windows*). The command-line usage for profmerge is as follows:

Syntax

profmerge [-prof_dir dir_name]

The tool merges all .dyn files in the current directory, or the directory specified by -prof_dir, and produces a summary file: pgopti.dpi.

NOTE. The spelling of tools options may differ slightly from compiler options. Tools options use an underscore (for example -prof_dir) instead of the hyphen used by compiler options ompiler options (for example -prof-dir or /Qprof-dir)to join words. Also, on Windows* OS systems, the tool options are preceded by a hyphen ("-") unlike Windows compiler options, which are preceded by a forward slash ("/").

579 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

You can use profmerge tool to merge .dyn files into a .dpi file without recompiling the application. Thus, you can run the instrumented executable file on multiple systems to generate dyn files, and optionally use profmerge with the -prof_dpi option to name each summary dpi file created from the multiple dyn files. Since the profmerge tool merges all the .dyn files that exist in the given directory, make sure unrelated .dyn files are not present; otherwise, profile information will be based on invalid profile data, which can negatively impact the performance of optimized code. profmerge Options The profmerge tool supports the following options:

Tool Option Description

-dump Displays profile information.

-help Lists supported options.

-nologo Disables version information. This option is supported on Windows* only.

-exclude_funcs functions Excludes functions from the profile. The list items must be separated by a comma (","); you can use a period (".") as a wild card character in function names.

-prof_dir dir Specifies the directory from which to read .dyn and .dpi files , and write the .dpi file. Alternatively, you can set the environment variable PROF_DIR.

-prof_dpi file Specifies the name of the .dpi file being generated.

-prof_file file Merges information from file matching: dpi_file_and_dyn_tag.

-src_old dir -src_new dir Changes the directory path stored within the .dpi file.

580 16

Tool Option Description

-src_no_dir Uses only the file name and not the directory name when reading dyn/dpi records. If you specify -src_no_dir, the directory name of the source file will be ignored when deciding which profile data records correspond to a specific application routine, and the -src- root option is ignored.

-src-root dir Specifies a directory path prefix for the root directory where the user's application files are stored. This option is ignored if you specify -src_no_dir.

-a file1.dpi ... fileN.dpi Specifies and merges available .dpi files.

-verbose Instructs the tool to display full information during merge.

-weighted Instructs the tool to apply an equal weighting (regardless of execution times) to the dyn file values to normalize the data counts. This keyword is useful when the execution runs have different time durations and you want them to be treated equally.

-gen_weight_spec file Instructs the tool to generate a text file containing a list of the .dyn and .dpi file that were merged with default weight=1/run_count. The text file is created in the directory specified by the prof_dir option.

-weight_spec weight_spec.txt Instructs the profmerge tool to generate and use the text file, weight_spec.txt, listing individual .dyn/.dpi files or directory names along with weight values for them. When the -weight_spec option is used:

581 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Tool Option Description • a new .dpi file is always created • only files called out by the specification file are merged • dyn timestamps are ignored and merge always takes place

The prof_dir option controls where the input/output weight_spec.txt is located, and the destination of the .dpi file. The -weight_spec option overrides:

• any values of -a option • any computation from using -weighted option

Weighting the Runs Using the -weight_spec option results in a new .dpi file. Only the files listed in the text file are merged. (No files in the current directory are used unless they are included in the text file.) Relocating source files using profmerge The Intel® compiler uses the full path to the source file for each routine to look up the profile summary information associated with that routine. By default, this prevents you from:

• Using the profile summary file (.dpi) if you move your application sources. • Sharing the profile summary file with another user who is building identical application sources that are located in a different directory.

You can disable the use of directory names when reading dyn/dpi file records by specifying the profmerge option -src_no_dir. This profmerge option is the same as the compiler option -no-prof-src-dir (VxWorks) and /Qprof-src-dir- (Windows). To enable the movement of application sources, as well as the sharing of profile summary files, you can use the profmerge option -src-root to specify a directory path prefix for the root directory where the application files are stored. Alternatively, you can specify the option pair -src_old -src_new to modify the data in an existing summary dpi file. For example:

582 16

Example: relocation command syntax

profmerge -prof_dir -src_old -src_new where is the full path to dynamic information file (.dpi), is the old full path to source files, and is the new full path to source files. The example command (above) reads the pgopti.dpi file, in the location specified in . For each function represented in the pgopti.dpi file, whose source path begins with the prefix, profmerge replaces that prefix with . The pgopti.dpi file is updated with the new source path information. You can run profmerge more than once on a given pgopti.dpi file. For example, you may need to do this if the source files are located in multiple directories:

Operating System Command Examples

VxWorks profmerge -prof_dir -src_old /src/prog_1 -src_new /src/prog_2 profmerge -prof_dir -src_old /proj_1 -src_new /proj_2

Windows profmerge -src_old "c:/program files" -src_new "e:/program files" profmerge -src_old c:/proj/application -src_new d:/app

In the values specified for -src_old and -src_new, uppercase and lowercase characters are treated as identical. Likewise, forward slash ( /) and backward slash ( \) characters are treated as identical.

NOTE. Because the source relocation feature of profmerge modifies the pgopti.dpi file, consider making a backup copy of the file before performing the source relocation. proforder Tool The proforder tool is used as part of the feedback compilation phase, to improve program performance. Use proforder to generate a function order list for use with the /ORDER linker option. The tool uses the following syntax:

583 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax

proforder [-prof_dir dir] [-o file]

where dir is the directory containing the profile files ( .dpi and .spi), and file is the optional name of the function order list file. The default name is proford.txt .

NOTE. The spelling of tools options may differ slightly from compiler options. Tools options use an underscore (for example -prof_dir) instead of the hyphen used by compiler options (for example -prof-dir or /Qprof-dir) to join words. Also, on Windows* OS systems, the tool options are preceded by a hyphen ("-") unlike Windows compiler options, which are preceded by a forward slash ("/").

proforder Options The proforder tool supports the following options:

Tool Option Default Description

-help Lists supported options.

-nologo Disables version information. This option is supported on Windows* only.

-omit_static Instructs the tool to omit static functions from function ordering.

-prof_dir dir Specifies the directory where the .spi and .dpi file reside.

-prof_dpi file Specifies the name of the .dpi file.

-prof_file string Selects the .dpi and .spi files that include the substring value in the file name matching the values passed as string.

584 16

Tool Option Default Description

-prof_spi file Specifies the name of the .spi file.

-o file proford.txt Specifies an alternate name for the output file.

Using Function Order Lists, Function Grouping, Function Ordering, and Data Ordering Optimizations

Instead of doing a full multi-file interprocedural build of your application by using the compiler option -ipo (Linux* OS and VxWorks* OS), you can obtain some of the benefits by having the compiler and linker work together to make global decisions about where to place the functions and data in your application. The following table lists each optimization, the type of functions or global data it applies to, and the operating systems and architectures that it is supported on.

Supported OS and Optimization Type of Function or Data Architectures

Function Order Lists: Specifies the order in which the linker should link the non-static routines Windows OS: all Intel (functions) of your program. extern functions procedures architectures This optimization can improve and library functions only application performance by (not static functions). Linux OS and VxWorks OS: improving code locality and not supported reduce paging. Also see Comparison of Function Order Lists and IPO Code Layout.

Function Grouping: Specifies that the linker Linux OS and VxWorks OS: should place the extern and extern functions and static IA-32 and Intel(R) 64 static routines (functions) of functions only (not library architectures your program into hot or cold functions). program sections. This Windows OS: not supported optimization can improve

585 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Supported OS and Optimization Type of Function or Data Architectures application performance by improving code locality and reduce paging.

Function Ordering: Enables ordering of static and extern routines using profile information. Specifies the order in which the linker extern functions and static Linux, VxWorks and Windows should link the routines functions only (not library OS: all Intel architectures (functions) of your program. functions) This optimization can improve application performance by improving code locality and reduce paging.

Data Ordering: Enables ordering of static global data items based on profiling information. Specifies the order in which the linker should link global data of Linux, VxWorks and Windows your program. This Static global data only OS: all Intel architectures optimization can improve application performance by improving the locality of static global data, reduce paging of large data sets, and improve data cache use.

You can use only one of the function-related ordering optimizations listed above. However, you can use the Data Ordering optimization with any one of the function-related ordering optimizations listed above, such as Data Ordering with Function Ordering, or Data Ordering with Function Grouping. In this case, specify the prof-gen option keyword globdata (needed for Data Ordering) instead of srcpos (needed for function-related ordering). The following sections show the commands needed to implement each of these optimizations: function order list, function grouping, function ordering, and data ordering. For all of these optimizations, omit the -ipo (Linux* OS and VxWorks* OS) or /Qipo (Windows OS) or equivalent compiler option.

586 16

Generating a Function Order List (Windows OS) This section provides an example of the process for generating a function order list. Assume you have a C++ program that consists of the following files: file1.cpp and file2.cpp. Additionally, assume you have created a directory for the profile data files called c:\profdata. You would enter commands similar to the following to generate and use a function order list for your Windows application.

1. Compile your program using the /Qprof-gen:srcpos option. Use the /Qprof-dir option to specify the directory location of the profile files. This step creates an instrumented executable.

Example commands

icl /Femyprog /Qprof-gen=srcpos /Qprof-dir c:\profdata file1.cpp file2.cpp

2. Run the instrumented program with one or more sets of input data. Change your directory to the directory where the executables are located. The program produces a .dyn file each time it is executed.

Example commands

myprog.exe

3. Before this step, copy all .dyn and .dpi files into the same directory. Merge the data from one or more runs of the instrumented program by using the profmerge tool to produce the pgopti.dpi file. Use the /prof_dir option to specify the directory location of the .dyn files.

Example commands

profmerge /prof_dir c:\profdata

4. Generate the function order list using the proforder tool. (By default, the function order list is produced in the file proford.txt.)

Example commands

proforder /prof_dir c:\profdata /o myprog.txt

587 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

5. Compile the application with the generated profile feedback by specifying the ORDER option to the linker. Use the /Qprof-dir option to specify the directory location of the profile files.

Example commands

icl /Femyprog /Qprof-dir c:\profdata file1.cpp file2.cpp /link -ORDER:@myprog.txt

Using Function Grouping (Linux OS and VxWorks OS) This section provides a general example of the process for using the function grouping optimization. Assume you have a C++ program that consists of the following files: file1.cpp and file2.cpp. Additionally, assume you have created a directory for the profile data files called profdata. You would enter commands similar to the following to use a function grouping for your Linux or VxWorks application.

1. Compile your program using the -prof-gen option. Use the -prof-dir option to specify the directory location of the profile files. This step creates an instrumented executable.

Example commands

icc -o myprog -prof-gen -prof-dir ./profdata file1.cpp file2.cpp

2. Run the instrumented program with one or more sets of input data. Change your directory to the directory where the executables are located. The program produces a .dyn file each time it is executed.

Example commands

./myprog

3. Copy all .dyn and .dpi files into the same directory. If needed, you can merge the data from one or more runs of the instrumented program by using the profmerge tool to produce the pgopti.dpi file. 4. Compile the application with the generated profile feedback by specifying the -prof-func- group option to request the function grouping as well as the -prof-use option to request feedback compilation. Again, use the -prof-dir option to specify the location of the profile files.

588 16

Example commands

icl /Femyprog file1.cpp file2.cpp -prof-func-group -prof-use -prof-dir ./profdata

Using Function Ordering This section provides an example of the process for using the function ordering optimization. Assume you have a C++ program that consists of the following files: file1.cpp and file2.cpp. Additionally, assume you have created a directory for the profile data files called c:\profdata (on Windows) or ./profdata (on Linux). You would enter commands similar to the following to generate and use function ordering for your application.

1. Compile your program using the -prof-gen=srcpos (Linux and VxWorks) option. Use the -prof-dir (Linux and VxWorks) option to specify the directory location of the profile files. This step creates an instrumented executable.

Operating System Example commands

Linux and VxWorks icc -o myprog -prof-gen=srcpos -prof-dir ./profdata file1.cpp file2.cpp

Windows icl /Femyprog /Qprof-gen:srcpos /Qprof-dir c:\profdata file1.cpp file2.cpp

2. Run the instrumented program with one or more sets of input data. Change your directory to the directory where the executables are located. The program produces a .dyn file each time it is executed.

Operating System Example commands

Linux and VxWorks ./myprog

Windows myprog.exe

3. Copy all .dyn and .dpi files into the same directory. If needed, you can merge the data from one or more runs of the instrumented program by using the profmerge tool to produce the pgopti.dpi file.

589 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

4. Compile the application with the generated profile feedback by specifying the -prof-func- order (Linux and VxWorks) option to request the function ordering as well as the -prof- use (Linux and VxWorks) option to request feedback compilation. Again, use the -prof- dir (Linux and VxWorks) option to specify the location of the profile files.

Operating System Example commands

Linux and VxWorks icpc -o myprog -prof-dir ./profdata file1.cpp file2.cpp -prof-func-order -prof-use

Windows icl /Femyprog /Qprof-dir c:\profdata file1.cpp file2.cpp /Qprof-func-or- der /Qprof-use

Using Data Ordering This section provides an example of the process for using the data order optimization. Assume you have a C++ program that consists of the following files: file1.cpp and file2.cpp. Additionally, assume you have created a directory for the profile data files called c:\profdata (on Windows) or ./profdata (on Linux). You would enter commands similar to the following to use data ordering for your application.

1. Compile your program using the -prof-gen=globdata (Linux and VxWorks) option. Use the -prof-dir (Linux and VxWorks) option to specify the directory location of the profile files. This step creates an instrumented executable.

Operating System Example commands

Linux and VxWorks icc -o myprog -prof-gen=globdata -prof-dir ./profdata file1.cpp file2.cpp

Windows icl /Femyprog /Qprof-gen=globdata /Qprof-dir c:\profdata file1.cpp file2.cpp

2. Run the instrumented program with one or more sets of input data. If you specified a location other than the current directory, change your directory to the directory where the executables are located. The program produces a .dyn file each time it is executed.

590 16

Operating System Example commands

Linux and VxWorks ./myprog

Windows myprog.exe

3. Copy all .dyn and .dpi files into the same directory. If needed, you can merge the data from one or more runs of the instrumented program by using the profmerge tool to produce the pgopti.dpi file. 4. Compile the application with the generated profile feedback by specifying the -prof-data- order (Linux and VxWorks) or /Qprof-data-order option to request the data ordering as well as the -prof-use (Linux and VxWorks) option to request feedback compilation. Again, use the -prof-dir (Linux and VxWorks) option to specify the location of the profile files.

Operating System Example commands

Linux and VxWorks icpc -o myprog -prof-dir ./profdata file1.cpp file2.cpp -prof-data-or- der -prof-use

Windows icl /Femyprog /Qprof-dir c:\profdata file1.cpp file2.cpp /Qprof-data-order /Qprof-use

Comparison of Function Order Lists and IPO Code Layout

The Intel® compiler provides two methods of optimizing the layout of functions in the executable: • Using a function order list • Using the /Qipo (Windows) compiler option

Each method has its advantages. A function order list, created with proforder, lets you optimize the layout of non-static functions: that is, external and library functions whose names are exposed to the linker. The linker cannot directly affect the layout order for static functions because the names of these functions are not available in the object files. The compiler cannot affect the layout order for functions it does not compile, such as library functions. The function layout optimization is performed automatically when IPO is active.

591 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Alternately, using the /Qipo (Windows) option allows you to optimize the layout of all static or extern functions compiled with the Intel® C++ Compiler. The compiler cannot affect the layout order for functions it does not compile, such as library functions. The function layout optimization is performed automatically when IPO is active. Function Order List Effects

Function Ordering with Function Type IPO Code Layout proforder

Static X No effect

Extern X X

Library No effect X

PGO API Support

API Support Overview

The Profile Information Generation Support (Profile IGS) lets you control the generation of profile information during the instrumented execution phase of profile-guided optimizations. A set of functions and an environment variable comprise the Profile IGS. The remaining topics in this section describe the associated functions and environment variables. The compiler sets a define for the _PGO_INSTRUMENT pre-processor macro when you compile with either -prof-gen (VxWorks*) or /Qprof-gen (Windows*) options, to instrument your code. Without instrumentation, the Profile IGS functions cannot provide PGO API support. Normally, profile information is generated by an instrumented application when it terminates by calling the standard exit() function. To ensure that profile information is generated, the functions described in this section may be necessary or useful in the following situations: • The instrumented application exits using a non-standard exit routine. • The instrumented application is a non-terminating application: exit() is never called. • The application requires control of when the profile information is generated. You can use the Profile IGS functions in your application by including the pgouser.h header file at the top of any source file where the functions may be used.

592 16

Example

#include

NOTE. You do not need to remove the Profile IGS API functions from your code when you have completed the instrumentation step. Changes to the source code at this stage could inhibit obtaining profile data feedback on the routines that were modified. For the instrumentation step (using -[Q]prof-gen or -[Q]prof-genx option), the definition for the _PGO_INSTRUMENT macro is automatically set, allowing instrumentation library routines to be used. For the production step (using -[Q]prof-use option), the definition for the _PGO_INSTRUMENT macro is removed, allowing profile data to be fed back.

The Profile IGS Environment Variable The environment variable for Profile IGS is INTEL_PROF_DUMP_INTERVAL. This environment variable may be used to initiate Interval Profile Dumping in an instrumented user application.

See Also • _PGOPTI_Set_Interval_Prof_Dump()

PGO Environment Variables

The environment variables determine the directory in which to store dynamic information files, control the creation of one or multiple dyn files to collect profiling information, and determine whether to overwrite pgopti.dpi. The environment variables are described in the table below.

Variable Description

INTEL_PROF_DUMP_CUMULATIVE When using interval profile dumping (initiated by INTEL_PROF_DUMP_INTERVAL or the function _PGOPTI_Set_Interval_Prof_Dump) during the execution of an instrumented user application, allows creation of a single .dyn file to contain profiling information instead of multiple .dyn files. If this environment variable is not set, executing an instrumented user application creates a new .dyn file for each interval. Setting this environment

593 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Variable Description variable is useful for applications that do not terminate or those that terminate abnormally (bypass the normal exit code).

INTEL_PROF_DUMP_INTERVAL Initiates interval profile dumping in an instrumented user application. This environment variable may be used to initiate Interval Profile Dumping in an instrumented application. See Interval Profile Dumping for more information.

PROF_DIR Specifies the directory in which dynamic information files are created. This variable applies to all three phases of the profiling process.

PROF_DUMP_INTERVAL Deprecated. Please use INTEL_PROF_DUMP_INTERVAL instead.

PROF_NO_CLOBBER Alters the feedback compilation phase slightly. By default, during the feedback compilation phase, the compiler merges data from all dynamic information files and creates a new pgopti.dpi file if the .dyn files are newer than an existing pgopti.dpi file. When this variable is set the compiler does not overwrite the existing pgopti.dpi file. Instead, the compiler issues a warning. You must remove the pgopti.dpi file if you want to use additional dynamic information files.

See the appropriate operating system documentation for instructions on how to specify environment variables and their values.

Dumping Profile Information

The _PGOPTI_Prof_Dump_All() function dumps the profile information collected by the instrumented application. The prototype of the function call is listed below.

594 16

Syntax

void _PGOPTI_Prof_Dump_All(void);

An older version of this function, _PGOPTI_Prof_Dump(), which will also dump profile information is still available; the older function operates much like _PGOPTI_Prof_Dump_All(), except on Linux* and VxWorks* operating systems, when used in connection with shared libraries (.so) and _exit() function to terminate a program. When _PGOPTI_Prof_Dump_All() is called before _exit() function to terminate the program, the new function insures that a .dyn file is created for all shared libraries needing to create a .dyn file. Use _PGOPTI_Prof_Dump_All() on Linux OS and VxWorks OS to insure portability and correct functionality. The profile information is generated in a .dyn file (generated in phase 2 of PGO). The environment variables that affect the _PGOPTI_Prof_Dump_All() function are PROF_DIR and PROF_DPI. Set the PROF_DIR environment variable to specify the directory to which the .dyn file must be stored. Alternately, you could use the –prof_dir profmerge tool option to specify this directory without setting the PROF_DIR variable. Set the PROF_DPI environment variable to specify an alternative .dpi filename. The default filename is pgopti.dpi. Alternately, you could use the –prof_dpi profmerge tool option to specify the filename for the summary .dpi file.

595 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Recommended usage Insert a single call to this function in the body of the function which terminates the user application. Normally, _PGOPTI_Prof_Dump_All() should be called just once. It is also possible to use this function in conjunction with _PGOPTI_Prof_Reset() function to generate multiple .dyn files (presumably from multiple sets of input data).

Example #include void process_data(int foo) {} int get_input_data() { return 1; } int main(void) { // Selectively collect profile information for the portion // of the application involved in processing input data. int input_data = get_input_data(); while (input_data) { _PGOPTI_Prof_Reset(); process_data(input_data); _PGOPTI_Prof_Dump_All(); input_data = get_input_data(); } return 0; }

Interval Profile Dumping

The _PGOPTI_Set_Interval_Prof_Dump() function activates Interval Profile Dumping and sets the approximate frequency at which dumps occur. This function is used in non-terminating applications. The prototype of the function call is listed below.

596 16

Syntax

void _PGOPTI_Set_Interval_Prof_Dump(int interval);

This function is used in non-terminating applications. The interval parameter specifies the time interval at which profile dumping occurs and is measured in milliseconds. For example, if interval is set to 5000, then a profile dump and reset will occur approximately every 5 seconds. The interval is approximate because the time-check controlling the dump and reset is only performed upon entry to any instrumented function in your application. Setting the interval to zero or a negative number will disable interval profile dumping, and setting a very small value for the interval may cause the instrumented application to spend nearly all of its time dumping profile information. Be sure to set interval to a large enough value so that the application can perform actual work and substantial profile information is collected. The following example demonstrates one way of using interval profile dumping in non-terminating code.

Example #include // The next include is to access // _PGOPTI_Set_Interval_Prof_Dump_All #include int returnValue() { return 100; } int main() { int ans; printf("CTRL-C to quit.\n"); _PGOPTI_Set_Interval_Prof_Dump(5000); while (1) ans = returnValue(); }

You can compile the code shown above by entering commands similar to the following:

597 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operating System Example

VxWorks* icc -prof-gen -o intrumented_number number.c

Windows* icl /Qprof-gen /Feinstrumented_number number.c

When compiled, the code shown above will dump profile information a .dyn file about every five seconds until the program is ended. You can use the profmerge tool to merge the .dyn files.

Recommended usage Call this function at the start of a non-terminating user application to initiate interval profile dumping. Note that an alternative method of initiating interval profile dumping is by setting the environment variable INTEL_PROF_DUMP_INTERVAL to the desired interval value prior to starting the application. Using interval profile dumping, you should be able to profile a non-terminating application with minimal changes to the application source code.

Resetting the Dynamic Profile Counters

The _PGOPTI_Prof_Reset() function resets the dynamic profile counters. The prototype of the function call is listed below.

Syntax

void _PGOPTI_Prof_Reset(void);

One of the activities performed under profile-guided optimization is value-profiling. With value-profiling, the compiler inserts instrumentation to obtain a sample of the variable values in the program. Based upon this sampling, the compiler optimizes the user's code using frequently observed values. In effect, during a run of an instrumented executable, the segment maintained for each routine executed during that run stores the following information: • Number of times each basic block in the routine is executed • Specific values observed at each data point undergoing value-profiling • Number of times each of these values are observed

598 16

When the _PGOPTI_Prof_Reset() function is called, the basic block execution count for each basic block is set to 0, and the value-profiling information is reset to indicate that no values were observed at any data point.

Recommended usage Use this function to clear the profile counters prior to collecting profile information on a section of the instrumented application. See the example under Dumping Profile Information.

Dumping and Resetting Profile Information

The _PGOPTI_Prof_Dump_And_Reset() function dumps the profile information to a new .dyn file and then resets the dynamic profile counters. Then the execution of the instrumented application continues. The prototype of the function call is listed below.

Syntax

void _PGOPTI_Prof_Dump_And_Reset(void);

This function is used in non-terminating applications and may be called more than once. Each call will dump the profile information to a new .dyn file.

Recommended usage Periodic calls to this function enables a non-terminating application to generate one or more profile information files (.dyn files). These files are merged during the feedback phase (phase 3) of profile-guided optimizations. The direct use of this function enables your application to control precisely when the profile information is generated.

599 16 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

600 Using High-Level Optimization (HLO) 17

High-Level Optimizations (HLO) Overview

High-level Optimizations (HLO) exploits the properties of source code constructs (for example, loops and arrays) in applications developed in high-level programming languages. While the default optimization level, -O2 (VxWorks* OS) option, performs some high-level optimizations, specifying -O3 (VxWorks* OS) or /O3 (Windows) provides the best chance for performing loop transformations to optimize memory accesses. Loop optimizations may result in calls to library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors. The optimizations performed can also be affected by certain options, such as /arch or -m or -x (VxWorks* OS), where, under the /Qx (Windows) or –x (VxWorks* OS) option, additional HLO transformations may be performed for Intel® microprocessors than for non-Intel microprocessors. Within HLO, loop transformation techniques include: • Loop Permutation or Interchange • Loop Distribution • Loop Fusion • Loop Unrolling • Data Prefetching • Scalar Replacement • Unroll and Jam • Loop Blocking or Tiling • Partial-Sum Optimization • Predicate Optimization • Loop Reversal • Profile-Guided Loop Unrolling • Loop Peeling • Data Transformation: Malloc Combining and Memset Combining • Loop Rerolling • Memset and Memcpy Recognition • Statement Sinking for Creating Perfect Loopnests

601 17 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

602 Part IV Creating Parallel Applications

Topics:

• Automatic Vectorization

603 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

604 Automatic Vectorization 18

Automatic Vectorization Overview

The automatic vectorizer (also called the auto-vectorizer) is a component of the Intel® compiler that automatically uses SIMD instructions in the Intel® Streaming SIMD Extensions (Intel® SSE, SSE2, SSE3 and SSE4 Vectorizing Compiler and Media Accelerators), and the Supplemental Streaming SIMD Extensions (SSSE3) instruction sets, and the Intel® Advanced Vector Extension instruction set. The vectorizer detects operations in the program that can be done in parallel, and then converts the sequential operations like one SIMD instruction that processes 2, 4, 8 or up to 16 elements in parallel, depending on the data type. Automatic vectorization is supported on IA-32 and Intel® 64 architectures. The section discusses the following topics, among others: • High-level discussion of compiler options used to control or influence vectorization • Vectorization Key Programming Guidelines • Loop parallelization and vectorization • Descriptions of the C++ language features to control vectorization • Discussion and general guidelines on vectorization types: • automatic vectorization • vectorization with user intervention (user-mandated vectorization)

• Examples demonstrating typical vectorization issues and resolutions

The compiler supports a variety of auto-vectorizing hints and user-mandated pragmas that can help the compiler to generate effective vector instructions. See The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance, A.J.C. Bik. Intel Press, June, 2004, for a detailed discussion of how to vectorize code using the Intel® compiler. Additionally, see the Related Publications topic in this document for other resources. Using the -vec (Linux*) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx (Windows) or -m or -x (VxWorks).

605 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Automatic Vectorization Options Quick Reference

These options are supported on IA-32 and Intel® 64 architectures.

Linux* OS, VxWorks* OS Windows* OS Description

Enables or disables vectorization and transformations enabled for vectorization. Vectorization -vec /Qvec is enabled by default. To -no-vec /Qvec- disable, use -no-vec (VxWorks*) or /Qvec- (Windows*) option. Supported on IA-32 and Intel® 64 architectures only.

Controls the diagnostic messages from the -vec-report /Qvec-report vectorizer. See Vectorization Report.

Controls user-mandated (SIMD) vectorization. User-mandated (SIMD) vectorization is enabled by -simd /Qsimd default. Use the -no-simd -no-simd /Qsimd- (Linux*, VxWorks*) or /Qsimd- (Windows*) option to disable SIMD transformations for vectorization.

When used with -x or -ax, the compiler may perform additional vectorizations for Intel® microprocessors. than it performs for non-Intel microprocessors. Vectorization within the Intel® compiler depends upon ability of the compiler to disambiguate memory references. Certain options may enable the compiler to do better vectorization.

606 18

Using the -vec (Linux*) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or -m or -x (VxWorks). Refer to Quick Reference Lists for a complete listing of the quick reference topics.

See Also • -x compiler option • -ax compiler option • -vec compiler option • -simd compiler option

Programming Guidelines for Vectorization

The goal of including the vectorizer component in the Intel® compiler is to exploit single-instruction multiple data (SIMD) processing automatically. Users can help however, by supplying the compiler with additional information; for example, by using auto-vectorizer hints or pragmas. Using the -vec (Linux*) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or -m or -x (VxWorks* OS).

Guidelines Follow these guidelines to vectorize innermost loop bodies. Use:

• straight-line code (a single basic block) • vector data only; that is, arrays and invariant expressions on the right hand side of assignments. Array references can appear on the left hand side of assignments. • only assignment statements

Avoid:

• function calls (other than math library calls)

607 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• unvectorizable operations (either because the loop cannot be vectorized, or because an operation is emulated through a number of instructions) • mixing vectorizable types in the same loop (leads to lower resource utilization) • data-dependent loop exit conditions (leads to loss of vectorization)

To make your code vectorizable, you will often need to make some changes to your loops. However, you should make only those changes needed to enable vectorization. In particular, you should avoid these common changes:

• loop unrolling; the compiler does it automatically. • decomposing one loop with several statements in the body into several single-statement loops.

Restrictions There are a number of restrictions that you should consider. Vectorization depends on two major factors: hardware and style of source code.

Factor Description

Hardware The compiler is limited by restrictions imposed by the underlying hardware. In the case of Streaming SIMD Extensions, the vector memory operations are limited to stride-1 accesses with a preference to 16-byte-aligned memory references. This means that if the compiler abstractly recognizes a loop as vectorizable, it still might not vectorize it for a distinct target architecture.

Style of source code The style in which you write source code can inhibit vectorization. For example, a common problem with global pointers is that they often prevent the compiler from being able to prove that two memory references refer to distinct locations. Consequently, this prevents certain reordering transformations.

608 18

Many stylistic issues that prevent automatic vectorization by compilers are found in loop structures. The ambiguity arises from the complexity of the keywords, operators, data references, pointer arithmetic, and memory operations within the loop bodies. However, by understanding these limitations and by knowing how to interpret diagnostic messages, you can modify your program to overcome the known limitations and enable effective vectorization.

Guidelines for Writing Vectorizable Code Follow these guidelines to write vectorizable code:

• Use simple for loops. Avoid complex loop termination conditions – the upper iteration limit must be invariant within the loop. For the innermost loop in a nest of loops, you could set the upper limit iteration to be a function of the outer loop indices. • Write straight-line code. Avoid branches such as switch, goto or return statements, most function calls, or if constructs that can not be treated as masked assignments. • Avoid dependencies between loop iterations or at the least, avoid read-after-write dependencies. • Try to use array notations instead of the use of pointers. C programs in particular impose very few restrictions on the use of pointers; aliased pointers may lead to unexpected dependencies. Without help, the compiler often cannot tell whether it is safe to vectorize code containing pointers. • Wherever possible, use the loop index directly in array subscripts instead of incrementing a separate counter for use as an array address. • Use efficient memory accesses; recommended that you

• Favor inner loops with unit stride. • Minimize indirect addressing. • Align your data to 16 byte boundaries (for Streaming SIMD Extensions (SSE) instructions).

• Choose a suitable data layout with care. Most multimedia extension instruction sets are rather sensitive to alignment. The data movement instructions of SSE, for example, operate much more efficiently on data that is aligned at a 16-byte boundary in memory. Therefore, the success of a vectorizing compiler also depends on its ability to select an appropriate data layout which, in combination with code restructuring (like loop peeling), results in aligned memory accesses throughout the program. • Use aligned data structures: Data structure alignment is the adjustment of any data object in relation with other objects. You can use the declaration, declspec(align(base,offset)), where 0 <= offset < base and base is a power of two, to allocate a data structure at

609 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

offset from a certain base. However, if incoming access patterns are guaranteed to be aligned at a 16-byte boundary, you can avoid the overhead with the hint __assume_aligned(x, 16); in the function to convey this information to the compiler.

CAUTION. Use this hint with care. Incorrect usage of aligned data movements result in an exception for SSE.

• Use structure of arrays (SoA) instead of array of structures (AoS): An array is the most common type of data structure that contains a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation it can be a hindrance for use of vector processing. To make vectorization of the resulting code more effective, you can also select appropriate data structures.

Using Aligned Data Structures Data structure alignment is the adjustment of any data object with relation to other objects. Intel® Compilers may align individual variables to start at certain addresses to speed up memory access. Misaligned memory accesses can incur large performance losses on certain target processors that do not support them in hardware. Alignment is a property of a memory address, expressed as the numeric address modulo of powers of two. In addition to its address, a single datum also has a size. A datum is called 'naturally aligned' if its address is aligned to its size, otherwise it is called 'misaligned'. For example, an 8-byte floating-point datum is naturally aligned if the address used to identify it is aligned to 8. A data structure is a way of storing data in a computer so that it can be used efficiently. Often, a carefully chosen data structure allows a more efficient algorithm to be used. A well-designed data structure allows a variety of critical operations to be performed, using as little resources - both execution time and memory space - as possible. struct MyData{ short Data1; short Data2; short Data3; };

610 18

In the example data structure above, if the type short is stored in two bytes of memory then each member of the data structure is aligned to a boundary of 2 bytes. Data1 would be at offset 0, Data2 at offset 2 and Data3 at offset 4. The size of this structure is 6 bytes. The type of each member of the structure usually has a required alignment, meaning that it is aligned on a pre-determined boundary, unless you request otherwise. In cases where the compiler has taken sub-optimal alignment decisions, you can use the declaration declspec(align(base,off- set)), where 0 <= offset < base and base is a power of two, to allocate a data structure at offset from a certain base. Consider as an example, that most of the execution time of an application is spent in a loop of the following form: double a[N], b[N]; ... for (i = 0; i < N; i++){ a[i+1] = b[i] * 3; }

If the first element of both arrays is aligned at a 16-byte boundary, then either an unaligned load of elements from b or an unaligned store of elements into a must be used after vectorization.

NOTE. In this case, peeling off an iteration will not help.

However, you can enforce the alignment shown below, which results in two aligned access patterns after vectorization (assuming an 8-byte size for doubles): __declspec(align(16, 8)) double a[N]; __declspec(align(16, 0)) double b[N]; /* or simply "align(16)" */

611 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

If pointer variables are used, the compiler is usually not able to determine the alignment of access patterns at compile time. Consider the following simple fill() function: void fill(char *x) { int i; for (i = 0; i < 1024; i++){ x[i] = 1; } }

Without more information, the compiler cannot make any assumption on the alignment of the memory region accessed by the above loop. At this point, the compiler may decide to vectorize this loop using unaligned data movement instructions or, generate the run-time alignment optimization shown here: peel = x & 0x0f; if (peel != 0) { peel = 16 - peel; /* runtime peeling loop */ for (i = 0; i < peel; i++){ x[i] = 1; } } /* aligned access */ for (i = peel; i < 1024; i++){

x[i] = 1; }

Run-time optimization provides a generally effective way to obtain aligned access patterns at the expense of a slight increase in code size and testing. However, if incoming access patterns are guaranteed to be aligned at a 16-byte boundary, you can avoid this overhead with the hint __assume_aligned(x, 16); in the function to convey this information to the compiler.

612 18

CAUTION. Use this hint with care. Incorrect use of aligned data movements result in an exception for SSE.

Using Structure of Arrays versus Array of Structures The most common and well-known data structure is the array that contains a contiguous collection of data items, which can be accessed by an ordinal index. This data can be organized as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization works excellently for encapsulation, for vector processing it works poorly. You can select appropriate data structures to make vectorization of the resulting code more effective. To illustrate this point, compare the traditional array of structures (AoS) arrangement for storing the r, g, b components of a set of three-dimensional points with the alternative structure of arrays (SoA) arrangement for storing this set. Point Structure:

// Point structure with data in AoS arrangement struct Point{ float r; float g; float b; }

Points Structure:

613 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

// Points structure with data in SoA arrangement struct Points{ float* x; float* y; float* z; }

With the AoS arrangement, a loop that visits all components of an RGB point before moving to the next point exhibits a good locality of reference because all elements in the fetched cache lines are utilized. The disadvantage of the AoS arrangement is that each individual memory reference in such a loop exhibits a non-unit stride, which, in general, adversely affects vector performance. Furthermore, a loop that visits only one component of all points exhibits less satisfactory locality of reference because many of the elements in the fetched cache lines remain unused. In contrast, with the SoA arrangement the unit-stride memory references are more amenable to effective vectorization and still exhibit good locality of reference within each of the three data streams. Consequently, an application that uses the SoA arrangement may ultimately outperform an application based on the AoS arrangement when compiled with a vectorizing compiler, even if this performance difference is not directly apparent during the early implementation phase. Before you start vectorization, try out some simple rules:

• Make your data structures vector-friendly. • Make sure that inner loop indices correspond to the outermost (last) array index in your data (row-major order). • Use structure of arrays over array of structures.

For instance when dealing with three-dimensional coordinates, use three separate arrays for each component (SoA), instead of using one array of three-component structures (AoS). To avoid dependencies between loops that will eventually prevent vectorization, use three separate arrays for each component (SoA), instead of one array of three-component structures (AoS). When you use the AoS arrangement, each iteration produces one result by computing XYZ, but

614 18

it can at best use only 75% of the SSE unit because the forth component is not used. Sometimes, the compiler? may use only one component (25%). When you use the SoA arrangement, each iteration produces four results by computing XXXX, YYYY and ZZZZ, using 100% of the SSE unit. A drawback for the SoA arrangement is that your code will likely be three times as long. On the other hand, the compiler might not be able to vectorize AoS arranged code at all. If your original data layout is in AoS format, you may even want to consider a conversion to SoA on the fly, before the critical loop. If it gets vectorized, it may be worth the effort! To summarize:

• Use the smallest data types that gives the needed precision to maximize potential SIMD width. (If only 16-bits are needed, using a short rather than an int can make the difference between 8-way or four-way SIMD parallelism, respectively.) • Avoid mixing data types, to minimize type conversions. • Avoid operations not supported in SIMD hardware. • Use all the instruction sets available for your processor. Use the appropriate command line option for your processor type, or select the appropriate IDE option:

• Go to Project > Properties > C/C++ > Code Generation > Intel Processor-Specific Optimization, if your application runs only on Intel processors, or Project > Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set, if your application runs on compatible, non-Intel processors.

• Vectorizing compilers usually have some built-in efficiency heuristics to decide whether vectorization is likely to improve performance. The Intel® C/C++ Compiler disables vectorization of loops with many unaligned or non-unit stride data access patterns. However, if experimentation reveals that vectorization improves performance, you can override this behavior using the #pragma vector always hint before the loop; the compiler vectorizes any loop regardless of the outcome of the efficiency analysis (provided, of course, that vectorization is safe).

See Also • Vectorization and Loops • Loop Constructs

Using Automatic Vectorization

The automatic vectorizer (also called the auto-vectorizer) is a component of the Intel® compiler that automatically uses SIMD instructions in the Intel® Streaming SIMD Extensions (Intel® SSE, SSE2, SSE3 and SSE4 Vectorizing Compiler and Media Accelerators), and the Supplemental

615 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Streaming SIMD Extensions (SSSE3) instruction sets, and the Intel® Advanced Vector Extension instruction set. The vectorizer detects operations in the program that can be done in parallel, and then converts the sequential operations, like one SIMD instruction that processes 2, 4, 8 or up to 16 elements, to parallel, depending on the data type. So, what is vectorization? The process of converting an algorithm from a scalar implementation, which does an operation one pair of operands at a time, to a vector process where a single instruction can refer to a vector (series of adjacent values) is called vectorization. SIMD instructions operate on multiple data elements in one instruction and make use of the 128-bit SIMD floating-point registers. Automatic vectorization occurs when the Intel® Compiler generates packed SIMD instructions to unroll a loop. Because the packed instructions operate on more than one data element at a time, the loop executes more efficiently. This process is referred to as auto-vectorization only to emphasize that the compiler identifies and optimizes suitable loops on its own, without requiring any special action by you. However, it is useful to note that in some cases, certain keywords or directives may be applied in the code for auto-vectorization to occur. Automatic vectorization is supported on IA-32 and Intel® 64 architectures. Using the -vec (VxWorks*) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or -m or -x (VxWorks* OS).

Vectorization Speed-up Where does the vectorization speedup come from? Consider the following sample code fragment, where a, b and c are integer arrays: for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

If vectorization is not enabled (that is, you compile using /O1 or /Qvec- option), for each iteration, the compiler processes the code such that there is a lot of unused space in the SIMD registers, even though each of the registers could hold three additional integers. If vectorization is enabled (you compile using /O2 or higher options), the compiler may use the additional registers to perform four additions in a single instruction. The compiler looks for vectorization opportunities whenever you compile at default optimization (-O2) or higher.

TIP. To allow comparisons between vectorized and not-vectorized code, disable vectorization using the -no-vec (VxWorks*) option; enable vectorization using the -O2 option.

616 18

To get information whether a loop was vectorized or not, enable generation of the vectorization report using the /Qopt-report:1 /Qopt-report-phase hpo (Windows) or –opt-report1 –opt-report-phase hpo (VxWorks) option. You will get a one line message for every loop that is vectorized, as follows: > icl /Qvec-report1 MultArray.c MultArray.c(92): (col. 5) remark: LOOP WAS VECTORIZED.

The source line number (92 in the above example) refers to either the beginning or the end of the loop. So, how significant is the performance enhancement? To evaluate performance enhancement yourself, run example1:

1. Open an Intel® Compiler command line window. Source an environment script such as iccvars_intel64.sh in the compiler bin/intel64 directory, or iccvars_ia32.sh in the bin/ia32 directory, as appropriate. 2. Navigate to the \example1 directory. The small application multiplies a vector by a matrix using the following loop: for (j = 0;j < size2; j++) { b[i] += a[i][j] * x[j]; }

3. Build and run the application, first without enabling auto-vectorization. Note the time taken for the application to run. On VxWorks* platforms, enter: icc -O2 -no-vec MultArray.c -o NoVectMult ./NoVectMult

4. Now build and run the application, this time enabling auto-vectorization. Note the time taken for the application to run. On VxWorks* platforms, enter: icc -O2 -vec-report1 MultArray.c -o VectMult ./VectMult

When you compare the timing of the two runs, you may see that the vectorized version runs faster. The time for the non-vectorized version is only slightly faster than would be obtained by compiling with the /O1 or the -O1 option.

617 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Obstacles to Vectorization The following do not always prevent vectorization, but frequently either prevent it or cause the compiler to decide that vectorization would not be worthwhile.

• Non-contiguous memory access: Four consecutive integers or floating-point values, or two consecutive doubles, may be loaded directly from memory in a single SSE instruction. But if the four integers are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient. The most common examples of non-contiguous memory access are loops with non-unit stride or with indirect addressing, as in the examples below. The compiler rarely vectorizes such loops, unless the amount of computational work is large compared to the overhead from non-contiguous memory access. // arrays accessed with stride 2 for (int i=0; i

The typical message from the vectorization report is: vectorization possible but seems inefficient, although indirect addressing may also result in the following report: Existence of vector dependence. • Data dependencies: Vectorization entails changes in the order of operations within a loop, since each SIMD instruction operates on several data elements at once. Vectorization is only possible if this change of order does not change the results of the calculation.

• The simplest case is when data elements that are written (stored to) do not appear in any other iteration of the individual loop. In this case, all the iterations of the original loop are independent of each other, and can be executed in any order, without changing the result. The loop may be safely executed using any parallel method, including vectorization. All the examples considered so far fall into this category.

618 18

• When a variable is written in one iteration and read in a subsequent iteration, there is a “read-after-write” dependency, also known as a flow dependency, as in this example: A[0]=0; for (j=1; j

So the value of j gets propagated to all A[j]. This cannot safely be vectorized: if the first two iterations are executed simultaneously by a SIMD instruction, the value of A[1] is used by the second iteration before it has been calculated by the first iteration. • When a variable is read in one iteration and written in a subsequent iteration, this is a write-after-read dependency, (sometimes also known as an anti-dependency), as in the following example: for (j=1; j

This is not safe for general parallel execution, since the iteration with the write may execute before the iteration with the read. However, for vectorization, no iteration with a higher value of j can complete before an iteration with a lower value of j, and so vectorization is safe (i.e., gives the same result as non-vectorized code) in this case. The following example, however, may not be safe, since vectorization might cause some elements of A to be overwritten by the first SIMD instruction before being used for the second SIMD instruction. for (j=1; j

• Read-after-read situations aren’t really dependencies, and do not prevent vectorization or parallel execution. If a variable is not written, it does not matter how often it is read. • Write-after-write, or ‘output’, dependencies, where the same variable is written to in more than one iteration, are in general unsafe for parallel execution, including vectorization.

619 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• However, there is a very important exception, that apparently contains all of the above types of dependency: sum=0; for (j=1; j

Although sum is both read and written in every iteration, the compiler recognizes such reduction idioms, and is able to vectorize them safely. The loop in example 1 was another example of a reduction, with a loop-invariant array element in place of a scalar. These types of dependencies between loop iterations are sometimes known as loop-carried dependencies. The above examples are of proven dependencies. However, the compiler cannot safely vectorize a loop if there is even a potential dependency. Consider the following example: for (i = 0; i < size; i++) { c[i] = a[i] * b[i]; }

In the above example, the compiler needs to determine whether, for some iteration i, c[i] might refer to the same memory location as a[i] or b[i] for a different iteration. (Such memory locations are sometimes said to be “aliased”). For example, if a[i] pointed to the same memory location as c[i-1], there would be a read-after-write dependency as in the earlier example. If the compiler cannot exclude this possibility, it will not vectorize the loop unless you provide the compiler with hints.

Helping the Intel® Compiler to Vectorize Sometimes the compiler has insufficient information to decide to vectorize a loop. There are several ways to provide additional information to the compiler:

• Pragmas:

620 18

• #pragma ivdep: may be used to tell the compiler that it may safely ignore any potential data dependencies. (The compiler will not ignore proven dependencies). Use of this pragma when there are dependencies may lead to incorrect results. There are cases where the compiler cannot tell by a static analysis that it is safe to vectorize but you, as a developer, would know. Consider the following loop: void copy(char *cp_a, char *cp_b, int n) { for (int i = 0; i < n; i++) { cp_a[i] = cp_b[i]; } }

Without more information, a vectorizing compiler must conservatively assume that the memory regions accessed by the pointer variables cp_a and cp_b may (partially) overlap, which gives rise to potential data dependencies that prohibit straightforward conversion of this loop into SIMD instructions. At this point, the compiler may decide to keep the loop serial or, as done by the Intel® C/C++ compiler, generate a run-time test for overlap, where the loop in the true-branch can be converted into SIMD instructions: if (cp_a + n < cp_b || cp_b + n < cp_a) /* vector loop */ for (int i = 0; i < n; i++) cp_a[i] = cp_b [i]; else /* serial loop */ for (int i = 0; i < n; i++) cp_a[i] = cp_b[i];

Run-time data-dependency testing provides a generally effective way to exploit implicit parallelism in C or C++ code at the expense of a slight increase in code size and testing overhead. If the function copy is only used in specific ways, however, you can assist the vectorizing compiler as follows:

• If the function is mainly used for small values of n or for overlapping memory regions, you can simply prevent vectorization and, hence, the corresponding run-time overhead by inserting a #pragma novector hint before the loop.

621 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• Conversely, if the loop is guaranteed to operate on non-overlapping memory regions, you can provide this information to the compiler by means of a #pragma ivdep hint before the loop, which informs the compiler that conservatively assumed data dependencies that prevent vectorization can be ignored. This results in vectorization of the loop without run-time data-dependency testing. #pragma ivdep void copy(char *cp_a, char *cp_b, int n) { for (int i = 0; i < n; i++) { cp_a[i] = cp_b[i]; } }

NOTE. You can also use the restrict keyword.

• #pragma loop count (n): may be used to advise the compiler of the typical trip count of the loop. This may help the compiler to decide whether vectorization is worthwhile, or whether or not it should generate alternative code paths for the loop. • #pragma vector always : asks the compiler to vectorize the loop if it is safe to do so, whether or not the compiler thinks that will improve performance. • #pragma vector align : asserts that data within the following loop is aligned (to a 16 byte boundary, for SSE instruction sets). • #pragma novector : asks the compiler not to vectorize a particular loop. • #pragma vector nontemporal: gives a hint to the compiler that data will not be reused, and therefore to use streaming stores that bypass cache.

• Keywords: The restrict keyword may be used to assert that the memory referenced by a pointer is not aliased, i.e. that it is not accessed in any other way. The keyword requires the use of either the /Qrestrict (-restrict) or /Qc99 (-c99) compiler option. The example under #pragma ivdep above can also be handled using the restrict keyword. You may use the restrict keyword in the declarations of cp_a and cp_b, as shown below, to inform the compiler that each pointer variable provides exclusive access to a certain memory region. The restrict qualifier in the argument list lets the compiler know that there are no other aliases to the memory to which the pointers point. In other words, the pointer for which it is used provides the only means of accessing the memory in question in the scope in which the pointers live. Even if the code gets vectorized without the restrict

622 18 keyword, the compiler checks for aliasing at run-time, if the restrict keyword was used. You may have to use an extra compiler option, such as /Qrestrict(Windows*) or -restrict (VxWorks*) option for the Intel C/C++ compiler. void copy(char * __restrict cp_a, char * __restrict cp_b, int n) { for (int i = 0; i < n; i++) cp_a[i] = cp_b[i]; }

This method is convenient in case the exclusive access property holds for pointer variables that are used in a large portion of code with many loops because it avoids the need to annotate each of the vectorizable loops individually. Note, however, that both the loop-specific #pragma ivdep hint, as well as the pointer variable-specific restrict hint must be used with care because incorrect usage may change the semantics intended in the original program. Another example is the following loop that may also not get vectorized because of a potential aliasing problem between pointers a, b and c: // potential unsupported loop structure void add(float *a, float *b, float *c) { for (int i=0; i

If the restrict keyword is added to the parameters, the compiler will trust you, that you will not access the memory in question with any other pointer and vectorize the code properly: // let the compiler know, the pointers are safe with restrict void add(float * __restrict a, float * __restrict b, float * __restrict c) { for (int i=0; i

623 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The down-side of using restrict is that not all compilers support this keyword, so your source code may lose portability. If you care about source code portability you may want to consider using the compiler option /Qansi-alias(Windows*) or -ansi-alias(VxWorks*) instead. However, compiler options work globally, so you have to make sure they do not cause harm to other code fragments. • Options/switches: You can use options to enable different levels of optimizations to achieve automatic vectorization:

• Interprocedural optimization (IPO): Enable IPO using /Qip (-ip) option within a single source file, or using /Qipo (-ipo) across source files. You provide the compiler with additional information about a loop, such as trip counts, alignment or data dependencies. Enabling IPO may also allow inlining of function calls. • Disambiguation of pointers and arrays: Use the options /Oa (Windows*) or –fno- alias (VxWorks*) to assert there is no aliasing of memory references, that is, the same memory location is not accessed via different arrays or pointers. Other options make more limited assertions, for example, /Qalias-args- (-fargument-noalias) asserts that function arguments cannot alias each other (that is, they cannot overlap). The /Qansi-alias (-fargumen-alias) option allows the compiler to assume strict adherence to the aliasing rules in the ISO C standard. Use these options responsibly; if you use these options when memory is aliased it may lead to incorrect results. Use the /Qansi- alias-check (-ansi-alias-check) option to enable the ansi-alias checker. The ansi-alias checker checks the source code for potential violations of ANSI aliasing rules and disables unsafe optimizations related to the code for those statements that are identified as potential violations. • High level loop optimizations (HLO): Enable HLO with /O3 (-O3) option. These additional loop optimizations make it easier for the compiler to vectorize the transformed loops. The HLO report, obtained using the /Qopt-report-phase:hlo (-opt-report- phase hlo) option or the corresponding IDE selection, tells you whether some of these additional transformations occurred.

Vectorization and Loops

This topic discusses loop parallelization in the context of vectorization. Using the -parallel (Linux*) option enables parallelization for both Intel® microprocessors and non-Intel microprocessors. The resulting executable may get additional performance gain on Intel microprocessors than on non-Intel microprocessors. The parallelization can also be affected by certain options, such as /arch or -m or -x (VxWorks* OS).

624 18

Using the -vec (Linux*) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or -m or -x (VxWorks* OS).

Types of Vectorized Loops For integer loops, the 128-bit Intel® Streaming SIMD Extensions (Intel® SSE) and the Intel® Advanced Vector Extensions (Intel® AVX) provide SIMD instructions for most arithmetic and logical operators on 32-bit, 16-bit, and 8-bit integer data types, with limited support for the 64-bit integer data type. Vectorization may proceed if the final precision of integer wrap-around arithmetic is preserved. A 32-bit shift-right operator, for instance, is not vectorized in 16-bit mode if the final stored value is a 16-bit integer. Also, note that because the Intel® SSE and the Intel® AVX instruction sets are not fully orthogonal (shifts on byte operands, for instance, are not supported), not all integer operations can actually be vectorized. For loops that operate on 32-bit single-precision and 64-bit double-precision floating-point numbers, Intel® SSE provides SIMD instructions for the following arithmetic operators: addition (+), subtraction (-), multiplication (*), and division (/). Additionally, the Streaming SIMD Extensions provide SIMD instructions for the binary MIN and MAX and unary SQRT operators. SIMD versions of several other mathematical operators (like the trigonometric functions SIN, COS, and TAN) are supported in software in a vector mathematical run-time library that is provided with the Intel® compiler of which the compiler takes advantage. To be vectorizable, loops must be:

• Countable: The loop trip count must be known at entry to the loop at runtime, though it need not be known at compile time (that is, the trip count can be a variable but the variable must remain constant for the duration of the loop). This implies that exit from the loop must not be data-dependent.

625 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• Single entry and single exit: as is implied by stating that the loop must be countable. Consider the following example of a loop that is not vectorizable, due to a second, data-dependent exit: void no_vec(float a[], float b[], float c[]){ int i = 0.; while (i < 100) { a[i] = b[i] * c[i]; // this is a data-dependent exit condition: if (a[i] < 0.0) break; ++i; } } > icc -c -O2 -vec-report2 two_exits.cpp two_exits.cpp(4) (col. 9): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

• Contain straight-line code: SIMD instruction perform the same operation on data elements from multiple iterations of the original loop, therefore, it is not possible for different iterations to have different control flow; that is, they must not branch. It follows that switch statements are not allowed. However, if statements are allowed if they can be implemented as masked

626 18

assignments, which is usually the case. The calculation is performed for all data elements but the result is stored only for those elements for which the mask evaluates to true. To illustrate this point, consider the following example that may be vectorized: #include void quad(int length, float *a, float *b, float *c, float *restrict x1, float *restrict x2){ for (int i=0; i= 0 ) { s = sqrt(s) ; x2[i] = (-b[i]+s)/(2.*a[i]); x1[i] = (-b[i]-s)/(2.*a[i]); } else { x2[i] = 0.; x1[i] = 0.; } } } > icc -c -restrict -vec-report2 quad.cpp quad5.cpp(5) (col. 3): remark: LOOP WAS VECTORIZED.

• Innermost loop of a nest: The only exception is if an original outer loop is transformed into an inner loop as a result of some other prior optimization phase, such as unrolling, loop collapsing or interchange. • Without function calls: Even a print statement is sufficient to prevent a loop from getting vectorized. The vectorization report message is typically: non-standard loop is not a vectorization candidate. The two major exceptions are for intrinsic math functions and for functions that may be inlined.

Intrinsic math functions such as sin(), log(), fmax(), and so on, are allowed, because the compiler runtime library contains vectorized versions of these functions. See the table below for a list of these functions; most exist in both float and double versions.

627 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

acos ceil fabs round acosh cos floor sin asin cosh fmax sinh asinh erf fmin sqrt atan erfc log tan atan2 erfinv log10 tanh atanh exp log2 trunc cbrt exp2 pow

The loop in the following example may be vectorized because sqrtf() is vectorizable and func() gets inlined. Inlining is enabled at default optimization for functions in the same source file. An inlining report may be obtained by setting the /Qopt-report-phase ipo_inl (Windows*) or -opt-report-phase ipo_inl (VxWorks* ) option. float func(float x, float y, float xp, float yp) { float denom; denom = (x-xp)*(x-xp) + (y-yp)*(y-yp); denom = 1./sqrtf(denom); return denom; } float trap_int(float y, float x0, float xn, int nx, float xp, float yp) { float x, h, sumx; int i; h = (xn-x0) / nx; sumx = 0.5*( func(x0,y,xp,yp) + func(xn,y,xp,yp) ); for (i=1;i

628 18

} // Command line > icc -c -vec-report2 trap_integ.c trap_int.c(16) (col. 3): remark: LOOP WAS VECTORIZED.

Statements in the Loop Body The vectorizable operations are different for floating-point and integer data. Integer Array Operations The statements within the loop body may contain char, unsigned char, short, unsigned short, int, and unsigned int. Calls to functions such as sqrt and fabs are also supported. Arithmetic operations are limited to addition, subtraction, bitwise AND, OR, and XOR operators, division (via run-time library call), multiplication, min, and max. You can mix data types but this may potentially cost you in terms of lowering efficiency. Some example operators where you can mix data types are multiplication, shift, or unary operators.

Other Operations No statements other than the preceding floating-point and integer operations are allowed. In particular, note that the special __m64 __m128, and __m256 data types are not vectorizable. The loop body cannot contain any function calls. Use of the Streaming SIMD Extensions intrinsics ( for example, _mm_add_ps) or the Intel® AVX intrinsics (for example, _mm256_add_ps) are not allowed.

Data Dependency Data dependency relations represent the required ordering constraints on the operations in serial loops. Because vectorization rearranges the order in which operations are executed, any auto-vectorizer must have at its disposal some form of data dependency analysis.

629 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

An example where data dependencies prohibit vectorization is shown below. In this example, the value of each element of an array is dependent on the value of its neighbor that was computed in the previous iteration.

Example 1: Data-dependent Loop int i; void dep(float *data){ for (i=1; i<100; i++){ data[i] = data[i-1]*0.25 + data[i]*0.5 + data[i+1]*0.25; } }

The loop in the above example is not vectorizable because the WRITE to the current element DATA(I) is dependent on the use of the preceding element DATA(I-1), which has already been written to and changed in the previous iteration. To see this, look at the access patterns of the array for the first two iterations as shown below.

Example 1: Data-dependency Vectorization Patterns i=1: READ data[0] READ data[1] READ data[2] WRITE data[1] i=2: READ data[1] READ data[2] READ data[3] WRITE data[2]

630 18

In the normal sequential version of this loop, the value of DATA(1) read from during the second iteration was written to in the first iteration. For vectorization, it must be possible to do the iterations in parallel, without changing the semantics of the original loop.

Example 2: Data Independent Loop for(i=0; i<100; i++) a[i]=b[i]; //which has the following access pattern read b[0] write a[0] read b[1] write a[1]

Data Dependency Analysis Data dependency analysis involves finding the conditions under which two memory accesses may overlap. Given two references in a program, the conditions are defined by:

• whether the referenced variables may be aliases for the same (or overlapping) regions in memory • for array references, the relationship between the subscripts

The data dependency analyzer for array references is organized as a series of tests, which progressively increase in power as well as in time and space costs. First, a number of simple tests are performed in a dimension-by-dimension manner, since independency in any dimension will exclude any dependency relationship. Multidimensional arrays references that may cross their declared dimension boundaries can be converted to their linearized form before the tests are applied. Some of the simple tests that can be used are the fast greatest common divisor (GCD) test and the extended bounds test. The GCD test proves independency if the GCD of the coefficients of loop indices cannot evenly divide the constant term. The extended bounds test checks for potential overlap of the extreme values in subscript expressions. If all simple tests fail to prove independency, the compiler will eventually resort to a powerful hierarchical dependency solver that uses Fourier-Motzkin elimination to solve the data dependency problem in all dimensions.

631 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Optimization Notice

Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options". Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

632 18

Loop Constructs

Loops can be formed with the usual for and while constructs. The loops must have a single entry and a single exit to be vectorized. The following examples illustrate loop constructs that can and cannot be vectorized.

Example: Vectorizable structure void vec(float a[], float b[], float c[]) { int i = 0; while (i < 100) { // The if branch is inside body of loop. a[i] = b[i] * c[i]; if (a[i] < 0.0) a[i] = 0.0; i++; } }

633 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The following example shows a loop that cannot be vectorized because of the inherent potential for an early exit from the loop.

Example: Non-vectorizable structure void no_vec(float a[], float b[], float c[]) { int i = 0; while (i < 100) { if (a[i] < 50) // The next statement is a second exit // that allows an early exit from the loop. break; ++i; } }

Loop Exit Conditions Loop exit conditions determine the number of iterations a loop executes. For example, fixed indexes for loops determine the iterations. The loop iterations must be countable; in other words, the number of iterations must be expressed as one of the following:

• A constant • A loop invariant term • A linear function of outermost loop indices

634 18

In the case where a loops exit depends on computation, the loops are not countable. The examples below show loop constructs that are countable and non-countable.

Example: Countable Loop void cnt1(float a[], float b[], float c[], int n, int lb) { // Exit condition specified by "N-1b+1" int cnt=n, i=0; while (cnt >= lb) { // lb is not affected within loop. a[i] = b[i] * c[i]; cnt--; i++; } }

The following example demonstrates a different countable loop construct.

Example: Countable Loop void cnt2(float a[], float b[], float c[], int m, int n) { // Number of iterations is "(n-m+2)/2". int i=0, l; for (l=m; l

635 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The following examples demonstrates a loop construct that is non-countable due to dependency loop variant count value.

Example: Non-Countable Loop void no_cnt(float a[], float b[], float c[]) { int i=0; // Iterations dependent on a[i]. while (a[i]>0.0) { a[i] = b[i] * c[i]; i++; } }

Strip-mining and Cleanup Strip-mining, also known as loop sectioning, is a loop transformation technique for enabling SIMD-encodings of loops, as well as a means of improving memory performance. By fragmenting a large loop into smaller segments or strips, this technique transforms the loop structure in two ways:

• It increases the temporal and spatial locality in the data cache if the data are reusable in different passes of an algorithm. • It reduces the number of iterations of the loop by a factor of the length of each vector, or number of operations being performed per SIMD operation. In the case of Streaming SIMD Extensions, this vector or strip-length is reduced by 4 times: four floating-point data items per single Streaming SIMD Extensions single-precision floating-point SIMD operation are processed.

First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine.

636 18

The compiler automatically strip-mines your loop and generates a cleanup loop. For example, assume the compiler attempts to strip-mine the following loop:

Example1: Before Vectorization i=0; while(i

a[i]=b[i]+c[i]; ++i; }

The compiler might handle the strip mining and loop cleaning by restructuring the loop in the following manner:

Example 2: After Vectorization // The vectorizer generates the following two loops

637 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example 2: After Vectorization i=0; while(i<(n-n%4)) { // Vector strip-mined loop // Subscript [i:i+3] denotes SIMD execution a[i:i+3]=b[i:i+3]+c[i:i+3]; i=i+4; } while(i

Loop Blocking It is possible to treat loop blocking as strip-mining in two or more dimensions. Loop blocking is a useful technique for memory performance optimization. The main purpose of loop blocking is to eliminate as many cache misses as possible. This technique transforms the memory domain into smaller chunks rather than sequentially traversing through the entire memory domain. Each chunk should be small enough to fit all the data for a given computation into the cache, thereby maximizing data reuse.

638 18

Consider the following example, loop blocking allows arrays A and B to be blocked into smaller rectangular chunks so that the total combined size of two blocked (A and B) chunks is smaller than cache size, which can improve data reuse.

Example 3: Original loop #include #include #define MAX 7000 void add(int a[][MAX], int b[][MAX]); int main() { int i, j; int A[MAX][MAX]; int B[MAX][MAX]; time_t start, elaspe; int sec; //Initialize array for(i=0;i

639 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example 3: Original loop printf("Time %d",sec); //List time taken to complete add function } void add(int a[][MAX], int b[][MAX]) { int i, j; for(i=0;i

640 18

The following example illustrates loop blocking the add function (from the previous example). In order to benefit from this optimization you might have to increase the cache size.

Example 4: Transformed Loop after blocking #include #include #define MAX 7000 void add(int a[][MAX], int b[][MAX]); int main() { #define BS 8 //Block size is selected as the loop-blocking factor. int i, j; int A[MAX][MAX]; int B[MAX][MAX]; time_t start, elaspe; int sec; //initialize array for(i=0;i

641 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example 4: Transformed Loop after blocking printf("Time %d",sec); //Display time taken to complete loopBlocking function } void add(int a[][MAX], int b[][MAX]) { int i, j, ii, jj; for(i=0;i

642 18

Loop Interchange and Subscripts: Matrix Multiply Loop interchange is often used for improving memory access patterns. Matrix multiplication is commonly written as shown in the following example:

Example: Typical Matrix Multiplication void matmul_slow(float *a[], float *b[], float *c[]) { int N = 100; for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; }

The use of B(K,J)is not a stride-1 reference and therefore will not be vectorized efficiently. If the loops are interchanged, however, all the references will become stride-1 as shown in the following example.

Example: Matrix Multiplication with Stride-1 void matmul_fast(float *a[], float *b[], float *c[]) { int N = 100; for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; }

Interchanging is not always possible because of dependencies, which can lead to different results.

643 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

User-mandated or SIMD Vectorization

User-mandated or SIMD vectorization supplements automatic vectorization. The following figure illustrates this relationship. User-mandated vectorization is implemented as a single-instruction-multiple-data (SIMD) feature and is referred to as SIMD vectorization. The SIMD vectorization feature is available for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or -m or -x (VxWorks* OS).

The following figure illustrates how SIMD vectorization is positioned among various approaches that you can take to generate vector code that exploits vector hardware capabilities. The programs written with SIMD vectorization are very similar to those written using auto-vectorization hints. You can use SIMD vectorization to minimize the amount of code changes that you may have to go through in order to obtain vectorized code.

644 18

SIMD vectorization uses the #pragma simd pragma to effect loop vectorization. You must add the pragma to a loop and recompile for the loop to get vectorized (the options –Qsimd [on Windows* OS] or –simd [on Linux* OS and VxWorks* OS] are enabled by default).

645 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Consider an example in C++ where the function add_floats() uses too many unknown pointers for the compiler’s automatic runtime independence check optimization to kick in. You can give a data dependence assertion using the auto-vectorization hint via #pragma ivdep and let the compiler decide whether the auto-vectorization optimization should be applied to the loop. Or you can now enforce vectorization of this loop by using #pragma simd .

Example: without #pragma simd [D:/simd] cat example1.c void add_floats(float *a, float *b, float *c, float *d, float *e, int n){ int i; for (i=0; i

Command line entry and output [D:/simd] icl example1.c –nologo -Qvec-report2

646 18

Command line entry and output example1.c D:\simd\example1.c(3): (col. 3) remark: loop was not vectorized: existence of vector dependence.

Example: with #pragma simd [D:/simd] cat example1.c void add_floats(float *a, float *b, float *c, float *d, float *e, int n){ int i; #pragma simd for (i=0; i

Command line entry and output [D:/simd] icl example1.c -nologo -Qvec-report2 example1.c D:\simd\example1.c(4): (col. 3) remark: SIMD LOOP WAS VECTORIZED.

The one big difference between using the SIMD pragma and auto-vectorization hints is that with the SIMD pragma, the compiler generates a warning when it is unable to vectorize the loop. With auto-vectorization hints, actual vectorization is still under the discretion of the compiler, even when you use the #pragma vector always hint. The SIMD pragma has five optional clauses to guide the compiler on how vectorization must proceed. Use these clauses appropriately so that the compiler obtains enough information to generate correct vector code. For more information on the clauses, see the #pragma simd description.

Additional Semantics Note the following points when using #pragma simd pragma.

647 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• A variable may belong to at most one of private, linear, or reduction (or none of them). • Within the vector loop, an expression is evaluated as a vector value if it is private, linear, reduction, or it has a sub-expression that is evaluated to a vector value. Otherwise, it is evaluated as a scalar value (that is, broadcast the same value to all iterations). Scalar value does not necessarily mean loop invariant, although that is the most frequently seen usage pattern of scalar value. • A vector value may not be assigned to a scalar L-value. It is an error. • A scalar L-value may not be assigned under a vector condition. It is an error. • The switch statement is not supported.

NOTE. You may find it difficult to describe vector semantics using SIMD pragma for some auto-vectorizable loops. One example is MIN/MAX reduction in C since the language does not have MIN/MAX operators.

Using vector Declaration

Consider the following C++ example code with a loop containing the math function, sinf(). All code examples in this section are applicable for the Windows* operating system only.

Example: Loop with math function is auto-vectorized [D:/simd] cat example2.c void vsin(float *restrict a, float *restrict b, int n){ int i; for (i=0; i

Command-line entry and output [D:/simd] icl example2.c –nologo –O3 -Qvec-report2 -Qrestrict example2.c D:\simd\example2.c(3): (col. 3) remark: LOOP WAS VECTORIZED.

648 18

When you compile the above code, the loop with sinf() function is auto-vectorized using the appropriate SVML library function provided by the Intel compiler. The auto-vectorizer identifies the entry points, matches up the scalar math library function to the SVML function and invokes it. However, within this loop if you have a call to your function, foo() that has the same prototype as sinf(), the auto-vectorizer fails to vectorize the loop because it does not know what foo() does unless it is inlined to this call site.

Example: Loop with user-defined function is NOT auto-vectorized [D:/simd] cat example2.c float foo(float); void vfoo(float *restrict a, float *restrict b, int n){ int i; for (i=0; i

Command-line entry and output [D:/simd] icl example2.c -nologo -O3 -Qvec-report2 -Qrestrict example2.c D:\simd\example2.c(4): (col. 3) remark: loop was not vectorized: existence of vector dependence.

649 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

In such cases, you can use the __attribute__((vector))declaration to vectorize the loop. All you need to do is add the vector declaration to the function declaration, and recompile both the caller and callee code. The loop and function are vectorized.

Example: Loop with user-defined function with vector declaration is vectorized [D:/simd] cat example3.c // foo() and vfoo() do not have to be in the same compilation unit as long //as both see the same declspec. __declspec(vector) float foo(float); void vfoo(float *restrict a, float *restrict b, int n){ int i; for (i=0; i

Command-line entry and output [D:/simd] icl example3.c -nologo -O3 -Qvec-report3 –Qrestrict

example3.c D:\simd\example3.c(9): (col. 3) remark: LOOP WAS VECTORIZED D:\simd\example3.c(14): (col. 3) remark: FUNCTION WAS VECTORIZED

Object-level compatibility considerations: The functions vfoo() and foo() do not have to reside in the same compilation unit. However, if vfoo() is compiled with the vector declaration, foo() also has to be compiled with the same declaration because the vectorization

650 18 of vfoo() creates a function call to some_mangled_name_for_vectorized_foo(), and the compilation of foo() has to provide the new function, some_mangled_name_for_vectorized_foo(), in addition to the original foo() itself. Passing multi-dimensional arrays: You can pass a multi-dimensional array to a vector declared function. When the corresponding parameter is classified as scalar, a single array is passed for that parameter. If it is classified as private, multiple arrays are passed, just as if the original scalar function is called multiple times for the consecutive iterations. For a parameter that is linear classified, you must use an additional step. The array parameter A must have defined semantics for A+step. For example, for a vector length of 4, the callee sees [A, A+step, A+2*step, A+3*step]. If the parameter doesn’t have defined semantics for +step, classifying it as linear is a syntax error.

Example: Passing multi-dimensional array to a vector declared function __declspec(vector(linear(a:1))) float foo(float *a) { return *a; } __declspec(vector(linear(a:1))) float foo1(float **a) { return **a; } void vfoo(float *X[]) { float A[100][100]; for (i=0; i<100; i++) { for (j=0; j<100; j++) { foo(A+j); foo1(X+j); } } }

651 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Calling vector declaration under a condition: The following example code illustrates the use of the vector declaration with the mask clause. The mask clause is used when the vector declaration is called under a condition. In the code below, the function fib() is called in the main() function and also recursively called in the function fib() under an if condition. The compiler creates a masked vector version and a non-masked vector version for the function fib() while retaining the original scalar version of the fib() function.

Example: Using vector declaration with mask clause

#include #include #define N 45 int a[N], b[N], c[N]; // “mask” clause needs to be used when the vector function is

// called under a condition. Function fib() is called in the main() function and // also recursively called in the fib() function under an if-condition. // The compiler creates masked and non-maksed vector versions for function fib()

652 18

Example: Using vector declaration with mask clause //while keeping the original scalar version of fib(). __declspec(vector(mask)) int fib(int n){ if (n <= 2) return n; else { return fib(n-1) + fib(n-2); } } int main(int argc, char *argv[]){ int i; for (i=0; i < N; i++) b[i] = i; #pragma simd for (i=0; i < N; i++) { a[i] = fib(b[i]);

// after vectorization, non-masked vector fib() is called } printf("Done a[%d] = %d\n", N-1, a[N-1]); }

Command-line entry and output [D:/simd] icl example3.c -nologo -O3 -Qvec-report3 -Qrestrict example3.c D:\SIMD\vfib_example.c(20) (col. 3): remark: LOOP WAS VECTORIZED. D:\SIMD\vfib_example.c(23) (col. 3): remark: SIMD LOOP WAS VECTORIZED. D:\SIMD\vfib_example.c(9) (col. 1): remark: FUNCTION WAS VECTORIZED. D:\SIMD\vfib_example.c(9) (col. 1): remark: FUNCTION WAS VECTORIZED.

653 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Restrictions for Using vector Declaration

Vectorization depends on two major factors: hardware and the style of source code. For the current implementation of the vector declaration, there are certain restrictions that apply when using the vector declaration. The following features are not allowed:

• Thread creation and joining through explicit threading API calls • Using setjmp, longjmp, EH, SEH • Inline ASM code and VML • Calling non-vector functions (note that all SVML functions are considered vector functions) • Locks, barriers, atomic construct, critical sections (presumably this is a special case of the previous one). • Goto statements • Intrinsics (for example, SVML intrinsics) • Function call through function pointer and virtual function • Any loop/array notation constructs • Struct access • The switch statement

Formal parameters must be the following data types:

• (un)signed 8, 16, 32, or 64-bit integer • 32 or 64-bit floating point • 64 or 128-bit complex • a pointer (C++ reference is considered a pointer data type)

See Also • simd pragma • Function Annotations and the SIMD Directive for Vectorization

Function Annotations and the SIMD Directive for Vectorization

This topic presents specific C++ language features that better help to vectorize code. The SIMD vectorization feature is available for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or -m or -x (VxWorks* OS).

654 18

The __declspec(align(n)) declaration enables you to overcome hardware alignment constraints. The restrict qualifier and the auto-vectorization hints address the stylistic issues due to lexical scope, data dependency, and ambiguity resolution. The SIMD feature's pragma allows you to enforce vectorization of innermost loops. The __declspec(vector) and the __declspec(vector[clauses]) declarations can be used to vectorize user-defined functions and loops. For SIMD usage, the vector function is called from a loop that is being vectorized. The function must be implemented in vector operations as part of the loop. The C/C++ extensions for array notations map operations can be defined to provide general data parallel semantics, where you do not express the implementation strategy. Using array notations, you can write the same operation regardless of the size of the problem, and let the implementation use the right construct, combining SIMD, loops, and tasking to implement the operation. With these semantics, you can choose more elaborate programming and express a single dimensional operation at two levels, using both task constructs and array operations to force a preferred parallel and vector execution. The usage model of the vector function is that the code generated for the function actually takes a small section (vector length) of the array, by value, and exploits SIMD parallelism, whereas the implementation of task parallelism is done at the call site. The following table summarizes the language features that help vectorize code.

Feature Description

__declspec(align(n)) Directs the compiler to align the variable to an n-byte boundary. Address of the variable is address mod n=0.

__declspec(align(n,off)) Directs the compiler to align the variable to an n-byte boundary with offset off within each n-byte boundary. Address of the variable is address mod n=off.

restrict Permits the disambiguator flexibility in alias assumptions, which enables more vectorization.

__assume_aligned(a,n) Instructs the compiler to assume that array a is aligned on an n-byte boundary; used in cases where the compiler has failed to obtain alignment information.

655 18 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Auto-vectorization Hints

#pragma ivdep Instructs the compiler to ignore assumed vector dependencies.

#pragma vector Specifies how to vectorize the loop and indicates that {aligned|unaligned|always|temporal|nontemporal} efficiency heuristics should be ignored. Using the assert keyword with the vector {always}pragma generates an error-level assertion message saying that the compiler efficiency heuristics indicate that the loop cannot be vectorized. Use #pragma ivdep to ignore the assumed dependencies.

#pragma novector Specifies that the loop should never be vectorized.

User-mandated pragma

#pragma simd Enforces vectorization of innermost loops.

See Also • ivdep pragma • simd pragma • vector pragma • User-mandated Vectorization

656 Part V Floating-point Operations

Topics:

• Overview • Understanding Floating-point Operations • Tuning Performance • Understanding IEEE Floating-point Operations

657 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

658 Overview 19

Overview: Floating-point Operations

This section introduces the floating-point support in the Intel® C++ Compiler and provides information about using floating-point operations in your applications. The following table lists some possible starting points:

If you are trying to... Then start with...

Understand the programming trade-offs in Programming Trade-offs in Floating-Point floating-point applications Applications

Use the -fp-model (VxWorks*) or /fp Using the -fp-model or /fp Option (Windows*) option

Set the flush-to-zero (FTZ) or Setting the FTZ and DAZ Flags denormals-are-zero (DAZ) flags

Tune the performance of floating-point Overview: Tuning Performance of Floating-Point applications for consistency Applications

Floating-point Options Quick Reference

The Intel® Compiler provides various options for you to optimize floating-point calculations with varying degrees of accuracy and predictability on different Intel architectures. This topic lists these compiler options and provides information about their supported architectures and operating systems.

Options for All Supported Intel® Architectures

VxWorks* Windows* Description

-fp-model /fp Specifies semantics used in floating-point calculations. Values are precise, fast [=1/2], strict, source, double, extended, [no-]except (VxWorks*) and except[-] (Windows*).

659 19 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

VxWorks* Windows* Description

• -fp-model compiler option

-fp-speculation /Qfp-speculation Specifies the speculation mode for floating-point operations. Values are fast, safe, strict, and off.

• -fp-speculation compiler option

-prec-div, -no-prec-div /Qprec-div, Attempts to use slower but more /Qprec-div- accurate implementation of floating-point divide. Use this option to disable the divide optimizations in cases where it is important to maintain the full range and precision for floating-point division. Using this option results in greater accuracy with some loss of performance. Specifying -no-prec-div (VxWorks*) or /Qprec-div- (Windows*) enables the divide-to-reciprocal multiply optimization; these results are slightly less precise than full IEEE division results. However, you may get different answers if you use these options on non-Intel processors.

• -prec-div compiler option

-complex-limited-range /Qcomplex-limit- Enables the use of basic algebraic ed-range expansions of some arithmetic operations involving data of type COMPLEX. This can cause performance improvements in programs that use a lot of COMPLEX arithmetic. Values at the extremes of the exponent range might not compute correctly; for example, for single-precision

660 19

VxWorks* Windows* Description floating-point operations, values >1.E20 or <1.E-20 will not compute correctly .

• -complex-limited-range compiler option

-ftz /Qftz May flush denormal results to zero. The default behavior depends on the architecture. Refer to the following topic for details:

• -ftz compiler option

The options -fp-model or /fp, -[no-]fast-transcendentals or /Qfast-transcendentals-, -fp-speculation or /Qfp-speculation, and the -fimf-<> set of options may influence the choice of math routines that are invoked. Many routines in the libirc, libm, and svml library are more highly optimized for Intel microprocessors than for non-Intel microprocessors.

Options for IA-32 and Intel® 64 Architectures

VxWorks* Windows* Description

-prec-sqrt, no-prec-sqrt /Qprec-sqrt, Improves the accuracy of square root /Qprec-sqrt- implementations, but using this option may impact speed. The default is the no-prec-sqrt (VxWorks* OS) option, using which the compiler does a faster but less precise implementation of square root calculation. You may get a different answer if you use the no-prec-sqrt (VxWorks* OS) option on non-Intel processors.

• -prec-sqrt compiler option

661 19 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

VxWorks* Windows* Description

-pc /Qpc Changes the floating point significand precision in the x87 control word. The application must use main() as the entry point, and you must compile the source file containing main() with this option.

• -pc compiler option

-rcd /Qrcd Disables rounding mode changes for floating-point-to-integer conversions.

• -rcd compiler option

-fp-port /Qfp-port Causes floating-point values to be rounded to the source precision at assignments and casts.

• -fp-port compiler option

-mp1 /Qprec This option rounds floating-point values to the precision specified in the source program prior to comparisons. It also implies -prec-div and -prec-sqrt (VxWorks). This option has less impact on performance and disables fewer optimizations than the -fp-model precise (VxWorks*) option. You may get a different answer if you use this option on non-Intel processors.

• -mp1 compiler option

662 Understanding Floating-point Operations 20

Programming Tradeoffs in Floating-point Applications

In general, the programming objectives for floating-point applications fall into the following categories: • Accuracy: The application produces results that are close to the correct result. • Reproducibility and portability: The application produces consistent results across different runs, different sets of build options, different compilers, different platforms, and different architectures. • Performance: The application produces fast, efficient code.

Based on the goal of an application, you will need to make tradeoffs among these objectives. For example, if you are developing a 3D graphics engine, then performance may be the most important factor to consider, and reproducibility and accuracy may be your secondary concerns. The Intel® Compiler provides several compiler options that allow you to tune your applications based on specific objectives. Broadly speaking there are the floating-point specific options, such as the -fp-model (VxWorks* operating systems) or /fp (Windows* operating system) option, and the fast-but-low-accuracy options, such as the -fimf-max-error (VxWorks*) or /Qimf-max-error (Windows*) option. The compiler optimizes and generates code differently when you specify these different compiler options. You should select appropriate compiler options by carefully balancing your programming objectives and making tradeoffs among these objectives. Some of these options may influence the choice of math routines that will be invoked. Many routines in the libirc, libm, and svml library are more highly optimized for Intel microprocessors than for non-Intel microprocessors.

Using Floating-point Options Take the following code as an example: float t0, t1, t2; ... t0=t1+t2+4.0f+0.1f;

663 20 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

If you specify the -fp-model extended (VxWorks*) or /fp:extended (Windows*) option in favor of accuracy, the compiler generates the following assembly code: fld DWORD PTR _t1 fadd DWORD PTR _t2 fadd DWORD PTR _Cnst4.0 fadd DWORD PTR _Cnst0.1 fstp DWORD PTR _t0

The above code maximizes accuracy because it utilizes the highest mantissa precision available on the target platform. However, the code might suffer in performance due to the overhead of managing the x87 stack and it might yield results that cannot be reproduced on other platforms that do not have an equivalent extended precision type. If you specify the -fp-model source (VxWorks*) or /fp:source (Windows*) option in favor of reproducibility and portability, the compiler generates the following assembly code: movss xmm0, DWORD PTR _t1 addss xmm0, DWORD PTR _t2 addss xmm0, DWORD PTR _Cnst4.0 addss xmm0, DWORD PTR _Cnst0.1 movss DWORD PTR _t0, xmm0

The above code maximizes portability by preserving the original order of the computation and by using the well-defined IEEE single-precision type for all computations. It is not as accurate as the previous implementation because the intermediate rounding error is greater compared to extended precision. And it is not the highest performance implementation because it does not take advantage of the opportunity to precompute 4.0f + 0.1f. If you specify the -fp-model fast (VxWorks*) or /fp:fast (Windows*) option in favor of performance, the compiler generates the following assembly code: movss xmm0, DWORD PTR _Cnst4.1 addss xmm0, DWORD PTR _t1 addss xmm0, DWORD PTR _t2 movss DWORD PTR _t0, xmm0

The above code maximizes performance by using Intel® SSE instructions and precomputing 4.0f + 0.1f. It is not as accurate as the first implementation, again due to greater intermediate rounding error. It will not provide reproducible results like the second

664 20

implementation because it must reorder the addition in order to precompute 4.0f + 0.1f, and you cannot expect that all compilers, on all platforms, at all optimization levels will reorder the addition in the same way. For many other applications, the considerations may be more complicated.

Using Fast-but-low-accuracy Options The fast-but-low-accuracy options provide an easy way to control the accuracy of mathematical functions and utilize performance/accuracy tradeoffs offered by the Intel® Math Libraries. You can specify accuracy, via a command line interface, for all math functions or a selected set of math functions at the level more precise than low, medium or high. You specify the accuracy requirements as a set of function attributes that the compiler uses for selecting an appropriate function implementation in the math libraries. Examples using the attribute, max-error, is presented here. For example, use the following option: -fimf-max-error=2

to specify relative error of 2 ulps (unit in the last place) for all single, double, long double, and quad precision functions. To specify 12 bits of accuracy for a sin function, use: –fimf-sin-accuracy-bits=12

To specify relative error of 10 ulps for a sin function, and 4 ulps for other math functions called in the source file you are compiling, use: -fimf-sin-max-error=10 -fimf-max-error=4

Intel® compiler defines the default value for the max-error attribute depending on the /fp option and /Qfast-transcendentals settings. In /fp:fast mode or if fast but less accurate math functions are explicitly enabled by /Qfast-transcendentals-, then the Intel® compiler sets max-error=4.0 for the call. Otherwise, it sets max-error=0.6.

Floating-point Optimizations

Application performance is an important goal of the Intel® Compilers, even at default optimization levels. A number of optimizations involve transformations that might affect the floating-point behavior of the application, such as evaluation of constant expressions at compile time, hoisting invariant expressions out of loops, or changes in the order of evaluation of expressions. These optimizations usually help the compiler to produce the most efficient code possible. However, the optimizations might be contrary to the floating-point requirements of the application.

665 20 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Some optimizations are not consistent with strict interpretation of the ANSI or ISO standards for C and C++. Such optimizations can cause differences in rounding and small variations in floating-point results that may be more or less accurate than the ANSI-conformant result. Intel Compilers provide the -fp-model (VxWorks*) or /fp (Windows*) option, which allows you to control the optimizations performed when you build an application. The option allows you to specify the compiler rules for: • Value safety: Whether the compiler may perform transformations that could affect the result. For example, in the SAFE mode, the compiler won't transform x/x to 1.0 because the value of x at runtime might be a zero or a NaN . The UNSAFE mode is the default. • Floating-point expression evaluation: How the compiler should handle the rounding of intermediate expressions. For example, when double precision is specified, the compiler interprets the statement t0=4.0f+0.1f+t1+t2; as t0=(float)(4.1+(double)t1+(double)t2); • Floating-point contractions: Whether the compiler should generate fused multiply-add (FMA) instructions on processors that support them. When enabled, the compiler may generate FMA instructions for combining multiply and add operations; when disabled, the compiler must generate separate multiply and add instructions with intermediate rounding. • Floating-point environment access: Whether the compiler must account for the possibility that the program might access the floating-point environment, either by changing the default floating-point control settings or by reading the floating-point status flags. This is disabled by default. You can use the -fp-model:strict (VxWorks) /fp:strict (Windows) option to enable it. • Precise floating-point exceptions: Whether the compiler should account for the possibility that floating-point operations might produce an exception. This is disabled by default. You can use -fp-model:strict (VxWorks); or -fp-model:except (VxWorks) to enable it.

666 20

Consider the following example: double a=1.5; int x=0; ... __try { int t0=a; //raises inexact x=1; a*=2; } __except(1) { printf("SEH Exception: x=%d\n", x); }

Without precise floating-point exceptions, the result is SEH Exception: x=1; with precision floating-point exceptions, the result is SEH Exception: x=0. The following table describes the impact of different keywords of the option on compiler rules and optimizations:

Keyword Value Safety Floating-Point Floating-Point Floating-Point Precise Expression Contractions Environment Floating-Point Evaluation Access Exceptions

precise Varies Source Yes No No source Source double Double extended Extended

strict Varies Source No Yes Yes

fast=1 Unsafe Unknown Yes No No (default)

fast=2 Unsafe Unknown Yes No No

except Unaffected Source Unaffected Unaffected Yes except- Unaffected Source Unaffected Unaffected No

667 20 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

NOTE. It is illegal to specify the except keyword in an unsafe safety mode.

Based on the objectives of an application, you can choose to use different sets of compiler options and keywords to enable or disable certain optimizations, so that you can get the desired result.

See Also • Using -fp-model (/fp) Option

Using the -fp-model (/fp) Option

The -fp-model (VxWorks*) or /fp (Windows*) option allows you to control the optimizations on floating-point data. You can use this option to tune the performance, level of accuracy, or result consistency for floating-point applications across platforms and optimization levels. For applications that do not require support for denormalized numbers, the -fp-model or /fp option can be combined with the -ftz (VxWorks*) or /Qftz (Windows*) option to flush denormalized results to zero in order to obtain improved runtime performance on processors based on all Intel® architectures. You can use keywords to specify the semantics to be used. The keywords specified for this option may influence the choice of math routines that are invoked. Many routines in the libirc, libm, and svml library are more highly optimized for Intel microprocessors than for non-Intel microprocessors. Possible values of the keywords are as follows:

Keyword Description

precise Enables value-safe optimizations on floating-point data.

fast[=1|2] Enables more aggressive optimizations on floating-point data.

strict Enables precise and except, disables contractions, and enables pragma stdc fenv_access.

source Rounds intermediate results to source-defined precision and enables value-safe optimizations.

double Rounds intermediate results to 53-bit (double) precision and enables value-safe optimizations.

extended Rounds intermediate results to 64-bit (extended) precision and enables value-safe optimizations.

668 20

Keyword Description

[no-]except (VxWorks*) Determines whether strict floating-point exception semantics or are used. except[-] (Windows*)

The default value of the option is -fp-model fast=1 or /fp:fast=1, which means that the compiler uses more aggressive optimizations on floating-point calculations.

NOTE. Using the default option keyword -fp-model fast or /fp:fast, you may get significant differences in your result depending on whether the compiler uses x87 or SSE2 instructions to implement floating-point operations. Results are more consistent when the other option keywords are used.

Several examples are provided to illustrate the usage of the keywords. These examples show: • A small example of source code Note that the same source code is considered in all the included examples. • The semantics that are used to interpret floating-point calculations in the source code • One or more possible ways the compiler may interpret the source code Note that there are several ways the compiler may interpret the code; we show just some of these possibilities.

-fp-model fast or /fp:fast Example source code: float t0, t1, t2; ... t0 = 4.0f + 0.1f + t1 + t2;

When this option is specified, the compiler applies the following semantics:

• Additions may be performed in any order • Intermediate expressions may use single, double, or extended double precision • The constant addition may be pre-computed, assuming the default rounding mode

669 20 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Using these semantics, some possible ways the compiler may interpret the original code are given below: float t0, t1, t2; ... t0 = (float)((double)t1 + (double)t2) + 4.1f;

float t0, t1, t2; ... t0 = (t1 + t2) + 4.1f; float t0, t1, t2; ... t0 = (t1 + 4.1f) + t2;

-fp-model extended or /fp:extended This setting is equivalent to -fp-model precise on Linux* operating systems based on the IA-32 architecture. Example source code: float t0, t1, t2; ... t0 = 4.0f + 0.1f + t1 + t2;

When this option is specified, the compiler applies the following semantics:

• Additions are performed in program order • Intermediate expressions use extended double precision • The constant addition may be pre-computed, assuming the default rounding mode

Using these semantics, a possible way the compiler may interpret the original code is shown below: float t0, t1, t2; ... t0 = (float)(((long double)4.1 + (long double)t1) + (long double)t2);

670 20

-fp-model source or /fp:source This setting is equivalent to -fp-model precise or /fp:precise on systems based on the Intel® 64 architecture. Example source code: float t0, t1, t2; ... t0 = 4.0f + 0.1f + t1 + t2;

When this option is specified, the compiler applies the following semantics:

• Additions are performed in program order • Intermediate expressions use the precision specified in the source code, that is, single-precision • The constant addition may be pre-computed, assuming the default rounding mode

Using these semantics, a possible way the compiler may interpret the original code is shown below: float t0, t1, t2; ... t0 = ((4.1f + t1) + t2);

-fp-model double or /fp:double This setting is equivalent to -fp-model precise or /fp:precise on Windows systems based on the IA-32 architecture. Example source code: float t0, t1, t2; ... t0 = 4.0f + 0.1f + t1 + t2;

When this option is specified, the compiler applies the following semantics:

• Additions are performed in program order • Intermediate expressions use double precision • The constant addition may be pre-computed, assuming the default rounding mode

671 20 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Using these semantics, a possible way the compiler may interpret the original code is shown below: float t0, t1, t2; ... t0 = (float)(((double)4.1 + (double)t1) + (double)t

-fp-model strict or /fp:strict Example source code: float t0, t1, t2; ... t0 = 4.0f + 0.1f + t1 + t2;

When this option is specified, the compiler applies the following semantics:

• Additions are performed in program order • Expression evaluation matches expression evaluation under keyword precise. • The constant addition is not pre-computed because there is no way to tell what rounding mode will be active when the program runs.

Using these semantics, a possible way the compiler may interpret the original code is shown below: float t0, t1, t2; ... t0 = (float)((((long double)4.0f + (long double)0.1f) + (long double)t1) + (long double)t2);

See Also • -fp-model compiler option

Denormal Numbers

A normalized number is a number for which both the exponent (including bias) and the most significant bit of the mantissa are non-zero. For such numbers, all the bits of the mantissa contribute to the precision of the representation.

672 20

The smallest normalized single-precision floating-point number greater than zero is about 1.1754943-38. Smaller numbers are possible, but those numbers must be represented with a zero exponent and a mantissa whose leading bit(s) are zero, which leads to a loss of precision. These numbers are called denormalized numbers or denormals (newer specifications refer to these as subnormal numbers). Denormal computations use hardware and/or operating system resources to handle denormals; these can cost hundreds of clock cycles. Denormal computations take much longer to calculate on processors based on IA-32 and Intel® 64 architectures than normal computations. There are several ways to avoid denormals and increase the performance of your application: • Scale the values into the normalized range. • Use a higher precision data type with a larger range. • Flush denormals to zero.

See Also • Reducing Impact of Denormal Exceptions Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture Intel® Itanium® Architecture Software Developer's Manual, Volume 1: Application Architecture Institute of Electrical and Electronics Engineers, Inc*. (IEEE) web site for information about the current floating-point standards and recommendations

Floating-point Environment

The floating-point environment is a collection of registers that control the behavior of the floating-point machine instructions and indicate the current floating-point status. The floating-point environment can include rounding mode controls, exception masks, flush-to-zero (FTZ) controls, exception status flags, and other floating-point related features. For example, on IA-32 and Intel® 64 architectures, bit 15 of the MXCSR register enables the flush-to-zero mode, which controls the masked response to an single-instruction multiple-data (SIMD) floating-point underflow condition. The floating-point environment affects most floating-point operations; therefore, correct configuration to meet your specific needs is important. For example, the exception mask bits define which exceptional conditions will be raised as exceptions by the processor. In general, the default floating-point environment is set by the operating system. You don't need to configure the floating-point environment unless the default floating-point environment does not suit your needs. There are several methods available if you want to modify the default floating-point environment. For example, you can use inline assembly, compiler built-in functions, library functions, or command line options.

673 20 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Changing the default floating-point environment affects runtime results only. This does not affect any calculations that are pre-computed at compile time. If strict reproducibility and consistency are important do not change the floating point environment without also using either -fp-model strict (VxWorks* OS) option or pragma fenv_access.

See Also Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture

Setting the FTZ and DAZ Flags

In Intel® processors, the flush-to-zero (FTZ) and denormals-are-zero (DAZ) flags in the MXCSR register are used to control floating-point calculations. The Intel® Streaming SIMD (Single Instruction Multiple Data) Extensions (Intel® SSE) and the Intel® SSE 2 instructions, including scalar and vector instructions, benefit from enabling the FTZ and DAZ flags respectively. Floating-point computations using these Intel® SSE instructions are accelerated when the FTZ and DAZ flags are enabled and thus the performance of the application improves. You can use the -ftz (VxWorks*) or /Qftz (Windows*) option to flush denormal results to zero when the application is in the gradual underflow mode. This option may improve performance if the denormal values are not critical to your application's behavior. The -ftz and /Qftz options, when applied to the main program, set the FTZ and the DAZ hardware flags. The -no-ftz and /Qftz- options leave the flags as they are. The following table describes how the compiler processes denormal values based on the status of the FTZ and DAZ flags:

Flag When set to ON, the When set to OFF, the Supported on compiler... compiler...

FTZ Sets denormal results Does not change the Intel® 64 (flush-to-zero) from floating-point denormal results architectures, calculations to zero and some IA-32 architectures

DAZ Treats denormal values Does not change the Intel® 64 (denormals-are-zero) used as input to denormal instruction architecture and floating-point instructions inputs some IA-32 as zero architecture

• FTZ and DAZ are not supported on all IA-32 architectures. The FTZ flag is supported only on IA-32 architectures that support Intel® SSE instructions.

674 20

• On systems based on the IA-32 and Intel® 64 architectures, FTZ only applies to Intel® SSE instructions. Hence, if your application happens to generate denormals using x87 instructions, FTZ does not apply. • DAZ and FTZ flags are not compatible with IEEE Standard 754, so you should only consider enabling them when strict compliance to the IEEE standard is not required.

Options -ftz and /Qftz are performance options. Setting these options does not guarantee that all denormals in a program are flushed to zero. They only cause denormals generated at run-time to be flushed to zero. On Intel®64 and IA-32 architecture-based systems, the compiler, by default, inserts code into the main routine to set the FTZ and DAZ flags. When -ftz or /Qftz option is used on IA-32 architecture-based systems with the option –msse2 or /arch:sse2, the compiler will insert code to conditionally set FTZ/DAZ flags based on a run-time processor check. The -no-ftz (VxWorks*) will prevent the compiler from inserting any code that might set FTZ or DAZ flags. When -ftz or /Qftz is used in combination with an Intel® SSE-enabling option on systems based on the IA-32 architecture (for example, -msse2 or /arch:sse2), the compiler will insert code in the main routine to set FTZ and DAZ. When -ftz or /Qftz is used without such an option, the compiler will insert code to conditionally set FTZ/DAZ based on a run-time processor check. -no-ftz (VxWorks) will prevent the compiler from inserting any code that might set FTZ or DAZ. The -ftz or /Qftz option only has an effect when the main program is being compiled. It sets the FTZ/DAZ mode for the process. The initial thread and any threads subsequently created by that process will operate in the FTZ/DAZ mode. On systems based on the IA-32 and Intel® 64 architectures, every optimization option O level, except O0, sets -ftz and /Qftz. If this option produces undesirable results of the numerical behavior of your program, you can turn the FTZ/DAZ mode off by using -no-ftz or /Qftz- in the command line while still benefiting from the O3 optimizations. For some non-Intel processors, you can set the flags manually with the following macros:

Feature Examples

Enable FTZ _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON)

Enable DAZ _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON)

The prototypes for these macros are in xmmintrin.h (FTZ) and pmmintrin.h (DAZ).

See Also Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture

675 20 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Checking the Floating-point Stack State

On systems based on the IA-32 architectures, when an application calls a function that returns a floating-point value, the returned floating-point value is supposed to be on the top of the floating-point stack. If the return value is not used, the compiler must pop the value off of the floating-point stack in order to keep the floating-point stack in the correct state. On systems based on Intel(R) 64 architectures, floating-point values are usually returned in the xmm0 register. The floating-point stack is used only when the return value is a long double on VxWorks* systems. If the application calls a function without defining or incorrectly defining the function's prototype, the compiler cannot determine if the function must return a floating-point value. Consequently, the return value is not popped off the floating-point stack if it is not used. This can cause the floating-point stack to overflow. The overflow of the stack results in two undesirable situations: • A NaN value gets involved in the floating-point calculations • The program results become unpredictable; the point where the program starts making errors can be arbitrarily far away from the point of the actual error.

For systems based on the IA-32 and Intel® 64 architectures, the -fp-stack-check (VxWorks*) or /Qfp-stack-check (Windows*) option checks whether a program makes a correct call to a function that should return a floating-point value. If an incorrect call is detected, the option places a code that marks the incorrect call in the program. The -fp-stack-check (VxWorks*) or /Qfp-stack-check (Windows*) option marks the incorrect call and makes it easy to find the error.

NOTE. The -fp-stack-check (VxWorks*) and the /Qfp-stack-check (Windows*) option causes significant code generation after every function/subroutine call to ensure that the floating-point stack is maintained in the correct state. Therefore, using this option slows down the program being compiled. Use the option only as a debugging aid to find floating point stack underflow/overflow problems, which can be otherwise hard to find.

See Also • -fp-stack-check, /Qfp-stack-check option

676 Tuning Performance 21

Overview: Tuning Performance

This section describes several programming guidelines that can help you improve the performance of floating-point applications: • Avoid exceeding representable ranges during computation; handling these cases can have a performance impact. • Use a single-precision type (for example, float) unless the extra precision and/or range obtained through double or long double is required. Greater precision types increase memory size and bandwidth requirements. See Using Efficient Data Types section. • Reduce the impact of denormal exceptions for all supported architectures. • Avoid mixed data type arithmetic expressions.

Handling Floating-point Array Operations in a Loop Body

Following the guidelines below will help autovectorization of the loop. • Statements within the loop body may contain float or double operations (typically on arrays). The following arithmetic operations are supported: addition, subtraction, multiplication, division, negation, square root, MAX, MIN, and mathematical functions such as SIN and COS. • Writing to a single-precision scalar/array and a double scalar/array within the same loop decreases the chance of autovectorization due to the differences in the vector length (that is, the number of elements in the vector register) between float and double types. If autovectorization fails, try to avoid using mixed data types.

NOTE. The special __m64 and __m128 datatypes are not vectorizable. The loop body cannot contain any function calls. Use of the Intel® Streaming SIMD Extensions intrinsics (for example, mm_add_ps) is not allowed.

See Also • Programming Guidelines for Vectorization

677 21 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Reducing the Impact of Denormal Exceptions

Denormalized floating-point values are those that are too small to be represented in the normal manner; that is, the mantissa cannot be left-justified. Denormal values require hardware or operating system interventions to handle the computation, so floating-point computations that result in denormal values may have an adverse impact on performance. There are several ways to handle denormals to increase the performance of your application: • Scale the values into the normalized range • Use a higher precision data type with a larger range • Flush denormals to zero

For example, you can translate them to normalized numbers by multiplying them using a large scalar number, doing the remaining computations in the normal space, then scaling back down to the denormal range. Consider using this method when the small denormal values benefit the program design. Consider using a higher precision data type with a larger range; for example, by converting variables declared as float to be declared as double. Understand that making the change can potentially slow down your program. Storage requirements will increase, which will increase the amount of time for loading and storing data from memory. Higher precision data types can also decrease the potential throughput of SSE operations. If you change the declaration of a variable you might also need to change the libraries you call to use the variable; for example, cosd() instead of cos(). Another strategy that might result in increased performance is to increase the amount of precision of intermediate values using the -fp-model [double|extended] option. However, this strategy might not eliminate all denormal exceptions, so you must experiment with the performance of your application. If you change the type declaration of a variable, you might also need to change associated library calls, unless these are generic; for example, cosd() instead of cos(). Another strategy that might result in increased performance is to increase the amount of precision of intermediate values using the -fp-model [double|extended] option. However, this strategy might not eliminate all denormal exceptions, so you must experiment with the performance of your application. You should verify that the gain in performance from eliminating denormals is greater than the overhead of using a data type with higher precision and greater dynamic range. Finally, in many cases denormal numbers can be treated safely as zero without adverse effects on program results. Depending on the target architecture, use flush-to-zero (FTZ) options.

678 21

IA-32 and Intel® 64 Architectures These architectures take advantage of the FTZ (flush-to-zero) and DAZ (denormals-are-zero) capabilities of Intel® Streaming SIMD Extensions (Intel® SSE) instructions. On Intel®64 and IA-32 architecture-based systems, the compiler, by default, inserts code into the main routine to enable FTZ and DAZ at optimization levels higher than -O0. To enable FTZ and DAZ at -O0, compile the source file containing main()PROGRAM using –ftz or /Qftz option. When -ftz or /Qftz option is used on IA-32 architecture-based systems with the option –mia32 or /arch:IA32, the compiler inserts code to conditionally enable FTZ and DAZ flags based on a run-time processor check.

NOTE. After using flush-to-zero, ensure that your program still gives correct results when treating denormalized values as zero.

See Also • Setting the FTZ and DAZ Flags Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture

Avoiding Mixed Data Type Arithmetic Expressions

Avoid mixing integer and floating-point (float,double, or long double) data in the same computation. Expressing all numbers in a floating-point arithmetic expression (assignment statement) as floating-point values eliminates the need to convert data between fixed and floating-point formats. Expressing all numbers in an integer arithmetic expression as integer values also achieves this. This improves run-time performance. For example, assuming that I and J are both int variables, expressing a constant number (2.) as an integer value (2) eliminates the need to convert the data. The following examples demonstrate inefficient and efficient code. Examples

Example 1: Inefficient Code

int I, J; I = J / 2.;

679 21 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example 2: Efficient Code

int I, J; I = J / 2;

Special Considerations for Auto-Vectorization of the Innermost Loops Auto-vectorization of an innermost loop packs multiple data elements from consecutive loop iterations into a vector register, each of which is 128-bit in size. Consider a loop that uses different sized data, for example, REAL and DOUBLE PRECISION. For REAL data, the compiler tries to pack data elements from four (4) consecutive iterations (32 bits x 4 = 128 bits). For DOUBLE PRECISION data, the compiler tries to pack data elements from two (2) consecutive iterations (64 bits x 2 = 128 bits). Because of the mismatched number of iterations, the compiler sometimes fails to perform auto-vectorization of the loop, after trying to automatically remedy the situation. If your attempt to auto-vectorize an innermost loop fails, it is a good practice to try using the same sized data. INTEGER and REAL are considered same sized data since both are 32-bit in size.

680 21

Examples

Example 1: Non-autovectorizable code DOUBLE PRECISION A(N), B(N) REAL C(N), D(N) DO I=1, N A(I)=D(I) C(I)=B(I) ENDDO

Example 2: Auto-vectorizable after automatic distribution into two loops DOUBLE PRECISION A(N), B(N) REAL C(N), D(N) DO I=1, N A(I)=B(I) C(I)=D(I) ENDDO

Example 3: Auto-vectorizable as one loop REAL A(N), B(N) REAL C(N), D(N) DO I=1, N A(I)=B(I) C(I)=D(I) ENDDO

681 21 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Using Efficient Data Types

In cases where more than one data type can be used for a variable, consider selecting the data types based on the following hierarchy, listed from most to least efficient: • char • short • int • long • long long • float • double • long double

However, keep in mind that in an arithmetic expression, you should avoid mixing integer and floating-point data. You can use integer data types (int, int long, etc.) in loops to improve floating point performance. Convert the data type to integer data types, process the data, then convert the data to the old type.

682 Understanding IEEE Floating-point Operations 22

Overview: Understanding IEEE Floating-point Standard

This version of Intel® Compiler uses a close approximation to the IEEE floating-point standard (ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic, 1985) unless otherwise stated. This standard is common to many microcomputer-based systems due to the availability of fast processors that implement the required characteristics. This section outlines the characteristics of the standard and its implementation for the Intel Compilers. Except as noted, the description includes both the IEEE standard and the Intel Compiler implementation.

Floating-point Formats

The IEEE Standard 754 specifies values and requirements for floating-point representation (such as base 2). The standard outlines requirements for two formats: basic and extended, and for two word-lengths within each format: single and double. On IA-32 and Intel® 64 architecture-based systems, the Intel C/C++ Compiler supports the GCC decimal floating point feature and the following decimal floating types: _Decimal32, _Decimal64, and _Decimal128. This feature is supported for applications written in C only. These three decimal formats are defined in IEEE 754-2008 standard. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1312.pdf.

Special Values

The following lists special values that the Intel® Compiler supports and provides a brief description.

Signed zero The sign of zero is the same as the sign of a nonzero number. Comparisons, however, consider +0 to be equal to -0. A signed zero is useful in certain numerical analysis algorithms, but in most applications the sign of zero is invisible.

683 22 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Denormalized numbers Denormalized numbers (denormals) fill the gap between the smallest positive normalized number and the smallest negative number. Otherwise only (+/-) 0 occurs in that interval. Denormalized numbers extend the range of computable results by allowing for gradual underflow. Systems based on the IA-32 architecture support a Denormal Operand status flag, which, when set, means at least one of the input operands to a floating-point operation is a denormal. The Underflow status flag is set when a number loses precision and becomes a denormal.

Signed infinity Infinities are the result of arithmetic in the limiting case of operands with arbitrarily large magnitude. They provide a way to continue when an overflow occurs. The sign of an infinity is simply the sign you obtain for a finite number in the same operation as the finite number approaches an infinite value. By retrieving the status flags, you can differentiate between an infinity that results from an overflow and one that results from division by zero. Intel® Compiler treats infinity as signed by default. The output value of infinity is +Infinity or -Infinity.

Not a Number Not a Number (NaN) results from an invalid operation. For instance 0/0 and SQRT(-1) result in NaN. In general, an operation involving a NaN produces another NaN. Because the fraction of a NaN is unspecified, there are many possible NaNs. Intel® Compiler treats all NaNs identically, but provides two different types:

• Signaling NAN, which has an initial fraction bit of 0 (zero). This usually raises an invalid exception when used in an operation. • Quiet NaN, which has an initial fraction bit of 1.

The floating-point hardware changes a signaling NAN into a quiet NAN during many arithmetic operations, including the assignment operation. An invalid exception may be raised but the resulting floating-point value will be a quiet NAN. Unformatted input and output do not change the internal representations of the values as they are handled. Therefore, signaling and quiet NANs may be read into real data and output to files in binary form. The output value of NaN is NaN.

684 Part VI Intrinsics Reference

Topics:

• Overview: Intrinsics Reference • Intrinsics for All Intel Architectures • Data Alignment, Memory Allocation Intrinsics, and Inline Assembly • MMX(TM) Technology Intrinsics • Intrinsics for Intel(R) Streaming SIMD Extensions • Intrinsics for Intel(R) Streaming SIMD Extensions 2 • Intrinsics for Intel(R) Streaming SIMD Extensions 3 • Intrinsics for Intel(R) Supplemental Streaming SIMD Extensions 3 • Intrinsics for Intel(R) Streaming SIMD Extensions 4 • Intrinsics for Advanced Encryption Standard Implementation • Intrinsics for Intel(R) Advanced Vector Extensions • Intrinsics for Intel(R) Post-32nm Processor Instruction Extensions • Intrinsics for Managing Extended Processor States and Registers • Intrinsics for Converting Half Floats

685 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• Intrinsics for Short Vector Math Library Operations

686 Overview: Intrinsics Reference 23

Overview: Intrinsics Reference

Intrinsics are assembly-coded functions that allow you to use C++ function calls and variables in place of assembly instructions. Intrinsics are expanded inline eliminating function call overhead. Providing the same benefit as using inline assembly, intrinsics improve code readability, assist instruction scheduling, and help reduce debugging. Intrinsics provide access to instructions that cannot be generated using the standard constructs of the C and C++ languages.

Intrinsics for Intel® C++ Compilers The Intel® C++ Compiler enables easy implementation of assembly instructions through the use of intrinsics. Intrinsics are provided for the following instructions:

• Intel® Advanced Vector Extensions instructions • Carry-less multiplication instruction and Advanced Encryption Standard Extensions instructions • Half-float conversion instructions • MMX™ Technology instructions • Intel® Streaming SIMD Extensions (SSE) instructions • Intel® Streaming SIMD Extensions 2 (SSE2) instructions • Intel® Streaming SIMD Extensions 3 (SSE3) instructions • Supplemental Streaming SIMD Extensions 3 (SSSE3) instructions • Intel® Streaming SIMD Extensions 4 (SSE4) instructions

The Short Vector Math Library (svml) intrinsics are documented in this reference. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel C++ Compiler provides intrinsics that work across IA-32, Intel® 64, and IA-64 architectures. Most intrinsics map directly to a corresponding assembly instruction, some map to several assembly instructions.

687 23 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Availability of Intrinsics on Intel Processors Not all Intel processors support all intrinsics. For information on which intrinsics are supported on Intel processors, visit http://processorfinder.intel.com. The Processor Spec Finder tool links directly to all processor documentation and the data sheets list the features, including intrinsics, supported by each processor.

Details about Intrinsics

All instructions use the following features:

• Registers • Data Types

Registers Intel® processors provide special register sets for different instructions. The MMX™ instructions use eight 64-bit registers (mm0 to mm7) which are aliased on the floating-point stack registers. The Intel® Streaming SIMD Extensions (Intel® SSE) and the Advanced Encryption Standard (AES) instructions use eight 128-bit registers (xmm0 to xmm7). The Intel® Advanced Vector Extension (Intel® AVX) instructions use 256-bit registers which are extensions of the 128-bit SIMD registers. Because each of these registers can hold more than one data element, the processor can process more than one data element simultaneously. This processing capability is also known as single-instruction multiple data processing (SIMD). For each computational and data manipulation instruction in the new extension sets, there is a corresponding C intrinsic that implements that instruction directly. This frees you from managing registers and assembly programming. Further, the compiler optimizes the instruction scheduling so that your executable runs faster.

Data Types Intrinsic functions use new C data types as operands, representing the new registers that are used as the operands to these intrinsic functions.

688 23

The following table details for which instructions each of the new data types are available. A 'Yes' indicates that the data type is available for that group of intrinsics; an 'NA' indicates that the data type is not available for that group of intrinsics.

__m64 __m256 __m256d __m256i Data Types --> __m128 __m128d __m128i Technology

MMX™ Technology Yes NA NA NA NA NA NA Intrinsics

Intel® Streaming Yes Yes NA NA NA NA NA SIMD Extensions Intrinsics

Intel® Streaming Yes Yes Yes Yes NA NA NA SIMD Extensions 2 Intrinsics

Intel® Streaming Yes Yes Yes Yes NA NA NA SIMD Extensions 3 Intrinsics

Advanced Yes Yes Yes Yes NA NA NA Encryption Standard Intrinsics + Carry-less Multiplication Intrinsic

Half-Float Yes Yes Yes Yes NA NA NA Intrinsics

Intel® Advanced Yes Yes Yes Yes Yes Yes Yes Vector Extensions Intrinsics

689 23 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m64 Data Type The __m64 data type is used to represent the contents of an MMX register, which is the register that is used by the MMX technology intrinsics. The __m64 data type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one 64-bit value.

__m128 Data Types The __m128 data type is used to represent the contents of a Streaming SIMD Extension register used by the Streaming SIMD Extension intrinsics. Conventionally, the __m128 data type can hold four 32-bit floating-point values, while the __m128d data type can hold two 64-bit floating-point values, and the __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values. The compiler aligns __m128d and _m128i local and global data to 16-byte boundaries on the stack. To align integer, float, or double arrays, use the declspec align statement as follows: typedef struct __declspec(align(16)) { float f[4]; } __m128; typedef struct __declspec(align(16)) { double d[2]; } __m128d; typedef struct __declspec(align(16)) { int i[4]; } __m128i;

NOTE. The __m128 data type is handled conventionally (that is, interpreted as 4 floating-point values) on IA-32 architecture-based platforms (for x86 processors). The Intel® Debugger, however, interprets the __128 data type as four floating-point values on all platforms.

Accessing __m128i Data In the IA-32 and Intel® 64 architecture-based systems, to access 8-bit data, use the mm_extract intrinsics as follows: #define _mm_extract_epi8(x, imm) \ ((((imm) & 0x1) == 0) ? \ _mm_extract_epi16((x), (imm) >> 1) & 0xff : \ _mm_extract_epi16(_mm_srli_epi16((x), 8), (imm) >> 1))

To access 16-bit data, use: int _mm_extract_epi16(__m128i a, int imm)

690 23

To access 32-bit data, use: #define _mm_extract_epi32(x, imm) \ _mm_cvtsi128_si32(_mm_srli_si128((x), 4 * (imm)))

To access 64-bit data (Intel® 64 architecture only), use: #define _mm_extract_epi64(x, imm) \ _mm_cvtsi128_si64(_mm_srli_si128((x), 8 * (imm)))

__m256 Data Types The __m256 data type is used to represent the contents of the extended SSE register - the YMM register, used by the Intel® AVX intrinsics. The __m256 data type can hold eight 32-bit floating-point values, while the __m256d data type can hold four 64-bit double precision floating-point values, and the __m256i data type can hold thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values. See Details for Intel® AVX Intrinsics for more information.

Data Types Usage Guidelines These data types are not basic ANSI C data types. You must observe the following usage restrictions:

• Use data types only on either side of an assignment, as a return value, or as a parameter. You cannot use it with other arithmetic expressions (+, -, etc). • Use data types as objects in aggregates, such as unions, to access the byte elements and structures. • Use data types only with the respective intrinsics described in this documentation.

Naming and Usage Syntax

Most intrinsic names use the following notational convention: _mm__

The following table explains each item in the syntax.

Indicates the basic operation of the intrinsic; for example, add for addition and sub for subtraction.

691 23 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Denotes the type of data the instruction operates on. The first one or two letters of each suffix denote whether the data is packed (p), extended packed (ep), or scalar (s). The remaining letters and numbers denote the type, with notation as follows:

• s single-precision floating point • d double-precision floating point • i128 signed 128-bit integer • i64 signed 64-bit integer • u64 unsigned 64-bit integer • i32 signed 32-bit integer • u32 unsigned 32-bit integer • i16 signed 16-bit integer • u16 unsigned 16-bit integer • i8 signed 8-bit integer • u8 unsigned 8-bit integer

A number appended to a variable name indicates the element of a packed object. For example, r0 is the lowest word of r. Some intrinsics are "composites" because they require more than one instruction to implement them. The packed values are represented in right-to-left order, with the lowest value being used for scalar operations. Consider the following example operation: double a[2] = {1.0, 2.0}; __m128d t = _mm_load_pd(a);

The result is the same as either of the following: __m128d t = _mm_set_pd(2.0, 1.0); __m128d t = _mm_setr_pd(1.0, 2.0);

In other words, the xmm register that holds the value t appears as follows:

692 23

The "scalar" element is 1.0. Due to the nature of the instruction, some intrinsics require their arguments to be immediates (constant integer literals).

References

See the following publications and internet locations for more information about intrinsics and the Intel architectures that support them. You can find all publications on the Intel website.

Internet Location or Publication Description

http://www.intel.com/software/products Technical resource center for hardware designers and developers; contains links to product pages and documentation. Intel® 64 and IA-32 architecture manuals Intel website for Intel(R) 64 and IA-32 architecture manuals.

Intel® 64 and IA-32 Architectures Software Describes the format of the instruction set of Developer's Manual, Volume 2A: Instruction Intel® 64 and IA-32 architectures and covers Set Reference, A-M the instructions from A to M.

Intel® 64 and IA-32 Architectures Software Describes the format of the instruction set of Developer's Manual, Volume 2B: Instruction Intel® 64 and IA-32 architectures and covers Set Reference, N-Z the instructions from N to Z.

693 23 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

694 Intrinsics for All Intel Architectures 24

Overview: Intrinsics across Intel Architectures

Most of the intrinsics documented in this section function across all supported Intel architectures. Some of them function across two Intel architectures. The intrinsics are categorized as follows:

• Integer Arithmetic Intrinsics • Floating-Point Intrinsics • String and Block Copy Intrinsics (only for IA-32 and Intel® 64 architectures) • Miscellaneous Intrinsics

Integer Arithmetic Intrinsics

The following table lists and describes integer arithmetic intrinsics that you can use across Intel architectures.

Intrinsic Syntax Description

Intrinsics for all Supported Intel Architectures

int abs(int) Returns the absolute value of an integer.

long labs(long) Returns the absolute value of a long integer.

unsigned long _lrotl(unsigned long Implements 64-bit left rotate of value by shift value, int shift) positions.

unsigned long _lrotr(unsigned long Implements 64-bit right rotate of value by value, int shift) shift positions.

unsigned int _rotl(unsigned int value, Implements 32-bit left rotate of value by shift int shift) positions.

695 24 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Syntax Description

Intrinsics for all Supported Intel Architectures

unsigned int _rotr(unsigned int val- Implements 32-bit right rotate of value by ue, int shift) shift positions.

Intrinsics for IA-32 and Intel® 64 Architectures

unsigned short _rotwl(unsigned short Implements 16-bit left rotate of value by value, int shift) shift positions.

unsigned short _rotwr(unsigned short Implements 16-bit right rotate of value by value, int shift) shift positions.

NOTE. Passing a constant shift value in the rotate intrinsics results in higher performance.

Floating-point Intrinsics

The following table lists and describes floating point intrinsics that you can use across all Intel and compatible architectures. Floating-point intrinsic functions may invoke library functions that are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Intrinsic Description

double fabs(double) Returns the absolute value of a floating-point value.

double log(double) Returns the natural logarithm ln(x), x>0, with double precison.

float logf(float) Returns the natural logarithm ln(x), x>0, with single precison.

double log10(double) Returns the base 10 logarithm log10(x), x>0, with double precison.

float log10f(float) Returns the base 10 logarithm log10(x), x>0, with single precison.

696 24

Intrinsic Description double exp(double) Returns the exponential function with double precison. float expf(float) Returns the exponential function with single precison. double pow(double, double) Returns the value of x to the power y with double precison. float powf(float, float) Returns the value of x to the power y with single precison. double sin(double) Returns the sine of x with double precison. float sinf(float) Returns the sine of x with single precison. double cos(double) Returns the cosine of x with double precison. float cosf(float) Returns the cosine of x with single precison. double tan(double) Returns the tangent of x with double precison. float tanf(float) Returns the tangent of x with single precison. double acos(double) Returns the inverse cosine of x with double precison float acosf(float) Returns the inverse cosine of x with single precison double acosh(double) Compute the inverse hyperbolic cosine of the argument with double precison. float acoshf(float) Compute the inverse hyperbolic cosine of the argument with single precison. double asin(double) Compute inverse sine of the argument with double precison. float asinf(float) Compute inverse sine of the argument with single precison. double asinh(double) Compute inverse hyperbolic sine of the argument with double precison.

697 24 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Description

float asinhf(float) Compute inverse hyperbolic sine of the argument with single precison.

double atan(double) Compute inverse tangent of the argument with double precison.

float atanf(float) Compute inverse tangent of the argument with single precison.

double atanh(double) Compute inverse hyperbolic tangent of the argument with double precison.

float atanhf(float) Compute inverse hyperbolic tangent of the argument with single precison.

double cabs(double complex Computes absolute value of complex number. The z) intrinsic argument is a complex number made up of two double precison elements, one real and one imaginary. The input parameter z is made up of two values of double type passed together as a single argument.

float cabsf(float complex z) Computes absolute value of complex number. The intrinsic argument is a complex number made up of two single precison elements, one real and one imaginary. The input parameter z is made up of two values of float type passed together as a single argument.

double ceil(double) Computes smallest integral value of double precison argument not less than the argument.

float ceilf(float) Computes smallest integral value of single precison argument not less than the argument.

double cosh(double) Computes the hyperbolic cosine of double precison argument.

float coshf(float) Computes the hyperbolic cosine of single precison argument.

float fabsf(float) Computes absolute value of single precison argument.

698 24

Intrinsic Description double floor(double) Computes the largest integral value of the double precison argument not greater than the argument. float floorf(float) Computes the largest integral value of the single precison argument not greater than the argument. double fmod(double) Computes the floating-point remainder of the division of the first argument by the second argument with double precison. float fmodf(float) Computes the floating-point remainder of the division of the first argument by the second argument with single precison. double hypot(double, double) Computes the length of the hypotenuse of a right angled triangle with double precison. float hypotf(float, float) Computes the length of the hypotenuse of a right angled triangle with single precison. double rint(double) Computes the integral value represented as double using the IEEE rounding mode. float rintf(float) Computes the integral value represented with single precison using the IEEE rounding mode. double sinh(double) Computes the hyperbolic sine of the double precison argument. float sinhf(float) Computes the hyperbolic sine of the single precison argument. float sqrtf(float) Computes the square root of the single precison argument. double tanh(double) Computes the hyperbolic tangent of the double precison argument. float tanhf(float) Computes the hyperbolic tangent of the single precison argument.

699 24 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

String and Block Copy Intrinsics

The following table lists and describes string and block copy intrinsics that you can use on systems based on IA-32 and Intel® 64 architectures. They may invoke library functions that are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

NOTE. strncpy() and strncmp() functions are implemented as intrinsics depending on compiler version and compiler switches like optimization level.

Intrinsic Description

char *_strset(char *, _int32) Sets all characters in a string to a fixed value.

int memcmp(const void *cs, const Compares two regions of memory. Return <0 void *ct, size_t n) if cs0 if cs>ct.

void *memcpy(void *s, const void Copies from memory. Returns s. *ct, size_t n)

void *memset(void * s, int c, size_t Sets memory to a fixed value. Returns s. n)

char *strcat(char * s, const char * Appends to a string. Returns s. ct)

int strcmp(const char *, const char Compares two strings. Return <0 if cs0 if cs>ct.

char *strcpy(char * s, const char * Copies a string. Returns s. ct)

size_t strlen(const char * cs) Returns the length of string cs.

int strncmp(char *, char *, int) Compare two strings, but only specified number of characters.

int strncpy(char *, char *, int) Copies a string, but only specified number of characters.

700 24

Miscellaneous Intrinsics

The following tables list and describe intrinsics that you can use across all Intel architectures, except where noted. These intrinsics are available for both Intel® and non-Intel microprocessors but they may perform additional optimizations for Intel® microprocessors than they perform for non-Intel microprocessors.

Intrinsic Description

Intrinsics for all Supported Intel® Architectures

__cpuid Queries the processor for information about processor type and supported features. The Intel® C++ Compiler supports the Microsoft* implementation of this intrinsic. See the Microsoft documentation for details.

void *_alloca(int) Allocates memory in the local stack frame. The memory is automatically freed upon return from the function.

int _bit_scan_forward(int x) Returns the bit index of the least significant set bit of x. If x is 0, the result is undefined.

int _bit_scan_reverse(int) Returns the bit index of the most significant set bit of x. If x is 0, the result is undefined.

int _bswap(int) Reverses the byte order of x. Swaps 4 bytes; bits 0-7 are swapped with bits 24-31, bits 8-15 are swapped with bits 16-23.

__int64 _bswap64(__int64 x) Reverses the byte order of x. Swaps 8 bytes; bits 0-7 are swapped with bits 56-63, bits 8-15 are swapped with bits 48-55, bits 16-23 are swapped with bits 40-47, and bits 24-31 are swapped with bits 32-39.

unsigned int __cacheSize(un- __cacheSize(n) returns the size in kilobytes of the signed int cacheLevel) cache at level n. 1 represents the first-level cache. 0 is returned for a non-existent cache level. For example, an application may query the cache size and use it to select block sizes in algorithms that operate on matrices.

701 24 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Description

Intrinsics for all Supported Intel® Architectures

void _enable(void) Enables the interrupt.

void _disable(void) Disables the interrupt.

int _in_byte(int) Intrinsic that maps to the IA-32 instruction IN. Transfer data byte from port specified by argument.

int _in_dword(int) Intrinsic that maps to the IA-32 instruction IN. Transfer double word from port specified by argument.

int _in_word(int) Intrinsic that maps to the IA-32 instruction IN. Transfer word from port specified by argument.

int _inp(int) Same as _in_byte

int _inpd(int) Same as _in_dword

int _inpw(int) Same as _in_word

int _out_byte(int, int) Intrinsic that maps to the IA-32 instruction OUT. Transfer data byte in second argument to port specified by first argument.

int _out_dword(int, int) Intrinsic that maps to the IA-32 instruction OUT. Transfer double word in second argument to port specified by first argument.

int _out_word(int, int) Intrinsic that maps to the IA-32 instruction OUT. Transfer word in second argument to port specified by first argument.

int _outp(int, int) Same as _out_byte

int _outpw(int, int) Same as _out_word

int _outpd(int, int) Same as _out_dword

702 24

Intrinsic Description

Intrinsics for all Supported Intel® Architectures int _popcnt32(int x) Returns the number of set bits in x. int _popcnt64(__int64 x) Returns the number of set bits in x.

__int64 _rdpmc(int p) Returns the current value of the 40-bit performance monitoring counter specified by p.

Intrinsics for IA-32 and Intel® 64 Architectures

__int64 _rdtsc(void) Returns the current value of the processor's 64-bit time stamp counter. int _setjmp(jmp_buf) A fast version of setjmp(), which bypasses the termination handling. Saves the callee-save registers, stack pointer and return address. int __pin_value(char Bypasses code that executes only in PIN mode. *annotation)

703 24 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

704 Data Alignment, Memory Allocation Intrinsics, and Inline Assembly 25

Overview: Data Alignment, Memory Allocation Intrinsics, and Inline Assembly

This section describes features that support usage of the intrinsics. The following topics are described:

• Alignment Support • Allocating and Freeing Aligned Memory Blocks • Inline Assembly

Alignment Support

Aligning data improves the performance of intrinsics. When using the Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics, you should align data to 16 bytes in memory operations. Specifically, you must align __m128 objects as addresses passed to the _mm_load and _mm_store intrinsics. If you want to declare arrays of floats and treat them as __m128 objects by casting, you need to ensure that the float arrays are properly aligned. Use __declspec(align) to direct the compiler to align data more strictly than it otherwise would. For example, a data object of type int is allocated at a byte address which is a multiple of 4 by default. However, by using __declspec(align), you can direct the compiler to instead use an address which is a multiple of 8, 16, or 32 with the following restriction on IA-32 architecture:

• 16-byte addresses can be locally or statically allocated

You can use this data alignment support as an advantage in optimizing cache line usage. By clustering small objects that are commonly used together into a struct, and forcing the struct to be allocated at the beginning of a cache line, you can effectively guarantee that each object is loaded into the cache as soon as any one is accessed, resulting in a significant performance benefit. The syntax of this extended-attribute is as follows: align(n)

where n is an integral power of 2, up to 4096. The value specified is the requested alignment.

705 25 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

NOTE. If a value is specified that is less than the alignment of the affected data type, it has no effect. In other words, data is aligned to the maximum of its own alignment or the alignment specified with __declspec(align).

You can request alignments for individual variables, whether of static or automatic storage duration. (Global and static variables have static storage duration; local variables have automatic storage duration by default.) You cannot adjust the alignment of a parameter, nor a field of a struct or class. You can, however, increase the alignment of a struct (or union or class), in which case every object of that type is affected. As an example, suppose that a function uses local variables i and j as subscripts into a 2-dimensional array. They might be declared as follows: int i, j;

These variables are commonly used together. But they can fall in different cache lines, which could be detrimental to performance. You can instead declare them as follows: __declspec(align(16)) struct { int i, j; } sub;

The compiler now ensures that they are allocated in the same cache line. In C++, you can omit the struct variable name (written as sub in the previous example). In C, however, it is required, and you must write references to i and j as sub.i and sub.j. If you use many functions with such subscript pairs, it is more convenient to declare and use a struct type for them, as in the following example: typedef struct __declspec(align(16)) { int i, j; } Sub;

By placing the __declspec(align) after the keyword struct, you are requesting the appropriate alignment for all objects of that type. Note that allocation of parameters is unaffected by __declspec(align). (If necessary, you can assign the value of a parameter to a local variable with the appropriate alignment.) You can also force alignment of global variables, such as arrays: __declspec(align(16)) float array[1000];

Allocating and Freeing Aligned Memory Blocks

To allocate and free aligned blocks of memory use the _mm_malloc and _mm_free intrinsics. These intrinsics are based on malloc and free, which are in the libirc.a library. You need to include malloc.h. The syntax for these intrinsics is as follows: void* _mm_malloc (size_t size, size_t align )

706 25

void _mm_free (void *p)

The _mm_malloc routine takes an extra parameter, which is the alignment constraint. This constraint must be a power of two. The pointer that is returned from _mm_malloc is guaranteed to be aligned on the specified boundary.

NOTE. Memory that is allocated using _mm_malloc must be freed using _mm_free . Calling free on memory allocated with _mm_malloc or calling _mm_free on memory allocated with malloc will cause unpredictable behavior.

Inline Assembly

Microsoft Style Inline Assembly The Intel® C++ Compiler supports Microsoft-style inline assembly with the -use-msasm compiler option. See your Microsoft documentation for the proper syntax.

GNU*-like Style Inline Assembly (IA-32 architecture and Intel(R) 64 architecture only) The Intel® C++ Compiler supports GNU-like style inline assembly. The syntax is as follows: asm-keyword [ volatile-keyword ] ( asm-template [ asm-interface ] ) ;

The Intel C++ Compiler also supports mixing UNIX and Microsoft style asms. Use the __asm__ keyword for GNU-style ASM when using the -use_msasm switch.

NOTE. The Intel C++ Compiler supports gcc-style inline ASM if the assembler code uses AT&T* System V/386 syntax.

Description Syntax Element

asm-keyword Assembly statements begin with the keyword asm. Alternatively, either __asm or __asm__ may be used for compatibility. When mixing UNIX and Microsoft style asm, use the __asm__ keyword.

707 25 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Syntax Element The compiler only accepts the __asm__ keyword. The asm and __asm keywords are reserved for Microsoft style assembly statements. volatile-keyword If the optional keyword volatile is given, the asm is volatile. Two volatile asm statements are never moved past each other, and a reference to a volatile variable is not moved relative to a volatile asm. Alternate keywords __volatile and __volatile__ may be used for compatibility. asm-template The asm-template is a C language ASCII string that specifies how to output the assembly code for an instruction. Most of the template is a fixed string; everything but the substitution-directives, if any, is passed through to the assembler. The syntax for a substitution directive is a % followed by one or two characters. asm-interface The asm-interface consists of three parts:

1. An optional output-list 2. An optional input-list 3. An optional clobber-list

These are separated by colon (:) characters. If the output-list is missing, but an input-list is given, the input list may be preceded by two colons (::) to take the place of the missing output-list. If the asm-interface is omitted altogether, the asm statement is considered volatile regardless of whether a volatile-keyword was specified.

output-list An output-list consists of one or more output-specs separated by commas. For the purposes of substitution in the asm-template, each output-spec is numbered. The first operand in the output-list is numbered 0, the second is 1, and so on. Numbering is continuous through the output-list and into the input-list. The total number of operands is limited to 30 (i.e. 0-29). input-list Similar to an output-list, an input-list consists of one or more input-specs separated by commas. For the purposes of substitution in the asm-template, each input-spec is numbered, with the numbers continuing from those in the output-list.

708 25

Description Syntax Element

clobber-list A clobber-list tells the compiler that the asm uses or changes a specific machine register that is either coded directly into the asm or is changed implicitly by the assembly instruction. The clobber-list is a comma-separated list of clobber-specs. input-spec The input-specs tell the compiler about expressions whose values may be needed by the inserted assembly instruction. In order to describe fully the input requirements of the asm, you can list input-specs that are not actually referenced in the asm-template. clobber-spec Each clobber-spec specifies the name of a single machine register that is clobbered. The register name may optionally be preceded by a %. You can specify any valid machine register name. It is also legal to specify "memory" in a clobber-spec. This prevents the compiler from keeping data cached in registers across the asm statement.

When compiling an assembly statement on Linux*, the compiler simply emits the asm-template to the assembly file after making any necessary operand substitutions. The compiler then calls the GNU assembler to generate machine code. In contrast, on Windows* the compiler itself must assemble the text contained in the asm-template string into machine code. In essence, the compiler contains a built-in assembler. The compiler’s built-in assembler supports the GNU .byte directive but does not support other functionality of the GNU assembler, so there are limitations in the contents of the asm-template. The following assembler features are not currently supported.

• Directives other than the .byte directive • Symbols*

NOTE. * Direct symbol references in the asm-template are not supported. To access a C++ object, use the asm-interface with a substitution directive.

Example Incorrect method for accessing a C++ object: __asm__("addl $5, _x");

Proper method for accessing a C++ object: __asm__("addl $5, %0" : "+rm" (x));

709 25 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Additionally, there are some restrictions on the usage of labels. The compiler only allows local labels, and only references to labels within the same assembly statement are permitted. A local label has the form “N:”, where N is a non-negative integer. N does not have to be unique, even within the same assembly statement. To reference the most recent definition of label N, use “Nb”. To reference the next definition of label N, use “Nf”. In this context, “b” means backward and “f” means forward. For more information, refer to the GNU assembler documentation. GNU-style inline assembly statements on Windows* use the same assembly instruction format as on Linux*. This means that destination operands are on the right and source operands are on the left. This operand order is the reverse of Intel assembly syntax. Due to the limitations of the compiler's built-in assembler, many assembly statements that compile and run on Linux* will not compile on Windows*. On the other hand, assembly statements that compile and run on Windows* should also compile and run on Linux*.

710 25

This feature provides a high-performance alternative to Microsoft-style inline assembly statements when portability between Windows* and Linux* is important. Its intended use is in small primitives where high-performance integration with the surrounding C++ code is essential. #ifdef _WIN64 #define INT64_PRINTF_FORMAT "I64" #else #define __int64 long long #define INT64_PRINTF_FORMAT "L" #endif #include typedef struct { __int64 lo64; __int64 hi64; } my_i128; #define ADD128(out, in1, in2) \ __asm__("addq %2, %0; adcq %3, %1" : \ "=r"(out.lo64), "=r"(out.hi64) : \ "emr" (in2.lo64), "emr"(in2.hi64), \ "0" (in1.lo64), "1" (in1.hi64)); extern int main() { my_i128 val1, val2, result; val1.lo64 = ~0; val1.hi64 = 0; val2.hi64 = 65; ADD128(result, val1, val2); printf("0x%016" INT64_PRINTF_FORMAT "x%016" INT64_PRINTF_FORMAT "x\n",

711 25 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

val1.hi64, val1.lo64); printf("+0x%016" INT64_PRINTF_FORMAT "x%016" INT64_PRINTF_FORMAT "x\n", val2.hi64, val2.lo64); printf("------\n"); printf("0x%016" INT64_PRINTF_FORMAT "x%016" INT64_PRINTF_FORMAT "x\n", result.hi64, result.lo64); return 0; }

This example, written for Intel(R) 64 architecture, shows how to use a GNU-style inline assembly statement to add two 128-bit integers. In this example, a 128-bit integer is represented as two __int64 objects in the my_i128 structure. The inline assembly statement used to implement the addition is contained in the ADD128 macro, which takes 3 my_i128 arguments representing 3 128-bit integers. The first argument is the output. The next two arguments are the inputs. The example compiles and runs using the Intel Compiler on Linux* or Windows*, producing the following output. 0x0000000000000000ffffffffffffffff + 0x00000000000000410000000000000001 ------+ 0x00000000000000420000000000000000

In the GNU-style inline assembly implementation, the asm interface specifies all the inputs, outputs, and side effects of the asm statement, enabling the compiler to generate very efficient code. mov r13, 0xffffffffffffffff mov r12, 0x000000000 add r13, 1 adc r12, 65

It is worth noting that when the compiler generates an assembly file on Windows*, it uses Intel syntax even though the assembly statement was written using Linux* assembly syntax.

712 25

The compiler moves in1.lo64 into a register to match the constraint of operand 4. Operand 4's constraint of "0" indicates that it must be assigned the same location as output operand 0. And operand 0's constraint is "=r", indicating that it must be assigned an integer register. In this case, the compiler chooses r13. In the same way, the compiler moves in1.hi64 into register r12. The constraints for input operands 2 and 3 allow the operands to be assigned a register location ("r"), a memory location ("m"), or a constant signed 32-bit integer value ("e"). In this case, the compiler chooses to match operands 2 and 3 with the constant values 1 and 65, enabling the add and adc instructions to utilize the "register-immediate" forms. The same operation is much more expensive using a Microsoft-style inline assembly statement, because the interface between the assembly statement and the surrounding C++ code is entirely through memory. Using Microsoft assembly, the ADD128 macro might be written as follows. #define ADD128(out, in1, in2) \ { \ __asm mov rax, in1.lo64 \ __asm mov rdx, in1.hi64 \ __asm add rax, in2.lo64 \ __asm adc rdx, in2.hi64 \ __asm mov out.lo64, rax \ __asm mov out.hi64, rdx \ }

713 25 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The compiler must add code before the assembly statement to move the inputs into memory, and it must add code after the assembly statement to retrieve the outputs from memory. This prevents the compiler from exploiting some optimization opportunities. Thus, the following assembly code is produced. mov QWORD PTR [rsp+32], -1 mov QWORD PTR [rsp+40], 0 mov QWORD PTR [rsp+48], 1 mov QWORD PTR [rsp+56], 65 ; Begin ASM mov rax, QWORD PTR [rsp+32] mov rdx, QWORD PTR [rsp+40] add rax, QWORD PTR [rsp+48] adc rdx, QWORD PTR [rsp+56] mov QWORD PTR [rsp+64], rax mov QWORD PTR [rsp+72], rdx ; End ASM mov rdx, QWORD PTR [rsp+72] mov r8, QWORD PTR [rsp+64]

The operation that took only 4 instructions and 0 memory references using GNU-style inline assembly takes 12 instructions with 12 memory references using Microsoft-style inline assembly.

714 MMX(TM) Technology Intrinsics 26

Overview: MMX™ Technology Intrinsics

MMX™ technology is an extension to the Intel architecture instruction set. The MMX instruction set adds 57 opcodes and a 64-bit quadword data type, and eight 64-bit registers. Each of the eight registers can be directly addressed using the register names MM0 to MM7. The prototypes for MMX technology intrinsics are in the mmintrin.h header file.

Details about MMX(TM) Technology Intrinsics

The MMX™ technology instructions use the following features:

• Registers – Enable packed data of up to 128 bits in length for optimal single-instruction multiple data (SIMD) processing. • Data Types – Enable packing of up to 16 elements of data in one register.

Registers The MMX instructions use eight 64-bit registers (mm0 to mm7) which are aliased on the floating-point stack registers. Because each of these registers can hold more than one data element, the processor can process more than one data element simultaneously. This processing capability is also known as single-instruction multiple data (SIMD) processing. For each computational and data manipulation instruction in the new extension sets, there is a corresponding C intrinsic that implements that instruction directly. This frees you from managing registers and assembly programming. Further, the compiler optimizes the instruction scheduling so that your executable runs faster.

Data Types Intrinsic functions use four new C data types as operands, representing the new registers that are used as the operands to these intrinsic functions.

715 26 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m64 Data Type The __m64 data type is used to represent the contents of an MMX register, which is the register that is used by the MMX technology intrinsics. The __m64 data type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one 64-bit value.

Data Types Usage Guidelines These data types are not basic ANSI C data types. You must observe the following usage restrictions:

• Use data types only on either side of an assignment, as a return value, or as a parameter. You cannot use it with other arithmetic expressions (+, -, etc). • Use data types as objects in aggregates, such as unions, to access the byte elements and structures. • Use data types only with the respective intrinsics described in this documentation.

The EMMS Instruction: Why You Need It

Using EMMS is like emptying a container to accommodate new content. The EMMS instruction clears the MMX™ registers and sets the value of the floating-point tag word to empty. You should clear the MMX registers before issuing a floating-point instruction because floating-point convention specifies that the floating-point stack be cleared after use. Insert the EMMS instruction at the end of all MMX code segments to avoid a floating-point overflow exception.

716 26

Why You Need EMMS to Reset After an MMX™ Instruction

CAUTION. Failure to empty the multimedia state after using an MMX instruction and before using a floating-point instruction can result in unexpected execution or poor performance.

717 26 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

EMMS Usage Guidelines

Here are guidelines for when to use the EMMS instruction:

• Use _mm_empty() after an MMX™ instruction if the next instruction is a floating-point (FP) instruction. For example, you should use the EMMS instruction before performing calculations on float, double or long double. You must be aware of all situations in which your code generates an MMX instruction:

• when using an MMX technology intrinsic • when using Intel® Streaming SIMD Extensions (Intel® SSE) integer intrinsics that use the __m64 data type • when referencing an __m64 data type variable • when using an MMX instruction through inline assembly

• Use different functions for operations that use floating point instructions and those that use MMX instructions. This action eliminates the need to empty the multimedia state within the body of a critical loop. • Use _mm_empty() during runtime initialization of __m64 and FP data types. This ensures resetting the register between data type transitions. • Do not use _mm_empty() before an MMX instruction, since using _mm_empty() before an MMX instruction incurs an operation with no benefit (no-op). • See the Correct Usage and Incorrect Usage coding examples in the following table.

Incorrect Usage Correct Usage

__m64 x = _m_paddd(y, z); __m64 x = _m_paddd(y, z); float f = init(); float f = ( _mm_empty(), init());

MMX™ Technology General Support Intrinsics

The prototypes for MMX™ technology general support intrinsics are in the mmintrin.h header file.

718 26

Intrinsic Name Operation Corresponding MMX Instruction

_mm_empty Empty MM state EMMS

_mm_cvtsi32_si64 Convert from int MOVD

_mm_cvtsi64_si32 Convert to int MOVD

_mm_cvtsi64_m64 Convert from __int64 MOVQ

_mm_cvtm64_si64 Convert to __int64 MOVQ

_mm_packs_pi16 Pack PACKSSWB

_mm_packs_pi32 Pack PACKSSDW

_mm_packs_pu16 Pack PACKUSWB

_mm_unpackhi_pi8 Interleave PUNPCKHBW

_mm_unpackhi_pi16 Interleave PUNPCKHWD

_mm_unpackhi_pi32 Interleave PUNPCKHDQ

_mm_unpacklo_pi8 Interleave PUNPCKLBW

_mm_unpacklo_pi16 Interleave PUNPCKLWD

_mm_unpacklo_pi32 Interleave PUNPCKLDQ void _mm_empty(void)

Empties the multimedia state. __m64 _mm_cvtsi32_si64(int i)

Converts the integer object i to a 64-bit __m64 object. The integer value is zero-extended to 64 bits. int _mm_cvtsi64_si32(__m64 m)

Converts the lower 32 bits of the __m64 object m to an integer. __m64 _mm_cvtsi64_m64(__int64 i)

719 26 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Moves the 64-bit integer object i to a __mm64 object __m64 _mm_cvtm64_si64(__m64 m)

Moves the __m64 object m to a 64-bit integer __m64 _mm_packs_pi16(__m64 m1, __m64 m2)

Packs the four 16-bit values from m1 into the lower four 8-bit values of the result with signed saturation, and pack the four 16-bit values from m2 into the upper four 8-bit values of the result with signed saturation. __m64 _mm_packs_pi32(__m64 m1, __m64 m2)

Packs the two 32-bit values from m1 into the lower two 16-bit values of the result with signed saturation, and pack the two 32-bit values from m2 into the upper two 16-bit values of the result with signed saturation. __m64 _mm_packs_pu16(__m64 m1, __m64 m2)

Packs the four 16-bit values from m1 into the lower four 8-bit values of the result with unsigned saturation, and pack the four 16-bit values from m2 into the upper four 8-bit values of the result with unsigned saturation. __m64 _mm_unpackhi_pi8(__m64 m1, __m64 m2)

Interleaves the four 8-bit values from the high half of m1 with the four values from the high half of m2. The interleaving begins with the data from m1. __m64 _mm_unpackhi_pi16(__m64 m1, __m64 m2)

Interleaves the two 16-bit values from the high half of m1 with the two values from the high half of m2. The interleaving begins with the data from m1. __m64 _mm_unpackhi_pi32(__m64 m1, __m64 m2)

Interleaves the 32-bit value from the high half of m1 with the 32-bit value from the high half of m2. The interleaving begins with the data from m1. __m64 _mm_unpacklo_pi8(__m64 m1, __m64 m2)

Interleaves the four 8-bit values from the low half of m1 with the four values from the low half of m2. The interleaving begins with the data from m1. __m64 _mm_unpacklo_pi16(__m64 m1, __m64 m2)

Interleaves the two 16-bit values from the low half of m1 with the two values from the low half of m2. The interleaving begins with the data from m1.

720 26

__m64 _mm_unpacklo_pi32(__m64 m1, __m64 m2)

Interleaves the 32-bit value from the low half of m1 with the 32-bit value from the low half of m2. The interleaving begins with the data from m1.

MMX™ Technology Packed Arithmetic Intrinsics

The prototypes for MMX™ technology packed arithmetic intrinsics are in the mmintrin.h header file.

Intrinsic Name Operation Corresponding MMX Instruction

_mm_add_pi8 Addition PADDB

_mm_add_pi16 Addition PADDW

_mm_add_pi32 Addition PADDD

_mm_adds_pi8 Addition PADDSB

_mm_adds_pi16 Addition PADDSW

_mm_adds_pu8 Addition PADDUSB

_mm_adds_pu16 Addition PADDUSW

_mm_sub_pi8 Subtraction PSUBB

_mm_sub_pi16 Subtraction PSUBW

_mm_sub_pi32 Subtraction PSUBD

_mm_subs_pi8 Subtraction PSUBSB

_mm_subs_pi16 Subtraction PSUBSW

_mm_subs_pu8 Subtraction PSUBUSB

_mm_subs_pu16 Subtraction PSUBUSW

_mm_madd_pi16 Multiply and add PMADDWD

721 26 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Name Operation Corresponding MMX Instruction

_mm_mulhi_pi16 Multiplication PMULHW

_mm_mullo_pi16 Multiplication PMULLW

__m64 _mm_add_pi8(__m64 m1, __m64 m2)

Add the eight 8-bit values in m1 to the eight 8-bit values in m2. __m64 _mm_add_pi16(__m64 m1, __m64 m2)

Add the four 16-bit values in m1 to the four 16-bit values in m2. __m64 _mm_add_pi32(__m64 m1, __m64 m2)

Add the two 32-bit values in m1 to the two 32-bit values in m2. __m64 _mm_adds_pi8(__m64 m1, __m64 m2)

Add the eight signed 8-bit values in m1 to the eight signed 8-bit values in m2 using saturating arithmetic. __m64 _mm_adds_pi16(__m64 m1, __m64 m2)

Add the four signed 16-bit values in m1 to the four signed 16-bit values in m2 using saturating arithmetic. __m64 _mm_adds_pu8(__m64 m1, __m64 m2)

Add the eight unsigned 8-bit values in m1 to the eight unsigned 8-bit values in m2 and using saturating arithmetic. __m64 _mm_adds_pu16(__m64 m1, __m64 m2)

Add the four unsigned 16-bit values in m1 to the four unsigned 16-bit values in m2 using saturating arithmetic. __m64 _mm_sub_pi8(__m64 m1, __m64 m2)

Subtract the eight 8-bit values in m2 from the eight 8-bit values in m1. __m64 _mm_sub_pi16(__m64 m1, __m64 m2)

Subtract the four 16-bit values in m2 from the four 16-bit values in m1. __m64 _mm_sub_pi32(__m64 m1, __m64 m2)

722 26

Subtract the two 32-bit values in m2 from the two 32-bit values in m1. __m64 _mm_subs_pi8(__m64 m1, __m64 m2)

Subtract the eight signed 8-bit values in m2 from the eight signed 8-bit values in m1 using saturating arithmetic. __m64 _mm_subs_pi16(__m64 m1, __m64 m2)

Subtract the four signed 16-bit values in m2 from the four signed 16-bit values in m1 using saturating arithmetic. __m64 _mm_subs_pu8(__m64 m1, __m64 m2)

Subtract the eight unsigned 8-bit values in m2 from the eight unsigned 8-bit values in m1 using saturating arithmetic. __m64 _mm_subs_pu16(__m64 m1, __m64 m2)

Subtract the four unsigned 16-bit values in m2 from the four unsigned 16-bit values in m1 using saturating arithmetic. __m64 _mm_madd_pi16(__m64 m1, __m64 m2)

Multiply four 16-bit values in m1 by four 16-bit values in m2 producing four 32-bit intermediate results, which are then summed by pairs to produce two 32-bit results. __m64 _mm_mulhi_pi16(__m64 m1, __m64 m2)

Multiply four signed 16-bit values in m1 by four signed 16-bit values in m2 and produce the high 16 bits of the four results. __m64 _mm_mullo_pi16(__m64 m1, __m64 m2)

Multiply four 16-bit values in m1 by four 16-bit values in m2 and produce the low 16 bits of the four results.

MMX™ Technology Shift Intrinsics

The prototypes for MMX™ technology shift intrinsics are in the mmintrin.h header file.

Intrinsic Name Operation Corresponding MMX Instruction

_mm_sll_pi16 Logical shift left PSLLW

_mm_slli_pi16 Logical shift left PSLLWI

723 26 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Name Operation Corresponding MMX Instruction

_mm_sll_pi32 Logical shift left PSLLD

_mm_slli_pi32 Logical shift left PSLLDI

_mm_sll_pi64 Logical shift left PSLLQ

_mm_slli_pi64 Logical shift left PSLLQI

_mm_sra_pi16 Arithmetic shift right PSRAW

_mm_srai_pi16 Arithmetic shift right PSRAWI

_mm_sra_pi32 Arithmetic shift right PSRAD

_mm_srai_pi32 Arithmetic shift right PSRADI

_mm_srl_pi16 Logical shift right PSRLW

_mm_srli_pi16 Logical shift right PSRLWI

_mm_srl_pi32 Logical shift right PSRLD

_mm_srli_pi32 Logical shift right PSRLDI

_mm_srl_pi64 Logical shift right PSRLQ

_mm_srli_pi64 Logical shift right PSRLQI

__m64 _mm_sll_pi16(__m64 m, __m64 count)

Shifts four 16-bit values in m left the amount specified by count while shifting in zeros. __m64 _mm_slli_pi16(__m64 m, int count)

Shifts four 16-bit values in m left the amount specified by count while shifting in zeros. For the best performance, count should be a constant. __m64 _mm_sll_pi32(__m64 m, __m64 count)

Shifts two 32-bit values in m left the amount specified by count while shifting in zeros.

724 26

__m64 _mm_slli_pi32(__m64 m, int count)

Shifts two 32-bit values in m left the amount specified by count while shifting in zeros. For the best performance, count should be a constant. __m64 _mm_sll_pi64(__m64 m, __m64 count)

Shifts the 64-bit value in m left the amount specified by count while shifting in zeros. __m64 _mm_slli_pi64(__m64 m, int count)

Shifts the 64-bit value in m left the amount specified by count while shifting in zeros. For the best performance, count should be a constant. __m64 _mm_sra_pi16(__m64 m, __m64 count)

Shifts four 16-bit values in m right the amount specified by count while shifting in the sign bit. __m64 _mm_srai_pi16(__m64 m, int count)

Shifts four 16-bit values in m right the amount specified by count while shifting in the sign bit. For the best performance, count should be a constant. __m64 _mm_sra_pi32(__m64 m, __m64 count)

Shifts two 32-bit values in m right the amount specified by count while shifting in the sign bit. __m64 _mm_srai_pi32(__m64 m, int count)

Shifts two 32-bit values in m right the amount specified by count while shifting in the sign bit. For the best performance, count should be a constant. __m64 _mm_srl_pi16(__m64 m, __m64 count)

Shifts four 16-bit values in m right the amount specified by count while shifting in zeros. __m64 _mm_srli_pi16(__m64 m, int count)

Shifts four 16-bit values in m right the amount specified by count while shifting in zeros. For the best performance, count should be a constant. __m64 _mm_srl_pi32(__m64 m, __m64 count)

Shifts two 32-bit values in m right the amount specified by count while shifting in zeros. __m64 _mm_srli_pi32(__m64 m, int count)

Shifts two 32-bit values in m right the amount specified by count while shifting in zeros. For the best performance, count should be a constant. __m64 _mm_srl_pi64(__m64 m, __m64 count)

725 26 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Shifts the 64-bit value in m right the amount specified by count while shifting in zeros. __m64 _mm_srli_pi64(__m64 m, int count)

Shifts the 64-bit value in m right the amount specified by count while shifting in zeros. For the best performance, count should be a constant.

MMX™ Technology Logical Intrinsics

The prototypes for MMX™ technology logical intrinsics are in the mmintrin.h header file.

Intrinsic Name Operation Corresponding MMX Instruction

_mm_and_si64 Bitwise AND PAND

_mm_andnot_si64 Bitwise ANDNOT PANDN

_mm_or_si64 Bitwise OR POR

_mm_xor_si64 Bitwise Exclusive OR PXOR

__m64 _mm_and_si64(__m64 m1, __m64 m2)

Perform a bitwise AND of the 64-bit value in m1 with the 64-bit value in m2. __m64 _mm_andnot_si64(__m64 m1, __m64 m2)

Perform a bitwise NOT on the 64-bit value in m1 and use the result in a bitwise AND with the 64-bit value in m2. __m64 _mm_or_si64(__m64 m1, __m64 m2)

Perform a bitwise OR of the 64-bit value in m1 with the 64-bit value in m2. __m64 _mm_xor_si64(__m64 m1, __m64 m2)

Perform a bitwise XOR of the 64-bit value in m1 with the 64-bit value in m2.

MMX™ Technology Compare Intrinsics

The prototypes for MMX™ technology compare intrinsics are in the mmintrin.h header file.

726 26

Intrinsic Operation Corresponding Name MMX Instruction

_mm_cmpeq_pi8 Equal PCMPEQB

_mm_cmpeq_pi16 Equal PCMPEQW

_mm_cmpeq_pi32 Equal PCMPEQD

_mm_cmpgt_pi8 Greater Than PCMPGTB

_mm_cmpgt_pi16 Greater Than PCMPGTW

_mm_cmpgt_pi32 Greater Than PCMPGTD

__m64 _mm_ cmpeq_pi8(__m64 m1, __m64 m2)

Sets the corresponding 8-bit resulting values to all ones if the 8-bit values in m1 are equal to the corresponding 8-bit values in m2; otherwise sets them to all zeros. __m64 _mm_ cmpeq_pi16(__m64 m1, __m64 m2)

Sets the corresponding 16-bit resulting values to all ones if the 16-bit values in m1 are equal to the corresponding 16-bit values in m2; otherwise set them to all zeros. __m64 _mm_ cmpeq_pi32(__m64 m1, __m64 m2)

Sets the corresponding 32-bit resulting values to all ones if the 32-bit values in m1 are equal to the corresponding 32-bit values in m2; otherwise set them to all zeros. __m64 _mm_ cmpgt_pi8(__m64 m1, __m64 m2)

Sets the corresponding 8-bit resulting values to all ones if the 8-bit signed values in m1 are greater than the corresponding 8-bit signed values in m2; otherwise set them to all zeros. __m64 _mm_ cmpgt_pi16(__m64 m1, __m64 m2)

Sets the corresponding 16-bit resulting values to all ones if the 16-bit signed values in m1 are greater than the corresponding 16-bit signed values in m2; otherwise set them to all zeros. __m64 _mm_ cmpgt_pi32(__m64 m1, __m64 m2)

Sets the corresponding 32-bit resulting values to all ones, if the 32-bit signed values in m1 are greater than the corresponding 32-bit signed values in m2; otherwise set them all to zeros.

727 26 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

MMX™ Technology Set Intrinsics

The prototypes for MMX™ technology intrinsics are in the mmintrin.h header file.

NOTE. In the descriptions regarding the bits of the MMX register, bit 0 is the least significant and bit 63 is the most significant.

Intrinsic Name Operation Corresponding MMX Instruction

_mm_setzero_si64 set to zero PXOR

_mm_set_pi32 set integer values Composite

_mm_set_pi16 set integer values Composite

_mm_set_pi8 set integer values Composite

_mm_set1_pi32 set integer values Composite

_mm_set1_pi16 set integer values Composite

_mm_set1_pi8 set integer values Composite

_mm_setr_pi32 set integer values Composite

_mm_setr_pi16 set integer values Composite

_mm_setr_pi8 set integer values Composite

__m64 _mm_setzero_si64()

Sets the 64-bit value to zero.

R

0x0

__m64 _mm_set_pi32(int i1, int i0)

728 26

Sets the 2 signed 32-bit integer values.

R0 R1

i0 i1

__m64 _mm_set_pi16(short s3, short s2, short s1, short s0)

Sets the 4 signed 16-bit integer values.

R0 R1 R2 R3

w0 w1 w2 w3

__m64 _mm_set_pi8(char b7, char b6, char b5, char b4, char b3, char b2, char b1, char b0)

Sets the 8 signed 8-bit integer values.

R0 R1 ... R7

b0 b1 ... b7

__m64 _mm_set1_pi32(int i)

Sets the 2 signed 32-bit integer values to i.

R0 R1

i i

__m64 _mm_set1_pi16(short s)

Sets the 4 signed 16-bit integer values to w.

R0 R1 R2 R3

w w w w

__m64 _mm_set1_pi8(char b)

Sets the 8 signed 8-bit integer values to b.

729 26 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 ... R7

b b ... b

__m64 _mm_setr_pi32(int i1, int i0)

Sets the 2 signed 32-bit integer values in reverse order.

R0 R1

i1 i0

__m64 _mm_setr_pi16(short s3, short s2, short s1, short s0)

Sets the 4 signed 16-bit integer values in reverse order.

R0 R1 R2 R3

w3 w2 w1 w0

__m64 _mm_setr_pi8(char b7, char b6, char b5, char b4, char b3, char b2, char b1, char b0)

Sets the 8 signed 8-bit integer values in reverse order.

R0 R1 ... R7

b7 b6 ... b0

730 Intrinsics for Intel(R) Streaming SIMD Extensions 27

Overview: Intel(R) Streaming SIMD Extensions

This section describes the C++ language-level features supporting the Intel® Streaming SIMD Extensions (Intel® SSE) in the Intel® C++ Compiler. These topics explain the following features of the intrinsics:

• Floating Point Intrinsics • Arithmetic Operation Intrinsics • Logical Operation Intrinsics • Comparison Intrinsics • Conversion Intrinsics • Load Operations • Set Operations • Store Operations • Cacheability Support • Integer Intrinsics • Intrinsics to Read and Write Registers • Miscellaneous Intrinsics

The prototypes for SSE intrinsics are in the xmmintrin.h header file.

NOTE. You can also use the single ia32intrin.h header file for any IA-32 architecture-based intrinsics.

Details about Intel(R) Streaming SIMD Extension Intrinsics

The Intel® Streaming SIMD Extension (Intel® SSE) instructions use the following features:

• Registers – Enable packed data of up to 128 bits in length for optimal SIMD processing

731 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• Data Types – Enable packing of up to 16 elements of data in one register

Registers The Streaming SIMD Extensions use eight 128-bit registers (XMM0 to XMM7). Because each of these registers can hold more than one data element, the processor can process more than one data element simultaneously. This processing capability is also known as single-instruction multiple data processing (SIMD). For each computational and data manipulation instruction in the new extension sets, there is a corresponding C intrinsic that implements that instruction directly. This frees you from managing registers and assembly programming. Further, the compiler optimizes the instruction scheduling so that your executable runs faster.

Data Types Intrinsic functions use four new C data types as operands, representing the new registers that are used as the operands to these intrinsic functions.

New Data Types The following table details for which instructions each of the new data types are available.

New Data Type Intel® Streaming Intel® Streaming Intel® Streaming SIMD Extensions SIMD Extensions 2 SIMD Extensions 3 Intrinsics Intrinsics Intrinsics

__m64 Available Available Available

__m128 Available Available Available

__m128d Not available Available Available

__m128i Not available Available Available

__m128 Data Types The __m128 data type is used to represent the contents of a Intel® SSE register used by the Intel® SSE intrinsics. The __m128 data type can hold four 32-bit floating-point values.

732 27

The __m128d data type can hold two 64-bit floating-point values. The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values. The compiler aligns __m128d and _m128i local and global data to 16-byte boundaries on the stack. To align integer, float, or double arrays, you can use the declspec align statement.

Data Types Usage Guidelines These data types are not basic ANSI C data types. You must observe the following usage restrictions:

• Use data types only on either side of an assignment, as a return value, or as a parameter. You cannot use it with other arithmetic expressions (+, -, etc). • Use data types as objects in aggregates, such as unions, to access the byte elements and structures. • Use data types only with the respective intrinsics described in this documentation.

Accessing __m128i Data To access 8-bit data: #define _mm_extract_epi8(x, imm) \ ((((imm) & 0x1) == 0) ? \ _mm_extract_epi16((x), (imm) >> 1) & 0xff : \ _mm_extract_epi16( _mm_srli_epi16((x), 8), (imm) >> 1))

For 16-bit data, use the following intrinsic: int _mm_extract_epi16(__m128i a, int imm)

To access 32-bit data: #define _mm_extract_epi32(x, imm) \ _mm_cvtsi128_si32( _mm_srli_si128((x), 4 * (imm)))

To access 64-bit data (Intel® 64 architecture only): #define _mm_extract_epi64(x, imm) \ _mm_cvtsi128_si64( _mm_srli_si128((x), 8 * (imm)))

733 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Writing Programs with Intel(R) Streaming SIMD Extensions Intrinsics

You should be familiar with the hardware features provided by the Intel® Streaming SIMD Extensions (Intel® SSE) when writing programs with the intrinsics. The following are four important issues to keep in mind:

• Certain intrinsics, such as _mm_loadr_ps and _mm_cmpgt_ss, are not directly supported by the instruction set. While these intrinsics are convenient programming aids, be mindful that they may consist of more than one machine-language instruction. • Floating-point data loaded or stored as __m128 objects must be generally 16-byte-aligned. • Some intrinsics require that their argument be immediates, that is, constant integers (literals), due to the nature of the instruction. • The result of arithmetic operations acting on two NaN (Not a Number) arguments is undefined. Therefore, FP operations using NaN arguments will not match the expected behavior of the corresponding assembly instructions.

Arithmetic Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for arithmetic operations are in the xmmintrin.h header file. The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the four 32-bit pieces of the result register.

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_add_ss Addition ADDSS

_mm_add_ps Addition ADDPS

_mm_sub_ss Subtraction SUBSS

_mm_sub_ps Subtraction SUBPS

_mm_mul_ss Multiplication MULSS

734 27

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_mul_ps Multiplication MULPS

_mm_div_ss Division DIVSS

_mm_div_ps Division DIVPS

_mm_sqrt_ss Squared Root SQRTSS

_mm_sqrt_ps Squared Root SQRTPS

_mm_rcp_ss Reciprocal RCPSS

_mm_rcp_ps Reciprocal RCPPS

_mm_rsqrt_ss Reciprocal Squared Root RSQRTSS

_mm_rsqrt_ps Reciprocal Squared Root RSQRTPS

_mm_min_ss Computes Minimum MINSS

_mm_min_ps Computes Minimum MINPS

_mm_max_ss Computes Maximum MAXSS

_mm_max_ps Computes Maximum MAXPS

__m128 _mm_add_ss(__m128 a, __m128 b)

Adds the lower single-precision, floating-point (SP FP) values of a and b; the upper 3 SP FP values are passed through from a.

R0 R1 R2 R3

a0 + b0 a1 a2 a3

__m128 _mm_add_ps(__m128 a, __m128 b)

Adds the four SP FP values of a and b.

735 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 R2 R3

a0 +b0 a1 + b1 a2 + b2 a3 + b3

__m128 _mm_sub_ss(__m128 a, __m128 b)

Subtracts the lower SP FP values of a and b. The upper 3 SP FP values are passed through from a.

R0 R1 R2 R3

a0 - b0 a1 a2 a3

__m128 _mm_sub_ps(__m128 a, __m128 b)

Subtracts the four SP FP values of a and b.

R0 R1 R2 R3

a0 - b0 a1 - b1 a2 - b2 a3 - b3

__m128 _mm_mul_ss(__m128 a, __m128 b)

Multiplies the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a.

R0 R1 R2 R3

a0 * b0 a1 a2 a3

__m128 _mm_mul_ps(__m128 a, __m128 b)

Multiplies the four SP FP values of a and b.

R0 R1 R2 R3

a0 * b0 a1 * b1 a2 * b2 a3 * b3

__m128 _mm_div_ss(__m128 a, __m128 b )

Divides the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a.

736 27

R0 R1 R2 R3

a0 / b0 a1 a2 a3

__m128 _mm_div_ps(__m128 a, __m128 b)

Divides the four SP FP values of a and b.

R0 R1 R2 R3

a0 / b0 a1 / b1 a2 / b2 a3 / b3

__m128 _mm_sqrt_ss(__m128 a)

Computes the square root of the lower SP FP value of a ; the upper 3 SP FP values are passed through.

R0 R1 R2 R3

sqrt(a0) a1 a2 a3

__m128 _mm_sqrt_ps(__m128 a)

Computes the square roots of the four SP FP values of a.

R0 R1 R2 R3

sqrt(a0) sqrt(a1) sqrt(a2) sqrt(a3)

__m128 _mm_rcp_ss(__m128 a)

Computes the approximation of the reciprocal of the lower SP FP value of a; the upper 3 SP FP values are passed through.

R0 R1 R2 R3

recip(a0) a1 a2 a3

__m128 _mm_rcp_ps(__m128 a)

Computes the approximations of reciprocals of the four SP FP values of a.

737 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 R2 R3

recip(a0) recip(a1) recip(a2) recip(a3)

__m128 _mm_rsqrt_ss(__m128 a)

Computes the approximation of the reciprocal of the square root of the lower SP FP value of a; the upper 3 SP FP values are passed through.

R0 R1 R2 R3

recip(sqrt(a0)) a1 a2 a3

__m128 _mm_rsqrt_ps(__m128 a)

Computes the approximations of the reciprocals of the square roots of the four SP FP values of a.

R0 R1 R2 R3

recip(sqrt(a0)) recip(sqrt(a1)) recip(sqrt(a2)) recip(sqrt(a3))

__m128 _mm_min_ss(__m128 a, __m128 b)

Computes the minimum of the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a.

R0 R1 R2 R3

min(a0, b0) a1 a2 a3

__m128 _mm_min_ps(__m128 a, __m128 b)

Computes the minimum of the four SP FP values of a and b.

R0 R1 R2 R3

min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3)

__m128 _mm_max_ss(__m128 a, __m128 b)

Computes the maximum of the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a.

738 27

R0 R1 R2 R3

max(a0, b0) a1 a2 a3

__m128 _mm_max_ps(__m128 a, __m128 b)

Computes the maximum of the four SP FP values of a and b.

R0 R1 R2 R3

max(a0, b0) max(a1, b1) max(a2, b2) max(a3, b3)

Logical Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for logical operations are in the xmmintrin.h header file. The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the four 32-bit pieces of the result register.

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_and_ps Bitwise AND ANDPS

_mm_andnot_ps Bitwise ANDNOT ANDNPS

_mm_or_ps Bitwise OR ORPS

_mm_xor_ps Bitwise Exclusive OR XORPS

__m128 _mm_and_ps(__m128 a, __m128 b)

Computes the bitwise AND of the four SP FP values of a and b.

R0 R1 R2 R3

a0 & b0 a1 & b1 a2 & b2 a3 & b3

__m128 _mm_andnot_ps(__m128 a, __m128 b)

739 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Computes the bitwise AND-NOT of the four SP FP values of a and b.

R0 R1 R2 R3

~a0 & b0 ~a1 & b1 ~a2 & b2 ~a3 & b3

__m128 _mm_or_ps(__m128 a, __m128 b)

Computes the bitwise OR of the four SP FP values of a and b.

R0 R1 R2 R3

a0 | b0 a1 | b1 a2 | b2 a3 | b3

__m128 _mm_xor_ps(__m128 a, __m128 b)

Computes bitwise XOR (exclusive-or) of the four SP FP values of a and b.

R0 R1 R2 R3

a0 ^ b0 a1 ^ b1 a2 ^ b2 a3 ^ b3

Compare Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for comparision operations are in the xmmintrin.h header file. Each comparison intrinsic performs a comparison of a and b. For the packed form, the four SP FP values of a and b are compared, and a 128-bit mask is returned. For the scalar form, the lower SP FP values of a and b are compared, and a 32-bit mask is returned; the upper three SP FP values are passed through from a. The mask is set to 0xffffffff for each element where the comparison is true and 0x0 where the comparison is false. The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R or R0-R3. R0, R1, R2 and R3 each represent one of the four 32-bit pieces of the result register.

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_cmpeq_ss Equal CMPEQSS

740 27

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_cmpeq_ps Equal CMPEQPS

_mm_cmplt_ss Less Than CMPLTSS

_mm_cmplt_ps Less Than CMPLTPS

_mm_cmple_ss Less Than or Equal CMPLESS

_mm_cmple_ps Less Than or Equal CMPLEPS

_mm_cmpgt_ss Greater Than CMPLTSS

_mm_cmpgt_ps Greater Than CMPLTPS

_mm_cmpge_ss Greater Than or Equal CMPLESS

_mm_cmpge_ps Greater Than or Equal CMPLEPS

_mm_cmpneq_ss Not Equal CMPNEQSS

_mm_cmpneq_ps Not Equal CMPNEQPS

_mm_cmpnlt_ss Not Less Than CMPNLTSS

_mm_cmpnlt_ps Not Less Than CMPNLTPS

_mm_cmpnle_ss Not Less Than or Equal CMPNLESS

_mm_cmpnle_ps Not Less Than or Equal CMPNLEPS

_mm_cmpngt_ss Not Greater Than CMPNLTSS

_mm_cmpngt_ps Not Greater Than CMPNLTPS

_mm_cmpnge_ss Not Greater Than or Equal CMPNLESS

_mm_cmpnge_ps Not Greater Than or Equal CMPNLEPS

_mm_cmpord_ss Ordered CMPORDSS

741 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_cmpord_ps Ordered CMPORDPS

_mm_cmpunord_ss Unordered CMPUNORDSS

_mm_cmpunord_ps Unordered CMPUNORDPS

_mm_comieq_ss Equal COMISS

_mm_comilt_ss Less Than COMISS

_mm_comile_ss Less Than or Equal COMISS

_mm_comigt_ss Greater Than COMISS

_mm_comige_ss Greater Than or Equal COMISS

_mm_comineq_ss Not Equal COMISS

_mm_ucomieq_ss Equal UCOMISS

_mm_ucomilt_ss Less Than UCOMISS

_mm_ucomile_ss Less Than or Equal UCOMISS

_mm_ucomigt_ss Greater Than UCOMISS

_mm_ucomige_ss Greater Than or Equal UCOMISS

_mm_ucomineq_ss Not Equal UCOMISS

__m128 __cdecl _mm_cmpeq_ss(__m128 a, __m128 b)

Compares for equality.

R0 R1 R2 R3

(a0 == b0) ? a1 a2 a3 0xffffffff : 0x0

742 27

__m128 _mm_cmpeq_ps(__m128 a, __m128 b)

Compares for equality.

R0 R1 R2 R3

(a0 == b0) ? (a1 == b1) ? (a2 == b2) ? (a3 == b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmplt_ss(__m128 a, __m128 b)

Compares for less-than.

R0 R1 R2 R3

(a0 < b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmplt_ps(__m128 a, __m128 b)

Compares for less-than.

R0 R1 R2 R3

(a0 < b0) ? (a1 < b1) ? (a2 < b2) ? (a3 < b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmple_ss(__m128 a, __m128 b)

Compares for less-than-or-equal.

R0 R1 R2 R3

(a0 <= b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmple_ps(__m128 a, __m128 b)

Compares for less-than-or-equal.

743 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 R2 R3

(a0 <= b0) ? (a1 <= b1) ? (a2 <= b2) ? (a3 <= b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmpgt_ss(__m128 a, __m128 b)

Compares for greater-than.

R0 R1 R2 R3

(a0 > b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmpgt_ps(__m128 a, __m128 b)

Compares for greater-than.

R0 R1 R2 R3

(a0 > b0) ? (a1 > b1) ? (a2 > b2) ? (a3 > b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmpge_ss(__m128 a, __m128 b)

Compares for greater-than-or-equal.

R0 R1 R2 R3

(a0 >= b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmpge_ps(__m128 a, __m128 b)

Compares for greater-than-or-equal.

R0 R1 R2 R3

(a0 >= b0) ? (a1 >= b1) ? (a2 >= b2) ? (a3 >= b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmpneq_ss(__m128 a, __m128 b)

744 27

Compares for inequality.

R0 R1 R2 R3

(a0 != b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmpneq_ps(__m128 a, __m128 b)

Compares for inequality.

R0 R1 R2 R3

(a0 != b0) ? (a1 != b1) ? (a2 != b2) ? (a3 != b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmpnlt_ss(__m128 a, __m128 b)

Compares for not-less-than.

R0 R1 R2 R3

!(a0 < b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmpnlt_ps(__m128 a, __m128 b)

Compares for not-less-than.

R0 R1 R2 R3

!(a0 < b0) ? !(a1 < b1) ? !(a2 < b2) ? !(a3 < b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmpnle_ss(__m128 a, __m128 b)

Compares for not-less-than-or-equal.

R0 R1 R2 R3

!(a0 <= b0) ? a1 a2 a3 0xffffffff : 0x0

745 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m128 _mm_cmpnle_ps(__m128 a, __m128 b)

Compares for not-less-than-or-equal.

R0 R1 R2 R3

!(a0 <= b0) ? !(a1 <= b1) ? !(a2 <= b2) ? !(a3 <= b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmpngt_ss(__m128 a, __m128 b)

Compares for not-greater-than.

R0 R1 R2 R3

!(a0 > b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmpngt_ps(__m128 a, __m128 b)

Compares for not-greater-than.

R0 R1 R2 R3

!(a0 > b0) ? !(a1 > b1) ? !(a2 > b2) ? !(a3 > b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmpnge_ss(__m128 a, __m128 b)

Compares for not-greater-than-or-equal.

R0 R1 R2 R3

!(a0 >= b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmpnge_ps(__m128 a, __m128 b)

Compares for not-greater-than-or-equal.

746 27

R0 R1 R2 R3

!(a0 >= b0) ? !(a1 >= b1) ? !(a2 >= b2) ? !(a3 >= b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmpord_ss(__m128 a, __m128 b)

Compares for ordered.

R0 R1 R2 R3

(a0 ord? b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmpord_ps(__m128 a, __m128 b)

Compares for ordered.

R0 R1 R2 R3

(a0 ord? b0) ? (a1 ord? b1) ? (a2 ord? b2) ? (a3 ord? b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128 _mm_cmpunord_ss(__m128 a, __m128 b)

Compares for unordered.

R0 R1 R2 R3

(a0 unord? b0) ? a1 a2 a3 0xffffffff : 0x0

__m128 _mm_cmpunord_ps(__m128 a, __m128 b)

Compares for unordered.

R0 R1 R2 R3

(a0 unord? b0) ? (a1 unord? b1) ? (a2 unord? b2) ? (a3 unord? b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 int _mm_comieq_ss(__m128 a, __m128 b)

747 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is returned. Otherwise 0 is returned.

R

(a0 == b0) ? 0x1 : 0x0

int _mm_comilt_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is returned. Otherwise 0 is returned.

R

(a0 < b0) ? 0x1 : 0x0

int _mm_comile_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a less than or equal to b. If a is less than or equal to b, 1 is returned. Otherwise 0 is returned.

R

(a0 <= b0) ? 0x1 : 0x0

int _mm_comigt_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a greater than b. If a is greater than b are equal, 1 is returned. Otherwise 0 is returned.

R

(a0 > b0) ? 0x1 : 0x0

int _mm_comige_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a greater than or equal to b. If a is greater than or equal to b, 1 is returned. Otherwise 0 is returned.

R

(a0 >= b0) ? 0x1 : 0x0

748 27 int _mm_comineq_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a not equal to b. If a and b are not equal, 1 is returned. Otherwise 0 is returned.

R

(a0 != b0) ? 0x1 : 0x0 int _mm_ucomieq_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is returned. Otherwise 0 is returned.

R

(a0 == b0) ? 0x1 : 0x0 int _mm_ucomilt_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is returned. Otherwise 0 is returned.

R

(a0 < b0) ? 0x1 : 0x0 int _mm_ucomile_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a less than or equal to b. If a is less than or equal to b, 1 is returned. Otherwise 0 is returned.

R

(a0 <= b0) ? 0x1 : 0x0 int _mm_ucomigt_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a greater than b. If a is greater than or equal to b, 1 is returned. Otherwise 0 is returned.

749 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R

(a0 > b0) ? 0x1 : 0x0

int _mm_ucomige_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a greater than or equal to b. If a is greater than or equal to b, 1 is returned. Otherwise 0 is returned.

R

(a0 >= b0) ? 0x1 : 0x0

int _mm_ucomineq_ss(__m128 a, __m128 b)

Compares the lower SP FP value of a and b for a not equal to b. If a and b are not equal, 1 is returned. Otherwise 0 is returned.

R

r := (a0 != b0) ? 0x1 : 0x0

Conversion Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for conversion operations are in the xmmintrin.h header file. The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R or R0-R3. R0, R1, R2 and R3 each represent one of the four 32-bit pieces of the result register. Intrinsics marked with * are implemented only on Intel® 64 architectures. The rest of the intrinsics are implemented on both IA-32 and Intel® 64 architectures.

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_cvtss_si32 Convert to 32-bit integer CVTSS2SI

_mm_cvtss_si64* Convert to 64-bit integer CVTSS2SI

750 27

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_cvtps_pi32 Convert to two 32-bit CVTPS2PI integers

_mm_cvttss_si32 Convert to 32-bit integer CVTTSS2SI

_mm_cvttss_si64* Convert to 64-bit integer CVTTSS2SI

_mm_cvttps_pi32 Convert to two 32-bit CVTTPS2PI integers

_mm_cvtsi32_ss Convert from 32-bit integer CVTSI2SS

_mm_cvtsi64_ss* Convert from 64-bit integer CVTSI2SS

_mm_cvtpi32_ps Convert from two 32-bit CVTTPI2PS integers

_mm_cvtpi16_ps Convert from four 16-bit composite integers

_mm_cvtpu16_ps Convert from four 16-bit composite integers

_mm_cvtpi8_ps Convert from four 8-bit composite integers

_mm_cvtpu8_ps Convert from four 8-bit composite integers

_mm_cvtpi32x2_ps Convert from four 32-bit composite integers

_mm_cvtps_pi16 Convert to four 16-bit composite integers

_mm_cvtps_pi8 Convert to four 8-bit integers composite

_mm_cvtss_f32 Extract composite

751 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

int _mm_cvtss_si32(__m128 a)

Converts the lower SP FP value of a to a 32-bit integer according to the current rounding mode.

R

(int)a0

__int64 _mm_cvtss_si64(__m128 a)

Converts the lower SP FP value of a to a 64-bit signed integer according to the current rounding mode. Use only on Intel® 64 architectures.

R

(__int64)a0

__m64 _mm_cvtps_pi32(__m128 a)

Converts the two lower SP FP values of a to two 32-bit integers according to the current rounding mode, returning the integers in packed form.

R0 R1

(int)a0 (int)a1

int _mm_cvttss_si32(__m128 a)

Converts the lower SP FP value of a to a 32-bit integer with truncation.

R

(int)a0

__int64 _mm_cvttss_si64(__m128 a)

Converts the lower SP FP value of a to a 64-bit signed integer with truncation. Use only on Intel® 64 architectures.

R

(__int64)a0

752 27

__m64 _mm_cvttps_pi32(__m128 a)

Converts the two lower SP FP values of a to two 32-bit integer with truncation, returning the integers in packed form.

R0 R1

(int)a0 (int)a1

__m128 _mm_cvtsi32_ss(__m128 a, int b)

Converts the 32-bit integer value b to an SP FP value; the upper three SP FP values are passed through from a.

R0 R1 R2 R3

(float)b a1 a2 a3

__m128 _mm_cvtsi64_ss(__m128 a, __int64 b)

Converts the signed 64-bit integer value b to an SP FP value; the upper three SP FP values are passed through from a. Use only on Intel® 64 architectures.

R0 R1 R2 R3

(float)b a1 a2 a3

__m128 _mm_cvtpi32_ps(__m128 a, __m64 b)

Converts the two 32-bit integer values in packed form in b to two SP FP values; the upper two SP FP values are passed through from a.

R0 R1 R2 R3

(float)b0 (float)b1 a2 a3

__m128 _mm_cvtpi16_ps(__m64 a)

Converts the four 16-bit signed integer values in a to four single precision FP values.

R0 R1 R2 R3

(float)a0 (float)a1 (float)a2 (float)a3

753 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m128 _mm_cvtpu16_ps(__m64 a)

Converts the four 16-bit unsigned integer values in a to four single precision FP values.

R0 R1 R2 R3

(float)a0 (float)a1 (float)a2 (float)a3

__m128 _mm_cvtpi8_ps(__m64 a)

Converts the lower four 8-bit signed integer values in a to four single precision FP values.

R0 R1 R2 R3

(float)a0 (float)a1 (float)a2 (float)a3

__m128 _mm_cvtpu8_ps(__m64 a)

Converts the lower four 8-bit unsigned integer values in a to four single precision FP values.

R0 R1 R2 R3

(float)a0 (float)a1 (float)a2 (float)a3

__m128 _mm_cvtpi32x2_ps(__m64 a, __m64 b)

Converts the two 32-bit signed integer values in a and the two 32-bit signed integer values in b to four single precision FP values.

R0 R1 R2 R3

(float)a0 (float)a1 (float)b0 (float)b1

__m64 _mm_cvtps_pi16(__m128 a)

Converts the four single precision FP values in a to four signed 16-bit integer values.

R0 R1 R2 R3

(short)a0 (short)a1 (short)a2 (short)a3

__m64 _mm_cvtps_pi8(__m128 a)

754 27

Converts the four single precision FP values in a to the lower four signed 8-bit integer values of the result.

R0 R1 R2 R3

(char)a0 (char)a1 (char)a2 (char)a3

float _mm_cvtss_f32(__m128 a)

Extracts a single precision floating point value from the first vector element of an __m128. It does so in the most efficient manner possible in the context used.

Load Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for load operations are in the xmmintrin.h header file. The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the four 32-bit pieces of the result register.

Intrinsic Name Operation Corresponding Intel® SSE Instructions

_mm_loadh_pi Load high MOVHPS reg, mem

_mm_loadl_pi Load low MOVLPS reg, mem

_mm_load_ss Load the low value and clear MOVSS the three high values

_mm_load1_ps Load one value into all four MOVSS + Shuffling words

_mm_load_ps Load four values, address MOVAPS aligned

_mm_loadu_ps Load four values, address MOVUPS unaligned

_mm_loadr_ps Load four values in reverse MOVAPS + Shuffling

755 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m128 _mm_loadh_pi(__m128 a, __m64 const *p)

Sets the upper two SP FP values with 64 bits of data loaded from the address p.

R0 R1 R2 R3

a0 a1 *p0 *p1

__m128 _mm_loadl_pi(__m128 a, __m64 const *p)

Sets the lower two SP FP values with 64 bits of data loaded from the address p; the upper two values are passed through from a.

R0 R1 R2 R3

*p0 *p1 a2 a3

__m128 _mm_load_ss(float * p )

Loads an SP FP value into the low word and clears the upper three words.

R0 R1 R2 R3

*p 0.0 0.0 0.0

__m128 _mm_load1_ps(float * p)

Loads a single SP FP value, copying it into all four words.

R0 R1 R2 R3

*p *p *p *p

__m128 _mm_load_ps(float * p )

Loads four SP FP values. The address must be 16-byte-aligned.

R0 R1 R2 R3

p[0] p[1] p[2] p[3]

__m128 _mm_loadu_ps(float * p)

Loads four SP FP values. The address need not be 16-byte-aligned.

756 27

R0 R1 R2 R3

p[0] p[1] p[2] p[3]

__m128 _mm_loadr_ps(float * p)

Loads four SP FP values in reverse order. The address must be 16-byte-aligned.

R0 R1 R2 R3

p[3] p[2] p[1] p[0]

Set Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for set operations are in the xmmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R0, R1, R2 and R3 represent the registers in which results are placed.

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_set_ss Set the low value and clear Composite the three high values

_mm_set1_ps Set all four words with the Composite same value

_mm_set_ps Set four values, address Composite aligned

_mm_setr_ps Set four values, in reverse Composite order

_mm_setzero_ps Clear all four values Composite

__m128 _mm_set_ss(float w )

Sets the low word of an SP FP value to w and clears the upper three words.

757 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 R2 R3

w 0.0 0.0 0.0

__m128 _mm_set1_ps(float w )

Sets the four SP FP values to w.

R0 R1 R2 R3

w w w w

__m128 _mm_set_ps(float z, float y, float x, float w )

Sets the four SP FP values to the four inputs.

R0 R1 R2 R3

w x y z

__m128 _mm_setr_ps (float z, float y, float x, float w )

Sets the four SP FP values to the four inputs in reverse order.

R0 R1 R2 R3

z y x w

__m128 _mm_setzero_ps (void)

Clears the four SP FP values.

R0 R1 R2 R3

0.0 0.0 0.0 0.0

Store Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for store operations are in the xmmintrin.h header file.

758 27

The description for each intrinsic contains a table detailing the returns. In these tables, p[n] is an access to the n element of the result.

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_storeh_pi Store high MOVHPS mem, reg

_mm_storel_pi Store low MOVLPS mem, reg

_mm_store_ss Store the low value MOVSS

_mm_store1_ps Store the low value across all Shuffling + MOVSS four words, address aligned

_mm_store_ps Store four values, address MOVAPS aligned

_mm_storeu_ps Store four values, address MOVUPS unaligned

_mm_storer_ps Store four values, in reverse MOVAPS + Shuffling order void _mm_storeh_pi(__m64 *p, __m128 a)

Stores the upper two SP FP values to the address p.

*p0 *p1

a2 a3 void _mm_storel_pi(__m64 *p, __m128 a)

Stores the lower two SP FP values of a to the address p.

*p0 *p1

a0 a1 void _mm_store_ss(float * p, __m128 a)

Stores the lower SP FP value.

759 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

*p

a0

void _mm_store1_ps(float * p, __m128 a )

Stores the lower SP FP value across four words.

p[0] p[1] p[2] p[3]

a0 a0 a0 a0

void _mm_store_ps(float *p, __m128 a)

Stores four SP FP values. The address must be 16-byte-aligned.

p[0] p[1] p[2] p[3]

a0 a1 a2 a3

void _mm_storeu_ps(float *p, __m128 a)

Stores four SP FP values. The address need not be 16-byte-aligned.

p[0] p[1] p[2] p[3]

a0 a1 a2 a3

void _mm_storer_ps(float * p, __m128 a )

Stores four SP FP values in reverse order. The address must be 16-byte-aligned.

p[0] p[1] p[2] p[3]

a3 a2 a1 a0

Cacheability Support Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for cacheability support are in the xmmintrin.h header file.

760 27

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_prefetch Load PREFETCH

_mm_stream_pi Store MOVNTQ

_mm_stream_ps Store MOVNTPS

_mm256_stream_ps Store VMOVNTPS

_mm_sfence Store fence SFENCE

void _mm_prefetch(char const*a, int sel)

Loads one cache line of data from address a to a location "closer" to the processor. The value sel specifies the type of prefetch operation: the constants _MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA should be used for systems based on IA-32 architecture, and correspond to the type of prefetch instruction. void _mm_stream_pi(__m64 *p, __m64 a)

Stores the data in a to the address p without polluting the caches. This intrinsic requires you to empty the multimedia state for the MMX™ register. See the topic The EMMS Instruction: Why You Need It. void _mm_stream_ps(float *p, __m128 a)

Stores the data in a to the address p without polluting the caches. The address must be 16-byte-aligned. void _mm256_stream_ps(float *p, __m256 a)

Stores the data in a to the address p without polluting the caches. The address must be 32-byte (VEX.256 encoded version) aligned. void _mm_sfence(void)

Guarantees that every preceding store is globally visible before any subsequent store.

Integer Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for integer operations are in the xmmintrin.h header file.

761 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1...R7 represent the registers in which results are placed. Before using these intrinsics, you must empty the multimedia state for the MMX™ technology register. See The EMMS Instruction: Why You Need It for more details.

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_extract_pi16 Extract one of four words PEXTRW

_mm_insert_pi16 Insert word PINSRW

_mm_max_pi16 Compute maximum PMAXSW

_mm_max_pu8 Compute maximum, PMAXUB unsigned

_mm_min_pi16 Compute minimum PMINSW

_mm_min_pu8 Compute minimum, unsigned PMINUB

_mm_movemask_pi8 Create eight-bit mask PMOVMSKB

_mm_mulhi_pu16 Multiply, return high bits PMULHUW

_mm_shuffle_pi16 Return a combination of four PSHUFW words

_mm_maskmove_si64 Conditional Store MASKMOVQ

_mm_avg_pu8 Compute rounded average PAVGB

_mm_avg_pu16 Compute rounded average PAVGW

_mm_sad_pu8 Compute sum of absolute PSADBW differences

int _mm_extract_pi16(__m64 a, int n)

Extracts one of the four words of a. The selector n must be an immediate.

762 27

R

(n==0) ? a0 : ( (n==1) ? a1 : ( (n==2) ? a2 : a3 ) )

__m64 _mm_insert_pi16(__m64 a, int d, int n)

Inserts word d into one of four words of a. The selector n must be an immediate.

R0 R1 R2 R3

(n==0) ? d : a0; (n==1) ? d : a1; (n==2) ? d : a2; (n==3) ? d : a3;

__m64 _mm_max_pi16(__m64 a, __m64 b)

Computes the element-wise maximum of the words in a and b.

R0 R1 R2 R3

min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3)

__m64 _mm_max_pu8(__m64 a, __m64 b)

Computes the element-wise maximum of the unsigned bytes in a and b.

R0 R1 ... R7

min(a0, b0) min(a1, b1) ... min(a7, b7)

__m64 _mm_min_pi16(__m64 a, __m64 b)

Computes the element-wise minimum of the words in a and b.

R0 R1 R2 R3

min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3)

__m64 _mm_min_pu8(__m64 a, __m64 b)

Computes the element-wise minimum of the unsigned bytes in a and b.

R0 R1 ... R7

min(a0, b0) min(a1, b1) ... min(a7, b7)

763 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m64 _mm_movemask_pi8(__m64 b)

Creates an 8-bit mask from the most significant bits of the bytes in a.

R

sign(a7)<<7 | sign(a6)<<6 |... | sign(a0)

__m64 _mm_mulhi_pu16(__m64 a, __m64 b)

Multiplies the unsigned words in a and b, returning the upper 16 bits of the 32-bit intermediate results.

R0 R1 R2 R3

hiword(a0 * b0) hiword(a1 * b1) hiword(a2 * b2) hiword(a3 * b3)

__m64 _mm_shuffle_pi16(__m64 a, int n)

Returns a combination of the four words of a. The selector n must be an immediate.

R0 R1 R2 R3

word (n&0x3) of a word ((n>>2)&0x3) word ((n>>4)&0x3) word ((n>>6)&0x3) of a of a of a

void _mm_maskmove_si64(__m64 d, __m64 n, char *p)

Conditionally stores byte elements of d to address p. The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored.

if (sign(n0)) if (sign(n1)) ... if (sign(n7))

p[0] := d0 p[1] := d1 ... p[7] := d7

__m64 _mm_avg_pu8(__m64 a, __m64 b)

Computes the (rounded) averages of the unsigned bytes in a and b.

764 27

R0 R1 ... R7

(t >> 1) | (t & (t >> 1) | (t & ... ((t >> 1) | (t & 0x01), where t = 0x01), where t = 0x01)), where t = (unsigned char)a0 (unsigned char)a1 (unsigned char)a7 + (unsigned + (unsigned + (unsigned char)b0 char)b1 char)b7

__m64 _mm_avg_pu16(__m64 a, __m64 b)

Computes the (rounded) averages of the unsigned short in a and b.

R0 R1 ... R7

(t >> 1) | (t & (t >> 1) | (t & ... (t >> 1) | (t & 0x01), where t = 0x01), where t = 0x01), where t = (unsigned int)a0 (unsigned int)a1 (unsigned int)a7 + (unsigned + (unsigned + (unsigned int)b0 int)b1 int)b7

__m64 _mm_sad_pu8(__m64 a, __m64 b)

Computes the sum of the absolute differences of the unsigned bytes in a and b, returning the value in the lower word. The upper three words are cleared.

R0 R1 R2 R3

abs(a0-b0) +... + 0 0 0 abs(a7-b7)

Intrinsics to Read and Write Registers

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics to read from and write to registers are in the xmmintrin.h header file.

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_getcsr Return control register STMXCSR

765 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_setcsr Set control register LDMXCSR

unsigned int _mm_getcsr(void)

Returns the contents of the control register. void _mm_setcsr(unsigned int i)

Sets the control register to the value specified.

Miscellaneous Intrinsics

The prototypes for Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics for miscellaneous operations are in the xmmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1, R2 and R3 represent the registers in which results are placed.

Intrinsic Name Operation Corresponding Intel® SSE Instruction

_mm_shuffle_ps Shuffle SHUFPS

_mm_unpackhi_ps Unpack High UNPCKHPS

_mm_unpacklo_ps Unpack Low UNPCKLPS

_mm_move_ss Set low word, pass in three MOVSS high values

_mm_movehl_ps Move High to Low MOVHLPS

_mm_movelh_ps Move Low to High MOVLHPS

_mm_movemask_ps Create four-bit mask MOVMSKPS

__m128 _mm_shuffle_ps(__m128 a, __m128 b, unsigned int imm8)

766 27

Selects four specific SP FP values from a and b, based on the mask imm8. The mask must be an immediate. See Macro Function for Shuffle Using Streaming SIMD Extensions for a description of the shuffle semantics. __m128 _mm_unpackhi_ps(__m128 a, __m128 b)

Selects and interleaves the upper two SP FP values from a and b.

R0 R1 R2 R3

a2 b2 a3 b3

__m128 _mm_unpacklo_ps(__m128 a, __m128 b)

Selects and interleaves the lower two SP FP values from a and b.

R0 R1 R2 R3

a0 b0 a1 b1

__m128 _mm_move_ss( __m128 a, __m128 b)

Sets the low word to the SP FP value of b. The upper 3 SP FP values are passed through from a.

R0 R1 R2 R3

b0 a1 a2 a3

__m128 _mm_movehl_ps(__m128 a, __m128 b)

Moves the upper 2 SP FP values of b to the lower 2 SP FP values of the result. The upper 2 SP FP values of a are passed through to the result.

R0 R1 R2 R3

b2 b3 a2 a3

__m128 _mm_movelh_ps(__m128 a, __m128 b)

Moves the lower 2 SP FP values of b to the upper 2 SP FP values of the result. The lower 2 SP FP values of a are passed through to the result.

767 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 R2 R3

a0 a1 b0 b1

int _mm_movemask_ps(__m128 a)

Creates a 4-bit mask from the most significant bits of the four SP FP values.

R

sign(a3)<<3 | sign(a2)<<2 | sign(a1)<<1 | sign(a0)

Macro Functions

Macro Function for Shuffle Operations

The Intel® Streaming SIMD Extensions (Intel® SSE) provide a macro function to help create constants that describe shuffle operations. The macro takes four small integers (in the range of 0 to 3) and combines them into an 8-bit immediate value used by the SHUFPS instruction.

Shuffle Function Macro

You can view the four integers as selectors for choosing which two words from the first input operand and which two words from the second are to be put into the result word.

768 27

View of Original and Result Words with Shuffle Function Macro

Macro Functions to Read and Write Control Registers

The following macro functions enable you to read and write bits to and from the control register.

Exception State Macros Macro Arguments

_MM_SET_EXCEPTION_STATE(x) _MM_EXCEPT_INVALID

_MM_GET_EXCEPTION_STATE() _MM_EXCEPT_DIV_ZERO

_MM_EXCEPT_DENORM

Macro Definitions _MM_EXCEPT_OVERFLOW Write to and read from the six least significant control register bits, respectively.

_MM_EXCEPT_UNDERFLOW

_MM_EXCEPT_INEXACT

The following example tests for a divide-by-zero exception.

769 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Exception State Macros with _MM_EXCEPT_DIV_ZERO

Exception Mask Macros Macro Arguments

_MM_SET_EXCEPTION_MASK(x) _MM_MASK_INVALID

_MM_GET_EXCEPTION_MASK () _MM_MASK_DIV_ZERO

_MM_MASK_DENORM

Macro Definitions _MM_MASK_OVERFLOW Write to and read from bit 7 – 12 control register bits, respectively.

NOTE. All six exception mask bits are always affected. Bits not set explicitly are cleared.

_MM_MASK_UNDERFLOW

_MM_MASK_INEXACT

To mask the overflow and underflow exceptions and unmask all other exceptions, use the macros as follows: _MM_SET_EXCEPTION_MASK(MM_MASK_OVERFLOW | _MM_MASK_UNDERFLOW)

The following table lists the macros to set and get rounding modes, and the macro arguments that can be passed with the macros.

770 27

Rounding Mode Macro Arguments

_MM_SET_ROUNDING_MODE(x) _MM_ROUND_NEAREST

_MM_GET_ROUNDING_MODE() _MM_ROUND_DOWN

Macro Definition _MM_ROUND_UP Write to and read from bits 13 and 14 of the control register.

_MM_ROUND_TOWARD_ZERO

To test the rounding mode for round toward zero, use the _MM_ROUND_TOWARD_ZERO macro as follows. if (_MM_GET_ROUNDING_MODE() == _MM_ROUND_TOWARD_ZERO) { /* Rounding mode is round toward zero */ }

The following table lists the macros to set and get the flush-to-zero mode and the macro arguments that can be used.

Flush-to-Zero Mode Macro Arguments

_MM_SET_FLUSH_ZERO_MODE(x) _MM_FLUSH_ZERO_ON

_MM_GET_FLUSH_ZERO_MODE() _MM_FLUSH_ZERO_OFF

Macro Definition Write to and read from bit 15 of the control register.

To disable the flush-to-zero mode, use the _MM_FLUSH_ZERO_OFF macro. _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_OFF)

See Also • Intrinsics to Read and Write Registers

771 27 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Macro Function for Matrix Transposition

The Intel® Streaming SIMD Extensions (Intel® SSE) provide the following macro function to transpose a 4 by 4 matrix of single precision floating point values. _MM_TRANSPOSE4_PS(row0, row1, row2, row3)

The arguments row0, row1, row2, and row3 are __m128 values whose elements form the corresponding rows of a 4 by 4 matrix. The matrix transposition is returned in arguments row0, row1, row2, and row3 where row0 now holds column 0 of the original matrix, row1 now holds column 1 of the original matrix, and so on. The transposition function of this macro is illustrated in the figure below.

Matrix Transposition Using _MM_TRANSPOSE4_PS Macro

772 Intrinsics for Intel(R) Streaming SIMD Extensions 2 28

Overview: Intel(R) Streaming SIMD Extensions 2

This section describes the C++ language-level features supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) in the Intel® C++ Compiler. The features are divided into two categories:

• Floating-Point Intrinsics -- describes the arithmetic, logical, compare, conversion, memory, and initialization intrinsics for the double-precision floating-point data type (__m128d). • Integer Intrinsics -- describes the arithmetic, logical, compare, conversion, memory, and initialization intrinsics for the extended-precision integer data type (__m128i).

NOTE.

1. There are no intrinsics for floating-point move operations. To move data from one register to another, a simple assignment, A = B, suffices, where A and B are the source and target registers for the move operation. 2. On processors that do not support Intel® SSE2 instructions but do support MMX™ Technology, you can use the sse2mmx.h emulation pack to enable support for Intel® SSE2 instructions.

Use the sse2mmx.h header file for the following processors:

• Pentium® III processor • Pentium® II processor • Pentium® processor with MMX™ technology

Some intrinsics are "composites" - they require more than one instruction to implement them. Intrinsics that require one instruction to implement them are referred to as "simple". You should be familiar with the hardware features provided by the Intel® SSE2 when writing programs with the intrinsics. The following are three important issues to keep in mind:

• Certain intrinsics, such as _mm_loadr_pd and _mm_cmpgt_sd, are not directly supported by the instruction set. While these intrinsics are convenient programming aids, be mindful of their implementation cost.

773 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• Data loaded or stored as __m128d objects must be generally 16-byte-aligned. • Some intrinsics require that their argument be immediates, that is, constant integers (literals), due to the nature of the instruction.

The prototypes for Intel® SSE2 intrinsics are in the emmintrin.h header file.

NOTE. You can also use the single ia32intrin.h header file for any IA-32 architecture-based intrinsics.

Floating-point Intrinsics

Arithmetic Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for floating-point arithmetic operations are listed in this topic. The prototypes for the SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in a register. The information about what is placed in each register appears in the tables below, in the detailed explanation for each intrinsic. For each intrinsic, the resulting register is represented by R0 and R1, where R0 and R1 each represent one piece of the result register.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_add_sd Addition ADDSD

_mm_add_pd Addition ADDPD

_mm_sub_sd Subtraction SUBSD

_mm_sub_pd Subtraction SUBPD

_mm_mul_sd Multiplication MULSD

_mm_mul_pd Multiplication MULPD

_mm_div_sd Division DIVSD

774 28

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_div_pd Division DIVPD

_mm_sqrt_sd Computes Square Root SQRTSD

_mm_sqrt_pd Computes Square Root SQRTPD

_mm_min_sd Computes Minimum MINSD

_mm_min_pd Computes Minimum MINPD

_mm_max_sd Computes Maximum MAXSD

_mm_max_pd Computes Maximum MAXPD

__m128d _mm_add_sd(__m128d a, __m128d b)

Adds the lower DP FP (double-precision, floating-point) values of a and b ; the upper DP FP value is passed through from a.

R0 R1

a0 + b0 a1

__m128d _mm_add_pd(__m128d a, __m128d b)

Adds the two DP FP values of a and b.

R0 R1

a0 + b0 a1 + b1

__m128d _mm_sub_sd(__m128d a, __m128d b)

Subtracts the lower DP FP value of b from a. The upper DP FP value is passed through from a.

R0 R1

a0 - b0 a1

775 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m128d _mm_sub_pd(__m128d a, __m128d b)

Subtracts the two DP FP values of b from a.

R0 R1

a0 - b0 a1 - b1

__m128d _mm_mul_sd(__m128d a, __m128d b)

Multiplies the lower DP FP values of a and b. The upper DP FP is passed through from a.

R0 R1

a0 * b0 a1

__m128d _mm_mul_pd(__m128d a, __m128d b)

Multiplies the two DP FP values of a and b.

R0 R1

a0 * b0 a1 * b1

__m128d _mm_div_sd(__m128d a, __m128d b)

Divides the lower DP FP values of a and b. The upper DP FP value is passed through from a.

R0 R1

a0 / b0 a1

__m128d _mm_div_pd(__m128d a, __m128d b)

Divides the two DP FP values of a and b.

R0 R1

a0 / b0 a1 / b1

__m128d _mm_sqrt_sd(__m128d a, __m128d b)

Computes the square root of the lower DP FP value of b. The upper DP FP value is passed through from a.

776 28

R0 R1

sqrt(b0) a1

__m128d _mm_sqrt_pd(__m128d a)

Computes the square roots of the two DP FP values of a.

R0 R1

sqrt(a0) sqrt(a1)

__m128d _mm_min_sd(__m128d a, __m128d b)

Computes the minimum of the lower DP FP values of a and b. The upper DP FP value is passed through from a.

R0 R1

min (a0, b0) a1

__m128d _mm_min_pd(__m128d a, __m128d b)

Computes the minima of the two DP FP values of a and b.

R0 R1

min (a0, b0) min(a1, b1)

__m128d _mm_max_sd(__m128d a, __m128d b)

Computes the maximum of the lower DP FP values of a and b. The upper DP FP value is passed through from a.

R0 R1

max (a0, b0) a1

__m128d _mm_max_pd(__m128d a, __m128d b)

Computes the maxima of the two DP FP values of a and b.

777 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1

max (a0, b0) max (a1, b1)

Logical Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for floating-point logical operations are listed in the following table. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation for each intrinsic. For each intrinsic, the resulting register is represented by R0 and R1, where R0 and R1 each represent one piece of the result register.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_and_pd Computes AND ANDPD

_mm_andnot_pd Computes AND and NOT ANDNPD

_mm_or_pd Computes OR ORPD

_mm_xor_pd Computes XOR XORPD

__m128d _mm_and_pd(__m128d a, __m128d b)

Computes the bitwise AND of the two DP FP values of a and b.

R0 R1

a0 & b0 a1 & b1

__m128d _mm_andnot_pd(__m128d a, __m128d b)

Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the 128-bit value in a.

778 28

R0 R1

(~a0) & b0 (~a1) & b1

__m128d _mm_or_pd(__m128d a, __m128d b)

Computes the bitwise OR of the two DP FP values of a and b.

R0 R1

a0 | b0 a1 | b1

__m128d _mm_xor_pd(__m128d a, __m128d b)

Computes the bitwise XOR of the two DP FP values of a and b.

R0 R1

a0 ^ b0 a1 ^ b1

Compare Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for floating-point comparision operations are listed in the following table. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. Each comparison intrinsic performs a comparison of a and b. For the packed form, the two DP FP values of a and b are compared, and a 128-bit mask is returned. For the scalar form, the lower DP FP values of a and b are compared, and a 64-bit mask is returned; the upper DP FP value is passed through from a. The mask is set to 0xffffffffffffffff for each element where the comparison is true, and set to 0x0 where the comparison is false. The r following the instruction name indicates that the operands to the instruction are reversed in the actual implementation. The results of each intrinsic operation are placed in a register. The information about what is placed in each register appears in the tables below, in the detailed explanation for each intrinsic. For each intrinsic, the resulting register is represented by R, R0 and R1, where R, R0 and R1 each represent one piece of the result register.

779 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_cmpeq_pd Equality CMPEQPD

_mm_cmplt_pd Less Than CMPLTPD

_mm_cmple_pd Less Than or Equal CMPLEPD

_mm_cmpgt_pd Greater Than CMPLTPDr

_mm_cmpge_pd Greater Than or Equal CMPLEPDr

_mm_cmpord_pd Ordered CMPORDPD

_mm_cmpunord_pd Unordered CMPUNORDPD

_mm_cmpneq_pd Inequality CMPNEQPD

_mm_cmpnlt_pd Not Less Than CMPNLTPD

_mm_cmpnle_pd Not Less Than or Equal CMPNLEPD

_mm_cmpngt_pd Not Greater Than CMPNLTPDr

_mm_cmpnge_pd Not Greater Than or Equal CMPNLEPDr

_mm_cmpeq_sd Equality CMPEQSD

_mm_cmplt_sd Less Than CMPLTSD

_mm_cmple_sd Less Than or Equal CMPLESD

_mm_cmpgt_sd Greater Than CMPLTSDr

_mm_cmpge_sd Greater Than or Equal CMPLESDr

_mm_cmpord_sd Ordered CMPORDSD

_mm_cmpunord_sd Unordered CMPUNORDSD

_mm_cmpneq_sd Inequality CMPNEQSD

780 28

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_cmpnlt_sd Not Less Than CMPNLTSD

_mm_cmpnle_sd Not Less Than or Equal CMPNLESD

_mm_cmpngt_sd Not Greater Than CMPNLTSDr

_mm_cmpnge_sd Not Greater Than or Equal CMPNLESDr

_mm_comieq_sd Equality COMISD

_mm_comilt_sd Less Than COMISD

_mm_comile_sd Less Than or Equal COMISD

_mm_comigt_sd Greater Than COMISD

_mm_comige_sd Greater Than or Equal COMISD

_mm_comineq_sd Not Equal COMISD

_mm_ucomieq_sd Equality UCOMISD

_mm_ucomilt_sd Less Than UCOMISD

_mm_ucomile_sd Less Than or Equal UCOMISD

_mm_ucomigt_sd Greater Than UCOMISD

_mm_ucomige_sd Greater Than or Equal UCOMISD

_mm_ucomineq_sd Not Equal UCOMISD

__m128d _mm_cmpeq_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for equality.

781 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1

(a0 == b0) ? 0xffffffffffffffff : (a1 == b1) ? 0xffffffffffffffff : 0x0 0x0

__m128d _mm_cmplt_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for a less than b.

R0 R1

(a0 < b0) ? 0xffffffffffffffff : 0x0 (a1 < b1) ? 0xffffffffffffffff : 0x0

__m128d _mm_cmple_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for a less than or equal to b.

R0 R1

(a0 <= b0) ? 0xffffffffffffffff : (a1 <= b1) ? 0xffffffffffffffff : 0x0 0x0

__m128d _mm_cmpgt_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for a greater than b.

R0 R1

(a0 > b0) ? 0xffffffffffffffff : 0x0 (a1 > b1) ? 0xffffffffffffffff : 0x0

__m128d _mm_cmpge_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for a greater than or equal to b.

R0 R1

(a0 >= b0) ? 0xffffffffffffffff : (a1 >= b1) ? 0xffffffffffffffff : 0x0 0x0

__m128d _mm_cmpord_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for ordered.

782 28

R0 R1

(a0 ord b0) ? 0xffffffffffffffff : (a1 ord b1) ? 0xffffffffffffffff : 0x0 0x0

__m128d _mm_cmpunord_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for unordered.

R0 R1

(a0 unord b0) ? 0xffffffffffffffff (a1 unord b1) ? 0xffffffffffffffff : 0x0 : 0x0

__m128d _mm_cmpneq_pd ( __m128d a, __m128d b)

Compares the two DP FP values of a and b for inequality.

R0 R1

(a0 != b0) ? 0xffffffffffffffff : (a1 != b1) ? 0xffffffffffffffff : 0x0 0x0

__m128d _mm_cmpnlt_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for a not less than b.

R0 R1

!(a0 < b0) ? 0xffffffffffffffff : !(a1 < b1) ? 0xffffffffffffffff : 0x0 0x0

__m128d _mm_cmpnle_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for a not less than or equal to b.

R0 R1

!(a0 <= b0) ? 0xffffffffffffffff : !(a1 <= b1) ? 0xffffffffffffffff : 0x0 0x0

__m128d _mm_cmpngt_pd(__m128d a, __m128d b)

783 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Compares the two DP FP values of a and b for a not greater than b.

R0 R1

!(a0 > b0) ? 0xffffffffffffffff : !(a1 > b1) ? 0xffffffffffffffff : 0x0 0x0

__m128d _mm_cmpnge_pd(__m128d a, __m128d b)

Compares the two DP FP values of a and b for a not greater than or equal to b.

R0 R1

!(a0 >= b0) ? 0xffffffffffffffff : !(a1 >= b1) ? 0xffffffffffffffff : 0x0 0x0

__m128d _mm_cmpeq_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for equality. The upper DP FP value is passed through from a.

R0 R1

(a0 == b0) ? 0xffffffffffffffff : a1 0x0

__m128d _mm_cmplt_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a less than b. The upper DP FP value is passed through from a.

R0 R1

(a0 < b0) ? 0xffffffffffffffff : 0x0 a1

__m128d _mm_cmple_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a less than or equal to b. The upper DP FP value is passed through from a.

784 28

R0 R1

(a0 <= b0) ? 0xffffffffffffffff : a1 0x0

__m128d _mm_cmpgt_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a greater than b. The upper DP FP value is passed through from a.

R0 R1

(a0 > b0) ? 0xffffffffffffffff : 0x0 a1

__m128d _mm_cmpge_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a greater than or equal to b. The upper DP FP value is passed through from a.

R0 R1

(a0 >= b0) ? 0xffffffffffffffff : a1 0x0

__m128d _mm_cmpord_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for ordered. The upper DP FP value is passed through from a.

R0 R1

(a0 ord b0) ? 0xffffffffffffffff : a1 0x0

__m128d _mm_cmpunord_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for unordered. The upper DP FP value is passed through from a.

785 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1

(a0 unord b0) ? 0xffffffffffffffff a1 : 0x0

__m128d _mm_cmpneq_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for inequality. The upper DP FP value is passed through from a.

R0 R1

(a0 != b0) ? 0xffffffffffffffff : a1 0x0

__m128d _mm_cmpnlt_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a not less than b. The upper DP FP value is passed through from a.

R0 R1

!(a0 < b0) ? 0xffffffffffffffff : a1 0x0

__m128d _mm_cmpnle_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a not less than or equal to b. The upper DP FP value is passed through from a.

R0 R1

!(a0 <= b0) ? 0xffffffffffffffff : 0x0 a1

__m128d _mm_cmpngt_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a not greater than b. The upper DP FP value is passed through from a.

786 28

R0 R1

!(a0 > b0) ? 0xffffffffffffffff : a1 0x0

__m128d _mm_cmpnge_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a not greater than or equal to b. The upper DP FP value is passed through from a.

R0 R1

!(a0 >= b0) ? 0xffffffffffffffff : a1 0x0 int _mm_comieq_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a equal to b. If a and b are equal, 1 is returned. Otherwise 0 is returned.

R

(a0 == b0) ? 0x1 : 0x0 int _mm_comilt_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a less than b. If a is less than b, 1 is returned. Otherwise 0 is returned.

R

(a0 < b0) ? 0x1 : 0x0 int _mm_comile_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a less than or equal to b. If a is less than or equal to b, 1 is returned. Otherwise 0 is returned.

R

(a0 <= b0) ? 0x1 : 0x0

787 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

int _mm_comigt_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a greater than b. If a is greater than b are equal, 1 is returned. Otherwise 0 is returned.

R

(a0 > b0) ? 0x1 : 0x0

int _mm_comige_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a greater than or equal to b. If a is greater than or equal to b, 1 is returned. Otherwise 0 is returned.

R

(a0 >= b0) ? 0x1 : 0x0

int _mm_comineq_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a not equal to b. If a and b are not equal, 1 is returned. Otherwise 0 is returned.

R

(a0 != b0) ? 0x1 : 0x0

int _mm_ucomieq_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a equal to b. If a and b are equal, 1 is returned. Otherwise 0 is returned.

R

(a0 == b0) ? 0x1 : 0x0

int _mm_ucomilt_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a less than b. If a is less than b, 1 is returned. Otherwise 0 is returned.

788 28

R

(a0 < b0) ? 0x1 : 0x0 int _mm_ucomile_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a less than or equal to b. If a is less than or equal to b, 1 is returned. Otherwise 0 is returned.

R

(a0 <= b0) ? 0x1 : 0x0 int _mm_ucomigt_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a greater than b. If a is greater than b are equal, 1 is returned. Otherwise 0 is returned.

R

(a0 > b0) ? 0x1 : 0x0 int _mm_ucomige_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a greater than or equal to b. If a is greater than or equal to b, 1 is returned. Otherwise 0 is returned.

R

(a0 >= b0) ? 0x1 : 0x0 int _mm_ucomineq_sd(__m128d a, __m128d b)

Compares the lower DP FP value of a and b for a not equal to b. If a and b are not equal, 1 is returned. Otherwise 0 is returned.

R

(a0 != b0) ? 0x1 : 0x0

789 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Conversion Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for floating-point conversion operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. Each conversion intrinsic takes one data type and performs a conversion to a different type. Some conversions, such as those performed by the _mm_cvtpd_ps intrinsic, result in a loss of precision. The rounding mode used in such cases is determined by the value in the MXCSR register. The default rounding mode is round-to-nearest.

NOTE. The rounding mode used by the C and C++ languages when performing a type conversion is to truncate. The _mm_cvttpd_epi32 and _mm_cvttsd_si32 intrinsics use the truncate rounding mode regardless of the mode specified by the MXCSR register.

The results of each intrinsic operation are placed in a register. The information about what is placed in each register appears in the tables below, in the detailed explanation for each intrinsic. For each intrinsic, the resulting register is represented by R, R0, R1, R2, and R3, where each represent the registers in which results are placed.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_cvtpd_ps Convert DP FP to SP FP CVTPD2PS

_mm_cvtps_pd Convert from SP FP to DP FP CVTPS2PD

_mm_cvtepi32_pd Convert lower integer values CVTDQ2PD to DP FP

_mm_cvtpd_epi32 Convert DP FP values to CVTPD2DQ integer values

_mm_cvtsd_si32 Convert lower DP FP value to CVTSD2SI integer value

_mm_cvtsd_ss Convert lower DP FP value to CVTSD2SS SP FP

790 28

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_cvtsi32_sd Convert signed integer value CVTSI2SD to DP FP

_mm_cvtss_sd Convert lower SP FP value to CVTSS2SD DP FP

_mm_cvttpd_epi32 Convert DP FP values to CVTTPD2DQ signed integers

_mm_cvttsd_si32 Convert lower DP FP to CVTTSD2SI signed integer

_mm_cvtpd_pi32 Convert two DP FP values to CVTPD2PI signed integer values

_mm_cvttpd_pi32 Convert two DP FP values to CVTTPD2PI signed integer values using truncate

_mm_cvtpi32_pd Convert two signed integer CVTPI2PD values to DP FP

_mm_cvtsd_f64 Extract DP FP value from first None vector element

__m128 _mm_cvtpd_ps(__m128d a)

Converts the two DP FP values of a to SP FP values.

R0 R1 R2 R3

(float) a0 (float) a1 0.0 0.0

__m128d _mm_cvtps_pd(__m128 a)

Converts the lower two SP FP values of a to DP FP values.

791 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1

(double) a0 (double) a1

__m128d _mm_cvtepi32_pd(__m128i a)

Converts the lower two signed 32-bit integer values of a to DP FP values.

R0 R1

(double) a0 (double) a1

__m128i _mm_cvtpd_epi32(__m128d a)

Converts the two DP FP values of a to 32-bit signed integer values.

R0 R1 R2 R3

(int) a0 (int) a1 0x0 0x0

int _mm_cvtsd_si32(__m128d a)

Converts the lower DP FP value of a to a 32-bit signed integer value.

R

(int) a0

__m128 _mm_cvtsd_ss(__m128 a, __m128d b)

Converts the lower DP FP value of b to an SP FP value. The upper SP FP values in a are passed through.

R0 R1 R2 R3

(float) b0 a1 a2 a3

__m128d _mm_cvtsi32_sd(__m128d a, int b)

Converts the signed integer value in b to a DP FP value. The upper DP FP value in a is passed through.

792 28

R0 R1

(double) b a1

__m128d _mm_cvtss_sd(__m128d a, __m128 b)

Converts the lower SP FP value of b to a DP FP value. The upper value DP FP value in a is passed through.

R0 R1

(double) b0 a1

__m128i _mm_cvttpd_epi32(__m128d a)

Converts the two DP FP values of a to 32-bit signed integers using truncate.

R0 R1 R2 R3

(int) a0 (int) a1 0x0 0x0 int _mm_cvttsd_si32(__m128d a)

Converts the lower DP FP value of a to a 32-bit signed integer using truncate.

R

(int) a0

__m64 _mm_cvtpd_pi32(__m128d a)

Converts the two DP FP values of a to 32-bit signed integer values.

R0 R1

(int)a0 (int) a1

__m64 _mm_cvttpd_pi32(__m128d a)

Converts the two DP FP values of a to 32-bit signed integer values using truncate.

793 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1

(int)a0 (int) a1

__m128d _mm_cvtpi32_pd(__m64 a)

Converts the two 32-bit signed integer values of a to DP FP values.

R0 R1

(double)a0 (double)a1

double _mm_cvtsd_f64(__m128d a)

This intrinsic extracts a double precision floating point value from the first vector element of an __m128d. It does so in the most efficient manner possible in the context used. This intrinsic does not map to any specific SSE2 instruction.

Load Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for floating-point load operations are listed in this topic. The prototypes for Intel® SSE2 intrinsics are in the emmintrin.h header file. The load and set operations are similar in that both initialize __m128d data. However, the set operations take a double argument and are intended for initialization with constants, while the load operations take a double pointer argument and are intended to mimic the instructions for loading data from memory. The results of each intrinsic operation are placed in a register. The information about what is placed in each register appears in the tables below, in the detailed explanation for each intrinsic. For each intrinsic, the resulting register is represented by R0 and R1, where R0 and R1 each represent one piece of the result register.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction _mm_load_pd Loads two DP FP values MOVAPD _mm_load1_pd Loads a single DP FP value, copying to both MOVSD + shuffling elements

794 28

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction _mm_loadr_pd Loads two DP FP values in reverse order MOVAPD + shuffling _mm_loadu_pd Loads two DP FP values MOVUPD _mm_load_sd Loads a DP FP value, sets upper DP FP to MOVSD zero _mm_loadh_pd Loads a DP FP value as the upper DP FP MOVHPD value of the result _mm_loadl_pd Loads a DP FP value as the lower DP FP MOVLPD value of the result

__m128d _mm_load_pd(double const*dp)

Loads two DP FP values. The address p must be 16-byte aligned.

R0 R1

p[0] p[1]

__m128d _mm_load1_pd(double const*dp)

Loads a single DP FP value, copying to both elements. The address p need not be 16-byte aligned.

R0 R1

*p *p

__m128d _mm_loadr_pd(double const*dp)

Loads two DP FP values in reverse order. The address p must be 16-byte aligned.

R0 R1

p[1] p[0]

__m128d _mm_loadu_pd(double const*dp)

795 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Loads two DP FP values. The address p need not be 16-byte aligned.

R0 R1

p[0] p[1]

__m128d _mm_load_sd(double const*dp)

Loads a DP FP value. The upper DP FP is set to zero. The address p need not be 16-byte aligned.

R0 R1

*p 0.0

__m128d _mm_loadh_pd(__m128d a, double const*dp)

Loads a DP FP value as the upper DP FP value of the result. The lower DP FP value is passed through from a. The address p need not be 16-byte aligned.

R0 R1

a0 *p

__m128d _mm_loadl_pd(__m128d a, double const*dp)

Loads a DP FP value as the lower DP FP value of the result. The upper DP FP value is passed through from a. The address p need not be 16-byte aligned.

R0 R1

*p a1

Set Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for floating-point set operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The load and set operations are similar in that both initialize __m128d data. However, the set operations take a double argument and are intended for initialization with constants, while the load operations take a double pointer argument and are intended to mimic the instructions for loading data from memory.

796 28

Some of the these intrinsics are composite intrinsics because they require more than one instruction to implement them. The results of each intrinsic operation are placed in a register. The information about what is placed in each register appears in the tables below, in the detailed explanation for each intrinsic. For each intrinsic, the resulting register is represented by R0 and R1, where R0 and R1 each represent one piece of the result register.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_set_sd Sets lower DP FP value to w Composite and upper to zero

_mm_set1_pd Sets two DP FP valus to w Composite

_mm_set_pd Sets lower DP FP to x and Composite upper to w

_mm_setr_pd Sets lower DP FP to w and Composite upper to x

_mm_setzero_pd Sets two DP FP values to XORPD zero

_mm_move_sd Sets lower DP FP value to the MOVSD lower DP FP value of b

__m128d _mm_set_sd(double w)

Sets the lower DP FP value to w and sets the upper DP FP value to zero.

R0 R1

w 0.0

__m128d _mm_set1_pd(double w)

Sets the 2 DP FP values to w.

797 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1

w w

__m128d _mm_set_pd(double w, double x)

Sets the lower DP FP value to x and sets the upper DP FP value to w.

R0 R1

x w

__m128d _mm_setr_pd(double w, double x)

Sets the lower DP FP value to w and sets the upper DP FP value to x. r0 := w r1 := x

R0 R1

w x

__m128d _mm_setzero_pd(void)

Sets the 2 DP FP values to zero.

R0 R1

0.0 0.0

__m128d _mm_move_sd( __m128d a, __m128d b)

Sets the lower DP FP value to the lower DP FP value of b. The upper DP FP value is passed through from a.

R0 R1

b0 a1

Store Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for floating-point store operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file.

798 28

The store operations assign the initialized data to the address. The detailed description of each intrinsic contains a table detailing the returns. In these tables, dp[n] is an access to the n element of the result.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_stream_pd Store MOVNTPD

_mm_store_sd Stores lower DP FP value of MOVSD a

_mm_store1_pd Stores lower DP FP value of MOVAPD + shuffling a twice

_mm_store_pd Stores two DP FP values MOVAPD

_mm_storeu_pd Stores two DP FP values MOVUPD

_mm_storer_pd Stores two DP FP values in MOVAPD + shuffling reverse order

_mm_storeh_pd Stores upper DP FP value of MOVHPD a

_mm_storel_pd Stores lower DP FP value of MOVLPD a void _mm_store_sd(double *dp, __m128d a)

Stores the lower DP FP value of a. The address dp need not be 16-byte aligned.

*dp

a0 void _mm_store1_pd(double *dp, __m128d a)

Stores the lower DP FP value of a twice. The address dp must be 16-byte aligned.

799 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

dp[0] dp[1]

a0 a0

void _mm_store_pd(double *dp, __m128d a)

Stores two DP FP values. The address dp must be 16-byte aligned.

dp[0] dp[1]

a0 a1

void _mm_storeu_pd(double *dp, __m128d a)

Stores two DP FP values. The address dp need not be 16-byte aligned.

dp[0] dp[1]

a0 a1

void _mm_storer_pd(double *dp, __m128d a)

Stores two DP FP values in reverse order. The address dp must be 16-byte aligned.

dp[0] dp[1]

a1 a0

void _mm_storeh_pd(double *dp, __m128d a)

Stores the upper DP FP value of a.

*dp

a1

void _mm_storel_pd(double *dp, __m128d a)

Stores the lower DP FP value of a.

800 28

*dp

a0

Integer Intrinsics

Arithmetic Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for integer arithmetic operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1...R15 represent the registers in which results are placed.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_add_epi8 Addition PADDB

_mm_add_epi16 Addition PADDW

_mm_add_epi32 Addition PADDD

_mm_add_si64 Addition PADDQ

_mm_add_epi64 Addition PADDQ

_mm_adds_epi8 Addition PADDSB

_mm_adds_epi16 Addition PADDSW

_mm_adds_epu8 Addition PADDUSB

_mm_adds_epu16 Addition PADDUSW

_mm_avg_epu8 Computes Average PAVGB

801 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_avg_epu16 Computes Average PAVGW

_mm_madd_epi16 Multiplication and Addition PMADDWD

_mm_max_epi16 Computes Maxima PMAXSW

_mm_max_epu8 Computes Maxima PMAXUB

_mm_min_epi16 Computes Minima PMINSW

_mm_min_epu8 Computes Minima PMINUB

_mm_mulhi_epi16 Multiplication PMULHW

_mm_mulhi_epu16 Multiplication PMULHUW

_mm_mullo_epi16 Multiplication PMULLW

_mm_mul_su32 Multiplication PMULUDQ

_mm_mul_epu32 Multiplication PMULUDQ

_mm_sad_epu8 Computes Difference/Adds PSADBW

_mm_sub_epi8 Subtraction PSUBB

_mm_sub_epi16 Subtraction PSUBW

_mm_sub_epi32 Subtraction PSUBD

_mm_sub_si64 Subtraction PSUBQ

_mm_sub_epi64 Subtraction PSUBQ

_mm_subs_epi8 Subtraction PSUBSB

_mm_subs_epi16 Subtraction PSUBSW

_mm_subs_epu8 Subtraction PSUBUSB

802 28

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_subs_epu16 Subtraction PSUBUSW

__m128i _mm_add_epi8(__m128i a, __m128i b)

Adds the 16 signed or unsigned 8-bit integers in a to the 16 signed or unsigned 8-bit integers in b.

R0 R1 ... R15

a0 + b0 a1 + b1; ... a15 + b15

__m128i _mm_add_epi16(__m128i a, __m128i b)

Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in b.

R0 R1 ... R7

a0 + b0 a1 + b1 ... a7 + b7

__m128i _mm_add_epi32(__m128i a, __m128i b)

Adds the 4 signed or unsigned 32-bit integers in a to the 4 signed or unsigned 32-bit integers in b.

R0 R1 R2 R3

a0 + b0 a1 + b1 a2 + b2 a3 + b3

__m64 _mm_add_si64(__m64 a, __m64 b)

Adds the signed or unsigned 64-bit integer a to the signed or unsigned 64-bit integer b.

R0

a + b

__m128i _mm_add_epi64(__m128i a, __m128i b)

803 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Adds the 2 signed or unsigned 64-bit integers in a to the 2 signed or unsigned 64-bit integers in b.

R0 R1

a0 + b0 a1 + b1

__m128i _mm_adds_epi8(__m128i a, __m128i b)

Adds the 16 signed 8-bit integers in a to the 16 signed 8-bit integers in b using saturating arithmetic.

R0 R1 ... R15

SignedSaturate SignedSaturate ... SignedSaturate (a0 + b0) (a1 + b1) (a15 + b15)

__m128i _mm_adds_epi16(__m128i a, __m128i b)

Adds the 8 signed 16-bit integers in a to the 8 signed 16-bit integers in b using saturating arithmetic.

R0 R1 ... R7

SignedSaturate SignedSaturate ... SignedSaturate (a0 + b0) (a1 + b1) (a7 + b7)

__m128i _mm_adds_epu8(__m128i a, __m128i b)

Adds the 16 unsigned 8-bit integers in a to the 16 unsigned 8-bit integers in b using saturating arithmetic.

R0 R1 ... R15

UnsignedSaturate UnsignedSaturate ... UnsignedSaturate (a0 + b0) (a1 + b1) (a15 + b15)

__m128i _mm_adds_epu16(__m128i a, __m128i b)

Adds the 8 unsigned 16-bit integers in a to the 8 unsigned 16-bit integers in b using saturating arithmetic.

804 28

R0 R1 ... R7

UnsignedSaturate UnsignedSaturate ... UnsignedSaturate (a0 + b0) (a1 + b1) (a7 + b7)

__m128i _mm_avg_epu8(__m128i a, __m128i b)

Computes the average of the 16 unsigned 8-bit integers in a and the 16 unsigned 8-bit integers in b and rounds.

R0 R1 ... R15

(a0 + b0) / 2 (a1 + b1) / 2 ... (a15 + b15) / 2

__m128i _mm_avg_epu16(__m128i a, __m128i b)

Computes the average of the 8 unsigned 16-bit integers in a and the 8 unsigned 16-bit integers in b and rounds.

R0 R1 ... R7

(a0 + b0) / 2 (a1 + b1) / 2 ... (a7 + b7) / 2

__m128i _mm_madd_epi16(__m128i a, __m128i b)

Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Adds the signed 32-bit integer results pairwise and packs the 4 signed 32-bit integer results.

R0 R1 R2 R3

(a0 * b0) + (a1 * (a2 * b2) + (a3 * (a4 * b4) + (a5 * (a6 * b6) + (a7 * b1) b3) b5) b7)

__m128i _mm_max_epi16(__m128i a, __m128i b)

Computes the pairwise maxima of the 8 signed 16-bit integers from a and the 8 signed 16-bit integers from b.

R0 R1 ... R7

max(a0, b0) max(a1, b1) ... max(a7, b7)

805 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m128i _mm_max_epu8(__m128i a, __m128i b)

Computes the pairwise maxima of the 16 unsigned 8-bit integers from a and the 16 unsigned 8-bit integers from b.

R0 R1 ... R15

max(a0, b0) max(a1, b1) ... max(a15, b15)

__m128i _mm_min_epi16(__m128i a, __m128i b)

Computes the pairwise minima of the 8 signed 16-bit integers from a and the 8 signed 16-bit integers from b.

R0 R1 ... R7

min(a0, b0) min(a1, b1) ... min(a7, b7)

__m128i _mm_min_epu8(__m128i a, __m128i b)

Computes the pairwise minima of the 16 unsigned 8-bit integers from a and the 16 unsigned 8-bit integers from b.

R0 R1 ... R15

min(a0, b0) min(a1, b1) ... min(a15, b15)

__m128i _mm_mulhi_epi16(__m128i a, __m128i b)

Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Packs the upper 16-bits of the 8 signed 32-bit results.

R0 R1 ... R7

(a0 * b0)[31:16] (a1 * b1)[31:16] ... (a7 * b7)[31:16]

__m128i _mm_mulhi_epu16(__m128i a, __m128i b)

Multiplies the 8 unsigned 16-bit integers from a by the 8 unsigned 16-bit integers from b. Packs the upper 16-bits of the 8 unsigned 32-bit results.

806 28

R0 R1 ... R7

(a0 * b0)[31:16] (a1 * b1)[31:16] ... (a7 * b7)[31:16]

__m128i _mm_mullo_epi16(__m128i a, __m128i b)

Multiplies the 8 signed or unsigned 16-bit integers from a by the 8 signed or unsigned 16-bit integers from b. Packs the lower 16-bits of the 8 signed or unsigned 32-bit results.

R0 R1 ... R7

(a0 * b0)[15:0] (a1 * b1)[15:0] ... (a7 * b7)[15:0]

__m64 _mm_mul_su32(__m64 a, __m64 b)

Multiplies the lower 32-bit integer from a by the lower 32-bit integer from b, and returns the 64-bit integer result.

R0

a0 * b0

__m128i _mm_mul_epu32(__m128i a, __m128i b)

Multiplies 2 unsigned 32-bit integers from a by 2 unsigned 32-bit integers from b. Packs the 2 unsigned 64-bit integer results.

R0 R1

a0 * b0 a2 * b2

__m128i _mm_sad_epu8(__m128i a, __m128i b)

Computes the absolute difference of the 16 unsigned 8-bit integers from a and the 16 unsigned 8-bit integers from b. Sums the upper 8 differences and lower 8 differences, and packs the resulting 2 unsigned 16-bit integers into the upper and lower 64-bit elements.

807 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 to R3 R4 R5 to R7

abs(a0 - b0) + 0x0 abs(a8 - b8) + 0x0 abs(a1 - b1) abs(a9 - b9) +...+ abs(a7 - +...+ abs(a15 - b7) b15)

__m128i _mm_sub_epi8(__m128i a, __m128i b)

Subtracts the 16 signed or unsigned 8-bit integers of b from the 16 signed or unsigned 8-bit integers of a.

R0 R1 ... R15

a0 - b0 a1 - b1 ... a15 - b15

__m128i _mm_sub_epi16(__m128i a, __m128i b)

Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.

R0 R1 ... R7

a0 - b0 a1 - b1 ... a7 - b7

__m128i _mm_sub_epi32(__m128i a, __m128i b)

Subtracts the 4 signed or unsigned 32-bit integers of b from the 4 signed or unsigned 32-bit integers of a.

R0 R1 R2 R3

a0 - b0 a1 - b1 a2 - b2 a3 - b3

__m64 _mm_sub_si64 (__m64 a, __m64 b)

Subtracts the signed or unsigned 64-bit integer b from the signed or unsigned 64-bit integer a.

R

a - b

808 28

__m128i _mm_sub_epi64(__m128i a, __m128i b)

Subtracts the 2 signed or unsigned 64-bit integers in b from the 2 signed or unsigned 64-bit integers in a.

R0 R1

a0 - b0 a1 - b1

__m128i _mm_subs_epi8(__m128i a, __m128i b)

Subtracts the 16 signed 8-bit integers of b from the 16 signed 8-bit integers of a using saturating arithmetic.

R0 R1 ... R15

SignedSaturate SignedSaturate ... SignedSaturate (a0 - b0) (a1 - b1) (a15 - b15)

__m128i _mm_subs_epi16(__m128i a, __m128i b)

Subtracts the 8 signed 16-bit integers of b from the 8 signed 16-bit integers of a using saturating arithmetic.

R0 R1 ... R15

SignedSaturate SignedSaturate ... SignedSaturate (a0 - b0) (a1 - b1) (a7 - b7)

__m128i _mm_subs_epu8 (__m128i a, __m128i b)

Subtracts the 16 unsigned 8-bit integers of b from the 16 unsigned 8-bit integers of a using saturating arithmetic.

R0 R1 ... R15

UnsignedSaturate UnsignedSaturate ... UnsignedSaturate (a0 - b0) (a1 - b1) (a15 - b15)

__m128i _mm_subs_epu16 (__m128i a, __m128i b)

Subtracts the 8 unsigned 16-bit integers of b from the 8 unsigned 16-bit integers of a using saturating arithmetic.

809 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 ... R7

UnsignedSaturate UnsignedSaturate ... UnsignedSaturate (a0 - b0) (a1 - b1) (a7 - b7)

Logical Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for integer logical operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in register R. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_and_si128 Computes AND PAND

_mm_andnot_si128 Computes AND and NOT PANDN

_mm_or_si128 Computes OR POR

_mm_xor_si128 Computes XOR PXOR

__m128i _mm_and_si128(__m128i a, __m128i b)

Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.

R0

a & b

__m128i _mm_andnot_si128(__m128i a, __m128i b)

Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the 128-bit value in a.

R0

(~a) & b

810 28

__m128i _mm_or_si128(__m128i a, __m128i b)

Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.

R0

a | b

__m128i _mm_xor_si128(__m128i a, __m128i b)

Computes the bitwise XOR of the 128-bit value in a and the 128-bit value in b.

R0

a ^ b

Shift Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for integer shift operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1...R7 represent the registers in which results are placed.

NOTE. The count argument is one shift count that applies to all elements of the operand being shifted. It is not a vector shift count that shifts each element by a different amount.

Intrinsic Operation Shift Type Corresponding Intel® SSE2 Instruction

_mm_slli_si128 Shift left Logical PSLLDQ

_mm_slli_epi16 Shift left Logical PSLLW

_mm_sll_epi16 Shift left Logical PSLLW

_mm_slli_epi32 Shift left Logical PSLLD

811 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Operation Shift Type Corresponding Intel® SSE2 Instruction

_mm_sll_epi32 Shift left Logical PSLLD

_mm_slli_epi64 Shift left Logical PSLLQ

_mm_sll_epi64 Shift left Logical PSLLQ

_mm_srai_epi16 Shift right Arithmetic PSRAW

_mm_sra_epi16 Shift right Arithmetic PSRAW

_mm_srai_epi32 Shift right Arithmetic PSRAD

_mm_sra_epi32 Shift right Arithmetic PSRAD

_mm_srli_si128 Shift right Logical PSRLDQ

_mm_srli_epi16 Shift right Logical PSRLW

_mm_srl_epi16 Shift right Logical PSRLW

_mm_srli_epi32 Shift right Logical PSRLD

_mm_srl_epi32 Shift right Logical PSRLD

_mm_srli_epi64 Shift right Logical PSRLQ

_mm_srl_epi64 Shift right Logical PSRLQ

__m128i _mm_slli_si128(__m128i a, int imm)

Shifts the 128-bit value in a left by imm bytes while shifting in zeros. imm must be an immediate.

R

a << (imm * 8)

__m128i _mm_slli_epi16(__m128i a, int count)

812 28

Shifts the 8 signed or unsigned 16-bit integers in a left by count bits while shifting in zeros.

R0 R1 ... R7

a0 << count a1 << count ... a7 << count

__m128i _mm_sll_epi16(__m128i a, __m128i count)

Shifts the 8 signed or unsigned 16-bit integers in a left by count bits while shifting in zeros.

R0 R1 ... R7

a0 << count a1 << count ... a7 << count

__m128i _mm_slli_epi32(__m128i a, int count)

Shifts the 4 signed or unsigned 32-bit integers in a left by count bits while shifting in zeros.

R0 R1 R2 R3

a0 << count a1 << count a2 << count a3 << count

__m128i _mm_sll_epi32(__m128i a, __m128i count)

Shifts the 4 signed or unsigned 32-bit integers in a left by count bits while shifting in zeros.

R0 R1 R2 R3

a0 << count a1 << count a2 << count a3 << count

__m128i _mm_slli_epi64(__m128i a, int count)

Shifts the 2 signed or unsigned 64-bit integers in a left by count bits while shifting in zeros.

R0 R1

a0 << count a1 << count

__m128i _mm_sll_epi64(__m128i a, __m128i count)

Shifts the 2 signed or unsigned 64-bit integers in a left by count bits while shifting in zeros.

813 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1

a0 << count a1 << count

__m128i _mm_srai_epi16(__m128i a, int count)

Shifts the 8 signed 16-bit integers in a right by count bits while shifting in the sign bit.

R0 R1 ... R7

a0 >> count a1 >> count ... a7 >> count

__m128i _mm_sra_epi16(__m128i a, __m128i count)

Shifts the 8 signed 16-bit integers in a right by count bits while shifting in the sign bit.

R0 R1 ... R7

a0 >> count a1 >> count ... a7 >> count

__m128i _mm_srai_epi32(__m128i a, int count)

Shifts the 4 signed 32-bit integers in a right by count bits while shifting in the sign bit.

R0 R1 R2 R3

a0 >> count a1 >> count a2 >> count a3 >> count

__m128i _mm_sra_epi32(__m128i a, __m128i count)

Shifts the 4 signed 32-bit integers in a right by count bits while shifting in the sign bit.

R0 R1 R2 R3

a0 >> count a1 >> count a2 >> count a3 >> count

__m128i _mm_srli_si128(__m128i a, int imm)

Shifts the 128-bit value in a right by imm bytes while shifting in zeros. imm must be an immediate.

814 28

R

srl(a, imm*8)

__m128i _mm_srli_epi16(__m128i a, int count)

Shifts the 8 signed or unsigned 16-bit integers in a right by count bits while shifting in zeros.

R0 R1 ... R7

srl(a0, count) srl(a1, count) ... srl(a7, count)

__m128i _mm_srl_epi16(__m128i a, __m128i count)

Shifts the 8 signed or unsigned 16-bit integers in a right by count bits while shifting in zeros.

R0 R1 ... R7

srl(a0, count) srl(a1, count) ... srl(a7, count)

__m128i _mm_srli_epi32(__m128i a, int count)

Shifts the 4 signed or unsigned 32-bit integers in a right by count bits while shifting in zeros.

R0 R1 R2 R3

srl(a0, count) srl(a1, count) srl(a2, count) srl(a3, count)

__m128i _mm_srl_epi32(__m128i a, __m128i count)

Shifts the 4 signed or unsigned 32-bit integers in a right by count bits while shifting in zeros.

R0 R1 R2 R3

srl(a0, count) srl(a1, count) srl(a2, count) srl(a3, count)

__m128i _mm_srli_epi64(__m128i a, int count)

Shifts the 2 signed or unsigned 64-bit integers in a right by count bits while shifting in zeros.

R0 R1

srl(a0, count) srl(a1, count)

815 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m128i _mm_srl_epi64(__m128i a, __m128i count)

Shifts the 2 signed or unsigned 64-bit integers in a right by count bits while shifting in zeros.

R0 R1

srl(a0, count) srl(a1, count)

Compare Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for integer comparison operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1...R15 represent the registers in which results are placed.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_cmpeq_epi8 Equality PCMPEQB

_mm_cmpeq_epi16 Equality PCMPEQW

_mm_cmpeq_epi32 Equality PCMPEQD

_mm_cmpgt_epi8 Greater Than PCMPGTB

_mm_cmpgt_epi16 Greater Than PCMPGTW

_mm_cmpgt_epi32 Greater Than PCMPGTD

_mm_cmplt_epi8 Less Than PCMPGTBr

_mm_cmplt_epi16 Less Than PCMPGTWr

_mm_cmplt_epi32 Less Than PCMPGTDr

__m128i _mm_cmpeq_epi8(__m128i a, __m128i b)

816 28

Compares the 16 signed or unsigned 8-bit integers in a and the 16 signed or unsigned 8-bit integers in b for equality.

R0 R1 ... R15

(a0 == b0) ? 0xff (a1 == b1) ? 0xff ... (a15 == b15) ? : 0x0 : 0x0 0xff : 0x0

__m128i _mm_cmpeq_epi16(__m128i a, __m128i b)

Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality.

R0 R1 ... R7

(a0 == b0) ? (a1 == b1) ? ... (a7 == b7) ? 0xffff : 0x0 0xffff : 0x0 0xffff : 0x0

__m128i _mm_cmpeq_epi32(__m128i a, __m128i b)

Compares the 4 signed or unsigned 32-bit integers in a and the 4 signed or unsigned 32-bit integers in b for equality.

R0 R1 R2 R3

(a0 == b0) ? (a1 == b1) ? (a2 == b2) ? (a3 == b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128i _mm_cmpgt_epi8(__m128i a, __m128i b)

Compares the 16 signed 8-bit integers in a and the 16 signed 8-bit integers in b for greater than.

R0 R1 ... R15

(a0 > b0) ? 0xff (a1 > b1) ? 0xff ... (a15 > b15) ? : 0x0 : 0x0 0xff : 0x0

__m128i _mm_cmpgt_epi16(__m128i a, __m128i b)

Compares the 8 signed 16-bit integers in a and the 8 signed 16-bit integers in b for greater than.

817 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 ... R7

(a0 > b0) ? (a1 > b1) ? ... (a7 > b7) ? 0xffff : 0x0 0xffff : 0x0 0xffff : 0x0

__m128i _mm_cmpgt_epi32(__m128i a, __m128i b)

Compares the 4 signed 32-bit integers in a and the 4 signed 32-bit integers in b for greater than.

R0 R1 R2 R3

(a0 > b0) ? (a1 > b1) ? (a2 > b2) ? (a3 > b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

__m128i _mm_cmplt_epi8( __m128i a, __m128i b)

Compares the 16 signed 8-bit integers in a and the 16 signed 8-bit integers in b for less than.

R0 R1 ... R15

(a0 < b0) ? 0xff (a1 < b1) ? 0xff ... (a15 < b15) ? : 0x0 : 0x0 0xff : 0x0

__m128i _mm_cmplt_epi16( __m128i a, __m128i b)

Compares the 8 signed 16-bit integers in a and the 8 signed 16-bit integers in b for less than.

R0 R1 ... R7

(a0 < b0) ? (a1 < b1) ? ... (a7 < b7) ? 0xffff : 0x0 0xffff : 0x0 0xffff : 0x0

__m128i _mm_cmplt_epi32( __m128i a, __m128i b)

Compares the 4 signed 32-bit integers in a and the 4 signed 32-bit integers in b for less than.

R0 R1 R2 R3

(a0 < b0) ? (a1 < b1) ? (a2 < b2) ? (a3 < b3) ? 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0

818 28

Conversion Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for integer conversion operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1, R2 and R3 represent the registers in which results are placed. Intrinsics marked with * are implemented only on Intel® 64 architectures. The rest of the intrinsics are implemented on both IA-32 and Intel® 64 architectures.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_cvtsi64_sd* Convert and pass through CVTSI2SD

_mm_cvtsd_si64* Convert according to CVTSD2SI rounding

_mm_cvttsd_si64* Convert using truncation CVTTSD2SI

_mm_cvtepi32_ps Convert to SP FP None

_mm_cvtps_epi32 Convert from SP FP None

_mm_cvttps_epi32 Convert from SP FP using None truncate

__m128d _mm_cvtsi64_sd(__m128d a, __int64 b)

Converts the signed 64-bit integer value in b to a DP FP value. The upper DP FP value in a is passed through. Use only on Intel® 64 architectures.

R0 R1

(double)b a1

__int64 _mm_cvtsd_si64(__m128d a)

819 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Converts the lower DP FP value of a to a 64-bit signed integer value according to the current rounding mode. Use only on Intel® 64 architectures.

R

(__int64) a0

__int64 _mm_cvttsd_si64(__m128d a)

Converts the lower DP FP value of a to a 64-bit signed integer value using truncation. Use only on Intel® 64 architectures.

R

(__int64) a0

__m128 _mm_cvtepi32_ps(__m128i a)

Converts the 4 signed 32-bit integer values of a to SP FP values.

R0 R1 R2 R3

(float) a0 (float) a1 (float) a2 (float) a3

__m128i _mm_cvtps_epi32(__m128 a)

Converts the 4 SP FP values of a to signed 32-bit integer values.

R0 R1 R2 R3

(int) a0 (int) a1 (int) a2 (int) a3

__m128i _mm_cvttps_epi32(__m128 a)

Converts the 4 SP FP values of a to signed 32 bit integer values using truncate.

R0 R1 R2 R3

(int) a0 (int) a1 (int) a2 (int) a3

820 28

Move Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for integer move operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1, R2 and R3 represent the registers in which results are placed. Intrinsics marked with * are implemented only on Intel® 64 architectures. The rest of the intrinsics are implemented on both IA-32 and Intel® 64 architectures.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_cvtsi32_si128 Move and zero MOVD

_mm_cvtsi64_si128* Move and zero MOVQ

_mm_cvtsi128_si32 Move lowest 32 bits MOVD

_mm_cvtsi128_si64* Move lowest 64 bits MOVQ

__m128i _mm_cvtsi32_si128(int a)

Moves 32-bit integer a to the least significant 32 bits of an __m128i object. Zeroes the upper 96 bits of the __m128i object.

R0 R1 R2 R3

a 0x0 0x0 0x0

__m128i _mm_cvtsi64_si128(__int64 a)

Moves 64-bit integer a to the lower 64 bits of an __m128i object, zeroing the upper bits. Use only on Intel® 64 architectures.

R0 R1

a 0x0

821 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

int _mm_cvtsi128_si32(__m128i a)

Moves the least significant 32 bits of a to a 32-bit integer.

R

a0

__int64 _mm_cvtsi128_si64(__m128i a)

Moves the lower 64 bits of a to a 64-bit integer. Use only on Intel® 64 architectures.

R

a0

Load Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for integer load operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0 and R1 represent the registers in which results are placed.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_load_si128 Load MOVDQA

_mm_loadu_si128 Load MOVDQU

_mm_loadl_epi64 Load and zero MOVQ

__m128i _mm_load_si128(__m128i const*p)

Loads 128-bit value. Address p must be 16-byte aligned.

R

*p

822 28

__m128i _mm_loadu_si128(__m128i const*p)

Loads 128-bit value. Address p not need be 16-byte aligned.

R

*p

__m128i _mm_loadl_epi64(__m128i const*p)

Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the result, zeroing the upper 64 bits of the result.

R0 R1

*p[63:0] 0x0

Set Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for integer set operations are listed in this topic. These intrinsics are composite intrinsics because they require more than one instruction to implement them. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1...R15 represent the registers in which results are placed.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_set_epi64 Set two integer values Composite

_mm_set_epi32 Set four integer values Composite

_mm_set_epi16 Set eight integer values Composite

_mm_set_epi8 Set sixteen integer values Composite

_mm_set1_epi64 Set two integer values Composite

_mm_set1_epi32 Set four integer values Composite

823 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_set1_epi16 Set eight integer values Composite

_mm_set1_epi8 Set sixteen integer values Composite

_mm_setr_epi64 Set two integer values in Composite reverse order

_mm_setr_epi32 Set four integer values in Composite reverse order

_mm_setr_epi16 Set eight integer values in Composite reverse order

_mm_setr_epi8 Set sixteen integer values in Composite reverse order

_mm_setzero_si128 Set to zero Composite

__m128i _mm_set_epi64(__m64 q1, __m64 q0)

Sets the 2 64-bit integer values.

R0 R1

q0 q1

__m128i _mm_set_epi32(int i3, int i2, int i1, int i0)

Sets the 4 signed 32-bit integer values.

R0 R1 R2 R3

i0 i1 i2 i3

__m128i _mm_set_epi16(short w7, short w6, short w5, short w4, short w3, short w2, short w1, short w0)

Sets the 8 signed 16-bit integer values.

824 28

R0 R1 ... R7

w0 w1 ... w7

__m128i _mm_set_epi8(char b15, char b14, char b13, char b12, char b11, char b10, char b9, char b8, char b7, char b6, char b5, char b4, char b3, char b2, char b1, char b0)

Sets the 16 signed 8-bit integer values.

R0 R1 ... R15

b0 b1 ... b15

__m128i _mm_set1_epi64(__m64 q)

Sets the 2 64-bit integer values to q.

R0 R1

q q

__m128i _mm_set1_epi32(int i)

Sets the 4 signed 32-bit integer values to i.

R0 R1 R2 R3

i i i i

__m128i _mm_set1_epi16(short w)

Sets the 8 signed 16-bit integer values to w.

R0 R1 ... R7

w w w w

__m128i _mm_set1_epi8(char b)

Sets the 16 signed 8-bit integer values to b.

825 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 ... R15

b b b b

__m128i _mm_setr_epi64(__m64 q0, __m64 q1)

Sets the 2 64-bit integer values in reverse order.

R0 R1

q0 q1

__m128i _mm_setr_epi32(int i0, int i1, int i2, int i3)

Sets the 4 signed 32-bit integer values in reverse order.

R0 R1 R2 R3

i0 i1 i2 i3

__m128i _mm_setr_epi16(short w0, short w1, short w2, short w3, short w4, short w5, short w6, short w7)

Sets the 8 signed 16-bit integer values in reverse order.

R0 R1 ... R7

w0 w1 ... w7

__m128i _mm_setr_epi8(char b15, char b14, char b13, char b12, char b11, char b10, char b9, char b8, char b7, char b6, char b5, char b4, char b3, char b2, char b1, char b0)

Sets the 16 signed 8-bit integer values in reverse order.

R0 R1 ... R15

b0 b1 ... b15

__m128i _mm_setzero_si128()

Sets the 128-bit value to zero.

826 28

R

0x0

Store Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for integer store operations are listed in this topic. The prototypes for the Intel® SSE2 intrinsics are in the emmintrin.h header file. The detailed description of each intrinsic contains a table detailing the returns. In these tables, p is an access to the result.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_stream_si128 Store MOVNTDQ

_mm_stream_si32 Store MOVNTI

_mm_store_si128 Store MOVDQA

_mm_storeu_si128 Store MOVDQU

_mm_maskmoveu_si128 Conditional store MASKMOVDQU

_mm_storel_epi64 Store lowest MOVQ void _mm_stream_si128(__m128i *p, __m128i a)

Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16 byte aligned.

*p

a void _mm_stream_si32(int *p, int a)

Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated.

827 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

*p

a

void _mm_store_si128(__m128i *p, __m128i b)

Stores 128-bit value. Address p must be 16 byte aligned.

*p

a

void _mm_storeu_si128(__m128i *p, __m128i b)

Stores 128-bit value. Address p need not be 16-byte aligned.

*p

a

void _mm_maskmoveu_si128(__m128i d, __m128i n, char *p)

Conditionally store byte elements of d to address p. The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored. Address p need not be 16-byte aligned.

if (n0[7]) if (n1[7] ... if (n15[7])

p[0] := d0 p[1] := d1 ... p[15] := d15

void _mm_storel_epi64(__m128i *p, __m128i a)

Stores the lower 64 bits of the value pointed to by p.

*p[63:0]

a0

828 28

Miscellaneous Functions and Intrinsics

Cacheability Support Intrinsics

The prototypes for Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for cacheability support are in the emmintrin.h header file.

Intrinsic Name Operation Corresponding Intel® SSE2 Instruction

_mm_stream_pd Store MOVNTPD

_mm256_stream_pd Store VMOVNTPD

_mm_stream_si128 Store MOVNTDQ

_mm256_stream_si256 Store VMOVNTDQ

_mm_stream_si32 Store MOVNTI

_mm_stream_si64* Store MOVNTI

_mm_clflush Flush CLFLUSH

_mm_lfence Guarantee visibility LFENCE

_mm_mfence Guarantee visibility MFENCE

void _mm_stream_pd(double *p, __m128d a)

Stores the data in a to the address p without polluting caches. The address p must be 16-byte (128-bit version) aligned. If the cache line containing address p is already in the cache, the cache will be updated. p[0] := a0 p[1] := a1

p[0] p[1]

a0 a1

void _mm256_stream_pd(double *p, __m256d a)

829 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Stores the data in a to the address p without polluting caches. The address p must be 32-byte (VEX.256 encoded version) aligned. If the cache line containing address p is already in the cache, the cache will be updated. p[0] := a0 p[1] := a1

p[0] p[1]

a0 a1

void _mm_stream_si128(__m128i *p, __m128i a)

Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16-byte (128-bit version) aligned.

*p

a

void _mm256_stream_si256(__m256i *p, __m256i a)

Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 32-byte (VEX.256 encoded version) aligned.

*p

a

void _mm_stream_si32(int *p, int a)

Stores the 32-bit integer data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated.

*p

a

void _mm_stream_si64(__int64 *p, __int64 a)

Stores the 64-bit integer data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache is updated.

830 28

*p

a void _mm_clflush(void const*p)

Cache line containing p is flushed and invalidated from all caches in the coherency domain.

*p

a void _mm_lfence(void)

Guarantees that every load instruction that precedes, in program order, the load fence instruction is globally visible before any load instruction which follows the fence in program order. void _mm_mfence(void)

Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction which follows the fence in program order.

Miscellaneous Intrinsics

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsics for miscellaneous operations are listed in the following table followed by descriptions. The prototypes for Intel® SSE2 intrinsics are in the emmintrin.h header file.

Intrinsic Operation Corresponding Intel® SSE 2 Instruction

_mm_packs_epi16 Packed Saturation PACKSSWB

_mm_packs_epi32 Packed Saturation PACKSSDW

_mm_packus_epi16 Packed Saturation PACKUSWB

_mm_extract_epi16 Extraction PEXTRW

_mm_insert_epi16 Insertion PINSRW

831 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Operation Corresponding Intel® SSE 2 Instruction

_mm_movemask_epi8 Mask Creation PMOVMSKB

_mm_shuffle_epi32 Shuffle PSHUFD

_mm_shufflehi_epi16 Shuffle PSHUFHW

_mm_shufflelo_epi16 Shuffle PSHUFLW

_mm_unpackhi_epi8 Interleave PUNPCKHBW

_mm_unpackhi_epi16 Interleave PUNPCKHWD

_mm_unpackhi_epi32 Interleave PUNPCKHDQ

_mm_unpackhi_epi64 Interleave PUNPCKHQDQ

_mm_unpacklo_epi8 Interleave PUNPCKLBW

_mm_unpacklo_epi16 Interleave PUNPCKLWD

_mm_unpacklo_epi32 Interleave PUNPCKLDQ

_mm_unpacklo_epi64 Interleave PUNPCKLQDQ

_mm_movepi64_pi64 Move MOVDQ2Q

_mm_movpi64_epi64 Move MOVDQ2Q

_mm_move_epi64 Move MOVQ

_mm_unpackhi_pd Interleave UNPCKHPD

_mm_unpacklo_pd Interleave UNPCKLPD

_mm_movemask_pd Create mask MOVMSKPD

_mm_shuffle_pd Select values SHUFPD

__m128i _mm_packs_epi16(__m128i a, __m128i b)

832 28

Packs the 16 signed 16-bit integers from a and b into 8-bit integers and saturates.

R0 ... R7 R8 ... R15

Signed ... Signed Signed ... Signed Saturate(a0) Saturate(a7) Saturate(b0) Saturate(b7)

__m128i _mm_packs_epi32(__m128i a, __m128i b)

Packs the 8 signed 32-bit integers from a and b into signed 16-bit integers and saturates.

R0 ... R3 R4 ... R7

Signed ... Signed Signed ... Signed Saturate(a0) Saturate(a3) Saturate(b0) Saturate(b3)

__m128i _mm_packus_epi16(__m128i a, __m128i b)

Packs the 16 signed 16-bit integers from a and b into 8-bit unsigned integers and saturates.

R0 ... R7 R8 ... R15

Unsigned ... Unsigned Unsigned ... Unsigned Saturate(a0) Saturate(a7) Saturate(b0) Saturate(b15) int _mm_extract_epi16(__m128i a, int imm)

Extracts the selected signed or unsigned 16-bit integer from a and zero extends. The selector imm must be an immediate.

R0

(imm == 0) ? a0: ( (imm == 1) ? a1: ... (imm==7) ? a7)

__m128i _mm_insert_epi16(__m128i a, int b, int imm)

Inserts the least significant 16 bits of b into the selected 16-bit integer of a. The selector imm must be an immediate.

833 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R0 R1 ... R7

(imm == 0) ? b : (imm == 1) ? b : ... (imm == 7) ? b : a0; a1; a7;

int _mm_movemask_epi8(__m128i a)

Creates a 16-bit mask from the most significant bits of the 16 signed or unsigned 8-bit integers in a and zero extends the upper bits.

R0

a15[7] << 15 | a14[7] << 14 | ... a1[7] << 1 | a0[7]

__m128i _mm_shuffle_epi32(__m128i a, int imm)

Shuffles the 4 signed or unsigned 32-bit integers in a as specified by imm. The shuffle value, imm, must be an immediate. See Macro Function for Shuffle for a description of shuffle semantics. __m128i _mm_shufflehi_epi16(__m128i a, int imm)

Shuffles the upper 4 signed or unsigned 16-bit integers in a as specified by imm. The shuffle value, imm, must be an immediate. See Macro Function for Shuffle for a description of shuffle semantics. __m128i _mm_shufflelo_epi16(__m128i a, int imm)

Shuffles the lower 4 signed or unsigned 16-bit integers in a as specified by imm. The shuffle value, imm, must be an immediate. See Macro Function for Shuffle for a description of shuffle semantics. __m128i _mm_unpackhi_epi8(__m128i a, __m128i b)

Interleaves the upper 8 signed or unsigned 8-bit integers in a with the upper 8 signed or unsigned 8-bit integers in b.

R0 R1 R2 R3 ... R14 R15

a8 b8 a9 b9 ... a15 b15

__m128i _mm_unpackhi_epi16(__m128i a, __m128i b)

Interleaves the upper 4 signed or unsigned 16-bit integers in a with the upper 4 signed or unsigned 16-bit integers in b.

834 28

R0 R1 R2 R3 R4 R5 R6 R7

a4 b4 a5 b5 a6 b6 a7 b7

__m128i _mm_unpackhi_epi32(__m128i a, __m128i b)

Interleaves the upper 2 signed or unsigned 32-bit integers in a with the upper 2 signed or unsigned 32-bit integers in b.

R0 R1 R2 R3

a2 b2 a3 b3

__m128i _mm_unpackhi_epi64(__m128i a, __m128i b)

Interleaves the upper signed or unsigned 64-bit integer in a with the upper signed or unsigned 64-bit integer in b.

R0 R1

a1 b1

__m128i _mm_unpacklo_epi8(__m128i a, __m128i b)

Interleaves the lower 8 signed or unsigned 8-bit integers in a with the lower 8 signed or unsigned 8-bit integers in b.

R0 R1 R2 R3 ... R14 R15

a0 b0 a1 b1 ... a7 b7

__m128i _mm_unpacklo_epi16(__m128i a, __m128i b)

Interleaves the lower 4 signed or unsigned 16-bit integers in a with the lower 4 signed or unsigned 16-bit integers in b.

R0 R1 R2 R3 R4 R5 R6 R7

a0 b0 a1 b1 a2 b2 a3 b3

__m128i _mm_unpacklo_epi32(__m128i a, __m128i b)

835 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Interleaves the lower 2 signed or unsigned 32-bit integers in a with the lower 2 signed or unsigned 32-bit integers in b.

R0 R1 R2 R3

a0 b0 a1 b1

__m128i _mm_unpacklo_epi64(__m128i a, __m128i b)

Interleaves the lower signed or unsigned 64-bit integer in a with the lower signed or unsigned 64-bit integer in b.

R0 R1

a0 b0

__m64 _mm_movepi64_pi64(__m128i a)

Returns the lower 64 bits of a as an __m64 type.

R0

a0

__m128i _mm_movpi64_pi64(__m64 a)

Moves the 64 bits of a to the lower 64 bits of the result, zeroing the upper bits.

R0 R1

a0 0X0

__m128i _mm_move_epi64(__m128i a)

Moves the lower 64 bits of a to the lower 64 bits of the result, zeroing the upper bits.

R0 R1

a0 0X0

__m128d _mm_unpackhi_pd(__m128d a, __m128d b)

Interleaves the upper DP FP values of a and b.

836 28

R0 R1

a1 b1

__m128d _mm_unpacklo_pd(__m128d a, __m128d b)

Interleaves the lower DP FP values of a and b.

R0 R1

a0 b0 int _mm_movemask_pd(__m128d a)

Creates a two-bit mask from the sign bits of the two DP FP values of a.

R

sign(a1) << 1 | sign(a0)

__m128d _mm_shuffle_pd(__m128d a, __m128d b, int i)

Selects two specific DP FP values from a and b, based on the mask i. The mask must be an immediate. See Macro Function for Shuffle for a description of the shuffle semantics.

Casting Support Intrinsics

The Intel® C++ Compiler supports casting between various single-precision, double-precision, and integer vector types. These intrinsics do not convert values; they change one data type to another without changing the value. The intrinsics for casting support do not have any corresponding Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions. The syntax for the Casting Support intrinsics are as follows: __m128 _mm_castpd_ps(__m128d in); __m128i _mm_castpd_si128(__m128d in); __m128d _mm_castps_pd(__m128 in); __m128i _mm_castps_si128(__m128 in); __m128 _mm_castsi128_ps(__m128i in);

837 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m128d _mm_castsi128_pd(__m128i in);

Pause Intrinsic

The prototype for this Intel® Streaming SIMD Extensions 2 (Intel® SSE2) intrinsic is in the xmmintrin.h header file.

PAUSE Intrinsic void _mm_pause(void)

The pause intrinsic is used in spin-wait loops with the processors implementing dynamic execution (especially out-of-order execution). In the spin-wait loop, the pause intrinsic improves the speed at which the code detects the release of the lock and provides especially significant performance gain. The execution of the next instruction is delayed for an implementation-specific amount of time. The PAUSE instruction does not modify the architectural state. For dynamic scheduling, the PAUSE instruction reduces the penalty of exiting from the spin-loop. Example of loop with the PAUSE instruction: In this example, the program spins until memory location A matches the value in register eax. The code sequence that follows shows a test-and-test-and-set. spin_loop:pause cmp eax, A jne spin_loop

In this example, the spin occurs only after the attempt to get a lock has failed. get_lock: mov eax, 1 xchg eax, A ; Try to get lock cmp eax, 0 ; Test if successful jne spin_loop

838 28

Critical Section // critical_section code mov A, 0 ; Release lock jmp continue spin_loop: pause; // spin-loop hint cmp 0, A ; // check lock availability jne spin_loop jmp get_lock // continue: other code

NOTE. The first branch is predicted to fall-through to the critical section in anticipation of successfully gaining access to the lock. It is highly recommended that all spin-wait loops include the PAUSE instruction. Since PAUSE is backwards compatible to all existing IA-32 architecture-based processor generations, a test for processor type (a CPUID test) is not needed. All legacy processors execute PAUSE instruction as a NOP, but in processors that use the PAUSE instruction as a hint there can be significant performance benefit.

Macro Function for Shuffle

The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) provide a macro function to help create constants that describe shuffle operations. The macro takes two small integers (in the range of 0 to 1) and combines them into an 2-bit immediate value used by the SHUFPD instruction. See the following example.

839 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Shuffle Function Macro

You can view the two integers as selectors for choosing which two words from the first input operand and which two words from the second are to be put into the result word.

View of Original and Result Words with Shuffle Function Macro

Intrinsics Returning Vectors of Undefined Values

These intrinsics generate vectors of undefined values. The result of the intrinsics is usually used as an argument to another intrinsic that requires all operands to be initialized, and when the content of a particular argument does not matter. These intrinsics are declared in the immintrin header file. For example, you can use such an intrinsic when you need to calculate a sum of packed double-precision floating-point values located in the xmm register. To avoid unnecessary moves, you can use the following code to obtain the required result at the low 64 bits: __m128d HILO = doSomeWork(); __m128d HI = _mm_unpackhi_pd(HILO, _mm_undefined_pd()); __m128d result = _mm_add_sd(HI, HILO);

840 28

_mm_undefined_ps() This intrinsic returns a vector of four single precision floating point elements. The content of the vector is not specified.

Syntax extern __m128 _mm_undefined_ps(void);

_mm_undefined_pd() This intrinsic returns a vector of two double precision floating point elements. The content of the vector is not specified.

Syntax extern __m128d _mm_undefined_pd(void);

_mm_undefined_si128() This intrinsic returns a vector of four packed doubleword integer elements. The content of the vector is not specified.

Syntax extern __m128i _mm_undefined_si128(void);

See Also • _mm256_undefined_ps() • _mm256_undefined_pd() • _mm256_undefined_si128()

841 28 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

842 Intrinsics for Intel(R) Streaming SIMD Extensions 3 29

Overview: Intel(R) Streaming SIMD Extensions 3

The Intel® C++ intrinsics listed in this section are designed for the Intel® Pentium® 4 processor with Intel® Streaming SIMD Extensions 3 (SSE3). The new Intel® SSE3 intrinsics include:

• Floating-point Vector Intrinsics • Integer Vector Intrinsics • Miscellaneous Intrinsics • Macro Functions

The prototypes for these intrinsics are in the pmmintrin.h header file.

NOTE. You can also use the single ia32intrin.h header file for any IA-32 architecture-based intrinsics.

Integer Vector Intrinsics

The integer vector intrinsic listed here is designed for the Intel® Pentium® 4 processor with Streaming SIMD Extensions 3 (Intel® SSE3). The prototype for this intrinsic is in the pmmintrin.h header file. R represents the register into which the returns are placed. __m128i _mm_lddqu_si128(__m128i const *p);

Loads an unaligned 128-bit value. This differs from movdqu in that it can provide higher performance in some cases. However, it also may provide lower performance than movdqu if the memory value being read was just previously written.

R

*p;

843 29 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Single-precision Floating-point Vector Intrinsics

The single-precision floating-point vector intrinsics listed here are designed for the Intel® Pentium® 4 processor with Intel® Streaming SIMD Extensions 3 (Intel® SSE3). The results of each intrinsic operation are placed in the registers R0, R1, R2, and R3. The prototypes for these intrinsics are in the pmmintrin.h header file.

Intrinsic Name Operation Corresponding Intel® SSE3 Instruction

_mm_addsub_ps Subtract and add ADDSUBPS

_mm_hadd_ps Add HADDPS

_mm_hsub_ps Subtracts HSUBPS

_mm_movehdup_ps Duplicates MOVSHDUP

_mm_moveldup_ps Duplicates MOVSLDUP

extern __m128 _mm_addsub_ps(__m128 a, __m128 b);

Subtracts even vector elements while adding odd vector elements.

R0 R1 R2 R3

a0 - b0; a1 + b1; a2 - b2; a3 + b3;

extern __m128 _mm_hadd_ps(__m128 a, __m128 b);

Adds adjacent vector elements.

R0 R1 R2 R3

a0 + a1; a2 + a3; b0 + b1; b2 + b3;

extern __m128 _mm_hsub_ps(__m128 a, __m128 b);

Subtracts adjacent vector elements.

844 29

R0 R1 R2 R3

a0 - a1; a2 - a3; b0 - b1; b2 - b3;

extern __m128 _mm_movehdup_ps(__m128 a);

Duplicates odd vector elements into even vector elements.

R0 R1 R2 R3

a1; a1; a3; a3;

extern __m128 _mm_moveldup_ps(__m128 a);

Duplicates even vector elements into odd vector elements.

R0 R1 R2 R3

a0; a0; a2; a2;

Double-precision Floating-point Vector Intrinsics

The double-precision floating-point intrinsics listed here are designed for the Intel® Pentium® 4 processor with Intel® Streaming SIMD Extensions 3 (Intel® SSE3). The results of each intrinsic operation are placed in the registers R0 and R1. The prototypes for these intrinsics are in the pmmintrin.h header file.

Intrinsic Name Operation Corresponding Intel® SSE3 Instruction

_mm_addsub_pd Subtract and add ADDSUBPD

_mm_hadd_pd Add HADDPD

_mm_hsub_pd Subtract HSUBPD

_mm_loaddup_pd Duplicate MOVDDUP

_mm_movedup_pd Duplicate MOVDDUP

845 29 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

extern __m128d _mm_addsub_pd(__m128d a, __m128d b);

Adds upper vector element while subtracting lower vector element.

R0 R1

a0 - b0; a1 + b1;

extern __m128d _mm_hadd_pd(__m128d a, __m128d b);

Adds adjacent vector elements.

R0 R1

a0 + a1; b0 + b1;

extern __m128d _mm_hsub_pd(__m128d a, __m128d b);

Subtracts adjacent vector elements.

R0 R1

a0 - a1; b0 - b1;

extern __m128d _mm_loaddup_pd(double const * dp);

Duplicates a double value into upper and lower vector elements.

R0 R1

*dp; *dp;

extern __m128d _mm_movedup_pd(__m128d a);

Duplicates lower vector element into upper vector element.

R0 R1

a0; a0;

846 29

Miscellaneous Intrinsics

The miscellaneous intrinsics listed here are designed for the Intel® Pentium® 4 processor with Intel® Streaming SIMD Extensions 3 (Intel® SSE3). The prototypes for these intrinsics are in the pmmintrin.h header file. extern void _mm_monitor(void const *p, unsigned extensions, unsigned hints);

Generates the MONITOR instruction. This sets up an address range for the monitor hardware using p to provide the logical address, and will be passed to the monitor instruction in register eax. The extensions parameter contains optional extensions to the monitor hardware which will be passed in ecx. The hints parameter will contain hints to the monitor hardware, which will be passed in edx. A non-zero value for extensions will cause a general protection fault. extern void _mm_mwait(unsigned extensions, unsigned hints);

Generates the MWAIT instruction. This instruction is a hint that allows the processor to stop execution and enter an implementation-dependent optimized state until occurrence of a class of events. In future processor designs extensions and hints parameters may be used to convey additional information to the processor. All non-zero values of extensions and hints are reserved. A non-zero value for extensions will cause a general protection fault.

Macro Functions

The macro function intrinsics listed here are designed for the Intel® Pentium® 4 processor with Intel® Streaming SIMD Extensions 3 (Intel® SSE3). The prototypes for these intrinsics are in the pmmintrin.h header file. _MM_SET_DENORMALS_ZERO_MODE(x)

Macro arguments: either _MM_DENORMALS_ZERO_ON, _MM_DENORMALS_ZERO_OFF. This macro causes "denormals are zero" mode to be turned ON or OFF by setting the appropriate bit of the control register. _MM_GET_DENORMALS_ZERO_MODE()

No arguments. This macro returns the current value of the denormals are zero mode bit of the control register.

847 29 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

848 Intrinsics for Intel(R) Supplemental Streaming SIMD Extensions 3 30

Overview: Intel(R) Supplemental Streaming SIMD Extensions 3

Intel's C++ intrinsics listed in this section correspond to the Intel® Supplemental Streaming SIMD Extensions 3 instructions. The prototypes for these intrinsics are in tmmintrin.h. You can also use the ia32intrin.h header file for these intrinsics.

• Addition Intrinsics • Subtraction Intrinsics • Multiplication Intrinsics • Absolute Value Intrinsics • Shuffle Intrinsics • Concatenate Intrinsics • Negation Intrinsics

Addition Intrinsics

These Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) intrinsics are used for horizontal addition. The prototypes for these intrinsics are in tmmintrin.h. You can also use the ia32intrin.h header file for these intrinsics. extern __m128i _mm_hadd_epi16 (__m128i a, __m128i b);

Add horizontally packed signed words. Interpreting a, b, and r as arrays of 16-bit signed integers: for (i = 0; i < 4; i++) { r[i] = a[2*i] + a[2i+1]; r[i+4] = b[2*i] + b[2*i+1]; }

849 30 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

extern __m128i _mm_hadd_epi32 (__m128i a, __m128i b);

Add horizontally packed signed dwords. Interpreting a, b, and r as arrays of 32-bit signed integers: for (i = 0; i < 2; i++) { r[i] = a[2*i] + a[2i+1]; r[i+2] = b[2*i] + b[2*i+1]; } extern __m128i _mm_hadds_epi16 (__m128i a, __m128i b);

Add horizontally packed signed words with signed saturation. Interpreting a, b, and r as arrays of 16-bit signed integers: for (i = 0; i < 4; i++) { r[i] = signed_saturate_to_word(a[2*i] + a[2i+1]); r[i+4] = signed_saturate_to_word(b[2*i] + b[2*i+1]); } extern __m64 _mm_hadd_pi16 (__m64 a, __m64 b);

Add horizontally packed signed words. Interpreting a, b, and r as arrays of 16-bit signed integers: for (i = 0; i < 2; i++) { r[i] = a[2*i] + a[2i+1]; r[i+2] = b[2*i] + b[2*i+1]; } extern __m64 _mm_hadd_pi32 (__m64 a, __m64 b);

Add horizontally packed signed dwords. Interpreting a, b, and r as arrays of 32-bit signed integers: r[0] = a[1] + a[0]; r[1] = b[1] + b[0]; extern __m64 _mm_hadds_pi16 (__m64 a, __m64 b);

Add horizontally packed signed words with signed saturation.

850 30

Interpreting a, b, and r as arrays of 16-bit signed integers: for (i = 0; i < 2; i++) { r[i] = signed_saturate_to_word(a[2*i] + a[2i+1]); r[i+2] = signed_saturate_to_word(b[2*i] + b[2*i+1]); }

Subtraction Intrinsics

These Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) intrinsics are used for horizontal subtraction. The prototypes for these intrinsics are in tmmintrin.h. You can also use the ia32intrin.h header file for these intrinsics. extern __m128i _mm_hsub_epi16 (__m128i a, __m128i b);

Subtract horizontally packed signed words. Interpreting a, b, and r as arrays of 16-bit signed integers: for (i = 0; i < 4; i++) { r[i] = a[2*i] - a[2i+1]; r[i+4] = b[2*i] - b[2*i+1]; } extern __m128i _mm_hsub_epi32 (__m128i a, __m128i b);

Subtract horiztonally packed signed dwords. Interpreting a, b, and r as arrays of 32-bit signed integers: for (i = 0; i < 2; i++) { r[i] = a[2*i] - a[2i+1]; r[i+2] = b[2*i] - b[2*i+1]; } extern __m128i _mm_hsubs_epi16 (__m128i a, __m128i b);

Subract horizontally packed signed words with signed saturation.

851 30 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Interpreting a, b, and r as arrays of 16-bit signed integers: for (i = 0; i < 4; i++) { r[i] = signed_saturate_to_word(a[2*i] - a[2i+1]); r[i+4] = signed_saturate_to_word(b[2*i] - b[2*i+1]); } extern __m64 _mm_hsub_pi16 (__m64 a, __m64 b);

Subtract horizontally packed signed words. Interpreting a, b, and r as arrays of 16-bit signed integers: for (i = 0; i < 2; i++) { r[i] = a[2*i] - a[2i+1]; r[i+2] = b[2*i] - b[2*i+1]; } extern __m64 _mm_hsub_pi32 (__m64 a, __m64 b);

Subtract horizontally packed signed dwords. Interpreting a, b, and r as arrays of 32-bit signed integers: r[0] = a[0] - a[1]; r[1] = b[0] - b[1]; extern __m64 _mm_hsubs_pi16 (__m64 a, __m64 b);

Subtract horizontally packed signed words with signed saturation. Interpreting a, b, and r as arrays of 16-bit signed integers: for (i = 0; i < 2; i++) { r[i] = signed_saturate_to_word(a[2*i] - a[2i+1]); r[i+2] = signed_saturate_to_word(b[2*i] - b[2*i+1]); }

852 30

Multiplication Intrinsics

These Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) intrinsics are used for multiplication. The prototypes for these intrinsics are in tmmintrin.h. You can also use the ia32intrin.h header file for these intrinsics. extern __m128i _mm_maddubs_epi16 (__m128i a, __m128i b);

Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed words. Interpreting a as array of unsigned 8-bit integers, b as arrays of signed 8-bit integers, and r as arrays of 16-bit signed integers: for (i = 0; i < 8; i++) { r[i] = signed_saturate_to_word(a[2*i+1] * b[2*i+1] + a[2*i]*b[2*i]); } extern __m64 _mm_maddubs_pi16 (__m64 a, __m64 b);

Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed words. Interpreting a as array of unsigned 8-bit integers, b as arrays of signed 8-bit integers, and r as arrays of 16-bit signed integers: for (i = 0; i < 4; i++) { r[i] = signed_saturate_to_word(a[2*i+1] * b[2*i+1] + a[2*i]*b[2*i]); } extern __m128i _mm_mulhrs_epi16 (__m128i a, __m128i b);

Multiply signed words, scale and round signed dwords, pack high 16-bits. Interpreting a, b, and r as arrays of signed 16-bit integers: for (i = 0; i < 8; i++) { r[i] = (( (int32)((a[i] * b[i]) >> 14) + 1) >> 1) & 0xFFFF; } extern __m64 _mm_mulhrs_pi16 (__m64 a, __m64 b);

Multiply signed words, scale and round signed dwords, pack high 16-bits.

853 30 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Interpreting a, b, and r as arrays of signed 16-bit integers: for (i = 0; i < 4; i++) { r[i] = (( (int32)((a[i] * b[i]) >> 14) + 1) >> 1) & 0xFFFF; }

Absolute Value Intrinsics

These Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) intrinsics are used to compute absolute values. The prototypes for these intrinsics are in tmmintrin.h. You can also use the ia32intrin.h header file for these intrinsics. extern __m128i _mm_abs_epi8 (__m128i a);

Compute absolute value of signed bytes. Interpreting a and r as arrays of signed 8-bit integers: for (i = 0; i < 16; i++) { r[i] = abs(a[i]); } extern __m128i _mm_abs_epi16 (__m128i a);

Compute absolute value of signed words. Interpreting a and r as arrays of signed 16-bit integers: for (i = 0; i < 8; i++) { r[i] = abs(a[i]); } extern __m128i _mm_abs_epi32 (__m128i a);

Compute absolute value of signed dwords. Interpreting a and r as arrays of signed 32-bit integers: for (i = 0; i < 4; i++) { r[i] = abs(a[i]); } extern __m64 _mm_abs_pi8 (__m64 a);

854 30

Compute absolute value of signed bytes. Interpreting a and r as arrays of signed 8-bit integers: for (i = 0; i < 8; i++) { r[i] = abs(a[i]); } extern __m64 _mm_abs_pi16 (__m64 a);

Compute absolute value of signed words. Interpreting a and r as arrays of signed 16-bit integers: for (i = 0; i < 4; i++) { r[i] = abs(a[i]); } extern __m64 _mm_abs_pi32 (__m64 a);

Compute absolute value of signed dwords. Interpreting a and r as arrays of signed 32-bit integers: for (i = 0; i < 2; i++) { r[i] = abs(a[i]); }

Shuffle Intrinsics

These Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) intrinsics are used to perform shuffle operations. The prototypes for these intrinsics are in tmmintrin.h. You can also use the ia32intrin.h header file for these intrinsics. extern __m128i _mm_shuffle_epi8 (__m128i a, __m128i b);

Shuffle bytes from a according to contents of b.

855 30 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Interpreting a, b, and r as arrays of unsigned 8-bit integers: for (i = 0; i < 16; i++){ if (b[i] & 0x80){ r[i] = 0; } else { r[i] = a[b[i] & 0x0F]; } } extern __m64 _mm_shuffle_pi8 (__m64 a, __m64 b);

Shuffle bytes from a according to contents of b.

Interpreting a, b, and r as arrays of unsigned 8-bit integers: for (i = 0; i < 8; i++){ if (b[i] & 0x80){ r[i] = 0; } else { r[i] = a[b[i] & 0x07]; } }

Concatenate Intrinsics

These Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) intrinsics are used concatenation. The prototypes for these intrinsics are in tmmintrin.h. You can also use the ia32intrin.h header file for these intrinsics. extern __m128i _mm_alignr_epi8 (__m128i a, __m128i b, int n);

856 30

Concatenate a and b, extract byte-aligned result shifted to the right by n.

Interpreting t1 as 256-bit unsigned integer, a, b, and r as 128-bit unsigned integers: t1[255:128] = a; t1[127:0] = b; t1[255:0] = t1[255:0] >> (8 * n); // unsigned shift r[127:0] = t1[127:0]; extern __m64 _mm_alignr_pi8 (__m64 a, __m64 b, int n);

Concatenate a and b, extract byte-aligned result shifted to the right by n.

Interpreting t1 as 127-bit unsigned integer, a, b and r as 64-bit unsigned integers: t1[127:64] = a; t1[63:0] = b; t1[127:0] = t1[127:0] >> (8 * n); // unsigned shift r[63:0] = t1[63:0];

Negation Intrinsics

These Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) intrinsics are used for negation. The prototypes for these intrinsics are in tmmintrin.h. You can also use the ia32intrin.h header file for these intrinsics. extern __m128i _mm_sign_epi8 (__m128i a, __m128i b);

Negate packed bytes in a if corresponding sign in b is less than zero.

857 30 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Interpreting a, b, and r as arrays of signed 8-bit integers: for (i = 0; i < 16; i++){ if (b[i] < 0){ r[i] = -a[i]; } else if (b[i] == 0){ r[i] = 0; } else { r[i] = a[i]; } } extern __m128i _mm_sign_epi16 (__m128i a, __m128i b);

Negate packed words in a if corresponding sign in b is less than zero.

858 30

Interpreting a, b, and r as arrays of signed 16-bit integers: for (i = 0; i < 8; i++){ if (b[i] < 0){ r[i] = -a[i]; } else if (b[i] == 0){ r[i] = 0; } else { r[i] = a[i]; } } extern __m128i _mm_sign_epi32 (__m128i a, __m128i b);

Negate packed dwords in a if corresponding sign in b is less than zero.

859 30 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Interpreting a, b, and r as arrays of signed 32-bit integers: for (i = 0; i < 4; i++){ if (b[i] < 0){ r[i] = -a[i]; } else if (b[i] == 0){ r[i] = 0; } else { r[i] = a[i]; } } extern __m64 _mm_sign_pi8 (__m64 a, __m64 b);

Negate packed bytes in a if corresponding sign in b is less than zero.

860 30

Interpreting a, b, and r as arrays of signed 8-bit integers: for (i = 0; i < 16; i++){ if (b[i] < 0){ r[i] = -a[i]; } else if (b[i] == 0){ r[i] = 0; } else { r[i] = a[i]; } } extern __m64 _mm_sign_pi16 (__m64 a, __m64 b);

Negate packed words in a if corresponding sign in b is less than zero.

861 30 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Interpreting a, b, and r as arrays of signed 16-bit integers: for (i = 0; i < 8; i++){ if (b[i] < 0){ r[i] = -a[i]; } else if (b[i] == 0){ r[i] = 0; } else { r[i] = a[i]; } } extern __m64 _mm_sign_pi32 (__m64 a, __m64 b);

Negate packed dwords in a if corresponding sign in b is less than zero.

862 30

Interpreting a, b, and r as arrays of signed 32-bit integers: for (i = 0; i < 2; i++){ if (b[i] < 0){ r[i] = -a[i]; } else if (b[i] == 0){ r[i] = 0; } else { r[i] = a[i]; } }

863 30 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

864 Intrinsics for Intel(R) Streaming SIMD Extensions 4 31

Overview: Intel(R) Streaming SIMD Extensions 4

The intrinsics in this section correspond to Intel® Streaming SIMD Extensions 4 (Intel® SSE4) instructions. Intel® SSE4 includes the following categories:

• Vectorizing Compiler and Media Accelerators. The prototypes for these intrinsics are in the smmintrin.h file. • Efficient Accelerated String and Text Processing. The prototypes for these intrinsics are in the nmmintrin.h file.

Vectorizing Compiler and Media Accelerators

Overview: Vectorizing Compiler and Media Accelerators

The intrinsics in this section correspond to Intel® Streaming SIMD Extensions 4 (Intel® SSE4) Vectorizing Compiler and Media Accelerators instructions.

• Packed Blending Intrinsincs for Streaming SIMD Extensions 4 • Floating Point Dot Product Intrinsincs for Streaming SIMD Extensions 4 • Packed Format Conversion Intrinsics for Streaming SIMD Extensions 4 • Packed Integer Min/Max Intrinsics for Streaming SIMD Extensions 4 • Floating Point Rounding Intrinsics for Streaming SIMD Extensions 4 • DWORD Multiply Intrinsics for Streaming SIMD Extensions 4 • Register Insertion/Extraction Intrinsics for Streaming SIMD Extensions 4 • Test Intrinsics for Streaming SIMD Extensions 4 • Packed DWORD to Unsigned WORD Intrinsic for Streaming SIMD Extensions 4 • Packed Compare for Equal Intrinsics for Streaming SIMD Extensions 4 • Cacheability Support Intrinsic for Streaming SIMD Extension 4

The prototypes for these instrinsics are in the smmintrin.h file.

865 31 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Packed Blending Intrinsincs

These Intel® Streaming SIMD Extension 4 (Intel® SSE4) intrinsics pack multiple operations in a single instruction. Blending conditionally copies one field in the source onto the corresponding field in the destination. The prototypes for these instrinsics are in the smmintrin.h file.

Intrinsic Syntax Operation Corresponding Intel® SSE4 Instruction

__m128 _mm_blend_ps(__m128 v1, Selects float single BLENDPS __m128 v2, const int mask) precision data from 2 sources using constant mask

__m128d _mm_blend_pd(__m128d v1, Selects float double BLENDPD __m128d v2, const int mask) precision data from 2 sources using constant mask

__m128 _mm_blendv_ps(__m128 v1, Selects float single BLENDVPS __m128 v2, __m128 v3) precision data from 2 sources using variable mask

__m128d _mm_blendv_pd(__m128d v1, Selects float double BLENDVPD __m128d v2, __m128d v3) precision data from 2 sources using variable mask

__m128i _mm_blendv_epi8(__m128i v1, Selects integer bytes from PBLENDVB __m128i v2, __m128i mask) 2 sources using variable mask

__m128i _mm_blend_epi16(__m128i v1, Selects integer words from PBLENDW __m128i v2, const int mask) 2 sources using constant mask

866 31

Floating Point Dot Product Intrinsincs

These Intel® Streaming SIMD Extension (Intel® SSE4) intrinsics enable floating point single precision and double precision dot products. The prototypes for these instrinsics are in the smmintrin.h file.

Intrinsic Operation Corresponding Intel® SSE4 Instruction

_mm_dp_pd Double precision dot product DPPD

_mm_dp_ps Single precision dot product DPPS

__m128d _mm_dp_pd ( __m128d a, __m128d b, const int mask)

This intrinsic calculates the dot product of double precision packed values with mask-defined summing and zeroing of the parts of the result. __m128 _mm_dp_ps ( __m128 a, __m128 b, const int mask)

This intrinsic calculates the dot product of single precision packed values with mask-defined summing and zeroing of the parts of the result.

Packed Format Conversion Intrinsics

These Intel® Streaming SIMD Extension 4 (Intel® SSE4) intrinsics convert a packed integer to a zero-extended or sign-extended integer with wider type. The prototypes for these instrinsics are in the smmintrin.h file.

Intrinsic Syntax Operation Corresponding Intel® SSE4 Instruction

__m128i Sign extend 4 bytes into 4 PMOVSXBD _mm_cvtepi8_epi32(__m128i a) double words

__m128i _mm_cvtepi8_epi64 Sign extend 2 bytes into 2 PMOVSXBQ (__m128i a) quad words

__m128i Sign extend 8 bytes into 8 PMOVSXBW _mm_cvtepi8_epi16(__m128i a) words

867 31 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Syntax Operation Corresponding Intel® SSE4 Instruction

__m128i Sign extend 2 double words PMOVSXDQ _mm_cvtepi32_epi64(__m128i a) into 2 quad words

__m128i Sign extend 4 words into 4 PMOVSXWD _mm_cvtepi16_epi32(__m128i a) double words

__m128i Sign extend 2 words into 2 PMOVSXWQ _mm_cvtepi16_epi64(__m128i a) quad words

__m128i Zero extend 4 bytes into 4 PMOVZXBD _mm_cvtepu8_epi32(__m128i a) double words

__m128i Zero extend 2 bytes into 2 PMOVZXBQ _mm_cvtepu8_epi64(__m128i a) quad words

__m128i Zero extend 8 bytes into 8 PMOVZXBW _mm_cvtepu8_epi16(__m128i a) word

__m128i Zero extend 2 double words PMOVZXDQ _mm_cvtepu32_epi64(__m128i a) into 2 quad words

__m128i Zero extend 4 words into 4 PMOVZXWD _mm_cvtepu16_epi32(__m128i a) double words

__m128i Zero extend 2 words into 2 PMOVZXWQ _mm_cvtepu16_epi64(__m128i a) quad words

Packed Integer Min/Max Intrinsics

These Intel® Streaming SIMD Extensions 4 (Intel® SSE4) intrinsics compare packed integers in the destination operand and the source operand, and return the minimum or maximum for each packed operand in the destination operand. The prototypes for these instrinsics are in the smmintrin.h file.

868 31

Intrinsic Syntax Operation Corresponding Intel® SSE4 Instruction

__m128i _mm_max_epi8( __m128i a, __m128i Calculates maximum of PMAXSB b) signed packed integer bytes

__m128i _mm_max_epi32( __m128i a, Calculates maximum of PMAXSD __m128i b) signed packed integer double words

__m128i _mm_max_epu32( __m128i a, Calculates maximum of PMAXUD __m128i b) unsigned packed integer double words

__m128i _mm_max_epu16( __m128i a, Calculates maximum of PMAXUW __m128i b) unsigned packed integer words

__m128i _mm_min_epi8( __m128i a, __m128i Calculates minimum of PMINSB b) signed packed integer bytes

__m128i _mm_min_epi32( __m128i a, Calculates minimum of PMINSD __m128i b) signed packed integer double words

__m128i _mm_min_epu32( __m128i a, Calculates minimum of PMINUD __m128i b) unsigned packed integer double words

__m128i _mm_min_epu16( __m128i a, Calculates minimum of PMINUW __m128i b) unsigned packed integer words

869 31 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Floating Point Rounding Intrinsics

These Intel® Streaming SIMD Extension 4 (Intel® SSE4) rounding intrinsics cover scalar and packed single-precision and double precision floating-point operands. The prototypes for these instrinsics are in the smmintrin.h file. The floor and ceil intrinsics correspond to the definitions of floor and ceil in the ISO 9899:1999 standard for the C programming language.

Intrinsic Syntax Operation Corresponding Intel® SSE4 Instruction

__m128d _mm_round_pd(__m128d s1, int Packed float double ROUNDPD iRoundMode) precision rounding __m128d _mm_floor_pd(__m128d s1) __m128d _mm_ceil_pd(__m128d s1)

__m128 _mm_round_ps(__m128 s1, int Packed float single ROUNDPS iRoundMode) precision rounding __m128 _mm_floor_ps(__m128 s1) __m128 _mm_ceil_ps(__m128 s1)

__m128d _mm_round_sd(__m128d dst, Single float double ROUNDSD __m128d s1, int iRoundMode) precision rounding __m128d _mm_floor_sd(__m128d dst, __m128d s1) __m128d _mm_ceil_sd(__m128d dst, __m128d s1)

__m128 _mm_round_ss(__m128 dst, Single float single ROUNDSS __m128d s1, int iRoundMode) precision rounding __m128 _mm_floor_ss(__m128d dst, __m128 s1) __m128 _mm_ceil_ss(__m128d dst, __m128 s1)

870 31

DWORD Multiply Intrinsics

These Intel® Streaming SIMD Extensions (Intel® SSE4) DWORD multiply intrinsics are designed to aid vectorization. They enable four simultaneous 32-bit by 32-bit multiplies. The prototypes for these instrinsics are in the smmintrin.h file.

Intrinsic Syntax Operation Corresponding Intel® SSE4 Instruction

__m128i _mm_mul_epi32( __m128i a, Packed integer 32-bit PMULDQ __m128i b) multiplication of 2 low pairs of operands producing two 64-bit results

__m128i _mm_mullo_epi32( __m128i a, Packed integer 32-bit PMULLD __m128i b) multiplication with truncation of upper halves of results

Register Insertion/Extraction Intrinsics

These Intel® Streaming SIMD Extensions (Intel® SSE4) intrinsics enable data insertion and extraction between general purpose registers and XMM registers. The prototypes for these instrinsics are in the smmintrin.h file. Intrinsics marked with * are implemented only on Intel® 64 architectures. The rest of the intrinsics are implemented on both IA-32 and Intel® 64 architectures.

Intrinsic Syntax Operation Corresponding Intel® SSE4 Instruction

__m128 _mm_insert_ps(__m128 dst, Insert single precision float INSERTPS __m128 src, const int ndx) into packed single precision array element selected by index.

871 31 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Syntax Operation Corresponding Intel® SSE4 Instruction

int _mm_extract_ps(__m128 src, const Extract single precision float EXTRACTPS int ndx) from packed single precision array element selected by index.

__m128i _mm_insert_epi8(__m128i s1, Insert integer byte into PINSRB int s2, const int ndx) packed integer array element selected by index.

int _mm_extract_epi8(__m128i src, Extract integer byte from PEXTRB const int ndx) packed integer array element selected by index.

int _mm_extract_epi16(__m128i src, Extract integer word from PEXTRW int ndx) packed integer array element selected by index.

__m128i _mm_insert_epi32(__m128i Insert integer double word PINSRD s1, int s2, const int ndx) into packed integer array element selected by index.

int _mm_extract_epi32(__m128i src, Extract integer double word PEXTRD const int ndx) from packed integer array element selected by index.

__m128i _mm_insert_epi64(__m128i Insert integer quad word into PINSRQ s2, int s, const int ndx)* packed integer array element selected by index. Use only on Intel® 64 architectures.

__int64 _mm_extract_epi64(__m128i Extract integer quad word PEXTRQ src, const int ndx)* from packed integer array element selected by index. Use only on Intel® 64 architectures.

872 31

Test Intrinsics

These Intel® Streaming SIMD Extensions (Intel® SSE4) intrinsics perform packed integer 128-bit comparisons. The prototypes for these instrinsics are in the smmintrin.h file.

Intrinsic Name Operation Corresponding Intel® SSE4 Instruction

Check for all zeros in specified _mm_testz_si128 PTEST bits of a 128-bit value Check for all ones in specified _mm_testc_si128 PTEST bits of a 128-bit value Check for at least one zero and _mm_testnzc_si128 PTEST at least one one in the specified bits of a 128-bit value int _mm_testz_si128 (__m128i s1, __m128i s2)

Returns 1 if the bitwise AND operation on s1 and s2 results in all zeros, else returns 0. That is, _mm_testz_si128 := ( (s1 & s2) == 0 ? 1 : 0 )

This intrinsic checks if the ZF flag equals 1 as a result of the instruction PTEST s1, s2. For example, it allows you to check if all set bits in s2 (mask) are zeros in s1. Corresponding instruction: PTEST int _mm_testc_si128 (__m128i s1, __m128i s2)

Returns 1 if the bitwise AND operation on s2 and logical NOT s1 results in all zeros, else returns 0. That is, _mm_testc_si128 := ( (~s1 & s2) == 0 ? 1 : 0 )

This intrinsic checks if the CF flag equals 1 as a result of the instruction PTEST s1, s2. For example it allows you to check if all set bits in s2 (mask) are also set in s1. Corresponding instruction: PTEST int _mm_testnzc_si128 (__m128i s1, __m128i s2)

Returns 1 if the following conditions are true: bitwise operation of s1 AND s2 does not equal all zeros and bitwise operation of NOT s1 AND s2 does not equal all zeros, otherwise returns 0. That is, _mm_testnzc_si128 := ( ( (s1 & s2) != 0 && (~s1 & s2) != 0 ) ? 1 : 0 )

873 31 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

This intrinsic checks if both the CF and ZF flags are not 1 as a result of the instruction PTEST s1, s2. For example, it allows you to check that the result has both zeros and ones in s1 on positions specified as set bits in s2 (mask). Corresponding instruction: PTEST

Packed DWORD to Unsigned WORD Intrinsic

The prototype for this Intel® Streaming SIMD Extensions (Intel® SSE4) intrinsic is in the smmintrin.h file. __m128i _mm_packus_epi32(__m128i m1, __m128i m2);

Converts 8 packed signed DWORDs into 8 packed unsigned WORDs, using unsigned saturation to handle overflow condition. Corresponding instruction: PACKUSDW

Packed Compare for Equal Intrinsic

The prototype for this Intel® Streaming SIMD Extensions (Intel® SSE4) intrinsic is in the smmintrin.h file. __m128i _mm_cmpeq_epi64(__m128i a, __m128i b)

Performs a packed integer 64-bit comparison for equality. The intrinsic zeroes or fills with ones the corresponding parts of the result. Corresponding instruction: PCMPEQQ

Cacheability Support Intrinsic

The prototype for this Intel® Streaming SIMD Extensions (Intel® SSE4) intrinsic is in the smmintrin.h file. extern __m128i _mm_stream_load_si128(__m128i* v1);

Loads _m128 data from a 16-byte aligned address (v1) to the destination operand (m128i) without polluting the caches. Corresponding instruction: MOVNTDQA

874 31

Efficient Accelerated String and Text Processing

Overview: Efficient Accelerated String and Text Processing

The intrinsics in this section correspond to Intel® Streaming SIMD Extensions 4 (Intel® SSE4) Efficient Accelerated String and Text Processing instructions. These intrinsics include:

• Packed Comparison Intrinsics for Streaming SIMD Extensions 4 • Application Targeted Accelerators Intrinsics

The prototypes for these intrinsics are in the nmmintrin.h file.

Packed Compare Intrinsics

These Intel® Streaming SIMD Extensions (Intel® SSE4) intrinsics perform packed comparisons. Some of these intrinsics could map to more than one instruction; the Intel® C++ Compiler selects the instruction to generate. The prototypes for these intrinsics are in the nmmintrin.h file.

Intrinsic Name Operation Corresponding Intel® SSE4 Instruction

_mm_cmpestri Packed comparison, PCMPESTRI generates index

_mm_cmpestrm Packed comparison, PCMPESTRM generates mask

_mm_cmpistri Packed comparison, PCMPISTRI generates index

_mm_cmpistrm Packed comparison, PCMPISTRM generates mask

_mm_cmpestrz Packed comparison PCMPESTRM or PCMPESTRI

_mm_cmpestrc Packed comparison PCMPESTRM or PCMPESTRI

_mm_cmpestrs Packed comparison PCMPESTRM or PCMPESTRI

875 31 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsic Name Operation Corresponding Intel® SSE4 Instruction

_mm_cmpestro Packed comparison PCMPESTRM or PCMPESTRI

_mm_cmpestra Packed comparison PCMPESTRM or PCMPESTRI

_mm_cmpistrz Packed comparison PCMPISTRM or PCMPISTRI

_mm_cmpistrc Packed comparison PCMPISTRM or PCMPISTRI

_mm_cmpistrs Packed comparison PCMPISTRM or PCMPISTRI

_mm_cmpistro Packed comparison PCMPISTRM or PCMPISTRI

_mm_cmpistra Packed comparison PCMPISTRM or PCMPISTRI

int _mm_cmpestri(__m128i src1, int len1, __m128i src2, int len2, const int mode)

This intrinsic performs a packed comparison of string data with explicit lengths, generating an index and storing the result in ECX. __m128i _mm_cmpestrm(__m128i src1, int len1, __m128i src2, int len2, const int mode)

This intrinsic performs a packed comparison of string data with explicit lengths, generating a mask and storing the result in XMM0. int _mm_cmpistri(__m128i src1, __m128i src2, const int mode)

This intrinsic performs a packed comparison of string data with implicit lengths, generating an index and storing the result in ECX. __m128i _mm_cmpistrm(__m128i src1, __m128i src2, const int mode)

This intrinsic performs a packed comparison of string data with implicit lengths, generating a mask and storing the result in XMM0. int _mm_cmpestrz(__m128i src1, int len1, __m128i src2, int len2, const int mode);

This intrinsic performs a packed comparison of string data with explicit lengths. Returns 1 if ZFlag == 1, otherwise 0.

876 31 int _mm_cmpestrc(__m128i src1, int len1, __m128i src2, int len2, const int mode);

This intrinsic performs a packed comparison of string data with explicit lengths. Returns 1 if CFlag == 1, otherwise 0. int _mm_cmpestrs(__m128i src1, int len1, __m128i src2, int len2, const int mode);

This intrinsic performs a packed comparison of string data with explicit lengths. Returns 1 if SFlag == 1, otherwise 0. int _mm_cmpestro(__m128i src1, int len1, __m128i src2, int len2, const int mode);

This intrinsic performs a packed comparison of string data with explicit lengths. Returns 1 if OFlag == 1, otherwise 0. int _mm_cmpestra(__m128i src1, int len1, __m128i src2, int len2, const int mode);

This intrinsic performs a packed comparison of string data with explicit lengths. Returns 1 if CFlag == 0 and ZFlag == 0, otherwise 0. int _mm_cmpistrz(__m128i src1, __m128i src2, const int mode);

This intrinsic performs a packed comparison of string data with implicit lengths. Returns 1 if (ZFlag == 1), otherwise 0. int _mm_cmpistrc(__m128i src1, __m128i src2, const int mode);

This intrinsic performs a packed comparison of string data with implicit lengths. Returns 1 if (CFlag == 1), otherwise 0. int _mm_cmpistrs(__m128i src1, __m128i src2, const int mode);

This intrinsic performs a packed comparison of string data with implicit lengths. Returns 1 if (SFlag == 1), otherwise 0. int _mm_cmpistro(__m128i src1, __m128i src2, const int mode);

This intrinsic performs a packed comparison of string data with implicit lengths. Returns 1 if (OFlag == 1), otherwise 0. int _mm_cmpistra(__m128i src1, __m128i src2, const int mode);

This intrinsic performs a packed comparison of string data with implicit lengths. Returns 1 if (ZFlag == 0 and CFlag == 0), otherwise 0.

877 31 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Application Targeted Accelerators Intrinsics

These Intel® Streaming SIMD Extensions (Intel® SSE4) intrinsics extend the capabilities of Intel architecture by adding performance-optimized, low-latency, lower power fixed-function accelerators on the processor die to benefit specific applications. The prototypes for application targeted accelerator intrinsics are in the file nmmintrin.h. Intrinsics marked with * are implemented only on Intel® 64 architectures. The rest of the intrinsics are implemented on both IA-32 and Intel® 64 architectures.

Intrinsic Name Operation Corresponding Intel® SSE4 Instruction

_mm_popcnt_u32 Counts number of set bits in a POPCNT data operation

_mm_popcnt_u64* Counts number of set bits in a POPCNT data operation

_mm_crc32_u8 Accumulates cyclic redundancy CRC32 check

_mm_crc32_u16 Performs cyclic redundancy CRC32 check

_mm_crc32_u32 Performs cyclic redundancy CRC32 check

_mm_crc32_u64* Performs cyclic redundancy CRC32 check

unsigned int _mm_popcnt_u32(unsigned int v);

Counts the number of set bits in a data operation. __int64 _mm_popcnt_u64(unsigned __int64 v)

Counts the number of set bits in a data operation. Use only on Intel® 64 architectures. unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v);

878 31

Starting with an initial value in the first operand, accumulates a CRC32 value for the second operand and stores the result in the destination operand. Accumulates CRC32 on r/m8. unsigned int _mm_crc32_u16(unsigned int crc, unsigned short v);

Starting with an initial value in the first operand, accumulates a CRC32 value for the second operand and stores the result in the destination operand. Accumulates CRC32 on r/m16. unsigned int _mm_crc32_u32(unsigned int crc, unsigned int v);

Starting with an initial value in the first operand, accumulates a CRC32 value for the second operand and stores the result in the destination operand. Accumulates CRC32 on r/m32. unsigned __int64 _mm_crc32_u64(unsigned __int64 crc, unsigned __int64 v);

Starting with an initial value in the first operand, accumulates a CRC32 value for the second operand and stores the result in the destination operand. Accumulates CRC32 on r/m64. Use only on Intel® 64 architectures.

879 31 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

880 Intrinsics for Advanced Encryption Standard Implementation 32

Overview: Intrinsics for Carry-less Multiplication Instruction and Advanced Encryption Standard Instructions

The Intel® C++ Compiler provides intrinsics to enable carry-less multiplication and encryption based on Advanced Encryption Standard (AES) specifications. The carry-less multiplication intrinsic corresponds to a single new instruction, PCLMULQDQ. The AES extension intrinsics correspond to AES extension intructions. The AES extension instructions and the PCLMULQDQ instruction follow the same system software requirements for XMM state support and single-instruction multiple data (SIMD) floating-point exception support as Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® SSE3, Intel® Supplemental SSE3, and Intel® SSE4 extensions. Intel®64 processors using 32nm processing technology support the AES extension instructions as well as the PCLMULQDQ instruction.

AES Encryption and Cryptographic Processing AES encryption involves processing 128-bit input data (plaintext) through a finite number of iterative operation, referred to as "AES round", into a 128-bit encrypted block (ciphertext). Decryption follows the reverse direction of iterative operation using the "equivalent inverse cipher" instead of the "inverse cipher". The cryptographic processing at each round involves two input data, one is the "state", the other is the "round key". Each round uses a different "round key". The round keys are derived from the cipher key using a "key schedule" algorithm. The "key schedule" algorithm is independent of the data processing of encryption/decryption, and can be carried out independently from the encryption/decryption phase. The AES standard supports cipher key of sizes 128, 192, and 256 bits. The respective cipher key sizes corresponds to 10, 12, and 14 rounds of iteration.

881 32 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Carry-less Multiplication Instruction and AES Extension Instructions A single instruction, PCLMULQDQ, performs carry-less multiplication for two binary numbers that are up to 64-bit wide. The AES extensions provide:

• two instructions to accelerate AES rounds on encryption (AESENC and AESENCLAST) • two instructions for AES rounds on decryption using the equivalent inverse cipher (AESDEC and AESDECLAST) • instructions for the generation of key schedules (AESIMC and AESKEYGENASSIST)

Detecting Support for Using Instructions Before any application attempts to use the PCLMULQDQ or the AES extension instructions, it must first detect if the instructions are supported by the processor. To detect support for the PCLMULQDQ instruction, your application must check the following: CPUID.01H:ECX.PCLMULQDQ[bit 1] = 1.

To detect support for the AES extension instructions, your application must check the following: CPUID.01H:ECX.AES[bit 25] = 1.

Operating systems that support handling of the SSE state also support applications that use AES extension instruction and the PCLMULQDQ instruction.

Intrinsics for Carry-less Multiplication Instruction and Advanced Encryption Standard Instructions

The prototypes for the Carry-less multiplication intrinsic and the AES intrinsics are defined in the wmmintrin.hfile.

Carry-less Multiplication Intrinsic The single general purpose block encryption intrinsic description is provided below. __m128i _mm_clmulepi64_si128(__m128i v1, __m128i v2, const int imm8);

Performs a carry-less multiplication of one quadword of v1 by one quadword of v2, and returns the result. The imm8 value is used to determine which quadwords of v1 and v2 should be used. Corresponding Instruction: PCLMULQDQ

882 32

Advanced Encryption Standard Intrinsics The AES intrinsics are described below. __m128i _mm_aesdec_si128(__m128i v, __m128i rkey);

Performs one round of an AES decryption flow using the Equivalent Inverse Cipher operating on a 128-bit data (state) from v with a 128-bit round key from rkey. Corresponding Instruction: AESDEC __m128i _mm_aesdeclast_si128(__m128i v, __m128i rkey);

Performs the last round of an AES decryption flow using the Equivalent Inverse Cipher operating on a 128-bit data (state) from v with a 128-bit round key from rkey. Corresponding Instruction: AESDECLAST __m128i _mm_aesenc_si128(__m128i v, __m128i rkey);

Performs one round of an AES encryption flow operating on a 128-bit data (state) from v with a 128-bit round key from rkey. Corresponding Instruction:AESENC __m128i _mm_aesenclast_si128(__m128i v, __m128i rkey);

Performs the last round of an AES encryption flow operating on a 128-bit data (state) from v with a 128-bit round key from rkey. Corresponding Instruction: AESENCLAST __m128i _mm_aesimc_si128(__m128i v);

Performs the InvMixColumn transformation on a 128-bit round key from v and returns the result. Corresponding Instruction: AESIMC __m128i _mm_aeskeygenassist_si128(__m128i ckey, const int rcon);

Assists in AES round key generation using an 8-bit Round Constant (RCON) specified in rcon operating on 128 bits of data specified in ckey and returns the result. Corresponding Instruction: AESKEYGENASSIST

See Also • Overview: Intrinsics for Carry-less Multiplication Instruction and Advanced Encryption Standard Instructions

883 32 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

884 Intrinsics for Intel(R) Advanced Vector Extensions 33

Overview: Intrinsics for Intel® Advanced Vector Extensions Instructions

The Intel® Advanced Vector Extensions (Intel® AVX) intrinsics are assembly-coded functions that call on Intel® AVX instructions, which are new vector SIMD instruction extensions for the IA-32 and Intel® 64 architectures. The Intel® Advanced Vector Extensions are architecturally similar to the Intel® Streaming SIMD Extensions (Intel® SSE) and double-precision floating-point portions of Intel® SSE2. The prototypes for the Intel® AVX intrinsics are available in the immintrin.h file. The Intel® AVX intrinsics are supported on the IA-32 and Intel® 64 architectures built from 32nm process technology. They map directly to the Intel® AVX new instructions and other enhanced 128-bit SIMD instructions. The Intel® Advanced Vector Extensions (Intel® AVX) introduces 256-bit vector processing capability and includes two components on the Intel processor generations built from 32nm process and beyond: • The first generation Intel® AVX provides 256-bit SIMD register support, 256-bit vector floating-point instructions, enhancements to 128-bit SIMD instructions, and support for three and four operand syntax. • The Fused Multiply-Add (FMA) is another extension of Intel® AVX that provides floating-point, fused multiply-add instructions supporting 256-bit and 128-bit SIMD vectors.

Functional Overview Intel® AVX and FMA provide comprehensive functional improvements over previous generations of SIMD instruction extensions. The functional improvements include:

• 256-bit floating-point arithmetic primitives: Intel® AVX enhances existing 128-bit floating-point arithmetic instructions with 256-bit capabilities for floating-point processing. FMA provides additional set of 256-bit floating-point processing capabilities with a rich set of fused-multiply-add and fused multiply-subtract primitives. • Enhancements for flexible SIMD data movements: Intel® AVX provides a number of new data movement primitives to enable efficient SIMD programming in relation to loading non-unit-strided data into SIMD registers, intra-register SIMD data manipulation, conditional expression and branch handling, etc. Enhancements for SIMD data movement primitives cover 256-bit and 128-bit vector floating-point data, and across 128-bit integer SIMD data processing using VEX-encoded instructions.

885 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

See Also • Intel® AVX site at http://software.intel.com/en-us/avx/ • Details about the Intel® AVX Intrinsics

Details of Intel® Advanced Vector Extensions Intrinsics and Fused Multiply Add Intrinsics

The Intel® Advanced Vector Extensions (Intel® AVX) intrinsics map directly to the Intel® AVX instructions and other enhanced 128-bit single-instruction multiple data processing (SIMD) instructions. The Intel® AVX instructions are architecturally similar to extensions of the existing Intel® 64 architecture-based vector streaming SIMD portions of Intel® Streaming SIMD Extensions (Intel® SSE) instructions, and double-precision floating-point portions of Intel® SSE2 instructions. However, Intel® AVX introduces the following architectural enhancements: • Support for 256-bit wide vectors and SIMD register set. • Instruction syntax support three and four operand syntax, to improve instruction programming flexibility and efficiency for new instruction extensions. • Enhancement of legacy 128-bit SIMD instruction extensions to support three-operand syntax and to simplify compiler vectorization of high-level language expressions. • Instruction encoding format using a new prefix (referred to as VEX) to provide compact, efficient encoding for three-operand syntax, vector lengths, compaction of legacy SIMD prefixes and REX functionality. • Intel® AVX data types allow packing of upto 32 elements in a register if bytes are used. The number of elements depends upon the element type: 8 single-precision floating point types or 4 double precision floating point types. • Fused Multiply Add (FMA) extensions and enhanced floating-point compare instructions add support for IEEE-754-2008 standard.

Intel® Advanced Vector Extensions and Fused Multiply Add Registers Intel® AVX adds 16 registers (YMM0-YMM15), each 256 bits wide, aliased onto the 16 SIMD (XMM0-XMM15) registers. The Intel® AVX new instructions operate on the YMM registers. Intel® AVX extends certain existing instructions to operate on the YMM registers, defining a new way of encoding up to three sources and one destination in a single instruction.

886 33

Because each of these registers can hold more than one data element, the processor can process more than one data element simultaneously. This processing capability is also known as single-instruction multiple data processing (SIMD). For each computational and data manipulation instruction in the new extension sets, there is a corresponding C intrinsic that implements that instruction directly. This frees you from managing registers and assembly programming. Further, the compiler optimizes the instruction scheduling so that your executable runs faster. The Fused Multiply Add (FMA) new instructions comprise 256-bit and 128-bit SIMD instructions operating on YMM states.

Intel® Advanced Vector Extensions and FMA Data Types The Intel® AVX intrinsic functions use three new C data types as operands, representing the new registers used as operands to the intrinsic functions. These are the __m256 , __m256d, and the __m256i data types.

887 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The __m256 data type is used to represent the contents of the extended SSE register - the YMM register, used by the Intel® AVX intrinsics. The __m256 data type can hold eight 32-bit floating-point values. The __m256d data type can hold four 64-bit double precision floating-point values. The __m256i data type can hold thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values. The compiler aligns the __m256, __m256d, and __m256i local and global data to 32-byte boundaries on the stack. To align integer, float, or double arrays, use the declspec align statement as follows: typedef struct __declspec(align(32)) { float f[8]; } __m256; typedef struct __declspec(align(32)) { double d[4]; } __m256d; typedef struct __declspec(align(32)) { int i[8]; } __m256i;

The Intel® AVX intrinsics also use SSE2 data types like __m128, __m128d, and __m128i for some operations. See Details of Intrinsics topic for more information.

VEX Prefix Instruction Encoding Support for Intel® AVX and FMA Intel® AVX introduces a new prefix, referred to as VEX, in the Intel® 64 and IA-32 instruction encoding format. Instruction encoding using the VEX prefix provides several capabilities:

• direct encoding of a register operand within the VEX prefix • efficient encoding of instruction syntax operating on 128-bit and 256-bit register sets • compaction of REX prefix functionality • compaction of SIMD prefix functionality and escape byte encoding • providing relaxed memory alignment requirements for most VEX-encoded SIMD numeric and data processing instruction semantics with memory operand as compared to instructions encoded using SIMD prefixes

The VEX prefix encoding applies to SIMD instructions operating on YMM registers, XMM registers, and in some cases with a general-purpose register as one of the operands. The VEX prefix is not supported for instructions operating on MMX™ or x87 registers. It is recommended to use Intel® AVX and FMA intrinsics with /QxAVX [on Windows* operating systems] or -xAVX [on Linux* operating systems] options because their corresponding instructions are encoded with the VEX-prefix. The /QxAVX or -xAVX option forces other packed instructions to be encoded with VEX too. As a result there are less number of performance stalls due to AVX/FMA to legacy SSE code transitions.

888 33

Naming and Usage Syntax Most Intel® AVX and FMA intrinsic names use the following notational convention: _mm256__( , , )

The following table explains each item in the syntax.

_mm256/_mm128 Prefix representing the Intel® AVX vector register size of 256 bits; this could be _mm128 for FMA intrinsics representing vector size of 128 bits Indicates the basic operation of the intrinsic; for example, add for addition and sub for subtraction. Denotes the type of data the instruction operates on. The first one or two letters of each suffix denote whether the data is packed (p), extended packed (ep), or scalar (s). The remaining letters and numbers denote the type, with notation as follows:

• s single-precision floating point • d double-precision floating point • i128 signed 128-bit integer • i64 signed 64-bit integer • u64 unsigned 64-bit integer • i32 signed 32-bit integer • u32 unsigned 32-bit integer • i16 signed 16-bit integer • u16 unsigned 16-bit integer • i8 signed 8-bit integer • u8 unsigned 8-bit integer • ps packed single-precision floating point • pd packed double-precision floating point • sd scalar double-precision floating point • epi32 extended packed 32-bit integer • si256 scalar 256-bit integer

const int, etc. Represents a source vector register: m1/s1/a Represents another source vector register: m2/s2/b Represents an integer value: mask/select/offset

889 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The third parameter is an integer value whose bits represent a conditionality based on which the intrinsic performs an operation.

Example Usage extern __m256d _mm256_add_pd(__m256d m1, __m256d m2); where, add indicates that an addition operation must be performed pd indicates packed double-precision floating-point value

The packed values are represented in right-to-left order, with the lowest value used for scalar operations. Consider the following example operation: double a[4] = {1.0, 2.0, 3.0, 4.0}; __m256d t = _mm256_load_pd(a);

The result is the following: __m256d t = _mm256_set_pd(4.0, 3.0, 2.0, 1.0);

In other words, the YMM register that holds the value t appears as follows:

The“scalar” element is 1.0. Due to the nature of the instruction, some intrinsics require their arguments to be immediates (constant integer literals).

See Also • Details of Intrinsics (general)

890 33

Intrinsics for Arithmetic Operations

_mm256_add_pd Adds float64 vectors. The corresponding Intel® AVX instruction is VADDPD.

Syntax extern __m256d _mm256_add_pd(__m256d m1, __m256d m2);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a SIMD addition of four packed double-precision floating-point elements (float64 elements) in the first source vector, m1, with four float64 elements in the second source vector, m2.

Returns Result of the addition operation.

_mm256_add_ps Adds float32 vectors. The corresponding Intel® AVX instruction is VADDPS.

Syntax extern __m256 _mm256_add_ps(__m256 m1, __m256 m2);

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation

891 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Performs a SIMD addition of eight packed single-precision floating-point elements (float32 elements) in the first source vector, m1, with eight float32 elements in the second source vector, m2.

Returns Result of the addition operation.

_mm256_addsub_pd Adds odd float64 elements and subtracts even float64 elements of vectors. The corresponding Intel® AVX instruction is VADDSUBPD.

Syntax extern __m256d _mm256_addsub_pd(__m256d m1, __m256d m2);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a SIMD addition of the odd packed double-precision floating-point elements (float64 elements) from the first source vector, m1 , to the odd float64 elements of the second source vector, m2. Simultaneously, the intrinsic performs subtraction of the even double-precision floating-point elements of the second source vector, m2, from the even float64 elements of the first source vector, m1.

Returns Result of the operation is stored in the result vector, which is returned by the intrinsic.

892 33

_mm256_addsub_ps Adds odd float32 elements and subtracts even float32 elements of vectors. The corresponding Intel® AVX instruction is VADDSUBPS.

Syntax extern __m256 _mm256_addsub_ps(__m256 m1, __m256 m2);

Arguments m1 float32 vector used for the operation m2 float32 vector also used for the operation

Description Performs a SIMD addition of the odd single-precision floating-point elements (float32 elements) in the first source vector, m1 , with the odd float32 elements in the second source vector, m2. Simultaneously, the intrinsic performs subtraction of the even single-precision floating-point elements (float32 elements) in the second source vector, m2, from the even float32 elements in the first source vector, m1.

Returns Result of the operation stored in the result vector.

_mm256_hadd_pd Adds horizontal pairs of float64 elements of two vectors. The corresponding Intel® AVX instruction is VHADDPD.

Syntax extern __m256d _mm256_hadd_pd(__m256d m1, __m256d m2);

Arguments m1 float64 vector used for the operation m2 float64 vector also used for the operation

893 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Performs a SIMD addition of adjacent (horizontal) pairs of double-precision floating-point elements (float64 elements) ins the first source vector, m1, with adjacent pairs of float64 elements in the second source vector, m2.

Returns Result of the addition operation.

_mm256_hadd_ps Adds horizontal pairs of float32 elements of two vectors. The corresponding Intel® AVX instruction is VHADDPS.

Syntax extern __m256 _mm256_hadd_ps(__m256 m1, __m256 m2);

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation

Description Performs a SIMD addition of adjacent (horizontal) pairs of single-precision floating-point elements (float32 elements) in the first source vector, m1, with adjacent pairs of float32 elements in the second source vector, m2.

Returns Returns the result of the addition operation.

_mm256_sub_pd Subtracts float64 vectors. The corresponding Intel® AVX instruction is VSUBPD.

Syntax extern __m256d _mm256_sub_pd(__m256d m1, __m256d m2);

894 33

Arguments m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a SIMD subtraction of four packed double-precision floating-point elements (float64 elements) of the second source vector, m2 , from the first source vector, m1.

Returns Returns the result of the subtraction operation.

_mm256_sub_ps Subtracts float32 vectors. The corresponding Intel® AVX instruction is VSUBPS.

Syntax extern __m256 _mm256_sub_ps(__m256 m1, __m256 m2);

Arguments m1 float32 vector used for the operation m2 float32 vector also used for the operation

Description Performs a SIMD subtraction of eight packed single-precision floating-point elements (float32 elements) of the second source vector, m2 , from the first source vector, m1.

Returns Returns the result of the subtraction operation.

895 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_hsub_pd Subtracts horizontal pairs of float64 elements of two vectors. The corresponding Intel® AVX instruction is VHSUBPD.

Syntax extern __m256d _mm256_hsub_pd(__m256d m1, __m256d m2);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a SIMD subtraction of adjacent (horizontal) pairs of double-precision floating-point elements (float64 elements) in the second source vector, m2, from adjacent pairs of float64 elements in the first source vector, m1.

Returns Result of the subtraction operation.

_mm256_hsub_ps Subtracts horizontal pairs of float32 elements of two vectors. The corresponding Intel® AVX instruction is VHSUBPS.

Syntax extern __m256 _mm256_hsub_ps(__m256 m1, __m256 m2);

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation

896 33

Description Performs a SIMD subtraction of adjacent (horizontal) pairs of single-precision floating-point elements (float32 elements) in the second source vector, m2, from adjacent pairs of float32 elements in the first source vector, m1.

Returns Returns the result of the subtraction operation.

_mm256_mul_pd Multiplies float64 vectors. The corresponding Intel® AVX instruction is VMULPD.

Syntax extern __m256d _mm256_mul_pd(__m256d m1, __m256d m2);

Arguments m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a SIMD multiplication of four packed double-precision floating-point elements (float64 elements) in the first source vector, m1 , with four float64 elements in the second source vector, m2.

Returns Result of the multiplication operation.

_mm256_mul_ps Multiplies float32 vectors. The corresponding Intel® AVX instruction is VMULPS.

Syntax extern __m256 _mm256_mul_ps(__m256 m1, __m256 m2);

897 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation

Description Performs a SIMD multiplication of eight packed single-precision floating-point elements (float32 elements) in the first source vector, m1 , with eight float32 elements in the second source vector, m2.

Returns Result of the multiplication operation.

_mm256_div_pd Divides float64 vectors. The corresponding Intel® AVX instruction is VDIVPD.

Syntax extern __m256d _mm256_div_pd(__m256d m1, __m256d m2);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a SIMD division of four packed double-precision floating-point elements (float64 elements) in the first source vector, m1 , with four float64 elements in the second source vector, m2.

Returns Result of the division operation.

898 33

_mm256_div_ps Divides float32 vectors. The corresponding Intel® AVX instruction is VDIVPS.

Syntax extern __m256 _mm256_div_ps(__m256 m1, __m256 m2);

Arguments m1 float32 vector used for the operation m2 float32 vector also used for the operation

Description Performs a SIMD division of eight packed single-precision floating-point elements (float32 elements) in the first source vector, m1 , with eight float32 elements in the second source vector, m2.

Returns Result of the division operation.

_mm256_dp_ps Calculates the dot product of float32 vectors. The corresponding Intel® AVX instruction is VDPPS.

Syntax extern __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);

Arguments m1 float32 vector used for the operation m2 float32 vector also used for the operation mask a constant of integer type where the high four bits of the mask determine how the resultant elements are summed and the low four bits determine whether the summed resultant value is to be broadcast to the destination vector or not

899 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description First performs a SIMD multiplication of the lower four packed single-precision floating-point elements (float32 elements) from the first source vector, m1, with corresponding elements in the second source vector, m2. Each of the four resulting single-precision elements is conditionally summed depending on the high four bits in the mask parameter. The resulting summed value is broadcast to each of the lower 4 positions in the destination vector, if the corresponding lower bit of the mask is "1". If the corresponding lower bit of the mask is zero, the corresponding lower element in the destination vector is set to zero. The process is then replicated with the high elements of the source vectors.

Returns Result of the operation.

_mm256_sqrt_pd Computes the square root of double-precision floating point values. The corresponding Intel® AVX instruction is VSQRTPD.

Syntax extern __m256d _mm256_sqrt_pd(__m256d a);

Arguments

a float64 source vector

Description Performs a SIMD computation of the square roots of the two or four packed double-precision floating-point values (float64 values) in the source vector and returns the result of the square root operation.

Returns Result of the square root operation.

900 33

_mm256_sqrt_ps Computes the square root of single-precision floating point values. The corresponding Intel® AVX instruction is VSQRTPS.

Syntax extern __m256 _mm256_sqrt_ps(__m256 a);

Arguments a float32 source vector

Description Performs a SIMD computation of the square roots of the eight packed single-precision floating-point values (float32 values) in the source vector and returns the result of the square root operation.

Returns Result of the square root operation.

_mm256_rsqrt_ps Computes approximate reciprocals of square roots of float32 values.The corresponding Intel® AVX instruction is VRSQRTPS.

Syntax extern __m256 _mm256_rsqrt_ps(__m256 a);

Arguments a float32 source vector

Description Performs a computation of the reciprocal of the square roots of eight single-precision floating point elements of the source vector, a, and returns the result as a vector.

901 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Returns Result of the reciprocal square root operation.

_mm256_rcp_ps Computes approximate reciprocals of float32 values.The corresponding Intel® AVX instruction is VRCPPS.

Syntax extern __m256 _mm256_rcp_ps(__m256 a);

Arguments

a float32 source vector

Description Performs a computation of the reciprocal of eight single-precision floating point elements of the source vector, a, and returns the result as a vector.

Returns Result of the reciprocal operation.

Intrinsics for Bitwise Operations

_mm256_and_pd Performs bitwise logical AND operation on float64 vectors. The corresponding Intel® AVX instruction is VANDPD.

Syntax extern __m256d _mm256_and_pd(__m256d m1, __m256d m2);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation

902 33

Description Performs a bitwise logical AND of the four packed double-precision floating-point elements (float64 elements) of the first source vector, m1, and corresponding elements in the second source vector, m2.

Returns Result of the bitwise operation.

_mm256_and_ps Performs bitwise logical AND operation on float32 vectors. The corresponding Intel® AVX instruction is VANDPS.

Syntax extern __m256 _mm256_and_ps(__m256 m1, __m256 m2);

Arguments m1 float32 vector used for the operation m2 float32 vector also used for the operation

Description Performs a bitwise logical AND of the eight packed single-precision floating-point elements (float32 elements) of the first source vector, m1, and corresponding elements in the second source vector, m2.

Returns Result of the bitwise operation.

_mm256_andnot_pd Performs bitwise logical AND NOT operation on float64 vectors. The corresponding Intel® AVX instruction is VANDNPD.

Syntax extern __m256d _mm256_andnot_pd(__m256d m1, __m256d m2);

903 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a bitwise logical AND NOT of the four packed double-precision floating-point elements (float64 elements) of the first source vector, m1, and corresponding elements in the second source vector, m2.

Returns Result of the bitwise operation.

_mm256_andnot_ps Performs bitwise logical AND NOT operation on float32 vectors.The corresponding Intel® AVX instruction is VANDNPS.

Syntax extern __m256 _mm256_andnot_ps(__m256 m1, __m256 m2);

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation

Description Performs a bitwise logical AND NOT of the eight packed single-precision floating-point elements (float32 elements) of the first source vector, m1, and corresponding elements in the second source vector, m2.

Returns Result of the bitwise operation.

904 33

_mm256_or_pd Performs bitwise logical OR operation on float64 vectors. The corresponding Intel® AVX instruction is VORPD.

Syntax extern __m256d _mm256_or_pd(__m256d m1, __m256d m2);

Arguments m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a bitwise logical OR of the four packed double-precision floating-point elements (float64 elements) of the first source vector, m1, and corresponding elements in the second source vector, m2.

Returns Result of the bitwise operation.

_mm256_or_ps Performs bitwise logical OR operation on float32 vectors. The corresponding Intel® AVX instruction is VORPS.

Syntax extern __m256 _mm256_or_ps(__m256 m1, __m256 m2);

Arguments m1 float32 vector used for the operation m2 float32 vector also used for the operation

905 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Performs a bitwise logical OR of the eight packed single-precision floating-point elements (float32 elements) of the first source vector, m1, and corresponding elements in the second source vector, m2.

Returns Result of the bitwise operation.

_mm256_xor_pd Performs bitwise logical XOR operation on float64 vectors. The corresponding Intel® AVX instruction is VXORPD.

Syntax extern __m256d _mm256_xor_pd(__m256d m1, __m256d m2);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a bitwise logical XOR of the four packed double-precision floating-point elements (float64 elements) of the first source vector, m1, and corresponding elements in the second source vector, m2.

Returns Result of the bitwise operation.

_mm256_xor_ps Performs bitwise logical XOR operation on float32 vectors. The corresponding Intel® AVX instruction is VXORPS.

Syntax extern __m256 _mm256_xor_ps(__m256 m1, __m256 m2);

906 33

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation

Description Performs a bitwise logical XOR of the eight packed single-precision floating-point elements (float32 elements) of the first source vector, m1, and corresponding elements in the second source vector, m2.

Returns Result of the bitwise operation.

Intrinsics for Blend and Conditional Merge Operations

_mm256_blend_pd Performs a conditional blend/merge of float64 vectors. The corresponding Intel® AVX instruction is VBLENDPD.

Syntax extern __m256d _mm256_blend_pd(__m256d m1, __m256d m2, const int mask);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation mask a constant of integer type that is the mask for the operation

Description Performs a conditional merge of four packed double-precision floating point elements (float64 elements) of two vectors according to the immediate bits of the mask parameter. The mask parameter defines a constant integer. The immediate bits [3:0] in the mask determine from which source vector elements are copied into the resulting vector.

907 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

If the bits in mask are “1” then the corresponding elements of the second source vector are copied into the resulting vector. If the bits are “0” then the corresponding elements of the first source vector are copied into the resulting vector. Thus a merging/blending of the elements of the two source vectors occurs when this intrinsic is used.

Returns Result of the merge/blend operation.

_mm256_blend_ps Performs a conditional blend/merge of float32 vectors. The corresponding Intel® AVX instruction is VBLENDPS.

Syntax extern __m256 _mm256_blend_ps(__m256 m1, __m256 m2, const int mask);

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation mask a constant of integer type that is the mask for the operation

Description Performs a conditional merge of eight packed single-precision floating point elements (float32 elements) of two vectors according to the immediate bits of the mask parameter. The mask parameter defines a constant integer. The immediate bits [7:0] in the mask determine from which source vector elements are copied into the resulting vector. If the bits in mask are “1” then the corresponding elements of the second source vector are copied into the resulting vector. If the bits are “0” then the corresponding elements of the first source vector are copied into the resulting vector. Thus a merging/blending of the elements of the two source vectors occurs when this intrinsic is used.

Returns Result of the merge/blend operation.

908 33

_mm256_blendv_pd Performs conditional blend/merge of float64 vectors. The corresponding Intel® AVX instruction is VBLENDVPD.

Syntax extern __m256d _mm256_blendv_pd(__m256d m1, __m256d m2, __m256d mask);

Arguments m1 float64 vector used for the operation m2 float64 vector also used for the operation mask float64 vector with mask for the operation

Description Performs a conditional merge of four packed double-precision floating point elements (float64 elements) of two vectors according to the most significant bits of the mask parameter elements. The mask parameter defines a mask for the operation. The most significant bit of the corresponding double-precision floating-point elements in the mask determines whether the corresponding double-precision floating-point element in the resulting vector is copied from the second source or first source. If the bit in the mask is “1” then the corresponding element of the second source vector is copied into the resulting vector. If the bit is “0” then the corresponding element of the first source vector is copied into the resulting vector. Thus a merging/blending of the elements of the two source vectors occurs when this intrinsic is used.

Returns Result of the blend operation.

_mm256_blendv_ps Performs conditional blend/merge of float32 vectors. The corresponding Intel® AVX instruction is VBLENDVPS.

Syntax extern __m256 _mm256_blendv_ps(__m256 m1, __m256 m2, __m256 mask);

909 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation mask float32 vector with the mask for the operation; defined such that the “1” in the most significant bits of an element indicate that corresponding elements of the second source vector are copied into the result, while “0” bits indicate that corresponding elements of the first source vector are copied into the result

Description Performs a conditional merge of eight packed single-precision floating point elements (float32 elements) of two vectors according to the most significant bits of the mask parameter. The mask parameter defines a mask for the operation. The most significant bit of the corresponding single-precision floating-point elements in the mask determines whether the corresponding single-precision floating-point element in the resulting vector is copied from the second source or first source. If the bit in the mask is “1” then the corresponding element of the second source vector is copied into the resulting vector. If the bit is “0” then the corresponding element of the first source vector is copied into the resulting vector. Thus a merging/blending of the elements of the two source vectors occurs when this intrinsic is used.

Returns Result of the blend operation.

Intrinsics for Compare Operations

_mm_cmp_pd, _m256_cpm_pd Compares packed 128-bit and 256-bit float64 vector elements. The corresponding Intel® AVX instruction is VCMPPD.

Syntax extern __m128d _mm_cmp_pd(__m128d m1, __m128d m2, const int predicate);

910 33 extern __m256d _mm256_cmp_pd(__m256d m1, __m256d m2, const int predicate);

Arguments m1 float64 vector used for the operation m2 float64 vector also used for the operation predicate an immediate operand that specifies the type of comparison to be performed of the packed values; see immintrin.h file for the values to specify the type of comparison

Description Performs a SIMD compare of the four packed double-precision floating-point (float64) values in the first source operand, m1, and the second source operand, m2, and returns the results of the comparison. The _mm_cmp_pd intrinsic is used for comparing 128-bit float64 values while the _m256_cmp_pd intrinsic is used for comparing 256-bit float64 values. The comparison predicate parameter (immediate) specifies the type of comparison performed on each of the pairs of packed values.

Returns Result of the compare operation.

_mm_cmp_ps, _m256_cmp_ps Compares packed float32 elements of two vectors. The corresponding Intel® AVX instruction is VCMPPS.

Syntax extern __m128 _mm_cmp_ps(__m128 m1, __m128 m2, const int predicate); extern __m256 _m256_cmp_ps(__m256 m1, __m256 m2, const int predicate);

Arguments m1 float32 vector used for the operation m2 float32 vector also used for the operation predicate an immediate operand that specifies the type of comparison to be performed of the packed values; see immintrin.h file for the values to specify the type of comparison

911 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Performs a SIMD compare of the eight packed single-precision floating-point (float32) values in the first source operand, m1, and the second source operand, m2, and returns the results of the comparison. The _mm_cmp_ps intrinsic is used for comparing 128-bit float32 values while the _m256_cmp_ps intrinsic is used for comparing 256-bit float32 values. The comparison predicate parameter (immediate) specifies the type of comparison performed on each of the pairs of packed values.

Returns Result of the compare operation.

_mm_cmp_sd Compares scalar float64 vectors. The corresponding Intel® AVX instruction is VCMPSD.

Syntax extern __m128d _mm_cmp_sd(__m128d m1, __m128d m2, const int predicate);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation predicate an immediate operand that specifies the type of comparison to be performed of the packed values; see immintrin.h file for the values to specify the type of comparison

Description Performs a compare operation of the low double-precision floating-point values (float64 values) in the first source operand, m1, and the second source operand, m2, and returns the result. The comparison predicate parameter (immediate operand) specifies the type of comparison performed on each of the pairs of values.

Returns Result of the compare operation.

912 33

_mm_cmp_ss Compares scalar float32 values. The corresponding Intel® AVX instruction is VCMPSS.

Syntax extern __m128 _mm_cmp_ss(__m128 m1, __m128 m2, const int predicate);

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation predicate an immediate operand that specifies the type of comparison to be performed of the packed values; see immintrin.h file for the values to specify the type of comparison

Description Performs a compare operation of the low single-precision floating-point values (float32 values) in the first source operand, m1, and the second source operand, m2, and returns the results. The comparison predicate parameter (immediate) specifies the type of comparison performed.

Returns Result of the compare operation.

Intrinsics for Conversion Operations

_mm256_cvtepi32_pd Converts extended packed 32-bit integer values to packed double-precision floating point values.The corresponding Intel® AVX instruction is VCVTDQ2PD.

Syntax extern __m256 _mm256_cvtepi32_pd(__m128i m1);

913 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

m1 integer source vector/operand

Description Converts four packed signed doubleword integers in the source vector, m1, into four packed double-precision floating-point values.

Returns Result of the conversion operation.

_mm256_cvtepi32_ps Converts extended packed 32-bit integer values to packed single-precision floating point values. The corresponding Intel® AVX instruction is VCVTDQ2PS.

Syntax extern __m256 _mm256_cvtepi32_ps(__m256i m1);

Arguments

m1 integer source vector /operand

Description Converts eight packed signed doubleword integers in the source vector, m1, to eight packed single-precision floating-point values.

Returns Result of the conversion operation.

_mm256_cvtpd_epi32 Converts packed double-precision float values to extended 32-bit integer values.The corresponding Intel® AVX instruction is VCVTPD2DQ.

Syntax extern __m128i _mm256_cvtpd_epi32(__m256d m1);

914 33

Arguments m1 float64 source vector

Description Converts four packed double-precision floating-point values in the source vector, m1, to four packed signed doubleword integer (extended 32-bit integer) values in the destination.

Returns Result of the conversion operation.

_m256_cvtps_epi32 Converts packed single-precision float values to extended 32-bit integer values. The corresponding Intel® AVX instruction is VCVTPS2DQ.

Syntax extern __m256i _mm256_cvtps_epi32(__m256 m1);

Arguments m1 float32 source vector

Description Converts eight packed single-precision floating point values in the source vector, m1, to eight signed doubleword integer (extended 32-bit integer) values.

Returns Result of the conversion operation.

_mm256_cvtpd_ps Converts packed float64 values to packed float32 values. The corresponding Intel® AVX instruction is VCVTPD2PS.

Syntax extern __m128 _mm256_cvtpd_ps(__m256d m1);

915 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

m1 float64 source vector

Description Converts four packed double-precision floating point values (float64 values) in the source vector, m1, to eight packed single-precision floating-point values (float32 values).

Returns Result of the conversion operation.

_m256_cvtps_pd Converts packed float32 values to packed floati64 values. The corresponding Intel® AVX instruction is VCVTPS2PD.

Syntax extern __m256d _m256_cvtps_pd(__m128 m1);

Arguments

m1 128-bit float32 source vector

Description Converts four packed single-precision floating point values (float32 values) in the source vector, m1, to four packed double-precision floating point values (float64 values).

Returns Result of the conversion operation.

_mm256_cvttp_epi32 Converts packed float64 values to truncated extended 32-bit integer values. The corresponding Intel® AVX instruction is VCVTTPD2DQ.

Syntax extern __m128i _mm256_cvttpd_epi32(__m256d m1);

916 33

Arguments m1 float64 source vector

Description Converts four packed double-precision floating-point values (float64 values) in the source vector to four packed signed doubleword integer (extended 32-bit integer) values in the destination. When a conversion is inexact, a truncated (round towards zero) value is returned. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised. If this exception is masked, the indefinite integer value (80000000H) is returned.

Returns Result of the conversion operation.

_mm256_cvttps_epi32 Converts packed float32 values to truncated extended 32-bit integer values. The corresponding Intel® AVX instruction is VCVTTPS2DQ.

Syntax extern __m128i _mm256_cvttps_epi32(__m256 m1);

Arguments m1 float32 source vector

Description Converts eight packed single-precision floating-point values (float32 values) in the source vector to eight packed signed doubleword integer (extended 32-bit integer) values in the destination. When a conversion is inexact, a truncated (round towards zero) value is returned. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised; if this exception is masked, the indefinite integer value (80000000H) is returned.

Returns Result of the conversion operation.

917 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsics to Determine Maximum and Minimum Values

_mm256_max_pd Determines the maximum of float64 vectors. The corresponding Intel® AVX instruction is VMAXPD.

Syntax extern __m256d _mm256_max_pd(__m256d m1, __m256d m2);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a SIMD compare of the packed double-precision floating-point (float64) elements in the first source vector, m1, and the second source vector, m2, and returns the maximum value for each pair.

Returns Maximum value of the compare operation.

_mm256_max_ps Determines the maximum of float32 vectors. The corresponding Intel® AVX instruction is VMAXPS.

Syntax extern __m256 _mm256_max_ps(__m256 m1, __m256 m2);

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation

918 33

Description Performs a SIMD compare of the packed single-precision floating-point (float32) elements in the first source vector, m1, and the second source vector, m2, and returns the maximum value for each pair.

Returns Maximum value of the compare operation.

_mm256_min_pd Determines the minimum of float64 vectors. The corresponding Intel® AVX instruction is VMINPD.

Syntax extern __m256d _mm256_min_pd(__m256d m1, __m256d m2);

Arguments m1 float64 vector used for the operation m2 float64 vector also used for the operation

Description Performs a SIMD compare of the packed double-precision floating-point (float64) elements in the first source vector, m1, and the second source vector, m2, and returns the minimum value for each pair.

Returns Minimum value of the compare operation.

_mm256_min_ps Determines the minimum of float32 vectors.The corresponding Intel® AVX instruction is VMINPS.

Syntax extern __m256 _mm256_min_ps(__m256 m1, __m256 m2);

919 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

m1 float32 vector used for the operation m2 float32 vector also used for the operation

Description Performs a SIMD compare of the packed single-precision floating-point (float32) elements in the first source vector, m1, and the second source vector, m2, and returns the minimum value for each pair.

Returns Minimum value of the compare operation.

Intrinsics for Load Operations

_mm256_broadcast_pd Loads and broadcasts packed double-precision floating point values.The corresponding Intel® AVX instruction is VBROADCAST128.

Syntax extern __m256d _mm256_broadcast_pd(__m128d const *a);

Arguments

*a pointer to a memory location that can hold constant float64 values

Description Loads 128-bit float64 values from the specified address pointed to by a, and broadcasts it to two elements in the destination 256-bit vector.

Returns Result of the load and broadcast operation.

920 33

_mm256_broadcast_ps Loads and broadcasts packed single-precision floating point values.The corresponding Intel® AVX instruction is VBROADCAST128.

Syntax extern __m256 _mm256_broadcast_ps(__m128 const *a);

Arguments

*a pointer to a memory location that can hold constant 128-bit float32 values

Description Loads 128-bit float32 values from the specified address pointed to by a, and broadcasts it to all elements in the destination 256-bit vector.

Returns Result of the load and broadcast operation.

_mm256_broadcast_sd Loads and broadcasts scalar double-precision floating point values to a 256-bit destination operand. The corresponding Intel® AVX instruction is VBROADCASTSD.

Syntax extern __m256d _mm256_broadcast_sd(double const *a);

Arguments

*a pointer to a memory location that can hold constant scalar float64 values

Description Loads scalar double-precision floating-point values from the specified address, a, and broadcasts it to all four elements in the destination vector.

921 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Returns Result of the load and broadcast operation.

_mm256_broadcast_ss, _mm_broadcast_ss Loads and broadcasts 256/128-bit scalar single-precision floating point values to a 256/128-bit destination operand. The corresponding Intel® AVX instruction is VBROADCASTSS.

Syntax extern __m256 _mm256_broadcast_ss(float const *a); extern __m128 _mm_broadcast_ss(float const *a);

Arguments

*a pointer to a memory location that can hold constant 256-bit or 128-bit float32 values

Description Loads scalar single-precision floating-point values from the specified address pointed to by a, and broadcasts it to elements in the destination vector. The _m256_broadcast_ss intrinsic broadcasts the loaded values to all eight elements in the 256-bit destination vector. The _mm_broadcast_ss intrinsic broadcasts the loaded values to all four elements in the 128-bit destination vector.

Returns Result of the load and broadcast operation.

922 33

_mm256_load_pd Moves packed double-precision floating point values from aligned memory location to a destination vector. The corresponding Intel® AVX instruction is VMOVAPD.

Syntax extern __m256d _mm256_load_pd(double const *a);

Arguments

*a pointer to a memory location that can hold constant float64 values; the address must be 32-byte aligned

Description Loads packed double-precision floating point values (float64 values) from the 256-bit aligned memory location pointed to by a, into a destination float64 vector, which is retured by the intrinsic.

Returns A 256-bit vector with float64 values.

_mm256_load_ps Moves packed single-precision floating point values from aligned memory location to a destination vector. The corresponding Intel® AVX instruction is VMOVAPS.

Syntax extern __m256 _mm256_load_ps(float const *a);

Arguments

*a pointer to a memory location that can hold constant float32 values; the address must be 32-byte aligned

923 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Loads packed single-precision floating point values (float32 values) from the 256-bit aligned memory location pointed to by a, into a destination float32 vector, which is retured by the intrinsic.

Returns A 256-bit vector with float32 values.

_mm256_load_si256 Moves integer values from aligned memory location to a destination vector. The corresponding Intel® AVX instruction is VMOVDQA.

Syntax extern __m256i _mm256_load_si256(__m256i const *a);

Arguments

*a pointer to a memory location that can hold constant integer values; the address must be 32-byte aligned

Description Loads integer values from the 256-bit aligned memory location pointed to by *a, into a destination integer vector, which is returned by the intrinsic.

Returns A 256-bit vector with integer values.

_mm256_loadu_pd Moves packed double-precision floating point values from unaligned memory location to a destination vector. The corresponding Intel® AVX instruction is VMOVUPD.

Syntax extern __m256d _mm256_loadu_pd(double const *a);

924 33

Arguments

*a pointer to a memory location that can hold constant float64 values;

Description Loads packed double-precision floating point values (float64 values) from the 256-bit unaligned memory location pointed to by a, into a destination float64 vector, which is retured by the intrinsic.

Returns A 256-bit vector with float64 values.

_mm256_loadu_ps Moves packed single-precision floating point values from unaligned memory location to a destination vector. The corresponding Intel® AVX instruction is VMOVAPS.

Syntax extern __m256 _mm256_loadu_ps(float const *a);

Arguments

*a pointer to a memory location that can hold constant float32 values

Description Loads packed single-precision floating point values (float32 values) from the 256-bit unaligned memory location pointed to by a, into a destination float32 vector, which is retured by the intrinsic.

Returns A 256-bit vector with float32 values.

925 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_loadu_si256 Moves integer values from unaligned memory location to a destination vector. The corresponding Intel® AVX instruction is VMOVDQA.

Syntax extern __m256i _mm256_loadu_si256(__m256i const *a);

Arguments

*a pointer to a memory location that can hold constant integer values

Description Loads integer values from the 256-bit unaligned memory location pointed to by *a, into a destination integer vector, which is returned by the intrinsic.

Returns A 256-bit vector with integer values.

_mm256_maskload_pd, _mm_maskload_pd Loads packed double-precision floating point values according to mask values. The corresponding Intel® AVX instruction is VMASKMOVPD.

Syntax extern __m256d _mm256_maskload_pd(double const *a, __m256i mask); extern __m128d _mm_maskload_pd(double const *a, __m128i mask);

Arguments

*a pointer to a 256/128-bit memory location that can hold constant float64 values mask integer value calculated based on the most-significant-bit of each quadword of a mask register

926 33

Description Loads packed double-precision floating point (float64) values from the 256/128-bit memory location pointed to by a, into a destination register using the mask value. The mask is calculated from the most significant bit of each qword of the mask register. If any of the bits of the mask is set to zero, the corresponding value from the memory location is not loaded, and the corresponding field of the destination vector is set to zero.

Returns A 256/128-bit register with float64 values.

_mm256_maskload_ps, _mm_maskload_ps Loads packed single-precision floating point values according to mask values. The corresponding Intel® AVX instruction is VMASKMOVPS.

Syntax extern __m256 _mm256_maskload_ps(float const *a, __m256i mask); extern __m128 _mm_maskload_ps(float const *a, __m128i mask);

Arguments

*a pointer to a 256/128-bit memory location that can hold constant float32 values mask integer value calculated based on the most-significant-bit of each doubleword of a mask register

Description Loads packed single-precision floating point (float32) values from the 256/128-bit memory location pointed to by a, into a destination register using the mask value. The mask is calculated from the most significant bit of each dword of the mask register. If any of the bits of the mask is set to zero, the corresponding value from the memory location is not loaded, and the corresponding field of the destination vector is set to zero.

Returns A 256/128-bit register with float32 values.

927 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_store_pd Moves packed double-precision floating point values from a float64 vector to an aligned memory location. The corresponding Intel® AVX instruction is VMOVAPD.

Syntax extern void _mm256_store_pd(double *a, __m256d b);

Arguments

*a pointer to a memory location that can hold double-precision floating point (float64) values; the address must be 32-byte aligned b float64 vector

Description Performs a store operation by moving packed double-precision floating point values (float64 values) from a float64 vector, b, to a 256-bit aligned memory location, pointed to by a.

Returns Nothing

_mm256_store_ps Moves packed single-precision floating point values from a float32 vector to an aligned memory location. The corresponding Intel® AVX instruction is VMOVAPS.

Syntax extern void _mm256_store_ps(float *a, __m256 b);

Arguments

*a pointer to a memory location that can hold single-precision floating point (float32) values; the address must be 32-byte aligned b float32 vector

928 33

Description Performs a store operation by moving packed single-precision floating point values (float32 values) from a float32 vector, b, to a 256-bit aligned memory location, pointed to by a.

Returns Nothing.

_mm256_store_si256 Moves values from a integer vector to an aligned memory location. The corresponding Intel® AVX instruction is VMOVDQA.

Syntax extern void _mm256_store_si256(__m256i *a, __m256i b);

Arguments

*a pointer to a memory location that can hold scalar integer values; the address must be 32-byte aligned. b integer vector

Description Performs a store operation by moving integer values from a 256-bit integer vector, b, to a 256-bit aligned memory location, pointed to by a.

Returns Nothing.

_mm256_storeu_pd Moves packed double-precision floating point values from a float64 vector to an unaligned memory location. The corresponding Intel® AVX instruction is VMOVAPD.

Syntax extern void _mm256_storeu_pd(double *a, __m256d b);

929 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

*a pointer to a memory location that can hold double-precision floating point (float64) values b float64 vector

Description Performs a store operation by moving packed double-precision floating point values (float64 values) from a float64 vector, b, to a 256-bit unaligned memory location, pointed to by a.

Returns Nothing.

_mm256_storeu_ps Moves packed single-precision floating point values from a float32 vector to an unaligned memory location. The corresponding Intel® AVX instruction is VMOVAPS.

Syntax extern void _mm256_storeu_ps(float *a, __m256 b);

Arguments

*a pointer to a memory location that can hold single-precision floating point (float32) values b float32 vector

Description Performs a store operation by moving packed single-precision floating point values (float32 values) from a float32 vector, b, to a 256-bit unaligned memory location, pointed to by a.

Returns Nothing.

930 33

_mm256_storeu_si256 Moves values from a integer vector to an unaligned memory location. The corresponding Intel® AVX instruction is VMOVDQA.

Syntax extern void _mm256_storeu_si256(__m256i *a, __m256i b);

Arguments

*a pointer to a memory location that can hold scalar integer values b integer vector

Description Performs a store operation by moving integer values from a 256-bit integer vector, b, to a 256-bit unaligned memory location, pointed to by a.

Returns Nothing.

_mm256_stream_pd Moves packed double-precision floating-point values using non-temporal hint. The corresponding Intel® AVX instruction is VMOVNTPD.

Syntax extern void _mm256_stream_pd(double *p, __m256d a);

Arguments

*p pointer to a memory location that can hold double-precision floating point (float64) values; the address must be 32-byte aligned a float64 vector

931 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Performs a store operation by moving packed double-precision floating point values (float64 values) from a float64 vector, a, to a 256-bit aligned memory location, pointed to by p, using a non-temporal hint to prevent caching of the data during the write to memory.

Returns Result of the streaming/store operation.

_mm256_stream_ps Moves packed single-precision floating-point values using non-temporal hint. The corresponding Intel® AVX instruction is VMOVNTPS.

Syntax extern void _mm256_stream_ps(float *p, __m256 a);

Arguments

*p pointer to a memory location that can hold single-precision floating point (float32) values; the address must be 32-byte aligned a float32 vector

Description Performs a store operation by moving packed single-precision floating point values (float32 values) from a float32 vector, a, to a 256-bit aligned memory location, pointed to by p, using a non-temporal hint to prevent caching of the data during the write to memory.

Returns Result of the streaming/store operation.

932 33

_mm256_stream_si256 Moves packed integer values using non-temporal hint. The corresponding Intel® AVX instruction is VMOVNTDQ.

Syntax extern void _mm256_stream_si256(__m256i *p, __m256i a);

Arguments

*p pointer to a memory location that can hold scalar integer values; the address must be 32-byte aligned a integer vector

Description Performs a store operation by moving scalar integer values from an integer vector, a, to a 256-bit aligned memory location, pointed to by p, using a non-temporal hint to prevent caching of the data during the write to memory.

Returns Result of the streaming/store operation.

_mm256_maskstore_pd, _mm_maskstore_pd Stores packed double-precision floating point values according to mask values. The corresponding Intel® AVX instruction is VMASKMOVPD.

Syntax extern void _mm256_maskstore_pd(double *a, __m256i mask, __m256d b); extern void _mm_maskstore_pd(double *a, __m256i mask, __m128d b);

Arguments

*a pointer to a 256/128-bit memory location that can hold constant double-precision floating point (float64) values

933 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

mask integer value calculated based on the most-significant-bit of each quadword of a mask register b a 256/128-bit float64 vector

Description Performs a store operation by moving packed double-precision floating point (float64) values from a vector, b, to a 256/128-bit memory location, pointed to by a, using a mask. The mask is calculated from the most significant bit of each qword of the mask register. If any of the bits of the mask are set to zero, the corresponding value from the float64 vector is not loaded, and the corresponding field of the destination memory location is left unchanged.

NOTE. Stores are atomic. Faults do not occur for memory locations for which all corresponding mask bits are set to zero.

Returns Nothing.

_mm256_maskstore_ps, _mm_maskstore_ps Stores packed single-precision floating point values according to mask values. The corresponding Intel® AVX instruction is VMASKMOVPS.

Syntax extern void _mm256_maskstore_ps(float *a, __m256i mask, __m256 b); extern void _mm_maskstore_ps(float *a, __m256i mask, __m128 b);

Arguments

*a pointer to a 256/128-bit memory location that can hold constant single-precision floating point (float32) values mask integer value calculated based on the most-significant-bit of each quadword of a mask register b a 256/128-bit float32 vector

934 33

Description Performs a store operation by moving packed single-precision floating point (float32) values from a vector, b, to a 256/128-bit memory location, pointed to by a, using a mask. The mask is calculated from the most significant bit of each qword of the mask register. If any of the bits of the mask are set to zero, the corresponding value from the float32 vector is not loaded, and the corresponding field of the destination memory location is left unchanged.

NOTE. Stores are atomic. Faults do not occur for memory locations for which all corresponding mask bits are set to zero.

Returns Nothing.

Intrinsics for Miscellaneous Operations

_mm256_extractf128_pd Extracts 128-bit packed float64 values. The corresponding Intel® AVX instruction is VEXTRACTF128.

Syntax extern __m128d _mm256_extractf128_pd(__m256d m1, const int offset);

Arguments

m1 float64 source vector offset a constant integer value that represents the 128-bit offset from where extraction must start

Description Extracts 128-bit packed double-precision floating point values (float64 values) from the source vector, m1, starting from the location specified by the value in the offset parameter.

935 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Returns Result of the extraction operation.

_mm256_extractf128_ps Extracts 128-bit float32 values. The corresponding Intel® AVX instruction is VEXTRACTF128.

Syntax extern __m128 _mm256_extractf128_ps(__m256 m1, const int offset);

Arguments

m1 float32 source vector offset a constant integer value that represents the 128-bit offset from where extraction must start

Description Extracts 128-bit packed single-precision floating point values (float32 values) from the source vector, m1, starting from the location specified by the value in the offset parameter.

Returns Result of the extraction operation.

_mm256_extractf128_si256 Extracts 128-bit scalar integer values. The corresponding Intel® AVX instruction is VEXTRACTF128.

Syntax extern __m128i _mm256_extractf128_si256(__m256i m1, const int offset);

Arguments

m1 integer source vector offset a constant integer value that represents the offset from where extraction must start

936 33

Description Extracts 128-bit scalar integer values from the source vector, m1, starting from the location specified by the value in the offset parameter.

Returns Result of the extraction operation.

_mm256_insertf128_pd Inserts 128 bits of packed float64 values. The corresponding Intel® AVX instruction is VINSERTF128.

Syntax extern __m256d _mm256_insertf128_pd(__m256d a, __m128d b, int offset);

Arguments a 256-bit float64 source vector b 128-bit float64 source vector offset an integer value that represents the 128-bit offset where the insertion must start

Description Performs an insertion of 128 bits of packed double-precision floating-point values (float64 values) from the second source vector, b, into a destination at a 128-bit offset specified by the offset parameter. The remaining portions of the destination are written by the corresponding elements of the first source vector, a.

Returns Result of the insertion operation.

937 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_insertf128_ps Inserts 128 bits of packed float32 values. The corresponding Intel® AVX instruction is VINSERTF128.

Syntax extern __m256 _mm256_insertf128_ps(__m256 a, __m128 b, int offset);

Arguments

a 256-bit float32 source vector b 128-bit float32 source vector offset an integer value that represents the 128-bit offset where the insertion must start

Description Performs an insertion of 128 bits of packed single-precision floating-point values (float32 values) from the second source vector, b, into a destination at a 128-bit offset specified by the offset parameter. The remaining portions of the destination are written by the corresponding elements of the first source vector, a.

Returns Result of the insertion operation.

_mm256_insertf128_si256 Inserts 128 bits of packed scalar integer values . The corresponding Intel® AVX instruction is VINSERTF128.

Syntax extern __m256i _mm256_insertf128_si256(__m256i a, __m128i b, int offset);

Arguments

a 256-bit integer source vector b 128-bit integer source vector

938 33 offset an integer value that represents the 128-bit offset where the insertion must start

Description Performs an insertion of 128 bits of packed scalar integer values from the second source vector, b, into a destination at a 128-bit offset specified by the offset parameter. The remaining portions of the destination are written by the corresponding elements of the first source vector, a.

Returns Result of the insertion operation.

_mm256_lddqu_si256 Moves unaligned integer from memory. The corresponding Intel® AVX instruction is VLDDQU.

Syntax extern __m256i _mm256_lddqu_si256(__m256i const *a);

Arguments

*a points to a memory location from where unaligned integer value must be moved

Description Fetches 32 bytes of data, starting at a memory address specified by the a parameter, and places them in a destination. This intrinsic calls the corresponding instruction VLDDQU, which performs an operation functionally similar to the VMOVDQU instruction.

Returns Result of the move operation.

939 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_movedup_pd Duplicates even-indexed double-precision floating point values. The corresponding Intel® AVX instruction is VMOVDDUP.

Syntax extern __m256d _mm256_movedup_pd(__m256d a);

Arguments

a float64 source vector

Description Performs a duplication of even-indexed double-precision floating-point values (float64 values) in the source vector, a, and returns the result.

Returns Result of the duplication operation.

_mm256_movehdup_ps Duplicates odd-indexed single-precision floating point values. The corresponding Intel® AVX instruction is VMOVSHDUP.

Syntax extern __m256 _mm256_movehdup_ps(__m256 a);

Arguments

a float32 source vector

Description Performs a duplication of odd-indexed single-precision floating-point values (float32 values) in the source vector, a, and returns the result.

Returns Result of the duplication operation.

940 33

_mm256_moveldup_ps Duplicates even-indexed single-precision floating point values. The corresponding Intel® AVX instruction is VMOVSLDUP.

Syntax extern __m256 _mm256_moveldup_ps(__m256 a);

Arguments a float32 source vector

Description Performs a duplication of even-indexed single-precision floating-point values (float32 values) in the source vector, a, and returns the result.

Returns Result of the duplication operation.

_mm256_movemask_pd Extracts float64 sign mask. The corresponding Intel® AVX instruction is VMOVMSKPD.

Syntax extern int _mm256_movemask_pd(__m256d a);

Arguments a float64 source vector

Description Performs an extract operation of sign bits from four double-precision floating point elements (float64 elements) of the source vector, a, and composes them into bitmasks.

Returns An integer bitmask of four meaningful bits.

941 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_movemask_ps Extracts float32 sign mask. The corresponding Intel® AVX instruction is VMOVMSKPS.

Syntax extern int _mm256_movemask_ps(__m256 a;

Arguments

a float32 source vector

Description Performs an extract operation of sign bits from eight single-precision floating point elements (float32 elements) of the source vector, a, and composes them into bitmasks.

Returns An integer bitmask of eight meaningful bits.

_mm256_round_pd Rounds off double-precision floating point values to nearest upper/lower integer depending on rounding mode. The corresponding Intel® AVX instruction is VROUNDPD.

Syntax extern __m256d _mm256_round_pd(__m256d a, int iRoundMode);

Arguments

a float64 vector iRoundMode A hexadecimal value dependent on rounding mode:

• For rounding off to upper integer, the value is 0x0A • For rounding off to lower integer, the value is 0x09

942 33

Description Rounds off the elements of a float64 vector, a, to the nearest upper/lower integer value. Two shortcuts, in the form of #defines, are used to achieve these two separate operations: #define _mm256_ceil_pd(a) _mm256_round_pd((a), 0x0A) #define _mm256_floor_pd(a) _mm256_round_pd((a), 0x09)

These #defines tell the preprocessor to replace all instances of _mm256_ceil_pd(a) with _mm256_round_pd((a), 0x0A) and all instances of _mm256_floor_pd(a) with _mm256_round_pd((a), 0x09). For example, if you write the following: __m256 a, b; a = _mm256_ceil_pd(b); the preprocessor will modify it to: a = _mm256_round_pd(b, 0x0a);

The Precision Floating Point Exception is signaled according to the (immediate operand) iRoundMode value.

Returns Result of the rounding off operation as a vector with double-precision floating point values.

_mm256_round_ps Rounds off single-precision floating point values to nearest upper/lower integer depending on rounding mode. The corresponding Intel® AVX instruction is VROUNDPS.

Syntax extern __m256 _mm256_round_ps(__m256 a, int iRoundMode );

Arguments a float32 vector iRoundMode A hexadecimal value dependent on rounding mode:

• For rounding off to upper integer, the value is 0x0A • For rounding off to lower integer, the value is 0x09

943 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Rounds off the elements of a float32 vector, a, to the nearest upper/lower integer value. Two shortcuts, in the form of #defines, are used to achieve these two separate operations: #define _mm256_ceil_ps(a) _mm256_round_ps((a), 0x0A) #define _mm256_floor_ps(a) _mm256_round_ps((a), 0x09)

These #defines tells the preprocessor to replace all instances of _mm256_ceil_ps(a) with _mm256_round_ps((a), 0x0A) and all instances of _mm256_floor_ps(a) with _mm256_round_ps((a), 0x09). For example, if you write the following: __m256 a, b; a = _mm256_ceil_ps(b);

the preprocessor will modify it to: a = _mm256_round_ps(b, 0x0a);

The Precision Floating Point Exception is signaled according to the (immediate operand) iRoundMode value.

Returns Result of the rounding off operation as a vector with single-precision floating point values.

_mm256_set_pd Initializes 256-bit vector with float64 values. No corresponding Intel® AVX instruction.

Syntax extern __m256d _mm256_set_pd(double, double, double, double);

Arguments

double a float64 value to be initialized into the 256-bit vector; there are 4 double parameters, one for each float64 vector element

Description Initializes a 256-bit vector with double-precision floating point values (float64 values) as specified by the double parameter.

944 33

Returns A float64 vector initialized with specified double-precision floating point values.

_mm256_set_ps Initializes 256-bit vector with float32 values. No corresponding Intel® AVX instruction.

Syntax extern __m256 _mm256_set_ps(float, float, float, float, float, float, float, float);

Arguments float a float32 value to be initialized into the 256-bit vector; there are 8 float parameters, one for each float32 vector element

Description Initializes a 256-bit vector with single-precision floating point values (float32 values) as specified by the float parameter.

Returns A float32 vector initialized with specified single-precision floating point values.

_mm256_set_epi32 Initializes 256-bit vector with integer values. No corresponding Intel® AVX instruction.

Syntax extern __m256i _mm256_set_epi32(int, int, int, int, int, int, int, int);

Arguments int an integer32 value to be initialized into the 256-bit vector; there are 8 int parameters, one for each integer32 vector element

945 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Initializes a 256-bit vector with extended packed integer values (integer32 values) as specified by the int parameter.

Returns An integer32 vector initialized with specified extended packed integer values.

_mm256_set1_pd Initializes 256-bit vector with scalar double-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m256d _mm256_set1_pd(double, double, double, double);

Arguments

double a float64 value to be initialized into the 256-bit vector; there are 4 double parameters, one for each float64 vector element

Description Initializes a 256-bit vector with scalar double-precision floating point values (float64 values) as specified by the double parameter.

Returns A float64 vector with all elements set to the specified scalar double-precision floating point values.

_mm256_set1_ps Initializes 256-bit vector with scalar single-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m256 _mm256_set1_ps(float, float, float, float, float, float, float, float);

946 33

Arguments float a float32 value to be initialized into the 256-bit vector; there are 8 float parameters, one for each float32 vector element

Description Initializes a 256-bit vector with scalar single-precision floating point values (float32 values) as specified by the float parameter.

Returns A float32 vector with all elements set to the specified scalar single-precision floating point values.

_mm256_set1_epi32 Initializes 256-bit vector with scalar integer values. No corresponding Intel® AVX instruction.

Syntax extern __m256i _mm256_set1_epi32(int, int, int, int, int, int, int, int);

Arguments int an integer32 value to be initialized into the 256-bit vector; there are 8 int parameters, one for each integer32 vector element

Description Initializes a 256-bit vector with scalar integer values (integer32 values) as specified by the int parameter.

Returns An integer32 vector with all elements set to the specified scalar integer values.

947 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_setzero_pd Sets float64 YMM registers to zero. No corresponding Intel® AVX instruction.

Syntax extern __m256d _mm256_setzero_pd(void);

Arguments None

Description Sets all the elements of a float64 vector to zero and returns the float64 vector. This is a utility intrinsic that helps during programming.

Returns A float64 vector with all elements set to zero.

_mm256_setzero_ps Sets float32 YMM registers to zero. No corresponding Intel® AVX instruction.

Syntax extern __m256 _mm256_setzero_ps(void);

Arguments None

Description Sets all the elements of a float32 vector to zero and returns the float32 vector. This is a utility intrinsic that helps during programming.

Returns A float32 vector with all elements set to zero.

948 33

_mm256_setzero_si256 Sets integer YMM registers to zero. No corresponding Intel® AVX instruction.

Syntax extern __m256i _mm256_setzero_si256(void);

Arguments None

Description Sets all the elements of an integer vector to zero and returns the integer vector. This is a utility intrinsic that helps during programming.

Returns An integer vector with all elements set to zero.

_mm256_zeroall Zeroes all YMM registers. The corresponding Intel® AVX instruction is VZEROALL.

Syntax extern void _mm256_zeroall(void);

Arguments None

Description Zeroes all YMM registers. This intrinsic is useful to clear all the YMM registers when transitioning between Intel® AVX and legacy SSE instructions. There is no transition penalty if an application clears the bits of all YMM registers (sets to ‘0’) via VZEROALL, the corresponding instruction for this intrinsic, before transitioning between Intel® AVX instructions and legacy SSE instructions.

Returns Nothing

949 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_zeroupper Zeroes the upper bits of the YMM registers. The corresponding Intel® AVX instruction is VZEROUPPER.

Syntax extern void _mm256_zeroupper(void);

Arguments None

Description Zeroes the upper 128 bits of all YMM registers. The lower 128 bits that correspond to the XMM registers are left unmodified. This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel® AVX and legacy SSE instructions. There is no transition penalty if an application clears the upper bits of all YMM registers (sets to ‘0’) via VZEROUPPER, the corresponding instruction for this intrinsic, before transitioning between Intel® AVX instructions and legacy SSE instructions.

Returns Result of the operation.

Intrinsics for Packed Test Operations

_mm256_testz_si256 Performs a packed bit test of two integer vectors to set the ZF flag. The corresponding Intel® AVX instruction is VPTEST.

Syntax extern int _mm256_testz_si256(__m256i s1, __m256i s2);

Arguments

s1 first source integer vector

950 33 s2 second source integer vector

Description Allows setting of the ZF flag. The ZF flag is set based on the result of a bitwise AND operation between the first and second source vectors. The corresponding instruction, VPTEST, sets the ZF flag if all the resulting bits are 0. If the resulting bits are non-zeros, the instruction clears the ZF flag.

Returns Non-zero if ZF flag is set Zero if the ZF flag is not set

_mm256_testc_si256 Performs a packed bit test of two integer vectors to set the CF flag. The corresponding Intel® AVX instruction is VPTEST.

Syntax extern int _mm256_testc_si256(__m256i s1, __m256i s2);

Arguments s1 first source integer vector s2 second source integer vector

Description Allows setting of the CF flag. The CF flag is set based on the result of a bitwise AND and logical NOT operation between the first and second source vectors. The corresponding instruction, VPTEST, sets the CF flag if all the resulting bits are 0. If the resulting bits are non-zeros, the instruction clears the CF flag.

Returns Non-zero if CF flag is set Zero if the CF flag is not set

951 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_testnzc_si256 Performs a packed bit test of two integer vectors to set ZF and CF flags. The corresponding Intel® AVX instruction is VPTEST.

Syntax extern int _mm256_testnzc_si256(__m256i s1, __m256i s2);

Arguments

s1 first source integer vector s2 second source integer vector

Description Performs a packed bit test of s1 and s2 vectors using VTESTPD s1, s2 instruction and checks the status of the ZF and CF flags. The intrinsic returns 1 if both ZF and CF flags are not 1 (that is, both flags are not set), otherwise returns 0 (that is, one of the flags is set) . The VTESTPD instruction performs a bitwise comparison of all the sign bits of the integer elements in the first source operand and corresponding sign bits in the second source operand. If the AND of the first source operand sign bits with the second source operand sign bits produces all zeros, the ZF flag is set else the ZF flag is clear. If the AND of the inverted first source operand sign bits with the second source operand sign bits produces all zeros the CF flag is set, else the CF flag is clear.

Returns 1: indicates that both ZF and CF flags are clear 0: indicates that either ZF or CF flag is set

_mm256_testz_pd, _mm_testz_pd Performs a packed bit test of two float64 256-bit or 128-bit vectors to set the ZF flag. The corresponding Intel® AVX instruction is VTESTPD.

Syntax extern __m256 _mm256_testz_pd(__m256 s1, __m256 s2); extern __m128 _mm_testz_pd(__m128 s1, __m128 s2);

952 33

Arguments s1 first float64 source vector s2 second float64 source vector

Description Allows setting of the ZF flag. The ZF flag is set based on the result of a bitwise AND operation between the first and second source vectors. The corresponding instruction, VTESTPD, sets the ZF flag if all the resulting bits are 0. If the resulting bits are non-zeros, the instruction clears the ZF flag. The _mm_testz_pd intrinsic sets the ZF flag according to results of the 128-bit float64 source vectors. The _m256_testz_pd intrinsic sets the ZF flag according to the results of the 256-bit float64 source vectors.

NOTE. Intel® AVX instructions include a full compliment of 128-bit SIMD instructions. Such Intel® AVX instructions, with vector length of 128-bits, zeroes the upper 128 bits of the YMM register. The lower 128 bits of the YMM register is aliased to the corresponding SIMD XMM register.

Returns Non-zero if ZF flag is set Zero if the ZF flag is not set

_mm256_testz_ps, _mm_testz_ps Performs a packed bit test of two float32 256-bit or 128-bit vectors to set the ZF flag. The corresponding Intel® AVX instruction is VTESTPS.

Syntax extern __m256 _mm256_testz_pz(__m256 s1, __m256 s2); extern __m256 _mm_testz_pz(__m256 s1, __m256 s2);

Arguments s1 first source float32 vector

953 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

s2 second source float32 vector

Description Allows setting of the ZF flag. The ZF flag is set based on the result of a bitwise AND operation between the first and second source vectors. The corresponding instruction, VTESTPS, sets the ZF flag if all the resulting bits are 0. If the resulting bits are non-zeros, the instruction clears the ZF flag. The _mm_testz_ps intrinsic sets the ZF flag according to results of the 128-bit float32 source vectors. The _m256_testz_ps intrinsic sets the ZF flag according to the results of the 256-bit float32 source vectors.

NOTE. Intel® AVX instructions include a full compliment of 128-bit SIMD instructions. Such Intel® AVX instructions, with vector length of 128-bits, zeroes the upper 128 bits of the YMM register. The lower 128 bits of the YMM register is aliased to the corresponding SIMD XMM register.

Returns Non-zero if ZF flag is set Zero if the ZF flag is not set

_mm256_testc_pd, _mm_testc_pd Performs a packed bit test of two 256-bit or 128-bit float64 vectors to set the CF flag. The corresponding Intel® AVX instruction is VTESTPD.

Syntax extern __m256d _mm256_testc_pd(__m256d s1, __m256d s2); extern __m128d _mm_testc_pd(__m128d s1, __m128d s2);

Arguments

s1 first source float64 vector s2 second source float64 vector

954 33

Description Allows setting of the CF flag. The CF flag is set based on the result of a bitwise AND and logical NOT operation between the first and second source vectors. The corresponding instruction, VTESTPD, sets the CF flag if all the resulting bits are 0. If the resulting bits are non-zeros, the instruction clears the CF flag. The _mm_testc_pd intrinsic sets the CF flag according to results of the 128-bit float64 source vectors. The _m256_testc_pd intrinsic sets the CF flag according to the results of the 256-bit float64 source vectors.

NOTE. Intel® AVX instructions include a full compliment of 128-bit SIMD instructions. Such Intel® AVX instructions, with vector length of 128-bits, zeroes the upper 128 bits of the YMM register. The lower 128 bits of the YMM register is aliased to the corresponding SIMD XMM register.

Returns Non-zero if CF flag is set Zero if the CF flag is not set

_mm256_testc_ps, _mm_testc_ps Performs a packed bit test of two 256-bit or 128-bit float32 vectors to set the CF flag. The corresponding Intel® AVX instruction is VTESTPS.

Syntax extern __m256 _mm256_testc_ps(__m256 s1, __m256 s2); extern __m128 _mm_testc_ps(__m128 s1, __m128 s2);

Arguments s1 first source float32 vector s2 second source float32 vector

955 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Allows setting of the CF flag. The CF flag is set based on the result of a bitwise AND and logical NOT operation between the first and second source vectors. The corresponding instruction, VTESTPS, sets the CF flag if all the resulting bits are 0. If the resulting bits are non-zeros, the instruction clears the CF flag. The _mm_testc_ps intrinsic sets the CF flag according to results of the 128-bit float32 source vectors. The _m256_testc_ps intrinsic sets the CF flag according to the results of the 256-bit float32 source vectors.

NOTE. Intel® AVX instructions include a full compliment of 128-bit SIMD instructions. Such Intel® AVX instructions, with vector length of 128-bits, zeroes the upper 128 bits of the YMM register. The lower 128 bits of the YMM register is aliased to the corresponding SIMD XMM register.

Returns Non-zero if CF flag is set Zero if the CF flag is not set

_mm256_testnzc_pd, _mm_testnzc_pd Performs a packed bit test of two 256-bit float64 or 128-bit float64 vectors to check ZF and CF flag settings. The corresponding Intel® AVX instruction is VTESTPD.

Syntax extern int _mm256_testnzc_pd(__m256d s1, __m256d s2); extern int _mm_testnzc_pd(__m128d s1, __m256d s2);

Arguments

s1 first source float64 vector s2 second source float64 vector

956 33

Description Performs a packed bit test of s1 and s2 vectors using VTESTPD s1, s2 instruction and checks the status of the ZF and CF flags. The intrinsic returns 1 if both ZF and CF flags are not 1 (that is, both flags are not set), otherwise returns 0 (that is, one of the flags is set). The VTESTPD instruction performs a bitwise comparison of all the sign bits of the double-precision elements in the first source operand and corresponding sign bits in the second source operand. If the AND of the first source operand sign bits with the second source operand sign bits produces all zeros, the ZF flag is set else the ZF flag is clear. If the AND of the inverted first source operand sign bits with the second source operand sign bits produces all zeros the CF flag is set, else the CF flag is clear. The _mm_testnzc_pd intrinsic checks the ZF and CF flags according to results of the 128-bit float64 source vectors. The _m256_testnzc_pd intrinsic checks the ZF and CF flags according to the results of the 256-bit float64 source vectors.

NOTE. Intel® AVX instructions include a full compliment of 128-bit SIMD instructions. Such Intel® AVX instructions, with vector length of 128-bits, zeroes the upper 128 bits of the YMM register. The lower 128 bits of the YMM register is aliased to the corresponding SIMD XMM register.

Returns 1: indicates that both ZF and CF flags are clear 0: indicates that either ZF or CF flag is set

_mm256_testnzc_ps, _mm_testnzc_ps Performs a packed bit test of two 256-bit or 128-bit float32 vectors to check ZF and CF flag settings. The corresponding Intel® AVX instruction is VTESTPS.

Syntax extern int _mm256_testnzc_ps(__m256 s1, __m256 s2); extern int _mm_testnzc_ps(__m128 s1, __m128 s2);

957 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

s1 first source float32 vector s2 second source float32 vector

Description Performs a packed bit test of s1 and s2 vectors using VTESTPD s1, s2 instruction and checks the status of the ZF and CF flags. The intrinsic returns 1 if both ZF and CF flags are not 1 (that is, both flags are not set), otherwise returns 0 (that is, one of the flags is set). The VTESTPD instruction performs a bitwise comparison of all the sign bits of the single-precision elements in the first source operand and corresponding sign bits in the second source operand. If the AND of the first source operand sign bits with the second source operand sign bits produces all zeros, the ZF flag is set else the ZF flag is clear. If the AND of the inverted first source operand sign bits with the second source operand sign bits produces all zeros the CF flag is set, else the CF flag is clear. The _mm_testnzc_ps intrinsic checks the ZF and CF flags according to results of the 128-bit float32 source vectors. The _m256_testnzc_ps intrinsic checks the ZF and CF flags according to the results of the 256-bit float32 source vectors.

NOTE. Intel® AVX instructions include a full compliment of 128-bit SIMD instructions. Such Intel® AVX instructions, with vector length of 128-bits, zeroes the upper 128 bits of the YMM register. The lower 128 bits of the YMM register is aliased to the corresponding SIMD XMM register.

Returns 1: indicates that both ZF and CF flags are clear 0: indicates that either ZF or CF flag is set

958 33

Intrinsics for Permute Operations

_mm256_permute_pd, _mm_permute_pd Permutes 256-bit or 128-bit float64 values into a 256-bit or 128-bit destination vector. The corresponding Intel® AVX instruction is VPERMILPD.

Syntax extern __m256d _mm256_permute_pd(__m256d m1, int control); extern __m128d _mm_permute_pd(__m128d m1, int control);

Arguments

m1 a 256-bit or 128-bit float64 vector control an integer specified as an 8-bit immediate;

• for the 256-bit m1 vector this integer contains four 1-bit control fields in the low 4 bits of the immediate • for the 128-bit m1 vector this integer contains two 1-bit control fields in the low 2 bits of the immediate

Description The _mm256_permute_pd intrinsic permutes double-precision floating point elements (float64 elements) in the 256-bit source vector, m1, according to a specified 1-bit control field, control, and stores the result in a destination vector. The _mm_permute_pd intrinsic permutes double-precision floating point elements (float64 elements) in the 128-bit source vector, m1, according to a specified 1-bit control field, control, and stores the result in a destination vector.

Returns A 256-bit or 128-bit float64 vector with permuted values.

959 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm256_permute_ps Permutes 256-bit or 128-bit float32 values into a 256-bit or 128-bit destination vector. The corresponding Intel® AVX instruction is VPERMILPS.

Syntax extern __m256 _mm256_permute_ps(__m256 m1, int control); extern __m128 _mm_permute_ps(__m128 m1, int control);

Arguments

m1 a 256-bit or 128-bit float32 vector control an integer specified as an 8-bit immediate;

• for the 256-bit m1 vector this integer contains four 2-bit control fields in the low 8 bits of the immediate • for the 128-bit m1 vector this integer contains two 2-bit control fields in the low 4 bits of the immediate

Description The _mm256_permute_ps intrinsic permutes single-precision floating point elements (float32 elements) in the 256-bit source vector, m1, according to a specified 2-bit control field, control, and stores the result in a destination vector. The _mm_permute_pd intrinsic permutes single-precision floating point elements (float32 elements) in the 128-bit source vector, m1, according to a specified 2-bit control field, control, and stores the result in a destination vector.

Returns A 256-bit or 128-bit float32 vector with permuted values.

960 33

_mm256_permutevar_pd, _mm_permutevar_pd Permutes float64 values into a 256-bit or 128-bit destination vector. The corresponding Intel® AVX instruction is VPERMILPD.

Syntax extern __m256d _mm256_permutevar_pd(__m256d m1, __m256i control); extern __m128d _mm_permutevar_pd(__m128d m1, __m128i control);

Arguments m1 a 256-bit or 128-bit float64 vector control a vector with 1-bit control fields, one for each corresponding element of the source vector

• for the 256-bit m1 source vector this control vector contains four 1-bit control fields in the low 4 bits of the immediate • for the 128-bit m1 source vector this control vector contains two 1-bit control fields in the low 2 bits of the immediate

Description Permutes double-precision floating-point values in the source vector, m1, according to the the 1-bit control fields in the low bytes of corresponding elements of a shuffle control. The result is stored in a destination vector.

Returns A 256-bit or 128-bit float64 vector with permuted values.

_mm256_permutevar_ps Permutes float32 values into a 256-bit or 128-bit destination vector. The corresponding Intel® AVX instruction is VPERMILPS.

Syntax extern __m256 _mm256_permutevar_ps(__m256 m1, __m256i control);

961 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

extern __m128 _mm_permutevar_ps(__m128 m1, __m128i control);

Arguments

m1 a 256-bit or 128-bit float32 vector control a vector with 2-bit control fields, one for each corresponding element of the source vector

• for the 256-bit m1 source vector this control vector contains eight 2-bit control fields • for the 128-bit m1 source vector this control vector contains four 2-bit control fields

Description Permutes single-precision floating-point values in the source vector, m1, according to the the 2-bit control fields in the low bytes of corresponding elements of a shuffle control. The result is stored in a destination vector.

Returns A 256-bit or 128-bit float32 vector with permuted values.

_mm256_permute2f128_pd Permutes 128-bit double-precision floating point containing fields into a 256-bit destination vector. The corresponding Intel® AVX instruction is VPERM2F128.

Syntax extern __m256d _mm256_permute2f128_pd(__m256d m1, __m256d m2, int control);

Arguments

m1 a 256-bit float64 source vector m2 a 256-bit float64 source vector control an immediate byte that specifies two 2-bit control fields and two additional bits which specify zeroing behavior.

962 33

Description Permutes 128-bit floating-point-containing fields from the first source vector, m1, and second source vector, m2, by using bits in the 8-bit control argument.

Returns A 256-bit float64 vector with permuted values.

_mm256_permute2f128_ps Permutes 128-bit single-precision floating point containing fields into a 256-bit destination vector. The corresponding Intel® AVX instruction is VPERM2F128.

Syntax extern __m256 _mm256_permute2f128_ps(__m256 m1, __m256 m2, int control);

Arguments m1 a 256-bit float32 source vector m2 a 256-bit float32 source vector control an immediate byte that specifies two 2-bit control fields and two additional bits which specify zeroing behavior.

Description Permutes 128-bit floating-point-containing fields from the first source vector, m1, and second source vector, m2, by using bits in the 8-bit control argument.

Returns A 256-bit float32 vector with permuted values.

_mm256_permute2f128_si256 Permutes 128-bit integer containing fields into a 256-bit destination vector. The corresponding Intel® AVX instruction is VPERM2F128.

Syntax extern __m256i _mm256_permute2f128_si256(__m256i m1, __m256i m2, int control);

963 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

m1 a 256-bit integer source vector m2 a 256-bit integer source vector control an immediate byte that specifies two 2-bit control fields and two additional bits which specify zeroing behavior.

Description Permutes 128-bit integer-containing fields from the first source vector, m1, and second source vector, m2, by using bits in the 8-bit control argument.

Returns A 256-bit integer vector with permuted values.

Intrinsics for Shuffle Operations

_mm256_shuffle_pd Shuffles float64 vectors. The corresponding Intel® AVX instruction is VSHUFPD.

Syntax extern __m256d _mm256_shuffle_pd(__m256d m1, __m256d m2, const int select);

Arguments

m1 float64 vector used for the operation m2 float64 vector also used for the operation select a constant of integer type that determines which elements of the source vectors are moved to the result

Description Moves or shuffles either of the two packed double-precision floating-point elements (float64 elements) from the double quadword in the source vectors to the low and high quadwords of the double quadword of the result.

964 33

The elements of the first source vector are moved to the low quadword while the elements of the second source vector are moved to the high quadword of the result. The constant defined by the select parameter determines which of the two elements of the source vectors are moved to the result.

Returns Result of the shuffle operation.

_mm256_shuffle_ps Shuffles float32 vectors. The corresponding Intel® AVX instruction is VSHUFPS.

Syntax extern __m256 _mm256_shuffle_ps(__m256 m1, __m256 m2, const int select);

Arguments m1 float32 vector used for the operation m2 float32 vector also used for the operation select a constant of integer type which determines which elements of the source vectors move to the result

Description Moves or shuffles two of the packed single-precision floating-point elements (float32 elements) from the double quadword in the source vectors to the low and high quadwords of the double quadword of the result. The elements of the first source vector are moved to the low quadword while the elements of the second source vector are moved to the high quadword of the result. The constant defined by the select parameter determines which of the two elements of the source vectors are moved to the result.

Returns Result of the shuffle operation.

965 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsics for Unpack and Interleave Operations

_mm256_unpackhi_pd Unpacks and interleaves high packed double-precision floating point values. The corresponding Intel® AVX instruction is VUNPCKHPD.

Syntax extern __m256d _mm256_unpackhi_pd(__m256d m1, __m256d m2);

Arguments

m1 float64 source vector m2 float64 source vector

Description Performs an interleaved unpack operation of high double-precision floating point values of the two source vectors, m1 and m2, and returns the result of the operation.

Returns A vector with unpacked interleaved double-precision floating point values.

_mm256_unpackhi_ps Unpacks and interleaves high packed single-precision floating point values. The corresponding Intel® AVX instruction is VUNPCKHPS.

Syntax extern __m256 _mm256_unpackhi_ps(__m256 m1, __m256 m2);

Arguments

m1 float32 source vector m2 float32 source vector

966 33

Description Performs an interleaved unpack operation of high single-precision floating point values of the two source vectors, m1 and m2, and returns the result of the operation.

Returns A vector with unpacked interleaved single-precision floating point values.

_mm256_unpacklo_pd Unpacks and interleaves low packed double-precision floating point values. The corresponding Intel® AVX instruction is VUNPCKLPD.

Syntax extern __m256d _mm256_unpacklo_pd(__m256d m1, __m256d m2);

Arguments m1 float64 source vector m2 float64 source vector

Description Performs an interleaved unpack operation of low double-precision floating point values of the two source vectors, m1 and m2, and returns the result of the operation.

Returns A vector with unpacked interleaved double-precision floating point values.

_mm256_unpacklo_ps Unpacks and interleaves low packed single-precision floating point values. The corresponding Intel® AVX instruction is VUNPCKHPS.

Syntax extern __m256 _mm256_unpacklo_ps(__m256 m1, __m256 m2);

967 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

m1 float32 source vector m2 float32 source vector

Description Performs an interleaved unpack operation of low single-precision floating point values of the two source vectors, m1 and m2, and returns the result of the operation.

Returns A vector with unpacked interleaved single-precision floating point values.

Support Intrinsics for Vector Typecasting Operations

_mm256_castpd_ps Typecasts double-precision floating point values to single-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m256 _mm256_castpd_ps(__m256d a);

Arguments

a float64 source vector

Description Performs a typecast operation from double-precision floating point values (float64 values) to single-precision floating point values (float32 values). This intrinsic does not introduce extra moves to the generated code. Source operand bits are passed unchanged to the result.

Returns A vector with single-precision floating point values.

968 33

_mm256_castps_pd Typecasts single-precision floating point values to double-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m256d _mm256_castps_pd(__m256 a);

Arguments a float32 source vector

Description Performs a typecast operation from single-precision floating point values (float32 values) to double-precision floating point values (float64 values). This intrinsic does not introduce extra moves to the generated code. Source operand bits are passed unchanged to the result.

Returns A vector with double-precision floating point values.

_mm256_castpd_si256 Typecasts double-precision floating point values to integer values. No corresponding Intel® AVX instruction.

Syntax extern __m256i _mm256_castpd_si256(__m256d a);

Arguments a float64 source vector

Description Performs a typecast operation from double-precision floating point values (float64 values) to integer values. This intrinsic does not introduce extra moves to the generated code. Source operand bits are passed unchanged to the result.

969 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Returns A vector with 256-bit integer values.

_mm256_castps_si256 Typecasts single-precision floating point values to integer values. No corresponding Intel® AVX instruction.

Syntax extern __m256i _mm256_castps_si256(__m256 a);

Arguments

a float32 source vector

Description Performs a typecast operation from single-precision floating point values (float32 values) to integer values. This intrinsic does not introduce extra moves to the generated code. Source operand bits are passed unchanged to the result.

Returns A vector with 256-bit integer values.

_mm256_castsi256_pd Typecasts 256-bit integer values to double-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m256d _mm256_castsi256_pd(__m256i a);

Arguments

a 256-bit integer vector

970 33

Description Performs a typecast operation from 256-bit integer values to double-precision floating point values (float64 values). This intrinsic does not introduce extra moves to the generated code. Source operand bits are passed unchanged to the result.

Returns A vector with double-precision floating point values.

_mm256_castsi256_ps Typecasts 256-bit integer values to single-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m256 _mm256_castsi256_ps(__m256i a);

Arguments a 256-bit integer source vector

Description Performs a typecast operation from 256-bit integer values to single-precision floating point values (float32 values). This intrinsic does not introduce extra moves to the generated code. Source operand bits are passed unchanged to the result.

Returns A vector with single-precision floating point values.

_mm256_castpd128_pd256 Typecasts 128-bit double-precision floating point values to 256-bit double-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m256d _mm256_castpd128_pd256(__m128d a);

971 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

a 128-bit float64 vector

Description Performs a typecast operation from 128-bit double-precision floating point values to 256-bit double-precision floating point values. The lower 128 bits of the 256-bit resulting vector contains the source vector values; the upper 128 bits of the resulting vector are undefined. This intrinsic does not introduce extra moves to the generated code.

Returns A vector with 256-bit double-precision floating point values. The upper bits of the resulting vector are undefined.

_mm256_castpd256_pd128 Typecasts 256-bit double-precision floating point values to 128-bit double-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m128d _mm256_castpd256_pd128(__m256d a);

Arguments

a 256-bit float64 source vector

Description Performs a typecast operation from 256-bit double-precision floating point values to 128-bit double-precision floating point values. The lower 128 bits of the source vector are passed unchanged to the result. This intrinsic does not introduce extra moves to the generated code.

Returns A vector with 128-bit double-precision floating point values.

972 33

_mm256_castps128_ps256 Typecasts 128-bit single-precision floating point values to 256-bit single-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m256 _mm256_castps128_ps256(__m128 a);

Arguments a 128-bit float32 source vector

Description Performs a typecast operation from 128-bit single-precision floating point values to 256-bit single-precision floating point values. The lower 128 bits of the 256-bit resulting vector contains the source vector values; the upper 128 bits of the resulting vector are undefined. This intrinsic does not introduce extra moves to the generated code.

Returns A vector with 256-bit single-precision floating point values. The upper bits of the resulting vector are undefined.

_mm256_castps256_ps128 Typecasts 256-bit single-precision floating point values to 128-bit single-precision floating point values. No corresponding Intel® AVX instruction.

Syntax extern __m128 _mm256_castps256_ps128(__m256 a);

Arguments a 256-bit float32 source vector

973 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Performs a typecast operation from 256-bit single-precision floating point values to 128-bit single-precision floating point values. The lower 128 bits of the source vector are passed unchanged to the result. This intrinsic does not introduce extra moves to the generated code.

Returns A vector with 128-bit single-precision floating point values.

_mm256_castsi128_si256 Typecasts 128-bit integer values to 256-bit integer values. No corresponding Intel® AVX instruction.

Syntax extern __m256i _mm256_castsi128_si256(__m128i a);

Arguments

a 128-bit integer source vector

Description Performs a typecast operation from 128-bit integer values to 256-bit integer values. The lower 128 bits of the 256-bit resulting vector contains the source vector values; the upper 128 bits of the resulting vector are undefined. This intrinsic does not introduce extra moves to the generated code.

Returns A vector with 256-bit integer values. The upper bits of the resulting vector are undefined.

_mm256_castsi256_si128 Typecasts 256-bit integer values to 128-bit integer values. No corresponding Intel® AVX instruction.

Syntax extern __m128i _mm256_castsi256_si128(__m256i a);

974 33

Arguments

a 256-bit integer source vector

Description Performs a typecast operation from 256-bit integer values to 128-bit integer values. The lower 128 bits of the source vector are passed unchanged to the result. This intrinsic does not introduce extra moves to the generated code.

Returns A vector with 128-bit integer values.

Intrinsics for Fused Multiply Add Operations

_mm_fmadd_pd, _mm256_fmadd_pd Multiply-adds packed double-precision floating-point values using three float64 vectors. The corresponding FMA instruction is VFMADDPD, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128d _mm_fmadd_pd(__m128d a, __m128d b, __m128d c);

For 256-bit vector extern __m256d _mm256_fmadd_pd(__m256d a, __m256d b, __m256d c);

Arguments

a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

975 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Performs a set of SIMD multiply-add computation on packed double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the infinite precision intermediate results are added to corresponding values in the third operand, after which the final results are rounded to the nearest float64 values. The compiler defaults to using the VFMADD213PD instruction and uses the other forms VFMADD132PD or VFMADD231PD only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-add operation.

_mm_fmadd_ps, _mm256_fmadd_ps Multiply-adds packed single-precision floating-point values using three float32 vectors. The corresponding FMA instruction is VFMADDPS, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128 _mm_fmadd_ps(__m128 a, __m128 b, __m128 c);

For 256-bit vector extern __m256 _mm256_fmadd_ps(__m256 a, __m256 b, __m256 c);

Arguments

a float32 vector used for the operation b float32 vector also used for the operation c float32 vector also used for the operation

976 33

Description Performs a set of SIMD multiply-add computation on packed single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the infinite precision intermediate results are added to corresponding values in the third operand, after which the final results are rounded to the nearest float32 values. The compiler defaults to using the VFMADD213PS instruction and uses the other forms VFMADD132PS or VFMADD231PS only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-add operation.

_mm_fmadd_sd Multiply-adds scalar double-precision floating-point values using three float64 vectors. The corresponding FMA instruction is VFMADDSD, where XXX could be 132, 213, or 231.

Syntax extern __m128d _mm_fmadd_sd(__m128d a, __m128d b, __m128d c);

Arguments a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD multiply-add computation on scalar double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the infinite precision intermediate results are added to corresponding values in the third operand, after which the final results are rounded to the nearest float64 values.

977 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The compiler defaults to using the VFMADD213SD instruction and uses the other forms VFMADD132SD or VFMADD231SD only if a low level optimization decides it is useful/necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-add operation.

_mm_fmadd_ss Multiply-adds scalar single-precision floating-point values using three float32vectors. The corresponding FMA instruction is VFMADDSS, where XXX could be 132, 213, or 231.

Syntax extern __m128 _mm_fmadd_ss(__m128 a, __m128 b, __m128 c);

Arguments

a float32 vector used for the operation b float32 vector also used for the operation c float32 vector also used for the operation

Description Performs a set of SIMD multiply-add computation on scalar single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the infinite precision intermediate results are added to corresponding values in the third operand, after which the final results are rounded to the nearest float32 values. The compiler defaults to using the VFMADD213SS instruction and uses the other forms VFMADD132SS or VFMADD231SS only if a low level optimization decides it is useful/necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-add operation.

978 33

_mm_fmaddsub_pd, _mm256_fmaddsub_pd Multiply-adds and subtracts packed double-precision floating-point values using three float64 vectors. The corresponding FMA instruction is VFMADDSUBPD, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128d _mm_fmaddsub_pd(__m128d a, __m128d b, __m128d c);

For 256-bit vector extern __m256d _mm256_fmaddsub_pd(__m256d a, __m256d b, __m256d c);

Arguments a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD multiply-add-subtract computation on packed double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and infinite precision intermediate results are obtained. The odd values in the third operand, c, are added to the intermediate results while the even values are subtracted from them. The final results are rounded to the nearest float64 values. The compiler defaults to using the VFMADDSUB213PD instruction and uses the other forms VFMADDSUB132PD or VFMADDSUB231PD only if a low-level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-add-subtract operation.

979 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_fmaddsub_ps, _mm256_fmaddsub_ps Multiply-adds and subtracts packed single-precision floating-point values using three float32 vectors. The corresponding FMA instruction is VFMADDSUBPS, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128 _mm_fmaddsub_ps(__m128 a, __m128 b, __m128 c);

For 256-bit vector extern __m256 _mm256_fmaddsub_ps(__m256 a, __m256 b, __m256 c);

Arguments

a float32 vector used for the operation b float32 vector also used for the operation c float32 vector also used for the operation

Description Performs a set of SIMD multiply-add-subtract computation on packed single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and infinite precision intermediate results are obtained. The odd values in the third operand, c, are added to the intermediate results while the even values are subtracted from them. The final results are rounded to the nearest float32 values. The compiler defaults to using the VFMADDSUB213PS instruction and uses the other forms VFMADDSUB132PS or VFMADDSUB231PS only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-add-subtract operation.

980 33

_mm_fmsubadd_pd, _mm256_fmsubadd_pd Multiply-subtracts and adds packed double-precision floating-point values using three float64 vectors. The corresponding FMA instruction is VFMSUBADDPD, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128d _mm_fmsubadd_pd(__m128d a, __m128d b, __m128d c);

For 256-bit vector extern __m256d _mm256_fmsubadd_pd(__m256d a, __m256d b, __m256d c);

Arguments a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD multiply-subtract-add computation on packed double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and infinite precision intermediate results are obtained. The odd values in the third operand, c, are subtracted from the intermediate results while the even values are added to them. The final results are rounded to the nearest float64 values. The compiler defaults to using the VFMSUBADD213PD instruction and uses the other forms VFMSUBADD132PD or VFMSUBSADD231PD only if a low-level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-subtract-add operation.

981 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_fmsubadd_ps, _mm256_fmsubadd_ps Multiply-subtracts and adds packed single-precision floating-point values using three float32 vectors. The corresponding FMA instruction is VFMSUBADDPS, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128 _mm_fmsubadd_ps(__m128 a, __m128 b, __m128 c);

For 256-bit vector extern __m256 _mm256_fmsubadd_ps(__m256 a, __m256 b, __m256 c);

Arguments

a float32 vector used for the operation b float32 vector also used for the operation c float32 vector also used for the operation

Description Performs a set of SIMD multiply-subtract-add computation on packed single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and infinite precision intermediate results are obtained. The odd values in the third operand, c, are subtracted from the intermediate results while the even values are added to them. The final results are rounded to the nearest float32 values. The compiler defaults to using the VFMSUBADD213PS instruction and uses the other forms VFMSUBADD132PS or VFMSUBADDS231PS only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-add-subtract operation.

982 33

_mm_fmsub_pd, _mm256_fmsub_pd Multiply-subtracts packed double-precision floating-point values using three float64 vectors. The corresponding FMA instruction is VFMSUBPD, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128d _mm_fmsub_pd(__m128d a, __m128d b, __m128d c);

For 256-bit vector extern __m256d _mm256_fmsub_pd(__m256d a, __m256d b, __m256d c);

Arguments a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD multiply-subtract computation on packed double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the infinite precision intermediate results are obtained. From the infinite precision intermediate results, the values in the third operand, c, are subtracted. The final results are rounded to the nearest float64 values. The compiler defaults to using the VFMSUB213PD instruction and uses the other forms VFMSUB132PD or VFMSUB231PD only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-subtract operation.

983 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_fmsub_ps, _mm256_fmsub_ps Multiply-subtracts packed single-precision floating-point values using three float32 vectors. The corresponding FMA instruction is VFMSUBPS, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128 _mm_fmsub_ps(__m128 a, __m128 b, __m128 c);

For 256-bit vector extern __m256 _mm256_fmsub_ps(__m256 a, __m256 b, __m256 c);

Arguments

a float32 vector used for the operation b float32 vector also used for the operation c float32 vector also used for the operation

Description Performs a set of SIMD multiply-subtract computation on packed single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the infinite precision intermediate results are obtained. From the infinite precision intermediate results, the values in the third operand, c, are subtracted. The final results are rounded to the nearest float32 values. The compiler defaults to using the VFMSUB213PS instruction and uses the other forms VFMSUB132PS or VFMSUB231PS only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-subtract operation.

984 33

_mm_fmsub_sd Multiply-subtracts scalar double-precision floating-point values using three float64 vectors. The corresponding FMA instruction is VFMSUBSD, where XXX could be 132, 213, or 231.

Syntax extern __m128d _mm_fmsub_sd(__m128d a, __m128d b, __m128d c);

Arguments a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD multiply-subtract computation on scalar double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the infinite precision intermediate results are obtained. From the infinite precision intermediate results, the values in the third operand, c, are subtracted. The final results are rounded to the nearest float64 values. The compiler defaults to using the VFMSUB213SD instruction and uses the other forms VFMSUB132SD or VFMSUB231SD only if a low level optimization decides it is useful/necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-subtract operation.

985 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_fmsub_ss Multiply-subtracts scalar single-precision floating-point values using three float32vectors. The corresponding FMA instruction is VFMSUBSS, where XXX could be 132, 213, or 231.

Syntax extern __m128 _mm_fmsub_ss(__m128 a, __m128 b, __m128 c);

Arguments

a float32 vector used for the operation b float32 vector also used for the operation c float32 vector also used for the operation

Description Performs a set of SIMD multiply-subtract computation on scalar single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the infinite precision intermediate results are obtained. From the infinite precision intermediate results, the values in the third operand, c, are subtracted. The final results are rounded to the nearest float32 values. The compiler defaults to using the VFMSUB213SS instruction and uses the other forms VFMSUB132SS or VFMSUB231SS only if a low level optimization decides it is useful/necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the multiply-subtract operation.

986 33

_mm_fnmadd_pd, _mm256_fnmadd_pd Multiply-adds negated packed double-precision floating-point values of three float64 vectors. The corresponding FMA instruction is VFNMADDPD, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128d _mm_fnmadd_pd(__m128d a, __m128d b, __m128d c);

For 256-bit vector extern __m256d _mm256_fnmadd_pd(__m256d a, __m256d b, __m256d c);

Arguments a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD negated multiply-add computation on packed double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the negated infinite precision intermediate results are added to the values in the third operand, c, after which the final results are rounded to the nearest float64 values. The compiler defaults to using the VFNMADD213PD instruction and uses the other forms VFNMADD132PD or VFNMADD231PD only if a low level optimization decides it as useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the negated multiply-add operation.

987 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_fnmadd_ps, _mm256_fnmadd_ps Multiply-adds negated packed single-precision floating-point values of three float32 vectors. The corresponding FMA instruction is VFNMADDPS, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128 _mm_fnmadd_ps(__m128 a, __m128 b, __m128 c);

For 256-bit vector extern __m256 _mm256_fnmadd_ps(__m256 a, __m256 b, __m256 c);

Arguments

a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD negated multiply-add computation on packed single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the negated infinite precision intermediate results are added to the values in the third operand, c, after which the final results are rounded to the nearest float32 values. The compiler defaults to using the VFNMADD213PS instruction and uses the other forms VFNMADD132PS or VFNMADD231PS only if a low level optimization decides it as useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the negated multiply-add operation.

988 33

_mm_fnmadd_sd Multiply-adds negated scalar double-precision floating-point values of three float64 vectors. The corresponding FMA instruction is VFNMADDSD, where XXX could be 132, 213, or 231.

Syntax extern __m128d _mm_fnmadd_sd(__m128d a, __m128d b, __m128d c);

Arguments a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD negated multiply-add computation on scalar double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the negated infinite precision intermediate results are added to the values in the third operand, c, after which the final results are rounded to the nearest float64 values. The compiler defaults to using the VFNMADD213SD instruction and uses the other forms VFNMADD132SD or VFNMADD231SD only if a low level optimization decides it as useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the negated multiply-add operation.

_mm_fnmadd_ss Multiply-adds negated scalar single-precision floating-point values of three float32 vectors. The corresponding FMA instruction is VFNMADDSS, where XXX could be 132, 213, or 231.

Syntax extern __m128 _mm_fnmadd_ss(__m128 a, __m128 b, __m128 c);

989 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD negated multiply-add computation on scalar single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the negated infinite precision intermediate results are added to the values in the third operand, c, after which the final results are rounded to the nearest float32 values. The compiler defaults to using the VFNMADD213SS instruction and uses the other forms VFNMADD132SS or VFNMADD231SS only if a low level optimization decides it as useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the negated multiply-add operation.

_mm_fnmsub_pd, _mm256_fnmsub_pd Multiply-subtracts negated packed double-precision floating-point values of three float64 vectors. The corresponding FMA instruction is VFNMSUBPD, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128d _mm_fnmsub_pd(__m128d a, __m128d b, __m128d c);

For 256-bit vector extern __m256d _mm256_fnmsub_pd(__m256d a, __m256d b, __m256d c);

Arguments

a float64 vector used for the operation b float64 vector also used for the operation

990 33 c float64 vector also used for the operation

Description Performs a set of SIMD negated multiply-subtract computation on packed double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the negated infinite precision intermediate result is obtained. From this intermediate result the value in the third operand, c, is subtracted, after which the final results are rounded to the nearest float64 values. The compiler defaults to using the VFNMSUB213PD instruction and uses the other forms VFNMSUB132PD or VFNMSUB231PD only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the negated multiply-subtract operation.

_mm_fnmsub_ps, _mm256_fnmsub_ps Multiply-subtracts negated packed single-precision floating-point values of three float32 vectors. The corresponding FMA instruction is VFNMSUBPS, where XXX could be 132, 213, or 231.

Syntax For 128-bit vector extern __m128 _mm_fnmsub_ps(__m128 a, __m128 b, __m128 c);

For 256-bit vector extern __m256 _mm256_fnmsub_ps(__m256 a, __m256 b, __m256 c);

Arguments a float32 vector used for the operation b float32 vector also used for the operation c float32 vector also used for the operation

991 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Performs a set of SIMD negated multiply-subtract computation on packed single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the negated infinite precision intermediate result is obtained. From this intermediate result the value in the third operand, c, is subtracted, after which the final results are rounded to the nearest float32 values. The compiler defaults to using the VFNMSUB213PS instruction and uses the other forms VFNMSUB132PS or VFNMSUB231PS only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the negated multiply-subtract operation.

_mm_fnmsub_sd Multiply-subtracts negated scalar double-precision floating-point values of three float64 vectors. The corresponding FMA instruction is VFNMSUBSD, where XXX could be 132, 213, or 231.

Syntax extern __m128d _mm_fnmsub_sd(__m128d a, __m128d b, __m128d c);

Arguments

a float64 vector used for the operation b float64 vector also used for the operation c float64 vector also used for the operation

Description Performs a set of SIMD negated multiply-subtract computation on scalar double-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the negated infinite precision intermediate result is obtained. From this intermediate result the value in the third operand, c, is subtracted, after which the final results are rounded to the nearest float64 values.

992 33

The compiler defaults to using the VFNMSUB213SD instruction and uses the other forms VFNMSUB132SD or VFNMSUB231SD only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the negated multiply-subtract operation.

_mm_fnmsub_ss Multiply-subtracts negated scalar single-precision floating-point values of three float32 vectors. The corresponding FMA instruction is VFNMSUBSS, where XXX could be 132, 213, or 231.

Syntax extern __m128 _mm_fnmsub_ss(__m128 a, __m128 b, __m128 c);

Arguments a float32 vector used for the operation b float32 vector also used for the operation c float32 vector also used for the operation

Description Performs a set of SIMD negated multiply-subtract computation on scalar single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the negated infinite precision intermediate result is obtained. From this intermediate result the value in the third operand, c, is subtracted, after which the final results are rounded to the nearest float32 values. The compiler defaults to using the VFNMSUB213SS instruction and uses the other forms VFNMSUB132SS or VFNMSUB231SS only if a low level optimization decides it is useful or necessary. For example, the compiler could change the default if it finds that another instruction form saves a register or eliminates a move.

Returns Result of the negated multiply-subtract operation.

993 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsics Generating Vectors of Undefined Values

_mm256_undefined_ps() Returns a vector of eight single precision floating point elements.

Syntax extern __m256 _mm256_undefined_ps(void);

Description This intrinsic returns a vector of eight single precision floating point elements. The content of the vector is not specified. The prototype of this intrinsic is in the immintrin header file.

Returns A vector of eight single precision floating point elements.

See Also • Intrinsics Returning Vectors of Undefined Values

_mm256_undefined_pd() Returns a vector of four double precision floating point elements.

Syntax extern __m256d _mm256_undefined_pd(void);

Description This intrinsic returns a vector of four double precision floating point elements. The content of the vector is not specified. The prototype of this intrinsic is in the immintrin header file.

Returns A vector of four double precision floating point elements.

See Also • Intrinsics Returning Vectors of Undefined Values

994 33

_mm256_undefined_si128() Returns a vector of eight packed doubleword integer elements.

Syntax extern __m256i _mm256_undefined_si256(void);

Description This intrinsic returns a vector of eight packed doubleword integer elements. The content of the vector is not specified. The prototype of this intrinsic is in the immintrin header file.

Returns A vector of eight packed doubleword integer elements.

See Also • Intrinsics Returning Vectors of Undefined Values

995 33 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

996 Intrinsics for Intel(R) Post-32nm Processor Instruction Extensions 34

Overview: Intrinsics for Intel(R) Post 32nm Processor Instruction Extensions

The Intel® Post 32nm Processor Instruction Extension intrinsics are assembly-coded functions that call on Intel® Post 32nm Processor Instructions that include new vector SIMD and scalar instructions targeted for Intel® 64 architecture processors in process technology smaller than 32 nm. The prototypes for the Intel® Post 32nm Processor Instructions Intrinsics are available in the immintrin.h file. These intrinsics map directly to the instructions defined in the "CHAPTER 7. POST-32NM PROCESSOR INSTRUCTIONS" section of "Intel® Advanced Vector Extensions Programming Reference" (http://software.intel.com/en-us/avx/).

Functional Overview The Intel® Post 32nm Processor Instruction Extensions include:

• Four intrinsics that map to two hardware instructions VCVTPS2PH and VCVTPH2PS performing 16-bit floating-point data type conversion to and from single-precision floating-point data type. The intrinsics for conversion to packed 16-bit floating-point values from packed single-precision floating-point values also provide rounding control using an immediate byte. • Three intrinsics that map to the hardware instruction RDRAND. The intrinsics generate random numbers of 16/32/64 bit wide random integers. • Eight intrinsics that map to the hardware instructions RDFSBASE, RDGSBASE, WRFSBASE, and WRGSBASE. The intrinsics allow software that works in the 64-bit environment to read and write the FS base and GS base registers at all privileged levels.

Intrinsics for Converting Half Floats that Map to Intel(R) Post-32nm Processor Instructions

There are four intrinsics for converting the half-float values. Two of them (_mm_cvtph_ps() and _mm_cvtps_ph()) are declared in the emmintrin.h file, and the other two (_mm256_cvtps_ph() and _mm256_cvtph_ps()) are declared in the immintrin.h file. These intrinsics convert packed half-precision values starting from the first CPUs with the AVX instructions support that do not really have any special instructions performing FP16 conversions. Therefore, the intrinsics are lowered to runtime library function calls and map to Intel(R) Post 32nm

997 34 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Processor instructions only when such a processor is specified as target CPU using the -Qx/-x option, where is the name of a CPU with support of the Post 32nm Instruction Extensions.

See Also • Intrinsics for Converting Half Floats

_mm_cvtph_ps() Converts four half-precision (16-bit) floating point values to single-precision floating point values.

Syntax extern __m128 _mm_cvtph_ps(__m128i x, int imm);

Arguments

X a vector containing four 16-bit FP values

Imm a conversion control constant

Description This intrinsic converts four half-precision (16-bit) floating point values to single-precision floating point values. The corresponding Intel® Post 32nm processor extension instruction is VCVTPH2PS.

Returns A vector containing four single-precision floating point elements.

_mm256_cvtph_ps() Converts eight half-precision (16-bit) floating point values to single-precision floating point values.

Syntax extern __m256 _mm_cvtph_ps(__m128i x, int imm);

Arguments

X a vector containing eight 16-bit FP values

Imm A conversion control constant

998 34

Description This intrinsic converts eight half-precision (16-bit) floating point values to single-precision floating point values. The corresponding Intel® Post 32nm processor extension instruction is VCVTPH2PS.

Returns A vector containing eight single-precision floating point elements.

_mm_cvtps_ph() Converts four single-precision floating point values to half-precision (16-bit) floating point values.

Syntax extern __m128i _mm_cvtps_ph(__m128 x, int imm);

Arguments

X a vector containing four single-precision FP values

Imm a conversion control constant

Description This intrinsic converts four single-precision floating point values to half-precision (16-bit) floating point values. The corresponding Intel® Post 32nm processor extension instruction is VCVTPS2PH.

Returns A vector containing eight half-precision (16-bit) floating point elements.

_mm256_cvtps_ph() Converts eight single-precision floating point values to half-precision (16-bit) floating point values.

Syntax extern __m128i _mm_cvtps_ph(__m256 x, int imm);

Arguments

X a vector containing eight single-precision FP values

999 34 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Imm a conversion control constant

Description This intrinsic converts eight single-precision floating point values to half-precision (16-bit) floating point values. The corresponding Intel® Post 32nm processor extension instruction is VCVTPS2PH.

Intrinsics that Generate Random Numbers of 16/32/64 Bit Wide Random Integers

There are three intrinsics returning a hardware-generated random value. These intrinsics are declared in the immintrin.h header file. All these intrinsics are mapped to a loop around RDRAND instruction: loop: RDRAND AX //return 16-bit random value //RDRAND EAX //return 32-bit random value //RDRAND RAX //return 64-bit random value jnc loop

_rdrand_u16(), _rdrand_u32(), _rdrand_u64() Generate random numbers of 16/32/64 bit wide random integers.

Syntax extern unsigned short _rdrand_u16(); extern unsigned int _rdrand_u32(); extern unsigned __int64 _rdrand_u64();

Description These intrinsics generate random numbers of 16/32/64 bit wide random integers. These intrinsics are mapped to the hardware instruction RDRAND.

Return A hardware-generated 16/32/64 random value.

Constraints The _rdrand_u64() intrinsic can be used only on systems with the 64-bit registers support.

1000 34

Intrinsics that Allow Reading from and Writing to the FS Base and GS Base Registers

There are eight intrinsics that allow reading and writing the value of the FS base and GS base registers at all privilege levels and in 64-bit mode only. These intrinsics are mapped to corresponding Post 32nm processor extension instructions.

_readfsbase_u32(), _readfsbase_u64() Read the value of the FS base register.

Syntax extern unsigned int _readfsbase_u32(); extern unsigned __int64 _readfsbase_u64();

Description These intrinsics read the value of the FS base register. Both intrinsics are mapped to RDFSBASE instruction.

Returns The value of the FS base segment register.

_readgsbase_u32(), _readgsbase_u64() Read the value of the GS base register.

Syntax extern unsigned int _readgsbase_u32(); extern unsigned int _readgsbase_u32();

Description These intrinsics read the value of the GS base register. Both intrinsics are mapped to RDGSBASE instruction.

Returns The value of the GS base segment register.

1001 34 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_writefsbase_u32(), _writefsbase_u64() Write the value to the FS base register.

Syntax extern void _writefsbase_u32(int 32 val); extern void _writefsbase_u64(unsigned __int64 val);

Arguments

val the value to be written to the FS base register

Description These intrinsics write the given value to the FS segment base register. Both intrinsics are mapped to WRFSBASE instruction.

_writegsbase_u32(), _writegsbase_u64() Write the value to the GS base register.

Syntax _writegsbase_u32(int 32 val) extern void _writegsbase_u64(unsigned __int64 val);

Arguments

val the value to be written to the GS base register

Description These intrinsics write the given value to the GS segment base register. Both intrinsics are mapped to WRGSBASE instruction.

1002 Intrinsics for Managing Extended Processor States and Registers 35

Overview: Intrinsics for Managing Extended Processor States and Registers

The Intel® C++ Compiler provides 12 intrinsics for managing the extended processor states and extended registers. These intrinsics are available for the IA-32 and Intel® 64 architectures running on Windows*, Linux*, VxWorks* operating systems. The prototypes for these intrinsics are available in the immintrin.h file. The intrinsics map directly to the hardware system instructions described in "Intel® 64 and IA-32 Architectures Software Developer's Manual, volumes 1, 2a, and 2b" and "Intel® Advanced Vector Extensions Programming Reference" (http://software.intel.com/en-us/avx/). Functional Overview The intrinsics for managing the extended processor states and extended registers include:

• Two intrinsics to read from and write to the specified extended control register. These intrinsics map to XGETBV and XSETBV instructions. • Four intrinsics to save and restore the current state of the x87 FPU, MMX, XMM and MCSRS registers. These intrinsics map to FXSAVE, FXSAVE64, FXRSTOR, and FXRSTOR64 instructions. • Six intrinsics to save and restore the current state of the x87 FPU, MMX, XMM, YMM and MCSRS registers. These intrinsics map to XSAVE, XSAVE64, XSAVEOPT, XSAVEOPT64, XRSTOR, and XRSTOR64 instructions.

Intrinsics for Reading and Writing the Content of Extended Control Registers

This group of intrinsics includes two intrinsics to read from and write to extended control registers (XCRs). Currently, the only such register defined is XCR0, XFEATURE_ENABLED_MASK register. This register specifies the set of processor states that the operating system enables on that processor, e.g. x87 FPU States, SSE states, and other processor extended states that Intel(R) 64 architecture may introduce in the future. These intrinsics are declared in the immintrin.h file.

1003 35 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_xgetbv() Reads the content of an extended control register.

Syntax extern unsigned __int64 _xgetbv(unsigned int xcr);

Arguments

xcr An extended control register to be read. Currently, only the value 0 is allowed.

Description This intrinsic reads from extended control registers. Currently, the only control register allowed/defined is (XCR0) XFEATURE_ENABLED_MASK register. The corresponding constant is defined in the immintrin.h file to refer to this register: #define _XCR_XFEATURE_ENABLED_MASK 0

This intrinsic maps to XGETBV instruction.

1004 35

Returns Returns the content of a specified extended control register.

_xsetbv() Writes the given value to a specified extended control register.

Syntax extern void _xsetbv(unsigned int xcr, unsigned __int64 val);

Arguments

xcr An extended control register to be written. Currently, only the value 0 is allowed. Value The value to be written to the specified extended control register.

Description This intrinsic writes the given value to the specified extended control register. Currently, the only control register allowed/defined is (XCR0) XFEATURE_ENABLED_MASK register. The corresponding constant is defined in the immintrin.h file to refer to this register: #define _XCR_XFEATURE_ENABLED_MASK 0

This intrinsic maps to XSETBV instruction.

Intrinsics for Saving and Restoring the Extended Processor States

1. Intrinsics that map to FXSAVE[64] and FXRSTOR[64] instructions This group of intrinsics includes four intrinsics to save and restore the current state of the x87 FPU, MMX, XMM, and MXCSR registers. These intrinsics accept a memory reference to a 16-byte aligned 512 bytes memory chunk. The layout of the memory is shown in Table 1. Table 1 - FXSAVE save area layout.

1005 35 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

These intrinsics are declared in the immintrin.h file.

2. Intrinsics that map to XSAVE[64], XSAVEOPT[64], and FXRSTOR[64] instructions This group of intrinsics includes six intrinsics to fully or partially save and restore the current state of the x87 FPU, MMX, XMM, YMM, and MXCSR registers. These intrinsics accept a memory reference to a 64-byte aligned memory. The layout of the register fields for the first 512 bytes is the same as the FXSAVE save area layout. The intrinsics saving the states do not write to bytes 464:511. The save area layout is shown in Table 2a and Table 2b below. The second operand of the intrinsics is a save/restore mask specifying the saved/restored extended states. The value of the mask is ANDed with XFEATURE_ENABLED_MASK (XCR0). So, a particular extended state is saved/restored only if the corresponding bit of both save/restore mask and XFEATURE_ENABLED_MASK is set to 1. Table 2a - XSAVE save area layout (first 512 bytes)

1006 35

Table 2b - XSAVE save area layout for YMM registers

1007 35 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

These intrinsics are declared in the immintrin.h file.

_fxsave() Saves the states of x87 FPU, MMX, XMM, and MXCSR registers to memory.

Syntax extern void _fxsave(void *mem);

Arguments

mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned.

Description Saves the states of x87 FPU, MMX, XMM, and MXCSR registers to memory. This intrinsic maps to FXSAVE instruction.

_fxsave64() Saves the states of x87 FPU, MMX, XMM, and MXCSR registers to memory.

Syntax extern void _fxsave64(void *mem);

1008 35

Arguments mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned.

Description Saves the states of x87 FPU, MMX, XMM, and MXCSR registers to memory. This intrinsic maps to FXSAVE64 instruction.

_fxrstor() Restores the states of x87 FPU, MMX, XMM, and MXCSR registers from memory.

Syntax extern void _fxrstor(void *mem);

Arguments mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned.

Description Restores the states of x87 FPU, MMX, XMM, and MXCSR registers from memory. This intrinsic maps to FXRSTOR instruction.

_fxrstor64() Restores the states of x87 FPU, MMX, XMM, and MXCSR registers from memory.

Syntax extern void _fxrstor64(void *mem);

Arguments mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned.

1009 35 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Restores the states of x87 FPU, MMX, XMM, and MXCSR registers from memory. This intrinsic maps to FXRSTOR64 instruction.

_xsave() Saves the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers to memory.

Syntax extern void _xsave(void *mem, unsigned __int64 save_mask);

Arguments

mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned. save_mask A bit mask specifying the extended states to be saved.

Description Saves the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers to memory. This intrinsic maps to XSAVE instruction.

_xsave64() Saves the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers to memory.

Syntax extern void _xsave64(void *mem, unsigned __int64 save_mask);

Arguments

mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned. save_mask A bit mask specifying the extended states to be saved.

Description Saves the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers to memory. This intrinsic maps to XSAVE64 instruction.

1010 35

_xsaveopt() Saves the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers to memory, optimizing the save operation if possible.

Syntax extern void _xsaveopt(void *mem, unsigned __int64 save_mask);

Arguments mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned. save_mask A bit mask specifying the extended states to be saved.

Description Saves the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers to memory, optimizing the save operation if possible. This intrinsic maps to XSAVEOPT instruction.

_xsaveopt64() Saves the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers to memory, optimizing the save operation if possible.

Syntax extern void _xsaveopt64(void *mem, unsigned __int64 save_mask);

Arguments mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned. save_mask A bit mask specifying the extended states to be saved.

Description Saves the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers to memory, optimizing the save operation if possible. This intrinsic maps to XSAVEOPT64 instruction.

1011 35 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_xrstor() Restores the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers from memory.

Syntax extern void _xrstor(void *mem, unsigned __int64 rstor_mask);

Arguments

mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned. rstor_mask A bit mask specifying the extended states to be restored.

Description Restores the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers from memory. This intrinsic maps to XRSTOR instruction.

_xrstor64() Restores the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers from memory.

Syntax extern void _xrstor64(void *mem, unsigned __int64 rstor_mask);

Arguments

mem A memory reference to FXSAVE area. The 512-bytes memory addressed by the reference must be 16-bytes aligned. rstor_mask A bit mask specifying the extended states to be restored.

Description Restores the states of x87 FPU, MMX, XMM, YMM, and MXCSR registers from memory. This intrinsic maps to XRSTOR64 instruction.

1012 Intrinsics for Converting Half Floats 36

Overview: Intrinsics to Convert Half Float Types

The half-float or 16-bit float is a popular type in some application domains. The half-float type is regarded as a storage type because although data is often stored as a half-float, computation is never done on values in these type. Usually values are converted to regular 32-bit floats before any computation. Support for half-float type is restricted to just conversions to/from 32-bit floats. The main benefits of using half float type are: • reduced storage requirements • less consumption of memory bandwidth and cache • accuracy and precision adequate for many applications

Half Float Intrinsics The half-float intrinsics are provided to convert half-float values to 32-bit floats for computation purposes and conversely, 32-bit float values to half-float values for data storage purposes. The intrinsics are translated into library calls that do the actual conversions. The half-float intrinsics are supported on IA-32 and Intel(R) 64 architectures running on Windows*/Linux*/VxWorks* operating systems. The minimum processor requirement is an Intel(R) Pentium 4 processor and an operating system supporting Streaming SIMD2 Extensions (SSE2) instructions.

Role of Immediate Byte in Half Float Intrinsic Operations For all half-float intrinsics an immediate byte controls rounding mode, flush to zero, and other non-volatile set values. The format of the imm8 byte is as shown in the diagram below. The imm8 value is used for special MXCSR overrides.

1013 36 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

In the diagram,

• MBZ = Most significant Bit is Zero; used for error checking • MS1 = 1 : use MXCSR RC, else use imm8.RC • SAE = 1 : all exceptions are suppressed • MS2 = 1 : use MXCSR FTZ/DAZ control, else use imm8.FTZ/DAZ.

The compiler passes the bits to the library function, with error checking - the most significant bit must be zero.

Intrinsics for Converting Half Floats

There are four intrinsics for converting half-floats to 32-bit floats and 32-bit floats to half-floats. The prototypes for these half-float conversion intrinsics are in the emmintrin.h file. float _cvtsh_ss(unsigned short x, int imm);

This intrinsic takes a half-float value, x, and converts it to a 32-bit float value, which is returned. unsigned short _cvtss_sh(float x, int imm);

This intrinsic takes a 32-bit float value, x, and converts it to a half-float value, which is returned. __m128 _mm_cvtph_ps(__m128i x, int imm);

1014 36

This intrinsic takes four packed half-float values and converts them to four 32-bit float values, which are returned. The upper 64-bits of x are ignored. The lower 64-bits are taken as four 16-bit float values for conversion. __m128i _mm_cvtps_ph(_m128 x, int imm);

This intrinsic takes four packed 32-bit float values and converts them to four half-float values, which are returned. The upper 64-bits in the returned result are all zeros. The lower 64-bits contain the four packed 16-bit float values.

See Also • Overview: Intrinsics to convert half-float types

1015 36 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

1016 Intrinsics for Short Vector Math Library Operations 37

Overview: Intrinsics for Short Vector Math Library Functions

The Intel® C++ Compiler provides short vector math library (SVML) intrinsics to compute vector math functions. These intrinsics are available for the IA-32 and Intel® 64 architectures running on Windows*, Linux*, VxWorks* operating systems. The SVML intrinsics are vector variants of corresponding scalar math operations using __m128, __m128d, __m256, __m256d, and __m256i data types. They take packed vector arguments, perform the operation on each element of the packed vector argument, and return a packed vector result. For example, the argument to the _mm_sin_ps intrinsic is a packed 128-bit vector of four 32-bit precision floating point numbers. The intrinsic computes the sine of each of these four numbers and returns the four results in a packed 128-bit vector. Using SVML intrinsics is faster than repeatedly calling the scalar math function/s. However the intrinsics differ from the scalar functions in accuracy. The SVML intrinsics do not have any corresponding instructions. The prototypes for the SVML intrinsics are available in the ia32intrin.h file. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Intrinsics for Error Function Operations

Many routines in the SVML library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_erf_pd, _mm256_erf_pd Calculates error function. Vector variant of erf(x) function for a 128-bit vector or 256-bit vector argument of float64 values.

Syntax extern __m128d _mm_erf_pd(__m128d v1); extern __m256d _mm256_erf_pd(__m256d v1);

1017 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

v1 float64 vector used for the operation

Description Calculates error function of v1 elements, which is defined as: erf(x) = 2/sqrt(pi)* integral from 0 to x of exp(-t*t) dt

Returns 128-bit or 256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_erf_ps, _mm256_erf_ps Calculates error function. Vector variant of erf(x) function for a 128-bit vector or 256-bit vector argument of float32 values.

Syntax extern __m128 _mm_erf_ps(__m128 v1); extern __m256 _mm256_erf_ps(__m256 v1);

Arguments

v1 vector with float32 values

Description Calculates error function of v1 elements, which is defined as: erf(x) = 2/sqrt(pi)* integral from 0 to x of exp(-t*t) dt

Returns 128-bit vector or 256-bit vector with the result of the operation.

1018 37

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_erfc_pd, _mm256_erfc_pd Calculates complementary error function. Vector variant of erfc(x) function for a 128-bit or 256-bit vector argument of float64 values.

Syntax extern __m128d _mm_erfc_pd(__m128d v1); extern __m256d _mm256_erfc_pd(__m256d v1);

Arguments v1 vector with float64 values

Description Calculates the complementary error function of vector v1 elements, which is 1.0 - erf(x)

Returns 128-bit vector or 256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_erfc_ps, _mm256_erfc_ps Calculates complementary error function. Vector variant of erfc(x) function for a 128-bit vector or 256-bit vector argument of float32 values.

Syntax extern __m128 _mm_erfc_ps(__m128 v1);

1019 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

extern __m256 _mm256_erfc_ps(__m256 v1);

Arguments

v1 vector with float32 values

Description Calculates the complementary error function of vector v1 elements, which is 1.0 - erf(x)

Returns 128-bit vector or 256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Intrinsics for Exponential Operations

Many routines in the SVML library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_exp2_pd, _mm256_exp2_pd Calculates exponential value of 2. Vector variant of exp2(x) function for a 128-bit vector or 256-bit vector argument of float64 values.

Syntax extern __m128d _mm_exp2_pd(__m128d v1); extern __m256d _mm256_exp2_pd(__m256d v1);

Arguments

v1 vector with float64 values

Description Calculates the exponential value of 2 raised to the power of vector v1 elements.

1020 37

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_exp2_ps, _mm256_exp2_ps Calculates exponential value of 2. Vector variant of exp2(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_exp2_ps(__m128 v1); extern __m256 _mm256_exp2_ps(__m256 v1);

Arguments v1 vector with float32 values

Description Calculates the exponential value of 2 raised to the power of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1021 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_exp_pd, _mm256_exp_pd Calculates exponential value of e (base of natural logarithms). Vector variant of exp(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_exp_pd(__m128d v1); extern __m256d _mm256_exp_pd(__m256d v1);

Arguments

v1 vector with float64 values

Description Calculates the exponential value of e (base of natural logarithms) raised to the power of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_exp_ps, _mm256_exp_ps Calculates exponential value of e (base of natural logarithms). Vector variant of exp(x) function for a 128-bit /256-bit vector argument of float32 values.

Syntax extern __m128 _mm_exp_ps(__m128 v1); extern __m256 _mm256_exp_ps(__m256 v1);

1022 37

Arguments v1 vector with float32 values

Description Calculates the exponential value of e (base of natural logarithms) raised to the power of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_cexp_ps, _mm256_cexp_ps Calculates complex exponential value of e (base of natural logarithms). Vector variant of exp(x) function for a 128-bit/256-bit vector argument of _Complex float32 values.

Syntax extern __m128 _mm_cexp_ps(__m128 v1); extern __m256 _mm256_cexp_ps(__m256 v1);

Arguments v1 vector with _Complex float32 values

Description Calculates the complex exponential value of e (base of natural logarithms) raised to the power of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

1023 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_pow_pd, _mm256_pow_pd Calculates exponential value of one argument raised to the other argument. Vector variant of pow(x, y) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_pow_pd(__m128d v1, __m128d v2); extern __m256d _mm256_pow_pd(__m256d v1, __m256d v2);

Arguments

v1 vector with float64 values v2 vector with float64 values

Description Calculates the exponential value of each vector v1 element raised to the power of the corresponding vector v2 element.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1024 37

_mm_pow_ps, _mm256_pow_ps Calculates exponential value of one argument raised to the other argument. Vector variant of pow(x, y) function for a 128-bit /256-bit vector argument of float32 values.

Syntax extern __m128 _mm_pow_ps(__m128 v1, __m128 v2); extern __m256 _mm256_pow_ps(__m256 v1, __m256 v2);

Arguments v1 vector with float32 values v2 vector with float32 values

Description Calculates the exponential value of each vector v1 element raised to the power of the corresponding vector v2 element.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_cpow_ps, _mm256_cpow_ps Calculates complex exponential value of one argument raised to the other argument. Vector variant of cpow(x, y) function for a 128-bit /256-bit vector argument of float32 values.

Syntax extern __m128 _mm_cpow_ps(__m128 v1, __m128 v2); extern __m256 _mm256_cpow_ps(__m256 v1, __m256 v2);

1025 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

v1 vector with _Complex float32 values v2 vector with _Complex float32 values

Description Calculates the complex exponential value of each vector v1 element raised to the power of the corresponding vector v2 element.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Intrinsics for Logrithmic Operations

Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_log2_pd, _mm256_log2_pd Calculates base-2 logarithm. Vector variant of log2(x) function for a 128-bit /256-bit vector argument of float64 values.

Syntax extern __m128d _mm_log2_pd(__m128d v1); extern __m256d _mm256_log2_pd(__m256d v1);

Arguments

v1 vector with float64 values

Description Calculates the base-2 logarithm of vector v1 elements.

1026 37

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_log2_ps, _mm256_log2_ps Calculates base-2 logarithm. Vector variant of log2(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_log2_ps(__m128 v1); extern __m256 _mm256_log2_ps(__m256 v1);

Arguments v1 vector with float32 values

Description Calculates the base-2 logarithm of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1027 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_log10_pd, _mm256_log10_pd Calculates base-10 logarithm. Vector variant of log10(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_log10_pd(__m128d v1); extern __m256d _mm256_log10_pd(__m256d v1);

Arguments

v1 vector with float64 values

Description Calculates the base-10 logarithm of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_log10_ps, _mm256_log10_ps Calculates base-10 logarithm. Vector variant of log10(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_log10_ps(__m128 v1); extern __m256 _mm256_log10_ps(__m256 v1);

Arguments

v1 vector with float32 values

1028 37

Description Calculates the base-10 logarithm of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_log_pd, _mm256_log_pd Calculates natural logarithm. Vector variant of log(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_log_pd(__m128d v1); extern __m256d _mm256_log_pd(__m256d v1);

Arguments v1 vector with float64 values

Description Calculates the natural logarithm of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1029 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_log_ps, _mm256_log_ps Calculates natural logarithm. Vector variant of log(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_log_ps(__m128 v1); extern __m256 _mm256_log_ps(__m256 v1);

Arguments

v1 vector with float32 values

Description Calculates the natural logarithm of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_clog2_ps, _mm256_clog2_ps Calculates complex base-2 logarithm. Vector variant of clog2(x) function for a 128-bit/256-bit vector argument of _Complex float32 values.

Syntax extern __m128 _mm_clog2_ps(__m128 v1); extern __m256 _mm256_clog2_ps(__m256 v1);

Arguments

v1 vector with _Complex float32 values

1030 37

Description Calculates the complex base-2 logarithm of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_clog10_ps, _mm256_clog10_ps Calculates complex base-10 logarithm. Vector variant of clog10(x) function for a 128-bit/256-bit vector argument of _Complex float32 values.

Syntax extern __m128 _mm_clog10_ps(__m128 v1); extern __m256 _mm256_clog10_ps(__m256 v1);

Arguments v1 vector with _Complex float32 values

Description Calculates the complex base-10 logarithm of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1031 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_clog_ps, _mm256_clog_ps Calculates complex natural logarithm. Vector variant of clog(x) function for a 128-bit/256-bit vector argument of _Complex float32 values.

Syntax extern __m128 _mm_clog_ps(__m128 v1); extern __m256 _mm256_clog_ps(__m256 v1);

Arguments

v1 vector with _Complex float32 values

Description Calculates the complex natural logarithm of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Intrinsics for Square Root and Cube Root Operations

Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_sqrt_pd, _mm256_sqrt_pd Calculates square root value. Vector variant of sqrt(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_sqrt_pd(__m128d v1);

1032 37 extern __m256d _mm256_sqrt_pd(__m256d v1);

Arguments v1 vector with float64 values

Description Calculates the square root of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_sqrt_ps, _mm256_sqrt_ps Calculates square root value. Vector variant of sqrt(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_sqrt_ps(__m128 v1); extern __m256 _mm256_sqrt_ps(__m256 v1);

Arguments v1 vector with float32 values

Description Calculates the square root of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1033 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_invsqrt_pd, _mm256_invsqrt_pd Calculates inverse square root value. Vector variant of invsqrt(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_invsqrt_pd(__m128d v1); extern __m256d _mm256_invsqrt_pd(__m256d v1);

Arguments

v1 vector with float64 values

Description Calculates the inverse square root of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_invsqrt_ps, _mm256_invsqrt_ps Calculates inverse square root value. Vector variant of invsqrt(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_invsqrt_ps(__m128 v1); extern __m256 _mm256_invsqrt_ps(__m256 v1);

Arguments

v1 vector with float32 values

1034 37

Description Calculates the inverse square root of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_cbrt_pd, _mm256_cbrt_pd Calculates cube root value. Vector variant of cbrt(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_cbrt_pd(__m128d v1); extern __m256d _mm256_cbrt_pd(__m256d v1);

Arguments v1 vector with float64 values

Description Calculates the cube root of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1035 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_cbrt_ps, _mm256_cbrt_ps Calculates cube root value. Vector variant of cbrt(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_cbrt_ps(__m128 v1); extern __m256 _mm256_cbrt_ps(__m256 v1);

Arguments

v1 vector with float32 values

Description Calculates the cube root of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_invcbrt_pd, _mm256_invcbrt_pd Calculates inverse cube root value. Vector variant of invcbrt(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_invcbrt_pd(__m128d v1); extern __m256d _mm256_invcbrt_pd(__m256d v1);

Arguments

v1 vector with float64 values

1036 37

Description Calculates the inverse cube root of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_invcbrt_ps, _mm256_invcbrt_ps Calculates inverse cube root value. Vector variant of invcbrt(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_invcbrt_ps(__m128 v1); extern __m256 _mm256_invcbrt_ps(__m256 v1);

Arguments v1 vector with float32 values

Description Calculates the inverse cube root of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1037 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_csqrt_ps, _mm256_csqrt_ps Calculates complex square root value. Vector variant of csqrt(x) function for a 128-bit/256-bit vector argument of _Complex float32 values.

Syntax extern __m128 _mm_csqrt_ps(__m128 v1); extern __m256 _mm256_csqrt_ps(__m256 v1);

Arguments

v1 vector with _Complex float32 values

Description Calculates the complex square root of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Intrinsics for Trignometric Operations

Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_acos_pd, _mm256_acos_pd Calculates arc cosine value. Vector variant of acos(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_acos_pd(__m128d v1);

1038 37 extern __m256d _mm256_acos_pd(__m256d v1);

Arguments v1 vector with float64 values

Description Calculates the arc cosine of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_acos_ps, _mm256_acos_ps Calculates arc cosine value. Vector variant of acos(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_acos_ps(__m128 v1); extern __m256 _mm256_acos_ps(__m256 v1);

Arguments v1 128-bit vector with float32 values

Description Calculates the arc cosine of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1039 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_acosh_pd, _mm256_acosh_pd Calculates the inverse hyperbolic cosine value. Vector variant of acosh(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_acosh_pd(__m128d v1); extern __m256d _mm256_acosh_pd(__m256d v1);

Arguments

v1 vector with float64 values

Description Calculates the inverse hyperbolic cosine of vector v1 values.

Returns 128-bit//256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_acosh_ps, _mm256_acosh_ps Calculates the inverse hyperbolic cosine value. Vector variant of acosh(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_acosh_ps(__m128 v1); extern __m256 _mm256_acosh_ps(__m256 v1);

Arguments

v1 vector with float32 values

1040 37

Description Calculates the inverse hyperbolic cosine of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_asin_pd, _mm256_asin_pd Calculates arc sine value. Vector variant of asin(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_asin_pd(__m128d v1); extern __m256d _mm256_asin_pd(__m256d v1);

Arguments v1 vector with float64 values

Description Calculates the arc sine of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1041 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_asin_ps, _mm256_asin_ps Calculates arc sine value. Vector variant of asin(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_asin_ps(__m128 v1); extern __m256 _mm256_asin_ps(__m256 v1);

Arguments

v1 vector with float32 values

Description Calculates the arc sine of vector v1 values.

Returns 128-bit/256-bit vector with result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_asinh_pd, _mm256_asinh_pd Calculates the inverse hyperbolic sine value. Vector variant of asinh(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_asinh_pd(__m128d v1); extern __m256d _mm256_asinh_pd(__m256d v1);

Arguments

v1 vector with float64 values

1042 37

Description Calculates the inverse hyperbolic sine of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_asinh_ps, _mm256_asinh_ps Calculates the inverse hyperbolic sine value. Vector variant of asinh(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_asinh_ps(__m128 v1); extern __m256 _mm256_asinh_ps(__m256 v1);

Arguments v1 vector with float32 values

Description Calculates the inverse hyperbolic sine of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1043 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_atan_pd, _mm256_atan_pd Calculates arc tangent value. Vector variant of atan(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_atan_pd(__m128d v1); extern __m256d _mm256_atan_pd(__m256d v1);

Arguments

v1 vector with float64 values

Description Calculates the arc tangent of vector v1 elements. In effect, the tangent of each resulting element value is the value of the corresponding v1 vector element.

Returns 128-bit/256-bit vector with result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_atan_ps, _mm256_atan_ps Calculates arc tangent value. Vector variant of atan(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_atan_ps(__m128 v1); extern __m256 _mm256_atan_ps(__m256 v1);

Arguments

v1 vector with float32 values

1044 37

Description Calculates the arc tangent of vector v1 elements. In effect, the tangent of each resulting element value is the value of the corresponding v1 vector element.

Returns 128-bit/256-bit vector with result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_atan2_pd, _mm256_atan2_pd Calculates the arc tangent of float64 variables x and y. Vector variant of atan2(x, y) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_atan2_pd(__m128d v1, __m128 v2); extern __m256d _mm256_atan2_pd(__m256d v1, __m256 v2);

Arguments v1 vector with float64 values v1 vector with float64 values

Description Calculates the arc tangent of corresponding float64 elements of vectors v1 and v2. The following is an illustration of the atan2 operation: Res[0] = atan2(v1[0], v2[0]) Res[1] = atan2(v1[1], v2[1]) Res[2] = atan2(v1[2], v2[2]) Res[15] = atan2(v1[15], v2[15]) ...

NOTE. This calculation is similar to calculating the arc tangent of y / x, except that the signs of both arguments are used to determine the quadrant of the result.

1045 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Returns Result of the bitwise operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_atan2_ps, _mm256_atan2_ps Calculates the arc tangent of float32 variables x and y. Vector variant of atan2(x, y) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_atan2_ps(__m128 v1, __m128 v2); extern __m256 _mm256_atan2_ps(__m256 v1, __m256 v2);

Arguments

v1 vector with float32 values v1 vector with float32 values

Description Calculates the arc tangent of corresponding float32 elements of vectors v1 and v2. The following is an illustration of the atan2 operation: Res[0] = atan2(v1[0], v2[0]) Res[1] = atan2(v1[1], v2[1]) Res[2] = atan2(v1[2], v2[2]) Res[15] = atan2(v1[15], v2[15]) ...

NOTE. This calculation is similar to calculating the arc tangent of y / x, except that the signs of both arguments are used to determine the quadrant of the result.

Returns 128-bit/256-bit vector with the result of the operation.

1046 37

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_atanh_pd, _mm256_atanh_pd Calculates inverse hyperbolic tangent value. Vector variant of atanh(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_atanh_pd(__m128d v1); extern __m256d _mm256_atanh_pd(__m256d v1);

Arguments v1 vector with float64 values

Description Calculates the inverse hyperbolic tangent of vector v1 elements. In effect, the hyperbolic tangent of each resulting element value is the value of the corresponding v1 vector element.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_atanh_ps, _mm256_atanh_ps Calculates inverse hyperbolic tangent value. Vector variant of atanh(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_atanh_ps(__m128 v1, __m128 v2); extern __m256 _mm256_atanh_ps(__m256 v1);

1047 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arguments

v1 vector with float32 values

Description Calculates the inverse hyperbolic tangent of vector v1 elements. In effect, the hyperbolic tangent of each resulting element value is the value of the corresponding v1 vector element.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_cos_pd, _mm256_cos_pd Calculates cosine value. Vector variant of cos(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_cos_pd(__m128d v1); extern __m256d _mm256_cos_pd(__m256d v1);

Arguments

v1 vector with float64 values

Description Calculates the cosine of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1048 37

_mm_cos_ps, _mm256_cos_ps Calculates cosine value. Vector variant of cos(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_cos_ps(__m128 v1); extern __m256 _mm256_cos_ps(__m256 v1);

Arguments v1 vector with float32 values

Description Calculates the cosine of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_cosh_pd, _mm256_cosh_pd Calculates the hyperbolic cosine value. Vector variant of cosh(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_cosh_pd(__m128d v1, __m128d m2); extern __m256d _mm256_cosh_pd(__m256d v1);

Arguments v1 vector with float64 values

1049 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Calculates the hyperbolic cosine of vector v1 elements, which is defined mathematically as (exp(x) + exp(-x)) / 2.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_cosh_ps, _mm256_cosh_ps Calculates the hyperbolic cosine value. Vector variant of cosh(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_cosh_ps(__m128 v1); extern __m256 _mm256_cosh_ps(__m256 v1);

Arguments

v1 vector with float32 values

Description Calculates the hyperbolic cosine of vector v1 elements, which is defined mathematically as (exp(x) + exp(-x)) / 2.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1050 37

_mm_sin_pd, _mm256_sin_pd Calculates sine value. Vector variant of sin(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_sin_pd(__m128d v1); extern __m256d _mm256_sin_pd(__m256d v1);

Arguments v1 vector with float64 values

Description Calculates the sine of vector v1 elements.

Returns 128-bit/256-bit vector with the sine values.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_sin_ps, _mm256_sin_ps Calculates sine value. Vector variant of sin(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_sin_ps(__m128 v1); extern __m256 _mm256_sin_ps(__m256 v1);

Arguments v1 vector with float32 values

1051 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Calculates the sine of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_sinh_pd, _mm256_sinh_pd Calculates the hyperbolic sine value. Vector variant of sinh(x) function for a 128-bit/256-bit vector argument of float64 values.

Syntax extern __m128d _mm_sinh_pd(__m128d v1); extern __m256d _mm256_sinh_pd(__m256d v1);

Arguments

v1 vector with float64 values

Description Calculates the hyperbolic sine of vector v1 elements, which is defined mathematically as (exp(x) - exp(-x)) / 2.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1052 37

_mm_sinh_ps, _mm256_sinh_ps Calculates the hyperbolic sine value. Vector variant of sinh(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_sinh_ps(__m128 v1); extern __m256 _mm256_sinh_ps(__m256 v1);

Arguments v1 vector with float32 values

Description Calculates the hyperbolic sine of vector v1 elements, which is defined mathematically as (exp(x) - exp(-x)) / 2.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_tan_pd, _mm256_tan_pd Calculates tangent value. Vector variant of tan(x) function for a 128-bit/256-bit vector arguments with float64 values.

Syntax extern __m128d _mm_tan_pd(__m128d v1); extern __m256d _mm256_tan_pd(__m256d v1);

Arguments v1 vector with float64 values

1053 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Calculates the tangent of vector v1 elements. In effect, the arc tangent of each resulting element value is the value of the corresponding v1 vector element.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_tan_ps, _mm256_tan_ps Calculates tangent value. Vector variant of tan(x) function for a 128-bit/256-bit vector arguments with float32 values.

Syntax extern __m128 _mm_tan_ps(__m128 v1); extern __m256 _mm256_tan_ps(__m256 v1);

Arguments

v1 vector with float32 values

Description Calculates the tangent of vector v1 elements. In effect, the arc tangent of each resulting element value is the value of the corresponding v1 vector element.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1054 37

_mm_tanh_pd, _mm256_tanh_pd Calculates hyperbolic tangent value. Vector variant of tanh(x) function for a 128-bit/256-bit vector arguments with float64 values.

Syntax extern __m128d _mm_tanh_pd(__m128d v1); extern __m256d _mm256_tanh_pd(__m256d v1);

Arguments v1 vector with float64 values

Description Calculates the hyperbolic tangent of vector v1 elements. In effect, the inverse hyperbolic tangent of each resulting element value is the value of the corresponding v1 vector element.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_tanh_ps, _mm256_tanh_ps Calculates hyperbolic tangent value. Vector variant of tanh(x) function for a 128-bit/256-bit vector arguments with float32 values.

Syntax extern __m128 _mm_tanh_ps(__m128 v1); extern __m256 _mm256_tanh_ps(__m256 v1);

Arguments v1 vector with float32 values

1055 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description Calculates the hyperbolic tangent of vector v1 elements. In effect, the inverse hyperbolic tangent of each resulting element value is the value of the corresponding v1 vector element.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_sincos_pd, _mm256_sincos_pd Calculates the sine and cosine values. Vector variant of sincos(x, &sin_x, &cos_x) function for a 128-bit/256-bit vector with float64 values.

Syntax extern __m128d _mm_sincos_pd(__m128d *p_cos, __m128d v1); extern __m256d _mm256_sincos_pd(__m256d *p_cos, __m256d v1);

Arguments

*p_cos points to vector of cosine results (pointer must be aligned on 16 bytes, or declared as __m128d* instead) v1 vector with float64 values

Description Calculates sine and cosine values of vector v1 elements. The cosine and sine values cannot be returned in the result vector. Therefore, the intrinsic stores the cosine values at a location pointed to by p_cos, and returns only the sine values in the 128-bit result vector.

Returns 128-bit/256-bit vector with the sine results.

1056 37

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_sincos_ps, _mm256_sincos_ps Calculates the sine and cosine values. Vector variant of sincos(x, &sin_x, &cos_x) function for a 128-bit/256-bit vector with float32 values.

Syntax extern __m128 _mm_sincos_ps(__m128 *p_cos_x, __m128 v1); extern __m256 _mm256_sincos_ps(__m256 *p_cos, __m256 v1);

Arguments

*p_cos_x points to vector of cosine results (pointer must be aligned on 16 bytes, or declared as __m128* instead) v2 vector with float32 values

Description Calculates sine and cosine values of vector v1 elements. The cosine and sine values cannot be returned in the result vector. Therefore, the intrinsic stores the cosine values at a location pointed to by p_cos, and returns only the sine values in the 128-bit result vector.

Returns 128-bit/256-bit vector with the sine results.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Intrinsics for Trignometric Operations with Complex Numbers

Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1057 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_cacos_ps, _mm256_cacos_ps Calculates complex arc cosine value. Vector variant of cacos(x) function for a 128-bit/256-bit vector with _Complex float32 values.

Syntax extern __m128 _mm_cacos_ps(__m128 v1); extern __m256 _mm256_cacos_ps(__m256 v1);

Arguments

v1 vector with _Complex float32 values

Description Calculates complex arc cosine of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_cacosh_ps, _mm256_cacosh_ps Calculates complex inverse hyperbolic cosine value. Vector variant of cacosh(x) function for a 128-bit/256-bit vector with _Complex float32 values.

Syntax extern __m128 _mm_cacosh_ps(__m128 v1); extern __m256 _mm256_cacosh_ps(__m256 v1);

Arguments

v1 vector with _Complex float32 values

1058 37

Description Calculates complex inverse hyperbolic cosine of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_casin_ps, _mm256_casin_ps Calculates complex arc sine value. Vector variant of casin(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 _mm_casin_ps(__m128 v1); extern __m256 _mm256_casin_ps(__m256 v1);

Arguments v1 vector with _Complex float values

Description Calculates complex arc sine of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1059 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

_mm_casinh_ps, _mm256_casinh_ps Calculates complex inverse hyperbolic sine value. Vector variant of casinh(x) function for a 128-bit/256-bit vector with _Complex float values.

Syntax extern __m128 _mm_casinh_ps(__m128 v1); extern __m256 _mm256_casinh_ps(__m256 v1);

Arguments

v1 vector with _Complex float values

Description Calculates complex inverse hyperbolic cosine of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_catan_ps, _mm256_catan_ps Calculates complex arc tan value. Vector variant of catan(x) function for a 128-bit/256-bit vector with _Complex float values.

Syntax extern __m128 _mm_catan_ps(__m128 v1); extern __m256 _mm256_catan_ps(__m256 v1);

Arguments

v1 vector with _Complex float values

Description Calculates complex arc tangent of vector v1 values.

1060 37

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_catanh_ps, _mm256_catanh_ps Calculates complex inverse hyperbolic tangent value. Vector variant of catanh(x) function for a 128-bit/256-bit vector with _Complex float values.

Syntax extern __m128 _mm_catanh_ps(__m128 v1); extern __m256 _mm256_catanh_ps(__m256 v1);

Arguments v1 vector with _Complex float values

Description Calculates complex inverse hyperbolic tangent of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_csin_ps, _mm256_csin_ps Calculates complex sine value. Vector variant of csin(x) function for a 128-bit/256-bit vector with _Complex float values.

Syntax extern __m128 _mm_csin_ps(__m128 v1);

1061 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

extern __m256 _mm256_csin_ps(__m256 v1);

Arguments

v1 vector with _Complex float values

Description Calculates complex sine of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_csinh_ps, _mm256_csinh_ps Calculates complex hyperbolic sine value. Vector variant of csinh(x) function for a 128-bit/256-bit vector with _Complex float values.

Syntax extern __m128 _mm_csinh_ps(__m128 v1); extern __m256 _mm256_csinh_ps(__m256 v1);

Arguments

v1 vector with _Complex float values

Description Calculates complex hyperbolic sine of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1062 37

_mm_ccos_ps, _mm256_ccos_ps Calculates complex cosine value. Vector variant of ccos(x) function for a 128-bit/256-bit vector argument of float32 values.

Syntax extern __m128 __cdecl _mm_ccos_ps(__m128 v1); extern __m256 _mm256_ccos_ps(__m256 v1);

Arguments v1 vector with float32 values

Description Calculates the complex cosine of vector v1 elements.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_ccosh_ps, _mm256_ccosh_ps Calculates complex hyperbolic cosine value. Vector variant of ccosh(x) function for a 128-bit/256-bit vector with _Complex float values.

Syntax extern __m128 _mm_ccosh_ps(__m128 v1); extern __m256 _mm256_ccosh_ps(__m256 v1);

Arguments v1 vector with _Complex float values

Description Calculates complex hyperbolic cosine of vector v1 elements.

1063 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_ctan_ps, _mm256_ctan_ps Calculates complex tangent value. Vector variant of ctan(x) function for a 128-bit/256-bit vector with _Complex float values.

Syntax extern __m128 _mm_ctan_ps(__m128 v1); extern __m256 _mm256_ctan_ps(__m256 v1);

Arguments

v1 vector with _Complex float values

Description Calculates complex tangent of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

_mm_ctanh_ps, _mm256_ctanh_ps Calculates complex hyperbolic tangent value. Vector variant of ctanh(x) function for a 128-bit/256-bit vector with _Complex float values.

Syntax extern __m128 _mm_ctanh_ps(__m128 v1);

1064 37 extern __m256 _mm256_ctanh_ps(__m256 v1);

Arguments v1 vector with _Complex float values

Description Calculates complex hyperbolic tangent of vector v1 values.

Returns 128-bit/256-bit vector with the result of the operation.

NOTE. Many routines in the svml library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

1065 37 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

1066 Part VII Compiler Reference

Topics:

• Intel C++ Compiler Pragmas • Intel Math Library • Intel C++ Class Libraries • Intel's C/C++ Language Extensions

1067 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

1068 Intel C++ Compiler Pragmas 38

Overview: Intel® C++ Compiler Pragmas

Pragmas are directives that provide instructions to the compiler for use in specific cases. For example, you can use the novector pragma to specify that a loop should never be vectorized. The keyword #pragma is standard in the C++ language, but individual pragmas are machine-specific or operating system-specific, and vary by compiler. Some pragmas provide the same functionality as compiler options. Pragmas override behavior specified by compiler options. Some pragmas are available for both Intel® and non-Intel microprocessors but they may perform additional optimizations for Intel® microprocessors than they perform for non-Intel microprocessors. Refer to the individual pragma page for detailed description. The Intel® C++ Compiler pragmas are categorized as follows: • Intel-Specific Pragmas - pragmas developed or modified by Intel to work specifically with the Intel® C++ Compiler • Intel Supported Pragmas - pragmas developed by external sources that are supported by the Intel® C++ Compiler for compatibility reasons

Using Pragmas You enter pragmas into your C++ source code using the following syntax: #pragma

Individual Pragma Descriptions Each pragma description has the following details:

Section Description

Short Description Contains a brief description of what the pragma does

Syntax Contains the pragma syntax

Arguments Contains a list of the arguments with descriptions

1069 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Section Description

Description Contains a detailed description of what the pragma does

Example Contains typical usage example/s

See Also Contains links or paths to other pragmas or related topics

Intel-Specific Pragmas

The Intel-specific C++ compiler pragmas described in the Intel-Specific Pragma reference are listed below. Some pragmas are available for both Intel® and non-Intel microprocessors but they may perform additional optimizations for Intel® microprocessors than they perform for non-Intel microprocessors. Click on the pragmas for a more detailed description.

Pragma Description

alloc_section allocates variable in specified section

distribute_point instructs the compiler to prefer loop distribution at the location indicated

inline instructs the compiler that the user prefers that the calls in question be inlined

ivdep instructs the compiler to ignore assumed vector dependencies

loop_count indicates the loop count is likely to be an integer

memref_control provides a method for controlling load latency at the variable level

novector specifies that the loop should never be vectorized

optimize enables or disables optimizations for specific functions; provides some degree of compatibility with Microsoft's implementation of optimize pragma

optimization_level enables control of optimization for a specific function

1070 38

Pragma Description

prefetch/noprefetch asserts that the data prefetches are generated or not generated for some memory references

simd enforces vectorization of innermost loops

swp/noswp swp indicates preference for loops to be software pipelined; noswp indicates the loops not to be software pipelined

unroll/nounroll instructs the compiler the number of times to unroll/not to unroll a loop

unroll_and_jam/nounroll_and_jam instructs the compiler to partially unroll higher loops and jam the resulting loops back together. Specifying the nounroll_and_jam pragma prevents unrolling and jamming of loops.

unused describes variables that are unused (warnings not generated)

vector indicates to the compiler that the loop should be vectorized according to the arguments: always/aligned/unaligned/nontemporal/temporal

Intel-specific Pragma Reference

alloc_section Allocates variable in specified section. Controls section attribute specification for variables.

Syntax #pragma alloc_section(var1,var2,..., "r;attribute-list")

Arguments

var variable that can be used to define a symbol in the section "r;attribute-list" a comma-separated list of attributes; defined values are: 'short' and 'long'

1071 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description The alloc_section pragma places the listed variables, var1, var2, etc., in the specified section. This pragma controls section attribute specification for variables. The compiler decides whether the variable, as defined by var1/var2/etc, should go to a "data", "bss", or "rdata" section. The section name is enclosed in double quotation marks. It should be previously introduced into the program using #pragma section. The list of comma-separated variable names follows the section name after a separating comma. All listed variables are necessarily defined before this pragma, in the same translation unit and in the same scope. The variables have static storage; their linkage does not matter in C modules, but in C++ modules they are defined with the extern "C" linkage specification.

Example The following example illustrates how to use the pragma. #pragma alloc_section(var1, "r;short") int var1 = 20; #pragma alloc_section(var2, "r;short") extern int var2;

distribute_point Instructs the compiler to prefer loop distribution at the location indicated.

Syntax #pragma distribute_point

Arguments None

Description The distribute_point pragma is used to suggest to the compiler to split large loops into smaller ones; this is particularly useful in cases where optimizations like software-pipelining (SWP) or vectorization cannot take place due to excessive register usage.

1072 38

Using distribute_point pragma for a loop distribution strategy enables software pipelining for the new, smaller loops in the IA-64 architecture. By splitting a loop into smaller segments, it is possible to get each smaller loop or at least one of the smaller loops to SWP or vectorize.

• When the pragma is placed inside a loop, the compiler distributes the loop at that point. All loop-carried dependencies are ignored. • When inside the loop, pragmas cannot be placed within an if statement. • When the pragma is placed outside the loop, the compiler distributes the loop based on an internal heuristic. The compiler determines where to distribute the loops and observes data dependency. If the pragmas are placed inside the loop, the compiler supports multiple instances of the pragma.

Example Example 1: Using distribute_point pragma outside the loop The following example uses the distribute_point pragma outside the loop. #define NUM 1024 void loop_distribution_pragma1( double a[NUM], double b[NUM], double c[NUM], double x[NUM], double y[NUM], double z[NUM] ) { int i; // Before distribution or splitting the loop #pragma distribute_point for (i=0; i< NUM; i++) { a[i] = a[i] + i; b[i] = b[i] + i; c[i] = c[i] + i; x[i] = x[i] + i; y[i] = y[i] + i; z[i] = z[i] + i; } }

1073 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example 2: Using distribute_point pragma inside the loop The following example uses the distribute_point pragma inside the loop. #define NUM 1024 void loop_distribution_pragma2( double a[NUM], double b[NUM], double c[NUM], double x[NUM], double y[NUM], double z[NUM] ) { int i; // After distribution or splitting the loop. for (i=0; i< NUM; i++) { a[i] = a[i] +i; b[i] = b[i] +i; c[i] = c[i] +i; #pragma distribute_point x[i] = x[i] +i; y[i] = y[i] +i; z[i] = z[i] +i; } }

Example 3: Using distribute_point pragma inside and outside the loop

1074 38

The following example shows how to use the distribute_point pragma, first outside the loop and then inside the loop. void dist1(int a[], int b[], int c[], int d[]) { #pragma distribute_point // Compiler will automatically decide where to // distribute. Data dependency is observed. for (int i=1; i<1000; i++) { b[i] = a[i] + 1; c[i] = a[i] + b[i]; d[i] = c[i] + 1; } } void dist2(int a[], int b[], int c[], int d[]) { for (int i=1; i<1000; i++) { b[i] = a[i] + 1; #pragma distribute_point // Distribution will start here, // ignoring all loop-carried dependency. c[i] = a[i] + b[i]; d[i] = c[i] + 1; } }

1075 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

inline, noinline, and forceinline Specify inlining of all calls in a statement.

Syntax #pragma inline[recursive] #pragma forceinline[recursive] #pragma noinline

Arguments

recursive indicates that the pragma applies to all of the calls that are called by these calls, and so forth, recursively, down the call chain

Description These are statement-specific inlining pragmas. Each can be placed before a C/C++ statement, and will then apply to all of the calls within a statement and all calls within statements nested within that statement. The forceinline pragma indicates that the calls in question should be inlined whenever the compiler is capable of doing so. The inline pragma is a hint to the compiler that the user prefers that the calls in question be inlined, but expects the compiler not to inline them if its heuristics determine that the inlining would be overly aggressive and might slow down the compilation of the source code excessively, create too large of an executable, or degrade performance. The noinlinepragma indicates that the calls in question should not be inlined. These statement-specific pragmas take precedence over the corresponding function-specific pragmas.

1076 38

Example Using #pragma forceinline recursive . The #pragma forceinline recursive applies to the call 'sun(a,b)' as well as the call 'fun(a,b)' #include static void fun(float a[100][100], float b[100][100]) { inti , j; for (i = 0; i < 100; i++) { for (j = 0; j < 100; j++) { a[i][j] = 2 * i; b[i][j] = 4 * j; } } } static void sun(float a[100][100], float b[100][100]) { int i, j; for (i = 0; i < 100; i++) { for (j = 0; j < 100; j++) { a[i][j] = 2 * i; b[i][j] = 4 * j; } fun(a, b); } } static float a[100][100]; static float b[100][100]; extern int main() {

1077 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

int i, j; for (i = 0; i < 100; i++) { for (j = 0; j < 100; j++) { a[i][j] = i + j; b[i][j] = i - j; } } for (i = 0; i < 99; i++) { fun(a, b); #pragma forceinline recursive for (j = 0; j < 99; j++) { sun(a, b); } } fprintf(stderr, "%d %d\n", a[99][9], b[99][99]); }

See Also

ivdep Instructs the compiler to ignore assumed vector dependencies.

Syntax #pragma ivdep

Arguments None

1078 38

Description The ivdep pragma instructs the compiler to ignore assumed vector dependencies. To ensure correct code, the compiler treats an assumed dependence as a proven dependence, which prevents vectorization. This pragma overrides that decision. Use this pragma only when you know that the assumed loop dependencies are safe to ignore.

NOTE. The proven dependencies that prevent vectorization are not ignored, only assumed dependencies are ignored.

Example Example 1 The loop in this example will not vectorize without the ivdep pragma, since the value of k is not known; vectorization would be illegal if k<0. void ignore_vec_dep(int *a, int k, int c, int m) { #pragma ivdep for (int i = 0; i < m; i++) a[i] = a[i + k] * c; }

The pragma binds only the for loop contained in current function. This includes a for loop contained in a sub-function called by the current function. Example 2 The following loop requires the parallel option in addition to the ivdep pragma to indicate there is no loop-carried dependencies: #pragma ivdep for (i=1; i

1079 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example 3 The following loop requires the parallel option in addition to the ivdep pragma to ensure there is no loop-carried dependency for the store into a(). #pragma ivdep for (j=0; j

See Also • vector • novector In addition to the ivdep pragma, the vector pragma can be used to override the efficiency heuristics of the vectorizer.

loop_count Specifies the iterations for the for loop.

Syntax #pragma loop_count(n) #pragma loop_count=n

or #pragma loop_count(n1[, n2]...) #pragma loop_count=n1[, n2]...

or #pragma loop_count min(n),max(n),avg(n) #pragma loop_count min=n, max=n, avg=n

Arguments

(n) or =n Non-negative integer value. The compiler will attempt to iterate the next loop the number of times specified in n; however, the number of iterations is not guaranteed.

1080 38

(n1[,n2]...) or = Non-negative integer values. The compiler will attempt to n1[,n2]... iterate the next loop the number of time specified by n1 or n2, or some other unspecified number of times. This behavior allows the compiler some flexibility in attempting to unroll the loop. The number of iterations is not guaranteed. min(n), max(n), Non-negative integer values. Specify one or more in any avg(n) or min=n, order without duplication. The compiler insures the next max=n, avg=n loop iterates for the specified maximum, minimum, or average number (n1) of times. The specified number of iterations is guaranteed for min and max.

Description The loop_count pragma affects heuristics in vectorization, loop-transformations, and software pipelining (IA-64 architecture). The pragma specifies the minimum, maximum, or average number of iterations for a for loop. In addition, a list of commonly occurring values can be specified to help the compiler generate multiple versions and perform complete unrolling. You can specify more than one pragma for a single loop; however, do not duplicate the pragma.

Example Example 1: Using #pragma loop_count (n) The following example illustrates how to use the pragma to iterate through the loop and enable software pipelining on IA-64 architectures. void loop_count(int a[], int b[]) { #pragma loop_count (10000) for (int i=0; i<1000; i++) a[i] = b[i] + 1.2; }

Example 2: Using #pragma loop_count min(n), max(n), avg(n)

1081 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The following example illustrates how to use the pragma to iterate through the loop a minimum of three, a maximum of ten, and average of five times. #include int i; int main() { #pragma loop_count min(3), max(10), avg(5) for (i=1;i<=15;i++){ printf("i=%d\n",i); }

memref_control Provides a method to control load latency and temporal locality at the variable level.

Syntax #pragma memref_control [name1[:[:]], [name2...]

Arguments

name1, name2 Specifies the name of array or pointer. You must specify at least one name; however, you can specify names with associated locality and latency values. locality An optional integer value that indicates the desired cache level to store data for future access. This will determine the load/store hint (or prefetch hint) to be used for this reference. The value can be one of the following: • |1 = 0 • |2 = 1 • |3 = 2 • mem = 3

To use this argument, you must also specify name.

1082 38 latency An optional integer value that indicates the load (or the latency that has to be overlapped if a prefetch is issued for this address). The value can be one of the following: • |1_latency = 0 • |2_latency = 1 • |3_latency = 2 • mem_latency = 3

To use this argument, you must also specify name and locality.

Description The memref_control pragma is supported on Itanium® processors only. This pragma provides a method for controlling load latency and temporal locality at the variable level. The memref_control pragma allows you to specify locality and latency at the array level. For example, using this pragma allows you to control the following:

• The location (cache level) to store data for future access. • The most appropriate latency value to be used for a load, or the latency that has to be overlapped if a prefetch is issued for this reference.

When you specify source-level and the data locality information at a high level for a particular data access, the compiler decides how best to use this information. If the compiler can prefetch profitably for the reference, then it issues a prefetch with a distance that covers the specified latency specified and then schedules the corresponding load with a smaller latency. It also uses the hints on the prefetch and load appropriately to keep the data in the specified cache level. If the compiler cannot compute the address in advance, or decides that the overheads for prefetching are too high, it uses the specified latency to separate the load and its use (in a pipelined loop or a Global Code Scheduler loop). The hint on the load/store will correspond to the cache level passed with the locality argument. You can use this with the prefetch and noprefetch to further tune the hints and prefetch strategies. When using the memref_control with noprefetch, keep the following guidelines in mind:

• Specifying noprefetch along with the memref_control causes the compiler to not issue prefetches; instead the latency values specified in the memref_control is used to schedule the load.

1083 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• There is no ordering requirements for using the two pragmas together. Specify the two pragmas in either order as long as both are specified consecutively just before the loop where it is to be applied. Issuing a prefetch with one hint and loading it later using a different hint can provide greater control over the hints used for specific architectures. • memref_control is handled differently from the prefetch or noprefetch. Even if the load cannot be prefetched, the reference can still be loaded using a non-default load latency passed to the latency argument.

Example Example 1: Using #pragma memref_control when prefetching is not possible The following example illustrates a case where the address is not known in advance, so prefetching is not possible. The compiler, in this case, schedules the loads of the tab array with an L3 load latency of 15 cycles (inside a software pipelined loop or GCS loop). #pragma memref_control tab : l2 : l3_latency for (i=0; i; dum += tab[x&mask]; x>>=6; dum += tab[x&mask]; x>>=6; dum += tab[x&mask]; x>>=6; }

Example 2: Using #pragma memref_control with prefetch and noprefetch pragmas [sparse matrix]

1084 38

The following example illustrates one way of using memref_control, prefetch, and noprefetch together. if( size <= 1000 ) { v#pragma noprefetch cp, vp #pragma memref_control x:l2:l3_latency

#pragma noprefetch yp, bp, rp #pragma noprefetch xp for (iii=0; iii

if( ip < rag2 ) { sum -= vp[ip]*x[cp[ip]]; ip++; } else { xp[i] = sum*yp[i]; i++; sum = bp[i]; rag2 = rp[i+1]; } } xp[i] = sum*yp[i]; } else {

#pragma prefetch cp, vp #pragma memref_control x:l2:mem_latency

#pragma prefetch yp, bp, rp #pragma noprefetch xp for (iii=0; iii

1085 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

if( ip < rag2 ) { sum -= vp[ip]*x[cp[ip]]; ip++; } else { xp[i] = sum*yp[i]; i++; sum = bp[i]; rag2 = rp[i+1]; } } xp[i] = sum*yp[i]; }

novector Specifies that the loop should never be vectorized.

Syntax #pragma novector

Arguments None

Description The novector pragma specifies that a particular loop should never be vectorized, even if it is legal to do so. When avoiding vectorization of a loop is desirable (when vectorization results in a performance regression rather than improvement), the novector pragma can be used in the source text to disable vectorization of a loop. This behavior is in contrast to the vector always pragma.

Example Example: Using the novector pragma

1086 38

When you know the trip count (ub - lb) is too low to make vectorization worthwhile, you can use novector to tell the compiler not to vectorize, even if the loop is considered vectorizable. void foo(int lb, int ub) { #pragma novector for(j=lb; j

See Also • vector pragma optimize Enables or disables optimizations for specific functions.

Syntax #pragma optimize("", on|off)

Arguments The compiler ignores first argument values. Valid second arguments for optimize are given below. off disables optimization on enables optimization

Description The optimize pragma is used to enable or disable optimizations for specific functions. Specifying #pragma optimize("", off) disables optimization until either the compiler finds a matching #pragma optimize("", on) statement or until the compiler reaches the end of the translation unit.

1087 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example Example 1: Disabling optimization for a single function using the #pragma optimize In the following example, optimizations are disabled for the alpha() function but not for omega(). #pragma optimize("", off) alpha() { ... } #pragma optimize("", on) omega() { ... }

Example 2: Disabling optimization for all functions using the #pragma optimize In the following example, optimizations are disabled for both the alpha() and omega() functions. #pragma optimize("", off) alpha() { ... } omega() { ... }

optimization_level Controls optimization for one function or all functions after its first occurrence.

Syntax #pragma [intel|GCC] optimization_level n

1088 38

Arguments intel|GCC indicates the interpretation to use n an integer value specifying an optimization level; valid values are:

• 0 same optimizations as -O0 • 1 same optimizations as -O1 • 2 same optimizations as -O2 • 3 same optimizations as -O3

Description The optimization_level pragma is used to restrict optimization for a specific function while optimizing the remaining application using a different, higher optimization level. For example, if you specify -O3 (VxWorks* systems) for the application and specify #pragma optimization_level 1, the marked function will be optimized at the -O1 option level, while the remaining application will be optimized at the higher level. In general, the pragma optimizes the function at the level specified as n; however, certain compiler optimizations, like Inter-procedural Optimization (IPO), are not enabled or disabled during translation unit compilation. For example, if you enable IPO and a specific optimization level, IPO is enabled even for the function targeted by this pragma; however, IPO might not be fully implemented regardless of the optimization level specified at the command line. The reverse is also true. Scope of optimization restriction On VxWorks* systems, the scope of the optimization restriction can be affected by arguments passed to the -pragma-optimization-level compiler option as explained in the following table.

Syntax Behavior

#pragma intel Applies pragma only to the next function, using the specified optimization_level n optimization level, regardless of the argument passed to the -pragma-optimization-level option.

#pragma GCC Applies pragma to all subsequent functions, using the specified optimization_level n optimization level, regardless of the argument passed to the or -pragma-optimization-level option.

1089 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Syntax Behavior #pragma GCC Specifying reset reverses the effect of the most recent #pragma optimization_level GCC optimization_level statement, by returning to the reset optimization level previously specified.

#pragma Applies either the intel or GCC interpretation. Interpretation optimization_level n depends on argument passed to the -pragma-optimization- level option.

NOTE. On Windows* systems, the pragma always uses the intel interpretation; the pragma is applied only to the next function.

Using the intel interpretation of #pragma optimization_level Place the pragma immediately before the function being affected. Example: #pragma intel optimization_level 1 gamma() { ... }

Using the GCC* interpretation of #pragma optimization_level Place the pragma in any location prior to the functions being affected.

prefetch/noprefetch Invites the compiler to issue data prefetches from memory (prefetch) or disables data prefetching (noprefetch).

Syntax #pragma prefetch #pragma prefetch [var1 [: hint1 [: distance1]] [, var2 [: hint2 [: distance2]]]...] #pragma noprefetch [var1 [, var2]...]

1090 38

Arguments var Optional memory reference (data to be prefetched) hint Optional hint to the compiler to specify the type of prefetch. Possible values are the constants defined in the header xmmintrin.h:

• _MM_HINT_T0 - for integer data that will be reused • _MM_HINT_NT1 - for integer and floating point data that will be reused from L2 cache • _MM_HINT_NT2 - for data that will be reused from L3 cache • _MM_HINT_NTA - for data that will not be reused

To use this argument, you must also specify var. distance Optional integer argument with a value greater than 0. It indicates the number of loop iterations ahead of which a prefetch is issued, before the corresponding load or store instruction. To use this argument, you must also specify var and hint.

Description The prefetch pragma is supported by Intel® Itanium® processors only. The prefetch pragma hints to the compiler to generate data prefetches for some memory references. This affects the heuristics used in the compiler. Prefetching data can minimize the effects of memory latency. If you specify #pragma prefetch with no arguments, all arrays accessed in the immediately following loop are prefetched. If the loop includes the expression A(j), placing #pragma prefetch A in front of the loop asks the compiler to insert prefetches for A(j + d) within the loop. Here, d is the number of iterations ahead of which to prefetch the data, and is determined by the compiler. The prefetch pragma affects the immediately following loop provided that the compiler general optimization level is -O1 (Linux* operating systems) or /O1 (Windows* operating systems) or higher. Remember that -O2 or /O2 is the default optimization level. The noprefetch pragma is also supported by Intel® Itanium® processors only. This pragma hints to the compiler not to generate data prefetches for some memory references. This affects the heuristics used in the compiler.

1091 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example Example 1: Using noprefetch and prefetch pragmas The following example demonstrates how to use the noprefetch and prefetch pragmas together: #pragma noprefetch b

#pragma prefetch a for(i=0; i

Example 2: Using noprefetch and prefetch pragmas The following is yet another example of how to use the noprefetch and prefetch pragmas: for (i=i0; i!=i1; i+=is) {

float sum = b[i]; int ip = srow[i]; int c = col[ip];

#pragma noprefetch col #pragma prefetch value:1:80 #pragma prefetch x:1:40

for(; ip

Example 3: Using noprefetch, prefetch, memref_control pragmas

1092 38

The following example, which is for IA-64 architecture only, demonstrates how to use the prefetch, noprefetch, and memref_control pragmas together: #define SIZE 10000 int prefetch(int *a, int *b) { int i, sum = 0; #pragma memref_control a:l2 #pragma noprefetch a #pragma prefetch b for (i = 0; i int main() { int i, arr1[SIZE], arr2[SIZE]; for (i = 0; i

See Also • memref_control pragma

1093 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

simd The simd pragma enforces vectorization of innermost loops.

Syntax #pragma simd [clause[ [,] clause]...]

where clause can be any of the following:

• vectorlength(vl_list) • private(expr_list, expr) • linear(var_step_list) • reduction(oper_var_list) • [no]assert

Arguments

vectorlength(n1[, n is a vector length (VL). It must be an integer that is a n2]...) power of 2; the value must be 2, 4, 8, or 16. If you specify more than one n, the vectorizor will choose the VL from the values specified. Causes each iteration in the vector loop to execute the computation equivalent to n iterations of scalar loop execution. Multiple vectorlength clauses are merged as a union. private(var1[, var is a scalar variable. var2]...) Causes each variable to be private to each iteration of a loop. Unless the compiler can prove that initial and last values are unused, the initial value is broadcasted to all private instances, and the last value is copied out from the last iteration instance. Multiple private clauses are merged as a union. linear(var1:step1 var is a scalar variable. [,var2:step2]...) step is a compile-time positive, integer constant expression. For each iteration of a scalar loop, var1 is incremented by step1, var2 is incremented by step2, and so on. Therefore, every iteration of the vector loop increments the variables

1094 38

by VL*step1, VL*step2, …, to VL*stepN, respectively. If more than one step is specified for a var, a compile-time error occurs. Multiple linear clauses are merged as a union. reduction(oper:var1 oper is a reduction operator. [,var2]…) var is a scalar variable. Applies the vector reduction indicated by oper to var1, var2, …, varN. The simd pragma may have multiple reduction clauses with the same or different operators. If more than one reduction operator is associated with a var, a compile-time error occurs. [no]assert Directs the compiler to assert or not to assert when the vectorization fails. The default is noassert. If this clause is specified more than once, a compile-time error occurs.

Description The simd pragma is used to guide the compiler to vectorize more loops. Vectorization using the simd pragma complements (but does not replace) the fully automatic approach. If you specify the simd pragma with no clause, default rules are in effect for variable attributes, vector length, and so forth You can only specify a particular variable in at most one instance of a private, linear, or reduction clause. If the compiler is unable to vectorize a loop, a warning will be emitted (use assert clause to make it an error). If the vectorizer has to stop vectorizing a loop for some reason, the fast floating-point model is used for the SIMD loop. Note that the simd pragma may not affect all auto-vectorizable loops. Some of these loops do not have a way to describe the SIMD vector semantics. The following restrictions apply to the simd pragma:

• The vector values must be signed 8-, 16-, 32-, or 64-bit integers, single or double-precision floating point numbers, or single or double-precision complex numbers. • A SIMD loop cannot contain another loop in it; no loop nesting is allowed. Note that inlining can create such an inner loop and cause an error, which may not be obvious at the source level.

1095 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

• A SIMD loop performs memory references unconditionally. Therefore, all address computations must result in valid memory addresses, even though such locations may not be accessed if the loop is executed sequentially.

To disable transformations that enables more vectorization, specify options -no-vec -no-simd (VxWorks*) or /Qvec- /Qsimd- (Windows*)

Example Using #pragma simd. In the example, the function add_floats() uses too many unknown pointers for the compiler's automatic runtime independence check optimization to kick-in. The programmer can enforce the vectorization of this loop by using #pragma simd and avoid the overhead of runtime check: void add_floats(float *a, float *b, float *c, float *d, float *e, int n){ int i; #pragma simd for (i=0; i

See Also

swp/noswp Indicates a preference for loops to be software pipelined or not pipelined.

Syntax #pragma swp #pragma noswp

Arguments None

1096 38

Description The swp pragma indicates a preference for loops to be software pipelined. The pragma does not help data dependency, but overrides heuristics based on profile counts or unequal control flow. The software pipelining optimization triggered by the swp pragma applies instruction scheduling to certain innermost loops, allowing instructions within a loop to be split into different stages, allowing increased instruction-level parallelism. This strategy can reduce the impact of long-latency operations, resulting in faster loop execution. Loops chosen for software pipelining are always the innermost loops that do not contain procedure calls that are not inlined. Because the optimizer no longer considers fully unrolled loops as innermost loops, fully unrolling loops can allow an additional loop to become the innermost loop. The noswp pragma is used to instruct the compiler not to software pipeline that loop. This may be advantageous for loops that iterate few times, as pipelining introduces overhead in executing the prolog and epilog code sequences.

Example Example: Using swp pragma The following example demonstrates one way of using the pragma to instruct the compiler to attempt software pipelining. void swp(int a[], int b[]) { #pragma swp for (int i = 0; i < 100; i++) if (a[i] == 0)

b[i] = a[i] + 1; else b[i] = a[i] * 2; }

1097 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

unroll/nounroll Indicates to the compiler to unroll or not to unroll a counted loop.

Syntax #pragma unroll #pragma unroll(n) #pragma nounroll

Arguments

n is the unrolling factor representing the number of times to unroll a loop; it is an integer constant from 0 through 255

Description The unroll[n] pragma tells the compiler how many times to unroll a counted loop. The unroll pragma must precede the FOR statement for each FOR loop it affects. If n is specified, the optimizer unrolls the loop n times. If n is omitted or if it is outside the allowed range, the optimizer assigns the number of times to unroll the loop. This pragma is supported only when option O3 is set. The unroll pragma overrides any setting of loop unrolling from the command line. The pragma can be applied for the innermost loop nest as well as for the outer loop nest. If applied to outer loop nests, the current implementation supports complete outer loop unrolling. The loops inside the loop nest are either not unrolled at all or completely unrolled. The compiler generates correct code by comparing n and the loop count. When unrolling a loop increases register pressure and code size it may be necessary to prevent unrolling of a loop. In such cases, use the nounroll pragma. The nounroll pragma instructs the compiler not to unroll a specified loop.

1098 38

Example Example 1: Using unroll pragma for innermost loop unrolling void unroll(int a[], int b[], int c[], int d[]) { #pragma unroll(4) for (int i = 1; i < 100; i++) { b[i] = a[i] + 1; d[i] = c[i] + 1; } }

Example 2: Using unroll pragma for outer loop unrolling In Example 2, placing the #pragma unroll before the first for loop causes the compiler to unroll the outer loop completely. If a #pragma unroll is placed before the inner for loop as well as before the outer for loop, the compiler ignores the inner for loop unroll pragma. If the #pragma unroll is placed only for the innermost loop, the compiler unrolls the innermost loop according to some factor. int m = 0; int dir[4]= {1,2,3,4}; int data[10]; #pragma unroll (4) // outer loop unrolling for (int i = 0; i < 4; i++) { for (int j = dir[i]; data[j]==N ; j+=dir[i]) m++; }

1099 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

unroll_and_jam/nounroll_and_jam Hints to the compiler to enable or disable loop unrolling and jamming. These pragmas can only be applied to iterative FOR loops.

Syntax #pragma unroll_and_jam #pragma unroll_and_jam (n) #pragma nounroll_and_jam

Arguments

n is the unrolling factor representing the number of times to unroll a loop; it is an integer constant from 0 through 255

Description The unroll_and_jam pragma partially unrolls one or more loops higher in the nest than the innermost loop and fuses/jams the resulting loops back together. This transformation allows more reuses in the loop. This pragma is not effective on innermost loops. Ensure that the immediately following loop is not the innermost loop after compiler-initiated interchanges are completed. Specifying this pragma is a hint to the compiler that the unroll and jam sequence is legal and profitable. The compiler enables this transformation whenever possible. The unroll_and_jam pragma must precede the FOR statement for each FOR loop it affects. If n is specified, the optimizer unrolls the loop n times. If n is omitted or if it is outside the allowed range, the optimizer assigns the number of times to unroll the loop. The compiler generates correct code by comparing n and the loop count. This pragma is supported only when compiler option O3 is set. The unroll_and_jam pragma overrides any setting of loop unrolling from the command line. When unrolling a loop increases register pressure and code size it may be necessary to prevent unrolling of a nested/imperfect nested loop. In such cases, use the nounroll_and_jam pragma. The nounroll_and_jam pragma hints to the compiler not to unroll a specified loop.

1100 38

Example Example: Using unroll_and_jam pragma int a[10][10]; int b[10][10]; int c[10][10]; int d[10][10]; void unroll(int n) { int i,j,k; #pragma unroll_and_jam (6) for (i = 1; i < n; i++) { #pragma unroll_and_jam (6) for (j = 1; j < n; j++) { for (k = 1; k < n; k++){ a[i][j] += b[i][k]*c[k][j]; } } } } unused Describes variables that are unused (warnings not generated).

Syntax #pragma unused

Arguments None

1101 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Description The unused pragma is implemented for compatibility with Apple* implementation of GCC.

vector Indicates to the compiler that the loop should be vectorized according to the argument keywords always/aligned/assert/unaligned/nontemporal/temporal.

Syntax #pragma vector {always[assert]|aligned|unaligned|nontemporal} #pragma vector nontemporal[(var1[, var2, ...])]

Arguments

always instructs the compiler to override any efficiency heuristic during the decision to vectorize or not, and vectorize non-unit strides or very unaligned memory accesses; controls the vectorization of the subsequent loop in the program; optionally takes the keyword assert aligned instructs the compiler to use aligned data movement instructions for all array references when vectorizing unaligned instructs the compiler to use unaligned data movement instructions for all array references when vectorizing nontemporal directs the compiler to use non-temporal (that is, streaming) stores on systems based on IA-32 and Intel(R) 64 architectures; optionally takes a comma separated list of variables temporal directs the compiler to use temporal (that is, non-streaming) stores on systems based on IA-32 and Intel(R) 64 architectures

Description The vector pragma indicates that the loop should be vectorized, if it is legal to do so, ignoring normal heuristic decisions about profitability. The vector pragma takes several argument keywords to specify the kind of loop vectorization required. These keywords are aligned,

1102 38 unaligned, always, temporal, and nontemporal. The compiler does not apply the vector pragma to nested loops, each nested loop needs a preceding pragma statement. Place the pragma before the loop control statement. Using aligned/unaligned keywords When the aligned/unaligned argument keyword is used with this pragma, it indicates that the loop should be vectorized using aligned/unaligned data movement instructions for all array references. Specify only one argument keyword: aligned or unaligned.

CAUTION. If you specify aligned as an argument, you must be sure that the loop is vectorizable using this pragma. Otherwise, the compiler generates incorrect code.

Using always keyword When the always argument keyword is used, the pragma controls the vectorization of the subsequent loop in the program. If assert is added, the compiler will generate an error-level assertion test to display a message saying that the compiler efficiency heuristics indicate that the loop cannot be vectorized.

NOTE. The pragma vector{always|aligned|unaligned} should be used with care. Overriding the efficiency heuristics of the compiler should only be done if the programmer is absolutely sure that vectorization will improve performance. Furthermore, instructing the compiler to implement all array references with aligned data movement instructions will cause a run-time exception in case some of the access patterns are actually unaligned.

Using nontemporal/temporal keywords The nontemporal and temporal argument keywords are used to control how the "stores" of register contents to storage are performed (streaming versus non-streaming) on systems based on IA-32 and Intel® 64 architectures. By default, the compiler automatically determines whether a streaming store should be used for each variable. Streaming stores may cause significant performance improvements over non-streaming stores for large numbers on certain processors. However, the misuse of streaming stores can significantly degrade performance.

1103 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example Example 1: Using pragma vector aligned The loop in the following example uses the aligned argument keyword to request that the loop be vectorized with aligned instructions, as the arrays are declared in such a way that the compiler could not normally prove this would be safe to do so. void vec_aligned(float *a, int m, int c) { int i; // Instruct compiler to ignore assumed vector dependencies. #pragma vector aligned for (i = 0; i < m; i++) a[i] = a[i] * c; // Alignment unknown but compiler can still align. for (i = 0; i < 100; i++) a[i] = a[i] + 1.0f; }

Example 2: Using pragma vector always The following example illustrates how to use the vector always pragma. void vec_always(int *a, int *b, int m) { #pragma vector always for(int i = 0; i <= m; i++) a[32*i] = b[99*i]; }

Example 3a: Using pragma vector nontemporal

1104 38

A float-type loop together with the generated assembly is shown in the following example. For large N, significant performance improvements result on Pentium 4 systems over a non-streaming implementation. float a[1000]; void foo(int N){ int i; #pragma vector nontemporal for (i = 0; i < N; i++) { a[i] = 1; } }

Example 3b: Using ASM code for the loop body .B1.2: movntps XMMWORD PTR _a[eax], xmm0 movntps XMMWORD PTR _a[eax+16], xmm0 add eax, 32 cmp eax, 4096 jl .B1.2

Example 4: Using pragma vector nontemporal with variables

1105 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The following example illustrates how to use the #pragma vector nontemporal with variables for implementing streaming stores. double A[1000]; double B[1000]; void foo(int n){ int i; #pragma vector nontemporal (A, B) for (i=0; i

Intel Supported Pragmas

The Intel® C++ Compiler supports the following pragmas:

Pragma Description

alloc_text names the code section where the specified function definitions are to reside

auto_inline excludes any function defined within the range where off is specified from being considered as candidates for automatic inline expansion

bss_seg indicates to the compiler the segment where uninitialized variables are stored in the .obj file

check_stack on argument indicates that stack checking should be enabled for functions that follow and off argument indicates that stack checking should be disabled for functions that follow.

code_seg specifies a code section where functions are to be allocated

comment places a comment record into an object file or executable file

1106 38

Pragma Description component controls collecting of browse information or dependency information from within source files conform specifies the run-time behavior of the /Zc:forScope compiler option const_seg specifies the segment where functions are stored in the .obj file data_seg specifies the default section for initialized data deprecated indicates that a function, type, or any other identifier may not be supported in a future release or indicates that a function, type, or any other identifier should not be used any more poison labels the identifiers you want removed from your program; an error results when compiling a "poisoned" identifier; #pragma POISON is also supported. float_control specifies floating-point behavior for a function fp_contract allows or disallows the implementation to contract expressions include_directory appends the string argument to the list of places to search for #include files; HP compatible pragma init_seg specifies the section to contain C++ initialization code for the translation unit message displays the specified string literal to the standard output device optimize specifies optimizations to be performed on a function-by-function basis; implemented to partly support Microsoft's implementation of same pragma; click here for Intel's implementation pointers_to_members specifies whether a pointer to a class member can be declared before its associated class definition and is used to control the pointer size and the code required to interpret the pointer pop_macro sets the value of the macro_name macro to the value on the top of the stack for this macro

1107 38 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Pragma Description

push_macro saves the value of the macro_name macro on the top of the stack for this macro

section creates a section in an .obj file. Once a section is defined, it remains valid for the remainder of the compilation

start_map_region used in conjunction with the stop_map_region pragma

stop_map_region used in conjunction with the start_map_region pragma

fenv_access informs an implementation that a program may test status flags or run under a non-default control mode

vtordisp on argument enables the generation of hidden vtordisp members and off disables them

warning allows selective modification of the behavior of compiler warning messages

weak declares symbol you enter to be weak

See Also • Intel-specific Pragmas

1108 Intel Math Library 39

Overview: Intel® Math Library

The Intel® C++ Compiler includes a mathematical software library containing highly optimized and very accurate mathematical functions. These functions are commonly used in scientific or graphic applications, as well as other programs that rely heavily on floating-point computations. To include support for C99 _Complex data types, use the -std=c99 compiler option. The mathimf.h header file includes prototypes for the library functions. For a complete list of the functions available, refer to the Function List. Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.

Math Libraries The math library linked to an application depends on the compilation or linkage options specified.

Library Description

libimf.a Default static math library.

See Also • Using Intel® Math Library

Using the Intel Math Library

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. To use the Intel math library, include the header file, mathimf.h, in your program. Here are two example programs that illustrate the use of the math library on Linux* and VxWorks* operating systems.

1109 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Example Using Real Functions // real_math.c #include #include int main() { float fp32bits; double fp64bits; long double fp80bits; long double pi_by_four = 3.141592653589793238/4.0; // pi/4 radians is about 45 degrees fp32bits = (float) pi_by_four; // float approximation to pi/4 fp64bits = (double) pi_by_four; // double approximation to pi/4 fp80bits = pi_by_four; // long double (extended) approximation to pi/4 // The sin(pi/4) is known to be 1/sqrt(2) or approximately .7071067

printf("When x = %8.8f, sinf(x) = %8.8f \n", fp32bits, sinf(fp32bits)); printf("When x = %16.16f, sin(x) = %16.16f \n", fp64bits, sin(fp64bits)); printf("When x = %20.20Lf, sinl(x) = %20.20f \n", fp80bits, sinl(fp80bits)); return 0; }

The command for compiling real_math.c is: icc real_math.c

The output of a.out will look like this: When x = 0.78539816, sinf(x) = 0.70710678 When x = 0.7853981633974483, sin(x) = 0.7071067811865475 When x = 0.78539816339744827900, sinl(x) = 0.70710678118654750275

1110 39

Example Using Complex Functions // complex_math.c #include #include int main() { float _Complex c32in,c32out; double _Complex c64in,c64out; double pi_by_four= 3.141592653589793238/4.0; c64in = 1.0 + I* pi_by_four; // Create the double precision complex number 1 + (pi/4) * i

// where I is the imaginary unit. c32in = (float _Complex) c64in; // Create the float complex value from the double complex value. c64out = cexp(c64in); c32out = cexpf(c32in); // Call the complex exponential, // cexp(z) = cexp(x+iy) = e^ (x + i y) = e^x * (cos(y) + i sin(y)) printf("When z = %7.7f + %7.7f i, cexpf(z) = %7.7f + %7.7f i \n" ,crealf(c32in),cimagf(c32in),crealf(c32out),cimagf(c32out)); printf("When z = %12.12f + %12.12f i, cexp(z) = %12.12f + %12.12f i \n" ,creal(c64in),cimag(c64in),creal(c64out),cimagf(c64out)); return 0; }

The command to compile complex_math.c is: icc -std=c99 complex_math.c

The output of a.out will look like this:

1111 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

When z = 1.0000000 + 0.7853982 i, cexpf(z) = 1.9221154 + 1.9221156 i When z = 1.000000000000 + 0.785398163397 i, cexp(z) = 1.922115514080 + 1.922115514080 i

NOTE. _Complex data types are supported in C but not in C++ programs. It is necessary to include the -std=c99 compiler option when compiling programs that require support for _Complex data types.

Exception Conditions If you call a math function using argument(s) that may produce undefined results, an error number is assigned to the system variable errno. Math function errors are usually domain errors or range errors. Domain errors result from arguments that are outside the domain of the function. For example, acos is defined only for arguments between -1 and +1 inclusive. Attempting to evaluate acos(-2) or acos(3) results in a domain error, where the return value is QNaN. Range errors occur when a mathematically valid argument results in a function value that exceeds the range of representable values for the floating-point data type. Attempting to evaluate exp(1000) results in a range error, where the return value is INF. When domain or range error occurs, the following values are assigned to errno:

• domain error (EDOM): errno = 33 • range error (ERANGE): errno = 34

The following example shows how to read the errno value for an EDOM and ERANGE error. // errno.c #include #include #include

int main(void) {

1112 39

double neg_one=-1.0; double zero=0.0;

// The natural log of a negative number is considered a domain error - EDOM printf("log(%e) = %e and errno(EDOM) = %d \n",neg_one,log(neg_one),errno);

// The natural log of zero is considered a range error - ERANGE printf("log(%e) = %e and errno(ERANGE) = %d \n",zero,log(zero),errno); }

The output of errno.c will look like this: log(-1.000000e+00) = nan and errno(EDOM) = 33 log(0.000000e+00) = -inf and errno(ERANGE) = 34

For the math functions in this section, a corresponding value for errno is listed when applicable. Other Considerations Some math functions are inlined automatically by the compiler. The functions actually inlined may vary and may depend on any vectorization or processor-specific compilation options used. A change of the default precision control or rounding mode may affect the results returned by some of the mathematical functions. See Overview: Tuning Performance in Floating-point Operations>Tuning Performance.

Math Functions

Function List

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library functions are listed here by function type.

Function Type Name

Trigonometric Functions acos

acosd

asin

1113 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Function Type Name

asind

atan

atan2

atand

atand2

cos

cosd

cot

cotd

sin

sincos

sincosd

sind

tan

tand

Hyperbolic Functions acosh

asinh

atanh

cosh

sinh

sinhcosh

1114 39

Function Type Name

tanh

Exponential Functions cbrt

exp

exp10

exp2

expm1

frexp

hypot

invsqrt

ilogb

ldexp

log

log10

log1p

log2

logb

pow

scalb

scalbln

scalbn

sqrt

1115 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Function Type Name

Special Functions annuity

compound

erf

erfc

erfinv

gamma

gamma_r

j0

j1

jn

lgamma

lgamma_r

tgamma

y0

y1

yn

Nearest Integer Functions ceil

floor

llrint

llround

lrint

1116 39

Function Type Name

lround

modf

nearbyint

rint

round

trunc

Remainder Functions fmod

remainder

remquo

Miscellaneous Functions copysign

fabs

fdim

finite

fma

fmax

fmin

fpclassify

isfinite

isgreater

isgreaterequal

isinf

1117 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Function Type Name

isless

islessequal

islessgreater

isnan

isnormal

isunordered

nextafter

nexttoward

signbit

significand

Complex Functions cabs

cacos

cacosh

carg

casin

casinh

catan

catanh

ccos

cexp

cexp2

1118 39

Function Type Name

cimag

cis

clog

clog10

conj

ccosh

cpow

cproj

creal

csin

csinh

csqrt

ctan

ctanh

Trigonometric Functions

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library supports the following trigonometric functions: acos Description: The acos function returns the principal value of the inverse cosine of x in the range [0, pi] radians for x in the interval [-1,1]. errno: EDOM, for |x| > 1

1119 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Calling interface: double acos(double x); long double acosl(long double x); float acosf(float x);

acosd Description: The acosd function returns the principal value of the inverse cosine of x in the range [0,180] degrees for x in the interval [-1,1]. errno: EDOM, for |x| > 1 Calling interface: double acosd(double x); long double acosdl(long double x); float acosdf(float x);

asin Description: The asin function returns the principal value of the inverse sine of x in the range [-pi/2, +pi/2] radians for x in the interval [-1,1]. errno: EDOM, for |x| > 1 Calling interface: double asin(double x); long double asinl(long double x); float asinf(float x);

asind Description: The asind function returns the principal value of the inverse sine of x in the range [-90,90] degrees for x in the interval [-1,1]. errno: EDOM, for |x| > 1 Calling interface: double asind(double x); long double asindl(long double x); float asindf(float x);

1120 39 atan Description: The atan function returns the principal value of the inverse tangent of x in the range [-pi/2, +pi/2] radians. Calling interface: double atan(double x); long double atanl(long double x); float atanf(float x); atan2 Description: The atan2 function returns the principal value of the inverse tangent of y/x in the range [-pi, +pi] radians. errno: EDOM, for x = 0 and y = 0 Calling interface: double atan2(double y, double x); long double atan2l(long double y, long double x); float atan2f(float y, float x); atand Description: The atand function returns the principal value of the inverse tangent of x in the range [-90,90] degrees. Calling interface: double atand(double x); long double atandl(long double x); float atandf(float x); atan2d Description: The atan2d function returns the principal value of the inverse tangent of y/x in the range [-180, +180] degrees. errno: EDOM, for x = 0 and y = 0. Calling interface: double atan2d(double x, double y); long double atan2dl(long double x, long double y); float atan2df(float x, float y);

1121 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

cos Description: The cos function returns the cosine of x measured in radians. This function may be inlined by the Itanium® compiler. Calling interface: double cos(double x); long double cosl(long double x); float cosf(float x);

cosd Description: The cosd function returns the cosine of x measured in degrees. Calling interface: double cosd(double x); long double cosdl(long double x); float cosdf(float x);

cot Description: The cot function returns the cotangent of x measured in radians. errno: ERANGE, for overflow conditions at x = 0. Calling interface: double cot(double x); long double cotl(long double x); float cotf(float x);

cotd Description: The cotd function returns the cotangent of x measured in degrees. errno: ERANGE, for overflow conditions at x = 0. Calling interface: double cotd(double x); long double cotdl(long double x); float cotdf(float x);

1122 39 sin Description: The sin function returns the sine of x measured in radians. This function may be inlined by the Itanium® compiler. Calling interface: double sin(double x); long double sinl(long double x); float sinf(float x); sincos Description: The sincos function returns both the sine and cosine of x measured in radians. This function may be inlined by the Itanium® compiler. Calling interface: void sincos(double x, double *sinval, double *cosval); void sincosl(long double x, long double *sinval, long double *cosval); void sincosf(float x, float *sinval, float *cosval); sincosd Description: The sincosd function returns both the sine and cosine of x measured in degrees. Calling interface: void sincosd(double x, double *sinval, double *cosval); void sincosdl(long double x, long double *sinval, long double *cosval); void sincosdf(float x, float *sinval, float *cosval); sind Description: The sind function computes the sine of x measured in degrees. Calling interface: double sind(double x); long double sindl(long double x); float sindf(float x); tan Description: The tan function returns the tangent of x measured in radians. Calling interface:

1123 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

double tan(double x); long double tanl(long double x); float tanf(float x);

tand Description: The tand function returns the tangent of x measured in degrees. errno: ERANGE, for overflow conditions Calling interface: double tand(double x); long double tandl(long double x); float tandf(float x);

Hyperbolic Functions

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library supports the following hyperbolic functions:

acosh Description: The acosh function returns the inverse hyperbolic cosine of x. errno: EDOM, for x < 1 Calling interface: double acosh(double x); long double acoshl(long double x); float acoshf(float x);

asinh Description: The asinh function returns the inverse hyperbolic sine of x. Calling interface: double asinh(double x); long double asinhl(long double x); float asinhf(float x);

1124 39 atanh Description: The atanh function returns the inverse hyperbolic tangent of x. errno: EDOM, for x > 1 errno: ERANGE, for x = 1 Calling interface: double atanh(double x); long double atanhl(long double x); float atanhf(float x); cosh Description: The cosh function returns the hyperbolic cosine of x, (ex + e-x)/2. errno: ERANGE, for overflow conditions Calling interface: double cosh(double x); long double coshl(long double x); float coshf(float x); sinh Description: The sinh function returns the hyperbolic sine of x, (ex - e-x)/2. errno: ERANGE, for overflow conditions Calling interface: double sinh(double x); long double sinhl(long double x); float sinhf(float x); sinhcosh Description: The sinhcosh function returns both the hyperbolic sine and hyperbolic cosine of x. errno: ERANGE, for overflow conditions Calling interface: void sinhcosh(double x, double *sinval, double *cosval); void sinhcoshl(long double x, long double *sinval, long double *cosval);

1125 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

void sinhcoshf(float x, float *sinval, float *cosval);

tanh Description: The tanh function returns the hyperbolic tangent of x, (ex - e-x) / (ex + e-x).

Calling interface: double tanh(double x); long double tanhl(long double x); float tanhf(float x);

Exponential Functions

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library supports the following exponential functions:

cbrt Description: The cbrt function returns the cube root of x. Calling interface: double cbrt(double x); long double cbrtl(long double x); float cbrtf(float x);

exp Description: The exp function returns e raised to the x power, ex. This function may be inlined by the Itanium® compiler. errno: ERANGE, for underflow and overflow conditions Calling interface: double exp(double x); long double expl(long double x); float expf(float x);

exp10 Description: The exp10 function returns 10 raised to the x power, 10x.

1126 39 errno: ERANGE, for underflow and overflow conditions Calling interface: double exp10(double x); long double exp10l(long double x); float exp10f(float x); exp2 Description: The exp2 function returns 2 raised to the x power, 2x. errno: ERANGE, for underflow and overflow conditions Calling interface: double exp2(double x); long double exp2l(long double x); float exp2f(float x); expm1 Description: The expm1 function returns e raised to the x power minus 1, ex - 1. errno: ERANGE, for overflow conditions Calling interface: double expm1(double x); long double expm1l(long double x); float expm1f(float x); frexp Description: The frexp function converts a floating-point number x into signed normalized fraction in [1/2, 1) multiplied by an integral power of two. The signed normalized fraction is returned, and the integer exponent stored at location exp. Calling interface: double frexp(double x, int *exp); long double frexpl(long double x, int *exp); float frexpf(float x, int *exp); hypot Description: The hypot function returns the square root of (x2 + y2).

1127 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

errno: ERANGE, for overflow conditions Calling interface: double hypot(double x, double y); long double hypotl(long double x, long double y); float hypotf(float x, float y);

ilogb Description: The ilogb function returns the exponent of x base two as a signed int value. errno: ERANGE, for x = 0 Calling interface: int ilogb(double x); int ilogbl(long double x); int ilogbf(float x);

invsqrt Description: The invsqrt function returns the inverse square root. This function may be inlined by the Itanium® compiler. Calling interface: double invsqrt(double x); long double invsqrtl(long double x); float invsqrtf(float x);

ldexp Description: The ldexp function returns x*2exp, where exp is an integer value.

errno: ERANGE, for underflow and overflow conditions Calling interface: double ldexp(double x, int exp); long double ldexpl(long double x, int exp); float ldexpf(float x, int exp);

log Description: The log function returns the natural log of x, ln(x). This function may be inlined by the Itanium® compiler.

1128 39 errno: EDOM, for x < 0 errno: ERANGE, for x = 0 Calling interface: double log(double x); long double logl(long double x); float logf(float x); log10

Description: The log10 function returns the base-10 log of x, log10(x). This function may be inlined by the Itanium® compiler. errno: EDOM, for x < 0 errno: ERANGE, for x = 0 Calling interface: double log10(double x); long double log10l(long double x); float log10f(float x); log1p Description: The log1p function returns the natural log of (x+1), ln(x + 1). errno: EDOM, for x < -1 errno: ERANGE, for x = -1 Calling interface: double log1p(double x); long double log1pl(long double x); float log1pf(float x); log2

Description: The log2 function returns the base-2 log of x, log2(x). errno: EDOM, for x < 0 errno: ERANGE, for x = 0 Calling interface: double log2(double x); long double log2l(long double x);

1129 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

float log2f(float x);

logb Description: The logb function returns the signed exponent of x. errno: EDOM, for x = 0 Calling interface: double logb(double x); long double logbl(long double x); float logbf(float x);

pow Description: The pow function returns x raised to the power of y, xy. This function may be inlined by the Itanium® compiler. Calling interface: errno: EDOM, for x = 0 and y < 0 errno: EDOM, for x < 0 and y is a non-integer errno: ERANGE, for overflow and underflow conditions Calling interface: double pow(double x, double y); long double powl(double x, double y); float powf(float x, float y);

scalb Description: The scalb function returns x*2y, where y is a floating-point value.

errno: ERANGE, for underflow and overflow conditions Calling interface: double scalb(double x, double y); long double scalbl(long double x, long double y); float scalbf(float x, float y);

scalbn Description: The scalbn function returns x*2n, where n is an integer value.

1130 39 errno: ERANGE, for underflow and overflow conditions Calling interface: double scalbn(double x, int n); long double scalbnl (long double x, int n); float scalbnf(float x, int n); scalbln Description: The scalbln function returns x*2n, where n is a long integer value. errno: ERANGE, for underflow and overflow conditions Calling interface: double scalbln(double x, long int n); long double scalblnl (long double x, long int n); float scalblnf(float x, long int n); sqrt Description: The sqrt function returns the correctly rounded square root. errno: EDOM, for x < 0 Calling interface: double sqrt(double x); long double sqrtl(long double x); float sqrtf(float x);

Special Functions

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library supports the following special functions: annuity Description: The annuity function computes the present value factor for an annuity, (1 - (1+x)(-y) ) / x, where x is a rate and y is a period. errno: ERANGE, for underflow and overflow conditions Calling interface:

1131 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

double annuity(double x, double y); long double annuityl(long double x, long double y); float annuityf(float x, float y);

compound Description: The compound function computes the compound interest factor, (1+x)y, where x is a rate and y is a period. errno: ERANGE, for underflow and overflow conditions Calling interface: double compound(double x, double y); long double compoundl(long double x, long double y); float compoundf(float x, float y);

erf Description: The erf function returns the error function value. Calling interface: double erf(double x); long double erfl(long double x); float erff(float x);

erfc Description: The erfc function returns the complementary error function value. errno: EDOM, for finite or infinite |x| > 1 Calling interface: double erfc(double x); long double erfcl(long double x); float erfcf(float x);

erfinv Description: The erfinv function returns the value of the inverse error function of x. errno: EDOM, for finite or infinite |x| > 1 Calling interface:

1132 39 double erfinv(double x); long double erfinvl(long double x); float erfinvf(float x); gamma Description: The gamma function returns the value of the logarithm of the absolute value of gamma. errno: ERANGE, for overflow conditions when x is a negative integer. Calling interface: double gamma(double x); long double gammal(long double x); float gammaf(float x); gamma_r Description: The gamma_r function returns the value of the logarithm of the absolute value of gamma. The sign of the gamma function is returned in the integer signgam. Calling interface: double gamma_r(double x, int *signgam); long double gammal_r(long double x, int *signgam); float gammaf_r(float x, int *signgam); j0 Description: Computes the Bessel function (of the first kind) of x with order 0. Calling interface: double j0(double x); long double j0l(long double x); float j0f(float x); j1 Description: Computes the Bessel function (of the first kind) of x with order 1. Calling interface: double j1(double x); long double j1l(long double x);

1133 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

float j1f(float x);

jn Description: Computes the Bessel function (of the first kind) of x with order n. Calling interface: double jn(int n, double x); long double jnl(int n, long double x); float jnf(int n, float x);

lgamma Description: The lgamma function returns the value of the logarithm of the absolute value of gamma. errno: ERANGE, for overflow conditions, x=0 or negative integers. Calling interface: double lgamma(double x); long double lgammal(long double x); float lgammaf(float x);

lgamma_r Description: The lgamma_r function returns the value of the logarithm of the absolute value of gamma. The sign of the gamma function is returned in the integer signgam. errno: ERANGE, for overflow conditions, x=0 or negative integers. Calling interface: double lgamma_r(double x, int *signgam); long double lgammal_r(long double x, int *signgam); float lgammaf_r(float x, int *signgam);

tgamma Description: The tgamma function computes the gamma function of x. errno: EDOM, for x=0 or negative integers. Calling interface: double tgamma(double x);

1134 39 long double tgammal(long double x); float tgammaf(float x); y0 Description: Computes the Bessel function (of the second kind) of x with order 0. errno: EDOM, for x <= 0 Calling interface: double y0(double x); long double y0l(long double x); float y0f(float x); y1 Description: Computes the Bessel function (of the second kind) of x with order 1. errno: EDOM, for x <= 0 Calling interface: double y1(double x); long double y1l(long double x); float y1f(float x); yn Description: Computes the Bessel function (of the second kind) of x with order n. errno: EDOM, for x <= 0 Calling interface: double yn(int n, double x); long double ynl(int n, long double x); float ynf(int n, float x);

Nearest Integer Functions

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library supports the following nearest integer functions:

1135 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

ceil Description: The ceil function returns the smallest integral value not less than x as a floating-point number. This function may be inlined by the Itanium® compiler. Calling interface: double ceil(double x); long double ceill(long double x); float ceilf(float x);

floor Description: The floor function returns the largest integral value not greater than x as a floating-point value. This function may be inlined by the Itanium® compiler. Calling interface: double floor(double x); long double floorl(long double x); float floorf(float x);

llrint Description: The llrint function returns the rounded integer value (according to the current rounding direction) as a long long int. errno: ERANGE, for values too large Calling interface: long long int llrint(double x); long long int llrintl(long double x); long long int llrintf(float x);

llround Description: The llround function returns the rounded integer value as a long long int. errno: ERANGE, for values too large Calling interface: long long int llround(double x); long long int llroundl(long double x); long long int llroundf(float x);

1136 39 lrint Description: The lrint function returns the rounded integer value (according to the current rounding direction) as a long int. errno: ERANGE, for values too large Calling interface: long int lrint(double x); long int lrintl(long double x); long int lrintf(float x); lround Description: The lround function returns the rounded integer value as a long int. Halfway cases are rounded away from zero. errno: ERANGE, for values too large Calling interface: long int lround(double x); long int lroundl(long double x); long int lroundf(float x); modf Description: The modf function returns the value of the signed fractional part of x and stores the integral part at *iptr as a floating-point number. Calling interface: double modf(double x, double *iptr); long double modfl(long double x, long double *iptr); float modff(float x, float *iptr); nearbyint Description: The nearbyint function returns the rounded integral value as a floating-point number, using the current rounding direction. Calling interface: double nearbyint(double x); long double nearbyintl(long double x); float nearbyintf(float x);

1137 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

rint Description: The rint function returns the rounded integral value as a floating-point number, using the current rounding direction. Calling interface: double rint(double x); long double rintl(long double x); float rintf(float x);

round Description: The round function returns the nearest integral value as a floating-point number. Halfway cases are rounded away from zero. Calling interface: double round(double x); long double roundl(long double x); float roundf(float x);

trunc Description: The trunc function returns the truncated integral value as a floating-point number. Calling interface: double trunc(double x); long double truncl(long double x); float truncf(float x);

Remainder Functions

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library supports the following remainder functions:

fmod Description: The fmod function returns the value x-n*y for integer n such that if y is nonzero, the result has the same sign as x and magnitude less than the magnitude of y. errno: EDOM, for y = 0 Calling interface:

1138 39 double fmod(double x, double y); long double fmodl(long double x, long double y); float fmodf(float x, float y); remainder Description: The remainder function returns the value of x REM y as required by the IEEE standard. Calling interface: double remainder(double x, double y); long double remainderl(long double x, long double y); float remainderf(float x, float y); remquo Description: The remquo function returns the value of x REM y. In the object pointed to by quo the function stores a value whose sign is the sign of x/y and whose magnitude is congruent modulo 2n of the integral quotient of x/y. N is an implementation-defined integer. For systems based on IA-64 architecture, N is equal to 24. For all other systems, N is equal to 31. Calling interface: double remquo(double x, double y, int *quo); long double remquol(long double x, long double y, int *quo); float remquof(float x, float y, int *quo);

Miscellaneous Functions

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library supports the following miscellaneous functions: copysign Description: The copysign function returns the value with the magnitude of x and the sign of y. Calling interface: double copysign(double x, double y); long double copysignl(long double x, long double y); float copysignf(float x, float y);

1139 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

fabs Description: The fabs function returns the absolute value of x. Calling interface: double fabs(double x); long double fabsl(long double x); float fabsf(float x);

fdim Description: The fdim function returns the positive difference value, x-y (for x > y) or +0 (for x <= to y). errno: ERANGE, for values too large Calling interface: double fdim(double x, double y); long double fdiml(long double x, long double y); float fdimf(float x, float y);

finite Description: The finite function returns 1 if x is not a NaN or +/- infinity. Otherwise 0 is returned. Calling interface: int finite(double x); int finitel(long double x); int finitef(float x);

fma Description: The fma functions return (x*y)+z. Calling interface: double fma(double x, double y, double z); long double fmal(long double x, long double y, long double z); float fmaf(float x, float y, float double z);

fmax Description: The fmax function returns the maximum numeric value of its arguments.

1140 39

Calling interface: double fmax(double x, double y); long double fmaxl(long double x, long double y); float fmaxf(float x, float y); fmin Description: The fmin function returns the minimum numeric value of its arguments. Calling interface: double fmin(double x, double y); long double fminl(long double x, long double y); float fminf(float x, float y); fpclassify Description: The fpclassify function returns the value of the number classification macro appropriate to the value of its argument.

Return Value

0 (NaN)

1 (Infinity)

2 (Zero)

3 (Subnormal)

4 (Finite)

Calling interface: double fpclassify(double x); long double fpclassifyl(long double x); float fpclassifyf(float x); isfinite Description: The isfinite function returns 1 if x is not a NaN or +/- infinity. Otherwise 0 is returned. Calling interface:

1141 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

int isfinite(double x); int isfinitel(long double x); int isfinitef(float x);

isgreater Description: The isgreater function returns 1 if x is greater than y. This function does not raise the invalid floating-point exception. Calling interface: int isgreater(double x, double y); int isgreaterl(long double x, long double y); int isgreaterf(float x, float y);

isgreaterequal Description: The isgreaterequal function returns 1 if x is greater than or equal to y. This function does not raise the invalid floating-point exception. Calling interface: int isgreaterequal(double x, double y); int isgreaterequall(long double x, long double y); int isgreaterequalf(float x, float y);

isinf Description: The isinf function returns a non-zero value if and only if its argument has an infinite value. Calling interface: int isinf(double x); int isinfl(long double x); int isinff(float x);

isless Description: The isless function returns 1 if x is less than y. This function does not raise the invalid floating-point exception. Calling interface: int isless(double x, double y); int islessl(long double x, long double y);

1142 39 int islessf(float x, float y); islessequal Description: The islessequal function returns 1 if x is less than or equal to y. This function does not raise the invalid floating-point exception. Calling interface: int islessequal(double x, double y); int islessequall(long double x, long double y); int islessequalf(float x, float y); islessgreater Description: The islessgreater function returns 1 if x is less than or greater than y. This function does not raise the invalid floating-point exception. Calling interface: int islessgreater(double x, double y); int islessgreaterl(long double x, long double y); int islessgreaterf(float x, float y); isnan Description: The isnan function returns a non-zero value if and only if x has a NaN value. Calling interface: int isnan(double x); int isnanl(long double x); int isnanf(float x); isnormal Description: The isnormal function returns a non-zero value if and only if x is normal. Calling interface: int isnormal(double x); int isnormall(long double x); int isnormalf(float x);

1143 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

isunordered Description: The isunordered function returns 1 if either x or y is a NaN. This function does not raise the invalid floating-point exception. Calling interface: int isunordered(double x, double y); int isunorderedl(long double x, long double y); int isunorderedf(float x, float y);

nextafter Description: The nextafter function returns the next representable value in the specified format after x in the direction of y. errno: ERANGE, for overflow and underflow conditions Calling interface: double nextafter(double x, double y); long double nextafterl(long double x, long double y); float nextafterf(float x, float y);

nexttoward Description: The nexttoward function returns the next representable value in the specified format after x in the direction of y. If x equals y, then the function returns y converted to the type of the function. Use the /Qlong-double option on Windows* operating systems for accurate results. errno: ERANGE, for overflow and underflow conditions Calling interface: double nexttoward(double x, long double y); long double nexttowardl(long double x, long double y); float nexttowardf(float x, long double y);

signbit Description: The signbit function returns a non-zero value if and only if the sign of x is negative. Calling interface: int signbit(double x);

1144 39 int signbitl(long double x); int signbitf(float x); significand Description: The significand function returns the significand of x in the interval [1,2). For x equal to zero, NaN, or +/- infinity, the original x is returned. Calling interface: double significand(double x); long double significandl(long double x); float significandf(float x);

Complex Functions

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library supports the following complex functions: cabs Description: The cabs function returns the complex absolute value of z. Calling interface: double cabs(double _Complex z); long double cabsl(long double _Complex z); float cabsf(float _Complex z); cacos Description: The cacos function returns the complex inverse cosine of z. Calling interface: double _Complex cacos(double _Complex z); long double _Complex cacosl(long double _Complex z); float _Complex cacosf(float _Complex z); cacosh Description: The cacosh function returns the complex inverse hyperbolic cosine of z. Calling interface:

1145 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

double _Complex cacosh(double _Complex z); long double _Complex cacoshl(long double _Complex z); float _Complex cacoshf(float _Complex z);

carg Description: The carg function returns the value of the argument in the interval [-pi, +pi]. Calling interface: double carg(double _Complex z); long double cargl(long double _Complex z); float cargf(float _Complex z);

casin Description: The casin function returns the complex inverse sine of z. Calling interface: double _Complex casin(double _Complex z); long double _Complex casinl(long double _Complex z); float _Complex casinf(float _Complex z);

casinh Description: The casinh function returns the complex inverse hyperbolic sine of z. Calling interface: double _Complex casinh(double _Complex z); long double _Complex casinhl(long double _Complex z); float _Complex casinhf(float _Complex z);

catan Description: The catan function returns the complex inverse tangent of z. Calling interface: double _Complex catan(double _Complex z); long double _Complex catanl(long double _Complex z); float _Complex catanf(float _Complex z);

1146 39 catanh Description: The catanh function returns the complex inverse hyperbolic tangent of z. Calling interface: double _Complex catanh(double _Complex z); long double _Complex catanhl(long double _Complex z); float _Complex catanhf(float _Complex z); ccos Description: The ccos function returns the complex cosine of z. Calling interface: double _Complex ccos(double _Complex z); long double _Complex ccosl(long double _Complex z); float _Complex ccosf(float _Complex z); ccosh Description: The ccosh function returns the complex hyperbolic cosine of z. Calling interface: double _Complex ccosh(double _Complex z); long double _Complex ccoshl(long double _Complex z); float _Complex ccoshf(float _Complex z); cexp Description: The cexp function returns ez (e raised to the power z).

Calling interface: double _Complex cexp(double _Complex z); long double _Complex cexpl(long double _Complex z); float _Complex cexpf(float _Complex z); cexp2 Description: The cexp function returns 2z (2 raised to the power z).

Calling interface: double _Complex cexp2(double _Complex z);

1147 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

long double _Complex cexp2l(long double _Complex z); float _Complex cexp2f(float _Complex z);

cexp10 Description: The cexp10 function returns 10z (10 raised to the power z).

Calling interface: double _Complex cexp10(double _Complex z); long double _Complex cexp10l(long double _Complex z); float _Complex cexp10f(float _Complex z);

cimag Description: The cimag function returns the imaginary part value of z. Calling interface: double cimag(double _Complex z); long double cimagl(long double _Complex z); float cimagf(float _Complex z);

cis Description: The cis function returns the cosine and sine (as a complex value) of z measured in radians. Calling interface: double _Complex cis(double x); long double _Complex cisl(long double z); float _Complex cisf(float z);

cisd Description: The cisd function returns the cosine and sine (as a complex value) of z measured in degrees. Calling interface: double _Complex cisd(double x); long double _Complex cisdl(long double z); float _Complex cisdf(float z);

1148 39 clog Description: The clog function returns the complex natural logarithm of z. Calling interface: double _Complex clog(double _Complex z); long double _Complex clogl(long double _Complex z); float _Complex clogf(float _Complex z); clog2 Description: The clog2 function returns the complex logarithm base 2 of z. Calling interface: double _Complex clog2(double _Complex z); long double _Complex clog2l(long double _Complex z); float _Complex clog2f(float _Complex z); clog10 Description: The clog10 function returns the complex logarithm base 10 of z. Calling interface: double _Complex clog10(double _Complex z); long double _Complex clog10l(long double _Complex z); float _Complex clog10f(float _Complex z); conj Description: The conj function returns the complex conjugate of z by reversing the sign of its imaginary part. Calling interface: double _Complex conj(double _Complex z); long double _Complex conjl(long double _Complex z); float _Complex conjf(float _Complex z); cpow Description: The cpow function returns the complex power function, xy.

Calling interface:

1149 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

double _Complex cpow(double _Complex x, double _Complex y); long double _Complex cpowl(long double _Complex x, long double _Complex y); float _Complex cpowf(float _Complex x, float _Complex y);

cproj Description: The cproj function returns a projection of z onto the Riemann sphere. Calling interface: double _Complex cproj(double _Complex z); long double _Complex cprojl(long double _Complex z); float _Complex cprojf(float _Complex z);

creal Description: The creal function returns the real part of z. Calling interface: double creal(double _Complex z); long double creall(long double _Complex z); float crealf(float _Complex z);

csin Description: The csin function returns the complex sine of z. Calling interface: double _Complex csin(double _Complex z); long double _Complex csinl(long double _Complex z); float _Complex csinf(float _Complex z);

csinh Description: The csinh function returns the complex hyperbolic sine of z. Calling interface: double _Complex csinh(double _Complex z); long double _Complex csinhl(long double _Complex z); float _Complex csinhf(float _Complex z);

1150 39 csqrt Description: The csqrt function returns the complex square root of z. Calling interface: double _Complex csqrt(double _Complex z); long double _Complex csqrtl(long double _Complex z); float _Complex csqrtf(float _Complex z); ctan Description: The ctan function returns the complex tangent of z. Calling interface: double _Complex ctan(double _Complex z); long double _Complex ctanl(long double _Complex z); float _Complex ctanf(float _Complex z); ctanh Description: The ctanh function returns the complex hyperbolic tangent of z. Calling interface: double _Complex ctanh(double _Complex z); long double _Complex ctanhl(long double _Complex z); float _Complex ctanhf(float _Complex z);

C99 Macros

Many routines in the libm library are more highly optimized for Intel® microprocessors than for non-Intel microprocessors. The Intel Math Library and mathimf.h header file support the following C99 macros: int fpclassify(x); int isfinite(x); int isgreater(x, y); int isgreaterequal(x, y); int isinf(x); int isless(x, y); int islessequal(x, y);

1151 39 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

int islessgreater(x, y); int isnan(x); int isnormal(x); int isunordered(x, y); int signbit(x);

See Also • Miscellaneous Functions

1152 Intel C++ Class Libraries 40

Introduction to the Class Libraries

Overview: Intel C++ Class Libraries

The Intel C++ Class Libraries enable Single-Instruction, Multiple-Data (SIMD) operations. The principle of SIMD operations is to exploit microprocessor architecture through parallel processing. The effect of parallel processing is increased data throughput using fewer clock cycles. The objective is to improve application performance of complex and computation-intensive audio, video, and graphical data bit streams.

Hardware and Software Requirements

The Intel C++ Class Libraries are functions abstracted from the instruction extensions available on Intel processors as specified in the table that follows: Processor Requirements for Use of Class Libraries

Header File Extension Set Available on These Processors

ivec.h MMX™ technology Intel® Pentium® processor with MMX™ technology, Intel® Pentium® II processor, Intel® Pentium® III processor, Intel® Pentium® 4 processor, Intel® Xeon® processor, and Intel® Itanium® processor

fvec.h Streaming SIMD Intel® Pentium® III processor, Intel® Pentium® Extensions 4 processor, Intel® Xeon® processor, and Intel® Itanium® processor

dvec.h Streaming SIMD Intel® Pentium® 4 processor and Intel® Xeon® Extensions 2 processors

About the Classes

The Intel C++ Class Libraries for SIMD Operations include: • Integer vector (Ivec) classes • Floating-point vector (Fvec) classes

1153 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

You can find the definitions for these operations in three header files: ivec.h, fvec.h, and dvec.h. The classes themselves are not partitioned like this. The classes are named according to the underlying type of operation. The header files are partitioned according to architecture: • ivec.h is specific to architectures with MMX™ technology • fvec.h is specific to architectures with Intel® Streaming SIMD Extensions • dvec.h is specific to architectures with Intel® Streaming SIMD Extensions 2

This documentation is intended for programmers writing code for the Intel architecture, particularly code that would benefit from the use of SIMD instructions. You should be familiar with C++ and the use of C++ classes.

Details About the Libraries

The Intel C++ Class Libraries for SIMD Operations provide a convenient interface to access the underlying instructions for processors as specified in Processor Requirements for Use of Class Libraries. These processor-instruction extensions enable parallel processing using the single instruction-multiple data (SIMD) technique as illustrated in the following figure.

SIMD Data Flow

Performing four operations with a single instruction improves efficiency by a factor of four for that particular instruction. These new processor instructions can be implemented using assembly inlining, intrinsics, or the C++ SIMD classes. Compare the coding required to add four 32-bit floating-point values, using each of the available interfaces:

1154 40

Comparison Between Inlining, Intrinsics and Class Libraries

Assembly Inlining Intrinsics SIMD Class Libraries

... __m128 a,b,c; __asm{ movaps #include ... __m128 #include xmm0,b movaps xmm1,c addps xmm0,xmm1 a,b,c; a = _mm_add_ps(b,c); ... movaps a, xmm0 } ...... F32vec4 a,b,c; a = b +c; ...

This table shows an addition of two single-precision floating-point values using assembly inlining, intrinsics, and the libraries. You can see how much easier it is to code with the Intel C++ SIMD Class Libraries. Besides using fewer keystrokes and fewer lines of code, the notation is like the standard notation in C++, making it much easier to implement over other methods.

C++ Classes and SIMD Operations

The use of C++ classes for SIMD operations is based on the concept of operating on arrays, or vectors of data, in parallel. Consider the addition of two vectors, A and B, where each vector contains four elements. Using the integer vector (Ivec) class, the elements A[i] and B[i] from each array are summed as shown in the following example.

Typical Method of Adding Elements Using a Loop short a[4], b[4], c[4]; for (i=0; i<4; i++) /* needs four iterations */ c[i] = a[i] + b[i]; /* returns c[0], c[1], c[2], c[3] */

The following example shows the same results using one operation with Ivec Classes.

1155 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

SIMD Method of Adding Elements Using Ivec Classes sIs16vec4 ivecA, ivecB, ivec C; /*needs one iteration*/ ivecC = ivecA + ivecB; /*returns ivecC0, ivecC1, ivecC2, ivecC3 */

Available Classes The Intel C++ SIMD classes provide parallelism, which is not easily implemented using typical mechanisms of C++. The following table shows how the Intel C++ SIMD classes use the classes and libraries. SIMD Vector Classes

Instruction Class Signedness Data Type Size Elements Header Set File

MMX™ I64vec1 unspecified __m64 64 1 ivec.h technology

I32vec2 unspecified int 32 2 ivec.h

Is32vec2 signed int 32 2 ivec.h

Iu32vec2 unsigned int 32 2 ivec.h

I16vec4 unspecified short 16 4 ivec.h

Is16vec4 signed short 16 4 ivec.h

Iu16vec4 unsigned short 16 4 ivec.h

I8vec8 unspecified char 8 8 ivec.h

Is8vec8 signed char 8 8 ivec.h

Iu8vec8 unsigned char 8 8 ivec.h

Streaming F32vec4 signed float 32 4 fvec.h SIMD Extensions

F32vec1 signed float 32 1 fvec.h

1156 40

Instruction Class Signedness Data Type Size Elements Header Set File

Streaming F64vec2 signed double 64 2 dvec.h SIMD Extensions 2

I128vec1 unspecified __m128i 128 1 dvec.h

I64vec2 unspecified long int 64 4 dvec.h

Is64vec2 signed long int 64 4 dvec.h

Iu64vec2 unsigned long int 32 4 dvec.h

I32vec4 unspecified int 32 4 dvec.h

Is32vec4 signed int 32 4 dvec.h

Iu32vec4 unsigned int 32 4 dvec.h

I16vec8 unspecified int 16 8 dvec.h

Is16vec8 signed int 16 8 dvec.h

Iu16vec8 unsigned int 16 8 dvec.h

I8vec16 unspecified char 8 16 dvec.h

Is8vec16 signed char 8 16 dvec.h

Iu8vec16 unsigned char 8 16 dvec.h

Most classes contain similar functionality for all data types and are represented by all available intrinsics. However, some capabilities do not translate from one data type to another without suffering from poor performance, and are therefore excluded from individual classes.

NOTE. Intrinsics that take immediate values and cannot be expressed easily in classes are not implemented. (For example, _mm_shuffle_ps, _mm_shuffle_pi16, _mm_shuffle_ps, _mm_ex- tract_pi16, _mm_insert_pi16 ).

1157 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Access to Classes Using Header Files The required class header files are installed in the include directory with the Intel®C++ Compiler. To enable the classes, use the #include directive in your program file as shown in the table that follows. Include Directives for Enabling Classes

Instruction Set Extension Include Directive

MMX Technology #include

Streaming SIMD Extensions #include

Streaming SIMD Extensions 2 #include

Each succeeding file from the top down includes the preceding class. You only need to include fvec.h if you want to use both the Ivec and Fvec classes. Similarly, to use all the classes including those for the Streaming SIMD Extensions 2, you need only to include the dvec.h file.

Usage Precautions When using the C++ classes, you should follow some general guidelines. More detailed usage rules for each class are listed in Integer Vector Classes, and Floating-point Vector Classes. Clear MMX Registers If you use both the Ivec and Fvec classes at the same time, your program could mix MMX instructions, called by Ivec classes, with Intel x87 architecture floating-point instructions, called by Fvec classes. Floating-point instructions exist in the following Fvec functions:

• fvec constructors • debug functions (cout and element access) • rsqrt_nr

NOTE. MMX registers are aliased on the floating-point registers, so you should clear the MMX state with the EMMS instruction intrinsic before issuing an x87 floating-point instruction, as in the following example.

1158 40

ivecA = ivecA & ivecB; Ivec logical operation that uses MMX instructions

empty (); clear state

cout << f32vec4a; F32vec4 operation that uses x87 floating-point instructions

CAUTION. Failure to clear the MMX registers can result in incorrect execution or poor performance due to an incorrect register state.

Follow EMMS Instruction Guidelines Intel strongly recommends that you follow the guidelines for using the EMMS instruction. Refer to this topic before coding with the Ivec classes.

Capabilities of C++ SIMD Classes

The fundamental capabilities of each C++ SIMD class include: • computation • horizontal data motion • branch compression/elimination • caching hints

Understanding each of these capabilities and how they interact is crucial to achieving desired results.

Computation The SIMD C++ classes contain vertical operator support for most arithmetic operations, including shifting and saturation. Computation operations include: +, -, *, /, reciprocal ( rcp and rcp_nr ), square root (sqrt), reciprocal square root ( rsqrt and rsqrt_nr ). Operations rcp and rsqrt are approximating instructions with very short latencies that produce results with at least 12 bits of accuracy. You may get different answer if used on non-Intel processors. Operations rcp_nr and rsqrt_nr use software refining techniques to enhance the

1159 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

accuracy of the approximations, with a minimal impact on performance. (The "nr" stands for Newton-Raphson, a mathematical technique for improving performance using an approximate result.)

Horizontal Data Support The C++ SIMD classes provide horizontal support for some arithmetic operations. The term "horizontal" indicates computation across the elements of one vector, as opposed to the vertical, element-by-element operations on two different vectors. The add_horizontal, unpack_low and pack_sat functions are examples of horizontal data support. This support enables certain algorithms that cannot exploit the full potential of SIMD instructions. Shuffle intrinsics are another example of horizontal data flow. Shuffle intrinsics are not expressed in the C++ classes due to their immediate arguments. However, the C++ class implementation enables you to mix shuffle intrinsics with the other C++ functions. For example: F32vec4 fveca, fvecb, fvecd; fveca += fvecb; fvecd = _mm_shuffle_ps(fveca,fvecb,0);

Branch Compression/Elimination Branching in SIMD architectures can be complicated and expensive. The SIMD C++ classes provide functions to eliminate branches, using logical operations, max and min functions, conditional selects, and compares. Consider the following example: short a[4], b[4], c[4]; for (i=0; i<4; i++) c[i] = a[i] > b[i] ? a[i] : b[i];

This operation is independent of the value of i. For each i, the result could be either A or B depending on the actual values. A simple way of removing the branch altogether is to use the select_gt function, as follows: Is16vec4 a, b, c c = select_gt(a, b, a, b)

1160 40

Caching Hints Streaming SIMD Extensions provide prefetching and streaming hints. Prefetching data can minimize the effects of memory latency. Streaming hints allow you to indicate that certain data should not be cached.

Integer Vector Classes

Overview: Integer Vector Classes

The Ivec classes provide an interface to SIMD processing using integer vectors of various sizes. The class hierarchy is represented in the following figure. Ivec Class Hierarchy

The M64 and M128 classes define the __m64 and __m128i data types from which the rest of the Ivec classes are derived. The first generation of child classes are derived based solely on bit sizes of 128, 64, 32, 16, and 8 respectively for the I128vec1, I64vec1, 164vec2, I32vec2, I32vec4, I16vec4, I16vec8, I8vec16, and I8vec8 classes. The latter seven of the these classes require specification of signedness and saturation.

CAUTION. Do not intermix the M64 and M128 data types. You will get unexpected behavior if you do.

The signedness is indicated by the s and u in the class names:

1161 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Is64vec2 Iu64vec2 Is32vec4 Iu32vec4 Is16vec8 Iu16vec8 Is8vec16 Iu8vec16 Is32vec2 Iu32vec2 Is16vec4 Iu16vec4 Is8vec8 Iu8vec8

Terms, Conventions, and Syntax Defined

The following are special terms and syntax used in this chapter to describe functionality of the classes with respect to their associated operations.

Ivec Class Syntax Conventions The name of each class denotes the data type, signedness, bit size, number of elements using the following generic format: vec { F | I } { s | u } { 64 | 32 | 16 | 8 } vec { 8 | 4 | 2 | 1 }

where

type indicates floating point ( F ) or integer ( I )

signedness indicates signed ( s ) or unsigned ( u ). For the Ivec class, leaving this field blank indicates an intermediate class. There are no unsigned Fvec classes, therefore for the Fvec classes, this field is blank.

bits specifies the number of bits per element

elements specifies the number of elements

1162 40

Special Terms and Conventions The following terms are used to define the functionality and characteristics of the classes and operations defined in this manual.

• Nearest Common Ancestor -- This is the intermediate or parent class of two classes of the same size. For example, the nearest common ancestor of Iu8vec8 and Is8vec8 is I8vec8. Also, the nearest common ancestor between Iu8vec8 and I16vec4 is M64. • Casting -- Changes the data type from one class to another. When an operation uses different data types as operands, the return value of the operation must be assigned to a single data type. Therefore, one or more of the data types must be converted to a required data type. This conversion is known as a typecast. Sometimes, typecasting is automatic, other times you must use special syntax to explicitly typecast it yourself. • Operator Overloading -- This is the ability to use various operators on the same user-defined data type of a given class. Once you declare a variable, you can add, subtract, multiply, and perform a range of operations. Each family of classes accepts a specified range of operators, and must comply by rules and restrictions regarding typecasting and operator overloading as defined in the header files. The following table shows the notation used in this documention to address typecasting, operator overloading, and other rules.

Class Syntax Notation Conventions

Class Name Description

I[s|u][N]vec[N] Any value except I128vec1 nor I64vec1

I64vec1 __m64 data type

I[s|u]64vec2 two 64-bit values of any signedness

I[s|u]32vec4 four 32-bit values of any signedness

I[s|u]8vec16 eight 16-bit values of any signedness

I[s|u]16vec8 sixteen 8-bit values of any signedness

I[s|u]32vec2 two 32-bit values of any signedness

I[s|u]16vec4 four 16-bit values of any signedness

I[s|u]8vec8 eight 8-bit values of any signedness

1163 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Rules for Operators

To use operators with the Ivec classes you must use one of the following three syntax conventions: [ Ivec_Class ] R = [ Ivec_Class ] A [ operator ][ Ivec_Class ] B Example 1: I64vec1 R = I64vec1 A & I64vec1 B; [ Ivec_Class ] R =[ operator ] ([ Ivec_Class ] A,[ Ivec_Class ] B) Example 2: I64vec1 R = andnot(I64vec1 A, I64vec1 B); [ Ivec_Class ] R [ operator ]= [ Ivec_Class ] A Example 3: I64vec1 R &= I64vec1 A; [ operator ]an operator (for example, &, |, or ^ ) [ Ivec_Class ] an Ivec class R, A, B variables declared using the pertinent Ivec classes The table that follows shows automatic and explicit sign and size typecasting. "Explicit" means that it is illegal to mix different types without an explicit typecasting. "Automatic" means that you can mix types freely and the compiler will do the typecasting for you. Summary of Rules Major Operators

Operators Sign Typecasting Size Typecasting Other Typecasting Requirements

Assignment N/A N/A N/A

Logical Automatic Automatic Explicit typecasting is (to left) required for different types used in non-logical expressions on the right side of the assignment.

Addition and Automatic Explicit N/A Subtraction

Multiplication Automatic Explicit N/A

1164 40

Operators Sign Typecasting Size Typecasting Other Typecasting Requirements

Shift Automatic Explicit Casting Required to ensure arithmetic shift.

Compare Automatic Explicit Explicit casting is required for signed classes for the less-than or greater-than operations.

Conditional Select Automatic Explicit Explicit casting is required for signed classes for less-than or greater-than operations.

Data Declaration and Initialization The following table shows literal examples of constructor declarations and data type initialization for all class sizes. All values are initialized with the most significant element on the left and the least significant to the right. Declaration and Initialization Data Types for Ivec Classes

Operation Class Syntax

Declaration M128 I128vec1 A; Iu8vec16 A;

Declaration M64 I64vec1 A; Iu8vec16 A;

__m128 Initialization M128 I128vec1 A(__m128 m); Iu16vec8(__m128 m);

__m64 Initialization M64 I64vec1 A(__m64 m);Iu8vec8 A(__m64 m);

__int64 Initialization M64 I64vec1 A = __int64 m; Iu8vec8 A =__int64 m;

1165 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operation Class Syntax

int i Initialization M64 I64vec1 A = int i; Iu8vec8 A = int i;

int Initialization I32vec2 I32vec2 A(int A1, int A0); Is32vec2 A(signed int A1, signed int A0); Iu32vec2 A(unsigned int A1, unsigned int A0);

int Initialization I32vec4 I32vec4 A(short A3, short A2, short A1, short A0); Is32vec4 A(signed short A3, ..., signed short A0); Iu32vec4 A(unsigned short A3, ..., unsigned short A0);

short int Initialization I16vec4 I16vec4 A(short A3, short A2, short A1, short A0); Is16vec4 A(signed short A3, ..., signed short A0); Iu16vec4 A(unsigned short A3, ..., unsigned short A0);

1166 40

Operation Class Syntax

short int Initialization I16vec8 I16vec8 A(short A7, short A6, ..., short A1, short A0); Is16vec8 A(signed A7, ..., signed short A0); Iu16vec8 A(unsigned short A7, ..., unsigned short A0);

char Initialization I8vec8 I8vec8 A(char A7, char A6, ..., char A1, char A0); Is8vec8 A(signed char A7, ..., signed char A0); Iu8vec8 A(unsigned char A7, ..., unsigned char A0);

char Initialization I8vec16 I8vec16 A(char A15, ..., char A0); Is8vec16 A(signed char A15, ..., signed char A0); Iu8vec16 A(unsigned char A15, ..., unsigned char A0);

Assignment Operator

Any Ivec object can be assigned to any other Ivec object; conversion on assignment from one Ivec object to another is automatic.

Assignment Operator Examples Is16vec4 A; Is8vec8 B;

1167 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

I64vec1 C; A = B; /* assign Is8vec8 to Is16vec4 */ B = C; /* assign I64vec1 to Is8vec8 */ B = A & C; /* assign M64 result of '&' to Is8vec8 */

Logical Operators

The logical operators use the symbols and intrinsics listed in the following table. Operator Symbols Syntax Usage Bitwise Corresponding Operation Standard w/assign Standard w/assign Intrinsic

AND & &= R = A & B R &= A _mm_and_si64 _mm_and_si128

OR | |= R = A | B R |= A _mm_and_si64 _mm_and_si128

XOR ^ ^= R = A^B R ^= A _mm_and_si64 _mm_and_si128

ANDNOT andnot N/A R = A N/A _mm_and_si64 andnot B _mm_and_si128

Logical Operators and Miscellaneous Exceptions A and B converted to M64. Result assigned to Iu8vec8. I64vec1 A; Is8vec8 B; Iu8vec8 C; C = A & B; Same size and signedness operators return the nearest common ancestor. I32vec2 R = Is32vec2 A ^ Iu32vec2 B; A&B returns M64, which is cast to Iu8vec8. C = Iu8vec8(A&B)+ C; When A and B are of the same class, they return the same type. When A and B are of different classes, the return value is the return type of the nearest common ancestor.

1168 40

The logical operator returns values for combinations of classes, listed in the following tables, apply when A and B are of different classes. Ivec Logical Operator Overloading

Return (R) AND OR XOR NAND A Operand B Operand

I64vec1 & | ^ andnot I[s|u]64vec2 I[s|u]64vec2 R A B

I64vec2 & | ^ andnot I[s|u]64vec2 I[s|u]64vec2 R A B

I32vec2 & | ^ andnot I[s|u]32vec2 I[s|u]32vec2 R A B

I32vec4 & | ^ andnot I[s|u]32vec4 I[s|u]32vec4 R A B

I16vec4 & | ^ andnot I[s|u]16vec4 I[s|u]16vec4 R A B

I16vec8 & | ^ andnot I[s|u]16vec8 I[s|u]16vec8 R A B

I8vec8 R & | ^ andnot I[s|u]8vec8 I[s|u]8vec8 A B

I8vec16 & | ^ andnot I[s|u]8vec16 I[s|u]8vec16 R A B

For logical operators with assignment, the return value of R is always the same data type as the pre-declared value of R as listed in the table that follows. Ivec Logical Operator Overloading with Assignment

Return Left Side AND OR XOR Right Side (Any Ivec Type (R) Type)

I128vec1 I128vec1 &= |= ^= I[s|u][N]vec[N] R A;

I64vec1 I64vec1 R &= |= ^= I[s|u][N]vec[N] A;

1169 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Return Left Side AND OR XOR Right Side (Any Ivec Type (R) Type)

I64vec2 I64vec2 R &= |= ^= I[s|u][N]vec[N] A;

I[x]32vec4 I[x]32vec4 &= |= ^= I[s|u][N]vec[N] R A;

I[x]32vec2 I[x]32vec2 &= |= ^= I[s|u][N]vec[N] R A;

I[x]16vec8 I[x]16vec8 &= |= ^= I[s|u][N]vec[N] R A;

I[x]16vec4 I[x]16vec4 &= |= ^= I[s|u][N]vec[N] R A;

I[x]8vec16 I[x]8vec16 &= |= ^= I[s|u][N]vec[N] R A;

I[x]8vec8 I[x]8vec8 &= |= ^= I[s|u][N]vec[N] R A;

Addition and Subtraction Operators

The addition and subtraction operators return the class of the nearest common ancestor when the right-side operands are of different signs. The following code provides examples of usage and miscellaneous exceptions. Syntax Usage for Addition and Subtraction Operators Return nearest common ancestor type, I16vec4. Is16vec4 A; Iu16vec4 B; I16vec4 C; C = A + B; Returns type left-hand operand type. Is16vec4 A; Iu16vec4 B;

1170 40

A += B; B -= A; Explicitly convert B to Is16vec4. Is16vec4 A,C; Iu32vec24 B; C = A + C; C = A + (Is16vec4)B; Addition and Subtraction Operators with Corresponding Intrinsics

Operation Symbols Syntax Corresponding Intrinsics

Addition + R = A + B _mm_add_epi64 += R += A _mm_add_epi32 _mm_add_epi16 _mm_add_epi8 _mm_add_pi32 _mm_add_pi16 _mm_add_pi8

Subtraction - R = A - B _mm_sub_epi64 -= R -= A _mm_sub_epi32 _mm_sub_epi16 _mm_sub_epi8 _mm_sub_pi32 _mm_sub_pi16 _mm_sub_pi8

The following table lists addition and subtraction return values for combinations of classes when the right side operands are of different signedness. The two operands must be the same size, otherwise you must explicitly indicate the typecasting. Addition and Subtraction Operator Overloading Available Operators Right Side Operands Return Value

R Add Sub A B

I64vec2 R + - I[s|u]64vec2 I[s|u]64vec2 A B

1171 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Available Operators Right Side Operands Return Value

R Add Sub A B

I32vec4 R + - I[s|u]32vec4 I[s|u]32vec4 A B

I32vec2 R + - I[s|u]32vec2 I[s|u]32vec2 A B

I16vec8 R + - I[s|u]16vec8 I[s|u]16vec8 A B

I16vec4 R + - I[s|u]16vec4 I[s|u]16vec4 A B

I8vec8 R + - I[s|u]8vec8 A I[s|u]8vec8 B

I8vec16 R + - I[s|u]8vec2 A I[s|u]8vec16 B

The following table shows the return data type values for operands of the addition and subtraction operators with assignment. The left side operand determines the size and signedness of the return value. The right side operand must be the same size as the left operand; otherwise, you must use an explicit typecast. Addition and Subtraction with Assignment

Return Value Left Side (R) Add Sub Right Side (A) (R)

I[x]32vec4 I[x]32vec2 R += -= I[s|u]32vec4 A;

I[x]32vec2 R I[x]32vec2 R += -= I[s|u]32vec2 A;

I[x]16vec8 I[x]16vec8 += -= I[s|u]16vec8 A;

I[x]16vec4 I[x]16vec4 += -= I[s|u]16vec4 A;

1172 40

Return Value Left Side (R) Add Sub Right Side (A) (R)

I[x]8vec16 I[x]8vec16 += -= I[s|u]8vec16 A;

I[x]8vec8 I[x]8vec8 += -= I[s|u]8vec8 A;

Multiplication Operators

The multiplication operators can only accept and return data types from the I[s|u]16vec4 or I[s|u]16vec8 classes, as shown in the following example. Syntax Usage for Multiplication Operators Explicitly convert B to Is16vec4. Is16vec4 A,C; Iu32vec2 B; C = A * C; C = A * (Is16vec4)B; Return nearest common ancestor type, I16vec4 Is16vec4 A; Iu16vec4 B; I16vec4 C; C = A + B; The mul_high and mul_add functions take Is16vec4 data only. Is16vec4 A,B,C,D; C = mul_high(A,B); D = mul_add(A,B); Multiplication Operators with Corresponding Intrinsics

1173 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Symbols Syntax Usage Intrinsic

* *= R = A * B _mm_mul- R *= A lo_pi16 _mm_mul- lo_epi16

mul_high N/A R = mul_high(A, B) _mm_mul- hi_pi16 _mm_mul- hi_epi16

mul_add N/A R = mul_high(A, B) _mm_madd_pi16 _mm_madd_epi16

The multiplication return operators always return the nearest common ancestor as listed in the table that follows. The two operands must be 16 bits in size, otherwise you must explicitly indicate typecasting. Multiplication Operator Overloading

R Mul A B

I16vec4 R * I[s|u]16vec4 A I[s|u]16vec4 B

I16vec8 R * I[s|u]16vec8 A I[s|u]16vec8 B

Is16vec4 R mul_add Is16vec4 A Is16vec4 B

Is16vec8 mul_add Is16vec8 A Is16vec8 B

Is32vec2 R mul_high Is16vec4 A Is16vec4 B

Is32vec4 R mul_high s16vec8 A Is16vec8 B

The following table shows the return values and data type assignments for operands of the multiplication operators with assignment. All operands must be 16 bytes in size. If the operands are not the right size, you must use an explicit typecast.

1174 40

Multiplication with Assignment

Return Value (R) Left Side (R) Mul Right Side (A)

I[x]16vec8 I[x]16vec8 *= I[s|u]16vec8 A;

I[x]16vec4 I[x]16vec4 *= I[s|u]16vec4 A;

Shift Operators

The right shift argument can be any integer or Ivec value, and is implicitly converted to a M64 data type. The first or left operand of a << can be of any type except I[s|u]8vec[8|16]. Example Syntax Usage for Shift Operators Automatic size and sign conversion. Is16vec4 A,C; Iu32vec2 B; C = A; A&B returns I16vec4, which must be cast to Iu16vec4 to ensure logical shift, not arithmetic shift. Is16vec4 A, C; Iu16vec4 B, R; R = (Iu16vec4)(A & B) C; A&B returns I16vec4, which must be cast to Is16vec4 to ensure arithmetic shift, not logical shift. R = (Is16vec4)(A & B) C; Shift Operators with Corresponding Intrinsics

1175 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operation Symbols Syntax Usage Intrinsic

Shift Left << R = A << B _mm_sll_si64 &= R &= A _mm_slli_si64 _mm_sll_pi32 _mm_slli_pi32 _mm_sll_pi16 _mm_slli_pi16

Shift Right >> R = A >> B _mm_srl_si64 R >>= A _mm_srli_si64 _mm_srl_pi32 _mm_srli_pi32 _mm_srl_pi16 _mm_srli_pi16 _mm_sra_pi32 _mm_srai_pi32 _mm_sra_pi16 _mm_srai_pi16

Right shift operations with signed data types use arithmetic shifts. All unsigned and intermediate classes correspond to logical shifts. The following table shows how the return type is determined by the first argument type. Shift Operator Overloading Right Shift Left Shift Option R A B

Logical I64vec1 >> >>= << <<= I64vec1 I64vec1 A; B;

Logical I32vec2 >> >>= << <<= I32vec2 I32vec2 A B;

Arithmetic Is32vec2 >> >>= << <<= Is32vec2 I[s|u][N]vec[N] A B;

Logical Iu32vec2 >> >>= << <<= Iu32vec2 I[s|u][N]vec[N] A B;

Logical I16vec4 >> >>= << <<= I16vec4 I16vec4 A B

1176 40

Right Shift Left Shift Option R A B

Arithmetic Is16vec4 >> >>= << <<= Is16vec4 I[s|u][N]vec[N] A B;

Logical Iu16vec4 >> >>= << <<= Iu16vec4 I[s|u][N]vec[N] A B;

Comparison Operators

The equality and inequality comparison operands can have mixed signedness, but they must be of the same size. The comparison operators for less-than and greater-than must be of the same sign and size. Example of Syntax Usage for Comparison Operator The nearest common ancestor is returned for compare for equal/not-equal operations. Iu8vec8 A; Is8vec8 B; I8vec8 C; C = cmpneq(A,B); Type cast needed for different-sized elements for equal/not-equal comparisons. Iu8vec8 A, C; Is16vec4 B; C = cmpeq(A,(Iu8vec8)B); Type cast needed for sign or size differences for less-than and greater-than comparisons. Iu16vec4 A; Is16vec4 B, C; C = cmpge((Is16vec4)A,B); C = cmpgt(B,C); Inequality Comparison Symbols and Corresponding Intrinsics

1177 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Compare For: Operators Syntax Intrinsic

Equality cmpeq R = cmpeq(A, B) _mm_cmpeq_pi32 _mm_cmpeq_pi16 _mm_cmpeq_pi8

Inequality cmpneq R = cmpneq(A, B) _mm_cmpeq_pi32 _mm_cmpeq_pi16 _mm_cmpeq_pi8 _mm_andnot_si64

Greater Than cmpgt R = cmpgt(A, B) _mm_cmpgt_pi32 _mm_cmpgt_pi16 _mm_cmpgt_pi8

Greater Than cmpge R = cmpge(A, B) _mm_cmpgt_pi32 or Equal To _mm_cmpgt_pi16 _mm_cmpgt_pi8 _mm_andnot_si64

Less Than cmplt R = cmplt(A, B) _mm_cmpgt_pi32 _mm_cmpgt_pi16 _mm_cmpgt_pi8

Less Than cmple R = cmple(A, B) _mm_cmpgt_pi32 or Equal To _mm_cmpgt_pi16 _mm_cmpgt_pi8 _mm_andnot_si64

Comparison operators have the restriction that the operands must be the size and sign as listed in the Compare Operator Overloading table. Compare Operator Overloading

R Comparison A B

I32vec2 R cmpeq I[s|u]32vec2 B I[s|u]32vec2 B cmpne

I16vec4 R I[s|u]16vec4 B I[s|u]16vec4 B

I8vec8 R I[s|u]8vec8 B I[s|u]8vec8 B

1178 40

R Comparison A B

I32vec2 R cmpgt Is32vec2 B Is32vec2 B cmpge cmplt cmple

I16vec4 R Is16vec4 B Is16vec4 B

I8vec8 R Is8vec8 B Is8vec8 B

Conditional Select Operators

For conditional select operands, the third and fourth operands determine the type returned. Third and fourth operands with same size, but different signedness, return the nearest common ancestor data type. Conditional Select Syntax Usage Return the nearest common ancestor data type if third and fourth operands are of the same size, but different signs. I16vec4 R = select_neq(Is16vec4, Is16vec4, Is16vec4, Iu16vec4); Conditional Select for Equality R0 := (A0 == B0) ? C0 : D0; R1 := (A1 == B1) ? C1 : D1; R2 := (A2 == B2) ? C2 : D2; R3 := (A3 == B3) ? C3 : D3; Conditional Select for Inequality R0 := (A0 != B0) ? C0 : D0; R1 := (A1 != B1) ? C1 : D1; R2 := (A2 != B2) ? C2 : D2; R3 := (A3 != B3) ? C3 : D3; Conditional Select Symbols and Corresponding Intrinsics

1179 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Conditional Operators Syntax Corresponding Additional Select For: Intrinsic Intrinsic (Applies to All)

Equality select_eq R = _mm_cm- _mm_and_si64 select_eq(A, peq_pi32 _mm_or_si64 B, C, D) _mm_cm- _mm_and- peq_pi16 not_si64 _mm_cmpeq_pi8

Inequality select_neq R = _mm_cm- select_neq(A, peq_pi32 B, C, D) _mm_cm- peq_pi16 _mm_cmpeq_pi8

Greater Than select_gt R = _mm_cmpgt_pi32 select_gt(A, _mm_cmpgt_pi16 B, C, D) _mm_cmpgt_pi8

Greater Than select_ge R = _mm_cmpge_pi32 or Equal To select_gt(A, _mm_cmpge_pi16 B, C, D) _mm_cmpge_pi8

Less Than select_lt R = _mm_cm- select_lt(A, plt_pi32 B, C, D) _mm_cm- plt_pi16 _mm_cmplt_pi8

Less Than select_le R = _mm_cm- or Equal To select_le(A, ple_pi32 B, C, D) _mm_cm- ple_pi16 _mm_cmple_pi8

All conditional select operands must be of the same size. The return data type is the nearest common ancestor of operands C and D. For conditional select operations using greater-than or less-than operations, the first and second operands must be signed as listed in the table that follows. Conditional Select Operator Overloading

1180 40

R Comparison A and B C D

I32vec2 R select_eq I[s|u]32vec2 I[s|u]32vec2 I[s|u]32vec2 select_ne I16vec4 R I[s|u]16vec4 I[s|u]16vec4

I8vec8 R I[s|u]8vec8 I[s|u]8vec8

I32vec2 R select_gt Is32vec2 Is32vec2 Is32vec2 select_ge I16vec4 R select_lt Is16vec4 Is16vec4 select_le I8vec8 R Is8vec8 Is8vec8

The following table shows the mapping of return values from R0 to R7 for any number of elements. The same return value mappings also apply when there are fewer than four return values. Conditional Select Operator Return Value Mapping Available Operators Return A B C and D Value Operands Operands Operands

R0:= A0 == != > >= < <= B0 C0 : D0;

R1:= A0 == != > >= < <= B0 C1 : D1;

R2:= A0 == != > >= < <= B0 C2 : D2;

R3:= A0 == != > >= < <= B0 C3 : D3;

R4:= A0 == != > >= < <= B0 C4 : D4;

R5:= A0 == != > >= < <= B0 C5 : D5;

R6:= A0 == != > >= < <= B0 C6 : D6;

1181 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Available Operators Return A B C and D Value Operands Operands Operands

R7:= A0 == != > >= < <= B0 C7 : D7;

Debug Operations

The debug operations do not map to any compiler intrinsics for MMX™ instructions. They are provided for debugging programs only. Use of these operations may result in loss of performance, so you should not use them outside of debugging.

Output The four 32-bit values of A are placed in the output buffer and printed in the following format (default in decimal): cout << Is32vec4 A; cout << Iu32vec4 A; cout << hex << Iu32vec4 A; /* print in hex format */ "[3]:A3 [2]:A2 [1]:A1 [0]:A0"

Corresponding Intrinsics: none The two 32-bit values of A are placed in the output buffer and printed in the following format (default in decimal): cout << Is32vec2 A; cout << Iu32vec2 A; cout << hex << Iu32vec2 A; /* print in hex format */ "[1]:A1 [0]:A0"

Corresponding Intrinsics: none The eight 16-bit values of A are placed in the output buffer and printed in the following format (default in decimal): cout << Is16vec8 A; cout << Iu16vec8 A; cout << hex << Iu16vec8 A; /* print in hex format */

1182 40

"[7]:A7 [6]:A6 [5]:A5 [4]:A4 [3]:A3 [2]:A2 [1]:A1 [0]:A0"

Corresponding Intrinsics: none The four 16-bit values of A are placed in the output buffer and printed in the following format (default in decimal): cout << Is16vec4 A; cout << Iu16vec4 A; cout << hex << Iu16vec4 A; /* print in hex format */ "[3]:A3 [2]:A2 [1]:A1 [0]:A0"

Corresponding Intrinsics: none The sixteen 8-bit values of A are placed in the output buffer and printed in the following format (default is decimal): cout << Is8vec16 A; cout << Iu8vec16 A; cout << hex << Iu8vec8 A; /* print in hex format instead of decimal*/ "[15]:A15 [14]:A14 [13]:A13 [12]:A12 [11]:A11 [10]:A10 [9]:A9 [8]:A8 [7]:A7 [6]:A6 [5]:A5 [4]:A4 [3]:A3 [2]:A2 [1]:A1 [0]:A0"

Corresponding Intrinsics: none The eight 8-bit values of A are placed in the output buffer and printed in the following format (default is decimal): cout << Is8vec8 A; cout << Iu8vec8 A;cout << hex << Iu8vec8 A; /* print in hex format instead of decimal*/ "[7]:A7 [6]:A6 [5]:A5 [4]:A4 [3]:A3 [2]:A2 [1]:A1 [0]:A0"

Corresponding Intrinsics: none

Element Access Operators int R = Is64vec2 A[i]; unsigned int R = Iu64vec2 A[i]; int R = Is32vec4 A[i]; unsigned int R = Iu32vec4 A[i]; int R = Is32vec2 A[i];

1183 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

unsigned int R = Iu32vec2 A[i]; short R = Is16vec8 A[i]; unsigned short R = Iu16vec8 A[i]; short R = Is16vec4 A[i]; unsigned short R = Iu16vec4 A[i]; signed char R = Is8vec16 A[i]; unsigned char R = Iu8vec16 A[i]; signed char R = Is8vec8 A[i]; unsigned char R = Iu8vec8 A[i];

Access and read element i of A. If DEBUG is enabled and the user tries to access an element outside of A, a diagnostic message is printed and the program aborts. Corresponding Intrinsics: none

Element Assignment Operators Is64vec2 A[i] = int R; Is32vec4 A[i] = int R; Iu32vec4 A[i] = unsigned int R; Is32vec2 A[i] = int R; Iu32vec2 A[i] = unsigned int R; Is16vec8 A[i] = short R; Iu16vec8 A[i] = unsigned short R; Is16vec4 A[i] = short R; Iu16vec4 A[i] = unsigned short R; Is8vec16 A[i] = signed char R; Iu8vec16 A[i] = unsigned char R; Is8vec8 A[i] = signed char R; Iu8vec8 A[i] = unsigned char R;

1184 40

Assign R to element i of A. If DEBUG is enabled and the user tries to assign a value to an element outside of A, a diagnostic message is printed and the program aborts. Corresponding Intrinsics: none

Unpack Operators

Interleave the 64-bit value from the high half of A with the 64-bit value from the high half of B. I364vec2 unpack_high(I64vec2 A, I64vec2 B); Is64vec2 unpack_high(Is64vec2 A, Is64vec2 B); Iu64vec2 unpack_high(Iu64vec2 A, Iu64vec2 B); R0 = A1; R1 = B1; Corresponding intrinsic: _mm_unpackhi_epi64 Interleave the two 32-bit values from the high half of A with the two 32-bit values from the high half of B. I32vec4 unpack_high(I32vec4 A, I32vec4 B); Is32vec4 unpack_high(Is32vec4 A, Is32vec4 B); Iu32vec4 unpack_high(Iu32vec4 A, Iu32vec4 B); R0 = A1; R1 = B1; R2 = A2; R3 = B2; Corresponding intrinsic: _mm_unpackhi_epi32 Interleave the 32-bit value from the high half of A with the 32-bit value from the high half of B. I32vec2 unpack_high(I32vec2 A, I32vec2 B); Is32vec2 unpack_high(Is32vec2 A, Is32vec2 B); Iu32vec2 unpack_high(Iu32vec2 A, Iu32vec2 B); R0 = A1; R1 = B1; Corresponding intrinsic: _mm_unpackhi_pi32

1185 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Interleave the four 16-bit values from the high half of A with the two 16-bit values from the high half of B. I16vec8 unpack_high(I16vec8 A, I16vec8 B); Is16vec8 unpack_high(Is16vec8 A, Is16vec8 B); Iu16vec8 unpack_high(Iu16vec8 A, Iu16vec8 B); R0 = A2; R1 = B2; R2 = A3; R3 = B3; Corresponding intrinsic: _mm_unpackhi_epi16 Interleave the two 16-bit values from the high half of A with the two 16-bit values from the high half of B. I16vec4 unpack_high(I16vec4 A, I16vec4 B); Is16vec4 unpack_high(Is16vec4 A, Is16vec4 B); Iu16vec4 unpack_high(Iu16vec4 A, Iu16vec4 B); R0 = A2;R1 = B2; R2 = A3;R3 = B3; Corresponding intrinsic: _mm_unpackhi_pi16 Interleave the four 8-bit values from the high half of A with the four 8-bit values from the high half of B. I8vec8 unpack_high(I8vec8 A, I8vec8 B); Is8vec8 unpack_high(Is8vec8 A, I8vec8 B); Iu8vec8 unpack_high(Iu8vec8 A, I8vec8 B); R0 = A4; R1 = B4; R2 = A5; R3 = B5; R4 = A6; R5 = B6; R6 = A7; R7 = B7; Corresponding intrinsic: _mm_unpackhi_pi8 Interleave the sixteen 8-bit values from the high half of A with the four 8-bit values from the high half of B.

1186 40

I8vec16 unpack_high(I8vec16 A, I8vec16 B); Is8vec16 unpack_high(Is8vec16 A, I8vec16 B); Iu8vec16 unpack_high(Iu8vec16 A, I8vec16 B); R0 = A8; R1 = B8; R2 = A9; R3 = B9; R4 = A10; R5 = B10; R6 = A11; R7 = B11; R8 = A12; R8 = B12; R2 = A13; R3 = B13; R4 = A14; R5 = B14; R6 = A15; R7 = B15; Corresponding intrinsic: _mm_unpackhi_epi16 Interleave the 32-bit value from the low half of A with the 32-bit value from the low half of B R0 = A0; R1 = B0; Corresponding intrinsic: _mm_unpacklo_epi32 Interleave the 64-bit value from the low half of A with the 64-bit values from the low half of B I64vec2 unpack_low(I64vec2 A, I64vec2 B); Is64vec2 unpack_low(Is64vec2 A, Is64vec2 B); Iu64vec2 unpack_low(Iu64vec2 A, Iu64vec2 B); R0 = A0; R1 = B0; R2 = A1; R3 = B1; Corresponding intrinsic: _mm_unpacklo_epi32 Interleave the two 32-bit values from the low half of A with the two 32-bit values from the low half of B

1187 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

I32vec4 unpack_low(I32vec4 A, I32vec4 B); Is32vec4 unpack_low(Is32vec4 A, Is32vec4 B); Iu32vec4 unpack_low(Iu32vec4 A, Iu32vec4 B); R0 = A0; R1 = B0; R2 = A1; R3 = B1; Corresponding intrinsic: _mm_unpacklo_epi32 Interleave the 32-bit value from the low half of A with the 32-bit value from the low half of B. I32vec2 unpack_low(I32vec2 A, I32vec2 B); Is32vec2 unpack_low(Is32vec2 A, Is32vec2 B); Iu32vec2 unpack_low(Iu32vec2 A, Iu32vec2 B); R0 = A0; R1 = B0; Corresponding intrinsic: _mm_unpacklo_pi32 Interleave the two 16-bit values from the low half of A with the two 16-bit values from the low half of B. I16vec8 unpack_low(I16vec8 A, I16vec8 B); Is16vec8 unpack_low(Is16vec8 A, Is16vec8 B); Iu16vec8 unpack_low(Iu16vec8 A, Iu16vec8 B); R0 = A0; R1 = B0; R2 = A1; R3 = B1; R4 = A2; R5 = B2; R6 = A3; R7 = B3; Corresponding intrinsic: _mm_unpacklo_epi16 Interleave the two 16-bit values from the low half of A with the two 16-bit values from the low half of B. I16vec4 unpack_low(I16vec4 A, I16vec4 B); Is16vec4 unpack_low(Is16vec4 A, Is16vec4 B);

1188 40

Iu16vec4 unpack_low(Iu16vec4 A, Iu16vec4 B); R0 = A0; R1 = B0; R2 = A1; R3 = B1; Corresponding intrinsic: _mm_unpacklo_pi16 Interleave the four 8-bit values from the high low of A with the four 8-bit values from the low half of B. I8vec16 unpack_low(I8vec16 A, I8vec16 B); Is8vec16 unpack_low(Is8vec16 A, Is8vec16 B); Iu8vec16 unpack_low(Iu8vec16 A, Iu8vec16 B); R0 = A0; R1 = B0; R2 = A1; R3 = B1; R4 = A2; R5 = B2; R6 = A3; R7 = B3; R8 = A4; R9 = B4; R10 = A5; R11 = B5; R12 = A6; R13 = B6; R14 = A7; R15 = B7; Corresponding intrinsic: _mm_unpacklo_epi8 Interleave the four 8-bit values from the high low of A with the four 8-bit values from the low half of B. I8vec8 unpack_low(I8vec8 A, I8vec8 B); Is8vec8 unpack_low(Is8vec8 A, Is8vec8 B); Iu8vec8 unpack_low(Iu8vec8 A, Iu8vec8 B); R0 = A0; R1 = B0; R2 = A1;

1189 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R3 = B1; R4 = A2; R5 = B2; R6 = A3; R7 = B3; Corresponding intrinsic: _mm_unpacklo_pi8

Pack Operators

Pack the eight 32-bit values found in A and B into eight 16-bit values with signed saturation. Is16vec8 pack_sat(Is32vec2 A,Is32vec2 B); Corresponding intrinsic: _mm_packs_epi32 Pack the four 32-bit values found in A and B into eight 16-bit values with signed saturation. Is16vec4 pack_sat(Is32vec2 A,Is32vec2 B); Corresponding intrinsic: _mm_packs_pi32 Pack the sixteen 16-bit values found in A and B into sixteen 8-bit values with signed saturation. Is8vec16 pack_sat(Is16vec4 A,Is16vec4 B); Corresponding intrinsic: _mm_packs_epi16 Pack the eight 16-bit values found in A and B into eight 8-bit values with signed saturation. Is8vec8 pack_sat(Is16vec4 A,Is16vec4 B); Corresponding intrinsic: _mm_packs_pi16 Pack the sixteen 16-bit values found in A and B into sixteen 8-bit values with unsigned saturation. Iu8vec16 packu_sat(Is16vec4 A,Is16vec4 B); Corresponding intrinsic: _mm_packus_epi16 Pack the eight 16-bit values found in A and B into eight 8-bit values with unsigned saturation. Iu8vec8 packu_sat(Is16vec4 A,Is16vec4 B); Corresponding intrinsic: _mm_packs_pu16

Clear MMX™ State Operator

Empty the MMX™ registers and clear the MMX state. Read the guidelines for using the EMMS instruction intrinsic. void empty(void); Corresponding intrinsic: _mm_empty

1190 40

Integer Functions for Streaming SIMD Extensions

NOTE. You must include fvec.h header file for the following functionality.

Compute the element-wise maximum of the respective signed integer words in A and B. Is16vec4 simd_max(Is16vec4 A, Is16vec4 B); Corresponding intrinsic: _mm_max_pi16 Compute the element-wise minimum of the respective signed integer words in A and B. Is16vec4 simd_min(Is16vec4 A, Is16vec4 B); Corresponding intrinsic: _mm_min_pi16 Compute the element-wise maximum of the respective unsigned bytes in A and B. Iu8vec8 simd_max(Iu8vec8 A, Iu8vec8 B); Corresponding intrinsic: _mm_max_pu8 Compute the element-wise minimum of the respective unsigned bytes in A and B. Iu8vec8 simd_min(Iu8vec8 A, Iu8vec8 B); Corresponding intrinsic: _mm_min_pu8 Create an 8-bit mask from the most significant bits of the bytes in A. int move_mask(I8vec8 A); Corresponding intrinsic: _mm_movemask_pi8 Conditionally store byte elements of A to address p. The high bit of each byte in the selector B determines whether the corresponding byte in A will be stored. void mask_move(I8vec8 A, I8vec8 B, signed char *p); Corresponding intrinsic: _mm_maskmove_si64 Store the data in A to the address p without polluting the caches. A can be any Ivec type. void store_nta(__m64 *p, M64 A); Corresponding intrinsic: _mm_stream_pi Compute the element-wise average of the respective unsigned 8-bit integers in A and B. Iu8vec8 simd_avg(Iu8vec8 A, Iu8vec8 B); Corresponding intrinsic: _mm_avg_pu8 Compute the element-wise average of the respective unsigned 16-bit integers in A and B. Iu16vec4 simd_avg(Iu16vec4 A, Iu16vec4 B);

1191 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Corresponding intrinsic: _mm_avg_pu16

Conversions between Fvec and Ivec

Convert the lower double-precision floating-point value of A to a 32-bit integer with truncation. int F64vec2ToInt(F64vec42 A); r := (int)A0; Convert the four floating-point values of A to two the two least significant double-precision floating-point values. F64vec2 F32vec4ToF64vec2(F32vec4 A); r0 := (double)A0; r1 := (double)A1; Convert the two double-precision floating-point values of A to two single-precision floating-point values. F32vec4 F64vec2ToF32vec4(F64vec2 A); r0 := (float)A0; r1 := (float)A1; Convert the signed int in B to a double-precision floating-point value and pass the upper double-precision value from A through to the result. F64vec2 InttoF64vec2(F64vec2 A, int B); r0 := (double)B; r1 := A1; Convert the lower floating-point value of A to a 32-bit integer with truncation. int F32vec4ToInt(F32vec4 A); r := (int)A0; Convert the two lower floating-point values of A to two 32-bit integer with truncation, returning the integers in packed form. Is32vec2 F32vec4ToIs32vec2 (F32vec4 A); r0 := (int)A0; r1 := (int)A1; Convert the 32-bit integer value B to a floating-point value; the upper three floating-point values are passed through from A. F32vec4 IntToF32vec4(F32vec4 A, int B); r0 := (float)B; r1 := A1; r2 := A2;

1192 40

r3 := A3; Convert the two 32-bit integer values in packed form in B to two floating-point values; the upper two floating-point values are passed through from A. F32vec4 Is32vec2ToF32vec4(F32vec4 A, Is32vec2 B); r0 := (float)B0; r1 := (float)B1; r2 := A2; r3 := A3;

Floating-point Vector Classes

Overview: Floating-point Vector Classes

The floating-point vector classes, F64vec2, F32vec4, and F32vec1, provide an interface to SIMD operations. The class specifications are as follows: F64vec2 A(double x, double y); F32vec4 A(float z, float y, float x, float w); F32vec1 B(float w); The packed floating-point input values are represented with the right-most value lowest as shown in the following table. Single-Precision Floating-point Elements

1193 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Fvec Notation Conventions

This reference uses the following conventions for syntax and return values.

Fvec Classes Syntax Notation Fvec classes use the syntax conventions shown the following examples: [Fvec_Class] R = [Fvec_Class] A [operator][Ivec_Class] B;

Example 1: F64vec2 R = F64vec2 A & F64vec2 B; [Fvec_Class] R = [operator]([Fvec_Class] A,[Fvec_Class] B);

Example 2: F64vec2 R = andnot(F64vec2 A, F64vec2 B); [Fvec_Class] R [operator]= [Fvec_Class] A;

Example 3: F64vec2 R &= F64vec2 A; where [operator] is an operator (for example, &, |, or ^ ) [Fvec_Class] is any Fvec class ( F64vec2, F32vec4, or F32vec1 )

1194 40

R, A, B are declared Fvec variables of the type indicated.

Return Value Notation Because the Fvec classes have packed elements, the return values typically follow the conventions presented in the Return Value Convention Notation Mappings table. F32vec4 returns four single-precision, floating-point values (R0, R1, R2, and R3); F64vec2 returns two double-precision, floating-point values, and F32vec1 returns the lowest single-precision floating-point value (R0). Return Value Convention Notation Mappings

Example 1: Example 2: Example 3: F32vec4 F64vec2 F32vec1

R0 := A0 & B0; R0 := A0 andnot R0 &= A0; x x x B0;

R1 := A1 & B1; R1 := A1 andnot R1 &= A1; x x N/A B1;

R2 := A2 & B2; R2 := A2 andnot R2 &= A2; x N/A N/A B2;

R3 := A3 & B3 R3 := A3 andhot R3 &= A3; x N/A N/A B3;

Data Alignment

Memory operations using the Streaming SIMD Extensions should be performed on 16-byte-aligned data whenever possible. F32vec4 and F64vec2 object variables are properly aligned by default. Note that floating point arrays are not automatically aligned. To get 16-byte alignment, you can use the alignment __declspec: __declspec( align(16) ) float A[4];

Conversions

All Fvec object variables can be implicitly converted to __m128 data types. For example, the results of computations performed on F32vec4 or F32vec1 object variables can be assigned to __m128 data types.

1195 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

__m128d mm = A & B; /* where A,B are F64vec2 object variables */ __m128 mm = A & B; /* where A,B are F32vec4 object variables */ __m128 mm = A & B; /* where A,B are F32vec1 object variables */

Constructors and Initialization

The following table shows how to create and initialize F32vec objects with the Fvec classes.

Constructors and Initialization for Fvec Classes

Example Intrinsic Returns

Constructor Declaration

F64vec2 A; N/A N/A F32vec4 B; F32vec1 C;

__m128 Object Initialization

F64vec2 A(__m128d mm); N/A N/A F32vec4 B(__m128 mm); F32vec1 C(__m128 mm);

Double Initialization

/* Initializes two _mm_set_pd A0 := d0; doubles. */ A1 := d1; F64vec2 A(double d0, double d1); F64vec2 A = F64vec2(double d0, double d1);

F64vec2 A(double d0); _mm_set1_pd A0 := d0; /* Initializes both A1 := d0; return values with the same double precision value */.

1196 40

Float Initialization

F32vec4 A(float f3, _mm_set_ps A0 := f0; float f2, A1 := f1; float f1, float f0); A2 := f2; F32vec4 A = A3 := f3; F32vec4(float f3, float f2, float f1, float f0);

F32vec4 A(float f0); _mm_set1_ps A0 := f0; /* Initializes all A1 := f0; return values A2 := f0; with the same floating A3 := f0; point value. */

F32vec4 A(double d0); _mm_set1_ps(d) A0 := d0; /* Initialize all A1 := d0; return values with A2 := d0; the same A3 := d0; double-precision value. */

F32vec1 A(double d0); _mm_set_ss(d) A0 := d0; /* Initializes the A1 := 0; lowest value of A A2 := 0; with d0 and the other A3 := 0; values with 0.*/

F32vec1 B(float f0); _mm_set_ss B0 := f0; /* Initializes the B1 := 0; lowest value of B B2 := 0; with f0 and the other B3 := 0; values with 0.*/

F32vec1 B(int I); _mm_cvtsi32_ss B0 := f0; /* Initializes the B1 := {} lowest value of B B2 := {} with f0, other values B3 := {} are undefined.*/

1197 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Arithmetic Operators

The following table lists the arithmetic operators of the Fvec classes and generic syntax. The operators have been divided into standard and advanced operations, which are described in more detail later in this section.

Fvec Arithmetic Operators

Category Operation Operators Generic Syntax

Standard Addition + R = A + B; += R += A;

Subtraction - R = A - B; -= R -= A;

Multiplication * R = A * B; *= R *= A;

Division / R = A / B; /= R /= A;

Advanced Square Root sqrt R = sqrt(A);

Reciprocal rcp R = rcp(A); (Newton-Raphson) rcp_nr R = rcp_nr(A);

Reciprocal Square rsqrt R = rsqrt(A); Root rsqrt_nr R = rsqrt_nr(A); (Newton-Raphson)

Standard Arithmetic Operator Usage The following two tables show the return values for each class of the standard arithmetic operators, which use the syntax styles described earlier in the Return Value Notation section. Standard Arithmetic Return Value Mapping

R A Operators B F32vec4 F64vec2 F32vec1

R0:= A0 + - * / B0

1198 40

R A Operators B F32vec4 F64vec2 F32vec1

R1:= A1 + - * / B1 N/A

R2:= A2 + - * / B2 N/A N/A

R3:= A3 + - * / B3 N/A N/A

Arithmetic with Assignment Return Value Mapping

R Operators A F32vec4 F64vec2 F32vec1

R0:= += -= *= /= A0

R1:= += -= *= /= A1 N/A

R2:= += -= *= /= A2 N/A N/A

R3:= += -= *= /= A3 N/A N/A

This table lists standard arithmetic operator syntax and intrinsics. Standard Arithmetic Operations for Fvec Classes

Operation Returns Example Syntax Intrinsic Usage

Addition 4 floats F32vec4 R = _mm_add_ps F32vec4 A + F32vec4 B; F32vec4 R += F32vec4 A;

2 doubles F64vec2 R = _mm_add_pd F64vec2 A + F32vec2 B; F64vec2 R += F64vec2 A;

1199 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operation Returns Example Syntax Intrinsic Usage

1 float F32vec1 R = _mm_add_ss F32vec1 A + F32vec1 B; F32vec1 R += F32vec1 A;

Subtraction 4 floats F32vec4 R = _mm_sub_ps F32vec4 A - F32vec4 B; F32vec4 R -= F32vec4 A;

2 doubles F64vec2 R - _mm_sub_pd F64vec2 A + F32vec2 B; F64vec2 R -= F64vec2 A;

1 float F32vec1 R = _mm_sub_ss F32vec1 A - F32vec1 B; F32vec1 R -= F32vec1 A;

Multiplication 4 floats F32vec4 R = _mm_mul_ps F32vec4 A * F32vec4 B; F32vec4 R *= F32vec4 A;

2 doubles F64vec2 R = _mm_mul_pd F64vec2 A * F364vec2 B; F64vec2 R *= F64vec2 A;

1200 40

Operation Returns Example Syntax Intrinsic Usage

1 float F32vec1 R = _mm_mul_ss F32vec1 A * F32vec1 B; F32vec1 R *= F32vec1 A;

Division 4 floats F32vec4 R = _mm_div_ps F32vec4 A / F32vec4 B; F32vec4 R /= F32vec4 A;

2 doubles F64vec2 R = _mm_div_pd F64vec2 A / F64vec2 B; F64vec2 R /= F64vec2 A;

1 float F32vec1 R = _mm_div_ss F32vec1 A / F32vec1 B; F32vec1 R /= F32vec1 A;

Advanced Arithmetic Operator Usage The following table shows the return values classes of the advanced arithmetic operators, which use the syntax styles described earlier in the Return Value Notation section. Advanced Arithmetic Return Value Mapping

R Operators A F32vec4 F64vec2 F32vec1

R0:= sqrt rcp rsqrt rcp_nr rsqrt_nr A0

R1:= sqrt rcp rsqrt rcp_nr rsqrt_nr A1 N/A

R2:= sqrt rcp rsqrt rcp_nr rsqrt_nr A2 N/A N/A

1201 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R Operators A F32vec4 F64vec2 F32vec1

R3:= sqrt rcp rsqrt rcp_nr rsqrt_nr A3 N/A N/A

f add_horizontal (A0 N/A N/A := + A1 + A2 + A3)

d := add_horizontal (A0 N/A N/A + A1)

This table shows examples for advanced arithmetic operators. Advanced Arithmetic Operations for Fvec Classes

Returns Example Syntax Usage Intrinsic

Square Root

4 floats F32vec4 R = _mm_sqrt_ps sqrt(F32vec4 A);

2 doubles F64vec2 R = _mm_sqrt_pd sqrt(F64vec2 A);

1 float F32vec1 R = _mm_sqrt_ss sqrt(F32vec1 A);

Reciprocal

4 floats F32vec4 R = rcp(F32vec4 _mm_rcp_ps A);

2 doubles F64vec2 R = rcp(F64vec2 _mm_rcp_pd A);

1 float F32vec1 R = rcp(F32vec1 _mm_rcp_ss A);

1202 40

Reciprocal Square Root

4 floats F32vec4 R = _mm_rsqrt_ps rsqrt(F32vec4 A);

2 doubles F64vec2 R = _mm_rsqrt_pd rsqrt(F64vec2 A);

1 float F32vec1 R = _mm_rsqrt_ss rsqrt(F32vec1 A);

Reciprocal Newton Raphson

4 floats F32vec4 R = _mm_sub_ps rcp_nr(F32vec4 A); _mm_add_ps _mm_mul_ps _mm_rcp_ps

2 doubles F64vec2 R = _mm_sub_pd rcp_nr(F64vec2 A); _mm_add_pd _mm_mul_pd _mm_rcp_pd

1 float F32vec1 R = _mm_sub_ss rcp_nr(F32vec1 A); _mm_add_ss _mm_mul_ss _mm_rcp_ss

Reciprocal Square Root Newton Raphson

4 float F32vec4 R = _mm_sub_pd rsqrt_nr(F32vec4 A); _mm_mul_pd _mm_rsqrt_ps

2 doubles F64vec2 R = _mm_sub_pd rsqrt_nr(F64vec2 A); _mm_mul_pd _mm_rsqrt_pd

1 float F32vec1 R = _mm_sub_ss rsqrt_nr(F32vec1 A); _mm_mul_ss _mm_rsqrt_ss

1203 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Horizontal Add

1 float float f = _mm_add_ss add_horizontal(F32vec4 _mm_shuffle_ss A);

1 double double d = _mm_add_sd add_horizontal(F64vec2 _mm_shuffle_sd A);

Minimum and Maximum Operators

Compute the minimums of the two double precision floating-point values of A and B. F64vec2 R = simd_min(F64vec2 A, F64vec2 B) R0 := min(A0,B0); R1 := min(A1,B1); Corresponding intrinsic: _mm_min_pd Compute the minimums of the four single precision floating-point values of A and B. F32vec4 R = simd_min(F32vec4 A, F32vec4 B) R0 := min(A0,B0); R1 := min(A1,B1); R2 := min(A2,B2); R3 := min(A3,B3); Corresponding intrinsic: _mm_min_ps Compute the minimum of the lowest single precision floating-point values of A and B. F32vec1 R = simd_min(F32vec1 A, F32vec1 B) R0 := min(A0,B0); Corresponding intrinsic: _mm_min_ss Compute the maximums of the two double precision floating-point values of A and B. F64vec2 simd_max(F64vec2 A, F64vec2 B) R0 := max(A0,B0); R1 := max(A1,B1); Corresponding intrinsic: _mm_max_pd Compute the maximums of the four single precision floating-point values of A and B. F32vec4 R = simd_man(F32vec4 A, F32vec4 B) R0 := max(A0,B0); R1 := max(A1,B1);

1204 40

R2 := max(A2,B2); R3 := max(A3,B3); Corresponding intrinsic: _mm_max_ps Compute the maximum of the lowest single precision floating-point values of A and B. F32vec1 simd_max(F32vec1 A, F32vec1 B) R0 := max(A0,B0); Corresponding intrinsic: _mm_max_ss

Logical Operators

The following table lists the logical operators of the Fvec classes and generic syntax. The logical operators for F32vec1 classes use only the lower 32 bits. Fvec Logical Operators Return Value Mapping

Bitwise Operation Operators Generic Syntax

AND & R = A & B; &= R &= A;

OR | R = A | B; |= R |= A;

XOR ^ R = A ^ B; ^= R ^= A;

andnot andnot R = andnot(A);

The following table lists standard logical operators syntax and corresponding intrinsics. Note that there is no corresponding scalar intrinsic for the F32vec1 classes, which accesses the lower 32 bits of the packed vector intrinsics. Logical Operations for Fvec Classes

Operation Returns Example Syntax Intrinsic Usage

AND 4 floats F32vec4 & = _mm_and_ps F32vec4 A & F32vec4 B; F32vec4 & &= F32vec4 A;

1205 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operation Returns Example Syntax Intrinsic Usage

2 doubles F64vec2 R = _mm_and_pd F64vec2 A & F32vec2 B; F64vec2 R &= F64vec2 A;

1 float F32vec1 R = _mm_and_ps F32vec1 A & F32vec1 B; F32vec1 R &= F32vec1 A;

OR 4 floats F32vec4 R = _mm_or_ps F32vec4 A | F32vec4 B; F32vec4 R |= F32vec4 A;

2 doubles F64vec2 R = _mm_or_pd F64vec2 A | F32vec2 B; F64vec2 R |= F64vec2 A;

1 float F32vec1 R = _mm_or_ps F32vec1 A | F32vec1 B; F32vec1 R |= F32vec1 A;

XOR 4 floats F32vec4 R = _mm_xor_ps F32vec4 A ^ F32vec4 B; F32vec4 R ^= F32vec4 A;

1206 40

Operation Returns Example Syntax Intrinsic Usage

2 doubles F64vec2 R = _mm_xor_pd F64vec2 A ^ F364vec2 B; F64vec2 R ^= F64vec2 A;

1 float F32vec1 R = _mm_xor_ps F32vec1 A ^ F32vec1 B; F32vec1 R ^= F32vec1 A;

ANDNOT 2 doubles F64vec2 R = _mm_andnot_pd andnot(F64vec2 A, F64vec2 B);

Compare Operators

The operators described in this section compare the single precision floating-point values of A and B. Comparison between objects of any Fvec class return the same class being compared. The following table lists the compare operators for the Fvec classes.

Compare Operators and Corresponding Intrinsics

Compare For: Operators Syntax

Equality cmpeq R = cmpeq(A, B)

Inequality cmpneq R = cmpneq(A, B)

Greater Than cmpgt R = cmpgt(A, B)

Greater Than or Equal To cmpge R = cmpge(A, B)

Not Greater Than cmpngt R = cmpngt(A, B)

Not Greater Than or Equal To cmpnge R = cmpnge(A, B)

1207 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Compare For: Operators Syntax

Less Than cmplt R = cmplt(A, B)

Less Than or Equal To cmple R = cmple(A, B)

Not Less Than cmpnlt R = cmpnlt(A, B)

Not Less Than or Equal To cmpnle R = cmpnle(A, B)

Compare Operators The mask is set to 0xffffffff for each floating-point value where the comparison is true and 0x00000000 where the comparison is false. The following table shows the return values for each class of the compare operators, which use the syntax described earlier in the Return Value Notation section.

Compare Operator Return Value Mapping

R A0 For Any Operators B If If F32vec4 F64vec2 F32vec1 True False

R0:= (A1 cmp[eq | lt | le | gt | ge] B1) 0xffffffff 0x0000000 X X X !(A1 cmp[ne | nlt | nle | ngt | nge] B1)

R1:= (A1 cmp[eq | lt | le | gt | ge] B2) 0xffffffff 0x0000000 X X N/A !(A1 cmp[ne | nlt | nle | ngt | nge] B2)

R2:= (A1 cmp[eq | lt | le | gt | ge] B3) 0xffffffff 0x0000000 X N/A N/A !(A1 cmp[ne | nlt | nle | ngt | nge] B3)

R3:= A3 cmp[eq | lt | le | gt | ge] B3) 0xffffffff 0x0000000 X N/A N/A cmp[ne | nlt | nle | ngt | nge] B3)

The following table shows examples for arithmetic operators and intrinsics.

1208 40

Compare Operations for Fvec Classes

Returns Example Syntax Usage Intrinsic

Compare for Equality

4 floats F32vec4 R = _mm_cmpeq_ps cmpeq(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpeq_pd cmpeq(F64vec2 A);

1 float F32vec1 R = _mm_cmpeq_ss cmpeq(F32vec1 A);

Compare for Inequality

4 floats F32vec4 R = _mm_cmpneq_ps cmpneq(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpneq_pd cmpneq(F64vec2 A);

1 float F32vec1 R = _mm_cmpneq_ss cmpneq(F32vec1 A);

Compare for Less Than

4 floats F32vec4 R = _mm_cmplt_ps cmplt(F32vec4 A);

2 doubles F64vec2 R = _mm_cmplt_pd cmplt(F64vec2 A);

1 float F32vec1 R = _mm_cmplt_ss cmplt(F32vec1 A);

Compare for Less Than or Equal

4 floats F32vec4 R = _mm_cmple_ps cmple(F32vec4 A);

1209 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Compare for Less Than or Equal

2 doubles F64vec2 R = _mm_cmple_pd cmple(F64vec2 A);

1 float F32vec1 R = _mm_cmple_pd cmple(F32vec1 A);

Compare for Greater Than

4 floats F32vec4 R = _mm_cmpgt_ps cmpgt(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpgt_pd cmpgt(F32vec42 A);

1 float F32vec1 R = _mm_cmpgt_ss cmpgt(F32vec1 A);

Compare for Greater Than or Equal To

4 floats F32vec4 R = _mm_cmpge_ps cmpge(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpge_pd cmpge(F64vec2 A);

1 float F32vec1 R = _mm_cmpge_ss cmpge(F32vec1 A);

Compare for Not Less Than

4 floats F32vec4 R = _mm_cmpnlt_ps cmpnlt(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpnlt_pd cmpnlt(F64vec2 A);

1 float F32vec1 R = _mm_cmpnlt_ss cmpnlt(F32vec1 A);

1210 40

Compare for Not Less Than or Equal

4 floats F32vec4 R = _mm_cmpnle_ps cmpnle(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpnle_pd cmpnle(F64vec2 A);

1 float F32vec1 R = _mm_cmpnle_ss cmpnle(F32vec1 A);

Compare for Not Greater Than

4 floats F32vec4 R = _mm_cmpngt_ps cmpngt(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpngt_pd cmpngt(F64vec2 A);

1 float F32vec1 R = _mm_cmpngt_ss cmpngt(F32vec1 A);

Compare for Not Greater Than or Equal

4 floats F32vec4 R = _mm_cmpnge_ps cmpnge(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpnge_pd cmpnge(F64vec2 A);

1 float F32vec1 R = _mm_cmpnge_ss cmpnge(F32vec1 A);

Conditional Select Operators for Fvec Classes

Each conditional function compares single-precision floating-point values of A and B. The C and D parameters are used for return value. Comparison between objects of any Fvec class returns the same class.

1211 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Conditional Select Operators for Fvec Classes

Conditional Select for: Operators Syntax

Equality select_eq R = select_eq(A, B)

Inequality select_neq R = select_neq(A, B)

Greater Than select_gt R = select_gt(A, B)

Greater Than or Equal To select_ge R = select_ge(A, B)

Not Greater Than select_gt R = select_gt(A, B)

Not Greater Than or Equal To select_ge R = select_ge(A, B)

Less Than select_lt R = select_lt(A, B)

1212 40

Conditional Select for: Operators Syntax

Less Than or Equal To select_le R = select_le(A, B)

Not Less Than select_nlt R = select_nlt(A, B)

Not Less Than or Equal To select_nle R = select_nle(A, B)

Conditional Select Operator Usage For conditional select operators, the return value is stored in C if the comparison is true or in D if false. The following table shows the return values for each class of the conditional select operators, using the Return Value Notation described earlier. Compare Operator Return Value Mapping

R A0 Operators B C D F32vec4 F64vec2 F32vec1

R0:= (A1 select_[eq | lt | le | gt B0) C0 D0 X X X !(A1 | ge] B0) C0 D0 select_[ne | nlt | nle | ngt | nge]

R1:= (A2 select_[eq | lt | le | gt B1) C1 D1 X X N/A !(A2 | ge] B1) C1 D1 select_[ne | nlt | nle | ngt | nge]

1213 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

R A0 Operators B C D F32vec4 F64vec2 F32vec1

R2:= (A2 select_[eq | lt | le | gt B2) C2 D2 X N/A N/A !(A2 | ge] B2) C2 D2 select_[ne | nlt | nle | ngt | nge]

R3:= (A3 select_[eq | lt | le | gt B3) C3 D3 X N/A N/A !(A3 | ge] B3) C3 D3 select_[ne | nlt | nle | ngt | nge]

The following table shows examples for conditional select operations and corresponding intrinsics. Conditional Select Operations for Fvec Classes

Returns Example Syntax Usage Intrinsic

Compare for Equality

4 floats F32vec4 R = _mm_cmpeq_ps select_eq(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpeq_pd select_eq(F64vec2 A);

1 float F32vec1 R = _mm_cmpeq_ss select_eq(F32vec1 A);

Compare for Inequality

4 floats F32vec4 R = _mm_cmpneq_ps select_neq(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpneq_pd select_neq(F64vec2 A);

1 float F32vec1 R = _mm_cmpneq_ss select_neq(F32vec1 A);

1214 40

Compare for Less Than

4 floats F32vec4 R = _mm_cmplt_ps select_lt(F32vec4 A);

2 doubles F64vec2 R = _mm_cmplt_pd select_lt(F64vec2 A);

1 float F32vec1 R = _mm_cmplt_ss select_lt(F32vec1 A);

Compare for Less Than or Equal

4 floats F32vec4 R = _mm_cmple_ps select_le(F32vec4 A);

2 doubles F64vec2 R = _mm_cmple_pd select_le(F64vec2 A);

1 float F32vec1 R = _mm_cmple_ps select_le(F32vec1 A);

Compare for Greater Than

4 floats F32vec4 R = _mm_cmpgt_ps select_gt(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpgt_pd select_gt(F64vec2 A);

1 float F32vec1 R = _mm_cmpgt_ss select_gt(F32vec1 A);

Compare for Greater Than or Equal To

4 floats F32vec1 R = _mm_cmpge_ps select_ge(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpge_pd select_ge(F64vec2 A);

1215 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Compare for Greater Than or Equal To

1 float F32vec1 R = _mm_cmpge_ss select_ge(F32vec1 A);

Compare for Not Less Than

4 floats F32vec1 R = _mm_cmpnlt_ps select_nlt(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpnlt_pd select_nlt(F64vec2 A);

1 float F32vec1 R = _mm_cmpnlt_ss select_nlt(F32vec1 A);

Compare for Not Less Than or Equal

4 floats F32vec1 R = _mm_cmpnle_ps select_nle(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpnle_pd select_nle(F64vec2 A);

1 float F32vec1 R = _mm_cmpnle_ss select_nle(F32vec1 A);

Compare for Not Greater Than

4 floats F32vec1 R = _mm_cmpngt_ps select_ngt(F32vec4 A);

2 doubles F64vec2 R = _mm_cmpngt_pd select_ngt(F64vec2 A);

1 float F32vec1 R = _mm_cmpngt_ss select_ngt(F32vec1 A);

Compare for Not Greater Than or Equal

4 floats F32vec1 R = _mm_cmpnge_ps select_nge(F32vec4 A);

1216 40

Compare for Not Greater Than or Equal

2 doubles F64vec2 R = _mm_cmpnge_pd select_nge(F64vec2 A);

1 float F32vec1 R = _mm_cmpnge_ss select_nge(F32vec1 A);

Cacheability Support Operators

Stores (non-temporal) the two double-precision, floating-point values of A. Requires a 16-byte aligned address. void store_nta(double *p, F64vec2 A); Corresponding intrinsic: _mm_stream_pd Stores (non-temporal) the four single-precision, floating-point values of A. Requires a 16-byte aligned address. void store_nta(float *p, F32vec4 A); Corresponding intrinsic: _mm_stream_ps

Debug Operations

The debug operations do not map to any compiler intrinsics for MMX™ technology or Streaming SIMD Extensions. They are provided for debugging programs only. Use of these operations may result in loss of performance, so you should not use them outside of debugging.

Output Operations The two single, double-precision floating-point values of A are placed in the output buffer and printed in decimal format as follows: cout << F64vec2 A; "[1]:A1 [0]:A0" Corresponding intrinsics: none The four, single-precision floating-point values of A are placed in the output buffer and printed in decimal format as follows: cout << F32vec4 A; "[3]:A3 [2]:A2 [1]:A1 [0]:A0" Corresponding intrinsics: none The lowest, single-precision floating-point value of A is placed in the output buffer and printed.

1217 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

cout << F32vec1 A; Corresponding intrinsics: none

Element Access Operations double d = F64vec2 A[int i]

Read one of the two, double-precision floating-point values of A without modifying the corresponding floating-point value. Permitted values of i are 0 and 1. For example: If DEBUG is enabled and i is not one of the permitted values (0 or 1), a diagnostic message is printed and the program aborts. double d = F64vec2 A[1]; Corresponding intrinsics: none Read one of the four, single-precision floating-point values of A without modifying the corresponding floating point value. Permitted values of i are 0, 1, 2, and 3. For example: float f = F32vec4 A[int i]

If DEBUG is enabled and i is not one of the permitted values (0-3), a diagnostic message is printed and the program aborts. float f = F32vec4 A[2]; Corresponding intrinsics: none

Element Assignment Operations F64vec4 A[int i] = double d;

Modify one of the two, double-precision floating-point values of A. Permitted values of int i are 0 and 1. For example: F32vec4 A[1] = double d; F32vec4 A[int i] = float f;

Modify one of the four, single-precision floating-point values of A. Permitted values of int i are 0, 1, 2, and 3. For example: If DEBUG is enabled and int i is not one of the permitted values (0-3), a diagnostic message is printed and the program aborts. F32vec4 A[3] = float f; Corresponding intrinsics: none.

1218 40

Load and Store Operators

Loads two, double-precision floating-point values, copying them into the two, floating-point values of A. No assumption is made for alignment. void loadu(F64vec2 A, double *p) Corresponding intrinsic: _mm_loadu_pd Stores the two, double-precision floating-point values of A. No assumption is made for alignment. void storeu(float *p, F64vec2 A); Corresponding intrinsic: _mm_storeu_pd Loads four, single-precision floating-point values, copying them into the four floating-point values of A. No assumption is made for alignment. void loadu(F32vec4 A, double *p) Corresponding intrinsic: _mm_loadu_ps Stores the four, single-precision floating-point values of A. No assumption is made for alignment. void storeu(float *p, F32vec4 A); Corresponding intrinsic: _mm_storeu_ps

Unpack Operators

Selects and interleaves the lower, double-precision floating-point values from A and B. F64vec2 R = unpack_low(F64vec2 A, F64vec2 B); Corresponding intrinsic: _mm_unpacklo_pd(a, b) Selects and interleaves the higher, double-precision floating-point values from A and B. F64vec2 R = unpack_high(F64vec2 A, F64vec2 B); Corresponding intrinsic: _mm_unpackhi_pd(a, b) Selects and interleaves the lower two, single-precision floating-point values from A and B. F32vec4 R = unpack_low(F32vec4 A, F32vec4 B); Corresponding intrinsic: _mm_unpacklo_ps(a, b) Selects and interleaves the higher two, single-precision floating-point values from A and B. F32vec4 R = unpack_high(F32vec4 A F32vec4 B); Corresponding intrinsic: _mm_unpackhi_ps(a, b)

1219 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Move Mask Operators

Creates a 2-bit mask from the most significant bits of the two, double-precision floating-point values of A, as follows: int i = move_mask(F64vec2 A) i := sign(a1)<<1 | sign(a0)<<0 Corresponding intrinsic: _mm_movemask_pd Creates a 4-bit mask from the most significant bits of the four, single-precision floating-point values of A, as follows: int i = move_mask(F32vec4 A) i := sign(a3)<<3 | sign(a2)<<2 | sign(a1)<<1 | sign(a0)<<0 Corresponding intrinsic: _mm_movemask_ps

Classes Quick Reference

This appendix contains tables listing operators to perform various SIMD operations, corresponding intrinsics to perform those operations, and the classes that implement those operations. The classes listed here belong to the Intel C++ Class Libraries for SIMD Operations. In the following tables, • N/A indicates that the operator is not implemented in that particular class. Thus, in the Logical Operations table, the Andnot operator is not implemented in the F32vec4 and F32vec1 classes. • All other entries under Classes indicate that those operators are implemented in those particular classes, and the entries under the Classes columns provide the suffix for the corresponding intrinsic. For example, consider the Arithmetic Operations Part1 table, where the corresponding intrinsic is _mm_add_[x] and the entry epi16 is under the I16vec8 column. It means that the I16vec8 class implements the addition operators and the corresponding intrinsic is _mm_add_epi16.

Logical Operations:

1220 40

Operators Corresponding Classes Intrinsic I128vec1, I64vec1, F64vec2 F32vec4 F32vec1 I64vec2, I32vec2, I32vec4, I16vec4, I16vec8, I8vec8 I8vec16

&, &= _mm_and_[x] si128 si64 pd ps ps

|, |= _mm_or_[x] si128 si64 pd ps ps

^, ^= _mm_xor_[x] si128 si64 pd ps ps

Andnot _mm_andnot_[x] si128 si64 pd N/A N/A

Arithmetic Operations: Part 1

Operators Corresponding Classes Intrinsic I64vec2 I32vec4 I16vec8 I8vec16

+, += _mm_add_[x] epi64 epi32 epi16 epi8

-, -= _mm_sub_[x] epi64 epi32 epi16 epi8

*, *= _mm_mullo_[x] N/A N/A epi16 N/A

/, /= _mm_div_[x] N/A N/A N/A N/A

mul_high _mm_mulhi_[x] N/A N/A epi16 N/A

mul_add _mm_madd_[x] N/A N/A epi16 N/A

sqrt _mm_sqrt_[x] N/A N/A N/A N/A

rcp _mm_rcp_[x] N/A N/A N/A N/A

rcp_nr _mm_rcp_[x] N/A N/A N/A N/A _mm_add_[x] _mm_sub_[x] _mm_mul_[x]

1221 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operators Corresponding Classes Intrinsic I64vec2 I32vec4 I16vec8 I8vec16

rsqrt _mm_rsqrt_[x] N/A N/A N/A N/A

rsqrt_nr _mm_rsqrt_[x] N/A N/A N/A N/A _mm_sub_[x] _mm_mul_[x]

Arithmetic Operations: Part 2

Operators Corresponding Classes Intrinsic I32vec2 I16vec4 I8vec8 F64vec2 F32vec4 F32vec1

+, += _mm_add_[x] pi32 pi16 pi8 pd ps ss

-, -= _mm_sub_[x] pi32 pi16 pi8 pd ps ss

*, *= _mm_mullo_[x] N/A pi16 N/A pd ps ss

/, /= _mm_div_[x] N/A N/A N/A pd ps ss

mul_high _mm_mulhi_[x] N/A pi16 N/A N/A N/A N/A

mul_add _mm_madd_[x] N/A pi16 N/A N/A N/A N/A

sqrt _mm_sqrt_[x] N/A N/A N/A pd ps ss

rcp _mm_rcp_[x] N/A N/A N/A pd ps ss

rcp_nr _mm_rcp_[x] N/A N/A N/A pd ps ss _mm_add_[x] _mm_sub_[x] _mm_mul_[x]

rsqrt _mm_rsqrt_[x] N/A N/A N/A pd ps ss

rsqrt_nr _mm_rsqrt_[x] N/A N/A N/A pd ps ss _mm_sub_[x] _mm_mul_[x]

1222 40

Shift Operations: Part 1

Operators Corresponding Classes Intrinsic I128vec1 I64vec2 I32vec4 I16vec8 I8vec16

>>,>>= _mm_srl_[x] N/A epi64 epi32 epi16 N/A _mm_srli_[x] N/A epi64 epi32 epi16 N/A _mm_sra__[x] N/A N/A epi32 epi16 N/A _mm_srai_[x] N/A N/A epi32 epi16 N/A

<<, <<= _mm_sll_[x] N/A epi64 epi32 epi16 N/A _mm_slli_[x] N/A epi64 epi32 epi16 N/A

Shift Operations: Part 2

Operators Corresponding Classes Intrinsic I64vec1 I32vec2 I16vec4 I8vec8

>>,>>= _mm_srl_[x] si64 pi32 pi16 N/A _mm_srli_[x] si64 pi32 pi16 N/A _mm_sra__[x] N/A pi32 pi16 N/A _mm_srai_[x] N/A pi32 pi16 N/A

<<, <<= _mm_sll_[x] si64 pi32 pi16 N/A _mm_slli_[x] si64 pi32 pi16 N/A

Comparison Operations: Part 1

Operators Corresponding Classes Intrinsic I32vec4 I16vec8 I8vec16 I32vec2 I16vec4 I8vec8

cmpeq _mm_cmpeq_[x] epi32 epi16 epi8 pi32 pi16 pi8

cmpneq _mm_cmpeq_[x] epi32 epi16 epi8 pi32 pi16 pi8 _mm_and- si128 si128 si128 si64 si64 si64 not_[y]*

cmpgt _mm_cmpgt_[x] epi32 epi16 epi8 pi32 pi16 pi8

1223 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operators Corresponding Classes Intrinsic I32vec4 I16vec8 I8vec16 I32vec2 I16vec4 I8vec8

cmpge _mm_cmpge_[x] epi32 epi16 epi8 pi32 pi16 pi8 _mm_and- si128 si128 si128 si64 si64 si64 not_[y]*

cmplt _mm_cmplt_[x] epi32 epi16 epi8 pi32 pi16 pi8

cmple _mm_cmple_[x] epi32 epi16 epi8 pi32 pi16 pi8 _mm_and- si128 si128 si128 si64 si64 si64 not_[y]*

cmpngt _mm_cmp- epi32 epi16 epi8 pi32 pi16 pi8 ngt_[x]

cmpnge _mm_cmp- N/A N/A N/A N/A N/A N/A nge_[x]

cmnpnlt _mm_cmpnlt_[x] N/A N/A N/A N/A N/A N/A

cmpnle _mm_cmpn- N/A N/A N/A N/A N/A N/A le_[x]

* Note that _mm_andnot_[y] intrinsics do not apply to the fvec classes. Comparison Operations: Part 2

Operators Corresponding Classes Intrinsic F64vec2 F32vec4 F32vec1

cmpeq _mm_cmpeq_[x] pd ps ss

cmpneq _mm_cmpeq_[x] pd ps ss _mm_andnot_[y]*

cmpgt _mm_cmpgt_[x] pd ps ss

cmpge _mm_cmpge_[x] pd ps ss _mm_andnot_[y]*

cmplt _mm_cmplt_[x] pd ps ss

1224 40

Operators Corresponding Classes Intrinsic F64vec2 F32vec4 F32vec1

cmple _mm_cmple_[x] pd ps ss _mm_andnot_[y]*

cmpngt _mm_cmpngt_[x] pd ps ss

cmpnge _mm_cmpnge_[x] pd ps ss

cmnpnlt _mm_cmpnlt_[x] pd ps ss

cmpnle _mm_cmpnle_[x] pd ps ss

Conditional Select Operations: Part 1

Operators Corresponding Classes Intrinsic I32vec4 I16vec8 I8vec16 I32vec2 I16vec4 I8vec8

select_eq _mm_cmpeq_[x] epi32 epi16 epi8 pi32 pi16 pi8 _mm_and_[y] si128 si128 si128 si64 si64 si64 _mm_and- si128 si128 si128 si64 si64 si64 not_[y]* si128 si128 si128 si64 si64 si64 _mm_or_[y]

select_neq _mm_cmpeq_[x] epi32 epi16 epi8 pi32 pi16 pi8 _mm_and_[y] si128 si128 si128 si64 si64 si64 _mm_and- si128 si128 si128 si64 si64 si64 not_[y]* si128 si128 si128 si64 si64 si64 _mm_or_[y]

select_gt _mm_cmpgt_[x] epi32 epi16 epi8 pi32 pi16 pi8 _mm_and_[y] si128 si128 si128 si64 si64 si64 _mm_and- si128 si128 si128 si64 si64 si64 not_[y]* si128 si128 si128 si64 si64 si64 _mm_or_[y]

1225 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operators Corresponding Classes Intrinsic I32vec4 I16vec8 I8vec16 I32vec2 I16vec4 I8vec8

select_ge _mm_cmpge_[x] epi32 epi16 epi8 pi32 pi16 pi8 _mm_and_[y] si128 si128 si128 si64 si64 si64 _mm_and- si128 si128 si128 si64 si64 si64 not_[y]* si128 si128 si128 si64 si64 si64 _mm_or_[y]

select_lt _mm_cmplt_[x] epi32 epi16 epi8 pi32 pi16 pi8 _mm_and_[y] si128 si128 si128 si64 si64 si64 _mm_and- si128 si128 si128 si64 si64 si64 not_[y]* si128 si128 si128 si64 si64 si64 _mm_or_[y]

select_le _mm_cmple_[x] epi32 epi16 epi8 pi32 pi16 pi8 _mm_and_[y] si128 si128 si128 si64 si64 si64 _mm_and- si128 si128 si128 si64 si64 si64 not_[y]* si128 si128 si128 si64 si64 si64 _mm_or_[y]

select_ngt _mm_cmpgt_[x] N/A N/A N/A N/A N/A N/A

select_nge _mm_cmpge_[x] N/A N/A N/A N/A N/A N/A

select_nlt _mm_cmplt_[x] N/A N/A N/A N/A N/A N/A

select_nle _mm_cmple_[x] N/A N/A N/A N/A N/A N/A

* Note that _mm_andnot_[y] intrinsics do not apply to the fvec classes. Conditional Select Operations: Part 2

Operators Corresponding Classes Intrinsic F64vec2 F32vec4 F32vec1

select_eq _mm_cmpeq_[x] pd ps ss _mm_and_[y] _mm_andnot_[y]* _mm_or_[y]

1226 40

Operators Corresponding Classes Intrinsic F64vec2 F32vec4 F32vec1

select_neq _mm_cmpeq_[x] pd ps ss _mm_and_[y] _mm_andnot_[y]* _mm_or_[y]

select_gt _mm_cmpgt_[x] pd ps ss _mm_and_[y] _mm_andnot_[y]* _mm_or_[y]

select_ge _mm_cmpge_[x] pd ps ss _mm_and_[y] _mm_andnot_[y]* _mm_or_[y]

select_lt _mm_cmplt_[x] pd ps ss _mm_and_[y] _mm_andnot_[y]* _mm_or_[y]

select_le _mm_cmple_[x] pd ps ss _mm_and_[y] _mm_andnot_[y]* _mm_or_[y]

select_ngt _mm_cmpgt_[x] pd ps ss

select_nge _mm_cmpge_[x] pd ps ss

select_nlt _mm_cmplt_[x] pd ps ss

select_nle _mm_cmple_[x] pd ps ss

Packing and Unpacking Operations: Part 1

1227 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Operators Corresponding Classes Intrinsic I64vec2 I32vec4 I16vec8 I8vec16 I32vec2

unpack_high _mm_unpackhi_[x] epi64 epi32 epi16 epi8 pi32

unpack_low _mm_unpacklo_[x] epi64 epi32 epi16 epi8 pi32

pack_sat _mm_packs_[x] N/A epi32 epi16 N/A pi32

packu_sat _mm_packus_[x] N/A N/A epi16 N/A N/A

sat_add _mm_adds_[x] N/A N/A epi16 epi8 N/A

sat_sub _mm_subs_[x] N/A N/A epi16 epi8 N/A

Packing and Unpacking Operations: Part 2

Operators Corresponding Classes Intrinsic I16vec4 I8vec8 F64vec2 F32vec4 F32vec1

unpack_high _mm_unpackhi_[x] pi16 pi8 pd ps N/A

unpack_low _mm_unpacklo_[x] pi16 pi8 pd ps N/A

pack_sat _mm_packs_[x] pi16 N/A N/A N/A N/A

packu_sat _mm_packus_[x] pu16 N/A N/A N/A N/A

sat_add _mm_adds_[x] pi16 pi8 pd ps ss

sat_sub _mm_subs_[x] pi16 pi8 pi16 pi8 pd

Conversions Operations: Conversion operations can be performed using intrinsics only. There are no classes implemented to correspond to these intrinsics.

Operators Corresponding Intrinsic

F64vec2ToInt _mm_cvttsd_si32

1228 40

Operators Corresponding Intrinsic

F32vec4ToF64vec2 _mm_cvtps_pd

F64vec2ToF32vec4 _mm_cvtpd_ps

IntToF64vec2 _mm_cvtsi32_sd

F32vec4ToInt _mm_cvtt_ss2si

F32vec4ToIs32vec2 _mm_cvttps_pi32

IntToF32vec4 _mm_cvtsi32_ss

Is32vec2ToF32vec4 _mm_cvtpi32_ps

1229 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Programming Example

This sample program uses the F32vec4 class to average the elements of a 20 element floating point array. //Include Streaming SIMD Extension Class Definitions

#include //Shuffle any 2 single precision floating point from a //into low 2 SP FP and shuffle any 2 SP FP from b //into high 2 SP FP of destination

#define SHUFFLE(a,b,i) (F32vec4)_mm_shuffle_ps(a,b,i) #include #define SIZE 20 //Global variables

float result; _MM_ALIGN16 float array[SIZE]; //***************************************************** // Function: Add20ArrayElements // Add all the elements of a 20 element array //***************************************************** void Add20ArrayElements (F32vec4 *array, float *result) { F32vec4 vec0, vec1; vec0 = _mm_load_ps ((float *) array); // Load array's first 4 floats //***************************************************** // Add all elements of the array, 4 elements at a time //******************************************************

1230 40

vec0 += array[1]; // Add elements 5-8 vec0 += array[2]; // Add elements 9-12 vec0 += array[3]; // Add elements 13-16 vec0 += array[4]; // Add elements 17-20 //***************************************************** // There are now 4 partial sums. // Add the 2 lowers to the 2 raises, // then add those 2 results together //***************************************************** vec1 = SHUFFLE(vec1, vec0, 0x40); vec0 += vec1; vec1 = SHUFFLE(vec1, vec0, 0x30); vec0 += vec1; vec0 = SHUFFLE(vec0, vec0, 2); _mm_store_ss (result, vec0); // Store the final sum } void main(int argc, char *argv[]) { int i;

//Initialize the array for (i=0; i < SIZE; i++) { array[i] = (float) i; } //Call function to add all array elements

Add20ArrayElements (array, &result);

1231 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

//Print average array element value

printf ("Average of all array values = %f\n", result/20.); printf ("The correct answer is %f\n\n\n", 9.5); }

C++ Librarcref_cls/common/y Extensions

Introduction

This section contains descriptions of Intel's C++ library extensions that assist users in parallel programming. The following C++ library specialization is included: • C++ valarray implementation: enables users to leavarage a custom valarray header file that uses the Intel® Integrated Performance Primitives (Intel® IPP) for performance benefit.

Intel's valarray implementation Introduction to Intel's valarray Implementation The Intel® Compiler provides a high performance implementation of specialized one-dimensional valarray operations for the C++ standard STL valarray container. The standard C++ valarray template consists of array/vector operations for high performance computing. These operations are designed to exploit high performance hardware features such as parallelism and achieve performance benefits. Intel's valarray implementation uses the Intel® Integrated Performance Primitives (IPP), which is part of the product. Make sure you select IPP when you install the product. The valarray implementation consists of a replacement header, , that provides a specialized high performance implementation for the following operators and types:

Operator Valarrays of Type

abs, acos, asin, atan, atan2, cos, cosh, float, double exp, log, log10, pow, sin, sinh, sqrt, tan, tanh

addition, subtraction, division, float, double multiplication

1232 40

Operator Valarrays of Type

bitwise or, and, xor (all unsigned) char, short, int

min, max, sum signed or short/signed int, float, double

Optimization Notice

The Intel® Integrated Performance Primitives (Intel® IPP) library contains functions that are more highly optimized for Intel microprocessors than for other microprocessors. While the functions in the Intel® IPP library offer optimizations for both Intel and Intel-compatible microprocessors, depending on your code and other factors, you will likely get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for the Intel® IPP library as a whole, the library may or may not be optimized to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

Using Intel's valarray Implementation Intel's valarray implementation allows you to declare huge arrays for parallel processing. Improved implementation of valarray is tied up with calling the IPP libraries that are part of Intel® Integrated Performance Primitives (Intel® IPP). Intel® IPP is part of the product.

Using valarray in Source Code To use valarrays in your source code, include the valarray header file , . The header file is located in the path /perf_header.

1233 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

The example code below shows a valarray addition operation (+) specialized through use of Intel's implementation of valarray: #include void test( ) { std::valarray vi(N), va(N); … vi = vi + va; //array addition … }

NOTE. To use the static merged library containing all CPU-specific optimized versions of the library code, you need to call the ippStaticInit function first, before any IPP calls. This ensures automatic dispatch to the appropriate version of the library code for Intel® processor and the generic version of the library code for non-Intel processor at runtime. If you do not call ippStaticInit first, the emerged library will use the generic instance of the code. If you are using the dynamic version of the libraries, you do not need to call ippStaticInit.

Compiling valarray Source Code To compile your valarray source code, the compiler option, /Quse-intel-optimized-headers (for Windows* OS) or -use-intel-optimized-headers (for VxWorks* OS), is used to include the required valarray header file and all the necessary IPP library files. The following examples illustrate two instances of how to compile and link a program to include the Intel valarray replacement header file and link with the Intel® Integrated Performance Primitives (Intel® IPP) libraries. Refer to the Intel® IPP documentation for details. In the following examples, "merged" libraries means using a static library that contains all the CPU-specific variants of the library code.

Windows* OS examples: The following command line performs a one-step compilation for a system based on IA-32 architecture, running Windows OS: icl /Quse-intel-optimized-headers source.cpp

1234 40

The following command lines perform separate compile and link steps for a system based on IA-32 architecture, running Windows OS: DLL (dynamic): icl /Quse-intel-optimized-headers /c source.cpp icl source.obj /Quse-intel-optimized-headers

Merged (static): icl /Quse-intel-optimized-headers /c source.cpp icl source.obj /Quse-intel-optimized-headers

Linux* OS and VxWorks* OS examples: The following command line performs a one-step compilation for a system based on Intel® 64 architecture, running Linux OS and VxWorks OS: icpc -use-intel-optimized-headers source.cpp

The following command lines perform separate compile and link steps for a system based on Intel® 64 architecture, running Linux OS and VxWorks OS: so (dynamic): icpc -use-intel-optimized-headers -c source.cpp icpc source.o -use-intel-optimized-headers

Merged (static): icpc -use-intel-optimized-headers -c source.cpp icpc source.o -use-intel-optimized-headers

Optimization Notice

The Intel® Integrated Performance Primitives (Intel® IPP) library contains functions that are more highly optimized for Intel microprocessors than for other microprocessors. While the functions in the Intel® IPP library offer optimizations for both Intel and Intel-compatible microprocessors, depending on your code and other factors, you will likely get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for the Intel® IPP library as a whole, the library may or may not be optimized to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability,

1235 40 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Optimization Notice functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

1236 Intel's C/C++ Language Extensions 41

Introduction

This section contains descriptions of Intel's C/C++ Language Extensions that assist users in parallel programming. The following C/C++ language extensions are included: • C++ lambda expressions or functions: enable easy usage of libraries.

Intel's C++ lambda extensions

Introduction

The Intel® C++ Compiler implements lambda expressions as specified in the ISO C++ document N2550.pdf available at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/. C++ Lambda expressions are primary expressions that define function objects. Such expressions can be used wherever a function object is expected; for example, as arguments to Standard Template Library (STL) algorithms. This section contains the following topics: • Details on Using C++ Lambda Expressions • Understanding Lambda-Capture • Lambda Function Object

Details on Using Lambda Expressions in C++

This topic explains in some detail about using Intel's implementation of C++ Lambda expressions. In order to use lambda expressions, you need to request c++0x with the command-line option -std or /Qstd: On Windows* systems: /Qstd=c++0x On Linux* and VxWorks* systems: -std=c++0x

1237 41 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Introducing Lambda Expressions in Code

Lambda expressions are introduced in the code by [lambda-captureopt]. If the expression does not capture any local variables or references with automatic storage duration, lambda-captureopt can be empty, leaving [] as the introducer. For example, the lambda expression [](int x) {return x%3==0;}

is equivalent to the expression unique(), where unique is a secret identifier generated and defined by the compiler as class unique { public: bool operator()(int x ) const {return x%3==0;} };

Parameter List in Lambda Expressions The lambda-parameter-declaration follows the lambda-introducer. If the parameter list is empty, the parentheses can be omitted. For example, the following lambda expressions are equivalent. [] {return rand();} []() {return rand();}

Body and Return Type of Lambda Expressions The body of a lambda expression is a compound statement. A return type T for a lambda expression can be specified by writing ->T after the parameter list. For example, the following lambda expressions are equivalent: [](int x) {return x%3==0;} [](int x) -> bool {return x%3==0;} [](int x) -> bool {if( x%3==0 ) return true; else return false;}

If the return type is not explicitly specified, the return type is void unless the body has the form {return expression;}. In such a case the return type is the type of the expression. Consider the following example: [](int x) { if (x % 3 == 0) return true; else return false; }

1238 41

This expression is an error because you can't return a boolean value from a function returning void.

Exception Specification A lambda expression may include an exception specification after the parameter list and before the return type specification. The parameter list is necessary if there is an exception specification. The following lambda expressions specify that they do not throw exceptions: [](int x) throw() -> bool {return x%3==0;} []() throw() {return rand();}

See Also • Understanding [lambda-capture] • Formal Grammar for Lambda Expressions in "Lambda Expressions and Closures: Wording for Monomorphic Lambdas (Revision 4)"

Understanding Lambda-Capture

A lambda expression can refer to identifiers declared outside the lambda expression. If the identifier is a local variable or a reference with automatic storage duration, it is an up-level reference and must be "captured" by the lambda expression. Such a lambda expression must be introduced by [lambda-captureopt], where lambda-captureopt specifies whether identifiers are captured by reference or by copy. The table below summarizes forms of lambda-captureopt.

Symbol Indicates

[] Nothing to capture: an up-level reference is an error

[&x, y, ...] Capture as specified: identifiers prefixed by & are captured by reference; other identifiers are captured by copy. An up-level reference to any variable not explicitly listed is an error

[&] Capture by reference: an up-level reference implicitly captures the variable by reference

[=] Capture by copy: an up-level reference implicitly captures the variable by copy

[&, x, y, Capture by reference with exceptions: listed variables are captured by ...] value/copy (no listed variable may be prefixed by &)

1239 41 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Symbol Indicates

[=, &x, &y, Capture by copy with exceptions: listed variables are captured by reference ...] only (every listed variable must be prefixed by &)

No identifier may appear twice in the list. In the following code that sets area to the sum of the areas of four circles, the notation [&area,pi] specifies that area is captured by reference and pi by copy. float radius[] = {2,3,5,7}; float area=0; float pi = 3.14f; for_each(radius, radius+4, [&area,pi](float r) {return area+=pi*r*r;})

Specifying Default Capture When a default capture is specified, the list must specify only the other kind of capture. In other words, if you specify that the default capture is by reference, then you must list (and only list) the variables that should be captured by copy. For example, if your intent is to capture x by reference and y by copy, and both x and y appear in the function body, the following code illustrates what is correct and incorrect code: [&,&x,y] //ERROR - default is capture-by-reference; list only capture-by-copy variable, y

[&,y] //CORRECT - default is capture-by-reference; listed variable, y, should not have & prefix

[=,&x,y] //ERROR - default is capture-by-copy; listed variable, x, must be prefixed with &

1240 41

[=,&x] //CORRECT - default is capture by copy; listed variable must be prefixed with &

[&x] //ERROR - no default capture; you must list y separately to be captured by copy

[y] //ERROR - again no default capture is specified, so you must list &x separately to be captured by reference

[&x,y] //CORRECT - since no default is specified, every variable is listed with its capture mode.

Default Binding Modes The following lambda expressions demonstrate default binding modes. All three expressions are semantically equivalent. Each captures x and y by reference, and captures a and b by copy. [&x,&y,a,b](float r) {x=a; y=b;}

[&,a,b](float r) {x=a; y=b;}

[=,&x,&y](float r) {x=a; y=b;}

Referring to this

If a lambda expression occurs inside a member function, it can refer to this. Because this is not a variable, it cannot be captured by reference. Even when it is captured implicitly in a lambda expression introduced by [&], it is captured by copy.

Lambda Function Object

The compiler creates an anonymous function object upon evaluating a lambda expression.

1241 41 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

This function object, created by a lambda expression, may live longer than the block in which it is created. You must ensure that it does not use variables that were destroyed before the function object is used. The following example shows how a function object created by a lambda expression can outlive the function that creates it. struct Base { virtual bool test(int x) = 0; }; template struct Derived: Base { F f; bool test(int x) {return f(x);} Derived(F f_) : f(f_) {} }; template Base* MakeDerived( F f ) { return new Derived(f); } Base* Foo( int k ) { return MakeDerived( [k](int x) {return x%k==3;} ); } bool Bar() { Base* b = Foo(3); return b->test(6); }

In the above example, Bar invokes Foo, which copies the function object generated by a lambda expression into an instance of template class Derived. The lambda expression refers to a local variable k. Although the code destroys k before using the copied function object, the code is safe because k was captured by copy.

1242 Index

__regcall 123 -diag-error-limit compiler option 170 --version compiler option 437 -diag-file compiler option 171 -A compiler option 139 -diag-file-append compiler option 172 -A- compiler option 140 -diag-id-numbers compiler option 174 -alias-const compiler option 141 -diag-once compiler option 175 -align compiler option 142 -diag-remark compiler option 166 -ansi compiler option 143 -diag-warning compiler option 166 -ansi-alias compiler option 143 -dM compiler option 175 -ansi-alias-check compiler option 144 -dN compiler option 176 -auto-ilp32 compiler option 146 -dryrun compiler option 177 -auto-p32 compiler option 147 -dumpmachine compiler option 178 -ax compiler option 148 -dumpversion compiler option 178 -B compiler option 151 -dynamic-linker compiler option 179 -Bdynamic compiler option 152 -dynamiclib compiler option 84 -Bstatic compiler option 153 -E compiler option 180 -Bsymbolic compiler option 154 -early-template-check compiler option 181 -Bsymbolic-functions compiler option 155 -EP compiler option 182 -c compiler option 84, 157 -export compiler option 182 -C compiler option 156 -export-dir compiler option 183 -c99 compiler option 158 -fabi-version compiler option 184 -check-uninit compiler option 159 -falias compiler option 185 -complex-limited-range compiler option 159 -falign-functions compiler option 186 -cxxlib compiler option 160 -falign-stack compiler option 187 -D compiler option 161 -fargument-alias compiler option 188 -dD compiler option 162 -fargument-noalias-global compiler option 189 -debug compiler option 78, 163 -fasm-blocks compiler option 190 -diag compiler option 166 -fast compiler option 190 -diag-disable compiler option 166 -fast-transcendentals compiler option 192 -diag-disable sc compiler option 166 -fbuiltin compiler option 194, 349 -diag-dump compiler option 169 -fcode-asm compiler option 195 -diag-enable compiler option 166 -fcommon compiler option 196 -diag-enable port-win compiler option 116 -fdata-sections compiler option 200 -diag-enable sc compiler option 166 -fexceptions compiler option 197 -diag-error compiler option 166 -ffnalias compiler option 197

1243 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-ffreestanding compiler option 198 -ftrapuv compiler option 249 -ffriend-injection compiler option 199 -ftz compiler option 250, 674 -ffunction-sections compiler option 200 -funroll-all-loops compiler option 251 -fimf-absolute-error compiler option 201 -funroll-loops compiler option 426 -fimf-accuracy-bits compiler option 202 -funsigned-bitfields compiler option 252 -fimf-arch-consistency compiler option 204 -funsigned-char compiler option 253 -fimf-max-error compiler option 205 -fvar-tracking compiler option 163 -fimf-precision compiler option 207 -fvar-tracking-assignments compiler option 163 -finline compiler option 208 -fverbose-asm compiler option 254 -finline-functions compiler option 209 -fvisibility compiler option (Linux* only) 255 -finline-limit compiler option 210 -fvisibility-inlines-hidden compiler option 257 -finstrument-functions compiler option 211 -fzero-initialized-in-bss compiler option 258 -fjump-tables compiler option 213 -g compiler option 259, 488, 490 -fkeep-static-consts compiler option 214 -g0 compiler option 260 -fmath-errno compiler option 215 -gcc compiler option 261, 263 -fminshared compiler option 216 -gcc-name compiler option 262 -fmudflap compiler option 217 -gcc-sys compiler option 261, 263 -fno-defer-pop compiler option 217 -gcc-version compiler option 264 -fno-gnu-keywords compiler option 218 -gdwarf-2 compiler option 265 -fno-implicit-fp-sse compiler option 219 -global-hoist compiler option 266 -fno-implicit-inline-templates compiler option 219 -guide compiler option 267 -fno-implicit-templates compiler option 220 -guide-data-trans compiler option 269 -fno-operator-names compiler option 221 -guide-opts compiler option 270 -fno-rtti compiler option 222 -guide-vec compiler option 273 -fnon-call-exceptions compiler option 222 -gxx-name compiler option 274 -fnon-lvalue-assign compiler option 223 -H compiler option 275 -fomit-frame-pointer compiler option 224, 369 -help compiler option 275 -fp compiler option 224, 369 -help-pragma compiler option 277 -fp-model compiler option 226, 659, 668 -I compiler option 278 how to use 668 -icc compiler option 279 -fp-port compiler option 232 -idirafter compiler option 280 -fp-speculation compiler option 233 -imacros compiler option 280 -fp-stack-check compiler option 234 -inline-calloc compiler option 281 -fp-trap compiler option 235 -inline-debug-info compiler option 283 -fp-trap-all compiler option 237 -inline-factor compiler option 284 -fpack-struct compiler option 239 -inline-forceinline compiler option 285 -fpascal-strings compiler option 240 -inline-level compiler option 286, 348 -fpermissive compiler option 240 -inline-max-per-compile compiler option 288 -fpic compiler option 84, 241 -inline-max-per-routine compiler option 289 -fpie compiler option 242 -inline-max-size compiler option 290 -freg-struct-return compiler option 243 -inline-max-total-size compiler option 291 -fshort-enums compiler option 244 -inline-min-size compiler option 293 -fsource-asm compiler option 244 -intel-extensions compiler option 294 -fstack-protector compiler option 245, 267 -ip compiler option 295 -fstack-security-check compiler option 245, 267 -ip-no-inlining compiler option 296 -fstrict-aliasing compiler option 143 -ip-no-pinlining compiler option 297 -fsyntax-only compiler option 246 -ipo compiler option 298, 511 -ftemplate-depth compiler option 247 -ipo-c compiler option 299 -ftls-model compiler option 248 -ipo-jobs compiler option 300

1244 Index

-ipo-S compiler option 301 -opt-args-in-regs compiler option 350 -ipo-separate compiler option 302 -opt-block-factor compiler option 352 -ipp compiler option 303 -opt-calloc compiler option (Linux only) 352 -iprefix compiler option 304 -opt-class-analysis compiler option 353 -iquote compiler option 305 -opt-jump-tables compiler option 354 -isystem compiler option 306 -opt-malloc-options compiler option 355 -iwithprefix compiler option 306 -opt-matmul compiler option 357 -iwithprefixbefore compiler option 307 -opt-multi-version-aggressive compiler option 358 -Kc++ compiler option 308, 423 -opt-prefetch compiler option 359 -l compiler option 309 -opt-ra-region-strategy compiler option 360 -L compiler option 310 -opt-report compiler option 361 -m compiler option 311 -opt-report-file compiler option 362 -M compiler option 312 -opt-report-help compiler option 363 -m32 compiler option 313 -opt-report-phase compiler option 364 -m64 compiler option 313 -opt-report-routine compiler option 365 -malign-double compiler option 314 -opt-streaming-stores compiler option 366 -map-opts compiler option 315 -opt-subscript-in-range compiler option 367 -march compiler option 316 -Os compiler option 368 -mc++-abr compiler option 320 -P compiler option 371 -mcpu compiler option 333 -pc compiler option 372 -MD compiler option 321 -pch compiler option 60, 373 -membedded compiler option 321 -pch-create compiler option 60, 374 -MF compiler option 322 -pch-dir compiler option 60, 375 -MG compiler option 323 -pch-use compiler option 60, 377 -mieee-fp compiler option 326 -pie compiler option 378 -minstruction compiler option 324 -pragma-optimization-level compiler option 379 -MM compiler option 325 -prec-div compiler option 380 -MMD compiler option 326 -prec-sqrt compiler option 381 -mp compiler option 326 -print-multi-lib compiler option 382 -MP compiler option 328 -prof-data-order compiler options 382 -mp1 compiler option 329 -prof-dir compiler option 383 -MQ compiler option 330 -prof-file compiler option 384 -mregparm compiler option 331 -prof-func-groups compiler option 385 -mrtp compiler option 332 -prof-func-order compiler options 386 -MT compiler option 332 -prof-gen -mtune compiler option 333 srcpos compiler option 538, 550, 570 -multibyte-chars compiler option 336 -prof-gen compiler option 388, 530, 538 -multiple-processes compiler option 328, 336 related options 530 -no-libgcc compiler option 338 -prof-gen:srcpos compiler option -nobss-init compiler option 337 code coverage tool 550 -nodefaultlibs compiler option 339 test priorization tool 570 -nolib-inline compiler option 340 -prof-genx compiler option 388 -non-static compiler option 340 -prof-hotness-threshold compiler option 390 -nostartfiles compiler option 341 -prof-src-dir compiler option 391 -nostdinc++ compiler option 342 -prof-src-root compiler option 392 -nostdlib compiler option 342 -prof-src-root-cwd compiler option 394 -o compiler option 343 -prof-use compiler option 395, 530, 550, 579 -O compiler option 344 code coverage tool 550 -Ob compiler option 286, 348 profmerge utility 579

1245 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

-prof-use compiler option (continued) -vec compiler option 433 related options 530 -vec-guard-write compiler option 434 -prof-value-profiling compiler option 397 -vec-report compiler option 435 -profile-functions compiler option 398 -vec-threshold compiler option 436 -profile-loops compiler option 399 -w compiler option 438, 439 -profile-loops-report compiler option 401 -Wa compiler option 440 -pthread compiler option 402 -Wabi compiler option 441 -qdiag-disable linking option 493 -Wall compiler option 442 -qdiag-enable linking option 493 -Wbrief compiler option 443 -qhelp linking option 493 -Wcheck compiler option 443 -Qinstall compiler option 403 -Wcomment compiler option 444 -qipo-fa linking option 493 -Wcontext-limit compiler option 445 -qipo-fac linking option 493 -wd compiler option 446 -qipo-facs linking option 493 -Wdeprecated compiler option 446 -qipo-fas linking option 493 -we compiler option 447 -qipo-fo linking option 493 -Weffc++ compiler option 448 -Qlocation compiler option 404 -Werror compiler option 449, 475 -qnoipo linking option 493 -Werror-all compiler option 450 -qomit-il linking option 493 -Wextra-tokens compiler option 451 -Qoption compiler option 405 -Wformat compiler option 452 -quseenv linking option 493 -Wformat-security compiler option 452 -qv linking option 493 -Winline compiler option 453 -rcd compiler option 407 -Wl compiler option 454 -regcall compiler option 408 -Wmain compiler option 456 -restrict compiler option 409 -Wmissing-declarations compiler option 456 -S compiler option 410 -Wmissing-prototypes compiler option 457 -save-temps compiler option 411 -wn compiler option 458 -scalar-rep compiler option 412 -Wnon-virtual-dtor compiler option 459 -shared compiler option 84, 413 -wo compiler option 459 -shared-intel compiler option 414 -Wp compiler option 460 -shared-libgcc compiler option 415 -Wp64 compiler option 461 -simd compiler option 415 -Wpointer-arith compiler option 461 -sox compiler option 417 -Wpragma-once compiler option 462 -static compiler option 418 -wr compiler option 463 -static-intel compiler option 418 -Wremarks compiler option 464 -static-libgcc compiler option 419 -Wreorder compiler option 464 -std compiler option 420 -Wreturn-type compiler option 465 -strict-ansi compiler option 422 -Wshadow compiler option 466 -T compiler option 423 -Wsign-compare compiler option 467 -u (L*) compiler option 424 -Wstrict-aliasing compiler option 468 -U compiler option 425 -Wstrict-prototypes compiler option 469 -undef compiler option 426 -Wtrigraphs compiler option 469 -unroll compiler option 426 -Wuninitialized compiler option 470 -unroll-aggressive compiler option 427 -Wunknown-pragmas compiler option 471 -use-asm compiler option 428 -Wunused-function compiler option 472 -use-intel-optimized-headers compiler option 429 -Wunused-variable compiler option 472 -use-msasm compiler option 430 -ww compiler option 473 -V (L*) compiler option 432 -Wwrite-strings compiler option 474 -v compiler option 431 -x (type) compiler option 475

1246 Index

-x compiler option 477 /Qdiag-enable -X compiler option 481 sc compiler option 166 -Xbind-lazy compiler option 482 /Qdiag-enable compiler option 166 -Xbind-now compiler option 483 /Qdiag-error compiler option 166 -xHost compiler option 484 /Qdiag-error-limit compiler option 170 -Xlinker compiler option 487 /Qdiag-file compiler option 171 -Zp compiler option 492 /Qdiag-file-append compiler option 172 .dpi file 550, 570, 579 /Qdiag-id-numbers compiler option 174 .dyn file 550, 570, 579 /Qdiag-once compiler option 175 .dyn files 530, 538 /Qdiag-remark compiler option 166 .spi file 550, 570 /Qdiag-warning compiler option 166 /c compiler option 157 /QdM compiler option 175 /C compiler option 156 /QdN compiler option 176 /D compiler option 161 /Qeffc++ compiler option 448 /E compiler option 180 /Qfast-transcendentals compiler option 192 /EP compiler option 182 /Qfnalign compiler option 186 /fast compiler option 190 /Qfp-port compiler option 232 /fp compiler option 226, 659, 668 /Qfp-speculation compiler option 233 how to use 668 /Qfp-stack-check compiler option 234 /GS compiler option 245, 267 /Qfp-trap compiler option 235 /help compiler option 275 /Qfp-trap-all compiler option 237 /I compiler option 278 /Qfreestanding compiler option 198 /MP compiler option 328, 336 /Qftz compiler option 250, 674 /O compiler option 344 /Qglobal-hoist compiler option 266 /Oa compiler option 185 /Qguide compiler option 267 /Ob compiler option 286, 348 /Qguide-data-trans compiler option 269 /Oi compiler option 194, 349 /Qguide-opts compiler option 270 /Os compiler option 368 /Qguide-vec compiler option 273 /Ow compiler option 197 /QH compiler option 275 /Oy compiler option 224, 369 /Qimf-absolute-error compiler option 201 /P compiler option 371 /Qimf-accuracy-bits compiler option 202 /QA compiler option 139 /Qimf-arch-consistency compiler option 204 /QA- compiler option 140 /Qimf-max-error compiler option 205 /Qalias-const compiler option 141 /Qimf-precision compiler option 207 /Qansi-alias compiler option 143 /Qinline-calloc compiler option 281 /Qansi-alias-check compiler option 144 /Qinline-debug-info compiler option 283 /Qauto-ilp32 compiler option 146 /Qinline-factor compiler option 284 /Qax compiler option 148 /Qinline-forceinline compiler option 285 /Qc99 compiler option 158 /Qinline-max-per-compile compiler option 288 /Qcomplex-limited-range compiler option 159 /Qinline-max-per-routine compiler option 289 /Qcontext-limit compiler option 445 /Qinline-max-size compiler option 290 /Qcov-gen compiler option 538, 550 /Qinline-max-total-size compiler option 291 code coverage tool 550 /Qinline-min-size compiler option 293 /QdD compiler option 162 /Qinstruction compiler option 324 /Qdiag compiler option 166 /Qinstrument-functions compiler option 211 /Qdiag-disable /Qintel-extensions compiler option 294 sc compiler option 166 /Qip compiler option 295 /Qdiag-disable compiler option 166 /Qip-no-inlining compiler option 296 /Qdiag-dump compiler option 169 /Qip-no-pinlining compiler option 297

1247 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

/Qipo compiler option 298, 511 /Qprof-genx compiler option 388 /Qipo-c compiler option 299 /Qprof-hotness-threshold compiler option 390 /Qipo-jobs compiler option 300 /Qprof-src-dir compiler option 391 /Qipo-S compiler option 301 /Qprof-src-root compiler option 392 /Qipo-separate compiler option 302 /Qprof-src-root-cwd compiler option 394 /Qipp compiler option 303 /Qprof-use compiler option 395, 550, 579 /Qkeep-static-consts compiler option 214 code coverage tool 550 /Qlocation compiler option 404 profmerge utility 579 /QM compiler option 312 /Qprof-value-profiling compiler option 397 /Qmap-opts compiler option 315 /Qprofile-functions compiler option 398 /Qmcmodel compiler option 318 /Qprofile-loops compiler option 399 /QMD compiler option 321 /Qprofile-loops-report compiler option 401 /QMF compiler option 322 /Qrcd compiler option 407 /QMG compiler option 323 /Qregcall compiler option 408 /QMM compiler option 325 /Qrestrict compiler option 409 /QMMD compiler option 326 /Qsave-temps compiler option 411 /QMT compiler option 332 /Qscalar-rep compiler option 412 /Qmultibyte-chars compiler option 336 /Qsimd compiler option 415 /Qnobss-init compiler option 337 /Qsox compiler option 417 /Qopt-args-in-regs compiler option 350 /Qstd compiler option 420 /Qopt-block-factor compiler option 352 /Qtemplate-depth compiler option 247 /Qopt-class-analysis compiler option 353 /Qtrapuv compiler option 249 /Qopt-jump-tables compiler option 354 /Qunroll compiler option 426 /Qopt-matmul compiler option 357 /Qunroll-aggressive compiler option 427 /Qopt-multi-version-aggressive compiler option 358 /Quse-asm compiler option 428 /Qopt-prefetch compiler option 359 /Quse-intel-optimized-headers compiler option 429 /Qopt-ra-region-strategy compiler option 360 /QV compiler option 432 /Qopt-report compiler option 361 /Qvec compiler option 433 /Qopt-report-file compiler option 362 /Qvec-guard-write compiler option 434 /Qopt-report-help compiler option 363 /Qvec-report compiler option 435 /Qopt-report-phase compiler option 364 /Qvec-threshold compiler option 436 /Qopt-report-routine compiler option 365 /Qwd compiler option 446 /Qopt-streaming-stores compiler option 366 /Qwe compiler option 447 /Qopt-subscript-in-range compiler option 367 /Qwn compiler option 458 /Qoption compiler option 405 /Qwo compiler option 459 /Qpc compiler option 372 /Qwr compiler option 463 /Qprec compiler option 329 /Qww compiler option 473 /Qprec_div compiler option 380 /Qx compiler option 477 /Qprec-sqrt compiler option 381 /QxHost compiler option 484 /Qprof-data-order compiler option 382 /Qzero-initialized-in-bss compiler option 258 /Qprof-dir compiler option 383 /S compiler option 410 /Qprof-file compiler option 384 /TP compiler option 308, 423 /Qprof-func-order compiler option 386 /U compiler option 425 /Qprof-gen /w compiler option 438 srcpos compiler option 538, 550, 570 /W compiler option 439 /Qprof-gen compiler option 388, 538 /Wall compiler option 442 /Qprof-gen:srcpos compiler option /Wcheck compiler option 443 code coverage tool 550 /Werror-all compiler option 450 test priorization tool 570 /WL compiler option 455

1248 Index

/Wp64 compiler option 461 autovectorization 679 /WX compiler option 449, 475 autovectorization of innermost loops 679 /X compiler option 481 avoid /Z7 compiler option 259, 488, 490 inefficient data types 679 /Zi compiler option 259, 488, 490 mixed arithmetic expressions 679 /ZI compiler option 491 /Zp compiler option 492 C A calling conventions 123 capturing IPO output 511 absolute error Carry-less intrinsic option defining for math library function results _mm_clmulepi64_si128 882 201 Carry-less Multiplication Instruction (PCLMULQDQ) Advanced Encryption Standard (AES) Intrinsics 881 Intrinsic 881 advanced PGO options 538 checking AES intrinsics floating-point stacks 676 _mm_aesdec_si128 882 stacks 676 _mm_aesdeclast_si128 882 Checking the Floating-point Stack State 676 _mm_aesenc_si128 882 Class Libraries _mm_aesenclast_si128 882 About the Classes 1153 _mm_aesimc_si128 882 C++ classes and SIMD operations 1155 _mm_aeskeygenassist_si128 882 capabilities of C++ SIMD classes 1159 aliasing details 1154 option specifying assumption in functions 197 floating-point vector classes option specifying assumption in programs 185 arithmetic operators 1198 ALLOCATABLE cacheability support operators 1217 basic block 550 compare operators 1207 code coverage 550 conditional select operators 1211 visual presentation 550 constructors and initialization 1196 alternate tools and locations 59 conversions 1195 ANSI/ISO standard 101 data alignment 1195 application debug operations 1217 deploying 54 load and store operators 1219 application tests 570 logical operators 1205 applications minimum and maximum operators 1204 option specifying code optimization for 344 move mask operators 1220 ar tool 84 overview 1193 assembler unpack operators 1219 option passing options to 440 Hardware and Software Requirements 1153 option producing objects through 428 integer vector classes assembly files addition, subtraction operation 1170 naming 59 assignment operator 1167 auto-parallelism clear MMX(TM) state operators 1190 option setting guidance for 267 comparison operators 1177 auto-vectorization conditional select operators 1179 option setting guidance for 267, 273 conversions between fvec and ivec 1192 auto-vectorization hints 654 debug operations auto-vectorizer 605, 615 element access operator 1182

1249 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Class Libraries (continued) compiler options (continued) integer vector classes (continued) linker-related 56 debug operations (continued) option categories 50 element assignment operators 1182 overview of descriptions of 137 functions for SSE 1191 preprocessor options 67, 68, 69 ivec classes 1161 using 50 logical operators 1168 compiler reports multiplication operators 1173 requesting with xi* tools 519 pack operators 1190 compilervars.sh environment script 55 rules for operators 1164 compiling shift operators 1175 gcc* code with Intel(R) C++ 112 unpack operators 1185 compiling large programs 514 overview 1153 compiling with IPO 511 Quick reference 1220 complex operations Classes option enabling algebraic expansion of 159 programming example 1230 conditional check code option performing in a vectorized loop 434 option generating for specified CPU 316 conditional parallel region execution option generating processor-specific 148, 311 inline expansion 522 option generating specialized 484 configuration files 74 option generating specialized and optimized 477 conventions code coverage tool 550 in the documentation 42 color scheme 550 correct usage of countable loop 633 dynamic counters in 550 COS exporting data 550 correct usage of 633 syntax of 550 counters for dynamic profile 598 code layout 516 CPU compilation units 216, 525 option generating code for specified 316 option to prevent linking as shareable object 216 option performing optimizations for specified 333 compiler CPU time overview 41, 44 DPI lists 570 compiler information for inline function expansion 520 saving in your executable 53 create libraries using IPO 518 compiler installation option specifying root directory for 403 compiler operation D compilation phases 49 input files 57 data format invoking from the command line 55 prefetching 601 output files 49, 58 type 605, 615 compiler option mapping tool 63 data ordering optimization 585 compiler options data transformation command-line syntax 50 option setting guidance for 267, 269 for interoperability 95 data types for optimization 112 efficiency 682 for portability 498 DAZ flag 674 for templates 104 debug information for visibility 88, 89 option generating full 259, 488, 490 general rules for 137

1250 Index debugging 77, 78, 163 dynamic-information files 530 compiler options for 77, 78 dynamic-linking of libraries option affecting information generated 163 option enabling 152 option specifying settings to enhance 163 symbolic 77 denormal exceptions 678 E denormal numbers 672 denormal results ebp register option flushing to zero 250 option determining use in optimizations 224, 369 denormalized numbers (IEEE*) 683 efficiency 679 NaN values 683 efficient denormals 672 inlining 522 deploying applications 54 PGO options 530 diagnostic messages efficient data types 682 option affecting which are issued 166 EMMS Instruction option controlling auto-parallelizer 166 about 716 option controlling display of 166 using 718 option controlling OpenMP 166 endian data option controlling static analysis 166 dumping profile information 594 option controlling vectorizer 166 for profile-guided optimization 593 option enabling or disabling 166 loop constructs 633 option issuing only once 175 PROF_DIR 593 option sending to file 171 PROF_DUMP_INTERVAL 596 diagnostics 119, 607 using profile-guided optimization 538 differential coverage 550 environment variables directory for Linux* OS, VxWorks* OS and Mac OS* X 71 option adding to start of include path 306 LD_LIBRARY_PATH 83 option specifying for executables 151 modifying 71 option specifying for includes and libraries 151 exception handling disabling option generating table of 197 function splitting 530 exceptions inlining 522 option allowing trapping instructions to throw 222 distributing applications 54 exclude code 550 DO constructs 633 code coverage tool 550 Documentation executable files 59 conventions for 42 explicit-shape arrays driver tool commands .dpi 530, 550, 570, 579 option specifying to show and execute 431 .dyn 530, 538, 550, 570, 579, 593, 594, 599 option specifying to show but not execute 177 .spi 550, 570 dumping profile information 594, 596 pgopti.dpi 550 dyn files 530, 538, 593, 594, 599 pgopti.spi 550 dynamic information 529, 538, 593, 594, 598 source 538 dumping profile information 594 export keyword 101 files 538 exported templates 101 resetting profile counters 598 dynamic linker option specifying an alternate 179 dynamic shared object option producing a 413

1251 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

F FMA Intrinsic (continued) _mm_fnmadd_sd 989 float-to-integer conversion _mm_fnmadd_ss 989 option enabling fast 407 _mm_fnmsub_pd, _mm256_fnmsub_pd 990 Floating-point array _mm_fnmsub_ps, _mm256_fnmsub_ps 991 Handling 677 _mm_fnmsub_sd 992 floating-point array operation 677 _mm_fnmsub_ss 993 floating-point calculations FMA intrinsics 886 option controlling semantics of 226 format function security problems Floating-point Compiler Options 659 option issuing warning for 452 Floating-point environment FTZ flag 674 -fp-model compiler option 673 Function annotations /fp compiler option 673 declspec(align) 654 pragma fenv_access 673 declspec(vector) 654 floating-point exceptions function entry and exit points denormal exceptions 678 option determining instrumentation of 211 floating-point numbers function expansion 525 formats for 683 function grouping overview 659 option enabling or disabling 385 special values 683 function grouping optimization 585 floating-point operations function order list 530, 591 option controlling semantics of 226 enabling or disabling 530 option rounding results of 232 function order lists 585 Floating-point Operations function ordering optimization 585 programming tradeoffs 663 function preemption 520 Floating-point Optimizations functions -fp-model compiler option 665 option aligning on byte boundary 186 floating-point precision Fused Multiply Add (FMA) Intrinsics 886 option controlling for significand 372 option improving for divides 380 option improving for square root 381 G option improving general 329 g++* language extensions 91 floating-point stack 234, 683 gcc C++ run-time libraries option checking 234 include file path 280 FMA Intrinsic option adding a directory to second 280 _mm_fmadd_pd, _mm256_fmadd_pd 975 option removing standard directories from 481 _mm_fmadd_ps, _mm256_fmadd_ps 976 option specifying to link to 160 _mm_fmadd_sd 977 gcc* _mm_fmadd_ss 978 porting from 116 _mm_fmaddsub_pd, _mm256_fmaddsub_pd 979 gcc* built-in functions 99 _mm_fmaddsub_ps, _mm256_fmaddsub_ps 980 gcc* compatibility 91 _mm_fmsub_pd, _mm256_fmsub_pd 983 gcc* considerations 112 _mm_fmsub_ps, _mm256_fmsub_ps 984 gcc* interoperability 95 _mm_fmsub_sd 985 gcc* language extensions 91 _mm_fmsub_ss 986 general compiler directives _mm_fmsubadd_pd, _mm256_fmsubadd_pd 981 for inlining functions 520 _mm_fmsubadd_ps, _mm256_fmsubadd_ps 982 for profile-guided optimization 529 _mm_fnmadd_pd, _mm256_fnmadd_pd 987 for vectorization 605, 607, 615 _mm_fnmadd_ps, _mm256_fnmadd_ps 988

1252 Index general compiler directives (continued) inlining (continued) instrumented code 530 option disabling full and partial 296 profile-optimized executable 530 option disabling partial 297 profiling information 592 option forcing 285 global function symbols option specifying lower limit for large routines 290 option binding references to shared library option specifying maximum size of function for 210 definitions 155 option specifying maximum times for a routine 289 global symbols 86, 154 option specifying maximum times for compilation option binding references to shared library unit 288 definitions 154 option specifying total size routine can grow 291 GNU C++ compatibility 91 option specifying upper limit for small routine 293 preemption 520 inlining options H option specifying percentage multiplier for 284 input files 57 Half-float intrinsics instrumentation 529, 530, 538, 594 _cvtsh_ss 1014 compilation 538 _cvtss_sh 1014 execution 538 _mm_cvtph_ps 1014 feedback compilation 538 _mm_cvtps_ph 1014 generating 530 Half-float Intrinsics program 529 overview 1013 instrumentation calls help option generating report for loops 401 using in Microsoft Visual Studio* 39 option inserting for function 398 high performance programming 529, 601 option inserting for loops 399 applications for 601 INTEL_PROF_DUMP_CUMULATIVE environment variable high-level optimizer 601 593 HLO 601 INTEL_PROF_DUMP_INTERVAL environment variable hotness threshold 593 option setting 390 Intel-provided libraries option linking dynamically 414 option linking statically 418 I Intel(R) 64 architecture based applications HLO 601 IA-32 architecture based applications Intel(R) Advanced Vector Extensions (Intel(R) AVX) HLO 601 Intrinsics 885, 886 IEEE Floating-point Standard 683 Data types 886 IEEE* Registers 886 floating-point values 683 Intel(R) IPP libraries include files 75 option letting you link to 303 inline function expansion Intel(R) linking tools 507 option specifying level of 286, 348 Intel(R)AVX Intrinsics 886 inlined code Intel® AVX Intrinsic option producing source position information for _mm_cmp_pd 910 283 _mm_cmp_ps 911 inlining 210, 285, 288, 289, 290, 291, 293, 296, 297, _mm_cmp_ss 913 520, 522, 525, 529 _mm256_add_pd 891, 936 compiler directed 522 _mm256_add_ps 891, 912 developer directed 525 _mm256_castps_pd 969

1253 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intel® AVX Intrinsic (continued) Intel® AVX Intrinsic (continued) _mm256_castps_si256 970 _mm256_broadcast_pd 920 _mm256_castps128_ps256 973 _mm256_broadcast_ps 921 _mm256_castps256_ps128 973 _mm256_broadcast_sd 921 _mm256_castsi128_si256 974 _mm256_broadcast_ss 922 _mm256_castsi256_pd 970 _mm256_castpd_ps 968 _mm256_castsi256_ps 971 _mm256_castpd_si256 969 _mm256_castsi256_si128 974 _mm256_castpd128_pd256 971 _mm256_cvtepi32_pd 913 _mm256_castpd256_pd128 972 _mm256_cvtepi32_ps 914 _mm256_hadd_ps 894 _mm256_cvtpd_epi32 914 _mm256_hsub_pd 896 _mm256_cvtpd_ps 915 _mm256_hsub_ps 896 _mm256_cvttpd_epi32 916 _mm256_insertf128_pd 937 _mm256_cvttps_epi32 917 _mm256_insertf128_ps 938 _mm256_div_pd 898 _mm256_insertf128_si256 938 _mm256_div_ps 899 _mm256_lddqu_si256 939 _mm256_dp_ps 899 _mm256_load_pd 923 _mm256_extractf128_pd 935 _mm256_load_ps 923 _mm256_extractf128_si256 936 _mm256_load_si256 924 _mm256_hadd_pd 893 _mm256_loadu_pd 924 _mm256_loadu_ps 925 _mm256_loadu_si256 926 _m256_cmp_pd 910 _mm256_maskload_pd, _mm_maskload_pd 926 _m256_cmp_ps 911 _mm256_maskstore_pd, _mm_maskstore_pd 933 _m256_cvtps_epi32 915 _mm256_maskstore_ps, _mm_maskstore_ps 934 _m256_cvtps_pd 916 _mm256_max_pd 918 _mm_broadcast_ss 922 _mm256_max_ps 918 _mm_permute_pd 959 _mm256_min_pd 919 _mm_permute_ps 960 _mm256_min_ps 919 _mm_permutevar_pd 961 _mm256_movedup_pd 940 _mm_permutevar_ps 961 _mm256_movehdup_ps 940 _mm_testc_pd 954 _mm256_moveldup_ps 941 _mm_testc_ps 955 _mm256_movemask_pd 941 _mm_testnzc_pd 956 _mm256_movemask_ps 942 _mm_testnzc_ps 957 _mm256_mul_pd 897 _mm_testz_pd 952 _mm256_mul_ps 897 _mm_testz_ps 953 _mm256_or_pd 905 _mm256_add_pd 891, 936 _mm256_or_ps 905 _mm256_add_ps 891, 912 _mm256_permute_pd 959 _mm256_add_ps, _mm_maskload_ps 927 _mm256_permute_ps 960 _mm256_addsub_pd 892 _mm256_permute2f128_pd 962 _mm256_addsub_ps 893 _mm256_permute2f128_ps 963 _mm256_and_pd 902 _mm256_permute2f128_si256 963 _mm256_and_ps 903 _mm256_permutevar_pd 961 _mm256_andnot_pd 903 _mm256_permutevar_ps 961 _mm256_andnot_ps 904 _mm256_rcp_pd 902 _mm256_blend_pd 907 _mm256_round_pd 942 _mm256_blend_ps 908 _mm256_round_ps 943 _mm256_blendv_pd 909 _mm256_rsqrt_ps 901 _mm256_blendv_ps 909 _mm256_set_epi32 945

1254 Index

Intel® AVX Intrinsic (continued) interoperability (continued) _mm256_set_pd 944 with gcc* 95 _mm256_set_ps 945 interoperability options 95 _mm256_set1_epi32 947 interprocedural optimizations 209, 295, 298, 507, _mm256_set1_pd 946 510, 511, 513, 514, 516, 518, 522, 529, 596 _mm256_set1_ps 946 capturing intermediate output 511 _mm256_setzero_pd 948 code layout 516 _mm256_setzero_ps 948 compilation 507 _mm256_setzero_si256 949 compiling 511 _mm256_shuffle_pd 964 considerations 514 _mm256_shuffle_ps 965 creating libraries 518 _mm256_sqrt_pd 900 initiating 596 _mm256_sqrt_ps 901 issues 513 _mm256_store_pd 928 large programs 514 _mm256_store_ps 928 linking 507, 511 _mm256_store_si256 929 option enabling additional 295 _mm256_storeu_pd 929 option enabling between files 298 _mm256_storeu_ps 930 option enabling for single file compilation 209 _mm256_storeu_si256 931 options 510 _mm256_stream_pd 931 overview 507 _mm256_stream_ps 932 performance 513 _mm256_stream_si256 933 using 511 _mm256_sub_pd 894 whole program analysis 507 _mm256_sub_ps 895 xiar 518 _mm256_testc_pd 954 xild 518 _mm256_testc_ps 955 xilibtool 518 _mm256_testc_si256 951 intrinsics _mm256_testnzc_pd 956, 957 about 687 _mm256_testnzc_si256 952 data alignment 705 _mm256_testz_pd 952 data types 688 _mm256_testz_ps 953 inline assembly 705, 707 _mm256_testz_si256 950 memory allocation 705, 706 _mm256_unpackhi_pd 966 naming and syntax 691 _mm256_unpackhi_ps 966 references 693 _mm256_unpacklo_pd 967 registers 688 _mm256_unpacklo_ps 967 Intrinsics for all IA 695, 696, 700, 701 _mm256_xor_pd 906 arithmetic intrinsics 695 _mm256_xor_ps 906 floating point intrinsics 696 _mm256_zeroall 949 miscellaneous intrinsics 701 _mm256_zeroupper 950 string and block copy intrinsics 700 Intel® SVML Intrinsic Intrinsics for Intel(R) Post-32nm Processor Instruction _mm_asinh_pd, _mm256_asinh_pd 1042 Extensions 997, 998, 999, 1000, 1001, 1002 _mm_asinh_ps, _mm256_asinh_ps 1043 _mm_cvtph_ps() 998 _mm_cos_pd, _mm256_cos_pd 1048 _mm_cvtps_ph() 999 intermediate files _mm256_cvtph_ps() 998 option saving during compilation 411 _mm256_cvtps_ph() 999 intermediate representation (IR) 507, 511 _rdrand_u16() 1000 interoperability _rdrand_u32() 1000 with g++* 95 _rdrand_u64() 1000

1255 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Intrinsics for Intel(R) Post-32nm Processor Instruction Lambda Expressions Extensions (continued) details on using in C++ 1237 _readfsbase_u32() 1001 function object 1241 _readfsbase_u64() 1001 lambda-capture 1239 _readgsbase_u32() 1001 language extensions _readgsbase_u64() 1001 g++* 91 _writefsbase_u32() 1002 gcc* 91 _writefsbase_u64() 1002 LD_LIBRARY_PATH 83 _writegsbase_u32() 1002 libgcc library _writegsbase_u64() 1002 option linking dynamically 415 Intrinsics for Managing Extended Processor States and option linking statically 419 Registers 1003, 1004, 1005, 1008, 1009, libraries 1010, 1011, 1012 -c compiler option 84 _fxrstor() 1009 -fPIC compiler option 84 _fxrstor64() 1009 -shared compiler option 84 _fxsave() 1008 creating your own 84 _fxsave64() 1008 default 81 _xgetbv() 1004 Intel(R) Math Library 81 _xrstor() 1012 LD_LIBRARY_PATH 83 _xrstor64() 1012 managing 83 _xsave() 1010 option enabling dynamic linking of 152 _xsave64() 1010 option enabling static linking of 153 _xsaveopt() 1011 option preventing linking with shared 418 _xsetbv() 1005 option preventing use of standard 339 Intrinsics Returning Vectors of Undefined Values option printing location of system 382 _mm_undefined_pd() 840 redistributing 54 _mm_undefined_ps() 840 shared 84 _mm_undefined_si128() 840 static 84 _mm256_undefined_pd() 994 library _mm256_undefined_ps() 994 option searching in specified directory for 310 _mm256_undefined_si128() 995 option to search for 309 introduction to the compiler 49 Library extensions 1232 IPO valarray implementation 1232 option specifying jobs during the link phase of 300 library functions 520 options 493 library math functions IR 511 option testing errno after calls to 215 IVDEP linker effect when tuning applications 601 option passing linker option to 487 linking option preventing use of startup files and libraries J when 342 option preventing use of startup files when 341 jump tables option suppressing 157 option enabling generation of 354 linking options 493 linking tools 493, 507, 513, 518 xild 507, 513, 518 L xilibtool 518 xilink 507, 513 Lambda expressions 1237 linking tools IR 507

1256 Index linking with IPO 511 Linux* compiler options (continued) Linux* compiler options -fargument-alias 188 --version 437 -fargument-noalias-global 189 -A 139 -fasm-blocks 190 -A- 140 -fast 190 -alias-const 141 -fast-transcendentals 192 -align 142 -fbuiltin 194, 349 -ansi 143 -fcode-asm 195 -ansi-alias 143 -fcommon 196 -ansi-alias-check 144 -fdata-sections 200 -auto-ilp32 146 -fexceptions 197 -auto-p32 147 -ffnalias 197 -ax 148 -ffreestanding 198 -B 151 -ffriend-injection 199 -Bdynamic 152 -ffunction-sections 200 -Bstatic 153 -fimf-absolute-error 201 -Bsymbolic 154 -fimf-accuracy-bits 202 -Bsymbolic-functions 155 -fimf-arch-consistency 204 -c 59, 157 -fimf-max-error 205 -C 67, 68, 156 -fimf-precision 207 -c99 158 -finline 208 -check-uninit 159 -finline-functions 209 -complex-limited-range 159 -finline-limit 210 -cxxlib 160 -finstrument-functions 211 -D 67, 69, 161 -fjump-tables 213 -dD 162 -fkeep-static-consts 214 -debug 163 -fmath-errno 215 -diag 166 -fminshared 216 -diag-dump 169 -fmudflap 217 -diag-error-limit 170 -fno-gnu-keywords 218 -diag-file 171 -fno-implicit-inline-templates 219 -diag-file-append 172 -fno-implicit-templates 220 -diag-id-numbers 174 -fno-operator-names 221 -diag-once 175 -fno-rtti 222 -dM 175 -fnon-call-exceptions 222 -dN 176 -fnon-lvalue-assign 223 -dryrun 177 -fomit-frame-pointer 224, 369 -dumpmachine 178 -fp 224, 369 -dumpversion 178 -fp-model 226 -dynamic-linker 179 -fp-port 232 -E 67, 68, 180 -fp-speculation 233 -early-template-check 181 -fp-stack-check 234 -EP 67, 68, 182 -fp-trap 235 -export 182 -fp-trap-all 237 -export-dir 183 -fpack-struct 239 -fabi-version 184 -fpermissive 240 -falias 185 -fpic 241 -falign-functions 186 -fpie 242 -falign-stack 187 -freg-struct-return 243

1257 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Linux* compiler options (continued) Linux* compiler options (continued) -fshort-enums 244 -inline-max-total-size 291 -fsource-asm 244 -inline-min-size 293 -fstack-protector 245, 267 -intel-extensions 294 -fstack-security-check 245, 267 -ip 295 -fstrict-aliasing 143 -ip-no-inlining 296 -fsyntax-only 246 -ip-no-pinlining 297 -ftemplate-depth 247 -ipo 298 -ftls-model 248 -ipo-c 299 -ftrapuv 249 -ipo-jobs 300 -ftz 250 -ipo-S 301 -funroll-all-loops 251 -ipo-separate 302 -funroll-loops 426 -ipp 303 -funsigned-bitfields 252 -iprefix 304 -funsigned-char 253 -iquote 305 -fvar-tracking 163 -isystem 306 -fvar-tracking-assignments 163 -iwithprefix 306 -fverbose-asm 254 -iwithprefixbefore 307 -fvisibility 255 -Kc++ 308, 423 -fvisibility-inlines-hidden 257 -l 309 -fzero-initialized-in-bss 258 -L 310 -g 77, 259, 488, 490 -m 311 -g0 260 -M 312 -gcc 261, 263 -m32 313 -gcc-name 262 -m64 313 -gcc-sys 261, 263 -malign-double 314 -gcc-version 264 -map-opts 315 -gdwarf-2 265 -march 316 -global-hoist 266 -mcmodel 318 -guide 267 -mcpu 333 -guide-data-trans 269 -MD 321 -guide-opts 270 -MF 322 -guide-vec 273 -MG 323 -gxx-name 274 -mieee-fp 326 -H 275 -minstruction 324 -help 275 -MM 325 -help-pragma 277 -MMD 326 -I 75, 278 -mp 326 -icc 279 -MP 328 -idirafter 280 -mp1 329 -imacros 280 -MQ 330 -inline-calloc 281 -mregparm 331 -inline-debug-info 283 -MT 332 -inline-factor 284 -mtune 333 -inline-forceinline 285 -multibyte-chars 336 -inline-level 286, 348 -multiple-processes 328, 336 -inline-max-per-compile 288 -no-libgcc 338 -inline-max-per-routine 289 -nobss-init 337 -inline-max-size 290 -nodefaultlibs 339

1258 Index

Linux* compiler options (continued) Linux* compiler options (continued) -nolib-inline 340 -prof-value-profiling 397 -nostartfiles 341 -profile-functions 398 -nostdinc++ 342 -profile-loops 399 -nostdlib 342 -profile-loops-report 401 -o 59, 343 -pthread 402 -O 344 -Qinstall 403 -Ob 286, 348 -Qlocation 59, 404 -opt-args-in-regs 350 -Qoption 59, 405 -opt-block-factor 352 -rcd 407 -opt-calloc 352 -regcall 408 -opt-class-analysis 353 -restrict 409 -opt-jump-tables 354 -S 59, 410 -opt-malloc-options 355 -save-temps 411 -opt-matmul 357 -scalar-rep 412 -opt-multi-version-aggressive 358 -shared 413 -opt-prefetch 359 -shared-intel 414 -opt-ra-region-strategy 360 -shared-libgcc 415 -opt-report 361 -simd 415 -opt-report-file 362 -sox 417 -opt-report-help 363 -static 418 -opt-report-phase 364 -static-intel 418 -opt-report-routine 365 -static-libgcc 419 -opt-streaming-stores 366 -std 420 -opt-subscript-in-range 367 -strict-ansi 422 -Os 368 -T 423 -P 67, 68, 371 -U 67, 69, 425 -pc 372 -u (L*) 424 -pch 373 -undef 426 -pch-create 374 -unroll 426 -pch-dir 375 -unroll-aggressive 427 -pch-use 377 -use-asm 428 -pie 378 -use-intel-optimized-headers 429 -pragma-optimization-level 379 -use-msasm 430 -prec-div 380 -v 431 -prec-sqrt 381 -V (L*) 432 -print-multi-lib 382 -vec 433 -prof-data-order 382 -vec-guard-write 434 -prof-dir 383 -vec-report 435 -prof-file 384 -vec-threshold 436 -prof-func-groups 385 -w 438, 439 -prof-func-order 386 -Wa 440 -prof-gen 388 -Wabi 441 -prof-genx 388 -Wall 442 -prof-hotness-threshold 390 -Wbrief 443 -prof-src-dir 391 -Wcheck 443 -prof-src-root 392 -Wcomment 444 -prof-src-root-cwd 394 -Wcontext-limit 445 -prof-use 395 -wd 446

1259 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Linux* compiler options (continued) loops (continued) -Wdeprecated 446 option specifying blocking factor for 352 -we 447 option specifying maximum times to unroll 426 -Weffc++ 448 option using aggressive unrolling for 427 -Werror 449, 475 parallelization 624 -Werror-all 450 transformations 601 -Wextra-tokens 451 vectorization 624 -Wformat 452 -Wformat-security 452 -Winline 453 M -Wl 454 -Wmain 456 Mac OS* X compiler options -Wmissing-declarations 456 --version 437 -Wmissing-prototypes 457 -A 139 -wn 458 -A- 140 -Wnon-virtual-dtor 459 -alias-const 141 -wo 459 -align 142 -Wp 460 -ansi 143 -Wp64 461 -ansi-alias 143 -Wpointer-arith 461 -ansi-alias-check 144 -Wpragma-once 462 -ax 148 -wr 463 -B 151 -Wremarks 464 -c 157 -Wreorder 464 -C 156 -Wreturn-type 465 -c99 158 -Wshadow 466 -check-uninit 159 -Wsign-compare 467 -complex-limited-range 159 -Wstrict-aliasing 468 -cxxlib 160 -Wstrict-prototypes 469 -D 161 -Wtrigraphs 469 -dD 162 -Wuninitialized 470 -debug 163 -Wunknown-pragmas 471 -diag 166 -Wunused-function 472 -diag-dump 169 -Wunused-variable 472 -diag-error-limit 170 -ww 473 -diag-id-numbers 174 -x 477 -diag-once 175 -X 75, 481 -dM 175 -x (type) 475 -dN 176 -xHost 484 -dryrun 177 -Xlinker 487 -dumpmachine 178 -Zp 492 -E 180 loop blocking factor -early-template-check 181 option specifying 352 -EP 182 loop unrolling 601, 607 -export 182 using the HLO optimizer 601 -export-dir 183 loops 352, 426, 427, 601, 624, 633 -fabi-version 184 constructs 633 -falias 185 distribution 601 -falign-functions 186 interchange 601 -falign-stack 187

1260 Index

Mac OS* X compiler options (continued) Mac OS* X compiler options (continued) -fargument-alias 188 -fstack-protector 245, 267 -fargument-noalias-global 189 -fstack-security-check 245, 267 -fasm-blocks 190 -fstrict-aliasing 143 -fast 190 -fsyntax-only 246 -fast-transcendentals 192 -ftemplate-depth 247 -fbuiltin 194, 349 -ftls-model 248 -fcode-asm 195 -ftrapuv 249 -fcommon 196 -ftz 250 -fexceptions 197 -funroll-all-loops 251 -ffnalias 197 -funroll-loops 426 -ffreestanding 198 -funsigned-bitfields 252 -ffriend-injection 199 -funsigned-char 253 -ffunction-sections 200 -fvar-tracking 163 -fimf-absolute-error 201 -fvar-tracking-assignments 163 -fimf-accuracy-bits 202 -fverbose-asm 254 -fimf-arch-consistency 204 -fvisibility-inlines-hidden 257 -fimf-max-error 205 -fzero-initialized-in-bss 258 -fimf-precision 207 -g 259, 488, 490 -finline 208 -g0 260 -finline-functions 209 -gcc 261, 263 -finline-limit 210 -gcc-name 262 -finstrument-functions 211 -gcc-sys 261, 263 -fjump-tables 213 -gcc-version 264 -fkeep-static-consts 214 -gdwarf-2 265 -fmath-errno 215 -global-hoist 266 -fminshared 216 -gxx-name 274 -fno-gnu-keywords 218 -H 275 -fno-implicit-inline-templates 219 -help 275 -fno-implicit-templates 220 -help-pragma 277 -fno-operator-names 221 -I 278 -fno-rtti 222 -icc 279 -fnon-call-exceptions 222 -idirafter 280 -fnon-lvalue-assign 223 -imacros 280 -fomit-frame-pointer 224, 369 -inline-calloc 281 -fp 224, 369 -inline-factor 284 -fp-model 226 -inline-forceinline 285 -fp-port 232 -inline-level 286, 348 -fp-speculation 233 -inline-max-per-compile 288 -fp-stack-check 234 -inline-max-per-routine 289 -fp-trap 235 -inline-max-size 290 -fp-trap-all 237 -inline-max-total-size 291 -fpack-struct 239 -inline-min-size 293 -fpascal-strings 240 -ip 209 -fpermissive 240 -ip-no-inlining 296 -fpic 241 -ip-no-pinlining 297 -freg-struct-return 243 -ipo 298 -fshort-enums 244 -ipo-c 299 -fsource-asm 244 -ipo-jobs 300

1261 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Mac OS* X compiler options (continued) Mac OS* X compiler options (continued) -ipo-S 301 -opt-jump-tables 354 -ipo-separate 302 -opt-malloc-options 355 -ipp 303 -opt-multi-version-aggressive 358 -iprefix 304 -opt-prefetch 359 -iquote 305 -opt-ra-region-strategy 360 -isystem 306 -opt-report 361 -iwithprefix 306 -opt-report-file 362 -iwithprefixbefore 307 -opt-report-help 363 -Kc++ 308, 423 -opt-report-phase 364 -L 310 -opt-report-routine 365 -m 311 -opt-streaming-stores 366 -M 312 -opt-subscript-in-range 367 -m32 313 -pc 372 -m64 313 -pch 373 -malign-double 314 -pch-create 374 -mcmodel 318 -pch-dir 375 -mcpu 333 -pch-use 377 -MD 321 -pragma-optimization-level 379 -MF 322 -prec-div 380 -MG 323 -prec-sqrt 381 -minstruction 324 -print-multi-lib 382 -MM 325 -prof-data-order 382 -MMD 326 -prof-dir 383 -MP 328 -prof-file 384 -mp1 329 -prof-func-groups 385 -MQ 330 -prof-func-order 386 -mregparm 331 -prof-gen 388 -MT 332 -prof-genx 388 -mtune 333 -prof-src-dir 391 -multibyte-chars 336 -prof-src-root 392 -multiple-processes 328, 336 -prof-src-root-cwd 394 -nobss-init 337 -prof-use 395 -nodefaultlibs 339 -prof-value-profiling 397 -nolib-inline 340 -profile-functions 398 -nostartfiles 341 -profile-loops 399 -nostdinc 481 -profile-loops-report 401 -nostdinc++ 342 -pthread 402 -nostdlib 342 -Qinstall 403 -o 343 -Qlocation 404 -O 344 -Qoption 405 -O0 344 -rcd 407 -O1 344 -regcall 408 -O2 344 -restrict 409 -O3 344 -S 410 -Ob 286, 348 -save-temps 411 -opt-args-in-regs 350 -scalar-rep 412 -opt-block-factor 352 -shared-intel 414 -opt-class-analysis 353 -simd 415

1262 Index

Mac OS* X compiler options (continued) Mac OS* X compiler options (continued) -static-intel 418 -Wshadow 466 -std 420 -Wsign-compare 467 -strict-ansi 422 -Wstrict-aliasing 468 -u (L*) 424 -Wstrict-prototypes 469 -undef 426 -Wtrigraphs 469 -unroll 426 -Wuninitialized 470 -unroll-aggressive 427 -Wunknown-pragmas 471 -use-asm 428 -Wunused-function 472 -use-intel-optimized-headers 429 -Wunused-variable 472 -use-msasm 430 -ww 473 -v 431 -x 477 -V 432 -x (type) 475 -vec 433 -xHost 484 -vec-guard-write 434 -Xlinker 487 -vec-report 435 macro names -vec-threshold 436 option associating with an optional value 161 -w 439 macros 98, 109, 128, 129 -Wa 440 maintainability 679 -Wabi 441 make, using 56 -Wall 442 makefiles -Wbrief 443 modifying 107 -Wcheck 443 Math library -Wcontext-limit 445 Complex Functions -wd 446 cabs library function 1145 -Wdeprecated 446 cacos library function 1145 -we 447 cacosh library function 1145 -Weffc++ 448 carg library function 1145 -Werror 449, 475 casin library function 1145 -Werror-all 450 casinh library function 1145 -Wextra-tokens 451 catan library function 1145 -Wformat 452 catanh library function 1145 -Wformat-security 452 ccos library function 1145 -Winline 453 ccosh library function 1145 -Wl 454 cexp library function 1145 -Wmain 456 cexp10 library function 1145 -Wmissing-declarations 456 cimag library function 1145 -Wmissing-prototypes 457 cis library function 1145 -wn 458 clog library function 1145 -Wnon-virtual-dtor 459 clog2 library function 1145 -wo 459 conj library function 1145 -Wp 460 cpow library function 1145 -Wp64 461 cproj library function 1145 -Wpointer-arith 461 creal library function 1145 -Wpragma-once 462 csin library function 1145 -wr 463 csinh library function 1145 -Wremarks 464 csqrt library function 1145 -Wreorder 464 ctan library function 1145 -Wreturn-type 465 ctanh library function 1145

1263 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Math library (continued) Math library (continued) Exponential Functions Nearest Integer Functions (continued) cbrt library function 1126 trunc library function 1135 exp library function 1126 Remainder Functions exp10 library function 1126 fmod library function 1138 exp2 library function 1126 remainder library function 1138 expm1 library function 1126 remquo library function 1138 frexp library function 1126 Special Functions hypot library function 1126 annuity library function 1131 ilogb library function 1126 compound library function 1131 ldexp library function 1126 erf library function 1131 log library function 1126 erfc library function 1131 log10 library function 1126 gamma library function 1131 log1p library function 1126 gamma_r library function 1131 log2 library function 1126 j0 library function 1131 logb library function 1126 j1 library function 1131 pow library function 1126 jn library function 1131 scalb library function 1126 lgamma library function 1131 scalbn library function 1126 lgamma_r library function 1131 sqrt library function 1126 tgamma library function 1131 Hyperbolic Functions y0 library function 1131 acosh library function 1124 y1 library function 1131 asinh library function 1124 yn library function 1131 atanh library function 1124 Trigonometric Functions cosh library function 1124 acos library function 1119 sinh library function 1124 acosd library function 1119 sinhcosh library function 1124 asin library function 1119 tanh library function 1124 asind library function 1119 Miscellaneous Functions atan library function 1119 copysign library function 1139 atan2 library function 1119 fabs library function 1139 atand library function 1119 fdim library function 1139 atand2 library function 1119 finite library function 1139 cos library function 1119 fma library function 1139 cosd library function 1119 fmax library function 1139 cot library function 1119 fmin library function 1139 cotd library function 1119 Miscellaneous Functions 1139 sin library function 1119 nextafter library function 1139 sincos library function 1119 Nearest Integer Functions sincosd library function 1119 ceil library function 1135 sind library function 1119 floor library function 1135 tan library function 1119 llrint library function 1135 tand library function 1119 llround library function 1135 Math Library lrint library function 1135 C99 macros 1151 lround library function 1135 for Linux* OS 1109 modf library function 1135 function list 1113 nearbyint library function 1135 using on Linux* OS 1109 rint library function 1135 math library functions round library function 1135 option defining accuracy for 207

1264 Index math library functions (continued) MMX intrinsics (continued) option producing consistent results 204 set_pi8 728 matmul library call set1_pi16 728 option replacing matrix multiplication loop nests set1_pi32 728 with 357 set1_pi8 728 matrix multiplication loop nests setr_pi16 728 option identifying and replacing 357 setr_pi32 728 memory loads setr_pi8 728 option enabling optimizations to move 266 setzero_si64 728 memory model shift intrinsics 723 option specifying large 318 sll_pi16 723 option specifying small or medium 318 sll_pi32 723 option to use specific 318 sll_pi64 723 mixing vectorizable types in a loop 607 slli_pi16 723 MMX intrinsics slli_pi32 723 add_pi16 721 slli_pi64 723 add_pi32 721 sra_pi16 723 add_pi8 721 sra_pi32 723 adds_pi16 721 srai_pi16 723 adds_pi8 721 srai_pi32 723 adds_pu16 721 srl_pi16 723 adds_pu8 721 srl_pi32 723 and_si64 726 srl_pi64 723 andnot_si64 726 srli_pi16 723 arithmetic intrinsics 721 srli_pi32 723 cmpeq_pi16 726 srli_pi64 723 cmpeq_pi32 726 sub_pi16 721 cmpeq_pi8 726 sub_pi32 721 cmpgt_pi16 726 sub_pi8 721 cmpgt_pi32 726 subs_pi16 721 cmpgt_pi8 726 subs_pi8 721 compare intrinsics 726 subs_pu16 721 cvtm64_si64 718 subs_pu8 721 cvtsi32_si64 718 unpackhi_pi16 718 cvtsi64_m64 718 unpackhi_pi32 718 cvtsi64_si32 718 unpackhi_pi8 718 empty 718 unpacklo_pi16 718 general support intrinsics 718 unpacklo_pi32 718 logical intrinsics 726 unpacklo_pi8 718 madd_pi16 721 xor_si64 726 mulhi_pi16 721 MMX technology intrinsics mullo_pi16 721 data types 715 or_si64 726 registers 715 packs_pi16 718 MMX(TM) 605, 615 packs_pi32 718 mock object files 511 packs_pu16 718 modifying the compilation environment 71 set intrinsics 728 MOVBE instructions set_pi16 728 option generating 324 set_pi32 728

1265 Intel(R) C++ Compiler for VxWorks* User and Reference Guides multiple processes parallelization 624 option creating 328, 336 performance 679 MXCSR register 674 performance issues with IPO 513 PGO 529 PGO API N _PGOPTI_Prof_Dump_All 594 _PGOPTI_Prof_Dump_And_Reset 599 normalized floating-point number 683 _PGOPTI_Prof_Reset 598 Not-a-Number (NaN) 683 _PGOPTI_Set_Interval_Prof_Dump 596 enable 592 PGO tools 550, 570, 579 O code coverage tool 550 profmerge 579 object file proforder 579 option generating one per source file 302 test prioritization tool 570 object files pgopti.dpi file 530, 593 specifying 59 pgopti.spi file 570 Open Source tools 65 pgouser.h header file 592 optimization 112, 299, 301, 344, 359 pointer aliasing option enabling prefetch insertion 359 option using aggressive multi-versioning check for option generating single assembly file from multiple 358 files 301 porting option generating single object file from multiple from GNU gcc* to Microsoft Visual C++* 116 files 299 porting applications 105, 116 option specifying code 344 position-independent code optimization report option generating 241, 242 option displaying phases for 363 position-independent executable option generating for routines with specified text option producing 378 365 pragma alloc_section 1071 option generating to stderr 361 pragma distribute_point 1072 option specifying name for 362 pragma forceinline 1076 option specifying phase to use for 364 pragma inline 1076 optimizations pragma ivdep 1078 high-level language 601 pragma loop_count 1080 option enabling many speed 368 pragma memref_control 1082 option generating recommendations to improve pragma noinline 1076 270 pragma noprefetch 1090 overview of 529 pragma noswp 1096 PGO methodology 530 pragma nounroll 1098 profile-guided 529 pragma nounroll_and_jam 1100 profile-guided optimization 530 pragma novector 1086 option mapping tool 63 pragma optimization_level 1088 options used for IPO 493, 510 pragma optimize 1087 output files 58, 343 pragma prefetch 1090 option specifying name for 343 pragma simd 644, 1094 pragma swp 1096 P pragma unroll 1098 pragma unroll_and_jam 1100 parallel invocations with makefile 530 pragma unused 1101

1266 Index pragma vector profile-guided optimization (continued) aligned 1102 resetting profile information 599 always 1102 support 592 assert 1102 usage model 529 nontemporal 1102 profile-optimized code 530, 592, 594, 596, 598 temporal 1102 dumping 594, 596 unaligned 1102 generating information 592 pragmas resetting dynamic counters for 598 option displaying 277 profiling Pragmas option enabling use of information from 395 Intel-specific 1070 option instrumenting a program for 388 Intel-supported 1106 option specifying directory for output files 383 overview 1069 option specifying name for summary 384 precompiled header files 60 profiling information predefined macros 98, 128, 129 option enabling function ordering 386 preempting functions 520 option using to order static data items 382 prefetch insertion profmerge 579 option enabling 359 programs preprocessor compiler options 67 option maximizing speed in 190 prioritizing application tests 570 option specifying aliasing should be assumed in processor 185 option optimizing for specific 333 processor-specific code option generating 148 Q option generating and optimizing 477 PROF_DIR environment variable 593 quick reference 510, 530 PROF_DUMP_INTERVAL environment variable IPO options 510 (deprecated) 593 PGO options 530 PROF_NO_CLOBBER environment variable 593 profile data records option affecting search for 391 R option letting you use relative paths when searching redistributable package 54 for 392, 394 redistributing libraries 54 profile-guided optimization 529, 530, 538, 540, 585, references to global function symbols 592, 593, 596, 598, 599 option binding to shared library definitions 155 API support 592 references to global symbols data ordering optimization 585 option binding to shared library definitions 154 dumping profile information 599 register allocator environment variables 593 option selecting method for partitioning 360 example of 538 relative error function grouping optimization 585 option defining for math library function results function order lists optimization 585 202 function ordering optimization 585 option defining maximum for math library function funtion/loop execution time 540 results 205 interval profile dumping 596 remarks options 530 option changing to errors 450 overview 529 report generation phases 538 dynamic profile counters 598 resetting dynamic profile counters 598

1267 Intel(R) C++ Compiler for VxWorks* User and Reference Guides report generation (continued) SSE intrinsics (continued) profile information 599 cmpgt_ps 740 using xi* tools 519 cmpgt_ss 740 response files 75 cmple_ps 740 routine entry cmple_ss 740 option specifying the stack alignment to use on cmplt_ps 740 187 cmplt_ss 740 routines cmpneq_ps 740 option passing parameters in registers 350 cmpneq_ss 740 run-time performance cmpnge_ps 740 improving 677 cmpnge_ss 740 cmpngt_ps 740 cmpngt_ss 740 S cmpnle_ps 740 cmpnle_ss 740 scalar replacement cmpnlt_ps 740 option enabling during loop transformation 412 cmpnlt_ss 740 option using aggressive multi-versioning check for cmpord_ps 740 358 cmpord_ss 740 shareable objects 87 cmpunord_ps 740 shared libraries 84 cmpunord_ss 740 shared object comieq_ss 740 option producing a dynamic 413 comige_ss 740 Short Vector Math Library (SVML) Intrinsics 1017 comigt_ss 740 signed infinity 683 comile_ss 740 signed zero 683 comilt_ss 740 simd comineq_ss 740 vectorization compare operations 740 function annotations 1094 conversion operations 750 SIMD vectorization cvtpi16_ps 750 option disabling 415 cvtpi32_ps 750 specifying file names cvtpi32x2_ps 750 for assembly files 59 cvtpi8_ps 750 for executable files 59 cvtps_pi16 750 for object files 59 cvtps_pi32 750 SSE 605, 615 cvtps_pi8 750 SSE intrinsics cvtpu16_ps 750 add_ps 734 cvtpu8_ps 750 add_ss 734 cvtsi32_ss 750 and_ps 739 cvtsi64_ss 750 andnot_ps 739 cvtss_f32 750 arithmetic operations 734 cvtss_si32 750 avg_pu16 761 cvtss_si64 750 avg_pu8 761 cvttps_pi32 750 cacheability support operations 760 cvttss_si32 750 cmpeq_ps 740 cvttss_si64 750 cmpeq_ss 740 data types 731 cmpge_ps 740 div_ps 734 cmpge_ss 740 div_ss 734

1268 Index

SSE intrinsics (continued) SSE intrinsics (continued) extract_pi16 761 setr_ps 757 getcsr 765 setzero_ps 757 insert_pi16 761 sfence 760 integer operations 761 shuffle function macro 768 load operations 755 shuffle_pi16 761 load_ps 755 shuffle_ps 766 load_ps1 755 sqrt_ps 734 load_ss 755 sqrt_ss 734 loadh_pi 755 store operations 758 loadl_pi 755 store_ps 758 loadr_ps( 755 store_ps1 758 loadu_ps 755 store_ss 758 logical operations 739 storeh_pi 758 maskmove_si641 761 storel_pi 758 matrix transposition macro 772 storer_ps 758 max_pi16 761 storeu_ps 758 max_ps 734 stream_pi 760 max_pu8 761 stream_ps 760 max_ss 734 sub_ps 734 min_pi16 761 sub_ss 734 min_ps 734 ucomieq_ss 740 min_pu8 761 ucomige_ss 740 min_ss 734 ucomigt_ss 740 miscellaneous operations 766 ucomile_ss 740 move_ss 766 ucomilt_ss 740 movehl_ps 766 ucomineq_ss 740 movelh_ps 766 unpackhi_ps 766 movemask_pi8 761 unpacklo_ps 766 movemask_ps 766 xor_ps 739 mul_ps 734 SSE2 605, 615 mul_ss 734 SSE2 intrinsics mulhi_pu16 761 add_epi16 801 or_ps 739 add_epi32 801 prefetch 760 add_epi64 801 programming with SSE intrinsics 734 add_epi8 801 rcp_ps 734 add_pd 774 rcp_ss 734 add_sd 774 read/write control register macros 769 add_si64 801 read/write register intrinsics 765 adds_epi16 801 registers 731 adds_epi8 801 rsqrt_ps 734 adds_epu16 801 rsqrt_ss 734 adds_epu8 801 sad_pu8 761 and_pd 778 set operations 757 and_si128 810 set_ps 757 andnot_pd 778 set_ps1 757 andnot_si128 810 set_ss 757 arithmetic operations intrinsics 801 setcsr 765 avg_epu16 801

1269 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

SSE2 intrinsics (continued) SSE2 intrinsics (continued) avg_epu8 801 cvtpd_pi32 790 cacheability support intrinsics 829 cvtpd_ps 790 casting support intrinsics 837 cvtpi32_pd 790 clflush 829 cvtps_epi32 819 cmpeq_epi16 816 cvtps_pd 790 cmpeq_epi32 816 cvtsd_f64 790 cmpeq_epi8 816 cvtsd_si32 790 cmpeq_pd 779 cvtsd_si64 819 cmpeq_sd 779 cvtsd_ss 790 cmpge_pd 779 cvtsi128_si32 821 cmpge_sd 779 cvtsi128_si64 821 cmpgt_epi16 816 cvtsi32_sd 790 cmpgt_epi32 816 cvtsi32_si128 821 cmpgt_epi8 816 cvtsi64_sd 819 cmpgt_pd 779 cvtsi64_si128 821 cmpgt_sd 779 cvtss_sd 790 cmple_pd 779 cvttpd_epi32 790 cmple_sd 779 cvttpd_pi32 790 cmplt_epi16 816 cvttps_epi32 819 cmplt_epi32 816 cvttsd_si32 790 cmplt_epi8 816 cvttsd_si64 819 cmplt_pd 779 div_pd 774 cmplt_sd 779 div_sd 774 cmpneq_pd 779 extract_epi16 831 cmpneq_sd 779 FP arithmetic intrinsics 774 cmpnge_pd 779 FP compare intrinsics 779 cmpnge_sd 779 FP conversion intrinsics 790 cmpngt_pd 779 FP load intrinsics 794 cmpngt_sd 779 FP logical operations intrinsics 778 cmpnle_pd 779 FP set intrinsics 796 cmpnle_sd 779 FP store intrinsics 798 cmpnlt_pd 779 insert_epi16 831 cmpnlt_sd 779 lfence 829 cmpord_pd 779 load operation intrinsics 822 cmpord_sd 779 load_pd 794 cmpunord_pd 779 load_sd 794 cmpunord_sd 779 load_si128 822 comieq_sd 779 load1_pd 794 comige_sd 779 loadh_pd 794 comigt_sd 779 loadl_epi64 822 comile_sd 779 loadl_pd 794 comilt_sd 779 loadr_pd 794 comineq_sd 779 loadu_pd 794 compare operation intrinsics 816 loadu_si128 822 conversion operation intrinsics 819 logical operations intrinsics 810 cvtepi32_pd 790 madd_epi16 801 cvtepi32_ps 819 maskmoveu_si128 827 cvtpd_epi32 790 max_epi16 801

1270 Index

SSE2 intrinsics (continued) SSE2 intrinsics (continued) max_epu8 801 setzero_si128 823 max_pd 774 shift operation intrinsics 811 max_sd 774 shuffle macro 839 mfence 829 shuffle_epi32 831 min_epi16 801 shuffle_pd 831 min_epu8 801 shufflehi_epi16 831 min_pd 774 shufflelo_epi16 831 min_sd 774 sll_epi16 811 miscellaneous intrinsics 831 sll_epi32 811 move operation intrinsics 821 sll_epi64 811 move_epi64 831 slli_epi16 811 move_sd 796 slli_epi32 811 movemask_epi8 831 slli_epi64 811 movemask_pd 831 slli_si128 811 movepi64_pi64 831 sqrt_pd 774 movpi64_pi64 831 sqrt_sd 774 mul_epu32 801 sra_epi16 811 mul_pd 774 sra_epi32 811 mul_sd 774 srai_epi16 811 mul_su32 801 srai_epi32 811 mulhi_epi16 801 srl_epi16 811 mulhi_epu16 801 srl_epi32 811 mullo_epi16 801 srl_epi64 811 or_pd 778 srli_epi16 811 or_si128 810 srli_epi32 811 packs_epi16 831 srli_epi64 811 packs_epi32 831 srli_si128 811 packus_epi16 831 store operation intrinsics 827 pause intrinsic 838 store_pd 798 sad_epu8 801 store_sd 798 set operation intrinsic 823 store_si128 827 set_epi16 823 store1_pd 798 set_epi32 823 storeh_pd 798 set_epi64 823 storel_epi64 827 set_epi8 823 storel_pd 798 set_pd 796 storer_pd 798 set_sd 796 storeu_pd 798 set1_epi16 823 storeu_si128 827 set1_epi32 823 stream_pd 829 set1_epi64 823 stream_si128 829 set1_epi8 823 stream_si32 829 set1_pd 796 sub_epi16 801 setr_epi16 823 sub_epi32 801 setr_epi32 823 sub_epi64 801 setr_epi64 823 sub_epi8 801 setr_epi8 823 sub_pd 774 setr_pd 796 sub_sd 774 setzero_pd 796 sub_si64 801

1271 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

SSE2 intrinsics (continued) SSE4 intrinsics (continued) subs_epi16 801 packed compare intrinsics 875 subs_epi8 801 packed format conversion intrinsics 867 subs_epu16 801 packed integer min/max intrinsics 868 subs_epu8 801 register insertion/extraction intrinsics 871 ucomieq_sd 779 test intrinsics 873 ucomige_sd 779 SSSE3 intrinsics ucomigt_sd 779 absolute value intrinsics 854 ucomile_sd 779 addition intrinsics 849 ucomilt_sd 779 concatenate intrinsics 856 ucomineq_sd 779 multiplication intrinsics 853 unpackhi_epi16 831 negation intrinsics 857 unpackhi_epi32 831 shuffle intrinsics 855 unpackhi_epi64 831 subtraction intrinsics 851 unpackhi_epi8 831 stack variables unpackhi_pd 831 option initializing to NaN 249 unpacklo_epi16 831 standard directories unpacklo_epi32 831 option removing from include search path 481 unpacklo_epi64 831 standards conformance 101 unpacklo_epi8 831 static libraries 84 unpacklo_pd 831 Streaming SIMD Extensions 607, 731 xor_pd 778 Streaming SIMD Extensions 2 773 xor_si128 810 Streaming SIMD Extensions 3 843 SSE3 intrinsics Streaming SIMD Extensions 4 865 addsub_pd 845 streaming stores addsub_ps 844 option generating for optimization 366 float32 vector intrinsics 844 subnormal numbers 672 float64 vector intrinsics 845 Supplemental Streaming SIMD Extensions 3 849 hadd_pd 845 supported tools 65 hadd_ps 844 SVML Intrinsic hsub_pd 845 _mm_ctanh_ps, _mm256_ctanh_ps 1064 hsub_ps 844 _mm_acos_pd, _mm256_acos_pd 1038 integer vector intrinsic 843 _mm_acos_ps, _mm256_acos_ps 1039 loaddup_pd 845 _mm_acosh_pd, _mm256_acosh_pd 1040 macro functions 847 _mm_acosh_ps, _mm256_acosh_ps 1040 miscellaneous intrinsics 847 _mm_asin_pd, _mm256_asin_pd 1041 movedup_pd 845 _mm_asin_ps, _mm256_asin_ps 1042 movehdup_ps 844 _mm_atan_pd, _mm256_atan_pd 1044 moveldup_ps 844 _mm_atan_ps, _mm256_atan_ps 1044 SSE4 intrinsic _mm_atan2_pd, _mm256_atan2_pd 1045 cacheability support intrinsic 874 _mm_atan2_ps, _mm256_atan2_ps 1046 DWORD multiply intrinsics 871 _mm_atanh_pd, _mm256_atanh_pd 1047 packed compare for equal intrinsic 874 _mm_atanh_ps, _mm256_atanh_ps 1047 packed DWORD to unsigned WORD intrinsic 874 _mm_cacos_ps, _mm256_cacos_ps 1058 SSE4 intrinsics _mm_cacosh_ps, _mm256_cacosh_ps 1058 application targeted accelerator intrinsics 878 _mm_casin_ps, _mm256_casin_ps 1059 floating-point rounding intrinsics 870 _mm_casinh_ps, _mm256_casinh_ps 1060 FP dot product intrinsics 867 _mm_catan_ps, _mm256_catan_ps 1060 packed blending intrinsics 866 _mm_catanh_ps, _mm256_catanh_ps 1061

1272 Index

SVML Intrinsic (continued) symbol preemption 87 _mm_cbrt_pd, _mm256_cbrt_pd 1035 symbol visibility _mm_cbrt_ps, _mm256_cbrt_ps 1036 option specifying 255 _mm_ccos_ps, _mm256_ccos_ps 1063 _mm_ccosh_ps, _mm256_ccosh_ps 1063 _mm_cexp_ps, _mm256_cexp_ps 1023 T _mm_clog_ps, _mm256_clog_ps 1032 _mm_clog10_ps, _mm256_clog10_ps 1031 template instantiation 104 _mm_clog2_ps, _mm256_clog2_ps 1030 templates _mm_cos_ps, _mm256_cos_ps 1049 exported 101 _mm_cosh_pd, _mm256_cosh_pd 1049 test prioritization tool 570 _mm_cosh_ps, _mm256_cosh_ps 1050 examples 570 _mm_cpow_ps, _mm256_cpow_ps 1025 options 570 _mm_csin_ps, _mm256_csin_ps 1061 requirements 570 _mm_csinh_ps, _mm256_csinh_ps 1062 thread-local storage 100 _mm_csqrt_ps, _mm256_csqrt_ps 1038 threshold control for auto-parallelization _mm_ctan_ps, _mm256_ctan_ps 1064 reordering 607 _mm_erf_pd, _mm256_erf_pd 1017 tool options 550, 570, 579 _mm_erf_ps, _mm256_erf_ps 1018 code coverage tool 550 _mm_erfc_pd, _mm256_erfc_pd 1019 profmerge 579 _mm_erfc_ps, _mm256_erfc_ps 1019 proforder 579 _mm_exp_pd, _mm256_exp_pd 1022 test prioritization 570 _mm_exp_ps, _mm256_exp_ps 1022 tools 404, 405, 550 _mm_exp2_pd, _mm256_exp2_pd 1020 option passing options to 405 _mm_exp2_ps, _mm256_exp2_ps 1021 option specifying directory for supporting 404 _mm_invcbrt_pd, _mm256_invcbrt_pd 1036 transcendental functions _mm_invcbrt_ps, _mm256_invcbrt_ps 1037 option replacing calls to 192 _mm_invsqrt_pd, _mm256_invsqrt_pd 1034 _mm_invsqrt_ps, _mm256_invsqrt_ps 1034 _mm_log_pd, _mm256_log_pd 1029 U _mm_log_ps, _mm256_log_ps 1030 unvectorizable copy 607 _mm_log10_pd, _mm256_log10_pd 1028 user functions 522, 525, 538, 540, 593 _mm_log10_ps, _mm256_log10_ps 1028 PGO environment 593 _mm_log2_pd, _mm256_log2_pd 1026 profile-guided optimization 538, 540 _mm_log2_ps, _mm256_log2_ps 1027 utilities 550, 579 _mm_pow_pd, _mm256_pow_pd 1024 profmerge 579 _mm_pow_ps, _mm256_pow_ps 1025 proforder 579 _mm_sin_pd, _mm256_sin_pd 1051 _mm_sin_ps, _mm256_sin_ps 1051 _mm_sincos_pd, _mm256_sincos_pd 1056 V _mm_sincos_ps, _mm256_sincos_ps 1057 _mm_sinh_pd, _mm256_sinh_pd 1052 valarray implementation _mm_sinh_ps, _mm256_sinh_ps 1053 compiling code 1233 _mm_sqrt_pd, _mm256_sqrt_pd 1032 using in code 1233 _mm_sqrt_ps, _mm256_sqrt_ps 1033 value-profiling 598 _mm_tan_pd, _mm256_tan_pd 1053 variables _mm_tan_ps, _mm256_tan_ps 1054 option placing explicitly zero-initialized in DATA _mm_tanh_pd, _mm256_tanh_pd 1055 section 258, 337 _mm_tanh_ps, _mm256_tanh_ps 1055 option placing uninitialized in DATA section 337

1273 Intel(R) C++ Compiler for VxWorks* User and Reference Guides variables (continued) VxWorks* compiler options (continued) option saving always 214 -diag-error-limit 170 vector copy -diag-file 171 options 606 -diag-file-append 172 overview 605, 615 -diag-id-numbers 174 programming guidelines 605, 607, 615 -diag-once 175 vectorization -dM 175 option disabling 433 -dN 176 option setting threshold for loops 436 -dryrun 177 Vectorization -dumpmachine 178 directive 654 -dumpversion 178 language support 654 -dynamic-linker 179 pragma 654 -E 180 pragma simd 1094 -early-template-check 181 SIMD 644 -EP 182 user-mandated 644 -export 182 vectorizer -export-dir 183 option controlling diagnostics reported by 435 -fabi-version 184 vectorizing 529, 607, 633 -falias 185 loops 529, 633 -falign-functions 186 version -falign-stack 187 option saving in executable or object file 417 -fargument-alias 188 visibility declaration attribute 88 -fargument-noalias-global 189 VxWorks* compiler options -fasm-blocks 190 --version 437 -fast 190 -A 139 -fast-transcendentals 192 -A- 140 -fbuiltin 194, 349 -alias-const 141 -fcode-asm 195 -align 142 -fcommon 196 -ansi 143 -fdata-sections 200 -ansi-alias 143 -fexceptions 197 -ansi-alias-check 144 -ffnalias 197 -auto-ilp32 146 -ffreestanding 198 -auto-p32 147 -ffriend-injection 199 -ax 148 -ffunction-sections 200 -B 151 -fimf-absolute-error 201 -Bdynamic 152 -fimf-accuracy-bits 202 -Bstatic 153 -fimf-arch-consistency 204 -c 157 -fimf-max-error 205 -C 156 -fimf-precision 207 -c99 158 -finline 208 -check-uninit 159 -finline-functions 209 -complex-limited-range 159 -finline-limit 210 -cxxlib 160 -finstrument-functions 211 -D 161 -fjump-tables 213 -dD 162 -fkeep-static-consts 214 -debug 163 -fmath-errno 215 -diag 166 -fminshared 216 -diag-dump 169 -fmudflap 217

1274 Index

VxWorks* compiler options (continued) VxWorks* compiler options (continued) -fno-defer-pop 217 -guide-data-trans 269 -fno-gnu-keywords 218 -guide-opts 270 -fno-implicit-fp-sse 219 -guide-vec 273 -fno-implicit-inline-templates 219 -gxx-name 274 -fno-implicit-templates 220 -H 275 -fno-operator-names 221 -help 275 -fno-rtti 222 -help-pragma 277 -fnon-call-exceptions 222 -I 278 -fnon-lvalue-assign 223 -icc 279 -fomit-frame-pointer 224, 369 -idirafter 280 -fp-model 226 -imacros 280 -fp-port 232 -inline-calloc 281 -fp-speculation 233 -inline-debug-info 283 -fp-stack-check 234 -inline-factor 284 -fp-trap 235 -inline-forceinline 285 -fp-trap-all 237 -inline-level 286, 348 -fpack-struct 239 -inline-max-per-compile 288 -fpermissive 240 -inline-max-per-routine 289 -fpic 241 -inline-max-size 290 -fpie 242 -inline-max-total-size 291 -freg-struct-return 243 -inline-min-size 293 -fshort-enums 244 -intel-extensions 294 -fsource-asm 244 -ip 295 -fstack-protector 245, 267 -ip-no-inlining 296 -fstack-security-check 245, 267 -ip-no-pinlining 297 -fstrict-aliasing 143 -ipo 298 -fsyntax-only 246 -ipo-c 299 -ftemplate-depth 247 -ipo-jobs 300 -ftls-model 248 -ipo-S 301 -ftrapuv 249 -ipo-separate 302 -ftz 250 -ipp 303 -funroll-all-loops 251 -iprefix 304 -funroll-loops 426 -iquote 305 -funsigned-bitfields 252 -isystem 306 -funsigned-char 253 -iwithprefix 306 -fvar-tracking 163 -iwithprefixbefore 307 -fvar-tracking-assignments 163 -Kc++ 308, 423 -fverbose-asm 254 -l 309 -fvisibility 255 -L 310 -fvisibility-inlines-hidden 257 -m 311 -fzero-initialized-in-bss 258 -M 312 -g 259, 488, 490 -m32 313 -g0 260 -m64 313 -gcc-name 262 -malign-double 314 -gcc-version 264 -map-opts 315 -gdwarf-2 265 -march 316 -global-hoist 266 -mc++-abr 320 -guide 267 -mcmodel 318

1275 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

VxWorks* compiler options (continued) VxWorks* compiler options (continued) -mcpu 333 -P 371 -MD 321 -pc 372 -membedded 321 -pch 373 -MF 322 -pch-create 374 -MG 323 -pch-dir 375 -mieee-fp 326 -pch-use 377 -minstruction 324 -pie 378 -MM 325 -pragma-optimization-level 379 -MMD 326 -prec-div 380 -mp 326 -prec-sqrt 381 -MP 328 -print-multi-lib 382 -mp1 329 -prof-data-order 382 -MQ 330 -prof-dir 383 -mregparm 331 -prof-file 384 -mrtp 332 -prof-func-groups 385 -MT 332 -prof-func-order 386 -mtune 333 -prof-gen 388 -multibyte-chars 336 -prof-genx 388 -multiple-processes 328, 336 -prof-hotness-threshold 390 -no-libgcc 338 -prof-src-dir 391 -nobss-init 337 -prof-src-root 392 -nodefaultlibs 339 -prof-src-root-cwd 394 -nolib-inline 340 -prof-use 395 -non-static 340 -prof-value-profiling 397 -nostartfiles 341 -profile-functions 398 -nostdinc++ 342 -profile-loops 399 -nostdlib 342 -profile-loops-report 401 -o 343 -pthread 402 -O 344 -Qinstall 403 -Ob 286, 348 -Qlocation 404 -opt-args-in-regs 350 -Qoption 405 -opt-block-factor 352 -rcd 407 -opt-calloc 352 -regcall 408 -opt-class-analysis 353 -restrict 409 -opt-jump-tables 354 -S 410 -opt-malloc-options 355 -save-temps 411 -opt-matmul 357 -scalar-rep 412 -opt-multi-version-aggressive 358 -shared 413 -opt-prefetch 359 -shared-intel 414 -opt-ra-region-strategy 360 -shared-libgcc 415 -opt-report 361 -simd 415 -opt-report-file 362 -sox 417 -opt-report-help 363 -static 418 -opt-report-phase 364 -static-intel 418 -opt-report-routine 365 -static-libgcc 419 -opt-streaming-stores 366 -std 420 -opt-subscript-in-range 367 -strict-ansi 422 -Os 368 -T 423

1276 Index

VxWorks* compiler options (continued) VxWorks* compiler options (continued) -U 425 -Wsign-compare 467 -u (L*) 424 -Wstrict-aliasing 468 -undef 426 -Wstrict-prototypes 469 -unroll 426 -Wtrigraphs 469 -unroll-aggressive 427 -Wuninitialized 470 -use-asm 428 -Wunknown-pragmas 471 -use-intel-optimized-headers 429 -Wunused-function 472 -use-msasm 430 -Wunused-variable 472 -v 431 -ww 473 -V (L*) 432 -x 477 -vec 433 -X 481 -vec-guard-write 434 -x (type) 475 -vec-report 435 -Xbind-lazy 482 -vec-threshold 436 -Xbind-now 483 -w 438, 439 -Xlinker 487 -Wa 440 -Zp 492 -Wabi 441 -Wall 442 -Wbrief 443 W -Wcheck 443 -Wcomment 444 warnings -Wcontext-limit 445 option changing to errors 449, 450, 475 -wd 446 warnings and errors 119 -Wdeprecated 446 whole program analysis 507 -we 447 Windows* compiler options -Weffc++ 448 -Werror-all 450 -Werror 449, 475 -WX 449, 475 -Werror-all 450 /c 157 -Wextra-tokens 451 /C 156 -Wformat 452 /D 161 -Wformat-security 452 /E 180 -Winline 453 /EP 182 -Wl 454 /fast 190 -Wmain 456 /fp 226 -Wmissing-declarations 456 /GS 245, 267 -Wmissing-prototypes 457 /help 275 -wn 458 /I 278 -Wnon-virtual-dtor 459 /MP 328, 336 -wo 459 /O 344 -Wp 460 /Ob 286, 348 -Wp64 461 /Oi 194, 349 -Wpointer-arith 461 /Os 368 -Wpragma-once 462 /Oy 224, 369 -wr 463 /P 371 -Wremarks 464 /prof-func-order 386 -Wreorder 464 /QA 139 -Wreturn-type 465 /QA- 140 -Wshadow 466 /Qalias-const 141

1277 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

Windows* compiler options (continued) Windows* compiler options (continued) /Qansi-alias 143 /Qinline-min-size 293 /Qansi-alias-check 144 /Qinstruction 324 /Qargument-alias 188 /Qinstrument-functions 211 /Qauto-ilp32 146 /Qintel-extensions 294 /Qax 148 /Qip 295 /Qc99 158 /Qip-no-inlining 296 /Qcomplex-limited-range 159 /Qip-no-pinlining 297 /Qcontext-limit 445 /Qipo 298 /QdD 162 /Qipo-c 299 /Qdiag 166 /Qipo-jobs 300 /Qdiag-dump 169 /Qipo-S 301 /Qdiag-error-limit 170 /Qipo-separate 302 /Qdiag-file 171 /Qipp 303 /Qdiag-file-append 172 /Qkeep-static-consts 214 /Qdiag-id-numbers 174 /Qlocation 404 /Qdiag-once 175 /QM 312 /QdM 175 /Qmap-opts 315 /QdN 176 /Qmcmodel 318 /Qeffc++ 448 /QMD 321 /Qfast-transcendentals 192 /QMF 322 /Qfnalign 186 /QMG 323 /Qfp-port 232 /QMM 325 /Qfp-speculation 233 /QMMD 326 /Qfp-stack-check 234 /QMT 332 /Qfp-trap 235 /Qmultibyte-chars 336 /Qfp-trap-all 237 /Qnobss-init 337 /Qfreestanding 198 /Qopt-args-in-regs 350 /Qftz 250 /Qopt-block-factor 352 /Qglobal-hoist 266 /Qopt-class-analysis 353 /Qguide 267 /Qopt-jump-tables 354 /Qguide-data-trans 269 /Qopt-matmul 357 /Qguide-opts 270 /Qopt-multi-version-aggressive 358 /Qguide-vec 273 /Qopt-prefetch 359 /QH 275 /Qopt-ra-region-strategy 360 /Qhelp-pragma 277 /Qopt-report 361 /Qimf-absolute-error 201 /Qopt-report-file 362 /Qimf-accuracy-bits 202 /Qopt-report-help 363 /Qimf-arch-consistency 204 /Qopt-report-phase 364 /Qimf-max-error 205 /Qopt-report-routine 365 /Qimf-precision 207 /Qopt-streaming-stores 366 /Qinline-calloc 281 /Qopt-subscript-in-range 367 /Qinline-debug-info 283 /Qoption 405 /Qinline-factor 284 /Qpc 372 /Qinline-forceinline 285 /Qprec 329 /Qinline-max-per-compile 288 /Qprec_div 380 /Qinline-max-per-routine 289 /Qprec-sqrt 381 /Qinline-max-size 290 /Qprof-data-order 382 /Qinline-max-total-size 291 /Qprof-dir 383

1278 Index

Windows* compiler options (continued) Windows* compiler options (continued) /Qprof-file 384 /Qwe 447 /Qprof-gen 388 /Qwn 458 /Qprof-genx 388 /Qwo 459 /Qprof-hotness-threshold 390 /Qwr 463 /Qprof-src-dir 391 /Qww 473 /Qprof-src-root 392 /Qx 477 /Qprof-src-root-cwd 394 /QxHost 484 /Qprof-use 395 /Qzero-initialized-in-bss 258 /Qprof-value-profiling 397 /S 410 /Qprofile-functions 398 /TP 308, 423 /Qprofile-loops 399 /U 425 /Qprofile-loops-report 401 /w 438 /Qrcd 407 /W 439 /Qregcall 408 /Wall 442 /Qrestrict 409 /Wcheck 443 /Qsave-temps 411 /WL 455 /Qscalar-rep 412 /Wp64 461 /Qsimd 415 /X 481 /Qsox 417 /Z7 259, 488, 490 /Qstd 420 /Zi 259, 488, 490 /Qtemplate-depth 247 /ZI 491 /Qtrapuv 249 /Zp 492 /Qunroll 426 /Qunroll-aggressive 427 /Quse-asm 428 X /Quse-intel-optimized-headers 429 /QV 432 xiar 513, 518 /Qvec 433 xild 507, 513, 518 /Qvec-guard-write 434 xilib 518 /Qvec-report 435 xilibtool 518 /Qvec-threshold 436 xilink 507, 513, 518 /Qwd 446

1279 Intel(R) C++ Compiler for VxWorks* User and Reference Guides

1280