Performance Portable Short Vector Transforms

Dissertation Performance Portable Short Vector Transforms ausgefuhrt¨ zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Wissenschaften unter der Leitung von Ao. Univ.-Prof. Dipl.-Ing. Dr. techn. Christoph W. Uberhuber¨ E115 – Institut fur¨ Angewandte und Numerische Mathematik eingereicht an der Technischen Universität Wien Fakultät fur¨ Technische Naturwissenschaften und Informatik von Dipl.-Ing. Franz Franchetti Matrikelnummer 9525993 Hartiggasse 3/602 2700 Wiener Neustadt Wien, am 7. Jänner 2003 Kurzfassung In dieser Dissertation wird eine mathematische Methode entwickelt, die automatische Leistungsoptimierung von Programmen zur Berechnung von diskreten linearen Transformationen fur¨ Prozessoren mit Multimedia-Vektorerweiterungen (short vector SIMD extensions) ermöglicht, wobei besonderes Gewicht auf die diskrete Fourier- Transformation (DFT) gelegt wird. Die neuentwickelte Methode basiert auf dem Kronecker-Produkt-Formalismus, der erweitert wurde, um die spezifischen Eigenschaf- ten von Multimedia-Vektorerweiterungen abzubilden. Es wurde auch eine speziell ange- paßte Cooley-Tukey-FFT-Variante1 entwickelt, die sowohl fur¨ Vektorlängen der Form N =2k als auch fur¨ allgemeinere Problemgrößen anwendbar ist. Die neuentwickelte Methode wurde als Erweiterung fur¨ Spiral2 und Fftw3,die derzeitigen Top-Systeme im Bereich der automatischen Leistungsoptimierung fur¨ diskrete lineare Transformationen, getestet. Sie erlaubt es, extrem schnelle Programme zur Berechnung der DFT zu erzeugen, welche die derzeit schnellsten Programme zur Berechnung der DFT auf Intel-Prozessoren mit den Multimedia-Vektorerweiterungen “Streaming SIMD Extensions” (SSE und SSE 2) sind. Sie sind schneller als die ent- sprechenden Programme aus der manuell optimierten Intel-Softwarebibliothek MKL (Math Kernel Library). Zusätzlich wurden die bisher ersten und einzigen automatisch leistungsoptimierten Programme zur Berechnung der Walsh-Hadamard-Transformation und fur¨ zwei-dimensionale Kosinus-Transformationen erzeugt. Wichtige Resultate dieser Dissertation sind: (i) Die Vektorisierung von diskreten linearen Transformationen fur¨ Multimedia-Vektorerweiterungen erfordert nichttriviale strukturelle Anderungen,¨ wenn automatische Leistungsoptimierung durchgefuhrt¨ werden soll. (ii) Leistungsportabilität uber¨ verschiedene Plattformen und Prozessorgene- rationen ist besonders schwierig bei Prozessoren mit Multimedia-Vektorerweiterungen zu erreichen. (iii) Aufgrund der Komplexität der Algorithmen fur¨ diskrete lineare Transformationen können vektorisierende Compiler keine Programme erzeugen, die eine zufriedenstellende Gleitpunktleistung aufweisen. (iv) Aufgrund anderer Designzie- le können Software-Bibliotheken fur¨ klassische Vektorcomputer auf Prozessoren mit Multimedia-Vektorerweiterungen nicht effizient eingesetzt werden. Die in dieser Dissertation entwickelte Methode basiert auf dem Kronecker-Produkt- Formalismus, angewendet auf diskrete lineare Transformationen (Van Loan [90], Moura et al. [72]). Die numerischen Experimente wurden mit neuentwickelten Erweiterungen zu Fftw (Frigo und Johnson [33]) und Spiral (Moura et al. [72]) durchgefuhrt.¨ Resul- tate dieser Dissertation wurden in Franchetti et al. [24, 25, 27, 28, 29, 30] veröffentlicht. 1FFT ist die Abkurzung¨ von “fast Fourier transform” (Schnelle Fourier-Transformation) 2Spiral ist die Abkurzung¨ von “signal processing algorithms implementation research for adaptive libraries” (Implementierungsforschung an Signalverarbeitungsalgorithmen fur¨ adaptive Software) 3Fftw ist die Abkurzung¨ von “Fastest Fourier Transform in the West” (Schnellste Fourier- Transformation des Westens) 2 Summary This thesis provides a mathematical framework for automatic performance tuning of discrete linear transform codes targeted at computers with processors featuring short vector SIMD extensions. Strong emphasis is placed on the discrete Fourier transform (DFT). The mathematical framework of this thesis is based on the Kronecker product approach to describe linear transforms. This approach has been extended to express the specific features of all current short vector SIMD extensions. A special short vector Cooley-Tukey FFT4 is introduced. The methods developed in this thesis are applicable as extensions to Spiral5 and Fftw6, the current state-of-the-art systems using automatic performance tuning in the field of discrete linear transforms. Application of the new method leads to extremely fast implementations of DFTs on most current short vector SIMD architectures for both non-powers and powers of two. The automatically generated and optimized codes are currently the fastest codes for Intel machines featuring streaming SIMD extensions (SSE and SSE 2). These codes are faster than Intel’s hand tuned math library (MKL) and of course faster than all other implementations freely available for these machines. The codes of this thesis are the first n-way short vector FFTs for both non-powers and powers of two. Moreover, the thesis presents the first automatically performance tuned short vector SIMD implementations for other transforms like Walsh-Hadamard transforms or two- dimensional cosine transforms. This thesis points out the following major issues: (i) SIMD vectorization cannot be achieved easily. Nontrivial mathematical transformation rules are required to obtain automatic performance tuning and thus satisfactory performance for processors featuring SIMD extensions. (ii) Performance portability across platforms and processor generations is not a straightforward matter, especially in the case of short vector SIMD extensions. Even the members of a family of binary compatible processors featuring the same short vector SIMD extension are different and adaptation is required to uti- lize them satisfactorily. (iii) Vectorizing compilers are not able to deliver competitive performance due to the structural complexity of discrete linear transforms algorithms. (iv) Conventional vector computer libraries optimized for dealing with long vector lengths do not achieve satisfactory performance on short vector SIMD extensions. The framework introduced in this thesis is based on the Kronecker product approach as used in Van Loan [90] and Moura et al. [72]. The experimental results were obtained by extending Fftw (Frigo and Johnson [33]) and Spiral (Moura et al. [72]). Some results of this thesis are presented in Franchetti et al. [24, 25, 27, 28, 29, 30]. 4FFT is the abbreviation of fast Fourier transform. 5Spiral is the abbreviation of “signal processing algorithms implementation research for adaptive libraries”. 6Fftw is the abbreviation of “Fastest Fourier Transform in the West” 3 Contents Kurzfassung / Summary .......................... 2 Introduction ................................. 7 1 Hardware vs. Algorithms ........................ 10 1.1 DiscreteLinearTransforms..................... 10 1.2 CurrentHardwareTrends...................... 13 1.3 PerformanceImplications...................... 15 1.4 AutomaticPerformanceTuning.................. 18 2 Standard Hardware ........................... 24 2.1 Processors.............................. 24 2.2 AdvancedArchitecturalFeatures.................. 27 2.3 TheMemoryHierarchy....................... 33 3 Short Vector Hardware ......................... 36 3.1 ShortVectorExtensions....................... 37 3.2 Intel’sStreamingSIMDExtensions................ 42 3.3 Motorola’s AltiVec Technology . 46 3.4 AMD’s3DNow!........................... 47 3.5 Vector Computers vs. Short Vector Hardware............................... 48 4 The Mathematical Framework .................... 51 4.1 Notation............................... 52 4.2 ExtendedSubvectorOperations.................. 55 4.3 KroneckerProducts......................... 57 4.4 StridePermutations......................... 61 4.5 Twiddle Factors and Diagonal Matrices . 65 4.6 ComplexArithmeticasRealArithmetic.............. 67 4.7 KroneckerProductCodeGeneration................ 69 5 Fast Algorithms for Linear Transforms ............... 77 5.1 DiscreteLinearTransforms..................... 77 5.2 TheFastFourierTransform..................... 83 6 A Portable SIMD API ......................... 97 6.1 DefinitionofthePortableSIMDAPI............... 98 4 CONTENTS 5 7 Transform Algorithms on Short Vector Hardware ........ 103 7.1 Formal Vectorization of Linear Transforms . 104 7.2 FFTsonShortVectorHardware.................. 114 7.3 A Short Vector Cooley-Tukey Recursion . 120 7.4 ShortVectorSpecificSearch.................... 126 8 Experimental Results .......................... 129 8.1 AShortVectorExtensionforFFTW............... 129 8.2 AShortVectorExtensionforSPIRAL............... 142 Conclusion and Outlook .......................... 164 A Performance Assessment ........................ 165 A.1 ShortVectorPerformanceMeasures................ 167 A.2 Empirical Performance Assessment . 167 B Short Vector Instruction Sets ..................... 175 B.1 TheIntelStreamingSIMDExtensions............... 175 B.2 TheIntelStreamingSIMDExtensions2.............. 180 B.3 TheMotorolaAltiVecExtensions................. 183 C The Portable SIMD API ........................ 189 C.1 IntelStreamingSIMDExtensions................. 189 C.2 IntelStreamingSIMDExtensions2 ................ 192 C.3 MotorolaAltiVecExtensions.................... 194 D SPIRAL Example Code ........................ 199 D.1 ScalarCCode............................ 199 D.2 ShortVectorCode.......................... 203 E FFTW Example ............................. 207 E.1 TheFFTWFramework......................

Performance Portable Short Vector Transforms

Super 7™ Motherboard

RISC-V Vector Extension Webinar II

Risc I: a Reduced Instruction Set Vlsi Computer

Computer Architecture and Assembly Language

Optimizing Packed String Matching on AVX2 Platform

Bring Down the Two-Language Barrier with Julia Language for Efficient

Pentium 82430VX / P54C PCI Mainboard User’S Guide & Technical Reference 5V A0/A2/A5 Ii ¨ ª

SEISGAMA: a Free C# Based Seismic Data Processing Software Platform

VME for Experiments Chairman: Junsei Chiba (KEK)

Julia: a Modern Language for Modern ML

Communication Theory II

Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the Netherlands