Vector Processing As a Soft-CPU Accelerator

Vector Processing as a Soft-CPU Accelerator by Jason Kwok Kwun Yu B.A.Sc, Simon Fraser University, 2005 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Applied Science in The Faculty of Graduate Studies (Electrical and Computer Engineering) The University Of British Columbia May, 2008 c Jason Kwok Kwun Yu 2008 ii Abstract Soft processors simplify hardware design by being able to implement complex control strate- gies using software. However, they are not fast enough for many intensive data-processing tasks, such as highly data-parallel embedded applications. This thesis suggests adding a vector processing core to the soft processor as a general-purpose accelerator for these types of applications. The approach has the benefits of a purely software-oriented development model, a fixed ISA allowing parallel software and hardware development, a single accelerator that can accel- erate multiple functions in an application, and scalable performance with a single source code. With no hardware design experience needed, a software programmer can make area-versus- performance tradeoffs by scaling the number of functional units and register file bandwidth with a single parameter. The soft vector processor can be further customized by a number of secondary parameters to add and remove features for the specific application to optimize resource utilization. This thesis shows that a vector processing architecture maps efficiently into an FPGA and provides a scalable amount of performance for a reasonable amount of area. Configurations of the soft vector processor with different performance levels are estimated to achieve speedups of 2–24× for 5–26× the area of a Nios II/s processor on three benchmark kernels. iii Abstract iv Table of Contents Abstract ............................................ iii Table of Contents ....................................... v List of Tables ......................................... xi List of Figures .........................................xiii Acknowledgments ...................................... xv 1 Introduction ........................................ 1 1.1 Motivation ...................................... 3 1.2 Contributions ................................... .. 5 1.3 ThesisOutline ................................... .. 6 2 Background ........................................ 7 2.1 VectorProcessingOverview . ..... 7 2.1.1 ExploitingData-levelParallelism . ....... 8 2.1.2 VectorInstructionSetArchitecture . ....... 10 2.1.3 Microarchitectural Advantages of Vector Processing ............ 11 2.2 VectorProcessorsandSIMDExtensions . ....... 12 2.2.1 SIMDExtensions ............................... 12 2.2.2 Single-chipVectorProcessors . ...... 13 v Table of Contents 2.2.3 FPGA-basedVectorProcessors . ... 13 2.3 MultiprocessorSystems . ..... 15 2.4 Custom-designedHardware Accelerators . ......... 17 2.4.1 DecoupledCo-processorAccelerators . ....... 17 2.4.2 CustomInstructionAccelerators . ...... 18 2.5 SynthesizedHardwareAccelerators . ........ 20 2.5.1 C-basedSynthesis .............................. 20 2.5.2 Block-basedSynthesis . ... 24 2.5.3 Application Specific Instruction Set Processor . .......... 26 2.5.4 SynthesisfromBinaryDecompilation . ...... 27 2.6 OtherSoftProcessorArchitectures . ........ 28 2.6.1 MitrionVirtualProcessor . .... 28 2.6.2 VLIWProcessors ............................... 29 2.6.3 SuperscalarProcessors . ... 30 2.7 Summary ........................................ 31 3 Configurable Soft Vector Processor ......................... 33 3.1 Requirements .................................... .. 33 3.2 SoftVectorArchitecture . ..... 34 3.2.1 SystemOverview ............................... 34 3.2.2 HybridVector-SIMDModel . 36 3.3 VectorLaneDatapath .............................. ... 37 3.3.1 VectorPipeline ................................ 37 3.3.2 DatapathFunctionalUnits . ... 39 3.3.3 DistributedVectorRegisterFile . ...... 40 3.3.4 LoadStoreUnit ................................ 40 3.3.5 LocalMemory ................................. 41 vi 3.4 MemoryUnit ...................................... 41 3.4.1 LoadStoreController . .. 42 3.4.2 ReadInterface ................................. 42 3.4.3 WriteInterface ................................ 43 3.5 FPGA-SpecificVectorExtensions . ...... 45 3.6 ConfigurableParameters . .... 47 3.7 DesignFlow ...................................... 48 3.7.1 Soft Vector Processor Flow versus C-based Synthesis . .......... 48 3.7.2 CommentonVectorizingCompilers . .... 50 4 Results ........................................... 51 4.1 BenchmarkSuite .................................. .. 51 4.2 BenchmarkPreparation . .... 53 4.2.1 GeneralMethodology . 54 4.2.2 BenchmarkVectorization . ... 54 4.2.3 BenchmarkTuningforC2HCompiler . ... 59 4.2.4 Soft Vector Processor Per Benchmark Configuration . ......... 60 4.3 ResourceUtilizationResults . ....... 60 4.4 PerformanceResults .............................. .... 63 4.4.1 PerformanceModels . ... .. ... .. ... .. .. ... .. ... .. 64 4.4.2 RTLModelPerformance . 66 4.4.3 IdealVectorModel .............................. 68 4.4.4 C2HAcceleratorResults . .. 70 4.4.5 VectorversusC2H ............................... 71 5 Conclusions and Future Work ............................. 73 5.1 FutureWork ...................................... 75 vii Table of Contents Bibliography .......................................... 79 Appendices A Soft Vector Processor ISA ............................... 85 A.1 Introduction .................................... .. 85 A.1.1 ConfigurableArchitecture . ... 86 A.1.2 MemoryConsistency ............................. 87 A.2 VectorRegisterSet ............................... ... 87 A.2.1 VectorRegisters ............................... 87 A.2.2 VectorScalarRegisters . ... 88 A.2.3 VectorFlagRegisters . .. 88 A.2.4 VectorControlRegisters . ... 89 A.2.5 Multiply-Accumulators for Vector Sum Reduction . ......... 90 A.2.6 VectorLaneLocalMemory . 90 A.3 InstructionSet .................................. ... 92 A.3.1 DataTypes .................................. 92 A.3.2 AddressingModes ............................... 92 A.3.3 FlagRegisterUse ............................... 92 A.3.4 Instructions .................................. 92 A.4 InstructionSetReference . ...... 93 A.4.1 IntegerInstructions . ... 93 A.4.2 LogicalInstructions . ... 96 A.4.3 Fixed-Point Instructions (Future Extension) . .......... 96 A.4.4 MemoryInstructions . 99 A.4.5 VectorProcessingInstructions . .102 A.4.6 VectorFlagProcessingInstructions . .......105 viii A.4.7 MiscellaneousInstructions . .106 A.5 InstructionFormats .............................. .107 A.5.1 Vector Register and Vector Scalar Instructions . ..........107 A.5.2 VectorMemoryInstructions . .108 A.5.3 InstructionEncoding . 109 ix Table of Contents x List of Tables 1.1 New soft vector processor instruction extensions . ............. 6 2.1 Partial list of FPGA-based multiprocessor systems in literature . 17 3.1 Listofconfigurableprocessorparameters . .......... 48 4.1 Per benchmark soft vector processor configuration parametersettings . 60 4.2 Resource usage of vector processor and C2H accelerator configurations . 61 4.3 Resource usage of vector processor when varying NLane and MemMinWidth with 128-bit MemWidth andotherwisefullfeatures. 62 4.4 Resource utilization from varying secondary processor parameters. 63 4.5 Idealvectorperformancemodel . ....... 65 4.6 Performancemeasurements . ..... 67 A.1 Listof configurable processor parameters . .......... 87 A.2 Listofvectorflagregisters. ....... 88 A.3 Listofcontrolregisters. ....... 89 A.4 Instructionqualifiers . ...... 93 A.11NiosIIOpcodeUsage .............................. .107 A.12 Scalar register usage as source or destination register................108 A.13 Vector register instruction function field encoding (OPX=0). .109 A.14 Scalar-vector instruction function field encoding (OPX=1).............110 A.15 Fixed-point instruction function field encoding (OPX=0) .............110 xi List of Tables A.16 Flag and miscellaneous instruction function field encoding(OPX=1) . .111 A.17Memory instruction function field encoding . ...........111 xii List of Figures 2.1 8-tapFIRfilterMIPSassembly . ..... 9 2.2 8-tapFIRfilterVIRAMvectorassembly . ....... 10 2.3 NiosIIcustominstructionformats . ........ 19 2.4 Example Nios II system with C2H accelerators. .......... 22 3.1 Scalarandvectorcoreinteraction. ......... 35 3.2 Vector chaining versus hybrid vector-SIMD execution . .............. 36 3.3 Vector co-processorsystem block diagram . .......... 38 3.4 VectorLaneALU................................... 39 3.5 Softvectorprocessorwriteinterface. ........... 43 3.6 Data alignment using crossbar and delay network for vectorstore. 44 3.7 8-tapFIRfiltervectorassembly. ....... 47 3.8 Altera C2H compiler design flow versus soft vector processordesignflow . 49 4.1 MotionestimationCcode . .... 52 4.2 5 × 5medianfilterCcode............................... 53 4.3 Simultaneously matching two copies of the macroblock to a reference frame . 55 4.4 Code segment from the motion estimation vector assembly. The code to handle the first and last rows of the windows are not shown. ....... 56 4.5 Vectorizingtheimagemedianfilter . ........ 57 4.6 Medianfilterinnerloopvectorassembly . ......... 57 4.7 Vector assembly for loading AES data and performing AES encryption round . 58 xiii List of Figures 4.8 RTLModelspeedupoverIdealNiosModel . ...... 68 4.9 Ideal Vector Model and C2H Model speedup over Ideal Nios Model........ 69 A.1 Connection between distributed MAC units and

Vector Processing As a Soft-CPU Accelerator

Data-Level Parallelism

2.5 Classification of Parallel Computers

Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy

Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S

Lecture 14: Vector Processors

Amdahl's Law Threading, Openmp

Parallel Generation of Image Layers Constructed by Edge Detection

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

High-Performance Message Passing Over Generic Ethernet Hardware with Open-MX Brice Goglin

An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications

Computer Systems Architecture

Computer Architecture: Parallel Processing Basics