Design and Implementation of an FPGA-Based Scalable Pipelined

Design and Implementation of an FPGA-Based Scalable Pipelined Associative SIMD Processor Array with Specialized Variations for Sequence Comparison and MSIMD Operation A dissertation submitted to the Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Hong Wang November 2006 Dissertation written by Hong Wang B.A., Lanzhou University, 1990 M.A., Kent State University, 2002 Ph.D., Kent State University, 2006 Approved by ___Robert A. Walker ___________________, Chair, Doctoral Dissertation Committee ___Johnnie W. Baker ___________________, Members, Doctoral Dissertation Committee ___Kenneth E Batcher __________________ ___Eugene C. Gartland, Jr._______________ _____________________________________ Accepted by ___Robert A. Walker___________________, Chair, Department of Computer Science ___John R. D. Stalvey__________________, Dean, College of Arts and Sciences ii Table of Contents 1 INTRODUCTION........................................................................................................... 3 1.1 Overview of Associative Computing Research......................................................... 5 1.2 Associative Computing ............................................................................................... 7 1.2.1 Associative Search and Responder Processing......................................................... 7 1.2.2 Associative Reduction for Maximum / Minimum Value ......................................... 9 1.3 Applications of a Scalable Associative Processor................................................... 10 1.3.1 Associative Computing for Relational Database Processing.................................. 11 1.3.2 Associative Computing for Image Processing........................................................ 16 1.3.3 Associative Computing for String Matching .......................................................... 19 1.4 Dissertation Organization ........................................................................................ 22 1.4.1 Developing an Associative Processor..................................................................... 22 1.4.2 Addition of Pipelining to Improve Performance .................................................... 23 1.4.3 Addition of a Reconfigurable Interconnection Network ........................................ 23 1.4.4 Supporting Multiple ISs for Increased Control Parallelism.................................... 23 2 FIRST PROTOTYPES OF AN ASSOCIATIVE COMPUTING (ASC) PROCESSOR ........................................................................................................................ 25 2.1 Basic ASC Processor Architecture.......................................................................... 25 2.2 First (Incomplete) Prototype of a 4-PE ASC Processor ........................................ 27 2.3 Implementing a Scalable ASC Processor................................................................ 29 2.3.1 Implementing Associative Search and Responder Processing ............................... 32 2.3.2 Implementing Associative Reduction for Maximum / Minimum Value................ 34 2.3.3 Implementing a 1-D and 2-D PE Interconnection Network ................................... 35 2.4 Performance of the First Scalable ASC Processor................................................. 36 2.5 Conclusion ................................................................................................................. 37 3 A SCALABLE PIPELINED ASC PROCESSOR WITH RECONFIGURABLE PE INTERCONNECTION NETWORK.................................................................................. 38 3.1 Pipelined SIMD Array.............................................................................................. 39 3.1.1 Pipeline Architecture.............................................................................................. 39 iii 3.1.2 Pipeline Performance.............................................................................................. 44 3.2 Reconfigurable PE Interconnection Network ........................................................ 44 3.2.1 Need for a Reconfigurable Network in Associative Computing ............................ 45 3.2.2 Reconfigurable Network Architecture.................................................................... 46 3.2.3 Reconfigurable Network Performance ................................................................... 49 3.3 Conclusion ................................................................................................................. 49 4 A SPECIALIZED ASC PROCESSOR WITH RECONFIGURABLE 2D MESH FOR SOLVING THE LONGEST COMMON SUBSEQUENCE (LCS) PROBLEM... 50 4.1 Genome Sequence Comparison ............................................................................... 51 4.1.1 Genome Sequence Comparison By Finding the Longest Common Subsequence (LCS) 51 4.1.2 Solving the Longest Common Subsequence (LCS) Problem Using a Specialized ASC Processor with Reconfigurable 2D Mesh................................................................... 52 4.2 PE Interconnection Network ................................................................................... 53 4.2.1 Coterie Network...................................................................................................... 55 4.2.2 Modifying the ASC Processor’s Reconfigurable Network to Support the LCS Algorithm............................................................................................................................ 56 4.3 Solving the LCS Problem on the ASC Processor with Reconfigurable 2D Mesh57 4.3.1 ASC Processor Exact Match LCS Algorithm......................................................... 57 4.3.2 ASC Processor Approximate Match LCS Algorithm ............................................. 60 4.4 Conclusions................................................................................................................ 63 5 AN ASC PROCESSOR TO SUPPORT MULTIPLE INSTRUCTION STREAM ASSOCIATIVE COMPUTING (MASC) ........................................................................... 64 5.1 Multiple-Instruction-Stream MASC Processor ..................................................... 65 5.1.1 Managing Tasks in the MASC Processor ............................................................... 67 5.1.2 Task Manager (TM)................................................................................................ 68 5.1.3 Instruction Stream (IS)............................................................................................ 69 5.1.4 Selecting the TM or IS to Control Each PE............................................................ 70 5.1.5 Program Forking and Joining ................................................................................. 72 5.2 MASC Processor Performance................................................................................ 75 5.3 Conclusion ................................................................................................................. 75 iv 6 CONCLUSIONS AND FUTURE WORK.................................................................. 76 6.1 Conclusions................................................................................................................ 76 6.2 Future Work.............................................................................................................. 78 v Table of Figures Fig. 1 Scalable ASC (Associative Computing) Processor........................................................ 4 Fig. 2 Sample Student Database ............................................................................................... 9 Fig. 3 Intersection and Union.................................................................................................. 12 Fig. 4 Cartesian Product and Join ........................................................................................... 13 Fig. 5 Convolution to Detect a Vertical Edge......................................................................... 16 Fig. 6 String Matching using Associative Computing............................................................ 20 Fig. 7 A 4-PE ASC Processor Array....................................................................................... 26 Fig. 8 A Scalable ASC (Associative Computing) Processor .................................................. 30 Fig. 9 A Scalable ASC Processor .......................................................................................... 31 Fig. 10 Network Operations................................................................................................... 34 Fig. 11 1-D and 2-D PE Interconnection Network in the ASC Processor.............................. 35 Fig. 12 ASC Processor’s Pipelined Architecture.................................................................... 40 Fig. 13 Parallel PE’s Pipelined Architecture .......................................................................... 41 Fig. 14 Structure of Data Switch............................................................................................. 48 Fig. 15 (a) Coterie Network [13], and (b) Switches for PE Interconnect and Bypass............ 54 Fig. 16 Text String After Broadcast Along Column Buses .................................................... 58 Fig. 17 (a) PEs Set Bypass Switches Along Common Subsequences, and (b) PEs Identify the LCS ............................................................................................................................

Design and Implementation of an FPGA-Based Scalable Pipelined

Parallel Patterns for Adaptive Data Stream Processing

A Middleware for Efficient Stream Processing in CUDA

2.5 Classification of Parallel Computers

AMD Accelerated Parallel Processing Opencl Programming Guide

SIMD Extensions

An Introduction to Gpus, CUDA and Opencl

Cuda C Programming Guide

Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated

Lightsaber: Efficient Window Aggregation on Multi-Core Processors

Parallel Stream Processing with MPI for Video Analytics and Data Visualization

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Thread-Level Parallelism I