Document Title

HotStream: Heterogeneous Many-Core Data Streaming Framework with Complex Pattern Support Sergio´ Micael Ferreira Paiagua´ Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering Examination Committee Chairperson: Doutor Nuno Cavaco Gomes Horta Supervisor: Doutor Ricardo Jorge Fernandes Chaves Co-supervisor: Doutor Nuno Filipe Valentim Roma Members of the Committee: Doutor Horacio´ Claudio´ de Campos Neto Doutor Paulo Ferreira Godinho Flores October 2013 Abstract The work herein presented proposes a data streaming accelerator framework that provides efficient data management facilities that can be easily tailored to any application and data pattern. This is achieved through an innovative and fully programmable data management structure, imple- mented with two granularity levels, which is further complemented with a complete software layer, which ranges from a device driver to an high-level API that provides easy access to every feature provided by the framework. The fine-grained data movements are made possible by an innovative Data Fetch Controller, powered by a custom microcontroller, which can be programmed to gener- ate arbitrarily complex access patterns with minimal performance overhead. The obtained results show that the proposed framework is capable of achieving virtually zero-latency address generation and data fetch, even for most complex streaming data patterns, while significantly reducing the size occupied by the pattern description code. In order to validate the proposed framework, two distinct case-studies were considered. The first deals with the block-based multiplication of large matrices, while the second consists of a full image-processing application in the frequency domain. The obtained experimental results for the first case study demonstrate that, by enabling data re-use, the proposed framework increases the available bandwidth by 4.2×, resulting in a speed-up of 2.1× when compared to existing related state of the art. Furthermore, it reduces the Host memory requirements and its intervention in the acceleration by more than 40×. The signal- processing case study revealed that an accelerator base on the proposed framework can achieve a linear relationship between the execution time and the size of the input image, which highly contrasts with CPU or GPU-based alternatives. Frame rates of 40 and 2.5 FPS were obtained for 1024 × 1024 and 4096 × 4096 images, respectively. Keywords: Stream computing, Many-Core Heterogeneous Architectures, Programmable Data Access Patterns, Data Reuse, Reconfigurable Devices, High-Speed Interconnections. i Resumo No presente trabalho e´ proposta uma plataforma de aceleraçao˜ baseada em computaçao˜ de fluxo de dados, que proporciona uma gestao˜ de dados eficiente, facilmente adaptavel´ a qualquer aplicaçao˜ ou padrao˜ de acesso de dados. Isto e´ conseguido atraves´ de uma inovadora estrutura de gestao˜ de dados completamente programavel,´ composta por dois n´ıveis de granularidade e complementada por uma extensa camada de software, que abarca desde o driver do dispositivo a uma interface de alto n´ıvel que garante o facil´ acesso a todos os elementos da plataforma. O controlo de dados a um n´ıvel de granularidade mais fino e´ garantido por um in- ovador Data Fetch Controller, comandado por um microcontrolador especialmente desenhado, capaz de gerar padroes˜ de acesso arbitrariamente complexos. Os resultados obtidos revelam que a plataforma proposta e´ capaz de gerar endereços e aceder a dados de forma quase ime- diata, qualquer que seja o padrao˜ de dados em questao,˜ reduzindo ainda o espaço necessario´ para alojar a descriçao˜ do padrao.˜ Por forma a validar a plataforma proposta, dois estudos de caso distintos foram utilizados. O primeiro baseia-se na multiplicaçao˜ de matrizes de grandes dimensoes,˜ enquanto que o segundo consiste numa aplicaçao˜ de processamento de imagem no dom´ınio da frequencia.ˆ Os resultados obtidos para o primero caso de estudo demonstram que, ao explorar extensivamente a re-utilizaçao˜ de dados, a plataforma proposta aumenta a largura de banda fornecida as` unidades de computaçao˜ em 4.2×, o que resulta num aumento de desem- penho de 2.1×, quando comparada com implementaçoes˜ convencionais. Mais, os requisitos de memoria´ impostos a` maquina´ anfitria˜ e´ reduzida em mais de 40×. O segundo caso de estudo revela que um acelerador baseado na plataforma proposta garante uma relaçao˜ linear entre o tempo de execuçao˜ e a dimensao˜ da imagem a ser processada, algo que o estado da arte nao˜ permite. Keywords: Computaçao˜ de fluxos de dados, Arquitecturas Heterogeneas´ com multiplos´ nucleos,´ Padroes˜ de Acesso Programaveis,´ Reutilizaçao˜ de Dados, Dispositivos Reconfiguraveis.´ iii Acknowledgments Within the next 80 pages, a lot more than a master thesis is contained. It obviously represents my hard work, dedication and effort over the last 8 months but is actually much more than that. This is the final step in a journey that I started back in 2008. A journey that has only been successful due to the invaluable help and companionship of a number of people that more than deserve to be mentioned in the following paragraphs. First of all, I would like to express my deepest gratitude to the exceptional team of advisors I had the pleasure to work with. Ricardo Chaves, Nuno Roma, Pedro Tomas´ and Frederico Pratas, I really couldn’t have hoped for a better supervision over the last months. From the lengthy but enlightening meetings, always accompanied by good humour and plenty of laughs, to your tireless effort in reviewing all of my work, I have no doubt that the quality of this thesis is, in great part, owed to all of you. To all the amazing friends I made during these last five years, in particular, Rui Coelho, Joana Marinhas, Jose´ Santos, Filipe Morais, Joao˜ Carvalho, Rita Pereira, a big thank you for all your support throughout all the (mostly) good and bad times. A special thanks to my great friend Jose´ Leitao˜ who had a special impact in this thesis by keeping me company during the long work nights at INESC and for always having the time to share a laugh, or to happily engage in endless technical debates. Finally, I thank my parents and my sister for, well, everything. Not exaggerating in the slight- est, without them, this moment would simply not have happened. I am very grateful for all the wonderful guidance, patience and love they have so selflessly given me over the years. iv Contents 1 Introduction 2 1.1 Motivation . .3 1.2 Objectives . .4 1.3 Main contributions . .5 1.4 Dissertation outline . .6 2 Technology Overview 9 2.1 Stream Computing Platforms and Address Generation . 10 2.2 PCI Express Interfaces . 11 2.3 Shared Buses and Crossbars . 12 2.3.1 Shared Bus . 12 2.3.2 Crossbar . 12 2.4 Networks On Chip . 13 2.5 NoC Survey . 14 2.6 Crossbar Survey . 15 2.7 Summary . 15 3 HotStream Framework Architecture 17 3.1 Host Interface Bridge . 19 3.2 Multi-Core Processing Engine . 20 3.3 The HotStream API . 21 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units . 22 3.4.1 Address Generation Core (AGC) . 23 3.4.2 Micro16 microcontroller . 24 3.4.3 Access to the Shared Memory . 27 3.5 Data Stream Switch (DSS) and Core Management Unit (CMU) . 28 3.6 Summary . 30 4 Host Interface Bridge 31 4.1 PCI Express Infrastructure . 32 4.2 Address Spaces and DMA . 33 4.3 2D DMA Transfers . 34 v Contents 4.4 Device Driver and User Interface . 37 4.4.1 Modifications to the MPRACE device driver . 38 4.4.2 Configuring a data transfer . 38 4.5 Summary . 40 5 Framework Prototype 41 5.1 AXI Interfaces . 42 5.2 HIB Implementation and Performance . 43 5.3 Backplane Implementation and Performance . 47 5.3.1 Hermes NoC . 48 5.3.1.A Modified packet structure . 48 5.3.2 AXI Stream Interconnect . 49 5.3.3 Backplane Performance Evaluation . 50 5.3.3.A Core Emulator and Stream Wrapper . 50 5.3.3.B Testbench and Python script . 51 5.3.3.C Results . 51 5.3.4 Crossbar and NoC Comparative Evaluation . 53 5.4 Shared Memory Performance . 55 5.4.1 Cycle-Accurate Simulator . 55 5.5 Summary . 56 6 Framework Evaluation 58 6.1 General Evaluation . 59 6.1.1 Resources Overhead . 59 6.1.2 Stream Generation Efficiency . 61 6.2 Case Study 1: Matrix Multiplication . 63 6.2.1 Computing Cores . 64 6.2.2 Roofline Model . 66 6.2.3 Performance and Memory Usage . 67 6.3 Case Study 2: Image processing chain in the frequency domain . 69 6.3.1 Computing Cores . 71 6.3.2 Performance and Scalability . 72 6.4 Summary . 75 7 Conclusions and Future Work 77 7.1 Conclusions . 78 7.2 Future work . 80 A Appendix A 85 A.1 Micro16 Instruction Set Architecture . 86 A.2 HotStream Register Interface . 86 vi Contents A.3 HotStream API . 89 B Appendix B 95 B.1 Pattern Description Examples . 96 B.1.1 Linear and Tiled access pattern . 96 B.1.2 Diagonal access pattern . 96 B.1.3 Cross access pattern . 98 vii Contents viii List of Figures 2.1 Structure of a 2D Mesh and 2D Torus NoC . 14 3.1 Structure and organization overview of the HotStream framework . 18 3.2 AGC in a 3-level nested loop configuration. 23 3.3 Architecture of the Micro16 microcontroller. 25 3.4 Core internal structure, comprising the PE (e.g., an application specific IP Core) and the co-located BMC . 28 3.5 Internal structure of the BMC, consisting of Write and Read control units, a Channel Arbiter and a Synchronizer block .

Load more