Hardware Considerations for Tensor Implementation and Analysis Using the Field Programmable Gate Array

electronics Article Hardware Considerations for Tensor Implementation and Analysis Using the Field Programmable Gate Array Ian Grout 1,* and Lenore Mullin 2 1 Department of Electronic and Computer Engineering, University of Limerick, V94 T9PX Limerick, Ireland 2 Department of Computer Science, College of Engineering and Applied Sciences, University at Albany, State University of New York, Albany, NY 12222, USA; [email protected] * Correspondence: [email protected]; Tel.: +353-61-202-298 Received: 18 October 2018; Accepted: 8 November 2018; Published: 13 November 2018 Abstract: In today’s complex embedded systems targeting internet of things (IoT) applications, there is a greater need for embedded digital signal processing algorithms that can effectively and efficiently process complex data sets. A typical application considered is for use in supervised and unsupervised machine learning systems. With the move towards lower power, portable, and embedded hardware-software platforms that meet the current and future needs for such applications, there is a requirement on the design and development communities to consider different approaches to design realization and implementation. Typical approaches are based on software programmed processors that run the required algorithms on a software operating system. Whilst such approaches are well supported, they can lead to solutions that are not necessarily optimized for a particular problem. A consideration of different approaches to realize a working system is therefore required, and hardware based designs rather than software based designs can provide performance benefits in terms of power consumption and processing speed. In this paper, consideration is given to utilizing the field programmable gate array (FPGA) to implement a combined inner and outer product algorithm in hardware that utilizes the available hardware resources within the FPGA. These products form the basis of tensor analysis operations that underlie the data processing algorithms in many machine learning systems. Keywords: Inner and outer product; tensor; FPGA; hardware 1. Introduction Embedded system applications are today demanding greater levels of digital signal processing (DSP) capabilities whilst providing low-power operation and with reduced processing times for complex signal processing operations found typically in machine learning [1] systems. For example, facial recognition [2] for safety and security conscious applications is a noticeable every day example, and many smartphones today incorporate facial recognition software applications for phone and software app access. Embedded environmental sensors, as an alternative application, can input multiple sensor data values over a period of time and, using DSP algorithms, can analyze the data and autonomously provide specific outcomes. Although these applications may differ, within the system hardware and software, these are simply algorithms accessing data values that need to be processed. The system does not need to know the context of the data it is obtaining. Data processing is rather concerned with how effectively and efficiently it can obtain, store, and process the data before transmitting a result to an external system. This requires not only an understanding of regular access patterns in important internet of things (IoT) algorithms, but also an ability to identify similarities Electronics 2018, 7, 320; doi:10.3390/electronics7110320 www.mdpi.com/journal/electronics Electronics 2018, 7, 320 2 of 24 amongst such algorithms. Research presented herein shows how scalar operations, such as plus and times, extended to all scalar operations, can be defined in a single circuit that implements all scalar Electronics 2018, 7, x FOR PEER REVIEW 2 of 24 operations extended to: (i) n-dimensional tensors (arrays); (ii) the inner product, (matrix multiply is a 2-dsimilaritiesinstance) amongst and the such outer algorithms. product, Research both on prnesented-dimensional herein shows arrays how (the scalar Kronecker operations, Product such is a 2-d instance);as plus and and times, (iii) compressions,extended to all scalar or reductions, operations, over can be arbitrary defined in dimensions. a single circuit However, that implements even more relationshipsall scalar exist. operatio Onens of extended the most to: compute (i) n-dimensional intensive tensors operations (arrays); in IoT(ii) isthe the inner Khatri-Rao, product, (matrix or parallel Kroneckermultiply Product, is a 2-d which, instance) from and the the perspective outer product, of this both research, on n-dimensional is an outer arrays product (the projectedKronecker to a matrix,Product enabling is a contiguous2-d instance);reads and (iii) and compressions, writes of data or reductions, values at machineover arbitrary speeds. dimensions. However, even more relationships exist. One of the most compute intensive operations in IoT is the Khatri-Rao, In terms of the data, when this data is obtained, it must be stored in the available memory. or parallel Kronecker Product, which, from the perspective of this research, is an outer product This will be a mixture of cache memory within a suitably selected software programmed processor projected to a matrix, enabling contiguous reads and writes of data values at machine speeds. µ µ (microcontrollerIn terms ( ofC), the microprocessor data, when this ( dataP), oris digitalobtained signal, it must processor be stored (DSP)), in the available locally connected memory. This external volatilewill or be non-volatile a mixture of memory cache memory connected within to the a suitably processor, selected memory software connected programmed to the processorprocessor via local area(microcontroller network (LAN), (µC), or microprocessor via some form (µP), of Cloud or digital based signal memory processor (Cloud (DSP)), storage). locally Identifying connected what to useexternal and when volatile is the or challenge. non-volatile Ideally, memory the conne datacted would to the be storedprocessor, in specificmemory memory connected locations to the so that theprocessor processor via local can optimallyarea network access (LAN), the or stored via some input form data, of Cloud process based the memory data, and (Cloud store storage). the result (the outputIdentifying data) what again to in use memory and when in suitable is the challenge. new locations, Ideally, orthe overwriting data would existingbe stored data in specific in already utilizedmemory memory. locations Knowing so that and the anticipating processor can cache optimally memory access misses, the stored for input example, data, enableprocess a the design data, that and store the result (the output data) again in memory in suitable new locations, or overwriting minimizes overhead(s), such as signal delays, energy, heat, and power. existing data in already utilized memory. Knowing and anticipating cache memory misses, for In many embedded systems implemented today, the software programmed processor is the example, enable a design that minimizes overhead(s), such as signal delays, energy, heat, and power. commonlyIn used many programmable embedded systems device implemented to perform today, complex the software tasks and programmed interface to processor input and is outputthe systems.commonly The software used programmable approach has device been to developed perform comp overlex the tasks last and number interface ofyears to input and and is supportedoutput throughsystems. tools (usuallyThe software available approach via anhas integrated been developed development over the last environment number of years (IDE)) and and is programmingsupported languagethrough constructs, tools (usually providing available the necessary via an syntax integrated and semanticsdevelopment to perform environment the required (IDE)) complexand tasks.programming However, increasingly, language constructs, the programmable providing the logic necessary device syntax (PLD) and [3] semantics that allows to forperform a hardware the configurationrequired tocomplex be downloaded tasks. However, into increasingly, the PLD interms the programmable of digital logic logic operations device (PLD) is utilized.[3] that allows Figure 1 showsfor the a targethardware device configuration choices available to be downloaded to the designer into the today. PLD in Alternatively, terms of digital an logic application operations specific is utilized. Figure 1 shows the target device choices available to the designer today. Alternatively, an integrated circuit (ASIC) solution whereby a custom integrated circuit is designed and fabricated could application specific integrated circuit (ASIC) solution whereby a custom integrated circuit is designed be considered. Design goals include not only semantic, denotational, and functional descriptions of a and fabricated could be considered. Design goals include not only semantic, denotational, and circuit,functional but also andescriptions operational of a description circuit, but also (how anto operational build the description circuit and (how associated to build memory the circuit relative and to accessassociated patterns ofmemory important relative algorithms). to access patterns of important algorithms). Algorithm to be implemented Software programmed Hardware configured ASIC processor programmable logic device Microcontroller (µC) Field programmable gate array (FPGA) Microprocessor (µP) Complex programmable logic device (CPLD) Digital signal processor (DSP) Simple programmable logic device (SPLD) FigureFigure

Hardware Considerations for Tensor Implementation and Analysis Using the Field Programmable Gate Array

Fault Tolerance and Re-Training Analysis on Neural Networks

In-Datacenter Performance Analysis of a Tensor Processing Unit

Abstractions for Programming Graphics Processors in High-Level Programming Languages

P1360R0: Towards Machine Learning for C++: Study Group 19

AI Chips: What They Are and Why They Matter

Shinjae Yoo Computational Science Initiative Outline

Podracer Architectures for Scalable Reinforcement Learning

Patent Claim Generation by Fine-Tuning Openai GPT-2

Automatic Full Compilation of Julia Programs and ML Models to Cloud

Interoperating Deep Learning Models with ONNX.Jl

Data Movement Is All You Need: a Case Study on Optimizing Transformers

Accelerators for Cyber-Physical Systems Sam Green, İhsan Çiçek and Çetin Kaya Koç University of California, Santa Barbara Introduction Capabilities Desired in CPS?