<<

Journal of Low Power Electronics and Applications

Article Accelerating Population Count with a Hardware Co- for MicroBlaze

Iouliia Skliarova

Department of Electronics, Telecommunications and Informatics, Institute of Electronics and Informatics Engineering of Aveiro (IEETA), Campus Universitário de Santiago, University of Aveiro, 3810-193 Aveiro, Portugal; [email protected]

Abstract: This paper proposes a Field-Programmable (FPGA)-based hardware accelerator for assisting the embedded MicroBlaze soft-core processor in calculating population count. The population count is frequently required to be executed in cyber-physical systems and can be applied to large data sets, such as in the case of molecular similarity search in cheminformatics, or assisting with computations performed by binarized neural networks. The MicroBlaze instruction set architecture (ISA) does not support this operation natively, so the count has to be realized as either a sequence of native instructions (in software) or in parallel in a dedicated hardware accelerator. Different hardware accelerator architectures are analyzed and compared to one another and to implementing the population count operation in MicroBlaze. The achieved experimental results with large vector lengths (up to 217) demonstrate that the best hardware accelerator with DMA (Direct Memory Access) is ~31 times faster than the best software version running on MicroBlaze. The proposed architectures are scalable and can easily be adjusted to both smaller and bigger input vector lengths. The entire system was implemented and tested on a Nexys-4 prototyping board containing a low-cost/low-   power Artix-7 FPGA.

Citation: Skliarova, I. Accelerating Keywords: cyber-physical systems; computation; embedded systems; population count; hardware ac- Population Count with a Hardware celerator Co-Processor for MicroBlaze. J. Low Power Electron. Appl. 2021, 11, 20. https://doi.org/10.3390/jlpea11020020 1. Introduction Academic Editors: Andrea Acquaviva Cyber-physical systems (CPS) are a ground for the Internet of Things, smart cities, and Francesco Barchi smart grid, and, actually, smart “anything” (cars, domestic appliances, hospitals). CPS

Received: 26 March 2021 tightly integrates computing, communication, and control technologies to achieve safety, Accepted: 22 April 2021 stability, performance, reliability, adaptability, robustness, and efficiency [1,2]. Frequently, Published: 24 April 2021 CPSs are built on reconfigurable hardware devices such as Field-Programmable Gate Arrays (FPGA) and Programmable Systems-on-Chip (PSoC). This is a logical choice since modern

Publisher’s Note: MDPI stays neutral reconfigurable platforms combine on a single chip high-capacity programmable logic with regard to jurisdictional claims in and state-of-the art general-purpose and graphic processors, achieving a tight integration published maps and institutional affil- of computational and physical elements. Moreover, inherent FPGA reconfigurability iations. provides direct support for run-time system adaptation, one of the requirements of CPS paradigm [2]. FPGA are also able to provide high energy efficiency by exploiting low-level fine-grained parallelism through customizing data paths to the requirements of a specific algorithm/application [3–7].

Copyright: © 2021 by the author. In this paper, I study one of the operations that is frequently executed by CPSs but is Licensee MDPI, Basel, Switzerland. not directly supported by MicroBlaze, which is a population count. Population count, also This article is an open access article called Hamming weight, is the number of nonzero elements in a (binary) vector. Albeit the distributed under the terms and operation itself seems very simple, it is used in several disciplines including information conditions of the Creative Commons theory, coding theory, cheminformatics, and cryptography. Modern hard-core processors Attribution (CC BY) license (https:// provide direct support for population count by including such instructions as popcnt in creativecommons.org/licenses/by/ Core [8] and vcnt in ARM [9]. 4.0/).

J. Low Power Electron. Appl. 2021, 11, 20. https://doi.org/10.3390/jlpea11020020 https://www.mdpi.com/journal/jlpea J. Low Power Electron. Appl. 2021, 11, 20 2 of 12

The main contributions of this study are the following: • Analysis and relative comparison of population count computations in software running on MicroBlaze processor; • Analysis and relative comparison of parallel dedicated accelerators for population count computation in hardware; • A hardware/software co-design technique implemented and tested in a low-cost FPGA of Artix-7 from ; • The result of experiments and comparisons demonstrating increase of throughput of hardware-accelerated computations comparing to the best software alternative. The remainder of this paper is organized as follows. The background is presented in Section2. Overview of the related work in software/hardware support for population count is reported in Section3. The detailed description of explored hardware accelerator architectures is done in Section4. The results are presented and discussed in Section5. The conclusion is given in Section6.

2. Background 2.1. FPGA and MicroBlaze FPGAs can be configured to implement an instruction-set architecture and other mi- crocontroller components augmented, if required, with new peripherals and functionality. A implemented in an FPGA at the cost of reconfigurable logic resources is known as a soft-core processor. Soft-core processors are slower than custom silicon processors (hard-core processors) with equivalent functionality, but they are nonetheless attractive because they are highly customizable. Besides, multiple soft-core processors may be instantiated in an FPGA to create a multicore processor. One of the most popular soft-core processors is Xilinx’s MicroBlaze [10]. MicroB- laze is a 32/64-bit configurable RISC processor with three preset configurations: (1) suit- able for running baremetal code; (2) providing deterministic real-time processing on a real-time operating system; and (3) supporting embedded Linux. MicroBlaze features a three/five/eight-stage pipeline (depending on the desired optimization) and represents a with instruction and data accesses done in separate address spaces. MicroBlaze ISA (Instruction Set Architecture) supports two instruction formats (register- register and register-immediate) and includes traditional RISC arithmetic, logical, branch, and load/store instructions augmented with special instructions [10].

2.2. Population Count Population count operation calculates the number of bits set to 1 in a binary vector. It can also be defined for any vector (not obligatory binary) as the number of the vector’s nonzero elements. If the source vector has N bits, the result is (blog2Nc + 1)-bit long. This operation has many practical applications; some examples are given below. • Binarized Neural Networks (BNN) are reduced precision neural networks, having weights and activations restricted to single-bit values [11–13]. One of the computations executed in BNNc is to multiply a binarized vector of input neurons against a binarized weight matrix. Such operation can be done using variant of a population count [11]. The parameter N tends to be large (64–1200) as it equals the number of input neurons for a fully connected layer and to the product of the size of the convolution filter in one dimension and the number of input channels for a convolution layer [12]. An example of using BNNs in a robot design for agricultural cyber-physical systems is reported in [14]. • Cryptographic applications—in [15] the population count operation is used to identify pairs of vectors that are likely to lie at a small angle to each other. In [16] Hamming weight is computed to describe an attack that recovers the private key from the public key for a public-key encryption scheme based on Mersenne numbers. Hamming distance (which is the population count of the number of mismatches between a pair J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 3 of 13

J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 3 of 13

J. Low Power Electron. Appl. 2021, 11, 20 3 of 12 public key for a public-key encryption scheme based on Mersenne numbers. Ham- publicming distancekey for a (which public-key is the encryption population scheme count of based the number on Mersenne of mismatches numbers. between Ham- mingaof pair vectors) distance of vectors) needs (which needs to beis the determinedto bepopulation determined to preventcount to preventof intrusion the number intrusion and of detect mismatches and anomaliesdetect anomalies between in CPSs a inpairreviewed CPSs of vectors)reviewed in [17 ].needs in [17]. to be determined to prevent intrusion and detect anomalies • • inTelecommunications—error Telecommunications—errorCPSs reviewed in [17]. detection/correct detection/correctionion in in a acommunication communication channel channel recur- recur- • Telecommunications—errorringring to to Hamming Hamming weight weight calculus calculus detection/correct is is reported reportedion in in in[18]. [18 a ].communication channel recur- • • ringCheminformatics—aCheminformatics—a to Hamming weight high-performance high-performance calculus is reported molecular molecular in [18]. similarity similarity search search system system is isde- de- • Cheminformatics—ascribedscribed in in [19] [19 executing] executing high-performance similarity similarity search search molecular of bitstring of bitstring similarity fingerprints fingerprints search and systemcombining and combining is de- fast scribedpopulationfast population in [19] count executing countevaluation similarity evaluation and pruningsearch and pruningof al bitstringgorithms. algorithms. fingerprints A fingerprint A fingerprintand for combining chemical for chem-fastsim- populationilarityical similarity is a descriptioncount isevaluation a description of a molecule and pruning of asuch molecule althatgorithms. the such similarity A that fingerprint the between similarity for two chemical betweendescriptions sim- two ilaritygivesdescriptions issome a description idea gives of the some of similaritya molecule idea of thebetween such similarity that two the moleculesbetween similarity two [19].between molecules The mosttwo descriptions [ 19widely]. The used most givesfingerprintswidely some used idea have fingerprints of thea length similarity have ranging abetween length from ranging two166 moleculesto from2048 bits. 166 [19]. toThe 2048 The most bits.most popular The widely most way used popu- to fingerprintscomparelar way totwo have compare fingerprints a length two fingerprintsrangingis to calculate from is 166 totheir calculate to Tanimoto 2048 bits. their similarity,The Tanimoto most popularwhich similarity, can way be which tore- compareducedcan be to reducedtwo one fingerprints population to one population iscount to calculate evaluation count their evaluation per Tanimoto comparison per similarity, comparison [19]. Millions which [19]. can Millionsof befinger- re- of ducedprintsfingerprints to have one to population have be processed to be processedcount for evaluationmost for corporate most per corporate comparisoncompound compound collections. [19]. Millions collections. of finger- • • printsBioinformatics—inBioinformatics—in have to be processed [20], [20], a atool for tool ismost is proposed proposed corporate to to remove compound remove duplicated duplicated collections. and and near-duplicated near-duplicated • Bioinformatics—inreadsreads from from next next generati generation [20], a ontool sequencing sequencingis proposed datasets. datasets.to remove duplicated and near-duplicated readsPopulationPopulation from next count count generati is is also alsoon used usedsequencing in in computer computer datasets. chess chess to to evaluate evaluate the the mobility mobility of of pieces pieces fromfromPopulation their their attack attack count sets sets andis and also quickly used matchmatchin computer blocksblocks of chessof text text to in in theevaluate the implementation implementation the mobility of of hashof hash pieces arrays ar- fromraysand theirand hash hashattack trees. trees. sets Execution andExecution quickly time time match for for population blockspopulation of counttext count in computations the computations implementation over over vectors of vectors hash has ar-has an raysanimpact impactand hash on on overall trees.overall performance Execution performa timence of systems forof systems population that usethat thecount use results the computations re ofsults such ofcomputations such over computations vectors [11 has–20 ]. an[11–20].Therefore, impact Therefore, on several overall researchseveral performa research effortsnce haveof efforts systems been have directed that been use directed to the efficiently results to efficiently implementof such implementcomputations this operation this [11–20].operationin both Therefore, software in both severalsoftware and hardware. research and hardware. efforts have been directed to efficiently implement this operation in both software and hardware. 3.3. Related Related Work Work 3. RelatedMostMost Work efficient efficient population population count count implementa implementationtion would would be by be a by native a native dedicated dedicated in- structioninstructionMost efficientbut but since since population MicroBlaze MicroBlaze count ISA ISA doesimplementa does not not include includetion wouldone, one, I will Ibe will bylimit limit a native my my review reviewdedicated to to listing listing in- structionknownknown software but software since and andMicroBlaze hardware hardware ISA realizations. realizations. does not include one, I will limit my review to listing known3.1. Software software Implementations and hardware ofrealizations. PopulationCount 3.1. Software Implementations of Population Count Let us consider 32-bit long vectors (N = 32). The simplest way is to compute the 3.1. SoftwareLet us considerImplementati 32-bitons long of Population vectors (N Count = 32). The simplest way is to compute the pop- population count by checking and accumulating the value of each of the 32 bits individually ulation count by checking and accumulating the value of each of the 32 bits individually (untilLet nous moreconsider bits 32-bit are set), long like vectors in the (N following = 32). The C codesimplest snippet: way is to compute the pop- ulation(until nocount more by bits checking are set), and like accumulating in the following the valueC code of snippet: each of the 32 bits individually (until unsignedno more bits int are set),popCount like in the = following0; C code snippet: unsignedwhile (n) int popCount = 0; /* n is the given 32-bit vector */ while{ popCount(n) += n & 1; /* n is the given 32-bit vector */ { popCountn >>= 1; += n & 1; } n >>= 1; }This approach is slow and highly inefficient since it requires multiple operations for This approach is slow and highly inefficient since it requires multiple operations for eachThis bit inapproach the vector. is slow Brian and Kernighan’s highly inefficient algorithm since reduces it requires the number multiple of operationscycle iterations for each bit in the vector. Brian Kernighan’s algorithm reduces the number of cycle iterations eachto the bit actual in the populationvector. Brian count Kernighan’s value. This algo isrithm achieved reduces by iterativel the numbery subtracting of cycle iterations one from to the actual population count value. This is achieved by iteratively subtracting one from tothe the given actual vector, population which count flips all value. the bitsThis to is the achieved right of by the iterativel rightmosty subtracting set bit, including one from the the given vector, which flips all the bits to the right of the rightmost set bit, including the therightmost given vector, set bit, which and calculating flips all the bitwise bits to ANDthe right with of the the previous rightmost vector’s set bit, value: including the rightmost set bit, and calculating bitwise AND with the previous vector’s value: rightmostunsigned set bit, andint calculating popCount bitwise = 0; AND with the previous vector’s value: unsignedwhile (n) int popCount =/* 0; n is the given 32-bit vector */ while{ n(n) &= (n - 1); /*/* nclear is the the given least 32-bit significant vector bit*/ set */ { npopCount++; &= (n - 1); /* clear the least significant bit set */ } popCount++; }One of the possibilities reported in [21] is to count the set bits in parallel. In this code, the Onegiven of vector the possibilities is divided reportedinto smaller in [21] chunks, is to countfor which the set the bits population in parallel. counts In this are code, com- One of the possibilities reported in [21] is to count the set bits in parallel. In this theputed given and vector then isadded: divided into smaller chunks, for which the population counts are com- code, the given vector is divided into smaller chunks, for which the population counts are puted and then added: computed and then added:

J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 4 of 13 J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 4 of 13 J. Low Power Electron. Appl. 2021, 11, 20 4 of 12

static const int B[] = static{0x55555555, const int0x33333333, B[] = 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF}; {0x55555555,/* count bits 0x33333333, of each 2-bit 0x0F0F0F0F, chunk */ 0x00FF00FF, 0x0000FFFF}; /*unsigned count bitsint popCountof each 2-bit= n - chunk((n >> */ 1) & B[0 ]); unsigned/* count intbits popCount of each =4-bit n - ((nchunk >> */1) & B[0 ]); /*popCount count bits= ((popCount of each 4-bit>> 2) chunk& B[1]) */ + (popCount & B[1]); popCount/* count =bits ((popCount of each >>8-bit 2) &chunk B[1]) */ + (popCount & B[1]); /*popCount count bits= ((popCount of each 8-bit>> 4) chunk+ popCount) */ & B[2]; popCount/* count =bits ((popCount of each >>16-bit 4) + chunkpopCount) */ & B[2]; /*popCount count bits= ((popCount of each 16-bit>> 8) +chunk popCount) */ & B[3]; popCount/* count =bits ((popCount of the whole>> 8) 32-bit+ popCount) number & */B[3]; /*popCount count bits= ((popCount of the whole >> 16) 32-bit + popCount) number */& B[4]; popCount = ((popCount >> 16) + popCount) & B[4]; TheThe fastest fastest counting counting method method would would be beto use to use a lookup a lookup table table (LUT). (LUT). This This approach approach is The fastest counting method would be to use a lookup table (LUT). This approach is notis practical not practical for big for values big values of N, of but N, it but is feasible it is feasible to have to havea relatively a relatively small small LUT; LUT;for in- for not practical for big values of N, but it is feasible to have a relatively small LUT; for in- stance,instance, LUT LUT 8:4 8:4(storing (storing population population counts counts for for all all8-bit 8-bit vectors), vectors), and and then then to tocalculate calculate the the stance, LUT 8:4 (storing population counts for all 8-bit vectors), and then to calculate the resultresult by by summing summing N/8 N/8 intermediate intermediate results. results. An An example example for for N N= 32 = 32is given is given below: below: result by summing N/8 intermediate results. An example for N = 32 is given below: static const unsigned char LUT[256] = static{ /* constmacro unsignedsuggested char by HallvardLUT[256] Furuseth= [21]*/ {# define/* macro B2(n) suggested n, n+1, n+1,by Hallvard n+2 Furuseth [21]*/ ## definedefine B2(n)B4(n) n,B2(n), n+1, B2(n+1),n+1, n+2 B2(n+1), B2(n+2) ## definedefine B4(n)B6(n) B2(n),B4(n), B2(n+1),B4(n+1), B2(n+1),B4(n+1), B2(n+2)B4(n+2) #B6(0), define B6(1), B6(n) B6(1),B4(n), B6(2)B4(n+1), B4(n+1), B4(n+2) B6(0),}; B6(1), B6(1), B6(2) };unsigned int popCount = LUT[n & 0xff] + LUT[(n >> 8) & 0xff] + unsigned int popCount LUT[(n = LUT[n >> 16) & 0xff]& 0xff] + LUT[(n+ LUT[n >> >> 8) 24]; & 0xff] + LUT[(n >> 16) & 0xff] + LUT[n >> 24]; 3.2. Hardware Implementations of Population Count 3.2. Hardware Implementations of Population Count 3.2.State-of-the-art Hardware Implementations hardware implementations of Population Count of population count computations have beenState-of-the-art exhaustivelyState-of-the-art analyzed hardware hardware in implementations[22–29]. implementations The basic of ideas ofpopulation population of these countmethods count computations computations are summarized have have beenbelowbeen exhaustively [22]: exhaustively analyzed analyzed in in[22–29]. [22–29 Th]. Thee basic basic ideas ideas of ofthese these methods methods are are summarized summarized below [22]: • belowParallel [22]: counters from [24] are tree-based circuits that are built from full-adders. • Parallel counters from [24] are tree-based circuits that are built from full-adders. • • TheParallel designs counters from [25] from are [24 based] are on tree-based sorting networks, circuits that which are built have from known full-adders. limitations; • • TheinThe particular, designs designs from when from [25] [the25 are] arenumber based based on of on sortinsource sortingg networks,data networks, items which grows, which have the have knownoccupied known limitations; limitations;resources inare inparticular, increased particular, whenconsiderably. when the the number number of of source source data itemsitems grows,grows, thethe occupied occupied resources resources are areincreased increased considerably. considerably. • Counting networks [26] eliminate propagation delays in carry chains that appear in • • Counting networks [26] eliminate propagation delays in carry chains that appear Counting[24] and give networks very good [26] eliminateresults especially propagation for pipelined delays in implementations.carry chains that appearHowever, in in [24] and give very good results especially for pipelined implementations. However, [24] and give very good results especially for pipelined implementations. However, theythey occupy occupy many many genera general-purposel-purpose logical logical slices. slices. they occupy many general-purpose logical slices. • • TheThe designs designs [28] [28 are] are based based on on embedded embedded to to FPGA FPGA digital digital signal signal processing processing (DSP) (DSP) • Theslices,slices, designs organized organized [28] asare as a based atree tree of ofon DSP DSP embedded adders. adders. to FPGA digital signal processing (DSP) slices, organized as a tree of DSP adders. • • LUT-basedLUT-based circuits circuits [29] [29] are veryvery competitivecompetitive but but they they are are hardly hardly scalable scalable and and resource re- • LUT-basedsourceconsuming consuming circuits for big for [29] values big are values ofvery N. ofcompetit N. ive but they are hardly scalable and re- • • sourceAA combination combination consuming of of forcounting counting big values networks, of N. LUT-LUT- andand DSP- DSP- based based circuits circuits is proposedis proposed in [in22 ]. • A[22]. Itcombination is It noticed is noticed that of that counting LUT-based LUT-based networks, circuits circuits andLUT- and counting and counting DSP- networks basednetworks circuits are are the theis fastest proposed fastest solutions solu- in [22].tionsfor It smallfor is noticedsmall length length that sub-vectors LUT-based sub-vectors (up circuits to(up 128 to bits). 128and bits). counting The resultThe resultnetworks is produced is produced are asthe a combinationalfastest as a combi- solu- tionsnationalsum for of smallsum the accumulated oflength the accumulatedsub-vectors population (up population to counts 128 bits). of counts the The sub-vectors resultof the issub-vectors produced that can be asthat eithera combi-can done be nationaleitherin DSP done sum slices in of DSP or the in slices aaccumulated circuit or in built a circuit frompopulation built logical from counts slices. logical of the slices. sub-vectors that can be either done in DSP slices or in a circuit built from logical slices. ArchitecturesArchitectures that that are are more more recent recent are are reported reported in in [30–33]. [30–33 ].In In [30] [30 a] ageneric generic system system architecturearchitectureArchitectures is is proposed proposed that are forfor more binarybinary recent string string are comparisons reportedcomparisons in that[30–33]. that is based is In based [30] on aa onVirtexgeneric a Virtex UltraScale+ system Ul- architecturetraScale+FPGA. The FPGA. is system proposed Theis system adaptable for is binary adaptable to different string to diff bitcomparisons widthserent bit and widths that thenumber andis based the of number parallelon a Virtex of processing parallel Ul- traScale+processingelements FPGA. elements and The exhibits system and highexhibits is adaptable throughput high thro to diff forughputerent streaming bitfor widths streaming data. and The data.the adopted number The adopted approachof parallel ap- is processingproachdefinitely is definitely elements interesting interestingand for exhibits high-end for high high-end applications throughput applications as for it requiresstreaming as it PCIe-basedrequires data. The PCIe-based connection adopted con-ap- to a host CPU but is not suitable for a low-cost CPS. proachnection is to definitely a host CPU interesting but is not for suitable high-end for a applications low-cost CPS. as it requires PCIe-based con- nectionA to LUT-efficient a host CPU but compressor is not suitable architecture for a low-cost for performing CPS. population count operation is described in [31] to be used in matrix multiplications of variable precision. The authors started with a population count unit built as a tree of 6:3 LUTs and adders requiring a large

J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 5 of 13

J. Low Power Electron. Appl. 2021, 11, 20 A LUT-efficient compressor architecture for performing population count operation5 of 12 is described in [31] to be used in matrix multiplications of variable precision. The authors started with a population count unit built as a tree of 6:3 LUTs and adders requiring a numberlarge number of LUTs of and LUTs many and stages many to stages pipeline to pi thepeline the tree.adder Therefore, tree. Therefore, a carry-free a carry-free bit heapbit compressionheap compression was applied, was applied, which which executes executes a carry-save a carry-save addition addition using using regular regular full full addersadders operating operating in parallel in parallel (except (except for the for last the addition, last addition, which which requires requires carry propagation).carry propaga- Thistion). work This resembles work resembles parallelcounters parallel counters from [24 ,from33]. [24,33]. TheThe paper paper [32 [32]] states states that that modern modern FPGA FPGA LUT-based LUT-based architectures architectures are are not not particularly particularly efficientefficient for for implementation implementation of compressor of compressor trees tree (whichs (which can be can considered be considered parallel parallel counters coun- withters explicit with explicit carry-in carry-in and carry-out and carry-out signals sign to beals connectedto be connected to adjacent to adjacent compressors compressors in thein same the stage)same stage) and suggests and suggests to augment to augmen existingt existing commercial commercial FPGA logic FPGA elements logic elements with a 6-inputwith a XOR6-input gate. XOR The gate. authors The authors demonstrate demons thattrate the that proposed the proposed modifications modifications improve im- compressorprove compressor tree synthesis tree synthesis using generalized using generalized parallel counters. parallel counters. It isIt clear is clear that that none none of the of analyzedthe analyzed related related work work targets targets specifically specifically CPSs CPSs incorporating incorporat- theing MicroBlaze the MicroBlaze processor, processor, which canwhich be essentialcan be essential for cost-sensitive for cost-sensitive applications. applications.

4. Proposed4. Proposed Hardware Hardware Accelerators Accelerators for for Population Population Count Count 4.1. Hardware Population Count Accelerators Architectures 4.1. Hardware Population Count Accelerators Architectures Six different architectures have been analyzed for the hardware accelerator module, Six different architectures have been analyzed for the hardware accelerator module, specified in VHDL. In all the cases, 32-bit sub-vectors arriving from DMA (Direct Memory specified in VHDL. In all the cases, 32-bit sub-vectors arriving from DMA (Direct Memory Access) are processed in parallel and the intermediate results are summed up at the speed Access) are processed in parallel and the intermediate results are summed up at the speed of AXI (Advanced eXtensible Interface) DMA transfers. The analyzed architectures are of AXI (Advanced eXtensible Interface) DMA transfers. The analyzed architectures are the the following: following: • • A1—trivialA1—trivial counting counting the the set set bits; bits; • • A2—usingA2—using tables tables 8:4 8:4 and and adders; adders; • • A3—usingA3—using double double layer layer of tablesof tables 8:4; 8:4; • • A4—countingA4—counting the the set set bits bits in parallelin parallel with with 16- 16- and and 32-bit 32-bit chunks chunks processed processed through through multiplication;multiplication; • A5—counting bits set in parallel without multiplication, like in Section 3.1; • A5—counting bits set in parallel without multiplication, like in Section 3.1; • A6—dedicated LUT-based circuit from [29]. • A6—dedicated LUT-based circuit from [29]. The first architecture (A1) is implemented as for loop VHDL construction and results The first architecture (A1) is implemented as for loop VHDL construction and results in a long sequence of 31 adders that does not seem particularly interesting except for the in a long sequence of 31 adders that does not seem particularly interesting except for the sake of comparison: sake of comparison: (dataIn) variable v_cnt : natural; begin v_cnt := 0; for i in dataIn'range loop if dataIn(i) = '1' then v_cnt := v_cnt + 1; end if; end loop; s_cnt <= v_cnt; end process; cntOut <= std_logic_vector(to_unsigned(s_cnt, cntOut'length)); The second architecture A2 uses four tables 8:4 to quickly process 8-bit chunks and The second architecture A2 uses four tables 8:4 to quickly process 8-bit chunks and threethree adders adders to sumto sum up up the the previous previous results. results. The The elaborated elaborated design design is presented is presented in Figure in Figure1 (albeit1 (albeit 6 bits 6 arebits sufficient are sufficient to keep to keep the result, the result, 32-bit 32-bit output output is used is withused the with most the significant most signifi- bitscant set tobits 0). set to 0). TheThe third third architecture architecture A3 isA3 similar is similar to A2, to but A2, instead but instead of adders, of adders, reutilizes reutilizes the 8:4 tablethe 8:4 as illustratedtable as illustrated in Figure in2. Figure The first 2. layerThe first of four laye tablesr of four 8:4 calculatestables 8:4 thecalculates population the population counts forcounts 8-bit chunks. for 8-bit Thechunks. second The layer second executes layer exec table-basedutes table-based “addition” “addition” operation operation over the over same-orderthe same-order bits and bits the and correct the correct carries fromcarries the from previous the previous lowest orderlowest results order as results illustrated as illus- in Figuretrated 2ina). Figure Blue 2a). squares Blue aresquares individual are individual bits of the bits four of the 4-bit four values 4-bit thatvalues have that to have be to summed up. The possible result in every column never exceeds 3 bits: red square is the most significant bit, green square is the middle bit, and yellow square is the least significant bit. Note that at position 4 the result can only be either 1 or 0, with carry always equal to 0. J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 6 of 13

J.J.J. LowLow Low PowerPower Power Electron.Electron. Electron. Appl.Appl. Appl. 202120212021,,, 1111,,, xx 20 FORFOR PEERPEER REVIEWREVIEW 666 ofof of 1212 be summed up. The possible result in every column never exceeds 3 bits: red square is the most significant bit, green square is the middle bit, and yellow square is the least signifi- cant bit. Note that at position 4 the result can only be either 1 or 0, with carry always equal to 0. This is because after processing the 8-bit chinks the maximum number of set bits that toThis 0. This isto because 0. is This because afteris because after processing processing after processing the 8-bit the 8-bi chinks thet chinks 8-bi thet maximumchinks the maximum the numbermaximum number of setnumber of bits setthat bitsof set eachthat bits that each of these may have is equal to 8. Solid rectangles represent five instances of the 8:4 eachof these ofeach these may of havemaythese ishave may equal ishave equal to 8. is Solid equalto 8. rectanglesSolid to 8. rectanglesSolid represent rectangles represent five represent instances five instances five of the instances 8:4 of tablethe of8:4 to the 8:4 table to be used at the second layer (see Figure 2b). The architecture resembles a compres- tablebe used totable be at used theto be second at used the secondat layer the (seesecond layer Figure (see layer 2Figub). (see There Figu 2b). architecture reThe 2b). architecture The resembles architecture resembles a compressor resembles a compres- a tree compres- sor tree executing a single 4-input addition operation. sorexecuting treesor executing tree a single executing a 4-input single a single4-input addition 4-input addition operation. addition operation. operation.

Figure 1. Schematic of A2 architecture: tables 8:4 and adders. Figure Figure1. SchematicSchematic 1. Schematic of of A2 A2 architecture: of A2 architecture: tables tables 8:4 tables and 8:4 adders. and adders.

“Adder” i result: “Adder”“Adder” i result: i result: a)ii ii ii b) a)a)i i i b) b) 3 2 1 3 23 12 1

3 2 1 0 0 3 23 12 01 00 0

3 2 1 0 3 23 12 01 0

3 2 1 0 3 23 12 01 0

3 2 1 0 3 23 12 01 0

3 2 1 0 3 23 12 01 0

0…0 3 4 3 2 1 0 0…0 0…03 43 43 23 12 01 0 PopulationPopulation count count Population count Figure 2. Executing the addition with 8:4 tables (a); schematic of A3 architecture (b). FigureFigure Figure2. 2. ExecutingExecuting 2. Executing the the addition addition the addition with with 8:4 8:4 with tables tables 8:4 ( (tablesaa);); schematic schematic (a); schematic of of A3 A3 architecture architectureof A3 architecture ( b). (b). The fourth architecture A4 executes counting the set bits in parallel with 16- and 32- TheThe fourth The fourth architecture architecture A4A4 executesexecutes A4 executes countingcounting counting the the set set bitsthe bits inset in parallel bitsparallel in parallel with with 16- 16- with and and 32-bit16- 32- and 32- bit chunks processed through multiplication. The following VHDL code fragment illus- bitchunks chunksbit processed chunks processed processed through through multiplication. through multiplication. multiplication. The The following following The following VHDL VHDL code VHDL code fragment fragment code illustratesfragment illus- illus- tratesthe idea: the idea: trates thetrates idea: the idea: architecturearchitecturearchitecture BehavioralBehavioral Behavioral ofof PopulationCountPopulationCount of PopulationCount isis is signal v1, v2, v3 : unsigned(31 downto 0); signalsignal v1, v2,v1, v3v2, : v3unsigned(31 : unsigned(31 downto downto 0); 0); signal v4 : unsigned(63 downto 0); signalsignal v4 : v4unsigned(63 : unsigned(63 downto downto 0); 0); begin beginbegin v1 <= unsigned(dataIn) – v1 <=v1 unsigned(dataIn) <= unsigned(dataIn) – – ('0' & unsigned(dataIn(31 downto 1)) AND x”55555555”); ('0' ('0'& unsigned(dataIn(31 & unsigned(dataIn(31 downto downto 1)) AND1)) x”55555555”);AND x”55555555”); v2 <= (v1 AND x”33333333”) + v2 <=v2 (v1 <= AND(v1 x”33333333”)AND x”33333333”) + + ((“00” & v1(31 downto 2)) AND x”33333333”); ((“00”((“00” & v1(31 & v1(31 downto downto 2)) AND2)) x”33333333”);AND x”33333333”); v3v3 <=<=v3 (v2(v2 <= ++(v2 ((“0000”((“0000” + ((“0000” && v2(31v2(31 & v2(31 downtodownto downto 4)))4))) 4)))ANDAND x”0F0F0F0F”);x”0F0F0F0F”);AND x”0F0F0F0F”); v4 <= v3 * x”01010101”; v4 <=v4 v3 <= * v3x”01010101”; * x”01010101”; cntOut <= std_logic_vector(x”000000” & v4(31 downto 24)); cntOutcntOut <= std_logic_vector(x”000000” <= std_logic_vector(x”000000” & v4(31 & v4(31 downto downto 24)); 24)); endend Behavioral;Behavioral;end Behavioral; TheThe architecturearchitecture A5A5 isis veryvery similarsimilar toto thethe previousprevious one,one, withwith thethe onlyonly differencedifference beingbeing thethe multiplicationmultiplication operationoperation substitutedsubstituted byby thethe followingfollowing combinationcombination ofof ANDsANDs andand shiftsshifts (as(as thisthis mightmight havehave influenceinfluence onon thethe criticalcritical path):path):

J. Low Power Electron. Appl. 2021, 11, 20 7 of 12 J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 7 of 13

The architectureThe architecture A5 is veryA5 is similar very similar to the to previous the previous one, with one, the with only the difference only difference being being the multiplicationthe multiplication operation operation substituted substituted by the by following the following combination combination of ANDs of ANDs and shifts and shifts (as this(as might this might have influencehave influence on the on critical the critical path): path): architecture Behavioral of PopulationCount is signal v1, v2, v3, v4, v5 : unsigned(31 downto 0); begin -- v1, v2, and v3 are assigned as in the VHDL code above v4 <= (v3 + (x”00” & v3(31 downto 8))) AND x”00FF00FF”; v5 <= (v4 + (x”0000” & v4(31 downto 16))) AND x”0000FFFF”; cntOut <= std_logic_vector(v5); end Behavioral;

Finally,Finally, the last the architecture last architecture is taken is taken straight straight from Figurefrom Figure 3.32 in 3.32 [29] in which [29] which constructs constructs a 32-bita 32-bit Hamming Hamming weight weight /comparator counter/comparator by directly by directly instantiating instantiating and configuring and configuring Artix-7Artix-70s LUTs.′s LUTs. This approach This approach is definitely is definitely the most the difficult most difficult to implement to implement and less portable,and less port- as theable, respective as the respective VHDL code VHDL uses Xilinx-specificcode uses Xilin LUTx-specific components LUT components (declared (declared in UNISIM in UNI- library)SIM and library) all the and LUTs all inthe the LUTs layers in havethe layers to be have configured to be configured with proper with constants; proper constants; this operationthis operation is obviously is ob error-prone.viously error-prone.

4.2. System4.2. System Architecture Architecture DifferentDifferent hardware hardware accelerators accelerators for MicroBlaze for MicroBlaze for population for population count count computation computation have beenhave designedbeen designed and implementedand implemented in Xilinx in Vivado 2020.2 2020.2 design design suite andsuite Vitisand Vitis 2020.22020.2 core developmentcore development kit. All kit. the All designs the design haves have been been tested tested on a low-coston a low-cost and low-and low- powerpower xc7a100tcsg324-1 xc7a100tcsg324-1 FPGA FPGA [34] from [34] Artix-7from Artix-7 family family available available on Nexys-4 on Nexys-4 prototyping prototyping boardboard [35]. The[35]. system The system architecture architecture was described was described using using the Vivado the Vivado IP integrator, IP integrator, with with variousvarious population population count count accelerators accelerators specified specified in VHDL. in VHDL. The software The software was written was written in C in C languagelanguage and built and usingbuilt using Vitis. TheVitis. system The system consists consists of the of following the following blocks: blocks: • A• 32-bitA 32-bit MicroBlaze MicroBlaze processor processo optimizedr optimized for performance for performance with instructionwith instruction and data and data cachescaches disabled, disabled, the debug the debug module module enabled, enabled, and the and peripheral the peripheral AXI data AXI interface data interface enabled.enabled. The processor The processor has two has interfaces two interfaces for memory for memory accesses: accesses: local memorylocal memory bus and AXI4and forAXI4 peripheral for peripheral access. access. • A• mixed-modeA mixed-mode clock clock manager manager (MMCM)—to (MMCM)—to generate generate the the 100 100MHz MHz clock clock for for the the de- designsign from from signal signal arriving arriving from from the the crystal crys clocktal clock oscillator oscillator available available on Nexys-4.on Nexys-4. • MicroBlaze• MicroBlaze local local memory memory configured configured to 128 to KB128 andKB and connected connected to theto the MicroBlaze MicroBlaze in- instancestance through through the localthe local memory memory bus corebus providingcore providing fast connection fast connection to on-chip to on-chip block block RAMRAM storing storing instruction instruction and data. and data. • MicroBlaze• MicroBlaze debug debug module module interfacing interfacing with thewith JTAG the JTAG port ofport the of FPGA the FPGA to provide to provide supportsupport for software for software debugging debugging tools. tools. • MicroBlaze• MicroBlaze concat concat for concatenating for concatenating bus signals bus signals to be usedto be inused the in interrupt the interrupt controller. controller. • AXI• interruptAXI interrupt controller controller supporting supporting interrupts interrupts from from the AXI the timer,AXI timer, the UARTLite the UARTLite module and DMA. It concentrates three interrupt inputs from these devices to a single module and DMA. It concentrates three interrupt inputs from these devices to a sin- interrupt output to the MicroBlaze. gle interrupt output to the MicroBlaze. • AXI timer—hardware timer for measuring execution times. The timer counter is • configured AXI timer—hardware to 32 bits. timer for measuring execution times. The timer counter is con- • Resetfigured module to that 32 bits. provides customized resets for the entire system, including the • processor, Reset the module interconnect, that provides the DMA, customized and peripherals. resets for the entire system, including the • AXI interconnectprocessor, the with interconnect, two slave the and DMA, six master and peripherals. interfaces. The interconnect core • connects AXI AXI interconnect memory-mapped with two master slave and devices six mast to oneer orinterfaces. more memory-mapped The interconnect slave core con- devices.nects The AXI two memory-mapped slave ports of the master interconnect devices are to connected one or more to the memory-mapped MicroBlaze and slave the DMAdevices. controller. The two The slave six ports master of the ports interco are linkednnect are with connected the interrupt to the controller,MicroBlaze and DMAthe controller, DMA controller. UARTLite The module, six master population ports countare linked accelerator, with the external interrupt , controller,DMA and controller, timer. UARTLite module, population count accelerator, external memory • UARTLitecontroller, module and implementing timer. AXI4-Lite slave interface for interacting with the FPGA through UART from the host PC. A serial port terminal is used to get and print the results.

J. Low Power Electron. Appl. 2021, 11, 20 8 of 12

• AXI External Memory Controller (EMC)—controller for the onboard external cellular 16 MB RAM which is used to store source data for processing. • AXI DMA controller configured to 32 bits data width with the buffer length register width of 24 bits (allowing transfers up to 224-1 bytes) and providing high-bandwidth direct memory access between the cellular 16MB memory (memory controller) and

J. Low Power Electron. Appl. 2021, 11, x FOR PEERthe REVIEW population count accelerator module. 8 of 12 • AXI4-Stream data FIFO with 4096 depth for buffering AXI4-Stream data. • The population count accelerator receiving stream data for processing through DMA • AXIfrom DMA the controller cellular configured 16MB RAM to 32 and bits providingdata width with memory-mapped the buffer length interface register to the MicroB- widthlaze of for24 bits reading (allowing the transfers result (sumup to 2 of24-1 the bytes) population and providing counts high-bandwidth of x 32-bit binary vectors, directwhich memory is equivalent access between to processing the cellular 216MBdlog2x ememory+5 bits). (memory controller) and the Thepopulation final block count diagram accelerator is module. illustrated in Figure3. The block automation and connection • AXI4-Stream data FIFO with 4096 depth for buffering AXI4-Stream data. • automationThe population features count haveaccelerator been receiving used to stream put together data for processing the basic through microprocessor DMA system, the accelerator,from the cellular the DMA/EMC 16MB RAM and controllers, providing memory-mapped and connecting interface ports to to external the Micro- I/O ports. Blaze for reading the result (sum of the population counts of x 32-bit binary vectors, ⌈⌉ which is equivalent to processing 2 bits). axis_data_fifo_0

S_AXIS s_axis_aresetn M_AXIS The final block diagram is illustrated in Figure 3. The block automation and connec-PopCountDMA_0 axi_dma_0 s_axis_aclk

S00_AXIS tion automation features have been usedS_AXI_LITE to putM_AXI_MM2S together the basic microprocessorAXI4-Stream Data FIFO system, S01_AXI s_axi_lite_aclk M_AXIS_MM2S microblaze_0_local_memory s00_axis_aclk m_axi_mm2s_aclk mm2s_prmry_reset_out_n the accelerator, the DMA/EMC controllers, and connecting ports to external I/O ports. s00_axis_aresetn axi_resetn mm2s_introut DLMB microblaze_0_axi_intc s01_axi_aclk ILMB s01_axi_aresetn AXI Direct Memory Access LMB_Clk s_axi microblaze_0 SYS_Rst s_axi_aclk PopCountDMA_v1.0 (Pre-Production) s_axi_aresetn interrupt INTERRUPT intr[2:0] DLMB microblaze_0_axi_periph DEBUG processor_clk ILMB Clk processor_rst M_AXI_DP S00_AXI Reset S01_AXI AXI Interrupt Controller ACLK MicroBlaze axi_uartlite_0 microblaze_0_xlconcat ARESETN S00_ACLK S_AXI UART usb_uart In0[0:0] mdm_1 S00_ARESETN s_axi_aclk interrupt In1[0:0] dout[2:0] M00_ACLK s_axi_aresetn In2[0:0] MBDEBUG_0 M00_ARESETN M00_AXI Debug_SYS_Rst M01_ACLK M01_AXI AXI Uartlite Concat M01_ARESETN M02_AXI axi_emc_0 MicroBlaze Debug Module (MDM) M02_ACLK M03_AXI rst_clk_wiz_1_100M M02_ARESETN M04_AXI S_AXI_MEM M03_ACLK M05_AXI s_axi_aclk EMC_INTF cellular_ram slowest_sync_clk mb_reset M03_ARESETN s_axi_aresetn reset ext_reset_in bus_struct_reset[0:0] M04_ACLK rdclk clk_wiz_1 aux_reset_in peripheral_reset[0:0] M04_ARESETN mb_debug_sys_rst interconnect_aresetn[0:0] M05_ACLK AXI EMC resetn clk_out1 dcm_locked peripheral_aresetn[0:0] M05_ARESETN sys_clock clk_in1 locked S01_ACLK Processor System Reset S01_ARESETN Clocking Wizard AXI Interconnect axi_timer_0

S_AXI capturetrig0 generateout0 capturetrig1 generateout1 freeze pwm0 s_axi_aclk interrupt s_axi_aresetn

AXI Timer

Figure 3. Block diagram of the microprocessor system equipped with the population count accelerator. Figure 3. Block diagram of the microprocessor system equipped with the population count accelerator. 4.3. Experimental Setup 4.3. Experimental Setup FigureFigure 4 illustrates4 illustrates the overall the overall experimental experimental setup. setup.

Executing a simple DMA Calculating the Nexys-4 board Pseudo-random transfer of all the population count data generation data, getting the with in software and on Microblaze result from the measuring the Artix-7 FPGA (4096 32-bit accelerator and elapsed time in values = 217 bits) measuring the microseconds elapsed time in microseconds

USB interface

Comparing the software and Visualizing the Host PC hardware results results on Vitis and the Serial Terminal processing times

FigureFigure 4. The 4. experimentalThe experimental setup. setup.

J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 9 of 13

J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 9 of 13

Executing a simple DMA Calculating the Nexys-4 board Pseudo-random transfer of Executingall the a population count simple DMA with data generation Calculating the data, getting the Nexys-4 board Pseudo-random in software and transfer of all the on Microblaze population count result from the data generation measuring the data, getting the Artix-7with FPGA (4096 32-bit in software and accelerator and on17 Microblaze elapsed time in result from the values = 2 bits) measuring the measuring the Artix-7 FPGA (4096 32-bit microseconds accelerator and elapsed time in elapsed time in values = 217 bits) measuring the microseconds microseconds elapsed time in microseconds USB interface USB interface

Comparing the software and Visualizing the Host PC hardwareComparing results the results on Vitis and thesoftware and Serial TerminalVisualizing the Host PC processinghardware times results results on Vitis and the Serial Terminal processing times J. Low Power Electron. Appl. 2021, 11, 20 9 of 12

Figure 4. The experimental setup. Figure 4. The experimental setup. TheThe MicroBlaze MicroBlaze is is executed executed in in standalone standalone mode. mode. BSS BSS (Block (Block Starting Starting Symbol Symbol segment segment keeping uninitialized data), heap and stack memory areas are mapped to external cellular keeping uninitializedThe MicroBlaze data), is heapexecuted and stackin standalone memory mode. areas areBSS mapped (Block Starting to external Symbol cellular segment RAM. Memory is initialized with N = 4096 (configurable value) 32-bit pseudo-randomly RAM.keeping Memory uninitialized is initialized data), with heap N = 4096and stack(configurable memory value)areas are 32-bit mapped pseudo-randomly to external cellular generated values, stored in the stack area. Then a C software routine is started to calculate generatedRAM. values, Memory stored is initialized in the stack with area. N Then= 4096 a (configurableC software routine value) is 32-bitstarted pseudo-randomly to calculate the total population count of all these values and the required execution time is measured the totalgenerated population values, count stored of all in these the stack values area. and Then the requireda C software execution routine time is started is measured to calculate with the aid of the hardware timer: with the aidtotal of population the hardware count timer: of all these values and the required execution time is measured RestartPerformanceTimer();with the aid of the hardware timer: // reset the timer and enable the RestartPerformanceTimer();//timer counter such that it// startsreset therunning timer and enable the unsigned //timerint SWpc counter = PopulationCountSw(srcData, such that it starts running N); // srcData is unsigned// an N-element int SWpc array = PopulationCountSw(srcData, storing 32-bit integers N); // srcData is timeElapsedSw// an =N-element StopAndGetPerformanceTimer(); array storing 32-bit integers//disable the timer timeElapsedSw//and get the =timer StopAndGetPerformanceTimer(); counter register value //disable the timer In the//and PopulationCountSw get the timer function, counter five different register methods value described in Section 3.1 haveIn been the PopulationCountSw Intested; the PopulationCountSw namely: function, function, five different five different methods methods described described in Section in Section 3.1 3.1 have been tested; namely: • trivialhave been accumulation tested; namely: of the values of each of the 217 bits in a loop; • trivial accumulation of the values of each of the 217 bits in a loop; • B.• Kernighan’strivial accumulation method; of the values of each of the 217 bits in a loop; • B. Kernighan’s method; • counting• B. Kernighan’s the set bits method;in parallel as illustrated in Section 3.1; • counting the set bits in parallel as illustrated in Section 3.1; • counting• counting the set the bits set in bits parallel in parallel with 16as -illustrated and 32-bit in chunks Section processed 3.1; through multi- • counting the set bits in parallel with 16- and 32-bit chunks processed through multipli- plication:• counting the set bits in parallel with 16- and 32-bit chunks processed through multi- cation: popCountplication: = (((popCount + (popCount >> 4)) & B[2]) * popCount 0x01010101) = (((popCount >> 24; + (popCount >> 4)) & B[2]) * • using LUTs 8:4 and0x01010101) adding up the >> intermediate 24; results. J. Low Power Electron. Appl. 2021, 11, x FOR• PEER REVIEW 10 of 13 I found using that LUTs the best 8:4 resultsand adding are achieved up the intermediate with the LUT-based results. method similar to that • reportedusing LUTsatI foundthe 8:4end that and of Sectionthe adding best 3.1: upresults the intermediate are achieved results. with the LUT-based method similar to that I foundreported that at the the best end results of Section are achieved 3.1: with the LUT-based method similar to that reported at the end of Section 3.1: unsigned int PopulationCountSw(int* pSrc, unsigned int size) { int* p; unsigned int result = 0; static const unsigned char BitsSetTable256[256] = //defined exactly as in section 3.1 for (p = pSrc; p < pSrc + size; p++) { result += BitsSetTable256[*p & 0xff] + BitsSetTable256[(*p >> 8) & 0xff] + BitsSetTable256[(*p >> 16) & 0xff] + BitsSetTable256[*p >> 24]; } return result; } After running the software routine to calculate the population count, the hardware timerAfter is running reset, a the simple software DMA routine transfer to of calculate the same the source population data to count,the hardware the hardware accelerator timeris is executed reset, a simple and, once DMA the transfer transfer of is the comple samete, source the result data tois read the hardware from the acceleratormemory-mapped is executedaccelerator’s and, once register. the transfer Finally, is complete,the timer theis stopped result is and read the from timer’s the memory-mappedcounter register value accelerator’sis used to register. measure Finally, the elapsed the timer time is in stopped microseconds. and the The timer’s following counter C code register illustrates value the is usedprocedure: to measure the elapsed time in microseconds. The following C code illustrates the procedure: RestartPerformanceTimer(); //reset the population count to 0: Xil_Out32(XPAR_POPCOUNTDMA_0_S01_AXI_BASEADDR, 0x0); //initiate one simple transfer submission: status = XAxiDma_SimpleTransfer(&dmaInstDefs, (UINTPTR) srcData, N * sizeof(int), XAXIDMA_DMA_TO_DEVICE); if (status != XST_SUCCESS) { xil_printf(“\r\nDMA transfer failed.”); return XST_FAILURE; } while (XAxiDma_Busy(&dmaInstDefs, XAXIDMA_DMA_TO_DEVICE)) { /* Wait */ } //get the calculated population count: unsigned int HWpc = Xil_In32(XPAR_POPCOUNTDMA_0_S01_AXI_BASEADDR); timeElapsedHw = StopAndGetPerformanceTimer(); All the results and eventual error messages are visualized on Vitis Serial Terminal communicating with the MicroBlaze through the UARTLite module.

5. Discussion of the Results The software execution time was measured for all the considered in Section 4.3 soft- ware implementations as the calculated timer counter register value multiplied by the known clock period of 10 ns. I found that the best results are achieved with the LUT-based method, followed by a 4-input adder. The selected embedded software module takes 23,786 µs to execute and is 19 times faster than the trivial bit count, 7 times faster than the B. Kernighan’s method and 1.2–1.6 times faster compared to counting the set bits in par- allel (depending on how to process 16- and 32-bit chunks). Therefore, the fastest C code listed in Section 4.3 for calculating the population count in software was chosen for further comparison with hardware accelerators. All the hardware architectures A1–A6 take exactly the same number of clock cycles to execute (77,042 cycles), as this is bounded by AXI DMA transactions, and, assuming 100 MHz clock frequency, this amounts to 770 µs for processing the same 217 bits of ran- domly-generated data, which is 30.8 times faster than calculating the population count in

J. Low Power Electron. Appl. 2021, 11, x FOR PEER REVIEW 10 of 13

unsigned int PopulationCountSw(int* pSrc, unsigned int size) { int* p; unsigned int result = 0; static const unsigned char BitsSetTable256[256] = //defined exactly as in section 3.1 for (p = pSrc; p < pSrc + size; p++) { result += BitsSetTable256[*p & 0xff] + BitsSetTable256[(*p >> 8) & 0xff] + BitsSetTable256[(*p >> 16) & 0xff] + BitsSetTable256[*p >> 24]; } return result; } After running the software routine to calculate the population count, the hardware timer is reset, a simple DMA transfer of the same source data to the hardware accelerator is executed and, once the transfer is complete, the result is read from the memory-mapped J. Low Power Electron. Appl. 2021, 11, 20 accelerator’s register. Finally, the timer is stopped and the timer’s counter register10 of value 12 is used to measure the elapsed time in microseconds. The following C code illustrates the procedure: RestartPerformanceTimer(); //reset the population count to 0: Xil_Out32(XPAR_POPCOUNTDMA_0_S01_AXI_BASEADDR, 0x0); //initiate one simple transfer submission: status = XAxiDma_SimpleTransfer(&dmaInstDefs, (UINTPTR) srcData, N * sizeof(int), XAXIDMA_DMA_TO_DEVICE); if (status != XST_SUCCESS) { xil_printf(“\r\nDMA transfer failed.”); return XST_FAILURE; } while (XAxiDma_Busy(&dmaInstDefs, XAXIDMA_DMA_TO_DEVICE)) { /* Wait */ } //get the calculated population count: unsigned int HWpc = Xil_In32(XPAR_POPCOUNTDMA_0_S01_AXI_BASEADDR); timeElapsedHw = StopAndGetPerformanceTimer(); All the results and eventual error messages are visualized on Vitis Serial Terminal All the results and eventual error messages are visualized on Vitis Serial Terminal communicating with the MicroBlaze through the UARTLite module. communicating with the MicroBlaze through the UARTLite module.

5.5. Discussion Discussion of of the the Results Results TheThe software software execution execution time time was was measured measured for fo allr all the the considered considered in in Section Section 4.3 4.3 soft- soft- wareware implementations implementations as as the the calculated calculated timer time counterr counter register register value value multiplied multiplied by by the the knownknown clock clock period period of of 10 10 ns. ns. I found I found that that the the best best results results are are achieved achieved with with the the LUT-based LUT-based method,method, followed followed by by a 4-inputa 4-input adder. adder. The The selected selected embedded embedded software software module module takes takes 23,78623,786µs µs to to execute execute and and is is 19 19 times times faster faster thanthan thethe trivialtrivial bit count, 7 7 times times faster faster than than the theB. B. Kernighan’s Kernighan’s method method and and 1.2–1.6 1.2–1.6 times times faster faster compared compared to tocounting counting the the set set bits bits in inpar- parallelallel (depending (depending on on how to processprocess 16-16- andand 32-bit 32-bit chunks). chunks). Therefore, Therefore, the the fastest fastest C codeC code listedlisted in in Section Section 4.3 4.3 for for calculating calculating the the population population count count in in software software was was chosen chosen for for further further comparisoncomparison with with hardware hardware accelerators. accelerators. AllAll the the hardware hardware architectures architectures A1–A6 A1–A6 take take exactly exactly the the same same number number of of clock clock cycles cycles toto execute execute (77,042 (77,042 cycles), cycles), as as this this is is bounded boundedby by AXIAXI DMADMA transactions,transactions, and, assuming assum- ing100100 MHz MHz clockclock frequency, frequency, this this amounts amounts to to 770 770 µsµ sfor for processing processing the the same same 217 2 17bitsbits of ofran- randomly-generateddomly-generated data, data, which which is is 30.8 30.8 times times faster faster than than calculating calculating the population count in in software running on MicroBlaze. This is, however, a theoretical speed-up as not all the architectures are capable of executing at 100 MHz. In particular, I found that A4 and A5 exhibit negative values of worst slack, and the greatest positive slack is demonstrated by A1 architecture. Table1 summarizes the timing performance of A1–A6 for the considered implementations and the required for the complete system resources (including all the modules listed in Section 4.2). In terms of the occupied resources, all the architectures exhibit comparable results with negligible differences, with the most resource-angry ap- proaches being A4 and A5. It is a big surprise, however, that the architecture A1 (a long sequence of 31 adders) demonstrates the best maximum operating frequency and compa- rable to counterparts resources. I believe that this is due to the very efficient carry chain optimization realized by the Vivado synthesis tool. Compressor trees employed in A3 are slower and grant a negligible resource gain. The total on-chip power calculated by Vivado power analysis tool from the imple- mented netlist for the whole system with A1–A6 accelerators is also reported in Table1. The power losses are scaled to one clock frequency (100 MHz) and the results indicate that the variation in power losses among the architectures A1–A6 is negligible. All the designs reported in this paper have been fully implemented and tested on the Nexys-4 board connected to the host PC through a USB cable. The designs are however self-contained and do not require a host PC to operate. The computer was only used to facilitate experiments and the results analysis. J. Low Power Electron. Appl. 2021, 11, 20 11 of 12

Table 1. Timing performance of A1–A6 accelerator architectures for the considered implementations, the resources occupied by the complete system and the total on-chip power.

Architecture Worst Slack (ns) Max Freq (MHz) LUT FF BRAM DSP Total on-Chip Power (W) A1 1.28 115 4663 4739 37 3 0.294 A2 0.77 108 4661 4707 38.5 3 0.294 A3 0.682 107 4658 4707 38.5 3 0.293 A4 −5.308 65 4718 4739 37 6 0.293 A5 −0.902 92 4708 4739 37 3 0.293 A6 0.415 104 4663 4739 37 3 0.294

6. Conclusions I found that a low-cost, low-power Artix-7 FPGA in which the soft-core processor offloads the computation of the population count to a hardware accelerator implemented in FPGA logic resources can perform the calculation at a speed 31× greater than what the processor achieves without . The experiments have been carried out with 217-bit randomly generated vectors stored in external cellular memory. The system architecture is scalable to support smaller (no changes are required) or longer bit vectors (it might be necessary to adjust the buffer length register width of the DMA controller which is currently configured to support 224-1 bit transactions). As the analyzed accelerators occupy negligible FPGA resources, several accelerator instances could be instantiated and run in parallel; however, as reported in [22], I believe that the bottleneck would be in communication links and protocols.

Funding: This work was supported by National Funds through the FCT—Foundation for Science and Technology, in the context of the project UIDB/00127/2020. Data Availability Statement: The data presented in this study are available in the article. Conflicts of Interest: The author declares no conflict of interest.

References 1. Kim, K.D.; Kumar, P.R. An overview and some challenges in cyber-physical systems. J. Indian Inst. Sci. 2013, 93, 341–352. 2. Mosterman, P.J.; Zander, J. Cyber-physical systems challenges: A needs analysis for collaborating embedded software systems. Softw. Syst. Model 2016, 15, 5–16. [CrossRef] 3. Rodríguez, A.; Valverde, J.; Portilla, J.; Otero, A.; Riesgo, T.; de la Torre, E. FPGA-Based High-Performance Embedded Systems for Adaptive Edge Computing in Cyber-Physical Systems: The ARTICo3 Framework. Sensors 2018, 18, 1877. [CrossRef] 4. Qasaimeh, M.; Denolf, K.; Vissers, J.L.K.; Zambreno, J.; Jones, P.H. Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels. In Proceedings of the 2019 IEEE International Conference on Embedded Software and Systems (ICESS), Las Vegas, NV, USA, 2–3 June 2019; pp. 1–8. [CrossRef] 5. Hong, T.; Kang, Y.; Chung, J. InSight: An FPGA-Based Neuromorphic Computing System for Deep Neural Networks. J. Low Power Electron. Appl. 2020, 10, 36. [CrossRef] 6. Spagnolo, F.; Perri, S.; Frustaci, F.; Corsonello, P. Energy-Efficient Architecture for CNNs Inference on Heterogeneous FPGA. J. Low Power Electron. Appl. 2020, 10, 1. [CrossRef] 7. Sarwar, I.; Turvani, G.; Casu, M.R.; Tobon, J.A.; Vipiana, F.; Scapaticci, R.; Crocco, L. Low-Cost Low-Power Acceleration of a Microwave Imaging Algorithm for Brain Stroke Monitoring. J. Low Power Electron. Appl. 2018, 8, 43. [CrossRef] 8. Intel Corp. Intel®64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A–Z. 2016. Available online: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64 -ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf (accessed on 14 March 2021). 9. Arm, Lda., Arm Armv8-A A32/T32 Instruction Set Architecture. Available online: https://developer.arm.com/documentation/ ddi0597/2020-12/SIMD-FP-Instructions/VCNT---Vector-Count-Set-Bits-?lang=en (accessed on 14 March 2021). 10. Xilinx, Inc. MicroBlaze Processor Reference Guide. UG081 (v9.0). 2008. Available online: https://www.xilinx.com/support/ documentation/sw_manuals/mb_ref_guide.pdf (accessed on 14 March 2021). 11. Nurvitadhi, E.; Sheffield, D.; Sim, J.; Mishra, A.; Venkatesh, G.; Marr, D. Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 77–84. [CrossRef] 12. Kim, J.H.; Lee, J.; Anderson, J.H. FPGA Architecture Enhancements for Efficient BNN Implementation. In Proceedings of the 2018 International Conference on Field-Programmable Technology (FPT), Naha, Japan, 10–14 December 2018; pp. 214–221. [CrossRef] J. Low Power Electron. Appl. 2021, 11, 20 12 of 12

13. Agrawal, A.; Jaiswal, A.; Roy, D.; Han, B.; Srinivasan, G.; Ankit, A.; Roy, K. Xcel-RAM: Accelerating Binary Neural Networks in High-Throughput SRAM Compute Arrays. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 66, 3064–3076. [CrossRef] 14. Huang, C.H.; Chen, P.J.; Lin, Y.J.; Chen, B.W.; Zheng, J.X. A robot-based intelligent management design for agricultural cyber- physical systems. Comput. Electron. Agric. 2021, 181.[CrossRef] 15. Schanck, J. Improving Post-Quantum Cryptography through Cryptanalysis. Ph.D. Thesis, University of Waterloo, Waterloo, ON, Canada, 2020. Available online: https://uwspace.uwaterloo.ca/bitstream/handle/10012/16060/Schanck_John.pdf?sequence= 3&isAllowed=y (accessed on 14 March 2021). 16. Coron, J.S.; Gini, A. Improved cryptanalysis of the AJPS Mersenne based cryptosystem. J. Math. Cryptol. 2020, 14, 218–223. [CrossRef] 17. Mitchell, R.; Chen, I.R. A Survey of Intrusion Detection Techniques for Cyber-Physical Systems. ACM Comput. Surv. 2014, 55. [CrossRef] 18. John, I.N.; Kamaku, P.W.; Macharia, D.K.; Mutua, N.M. Error Detection and Correction Using Hamming and Cyclic Codes in a Communication Channel. Pure Appl. Math. J. 2016, 5, 220–231. [CrossRef] 19. Dalke, A. The chemfp project. J. Cheminform. 2019, 11, 76. [CrossRef] 20. Gonzalez-Domínguez, J.; Schmidt, B. ParDRe: Faster parallel duplicated reads removal tool for sequencing studies. Bioinformatics 2016, 32, 1562–1564. [CrossRef] 21. Anderson, S.E. Bit Twiddling Hacks. Available online: http://graphics.stanford.edu/~{}seander/bithacks.html#CountBitsSetTable (accessed on 14 March 2021). 22. Sklyarov, V.; Skliarova, I.; Silva, J. On-chip reconfigurable hardware accelerators for popcount computations. Int. J. Re Config. Comput. 2016, 2016, 8972065. [CrossRef] 23. Sklyarov, V.; Skliarova, I. Hamming Weight Counters and Comparators based on Embedded DSP Blocks for Implementation in FPGA. Adv. Electr. Comput. Eng. 2014, 14, 63–68. [CrossRef] 24. Parhami, B. Efficient Hamming weight comparators for binary vectors based on accumulative and up/down parallel counters. IEEE Trans. Circuits Syst. Ii Express Briefs 2009, 56, 167–171. [CrossRef] 25. Piestrak, S.J. Efficient Hamming weight comparators of binary vectors. Electron. Lett. 2007, 43, 611–612. [CrossRef] 26. Sklyarov, V.; Skliarova, I. Design and implementation of counting networks. Computing 2015, 97, 557–577. [CrossRef] 27. El-Qawasmeh, E. Beating the Popcount. Int. J. Inf. Technol. 2003, 9, 1–18. Available online: http://intjit.org/cms/journal/volume/ 9/1/91_1.pdf (accessed on 14 March 2021). 28. Sklyarov, V.; Skliarova, I. Multi-core DSP-based vector set bits counters/comparators. J. Signal. Process. Syst. 2015, 80, 309–322. [CrossRef] 29. Sklyarov, V.; Skliarova, I.; Barkalov, A.; Titarenko, L. Synthesis and Optimization of FPGA-Based Systems; Springer: Berlin, Germany, 2014. 30. Pilz, S.; Porrmann, F.; Kaiser, M.; Hagemeyer, J.; Hogan, J.M.; Rückert, U. Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs. Algorithms 2020, 13, 47. [CrossRef] 31. Umuroglu, Y.; Conficconi, D.; Rasnayake, L.K.; Preußer, T.B.; Själander, M. Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing. ACM Trans. Reconfig. Technol. Syst. 2019, 12, 1–24. [CrossRef] 32. Rasoulinezhad, S.; Siddhartha; Zhou, H.; Wang, L.; Boland, D.; Leong, P.H.W. LUXOR: An FPGA Logic Architecture for Efficient Compressor Tree Implementations. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, Monterey, CA, USA, 26–28 February 2020; pp. 161–171. [CrossRef] 33. Preußer, T.B. Generic and Universal Parallel Matrix Summation with a Flexible Compression Goal for Xilinx FPGAs. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL), Ghent, Belgium, 4–8 September 2017; pp. 1–7. [CrossRef] 34. Xilinx, Inc. 7 Series FPGAs Data Sheet: Overview. 2020. Available online: https://www.xilinx.com/support/documentation/ data_sheets/ds180_7Series_Overview.pdf (accessed on 21 March 2021). 35. Digilent, Nexys 4 Reference Manual. Available online: https://reference.digilentinc.com/reference/programmable-logic/nexys- 4/reference-manual (accessed on 21 March 2021).