mathematics

Article Efficient Pipelined Broadcast with Monitoring Processing Node Status on a Multi-Core

Jongsu Park School of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Korea; [email protected]

 Received: 12 November 2019; Accepted: 29 November 2019; Published: 1 December 2019 

Abstract: This paper presents an efficient pipelined broadcasting algorithm with the inter-node transmission order change technique considering the communication status of processing nodes. The proposed method changes the transmission order for the broadcast operation based on the communication status of processing nodes. When a broadcast operation is received, a local checks the remaining pre-existing transmission data size of each processing node; it then transmits data according to the changed transmission order using the status information. Therefore, the synchronization time can be hidden for the remaining time, until the pre-existing data transmissions finish; as a result, the overall broadcast completion time is reduced. The simulation results indicated that the speed-up ratio of the proposed algorithm was up to 1.423, compared to that of the previous algorithm. To demonstrate physical implementation feasibility, the message passing engine (MPE) with the proposed broadcast algorithm was designed by using Verilog-HDL, which supports four processing nodes. The logic synthesis results with TSMC 0.18 µm process cell libraries show that the logic area of the proposed MPE is 2288.1 equivalent NAND gates, which is approximately 2.1% of the entire chip area. Therefore, performance improvement in multi-core processors is expected with a small hardware area overhead.

Keywords: broadcast; collective communication; pipelined broadcast; multi-core processor; message passing

1. Introduction Multi-core processor and many-core processor have become dominant processor models in many modern computer systems including smartphones, tablet PCs, desktop computers and even high-performance server systems [1,2]. Modern high-performance processors comprise two or more independent cores, and each core can be connected to various interconnection network topologies, such as bus, ring, mesh, and crossbar. Whereas the performance of a single core is limited by physical constraints, recent technological advances have made processors with many cores feasible [3,4]. The parallelism provided by multi- or many-core processors gives greater cost-performance benefits over single-core processors. To maximize the performance of these multi-core processors, it is important to support efficient data communication among cores. Data communications among processors are undertaken by using point-to-point and collective communications. Point-to-point communication is used when there is a single sender and a single receiver, and it is relatively easy to implement efficiently. Collective communication is for multiple senders and/or receivers, and it is difficult to implement efficiently because the topology-aware implementation is required for good performance [5]. To ensure implementation efficiency, collective communications are generally converted into groups of point-to-point communication [6]. Common collective communication patterns are broadcast, scatter, gather, all-gather, all-to-all, reduce, all-reduce, and so on [7,8]. Collective communication operations are generated through the use of programming that uses libraries, such as OpenMP, message passing interface (MPI), or compute unified

Mathematics 2019, 7, 1159; doi:10.3390/math7121159 www.mdpi.com/journal/mathematics Mathematics 2019, 7, 1159 2 of 22 device architecture (CUDA). Generally, using collective communication operations reduces the code size and also increases performance, compared to using point-to-point communication operations. Moreover, collective communication operations account for approximately up to 80% of the total data transmission time; therefore, it is very important to improve their execution time [9–11]. Broadcasting is a frequently used form of collective communications; it is used to disseminate data messages in the root node (core) to all the other nodes (core) that belong to the same communicator. Following the advent of the sequential tree algorithm [12], various broadcast algorithms have been proposed; these include the binary tree, binomial tree [13], minimum spanning tree, and distance minimum spanning tree algorithms, as well as Van de Gejin’s hybrid broadcast [14] and modified hybrid broadcast [15] algorithms. These algorithms offer performance efficiency by tuning network topologies in collective communications. Since then, pipelined broadcast algorithms have been proposed to accelerate the processing time of broadcast communication at the operating system level [9,16], by utilizing the maximum bandwidth. There have also been a few diverse works on pipelined broadcast algorithms [17–21] that look to improve the processing time of broadcast communication. In the pipelined broadcast algorithm, data are packetized into k partitions to make k-pipeline stages, so that it activates most communication ports efficiently. However, the action of packetizing incurs k-1 number of additional synchronization processes. As a result, the pipelined broadcast algorithm has an inherent limitation in improving the performance of broadcast communication because it improves the data transmission time but incurs an unnecessary synchronization time-cost. First, this paper explains an enhanced algorithm for pipelined broadcasting; it is called an “atomic pipelined broadcast algorithm” [22]. In this preliminary study, the enhanced algorithm accelerates the processing time of broadcast communication by drastically reducing the number of synchronization processes in the conventional pipeline broadcast algorithm. This reduced number of synchronization processes is accomplished by developing a light-weight protocol and employing simple hardware logic, so that the message broadcast can be performed in an atomic fashion with full pipeline utilization. Additionally, this paper proposes an inter-node transmission order change technique considering the communication status of processing nodes. The reinforcement learning algorithms have been already adopted in various transmission orders or selection problems in 5G networks [23,24]. However, these pre-existing algorithms could not be applied to on-chip communications because they have too high of an implementation complexity supported by software to be implemented to hardware logic. For on-chip communications, the proposed method has appropriate complexity and efficient performance. The proposed method changes the transmission order for the broadcast operation based on the communication status of processing nodes. When a broadcast operation is received, a local bus checks the remaining pre-existing transmission data size of each processing node; it then transmits data according to the changed transmission order using the status information. Therefore, the synchronization time can be hidden for the remaining time, until the pre-existing data transmissions finish; as a result, the overall broadcast completion time is reduced. To validate this approach, the proposed broadcast algorithm is implemented to a bus functional model (BFM) with SystemC, and then evaluated in the simulations from using various communication data sizes and numbers of nodes. The simulation results indicate that the speed-up ratio of the atomic pipelined broadcast algorithm is up to 4.113, compared to that of the previous pipelined broadcast algorithm, which lacks any changes to the transmission order, when a 64-byte date message is broadcasted among 32 nodes. In order to demonstrate physical implementation feasibility, the message passing engine (MPE) supporting the proposed atomic broadcast algorithm and the enhanced transmission order change technique was designed by using Verilog-HDL, which comprises four processing nodes. The logic synthesis results with TSMC 0.18 µm process cell libraries show that the logic area of the proposed MPE was 2288.1 equivalent NAND gates, which represents approximately 2.1% of the entire chip Mathematics 2019, 7, 1159 3 of 22 area. Therefore, performance improvement is expected with a small hardware area overhead, if the proposed MPE were to be added to multi-core processors. The remainder of this paper is organized as follows. Section2 introduces trends in the interconnection networks of multi-core processors, as well as the previous algorithms that served as the motivation for the thinking behind this paper. Section3 explains the proposed atomic pipelined broadcastMathematics algorithm 2019 with, 7, x the inter-node transmission order change technique. Section34 of provides 23 the simulationarea. results, Therefore, and performance a discussion improvement thereof. Sectionis expect5ed details with a small the modified hardware area MPE overhead, implementation if the and synthesisproposed results, andMPE Sectionwere to be6 providesadded to multi-core concluding processors. remarks. The remainder of this paper is organized as follows. Section 2 introduces trends in the 2. Backgroundinterconnection Research networks of multi-core processors, as well as the previous algorithms that served as the motivation for the thinking behind this paper. Section 3 explains the proposed atomic pipelined 2.1. Interconnectionbroadcast algorithm Networks with on the Multi-Core inter-node transmission Processors order change technique. Section 4 provides the simulation results, and a discussion thereof. Section 5 details the modified MPE implementation and An interconnectionsynthesis results, and network Section in6 provides multi-core concluding processors remarks. transfers information from any source core to any desired destination core. This should be completed with as small latency as possible. 2. Background Research It should allow a large number of such transfers to take place concurrently. Moreover, it should be inexpensive2.1. asInterconnection compared Networks to the on cost Multi-Core of the Processors rest of the machine. The network consists of links and switches, whichAn helpinterconnection to send the network information in multi-core from processo the sourcers transfers core information to the destination from any source core. core The network is specifiedto byany itsdesired topology, destination routing core. algorithm,This transfer should switching be completed strategy, with and as small flow latency control as mechanism.possible. The leadingIt should allow companies a large number in multi-core of such transfers processors to take place have concurrently. developed Moreover, their own it should interconnection be inexpensive as compared to the cost of the rest of the machine. The network consists of links and network architectures for data communication among cores. has developed two processor switches, which help to send the information from the source core to the destination core. The types: general-purposenetwork is specified processors by its topology, such as routing those algorithm, in the I7 series,switching and strategy, high-performance and flow control computing processorsmechanism. such as the Xeon Phi processor. In Intel’s general-purpose processors, a crossbar is used for interconnectionThe among leading cores, companies and in quick-path multi-core interconnectionprocessors have developed (QPI) is their used own for interconnection among network architectures for data communication among cores. Intel has developed two processor types: processors [25,26]. general-purpose processors such as those in the I7 series, and high-performance computing Figureprocessors1 is a block such as diagram the Xeon ofPhi a processor. multi-core In Inte processorl’s general-purpose with external processors, QPI. a crossbar The processor is used has one or more cores,for interconnection and it also among typically cores, comprises and quick-path one in orterconnection more integrated (QPI) is used memory for interconnection controllers. The QPI is a high-speedamong processors point-to-point [25,26]. interconnection network. Although it is sometimes classified as a Figure 1 is a block diagram of a multi-core processor with external QPI. The processor has one type of a serial bus, it is more accurately considered a point-to-point link, as data are sent in parallel or more cores, and it also typically comprises one or more integrated memory controllers. The QPI is across multiplea high-speed lanes point-to-point and its packets interconnection are broken network. into Although multiple it parallelis sometimes transfers. classified It as is a type a contemporary of design thata serial uses bus, some it is more techniques accurately similarconsidered to a thosepoint-to-point seen inlink, other as data point-to-point are sent in parallel interconnections, across such as peripheralmultiple lanes component and its packets interconnect are broken into express multiple (PCIe) parallel and transfers. a fully-bu It is a contemporaryffered dual design in-line memory that uses some techniques similar to those seen in other point-to-point interconnections, such as module (FB-DIMM). The physical connectivity of each link consists of 20 differential signal pairs and peripheral component interconnect express (PCIe) and a fully-buffered dual in-line memory module a differential(FB-DIMM). forwarded The physical clock. Eachconnectivity port supportsof each link a linkconsists pair of consisting20 differential of signal two uni-directionalpairs and a links that connectdifferential two processors; forwarded clock. this supports Each port trasupportsffic in a bothlink pair directions, consisting simultaneously of two uni-directional [27 ].links Additionally, the QPI comprisesthat connect a two cache processors; coherency this supports protocol traffic to maintain in both directions, the distributed simultaneously memory [27]. Additionally, and caches coherent the QPI comprises a cache coherency protocol to maintain the distributed memory and caches during system operation [28]. coherent during system operation [28].

Processor Cores

Core Core ….. Core

Memory Crossbar Router / Non-routingglobal links Interface interface Controller ….. Integrated Memory Memory Integrated …..

Intel QuickPath Interconnects

Figure 1. Block diagram of multi-core processor with Intel quick-path interconnection (QPI) interconnection. Reproduced from [27], Intel: 2019. MathematicsMathematics 2019 2019, ,7 7, ,x x 44 ofof 2323

MathematicsFigureFigure2019 1.1. , 7BlockBlock, 1159 diagramdiagram ofof multi-coremulti-core processorprocessor withwith IntelIntel quick-pathquick-path interconnectioninterconnection (QPI)(QPI)4 of 22 interconnection.interconnection. Reproduced Reproduced from from [27], [27], Intel: Intel: 2019. 2019.

The up-to-date Intel Xeon Phi processor, codenamed “Knight Landing” (KNL), comprises TheThe up-to-dateup-to-date IntelIntel XeonXeon PhiPhi processor,processor, codecodenamednamed “Knight“Knight Landing”Landing” (KNL),(KNL), comprisescomprises 3636 36 processor tiles, where each processor tile consists of two cores, two vector processing units processorprocessor tiles,tiles, wherewhere eacheach processorprocessor tiletile consistsconsists ofof twotwo cores,cores, twotwo vectorvector processingprocessing unitsunits (VPU),(VPU), (VPU), and 1 MByte L2. All processor tiles are interconnected to 2D mesh interconnection, and the andand 11 MByteMByte L2.L2. AllAll processorprocessor tilestiles areare interconinterconnectednected toto 2D2D meshmesh interconnection,interconnection, andand thethe interconnection among Xeon Phi processors is an Intel omni-path fabric interconnection [29–31]. interconnectioninterconnection amongamong XeonXeon PhiPhi processorsprocessors isis anan IntelIntel omni-pathomni-path fabricfabric interconnectioninterconnection [29–31].[29–31]. Figure2 is a block diagram of an Intel Xeon Phi Processor with omni-path fabric interconnection. FigureFigure 2 2 is is a a block block diagram diagram of of an an Intel Intel Xeon Xeon Phi Phi Pr Processorocessor with with omni-path omni-path fabric fabric interconnection. interconnection. The The The omni-path fabric is connected through the two x16 lanes of PCI express to the KNL die, and it omni-pathomni-path fabricfabric isis connectedconnected throughthrough thethe twotwo xx1616 laneslanes ofof PCIPCI expressexpress toto thethe KNLKNL die,die, andand itit provides two 100-Gbits-per-second ports, out of the package [32]. providesprovides two two 100-Gbits-per-second 100-Gbits-per-second ports, ports, out out of of the the package package [32]. [32].

PackagePackage

1616 Gbyte Gbyte MCDRAMMCDRAM

x16 PCIe x16 PCIe Omni-PathOmni-Path Ports Ports 100Gbps/port100Gbps/port Omni-Omni- DDRDDR 4 4 XeonXeon Phi Phi Processor Processor PathPath “Knight“Knight Landing” Landing”

x4x4 PCIe PCIe

FigureFigure 2.2.2.Block BlockBlock diagram diagramdiagram of Intel ofof Intel XeonIntel PhiXeonXeon processor PhiPhi processorprocessor with omni-path withwith omni-pathomni-path fabric interconnection. fabricfabric interconnection.interconnection. Reproduced ReproducedfromReproduced [32], IEEE: from from 2016. [32], [32], IEEE: IEEE: 2016. 2016.

AMDAMD also alsoalso uses uses aa crossbarcrossbar toto to interconnectinterconnect interconnect corecore coresss andand and thethe the interconnectioninterconnection interconnection amongamong among processorsprocessors processors isis HyperTransport,isHyperTransport, HyperTransport, whichwhich which isis aa is high-speedhigh-speed a high-speed point-to-poipoint-to-poi point-to-pointntnt interconnectioninterconnection interconnection networknetwork network similarsimilar similar toto IntelIntel to Intel QPIQPI [33,34].QPI[33,34]. [33 , Figure34Figure]. Figure 33 isis3 a ais blockblock a block diagramdiagram diagram ofof of anan an AMDAMD AMD processorprocessor processor with with HyperTransport HyperTransportHyperTransport interconnection,interconnection, whichwhich providesprovides scalability,scalability, highhighhigh bandwidth,bandwidth,bandwidth, andandand lowlowlow latency.latency.latency. The The distributed distributed shared-memory shared-memory architecturearchitecture includesincludesincludes four fourfour integrated integratedintegrated memory memorymemory controllers contcontrollersrollers (i.e., (i.e.,(i.e., one peroneone chip), perper chip),chip), giving givinggiving it four-fold itit four-foldfour-fold greater greatermemorygreater memorymemory bandwidth bandwidthbandwidth and capacity andand capacitycapacity compared comparedcompared to traditional toto traditionaltraditional architectures architecturesarchitectures without withoutwithout the use thethe of use costlyuse ofof costlypower-consumingcostly power-consuming power-consuming memory memory memory buffers buffers buffers [35]. [35]. [35].

DRAM I/OI/O hub hub DRAM

8 Gbyte/s 8 Gbyte/s ProcessorProcessor 8 Gbyte/s AMD 8 Gbyte/s Mem AMD Other Mem Other PCIe Opteron SOASOA Optimized PCIe Optimized bridge acceleratoraccelerator silicon bridge AMDAMD silicon HTHT OpteronOpteron

88 Gbyte/s Gbyte/s

ProcessorProcessor

88 Gbyte/s Gbyte/s AMDAMD MemMem OpteronOpteron FlopsFlops XMLXML accelerator accelerator accelerator AMDAMD accelerator HTHT OpteronOpteron

DRAM DRAM FigureFigure 3. 3.3. Block BlockBlock diagram diagram diagram of of ofAMD AMD AMD processo processo processorrr with with with HyperTransport HyperTransport HyperTransport interc interc interconnection.onnection.onnection. Reproduced Reproduced Reproduced from from [35],from[35], IEEE: [IEEE:35], IEEE:2007. 2007. 2007.

Mathematics 2019, 7, 1159 5 of 22

ARM is a leading embedded system-on-a-chip (SoC) technology, with its ARM multi-core processors and advanced microcontroller bus architecture (AMBA) intellectual properties (IPs). Since AMBA was first introduced in 1996, it has described a number of buses and interfaces for on-chip communications. The first version of AMBA contained only two buses: the advanced (ASB) and the advanced peripheral bus (APB). However, the current version adds a wide collection of high-performance buses and interfaces, including the advanced high-performance bus (AHB), the advanced extensible interface (AXI), and AXI coherency extensions (ACE) [36]. Since AMBA bus protocolsMathematics are2019 today, 7, x the de facto standard for embedded processors and can be used without royalties,5 of 23 the AMBA bus has been widely used in high-performance multi-core processor SoC designs [37–39]. ARM is a leading embedded system-on-a-chip (SoC) technology, with its ARM multi-core In up-to-date multi-core processors for mobile applications such as the Samsung Exynos Octa 9820 [39] processors and advanced microcontroller bus architecture (AMBA) intellectual properties (IPs). Since and the Qualcomm Snapdragon 855 [40], multiple cores are interconnected via the AMBA bus. AMBA was first introduced in 1996, it has described a number of buses and interfaces for on-chip Additionally, in 2012, the chip implementation of a multi-core processor, in which 32 cores are communications. The first version of AMBA contained only two buses: the advanced system bus interconnected via the AMBA bus, was presented [41]. (ASB) and the advanced peripheral bus (APB). However, the current version adds a wide collection of2.2. high-performance Pipelined Broadcast buses and interfaces, including the advanced high-performance bus (AHB), the advanced extensible interface (AXI), and AXI coherency extensions (ACE) [36]. Since AMBA bus Various broadcast algorithms have been proposed, including the binary tree, binomial tree, protocols are today the de facto standard for embedded processors and can be used without royalties, minimum spanning tree, and distance minimum spanning tree algorithms, as well as Van de Gejin’s the AMBA bus has been widely used in high-performance multi-core processor SoC designs [37–39]. hybrid broadcast and modified hybrid broadcast algorithms. These algorithms offer performance In up-to-date multi-core processors for mobile applications such as the Samsung Exynos Octa 9820 efficiency by tuning network topologies in collective communications [12]. [39] and the Qualcomm Snapdragon 855 [40], multiple cores are interconnected via the AMBA bus. Since then, pipelined broadcast algorithms have been proposed to accelerate the processing time Additionally, in 2012, the chip implementation of a multi-core processor, in which 32 cores are of broadcast communication by utilizing the maximum bandwidth [9,16]. A few diverse works on interconnected via the AMBA bus, was presented [41]. pipelined broadcast algorithms [17–21] have considered the means of improving the processing time 2.2.of broadcast Pipelined communication.Broadcast In the pipelined broadcast algorithm, data to be broadcast are partitioned into k packets as shown in FigureVarious4, and broadcast the packets algorithms are transmitted have been through propos a k-steped, including pipeline asthe in binary Figures tree,2–5; thisbinomial means tree, the minimumcommunication spanning channels tree, canand bedistance heavily minimum used during spanning the period tree fromalgorithms, step 3 toas throughoutwell as Van step de Gejin’sk with hybriditerative broadcast synchronization and modified processes, hybrid which broadcast is unavoidable algorithms. on the These mesh networkalgorithms level offer to verify performance the next efficiencydestination. by Consequently,tuning network the topologi algorithmes in shows collective better communications performance as [12]. the data size increases. Thus, the valueSince ofthen,k is pipelined very important broadcast in balancing algorithms the have trade-o beenff sproposed between to the accelerate number the of synchronization processing time processesof broadcast and communication the utilization ofby the utilizing communication the maximum channels. bandwidth The value [9,16]. of k isA knownfew diverse to be optimalworks on if pipelinedit is chosen broadcast according algori to Equationthms [17–21] (1) [16 have]. considered the means of improving the processing time of broadcast communication.  p In the pipelined broadcast algorithm, data to be broadcast( ) are partitioned into k packets as  1 if p 2 nβ/α < 1 shown in Figure 4, and the packets are transmitted throughp − a k-step pipeline as in Figures 2–5; this kopt =  n if (p 2)nβ/α > n , (1)  p − means the communication channels can(p be 2heavily)nβ/α used during otherwisethe period from step 3 to throughout step k with iterative synchronization processes,− which is unavoidable on the mesh network level to verifyp :the number next destination. of processing Consequently, nodes; the algorithm shows better performance as the data size increases.α : cycles Thus, required the value for of each k is syncronization;very important in balancing the trade-offs between the number of synchronizationβ : cycles required processes for and sin the gle utilization byte data transfer; of the communication channels. The value of k is known to ben optimal: data vectorif it is chosen length. according to Equation (1) [16].

Data(X)

X1 X2 X3 ….. X(k -2) X(k -1) X(k) Figure 4. Data packetizing into k partitions.

Figure 5. Pipelined broadcast architecture with iterative synchronization processes.

1 if 𝑝−2𝑛𝛽/𝛼 <1

𝑘 =𝑛 if 𝑝−2𝑛𝛽/𝛼 >𝑛, (1) 𝑝−2𝑛𝛽/𝛼 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Mathematics 2019, 7, x 5 of 23

ARM is a leading embedded system-on-a-chip (SoC) technology, with its ARM multi-core processors and advanced microcontroller bus architecture (AMBA) intellectual properties (IPs). Since AMBA was first introduced in 1996, it has described a number of buses and interfaces for on-chip communications. The first version of AMBA contained only two buses: the advanced system bus (ASB) and the advanced peripheral bus (APB). However, the current version adds a wide collection of high-performance buses and interfaces, including the advanced high-performance bus (AHB), the advanced extensible interface (AXI), and AXI coherency extensions (ACE) [36]. Since AMBA bus protocols are today the de facto standard for embedded processors and can be used without royalties, the AMBA bus has been widely used in high-performance multi-core processor SoC designs [37–39]. In up-to-date multi-core processors for mobile applications such as the Samsung Exynos Octa 9820 [39] and the Qualcomm Snapdragon 855 [40], multiple cores are interconnected via the AMBA bus. Additionally, in 2012, the chip implementation of a multi-core processor, in which 32 cores are interconnected via the AMBA bus, was presented [41].

2.2. Pipelined Broadcast Various broadcast algorithms have been proposed, including the binary tree, binomial tree, minimum spanning tree, and distance minimum spanning tree algorithms, as well as Van de Gejin’s hybrid broadcast and modified hybrid broadcast algorithms. These algorithms offer performance efficiency by tuning network topologies in collective communications [12]. Since then, pipelined broadcast algorithms have been proposed to accelerate the processing time of broadcast communication by utilizing the maximum bandwidth [9,16]. A few diverse works on pipelined broadcast algorithms [17–21] have considered the means of improving the processing time of broadcast communication. In the pipelined broadcast algorithm, data to be broadcast are partitioned into k packets as shown in Figure 4, and the packets are transmitted through a k-step pipeline as in Figures 2–5; this means the communication channels can be heavily used during the period from step 3 to throughout step k with iterative synchronization processes, which is unavoidable on the mesh network level to verify the next destination. Consequently, the algorithm shows better performance as the data size increases. Thus, the value of k is very important in balancing the trade-offs between the number of synchronization processes and the utilization of the communication channels. The value of k is known to be optimal if it is chosen according to Equation (1) [16].

Data(X)

Mathematics 2019, 7, 1159 X1 X2 X3 ….. X(k -2) X(k -1) X(k) 6 of 22 Figure 4. Data packetizing into k partitions.

FigureFigure 5. Pipelined 5. Pipelined broadcast broadcast architecture architecture with iterative iterative synchronization synchronization processes. processes.

In the pipelined broadcast algorithm1 shown if in𝑝−2 Figure𝑛𝛽/𝛼 5<1, we could see that k-iterative synchronization processes were 𝑘 = needed𝑛 in each pipeline if 𝑝−2 step.𝑛𝛽/𝛼 >𝑛 Suppose, that the number of(1) cycles required for synchronization in one pipeline𝑝−2𝑛𝛽/𝛼 step is m cycles, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 and that the operating frequency is f MHz; then, the total synchronization time will be calculated as in Equation (2). This will incur an O(k) arithmetic cost, leading to inefficient channel throughput due to diminishing return as k increases.

m 6 total_timesync = k 10− . (2) f ∗ ∗ In the case of broadcast communication, all data messages should be transferred to every node within the communicator, as soon as possible. As it is common to simultaneously allocate all channels to broadcast data, the recommended scenario is to have the host node preempt all pipeline channels and broadcast the data in the form of nonblocking chain operations, making all pipeline channels fully utilized.

3. Atomic Pipelined Broadcast with Inter-Node Transmission Order Change

3.1. Atomic Pipelined Broadcast As a preliminary study, the author presented the enhanced algorithm to overcome the disadvantage of pipelined broadcasting, which is called an atomic pipelined broadcast algorithm [22]. The term “Atomic” derives from the concept of an atomic transaction, where a transaction should be performed either in its entirety or not at all [42]. This algorithm accelerates the processing time of broadcast communication by drastically reducing the number of synchronization processes involved. This reduction was accomplished by developing a light-weight protocol and employing simple hardware logic, so that the data message broadcast can be performed in an atomic fashion with full pipeline utilization. This section explains the details of the atomic pipelined broadcast and discusses the resulting performance improvement. In the conventional pipelined broadcast algorithm described in Section 2.2, each processing node plays one of three roles: head, body or tail. For example, as shown in Figure5, PN0 is a head node, PN1 and PN2 are body nodes, and PN3 is a tail node. In the case of the head node, PN0 sends data toward PN1, where the broadcast communication is converted into k point-to-point send operations. In the case of PN3, which is a tail node, the broadcast communication is converted into k point-to-point receive operations. On the other hand, the body nodes such as PN1 and PN2 receive data from one node and simultaneously send the data to another, where the broadcast communication converts into k point-to-point send operations and also k receive operations. At this point, k times as many as iterative synchronization processes occur, which we want to avoid if there is no need to synchronize in between the initiation and the completion of the broadcast procedure. One way to avoid these iterative synchronization processes is by having solid network connections among all the processing nodes, which can be naturally found in a multi-core processor’s bus structure such as an AMBA bus. Using this bus structure, we could preempt all the communication channels by establishing specific protocols, and use the pipeline seamlessly as shown in Figure6. Mathematics 2019, 7, x 6 of 23

𝑝: number of processing nodes; 𝛼: cycles required for each syncronization; 𝛽: cycles required for single byte data transfer; 𝑛: data vector length. In the pipelined broadcast algorithm shown in Figure 5, we could see that k-iterative synchronization processes were needed in each pipeline step. Suppose that the number of cycles required for synchronization in one pipeline step is m cycles, and that the operating frequency is f MHz; then, the total synchronization time will be calculated as in Equation (2). This will incur an O(k) arithmetic cost, leading to inefficient channel throughput due to diminishing return as k increases. total_time = ∗𝑘∗10. (2) In the case of broadcast communication, all data messages should be transferred to every node within the communicator, as soon as possible. As it is common to simultaneously allocate all channels to broadcast data, the recommended scenario is to have the host node preempt all pipeline channels and broadcast the data in the form of nonblocking chain operations, making all pipeline channels fully utilized.

3. Atomic Pipelined Broadcast with Inter-Node Transmission Order Change

3.1. Atomic Pipelined Broadcast As a preliminary study, the author presented the enhanced algorithm to overcome the disadvantage of pipelined broadcasting, which is called an atomic pipelined broadcast algorithm [22]. The term “Atomic” derives from the concept of an atomic transaction, where a transaction should be performed either in its entirety or not at all [42]. This algorithm accelerates the processing time of broadcast communication by drastically reducing the number of synchronization processes involved. This reduction was accomplished by developing a light-weight protocol and employing simple hardware logic, so that the data message broadcast can be performed in an atomic fashion with full pipeline utilization. This section explains the details of the atomic pipelined broadcast and discusses the resulting performance improvement. In the conventional pipelined broadcast algorithm described in Section 2.2, each processing node plays one of three roles: head, body or tail. For example, as shown in Figure 5, PN0 is a head node, PN1 and PN2 are body nodes, and PN3 is a tail node. In the case of the head node, PN0 sends data toward PN1, where the broadcast communication is converted into k point-to-point send operations. In the case of PN3, which is a tail node, the broadcast communication is converted into k point-to- point receive operations. On the other hand, the body nodes such as PN1 and PN2 receive data from one node and simultaneously send the data to another, where the broadcast communication converts into k point-to-point send operations and also k receive operations. At this point, k times as many as iterative synchronization processes occur, which we want to avoid if there is no need to synchronize in between the initiation and the completion of the broadcast procedure. One way to avoid these iterative synchronization processes is by having solid network connections among all the processing nodes, which can be naturally found in a multi-core processor’s bus structure such as an AMBA bus. Using this bus structure, we could preempt all the Mathematicscommunication2019, 7, 1159 channels by establishing specific protocols, and use the pipeline seamlessly as shown7 of 22 in Figure 6.

Mathematics 2019, 7, x 7 of 23 Figure 6. Fully utilized pipelined broadcast architecture with only one initial synchronization process. Figure 6. Fully utilized pipelined broadcast architecture with only one initial synchronization process. In order to realize the seamless atomic transfer, the compiler should not convert the broadcast routineIn order into multiple to realize point-to-point the seamless atomic communications. transfer, theInstead, compiler a should new method not convert is required the broadcast that is capableroutine ofinto storing multiple the packetizedpoint-to-point data communications. words from one Instead, node into a bunewffers method without is synchronization,required that is andcapable simultaneously of storing the forward packetized them data to another. words from Thus, one we node proposed into buffers a new supplementary without synchronization, operation, whichand simultaneously we called Fwd, forward so that thethem broadcast to another. routine Thus, could we proposed be converted a new into supplementary Send, Recv, and operation, Fwd for thewhich head, we tail, called and Fwd, body so nodes, that the respectively. broadcast routine could be converted into Send, Recv, and Fwd for the head,To support tail, and the body atomic nodes, pipelined respectively. broadcast, a communication bus conforms to a cross-bar structure, throughTo support which all the processing atomic pipelined nodes can broadcast, send and receivea communication data. All processing bus conforms nodes to can a cross-bar be head, body,structure, or tail through nodes which according all processing to the broadcast nodes can routine. send and The receive role of data. the body All processing nodes among nodes them can isbe very head, important body, or becausetail nodes it storesaccording the packetizedto the broadcast data wordsroutine. from The onerole nodeof the into body bu nodesffers without among synchronizationthem is very important and simultaneously because it stores forwards the thempacketized to another. data words from one node into buffers withoutA Fwd synchronization operation should and simultaneo follow theusly order forwards shown them in Figure to another.7; this order applies to all body nodes.A Fwd If the operation order is notshould followed, follow the order process shown should in Fi entergure a7; statethis order of a deadlock. applies to all This body is treated nodes. asIf the a non-interruptible order is not followed, atomic the transmission, process should to ensureenter a bothstate fullof a channeldeadlock. utilization This is treated andreliability. as a non- Theinterruptible procedure atomic is as follows: transmission, to ensure both full channel utilization and reliability. The procedure is as follows: 1. Send request message to i + 1 node. 2.1. SendRecv request message fromto i + i1 –node. 1 node. 2. Recv request message from i – 1 node. 3. Recv ready message from i + 1 node. 3. Recv ready message from i + 1 node. 4. Send ready message to i – 1 node. 4. Send ready message to i – 1 node. 5. Recv and Send data messages between adjacent nodes. 5. Recv and Send data messages between adjacent nodes. 6. Recv and Send complete messages between adjacent nodes. 6. Recv and Send complete messages between adjacent nodes.

i-1

Response i

i+1 Wait until all data packets arrives.

FigureFigure 7. Atomic 7. Atomic procedure procedure among among the body the nodes. body nodes. Reproduced Reproduced from [22], from IEICE: [22], IEICE: 2014. 2014.

3.2.3.2. Inter-Node Transmission OrderOrder ChangeChange TechniqueTechnique

3.2.1.3.2.1. Initial Idea of Transmission OrderOrder ChangeChange TechniqueTechnique ToTo maximizemaximize broadcast broadcast communication communication performance, performance, this paperthis proposedpaper proposed the concept the ofconcept changing of thechanging transmission the transmission order. First,order. the First, author the author designed designed the modelthe model of theof the initial initial idea, idea, which which couldcould beginbegin thethe communicationcommunication fasterfaster byby reducingreducing thethe amountamount ofof timetime thatthat passespasses untiluntil thethe pre-existingpre-existing messagemessage transmissionstransmissions inin allall processingprocessing nodesnodes finish.finish. SinceSince thethe barrierbarrier functionfunction forfor communicationcommunication synchronizationsynchronization isis alwaysalways performedperformed beforebefore thethe broadcastbroadcast communication,communication, none ofof thethe previousprevious broadcast algorithms can begin the broadcast communication until all the communication ports of the processing nodes are free. Basically, the transmission order should be pre-defined in all broadcast algorithms regardless of the transmission order change technique being used; this is because the pre-defined order is used as a reference when the transmission order is determined. Before the transmission order change technique is presented, the broadcasting algorithms transmit data messages in the pre-defined order. The transmission order change technique requires a hardware logic to change the transmission order, and so it references the pre-defined order to determine the order in the same processing node group

Mathematics 2019, 7, 1159 8 of 22

Mathematics 2019, 7, x 8 of 23 broadcast algorithms can begin the broadcast communication until all the communication ports of the and the free and busy groups. In this paper, the pre-defined transmission order followed the numeric processing nodes are free. → → → →···→ order,Basically, according the to transmission the processing order node should names: be P0 pre-definedP1 P2 inP3 all broadcastPn. algorithms regardless of the transmissionFigure 8 shows order the changetransmission technique orders being in the used; atomic this pipelined is because broadcast the pre-defined algorithm order without is used the transmissionas a reference order when change the transmission technique when order P1–P4 is determined. have the existing Before ongoing the transmission transmissions. order As changeshown intechnique the figure, is presented, when performing the broadcasting broadcast algorithms communication transmit datain the messages situation in with the pre-defined the pre-existing order. Thetransmissions, transmission a waiting order change time of technique three cycles requires is required, a hardware because logic the to root change P0 should the transmission wait until all order, the existingand so it ongoing references transmissions the pre-defined of P1–P4 order finish. to determine the order in the same processing node group and theFigure free 9 and shows busy the groups. transmission In this paper, orders the in pre-defined the atomic transmissionpipelined broadcast order followed algorithm the with numeric the initialorder, transmission according to theorder processing change technique node names: when P0 P1–P4P1 haveP2 theP3 existingPn. ongoing transmissions. All → → → →···→ processingFigure 8nodes shows are the transmissionsorted into the orders “free” in the or atomic “busy” pipelined group, broadcastand the algorithmtransmission without order the is determinedtransmission in order each change group technique as per the when pre-defined P1–P4 have numeric the existing order. ongoing Additionally, transmissions. the free Asgroup shown is alwaysin the figure,in front when of performingthe busy broadcastgroup. In communicationthis way, the intransmission the situation order with theis changed pre-existing to P0transmissions,→→P6→P7 a→ waitingP1→P2 time→P3 of→ threeP4, and cycles the root is required, P0 can begin because the the broadcast root P0 communication should wait until without all the waitingexisting three ongoing cycles transmissions occurred in of the P1–P4 case finish.of Figure 8.

Waiting until existing all communications finish.

P0→P1 Root P0

P1 P0→P1→P2

P2 P1→P2→P3

P3 P2→P3→P4

P4 P3→P4→P5

P5 P4→P5→P6 Transmissionorder P6

P7

Cycles 012345 67 8 910 Figure 8.8. TransmissionTransmission order order in in the the atomic atomic pipelined pipelined broadcast broadcast without without the transmission the transmission order change order changetechnique technique when P1–P4 when have P1–P4 the have pre-existing the pre-existing data transmissions. data transmissions.

Figure9 shows the transmissionBroadcast orders begins in without the atomicwaiting time pipelined broadcast algorithm with the initial transmission order change technique when P1–P4 have the existing ongoing transmissions. P0→P5 All processing nodes are sortedRoot intoP0 the “free” or “busy” group, and the transmission order is determined in each group as per the pre-defined numericP0→P5→P6 order. Additionally, the free group is always in front of P5 the busy group. In this way, the transmission order is changed to P0 P5 P6 P7 P1 P2 P3 P4, ‘free’ P5→P6→P7 → → → → → → → and the root P0 can begin thegroup broadcastP6 communication without waiting three cycles occurred in the case of Figure8. P7 P6→P7→P1

P1 P7→P1→P2

P2 P1→P2→P3 Transmissionorder ‘busy’ group P3 P2→P3→P4

P4 P3→P4

Cycles 012345 67 8 910 Figure 9. Changed transmission order in the atomic pipelined broadcast with the initial transmission order change technique when P1–P4 have the pre-existing data transmissions.

Mathematics 2019, 7, x 8 of 23 and the free and busy groups. In this paper, the pre-defined transmission order followed the numeric order, according to the processing node names: P0→P1→P2→P3→···→Pn. Figure 8 shows the transmission orders in the atomic pipelined broadcast algorithm without the transmission order change technique when P1–P4 have the existing ongoing transmissions. As shown in the figure, when performing broadcast communication in the situation with the pre-existing transmissions, a waiting time of three cycles is required, because the root P0 should wait until all the existing ongoing transmissions of P1–P4 finish. Figure 9 shows the transmission orders in the atomic pipelined broadcast algorithm with the initial transmission order change technique when P1–P4 have the existing ongoing transmissions. All processing nodes are sorted into the “free” or “busy” group, and the transmission order is determined in each group as per the pre-defined numeric order. Additionally, the free group is always in front of the busy group. In this way, the transmission order is changed to P0→P5→P6→P7→P1→P2→P3→P4, and the root P0 can begin the broadcast communication without waiting three cycles occurred in the case of Figure 8.

Waiting until existing all communications finish.

P0→P1 Root P0

P1 P0→P1→P2

P2 P1→P2→P3

P3 P2→P3→P4

P4 P3→P4→P5

P5 P4→P5→P6 Transmissionorder P6

P7

Cycles 012345 67 8 910

MathematicsFigure2019 8. , Transmission7, 1159 order in the atomic pipelined broadcast without the transmission order9 of 22 change technique when P1–P4 have the pre-existing data transmissions.

Broadcast begins without waiting time

P0→P5 Root P0

P5 P0→P5→P6

P5→P6→P7 Mathematics 2019, 7, x ‘free’ 9 of 23 group P6 P6→P7→P1 3.2.2. Enhancing Transmission OrderP7 Change Technique P7→P1→P2 This section presents the P1inter-node transmission order change technique used to derive performance better than that seen with the initial technique. As described in Section 3.2.1., only P2 P1→P2→P3 applying the initial transmissionTransmissionorder ‘busy’ order change technique to the atomic pipelined broadcast algorithm group will improve the broadcast performance.P3 This is expected, P2→P3→P4as the atomic pipelined broadcast improves the execution time of broadcast communication itselfP3→P4 by reducing the number of synchronization; however, the transmissionP4 order change technique reduces the waiting time that occurs before the broadcast communication begins. Cycles 012345 67 8 910 To achieve better performance improvement, one should be aware of the limitation of the initial Figure 9. Changed transmission order in the atomic pipelined broadcast with the initial transmission transmissionFigure 9. Changedorder change transmission technique. order inFigure the atomic 10 shows pipelined the broadcast waiting withtime the in initialthe atomic transmission pipelined order change technique when P1–P4 have the pre-existing data transmissions. broadcastorder changewithout technique the transmission when P1–P4 order have thechange pre-existing technique, data transmissions. when P1’s pre-existing transmission 3.2.2. data size Enhancing increases. Transmission Since P1’s pre- Orderexisting Change data Technique transmission finishes at Cycle 6, the root P0 should wait until then. This section presents the inter-node transmission order change technique used to derive Figure 11 shows the waiting time in the atomic pipelined broadcast with the initial transmission performance better than that seen with the initial technique. As described in Section 3.2.1, only order change technique, when P1’s pre-existing transmission data size becomes larger. The applying the initial transmission order change technique to the atomic pipelined broadcast algorithm willP7→ improveP1→P2 data the broadcasttransmission performance. of P1 is delayed This is by expected, two cycles, as thedue atomic to its pre-existing pipelined broadcast data transmission. improves theBroadcast execution communications time of broadcast undertakes communication the itselfsync byhronization reducing thebefore number data of synchronization;transmissions; however,synchronization the transmission is a process order to check change if the technique communi reducescation ports the waitingof all processing time that nodes occurs are before available the broadcastfor broadcast communication communication. begins. As described in Section 2.2, the pipelined broadcast algorithm is synchronizedTo achieve by better propagating performance the request improvement, message onefrom should P0 (a head be aware node) of to the the limitation last processing of the initial node transmission(a tail node), and order the change ready message technique. from Figure a tail to10 a showshead. In the the waiting figure, P1 time receives in the the atomic ready pipelined message broadcastat Cycle 4, without but it remains the transmission in a buffer order until change the earlier technique, messages when are P1’s transmitted. pre-existing Finally, transmission P1 performs data sizethe increases.ready message Since P1’sat Cycle pre-existing 6; which data means transmission that the finishes effective at Cycle beginning 6, the rootcycle P0 of should broadcast wait untilcommunication then. is Cycle 2, even though it begins at Cycle 0.

Waiting until existing all communications finish.

P0→P1 Root P0

P1 P0→P1→P2

P2 P1→P2→P3

P3

P4

P5 Transmissionorder P6

P7

Cycles 67 8 910 012345 Figure 10. Waiting time occurrence in the atomic pipelinedpipelined broadcast without thethe transmission order change technique, whenwhen P1’sP1’s pre-existingpre-existing transmissiontransmission datadata sizesize becomesbecomes larger.larger.

Mathematics 2019, 7, 1159 10 of 22

Figure 11 shows the waiting time in the atomic pipelined broadcast with the initial transmission order change technique, when P1’s pre-existing transmission data size becomes larger. The P7 P1 P2 → → data transmission of P1 is delayed by two cycles, due to its pre-existing data transmission. Broadcast communications undertakes the synchronization before data transmissions; synchronization is a process to check if the communication ports of all processing nodes are available for broadcast communication. As described in Section 2.2, the pipelined broadcast algorithm is synchronized by propagating the request message from P0 (a head node) to the last processing node (a tail node), and the ready message from a tail to a head. In the figure, P1 receives the ready message at Cycle 4, but it remains in a buffer until the earlier messages are transmitted. Finally, P1 performs the ready message at Cycle 6; which means that the effective beginning cycle of broadcast communication is Cycle 2, even though it begins Mathematicsat Cycle 0. 2019, 7, x 10 of 23

Broadcast begins at Cycle 0 Cycle 2 is effective beginning cycle

P0→P5 Root P0

P5 P0→P5→P6

‘free’ P5→P6→P7 group P6 P7 P6→P7→P1

P1 P7→P1→P2

P1→P2→P3 P2 2 cycles delayed Transmissionorder ‘busy’ group P3 P2→P3→P4

P4

Cycles 67 8 910 012345 Figure 11. Waiting time occurrence in the atomic pipelinedpipelined broadcast with the initial transmission order change technique, when P1’s pre-existing transmission data sizesize becomesbecomes larger.larger.

The inter-node transmission order change techni techniqueque is proposed, to overcomeovercome this limitation inherent in the initial technique. In In the the proposed proposed technique, technique, a a multi-core multi-core processor system sorts each of the processing nodes into the “free” or “busy” “busy” grou group,p, just as as seen seen in in the the initial initial technique. technique. However, However, it determines thethe transmissiontransmission orders orders only only in in the the free free group group as as the the pre-defined pre-defined numeric numeric order. order. For For the thebusy busy group, group, it changes it changes the transmission the transmission order according order according to the remaining to the pre-existing remaining transmissionpre-existing transmissiondata size. Then, data it size. calculates Then, theit ca waitinglculates time the waiting by using time the by processing using the nodes’ processing remaining nodes’ pre-existing remaining pre-existingtransmission transmission data size. Therefore, data size. the additionalTherefore, hardware the additional logic for hardware the proposed logic technique for the shouldproposed be techniqueadded to the should multi-core be added processor to the systems, multi-core but proc its hardwareessor systems, overhead but canits hardware be ignored overhead as the logic can area be ignoredof the message as the logic passing area engine of the message in multi-core passing processors engine in is multi-core very much processors smaller than is very that much of processors. smaller thanIt was that only of 1.94%processors. in the It initial was only transmission 1.94% in orderthe initial change transmission technique. order change technique. Figure 12 shows the reduced waiting time in the atomic pipelined broadcast with the proposed inter-node transmission order change technique, when P1’s pre-existing transmission data size becomes larger.larger. SinceSince the the proposed proposed technique technique does does not not change change the transmissionthe transmission order order in the in free the group, free the transmission order is P0 P5 P6 P7, as seen with the initial technique. In the busy group, P1, P2, group, the transmission order→ is →P0→→P5→P6→P7, as seen with the initial technique. In the busy group, and P3 and P4 have the existing data transmissions for six, three, and two cycles, respectively. According P1, P2, and P3 and P4 have the existing data transmissions for six, three, and two cycles, respectively. to the order in the remaining pre-existing transmission data size, the transmission order is changed to According to the order in the remaining pre-existing transmission data size, the transmission order P3 P4 P2 P1. Consequently, the final transmission order is P0 P5 P6 P7 P3 P4 P2 P1. is →changed→ → to P3→P4→P2→P1. Consequently, the →final→ transmission→ → → →order→ is P0→P5→P6→P7→P3→P4→P2→P1. As shown in the figure, the broadcast communication begins at Cycle 0; in this respect, it is identical to the case adopting the initial transmission order change technique; however, the broadcast communication’s effective beginning cycle also is Cycle 0. If P1’s pre-existing transmission data size becomes larger, the performance improvement will also increase.

Mathematics 2019, 7, 1159x 1111 of of 23 22 Mathematics 2019, 7, x 11 of 23

Broadcast begins at Cycle 0 Broadcast begins at Cycle 0 P0→P5 Root P0 P0→P5 Root P0 P5 P0→P5→P6 P5 P0→P5→P6 ‘free’ P5→P6→P7 ‘free’group P6 P5→P6→P7 group P6 P7 P6→P7→P3 P7 P6→P7→P3 P3 P7→P3→P4 P3 P7→P3→P4 P3→P4→P2 P4 P3→P4→P2 Transmissionorder ‘busy’ P4 Transmissionorder ‘busy’group P4→P2→P1 group P2 P2 P4→P2→P1 P1 P2→P1 P1 P2→P1 Cycles 012345 67 8 910 Cycles 67 8 910 012345 Figure 12. Reduced waiting time in the atomic pipelined broadcast with the proposed inter-node Figure 12. ReducedReduced waiting waiting time time in the atomic pipelined broadcast with the proposed inter-node transmission order change technique, when P1’s pre-existing transmission data size becomes larger. transmission order change technique, whenwhen P1’sP1’s pre-existingpre-existing transmissiontransmission datadata sizesize becomesbecomes larger.larger. 3.2.3. Analysis of Minimizing Waiting Time with Inter-Node Transmission Order Change 3.2.3.As Analysis shown of in Minimizing the figure, Waiting the broadcast Time with communication Inter-Node Transmission begins at Cycle Order 0; in Change this respect, it is Technique Techniqueidentical to the case adopting the initial transmission order change technique; however, the broadcast communication’sFigures 13 and eff 14ective show beginning the transm cycleission also orders is Cycle in the 0. Ifatomic P1’s pre-existing pipelined broadcasts transmission without data sizeand Figures 13 and 14 show the transmission orders in the atomic pipelined broadcasts without and withbecomes the inter-node larger, the performancetransmission improvementorder change tech willnique, also increase. respectively, when P1 has the pre-existing with the inter-node transmission order change technique, respectively, when P1 has the pre-existing data transmission. In Figure 13, the root P0 should wait for eight cycles until P1’s pre-existing data data3.2.3. transmission. Analysis of Minimizing In Figure 13, Waiting the root Time P0 withshould Inter-Node wait for eight Transmission cycles until Order P1’s Change pre-existing Technique data transmission finishes, because it cannot change the transmission order. transmission finishes, because it cannot change the transmission order. However,Figures 13 asand shown 14 show in Figure the transmission 14, the atomic orders pipelined in the atomic broadcast pipelined with the broadcasts proposed without inter-node and However, as shown in Figure 14, the atomic pipelined broadcast with the proposed inter-node withtransmission the inter-node order change transmission technique order waits change for technique,only one cycle respectively, by virtue whenof the P1 changed has the transmission pre-existing transmission order change technique waits for only one cycle by virtue of the changed transmission order,data transmission. because it can In handle Figure 13the, thetransmission root P0 should order wait according for eight to cyclesthe processing until P1’s nodes’ pre-existing remaining data order, because it can handle the transmission order according to the processing nodes’ remaining pre-existingtransmission transmission finishes, because data itsize. cannot change the transmission order. pre-existing transmission data size.

Waiting until P1’s existing data communication finishs. Waiting until P1’s existing data communication finishs.

P0→P1 Root P0 P0→P1 Root P0 P1 P1 P2 P2 P3 P3 P4 P4 P5 Transmissionorder P5 Transmissionorder P6 P6 P7 P7 Cycles 012345 67 8 910 Cycles 67 8 910 012345 Figure 13.13. TransmissionTransmission order order in in the the atomic atomic pipelined pipelined broadcast broadcast without without the transmission the transmission order change order Figure 13. Transmission order in the atomic pipelined broadcast without the transmission order changetechnique technique when P1 when has pre-existingP1 has pre-existing data transmission. data transmission. change technique when P1 has pre-existing data transmission.

Mathematics 2019, 7, 1159 12 of 22 Mathematics 2019, 7, x 12 of 23

Effective beginning cycle is 1.

P0→P2 Root P0

P2 P0→P2→P3

P3 P2→P3→P4

P4 P3→P4→P5 ‘free’ group P5 P4→P5→P6

P6 P5→P6→P7 Transmissionorder P7 P6→P7→P1

‘busy’ P7→P1 group P1

Cycles 67 8 910 012345 Figure 14.14. TransmissionTransmission order order in in the atomic pipeline pipelinedd broadcast with the proposed inter-nodeinter-node transmission orderorder changechange techniquetechnique whenwhen P1P1 hashas pre-existingpre-existing datadata transmission.transmission.

3.2.4.However, Process Sequence as shown of in Inter-Node Figure 14, Transmission the atomic pipelined Order Change broadcast Technique with the proposed inter-node transmission order change technique waits for only one cycle by virtue of the changed transmission order,To because support it these can handle transmission the transmission order techniques, order according the local bus to the requires processing a status nodes’ register. remaining In the pre-existinginitial transmission transmission order change data size. technique, this was 8 bits in total (i.e., 1 bit for each processing node). To support the proposed inter-node transmission order change technique, this was increased to 2 bits 3.2.4.for each Process processing Sequence node, of Inter-Nodeto identify the Transmission pre-existing Order transmission Change Technique data size. The 2-bit value means that “00”To support was the these “free” transmission status and order“01”, “10”, techniques, and “11” the are local “busy” bus requires with a remaining a status register. transmission In the data size of under 512 bytes, under 1024 bytes, and over 1024 bytes, respectively. If the size of the initial transmission order change technique, this was 8 bits in total (i.e., 1 bit for each processing node). status register for each processing node was increased from 2 bits to 3 bits or more, it would be To support the proposed inter-node transmission order change technique, this was increased to 2 bits possible to classify the communication status of processing nodes at a finer granularity. for each processing node, to identify the pre-existing transmission data size. The 2-bit value means Figure 15 briefly shows the process sequence of the inter-node transmission order change that “00” was the “free” status and “01”, “10”, and “11” are “busy” with a remaining transmission data technique, when the root node receives the broadcast operation. Detailed information is provided in size of under 512 bytes, under 1024 bytes, and over 1024 bytes, respectively. If the size of the status Section 3.3. In this figure, the message passing engine (MPE) of the root P5 orders the local bus to register for each processing node was increased from 2 bits to 3 bits or more, it would be possible to distribute the value of the status register to all the MPEs. The status register had 2 bits for each classify the communication status of processing nodes at a finer granularity. processing node: 16 bits in total. In the initial transmission order change technique, this was 8 bits in Figure 15 briefly shows the process sequence of the inter-node transmission order change technique, total (i.e., 1 bit for each processing node). To support the proposed transmission order change when the root node receives the broadcast operation. Detailed information is provided in Section 3.3. technique, the size of the status register was increased to identify the pre-existing transmission data In this figure, the message passing engine (MPE) of the root P5 orders the local bus to distribute the size. In Step  of Figure 15, the local bus distributes the status register value to all MPEs. The MPE value of the status register to all the MPEs. The status register had 2 bits for each processing node: arranges the processing nodes’ physical numbers in order, by the 2-bit value for each MPE. In this 16 bits in total. In the initial transmission order change technique, this was 8 bits in total (i.e., 1 bit for figure, the status register value was assumed to “10 10 10 11 10 00 01 01”; therefore, the processing each processing node). To support the proposed transmission order change technique, the size of the statusnodes’ register physical was numbers increased towere identify arranged the pre-existing into P5→ transmissionP6→P7→P0 data→P1 size.→P2→ InP4 Step→P3.O3 ofThen, Figure the15, theintermediate local bus distributescontrol message the status was register generated value by to usin all MPEs.g the information The MPE arranges of the thesequential processing tree nodes’read- physicalonly memory numbers (ROM) in order, and physical/logical by the 2-bit value (P/L). for each The MPE.intermediate In this figure, control the message status registerwas translated value was to assumedthe final control to “10 10 message, 10 11 10 and 00 01 it 01”;was therefore,then stored the in processing the issue queue. nodes’ physical numbers were arranged into P5 P6 P7 P0 P1 P2 P4 P3. Then, the intermediate control message was generated by → → → → → → → using the information of the sequential tree read-only memory (ROM) and physical/logical (P/L). The intermediate control message was translated to the final control message, and it was then stored in the issue queue.

Mathematics 2019, 7, 1159 13 of 22 Mathematics 2019, 7, x 13 of 23

Broadcast from Processor #5

1 MPE #5 Issue Queue

Root is P5 4 Address Decoder Send to 6 Logical # 0 1 2 3 4 5 6 7 Existing msg Existing msg P/L table 5 6 7 0 1 2 4 3 5

7 Physical # 0 1 2 3 4 5 6 7 Sequential 6 Translated ‘1’→‘6’ Tree Send to 1 2 2 2 3 2 0 1 1 ROM

3 2 Local Bus

Node # 0 1 2 3 4 5 6 7 Status register 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 1

FigureFigure 15. 15.Process Process sequence sequence of of the the inter-node inter-node transmission transmissionorder order changechange techniquetechnique when the the root root node receivesnode thereceives broadcast the broadcast operation. operation.

3.3.3.3. Modified Modified Message Message Passing Passing Engine Engine ArchitectureArchitecture Generally,Generally, multi-core multi-core processors processors dodo not have have an an MPE MPE that that can can handle handle collective collective communication; communication; therefore,therefore, compilers compilers may may convert convert collective collective communication communication operations operations into into a group a group of point-to-point of point-to-point operations.operations. However, However, multi-core multi-core processors processors with with an an MPE MPE that that can can handle handle them them can can send send the collectivethe collective communication operations to the processing node, without converting them into point-to- communication operations to the processing node, without converting them into point-to-point point operations [43–48]. Then, they are performed in the MPE of processing nodes, according to the operationsfunction [43 and–48 feature]. Then, of each they MPE. are performed in the MPE of processing nodes, according to the function and featureTo support of each MPE.the atomic pipelined broadcast and the inter-node transmission change order technique,To support the the previous atomic MPE pipelined architecture broadcast also and should the inter-nodebe modified. transmission The modified change MPE can order receive technique, the previousthree types MPE of incoming architecture operations, also should which be are modified. Send, Recv, The and modified Bcast. Send MPE and can Recv receive messages three are types of incomingstored operations,directly in the which issue arequeue Send, and Recv,then the and communication Bcast. Send and begins. Recv The messages broadcast are operation stored directlyis in theconverted issue queue into point-to-point and then the operations communication by the atomic begins. broadcast The broadcast algorithm, operationand then the is converted into point-to-pointoperations are operations stored in the by theissue atomic queue. broadcast algorithm, and then the converted operations are Figure 16 is the flow chart for processing the incoming command messages in the modified MPE. stored in the issue queue. In the case of the broadcast, the root node orders the local bus to distribute the value of the status Figure 16 is the flow chart for processing the incoming command messages in the modified MPE. register to all the processing nodes in the communicator. After the statuses of the processing nodes In theare case checked, of the the broadcast, MPE reads thethe pre-defined root node orderstransmis thesion local order bus from to the distribute ROM, translates the value the oflogical the status registernode to number all the into processing the physical nodes node in number, the communicator. and then determines After the the statuses transmission of the order. processing According nodes are checked,to the theorder, MPE the readsMPE generates the pre-defined the control transmission messages, which order consist from theof Send ROM,, Recv, translates and Fwd the and logical store node numberthem into in the the issue physical queue. node number, and then determines the transmission order. According to the order, the MPE generates the control messages, which consist of Send, Recv, and Fwd and store them in the issue queue. Figure 17 shows the process sequence for the broadcast command in the modified MPE when the communicator has eight processing nodes. In this figure, the case of MPE #5 (left side) is the process sequence for the root node; the right side is for the leaf node. The process sequence is as follows. O1 All MPEs receive the broadcast command from their own cores. The command indicates the root node; in this case, the root is P5. Therefore, all MPEs know the node number. O2 The root MPE orders the local bus to distribute the value of the status register to all MPEs. The status register has 2 bits for each MPE; it is 16 bits in total. In the initial transmission order change technique, it is 8 bit in total (i.e., 1 bit for each MPE). To support the inter-node transmission order change technique, this is increased to identify the pre-existing transmission data size. The 2-bit value for each MPE means that “00” is “free” status, and “01”, “10”, and “11” are “busy” with a remaining transmission data size of under 512 bytes, under 1024 bytes, and over 1024 bytes, respectively. In leaf nodes, this step is ignored. Mathematics 2019, 7, 1159 14 of 22

O3 The local bus distributes the status register value to all the MPEs. The MPE arranges the nodes’ physical numbers, in the order of the 2-bit value for each MPE. If any 2-bit values are identical, they are arranged in numerical order. In this figure, the status register value is “10 10 10 11 10 00 01 01”; therefore, the nodes’ physical numbers are arranged into P5 P6 P7 P0 P1 P2 P4 P3. → → → → → → → Then, the MPE writes the information pertaining to the arranged nodes’ physical numbers in the physical-to-logical number map table (P/L table). O4 and O5 The address decoder generates and provides the physical address corresponding to each logical address, according to the P/L table. This step is needed because the sequential tree ROM has the information pertaining to the pre-defined transmission order based on the logical address. Therefore, to change the transmission order, the corresponding physical addresses are required. O6 The intermediate control message is generated. The sequential tree ROM has the information pertaining to the pre-defined transmission order based on the logical address; this is P0 P1 P2 P3 P4 P5 P6 P7. In the root MPE #5, the physical address is “5”, and the → → → → → → → corresponding logical address is “0”. Therefore, MPE #5 generates the intermediate control message for P0 P1 (Send to 1). In the leaf MPE #2, the intermediate control message for P4 P5 P6 (Fwd 4–6) → → → is generated because the logical address is “5”. O7 The intermediate control message is translated into the final control message, and it is then stored in the issue queue. In MPE #5, the intermediate control message is “Send to 1”; it is translated into “Send to 6” because the physical address for the logical address “1” is “6”. In this way, in MPE #2, the intermediate control message “Fwd 4 to 6” is translated into “Fwd 1 to 4”. Then, the translated Mathematicscontrol message 2019, 7, x is stored in the issue queue. 14 of 23

Command message from Processor

Broadcast Send or Recv Message == ?

Root node ? No

Yes Store in issue queue Order a local bus to Wait until the status distribute the status values are received register value

Read the status value & Read the status value & Make logical/physical Make logical/physical table table

Read the predefined Read the predefined order from ROM order from ROM

Change logical node # Change logical node # to physical node # to physical node #

Convert to Send, Recv Convert to Send, Recv or Fwd or Fwd

Store in issue queue Store in issue queue

Figure 16. Flow chart for processing incoming command messages in the modified modified message passing engine (MPE).

Figure 17 shows the process sequence for the broadcast command in the modified MPE when the communicator has eight processing nodes. In this figure, the case of MPE #5 (left side) is the process sequence for the root node; the right side is for the leaf node. The process sequence is as follows.  All MPEs receive the broadcast command from their own cores. The command indicates the root node; in this case, the root is P5. Therefore, all MPEs know the node number.  The root MPE orders the local bus to distribute the value of the status register to all MPEs. The status register has 2 bits for each MPE; it is 16 bits in total. In the initial transmission order change technique, it is 8 bit in total (i.e., 1 bit for each MPE). To support the inter-node transmission order change technique, this is increased to identify the pre-existing transmission data size. The 2-bit value for each MPE means that “00” is “free” status, and “01”, “10”, and “11” are “busy” with a remaining transmission data size of under 512 bytes, under 1024 bytes, and over 1024 bytes, respectively. In leaf nodes, this step is ignored.  The local bus distributes the status register value to all the MPEs. The MPE arranges the nodes’ physical numbers, in the order of the 2-bit value for each MPE. If any 2-bit values are identical, they are arranged in numerical order. In this figure, the status register value is “10 10 10 11 10 00 01 01”; therefore, the nodes’ physical numbers are arranged into P5→P6→P7→P0→P1→P2→P4→P3. Then, the MPE writes the information pertaining to the arranged nodes’ physical numbers in the physical- to-logical number map table (P/L table).  and  The address decoder generates and provides the physical address corresponding to each logical address, according to the P/L table. This step is needed because the sequential tree ROM has the information pertaining to the pre-defined transmission order based on the logical address. Therefore, to change the transmission order, the corresponding physical addresses are required.

Mathematics 2019, 7, x 15 of 23

 The intermediate control message is generated. The sequential tree ROM has the information pertaining to the pre-defined transmission order based on the logical address; this is P0→P1→P2→P3→P4→P5→P6→P7. In the root MPE #5, the physical address is “5”, and the corresponding logical address is “0”. Therefore, MPE #5 generates the intermediate control message for P0→P1 (Send to 1). In the leaf MPE #2, the intermediate control message for P4→P5→P6 (Fwd 4– 6) is generated because the logical address is “5”.  The intermediate control message is translated into the final control message, and it is then stored in the issue queue. In MPE #5, the intermediate control message is “Send to 1”; it is translated into “Send to 6” because the physical address for the logical address “1” is “6”. In this way, in MPE #2, the intermediate control message “Fwd 4 to 6” is translated into “Fwd 1 to 4”. Then, the translated Mathematics 2019, 7, 1159 15 of 22 control message is stored in the issue queue.

Broadcast from Processor #5 Broadcast from Processor #2)

1 1 MPE #5 MPE #2 Issue Queue Issue Queue

Root is P5 4 Address Root is P5 4 Address Decoder Send to 6 Decoder Fwd 1 to 4 Logical # 0 1 2 3 4 5 6 7 Logical # 0 1 2 3 4 5 6 7 Existing msg Existing msg Existing msg Existing msg P/L table 5 6 7 0 1 2 4 3 5 P/L table 5 6 7 0 1 2 4 3 5 7 7 Translated ‘4’→‘1’ Physical # 0 1 2 3 4 5 6 7 Sequential 6 Translated ‘1’→‘6’ Physical # 0 1 2 3 4 5 6 7 Sequential 6 ‘6 ’→’4 ’ Tree Send to 1 Tree Fwd 4 to 6 2 2 2 3 2 0 1 1 ROM 2 2 2 3 2 0 1 1 ROM

3 3 2 Local Bus

Node # 0 1 2 3 4 5 6 7 Status register 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 1

FigureFigure 17. 17.ProcessingProcessing sequence sequence for for the the broadcast broadcast command command in the modified in the MPE modified when the communicatorMPE when the communicatorhas eight processing has eight nodes. processing nodes.

4. Simulation Results and Discussion 4. Simulation Results and Discussion This section discussed the performance improvements offered by the atomic pipelined broadcast Thisalgorithm section with discussed the inter-node the performance transmission improvemen order change technique,ts offered alongby the with atomic the simulationpipelined results.broadcast algorithmThe simulation with the inter-node was performed transmission with the multi-core order change processor technique, system along and the with bus the functional simulation models results. The simulationthat support was each performed broadcast algorithm. with the multi-core The total communication processor system time for and data the message bus functional broadcasting models that supportamong multiple each broadcast cores inthe algorithm. system was The also total measured communication and compared. time for data message broadcasting among multipleAs described cores previously,in the system the atomicwas also pipelined measured broadcast and compared. was developed to reduce the execution Asdelay described of the broadcast previously, communication. the atomic In pipelined addition, thebroadcast proposed was inter-node developed transmission to reduce order the changeexecution delaytechnique of the broadcast was proposed communication. to reduce the waiting In addition time. In, orderthe proposed to evaluate inter-node performance transmission improvement inorder changeterms technique of both execution was proposed time and to waiting reduce time, the seven waiting broadcast time. algorithmsIn order wereto evaluate implemented performance and compared in this section. improvement in terms of both execution time and waiting time, seven broadcast algorithms were To explore the multi-core processor system architectures, the system architecture and the bus implemented and compared in this section. functional models (BFMs) were modeled with SystemC while considering the latencies of the functional Toblocks. explore They the generated multi-core communication processor trasystemffics in arch eachitectures, specific simulation the system environment. architecture To and simplify the bus functionalthe simulation, models the(BFMs) operating were clock modeled frequency with of SystemC this system while was defined considering as 100 MHz.the latencies AMBA busof the functionalprotocol blocks. can support They generated 32 processors communication without the increase traffics of clockin each cycles specific to transceiver simulation data. environment. As more To simplifyprocessors the are simulation, connected the to the operating bus system, clock the frequency hardware complexityof this system of the was bus defined system blockas 100 also MHz. AMBAincreases. bus protocol The increased can support hardware 32 pr complexityocessors without can make the the increase timing closureof clock worse cycles in to register-transfer transceiver data. As morelevel processors to graphic database are connected system (RTL-to-GDS) to the bus applicationsystem, the specific hardware integrated complexity circuit (ASIC) of the design bus flow. system blockHowever, also increases. the sign-o Theff timing increased closure hardware for the bus systemcomplexity block incan the make commercial the timing SoCs such closure as Samsung worse in register-transferExinos and Qualcomm level to graphic Snapdragon database was system finished (R withTL-to-GDS) about three application times slower specific clock frequencyintegrated than circuit that of CPU. The three times slower clock frequency was enough to support 32 processors in AMBA. (ASIC) design flow. However, the sign-off timing closure for the bus system block in the commercial The bus bandwidth could transmit 4 bytes per clock cycle. All processing nodes in the system were SoCs such as Samsung Exinos and Qualcomm Snapdragon was finished with about three times interconnected with each other via a crossbar bus. slower clockThe frequency broadcast communicationthan that of CPU. performance The three of times the broadcast slower clock algorithms frequency was evaluated was enough by to supportsimulating 32 processors communication in AMBA. traffi Thecs with bus their bandwidth own BFM. could The executiontransmit time4 bytes was per measured clock ascycle. the All processingvariation nodes in the in data the messagesystem sizewere from interco 4 to 2048nnected bytes, with when each the other number via of a processing crossbar nodesbus. was 4, 8, 16, or 32. The simulation environment was based on the circumstances as described in Table1. Mathematics 2019, 7, 1159 16 of 22

Table 1. Summary of simulation environment.

Simulation Environment Circumstances Clock frequency 100 MHz Bus bandwidth 4 bytes per clock cycle Root node P0 (Always) Number of nodes 4, 8, 16, or 32 Data message size 4–2048 bytes Pre-existing transmission data size 32, 128, 512, 1536 bytes Mathematics 2019, 7, x 17 of 23 Mathematics 2019, 7, x 17 of 23 results;The its simulations advantage describe was that the the performance transmission improvements order of processor in terms nodes of reducing could be the changed waiting in time the beforeresults;busy group broadcast its advantage also. communication was that the begins. transmission In any broadcast order of communicationprocessor nodes withoutcould be the changed pre-existing in the databusy transmission, group also. the transmission order change technique was unnecessary, because the changed transmissionTable 2. orderBest speed-up would be ratios identical by adopting to the pre-definedthe transmission order; order therefore, change technique, to evaluate according the performance to the Table 2. Best speed-up ratios by adopting the transmission order change technique, according to the of thesimulation transmission results. order change technique, the simulation environment was carefully prepared. simulation results. Then, the simulations were performed toPre-existing compare the atomicTotal Communication pipelined broadcast Time algorithmSpeed-up with the proposed transmission# of Nodes order Data msg change size techniquePre-existing (APOC) Total and Communication the atomic pipelined Time broadcastSpeed-up algorithm # of Nodes Data msg size Transmission AP APOC Ratio without the technique4 nodes (AP). 4 bytes Transmission 32 bytes 210AP ns APOC 190 ns Ratio 1.105 Figures 4818 nodes–21 show the 4 bytes speed-up ratios 32 bytes of the APOC 210290 algorithm ns when 190230 4,ns 8, 16, and 1.1051.261 32 processing nodes were in168 thenodesnodes communicator 4 bytes with the 128pre-existing 32 bytes bytes data 290690 transmission, ns 230550 respectively, ns compared1.261 1.255 to the AP algorithm.1632 Innodes these simulations, 4 bytes only the 128 P1 bytes processing 1010 node690 ns ns had the pre-existing 550710 ns data 1.2551.423 transmission, 32 nodes 4 bytes 128 bytes 1010 ns 710 ns 1.423 and the pre-existing transmission data sizes were 32, 128, 512, and 1536 bytes, respectively. 1.15 1.15 Case1 (32 bytes) Case1Case2 (32(128 bytes) bytes) 1.1 Case2 (128 bytes) 1.1 Case3 (512 bytes) Case3Case4 (512(1536 bytes) bytes) 1.05 Case4 (1536 bytes) 1.05

1 1 Speed-upratio

Speed-up0.95ratio 0.95

0.9 0.9 4 8 16 32 64 128 256 512 1024 2048 4 8 16Data 32 messa 64ge 128size (b 256ytes) 512 1024 2048 Data message size (bytes) Figure 18. The atomic pipelined broadcast algorithm without the technique (AP) vs. the atomic FigureFigurepipelined 18. 18.The broadcast The atomic atomic pipelinedalgorithm pipelined broadcast with broadcast the algorithmproposed algorithm withouttransmission wi thethout technique order the techniquechange (AP) vs. technique the(AP) atomic vs. (APOC; the pipelined atomic four broadcastpipelinednodes, only algorithm broadcast P1 w/ the with algorithm pre-ex the proposedisting with data the transmission transmission).proposed tr orderansmission change order technique change (APOC; technique four (APOC; nodes, only four P1nodes, w/ the only pre-existing P1 w/ the data pre-ex transmission).isting data transmission).

1.3 1.3 Case1 (32 bytes) 1.25 Case1Case2 (32(128 bytes) bytes) 1.25 1.2 Case2Case3 (128(512 bytes) 1.2 Case3Case4 (512(1536 bytes) bytes) 1.15 Case4 (1536 bytes) 1.15 1.1 1.1 1.05

Speed-up1.05ratio 1 Speed-upratio 1 0.95 0.95 0.9 0.9 4 8 16 32 64 128 256 512 1024 2048 4 8 16Data 32 messa 64ge 128size (b 256ytes) 512 1024 2048 Data message size (bytes) FigureFigure 19. 19.AP AP vs. vs. APOC APOC (eight (eight nodes, nodes, only only P1 P1 w w// the the pre-existing pre-existing data data transmission). transmission). Figure 19. AP vs. APOC (eight nodes, only P1 w/ the pre-existing data transmission).

Mathematics 2019, 7, x 18 of 23 Mathematics 2019, 7, 1159 17 of 22 Mathematics 2019, 7, x 18 of 23

1.3 1.3 Case1 (32 bytes) 1.25 Case2Case1 (128(32 bytes) bytes) 1.25 1.2 Case3Case2 (512(128 bytes)bytes) 1.2 Case4Case3 (1536(512 bytes) bytes) 1.15 Case4 (1536 bytes) 1.15 1.1 1.1 1.05

Speed-upratio1.05 1 Speed-upratio 1 0.95 0.95 0.9 0.9 4 8 16 32 64 128 256 512 1024 2048 4 8 16Data 32 messa 64ge size 128 (b 256ytes) 512 1024 2048 Data message size (bytes) Figure 20. AP vs. APOC (16 nodes, only P1 w/ the pre-existing data transmission). FigureFigure 20. 20.AP AP vs. vs. APOC APOC (16 (16 nodes, nodes, only only P1 P1 w w// the the pre-existing pre-existing data data transmission). transmission).

1.5 1.5 Case1 (32 bytes) Case1 (32 bytes) 1.4 Case2 (128 bytes) 1.4 Case3Case2 (512(128 bytes)bytes) 1.3 Case4Case3 (1536(512 bytes) bytes) 1.3 Case4 (1536 bytes) 1.2 1.2 1.1 Speed-upratio 1.1 Speed-upratio 1 1 0.9 0.9 4 8 16 32 64 128 256 512 1024 2048 4 8 16Data 32 messa 64ge 128size (b 256ytes) 512 1024 2048 Data message size (bytes) FigureFigure 21. 21.AP AP vs. vs. APOCAPOC (32(32 nodes,nodes, onlyonly P1 P1 w w// the the pre-existing pre-existing data data transmission). transmission). Figure 21. AP vs. APOC (32 nodes, only P1 w/ the pre-existing data transmission). 5. ImplementationIn the figures, APOC and Synthesis shows better Results performance than AP, which cannot change the transmission order.5. Implementation Since APOC features and Synthesis the transmission Results order change technique, it did not wait (or waited a short To support the atomic pipelined broadcast algorithm with the inter-node transmission order time) until the pre-existing data transmission finishes; rather, it performed broadcast communication changeTo technique,support the a modifiedatomic pipelined MPE was broadcast modeled algoby usingrithm Verilog-HDL with the inter-node and synthesized. transmission Figure order 22 immediately. Additionally, the synchronization time could be hidden for the communication time for showschange the technique, structure a ofmodified the multi-core MPE was processor modeled sy bystem using that Verilog-HDL included the andmodified synthesized. MPE. The Figure multi- 22 the pre-existing data transmission. coreshows processor the structure system of consistedthe multi-core of four processor processing system nodes that and included the local the bus. modified Each of MPE. the processingThe multi- Table2 shows the best speed-up ratios made by adopting the transmission order change technique, nodescore processor comprised system of a processor consisted core, of four 16-KByte processing instruction nodes memory,and the local 16-Kbyte bus. Eachdata memory,of the processing and the according to the simulation results. When four processing nodes were broadcasting with 4-byte data modifiednodes comprised MPE. The of aprocessor processor core core, was 16-KByte the Yonsei instruction Multi-Processor memory, 16-Kbyte Unit for data Extension memory, (YMPUE), and the message and a 32 bytes pre-existing data transmission, the best speed-up ratio (1.105) occurred. When whichmodified was MPE. designed The processorby the processor core was laboratory the Yonsei at YonseiMulti-Processor University. Unit It wasfor Extensiona five-stage (YMPUE), pipeline eight processing nodes were broadcasting with 4-byte data message and a 32 bytes pre-existing data RISCwhich type, was baseddesigned on MIPS by the DLX processor (DeLuXe) laboratory architecture at Yonsei [49]. University. It was a five-stage pipeline transmission, the best speed-up ratio (1.261) occurred. When 16 processing nodes were broadcasting RISCWhen type, basedthe instruction on MIPS DLX regarding (DeLuXe) message architecture passing [49]. is fetched from instruction memory, the with 4-byte data message and a 128 bytes pre-existing data transmission, the best speed-up ratio processorWhen core the transferred instruction it regarding to the modified message MPE passing after the is decodefetched stage.from Theinstruction modified memory, MPE stores the (1.255) occurred. Finally, when 32 processing nodes were broadcasting with 4-byte data message and a theprocessor Send and core Recv transferred commands it to directly the modified in the issuMPEe queue.after the When decode the stage.Bcast commandThe modified is received, MPE stores the 128 bytes pre-existing data transmission, the best speed-up ratio (1.423) occurred. rootthe Send node and orders Recv the commands local bus to directly distribute in the the issu valuee queue. of the Whenstatus theregister Bcast to command all the processing is received, nodes the inroot the node communicator. orders the local After bus the tocommunication distribute the statusvalue esof ofthe the status processing register nodes to all arethe checked,processing the nodes MPE in theTable communicator. 2. Best speed-up After ratios theby communication adopting the transmission statuses of orderthe processing change technique, nodes are according checked, to thethe MPE readssimulation the pre-defined results. transmission order from the ROM, translates the logical node number into the physicalreads the node pre-defined number, transmission and then determines order from the the tran ROsmissionM, translates order. theAccording logical nodeto the number changed into order, the physical node number, and then determines the transmission order. According to the changed order, the MPEs generate the control messPre-Existingages, which consistTotal of Communication Send (for a head Time node),Speed-Up Recv (for a tail # of Nodes Data Msg Size node),the MPEs and generateFwd (for thea body control node), mess andTransmissionages, then which store consistthem inAP of the Send issue (for queue. APOCa head node), RecvRatio (for a tail node), and Fwd (for a body node), and then store them in the issue queue. 4 nodes 4 bytes 32 bytes 210 ns 190 ns 1.105 8 nodes 4 bytes 32 bytes 290 ns 230 ns 1.261 16 nodes 4 bytes 128 bytes 690 ns 550 ns 1.255 32 nodes 4 bytes 128 bytes 1010 ns 710 ns 1.423

Mathematics 2019, 7, 1159 18 of 22

Where N is the number of processing nodes, the number of reduced waiting clock cycles is up to N-2 cycles. Even if the transmission order of P1 in the busy group is changed to the last order and the broadcast communication can begin immediately, P1 will perform the data transmission for the broadcast communication after at least N-2 cycles. Therefore, the performance improvement was limited. Additionally, as the data message size was increased, the ratio of synchronization time to data message transmission time was decreased; therefore, the performance improvement also was decreased. These simulation results indicate the levels of performance improvement that derive from the use of the transmission order change technique. Although the inter-node transmission order change technique was used in the simulation, the initial change technique would also derive the same levels of performance, as only one processing node (i.e., P1) belongs to the busy group. Therefore, the advantage of the inter-node transmission order change technique was not found in these simulation results; its advantage was that the transmission order of processor nodes could be changed in the busy group also.

5. Implementation and Synthesis Results To support the atomic pipelined broadcast algorithm with the inter-node transmission order change technique, a modified MPE was modeled by using Verilog-HDL and synthesized. Figure 22 shows the structure of the multi-core processor system that included the modified MPE. The multi-core processor system consisted of four processing nodes and the local bus. Each of the processing nodes comprised of a processor core, 16-KByte instruction memory, 16-Kbyte data memory, and the modified MPE. The processor core was the Yonsei Multi-Processor Unit for Extension (YMPUE), which was designed by the processor laboratory at Yonsei University. It was a five-stage pipeline RISC type, based on MIPSMathematics DLX 2019 (DeLuXe), 7, x architecture [49]. 19 of 23

Multi-core Processor Processing node #0

Instruction bus_state_0[3:0] Memory state_valid_0 comm_msg_en_0 sr_req_0 comm_msg_0[31:0] Processor comm_msg_complete_0 Message recv_en_0 Passing Engine send_busy_0 dm_rdata_0[31:0] recv_data_0[31:0] Local Bus dm_read_0 send_en_0 Data dm_wdata_0[31:0] recv_busy_0 Memory dm_addr_0[8:0] send_data_0[31:0] dm_write_0

FigureFigure 22. Block22. Block diagram diagram of of the the multi-core multi-core processor system system with with the the modified modified MPE. MPE.

WhenTo the obtain instruction hardware regarding overhead message information passing for is the fetched modified from MPE, instruction the multi-core memory, processor the processor core transferredsystem was synthesized it to the modified with a Synopsys MPE after Design the decodeCompiler, stage. using The TSMC modified 0.18 μm MPEprocess stores standard the Send and memory libraries. Table 3 shows the synthesis results. The modified MPE occupies an area much and Recv commands directly in the issue queue. When the Bcast command is received, the root node smaller than the total chip. The logic area of the modified MPE was 2288.1 (i.e., 9152.5 for four orders the local bus to distribute the value of the status register to all the processing nodes in the processing nodes) equivalent NAND gates. As described in [22], the logic area of the MPE supporting communicator.the initial transmission After the communication order change technique statuses of is the2108.4 processing (i.e., 8433.6 nodes for arefour checked, processing the nodes) MPE reads the pre-definedequivalent NAND transmission gates. order from the ROM, translates the logical node number into the physical node number,Compared and then to the determines previous MPE, the transmissionthe logic area of order. the modified According MPE to increased the changed by approximately order, the MPEs generate8.5%. the However, control messages,it occupied whichonly 2.1% consist of the of entire Send chip (for area. a head When node), considering Recv (for the a entire tail node), chip area, and Fwd (for athe body logic node), area of and the then modified store themMPE was in the negligible; issue queue. therefore, if the modified MPE were used in a Tomulti-core obtain hardwareprocessor overheadsystem, the information total system for pe therformance modified could MPE, be theimproved multi-core with processoronly a small system was synthesizedincrease in the with logic aSynopsys area. Design Compiler, using TSMC 0.18 µm process standard and memory libraries. Table3 shows the synthesis results. The modified MPE occupies an area much smaller Table 3. Logic synthesis results of the multi-core processor system with the modified MPE.

Block Gate count Ratio (%) mpi_bus 2526.4 0.60 ympu 28221.4 6.50 i_mem 38613.7 8.90 Processing Node #0 d_mem 38775.1 8.93 mpe 2281.7 0.53 ympu 28149.1 6.49 i_mem 38613.7 8.90 Processing Node #1 d_mem 38772.5 8.93 mpe 2281.0 0.53 ympu 28157.1 6.49 i_mem 38625.5 8.90 Processing Node #2 d_mem 38732.8 8.93 mpe 2296.9 0.53 ympu 28132.6 6.48 i_mem 38616.4 8.90 Processing Node #3 d_mem 38796.3 8.94 mpe 2292.8 0.53 Total 433975.06 100

Table 4 is the results of power measurement for the synthesized MPE with Synopsys Power Compiler. The total dynamic power was 56.587 mW and the total leakage power was 489.443 uW. The dynamic power of the modified MPE was 4.242 mW and the leakage power was 2.779 uW, which

Mathematics 2019, 7, 1159 19 of 22 than the total chip. The logic area of the modified MPE was 2288.1 (i.e., 9152.5 for four processing nodes) equivalent NAND gates. As described in [22], the logic area of the MPE supporting the initial transmission order change technique is 2108.4 (i.e., 8433.6 for four processing nodes) equivalent NAND gates.

Table 3. Logic synthesis results of the multi-core processor system with the modified MPE.

Block Gate Count Ratio (%) mpi_bus 2526.4 0.60 ympu 28221.4 6.50 i_mem 38613.7 8.90 Processing Node #0 d_mem 38775.1 8.93 mpe 2281.7 0.53 ympu 28149.1 6.49 i_mem 38613.7 8.90 Processing Node #1 d_mem 38772.5 8.93 mpe 2281.0 0.53 ympu 28157.1 6.49 i_mem 38625.5 8.90 Processing Node #2 d_mem 38732.8 8.93 mpe 2296.9 0.53 ympu 28132.6 6.48 i_mem 38616.4 8.90 Processing Node #3 d_mem 38796.3 8.94 mpe 2292.8 0.53 Total 433975.06 100

Compared to the previous MPE, the logic area of the modified MPE increased by approximately 8.5%. However, it occupied only 2.1% of the entire chip area. When considering the entire chip area, the logic area of the modified MPE was negligible; therefore, if the modified MPE were used in a multi-core processor system, the total system performance could be improved with only a small increase in the logic area. Table4 is the results of power measurement for the synthesized MPE with Synopsys Power Compiler. The total dynamic power was 56.587 mW and the total leakage power was 489.443 uW. The dynamic power of the modified MPE was 4.242 mW and the leakage power was 2.779 uW, which show that the total amount of power consumption used in this multi-core processor system was very low.

Table 4. Results of power measurement for the synthesized MPE.

Block Dynamic Power (mW) Leakage Power (uW) mpi_bus 1.340 0.690 9.097 9.452 6.50 1.982 56.090 8.90 Processing Node #0 1.712 56.177 8.93 1.059 0.696 0.53 9.069 9.437 6.49 1.955 56.090 8.90 Processing Node #1 1.641 56.173 8.93 1.058 0.692 0.53 8.968 9.576. 6.49 1.933 56.107 8.90 Processing Node #2 1.749 56.030 8.93 1.064 0.698 0.53 9.045 9.461 6.48 1.930 56.094 8.90 Processing Node #3 1.740 56.215 8.94 1.061 0.693 0.53 Total 56.587 489.443 Mathematics 2019, 7, 1159 20 of 22

6. Conclusions Since the sequential tree algorithm was presented, various broadcast algorithms were proposed; these include the binary tree, binomial tree, minimum spanning tree, distance minimum spanning tree algorithms, as well as Van de Gejin’s hybrid broadcast and modified hybrid broadcast algorithms. These algorithms offer performance efficiency by tuning network topologies in collective communications, but they cannot utilize the maximum bandwidth for message broadcast, given their restrictive structure. To utilize maximum bandwidth, pipelined broadcast algorithms were proposed to accelerate the processing time of broadcast communication in the operating system level and in the cluster system. This paper proposed an efficient atomic pipelined broadcast algorithm with the inter-node transmission order change technique. The atomic pipelined broadcast algorithm could drastically reduce the number of synchronizations on multi-core processor systems. This reduction was accomplished by developing a light-weight protocol and employing simple hardware logic, so that the message broadcast could be performed in an atomic fashion with full pipeline utilization. The proposed inter-node transmission order change technique considered the communication status of the processing node. This technique sorted each of the processing nodes into the free or busy group, in a manner similar to that seen in the initial transmission order change technique. Then, this technique determined the transmission order only in the free group as the pre-defined numeric order; however, for the busy group, it changed the transmission order according to the remaining pre-existing transmission data size. Therefore, the synchronization time could be hidden for the remaining time, until the pre-existing data transmissions finish; as a result, the overall broadcast completion time was reduced. To validate this approach, the proposed broadcast algorithm was implemented to a bus functional model with SystemC, and the results with various message data sizes and node number were evaluated. To evaluate the performance improvements of the proposed atomic pipelined broadcast algorithm and the proposed inter-node transmission order change technique, many simulations in various communication environments were performed, in which the number of processing nodes, the transmission data message size, and the pre-existing transmission data message size were varied. The simulation results showed that adopting the transmission order change technique could bring about a 1.423 speed-up ratio. The message passing engine (MPE) with the atomic pipelined broadcast algorithm with the inter-node transmission order change technique was designed by using Verilog-HDL, which supports four processor cores. The logic synthesis results with TSMC 0.18 µm process cell libraries showed that the logic area of the proposed MPE was 2288.1 equivalent NAND gates, which was approximately 2.1% of the entire chip area; therefore, performance improvement could be expected with a small hardware area overhead, if the proposed MPE were to be added to multi-core processor systems.

Funding: This research received no external funding. Acknowledgments: The author would like to thank Emma for her help with the manuscript submission. Conflicts of Interest: The author declares no conflict of interest.

References

1. Lotfi-Kamran, P.; Modarressi, M.; Sarbazi-Azad, H. An efficient hybrid-switched network-on-chip for chip multiprocessors. IEEE Trans. Comput. 2016, 65, 1656–1662. [CrossRef] 2. Homayoun, H. Heterogeneous chip multiprocessor architectures for big data applications. In Proceedings of the ACM International Conference on Computing Frontiers, Como, Italy, 16–18 May 2016; pp. 400–405. 3. Wilson, S. Methods in Computational Chemistry; Springer Science & Business Media: New York, NY, USA, 2013; Volume 3. 4. Shukla, S.K.; Murthy, C.; Chande, P.K. A survey of approaches used in parallel architectures and multi-core processors for performance improvement. Prog. Syst. Eng. 2015, 366, 537–545. 5. Fernandes, K. GPU Development and Computing Experiences. Available online: http://docplayer.net/ 77930870-Gpu-development-and-computing-experiences.html (accessed on 10 September 2019). Mathematics 2019, 7, 1159 21 of 22

6. Gong, Y.; He, B.; Zhong, J. Network performance aware MPI collective communication operations in the cloud. IEEE Trans. Parallel Distrib. Syst. 2013, 26, 3079–3089. [CrossRef] 7. Woolley, C. NCCL: Accelerated Multi-GPU Collective Communications. Available online: https://images. .com/events/sc15/pdfs/NCCL-Woolley.pdf (accessed on 10 September 2019). 8. Luehr, N. Fast Multi-GPU Collectives with NCCL. Available online: https://devblogs.nvidia.com/ parallelforall/fast-multi-gpu-collectives-nccl (accessed on 10 September 2019). 9. Chiba, T.; Endo, T.; Matsuoka, S. High-performance MPI broadcast algorithm for grid environments utilizing multi-lane NICs. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid, Rio de Janeiro, Brazil, 14–17 May 2007; pp. 487–494. 10. Kumar, S.; Sharkawi, S.S.; Nysal Jan, K.A. Optimization and analysis of MPI collective communication on fat-tree networks. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Symposium, Chicago, IL, USA, 23–27 May 2016; pp. 1031–1040. 11. Jha, S.; Gabriel, E. Impact and limitations of point-to-point performance on collective algorithms. In Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Cartagena, Colombia, 16–19 May 2016. [CrossRef] 12. Vadhiyar, S.S.; Fagg, G.E.; Dongarra, J. Automatically tuned collective communications. In Proceedings of the ACM/IEEE Conference on Supercomputing, Dallas, TX, USA, 4–10 November 2000. [CrossRef] 13. Seidel, S.R. Broadcasting on Linear Arrays and Meshes; Oak Ridge National Lab: Oak Ridge, TN, USA, 1993. 14. Barnett, M.; Gupta, S.; Payne, D.G.; Shuler, L.; van de Geijn, R. Building a High-performance Collective Communication Library. In Proceedings of the ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 14–18 November 1994; pp. 107–116. 15. Matsuda, M.; Kudoh, T.; Kodama, Y.; Takano, R.; Ishikawa, Y. Efficient MPI collective operations for clusters in long-and-fast networks. In Proceedings of the IEEE International Conference on Cluster Computing, Barcelona, Spain, 25–28 September 2006; pp. 1–9. 16. Barnett, M.; Payne, D.; Geijn, R.; Watts, J. Broadcasting on meshes with wormhole routing. J. Parallel Distrib. Comput. 1996, 35, 111–122. [CrossRef] 17. Traff, J.; Ripke, A.; Siebert, C.; Balaji, P.; Thakur, R.; Gropp, W. A simple, pipelined algorithm for large, irregular all-gather problems. Lect. Notes Comput. Sci. 2008, 5205, 84–93. 18. Patarasuk, P.; Yuan, X.; Faraj, A. Techniques for pipelined broadcast on switched clusters. J. Parallel Distrib. Comput. 2008, 68, 809–824. [CrossRef] 19. Zhang, P.; Deng, Y. Design and Analysis of Pipelined Broadcast Algorithms for the All-Port Interlaced Bypass Torus Networks. IEEE Trans. Parallel Distrib. Syst. 2012, 23, 2245–2253. [CrossRef] 20. Jia, B. Contention Free Pipelined Broadcasting Within a Constant Bisection Bandwidth Network Topology. U.S. Patent 8873559 B2, 12 August 2013. 21. Watts, J.; Gejin, R. A Pipelined Broadcast for Multidimensional Meshes. Parallel Process. Lett. 1995, 5, 281–292. [CrossRef] 22. Park, J.; Yun, H.; Moon, S. Enhancing Performance Using Atomic Pipelined Message Broadcast in a Distributed Memory MPSoC. IEICE Electron. Express 2014, 11, 1–7. [CrossRef] 23. Vamvakas, P.; Tsiropoulou, E.E.; Papavassiliou, S. Dynamic provider selection & power resource management in competitive wireless communication markets. Mob. Netw. Appl. 2018, 23, 86–99. 24. Mnih, V.;Badia, A.P.;Mirza, M.; Graves, A.; Harley,T.; Lillicrap, T.P.;Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International conference on machine learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. 25. Chowdhury, N.; Wight, J.; Mozak, C.; Kurd, N. i5/i7 QuickPath Interconnect receiver clocking circuits and training algorithm. In Proceedings of the International Symposium on VLSI Design, Automation and Test, Hsinchu, Taiwan, 23–25 April 2012; pp. 1–4. 26. Safranek, R. Intel QuickPath Interconnect Overview. In Proceedings of the IEEE Symposium on Hot Chips, Stanford, CA, USA, 23–25 August 2009; pp. 1–27. 27. An Introduction to the Intel QuickPath Interconnect. Available online: https://www.intel.com/content/www/us/ en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html (accessed on 10 September 2019). 28. Oliver, N.; Sharma, R.; Chang, S.; Chitlur, B.; Garcia, E.; Grecco, J.; Grier, A.; Ijih, N.; Liu, Y.; Marolia, P.; et al. A Reconfigurable Computing System Based on a Cache-Coherent Fabric. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs, Cancun, Mexico, 30 November–2 December 2011; pp. 80–85. Mathematics 2019, 7, 1159 22 of 22

29. Sodani, A. Knights Landing (KNL): 2nd Generation Intel Xeon Phi Processor. In Proceedings of the IEEE Symposium on Hot Chips, Cupertino, CA, USA, 22–25 August 2015; pp. 1–24. 30. Amanda, S. Programming and Compiling for Intel® Many Integrated Core Architecture. Available online: https://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated- core-architecture (accessed on 10 September 2019). 31. Intel Xeon Phi x100 Product Family. Available online: https://www.intel.com/content/dam/www/ public/us/en/documents/specification-updates/xeon-phi-coprocessor-specification-update.pdf (accessed on 10 September 2019). 32. Sodani, A.; Gramunt, R.; Corbal, J.; Kim, H.-S.; Vinod, K.; Chinthamani, S.; Hutsell, S.; Agarwal, R. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 2016, 36, 34–46. [CrossRef] 33. Pasarkar, B.V. A Review on HyperTransport™ Technology. J. Comput. Based Parallel Program. 2016, 1, 1–5. 34. Latest Technologies, Gaming, Graphics and Server, AMD. Available online: http://www.amd.com/en-us/ innovations/software-technologies// (accessed on 10 September 2019). 35. Conway, P.; Hughes, B. The AMD Opteron Architecture. IEEE Micro 2007, 27, 10–21. [CrossRef] 36. Sinha, R.; Roop, P.; Basu, S. The AMBA SOC Platform. In Correct-by-Construction Approaches for SoC Design; Springer: New York, NY, USA, 2013; pp. 11–23. 37. Shrivastava, A.; Sharma, S.K. Various arbitration algorithm for on-chip (AMBA) shared bus multi-processor SoC. In Proceedings of the IEEE Conference on Electrical, Electronics and Computer Science, Bhopal, India, 5–6 March 2016; pp. 1–7. 38. Keche, K.; Mankar, K. Implementation of AMBA compliant SoC design for the application of detection of landmines on FPGA. In Proceedings of the International Conference on Advanced Computing and Communication Systems, Coimbatore, India, 22–23 January 2016; pp. 1–6. 39. Exynos 9820 Processor: Specs, Features. Available online: https://www.samsung.com/semiconductor/ minisite/exynos/products/mobileprocessor/exynos-9-series-9820/ (accessed on 10 September 2019). 40. Snapdragon 855+ Mobile Platform. Available online: https://www.qualcomm.com/products/snapdragon- 855-plus-mobile-platform (accessed on 10 September 2019). 41. Painkras, E. A Chip Multiprocessor for a Large-Scale Neural Simulator. Doctoral Dissertation, University of Manchester, Manchester, UK, 2012. 42. Chopra, R. Operating Systems: A Practical Approach; S.Chand Publishing: New Delhi, India, 2009. 43. Gao, S. Hardware Design of Message Passing Architecture on Heterogeneous System; University of North Carolina at Charlotte: Charlotte, NC, USA, 2013. 44. Gao, S.; Huang, B.; Sass, R. The Impact of Hardware Communication on a Heterogeneous Computing System. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines, Seattle, WA, USA, 28–30 April 2013; p. 234. 45. Huang, L.; Wang, Z.; Xiao, N. Accelerating NoC-Based MPI Primitives via Communication Architecture Customization. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, Washington, DC, USA, 9–11 July 2012; pp. 141–148. 46. Ly, D.L.; Saldana, M.; Chow, P. The Challenges of Using an Embedded MPI for Hardware-based Processing Nodes. In Proceedings of the International Conference on Field-Programmable Technology, Sydney, Australia, 9–11 December 2009; pp. 120–127. 47. Mahr, P.; Lorchner, C.; Ishebabi, H.; Bobda, C. SoC-MPI: A Flexible Message Passing Library for Mutliprocessor Systems-on-Chips. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs, Cancun, Mexico, 3–5 December 2008; pp. 187–192. 48. Chung, W.Y.; Jeong, H.; Ro, W.W.; Lee, Y.-S. A Low-Cost Standard Mode MPI Hardware Unit for Embedded MPSoC. IEICE Trans. Inf. Syst. 2011, E94-D, 1497–1501. [CrossRef] 49. Jeong, H.; Chung, W.Y.; Lee, Y.-S. The Design of Hardware MPI Units for MPSoC. J. Korea Inf. Commun. Soc. 2011, 36, 86–92. [CrossRef]

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).