<<

aerospace

Article An Implementation of Real-Time Fundamental Functions on a DSP-Focused, High-Performance, Embedded Computing Platform

Xining Yu 1,*, Yan Zhang 1, Ankit Patel 1, Allen Zahrai 2 and Mark Weber 2 1 School of Electrical and Computer Engineering, University of Oklahoma, 3190 Monitor Avenue, Norman, OK 73019, USA; [email protected] (Y.Z.); [email protected] (A.P.) 2 National Severe Storms Laboratory, National Oceanic and Atomospheric Administration, Norman, OK 73072, USA; [email protected] (A.Z.); [email protected] (M.W.) * Correspondence: [email protected]; Tel.: +1-405-325-2871

Academic Editor: Konstantinos Kontis Received: 22 July 2016; Accepted: 2 September 2016; Published: 9 September 2016 Abstract: This paper investigates the feasibility of a backend design for real-time, multiple-channel processing digital phased array system, particularly for high-performance embedded computing platforms constructed of general purpose digital signal processors. First, we obtained the lab-scale backend performance benchmark from simulating beamforming, pulse compression, and Doppler filtering based on a Micro Telecom Computing Architecture (MTCA) chassis using the Serial RapidIO protocol in backplane communication. Next, a field-scale demonstrator of a multifunctional phased array radar is emulated by using the similar configuration. Interestingly, the performance of a barebones design is compared to that of emerging tools that systematically take advantage of parallelism and multicore capabilities, including the Open Computing Language.

Keywords: phased array radar; embedded computing; serial RapidIO; MPAR

1. Introduction

1.1. Real-Time, Large-Scale, Phased Array Radar Systems In [1], we had introduced the real-time phased array radar (PAR) processing based on the Micro Telecom Computing Architecture (MTCA) chassis. PAR, especially digital PAR, ranks among the most important sensors for aerospace surveillance [2]. A PAR system consists of three components: a phased array antenna manifold, a front-end electronics system, and a backend signal processing system. In current digital PAR systems, the radar pushes the backend system closer to the antennas, which makes the front-end system more digitalized than its analog predecessors [3]. Accordingly, current front-end systems are mixed-signal systems responsible for transmitting and receiving (RF), digital in-phase and quadrature (I/Q) sampling, and channel equalization that improves the quality of signals. Meanwhile, digital PAR backend systems control the overall system, prepare to transmit waveforms, transform received data for use in a digital processor, and process data for further functions, including real-time calibration, beamforming, and target detection/tracking. Many PAR systems demonstrate high throughput rates for data of significant computational complexity in both the front- and backends, especially in digital PARs. For example, [4] proposed a 400-channel PAR with 1 ms pulse repetition interval (PRI); assuming 8192 range gates, each 8 bytes long in memory, in each PRI, the throughput in the front-end can reach up to 5.24 GB/s. As the requirements for such data throughput are extraordinarily demanding, at present, such front-end computing performance requires digital I/Q filtering to be mapped to a fixed set of gates, look-up tables, and Boolean operations on the field-programmable gate array (FPGA) or very-large-scale

Aerospace 2016, 3, 28; doi:10.3390/aerospace3030028 www.mdpi.com/journal/aerospace Aerospace 2016, 3, 28 2 of 23 integration (VLSI) with full-custom design [5]. After front-end processing, data are sent to the backend system, in which more computationally intensive functions are performed. Compared with FPGA or full-custom VLSI chips, programmable processing devices, such as digital signal processors (DSPs), offer a high degree of flexibility, which allows designers to implement algorithms in a general purpose language (e.g., C) in backend systems [6]. For application in aerospace surveillance, target detection and tracking are, thus, performed in the backend. Target tracking algorithms, including the Kalman filter and its variants, predict future target speeds and positions by using Bayesian estimation [7], whose computational requirements vary according to the format and content of input data. Accordingly, detection and tracking functions require processors to be more capable of logic and data manipulation, as well as complex program flow control. Such features differ starkly from those required for baseline radar signal processors, in which the size of data involved dominates the throughput of processing [6]. As such, for tracking algorithms, a general purpose processor or graphic processor unit (GPU)-based platform is more suitable than FPGA or DSP. In sum, for PAR applications, the optimal solution is a hybrid implementation in hardware dedicated for front-end processing, programmable hardware for backend processing, and a high-performance server for high-level functions.

1.2. High-Performance Embedded Computing Platforms A high-performance embedded computing (HPEC) platform contains microprocessors, network interconnection technologies, such as those of the peripheral component interconnect Industrial Computer Manufacturers Group and OpenVPX, and management software that allows more computing power to be packed into a reduced size, weight, and power consumption (SWaP) system [8]. Choosing an HPEC platform as a backend system for digital PAR can meet the requirements of substantial computing power and high bandwidth throughput with a smaller SWaP system or in other SWaP-constraint scenarios, including those with airborne . Using an HPEC platform is, therefore, the optimal backend solution. Open standards, such as the Advanced Telecommunications Computing Architecture (ATCA) and MTCA [9,10], can be used for open architectures in multifunction PAR (MPAR) systems. Such designs achieve compatibility with industrial standards and reduce both the cost and duration of development. MTCA and ATCA contain groups of specifications that aim to provide an open, multivendor architecture that seeks to fulfill the requirements of a high throughput interconnection network, increase the feasibility of system upgrading and upscaling, and improve system reliability. In particular, MTCA specifies the standard use of an Advanced Mezzanine Card (AMC) to provide processing and input–output (I/O) functions on a high-performance switch fabric with a small form factor. An MTCA system contains one or more chassis into which multiple AMCs can be inserted. Each AMC communicates with others via the backplane of a chassis. Among chassis, Ethernet, fiber, or Serial Rapid IO (SRIO) cables can serve as data transaction media, and the number of MTCA chassis and AMCs are adjustable, meeting the requirements of scalable and diversified functionality for specific applications. Due to PAR’s modularity and flexibility, its demands can be satisfied by using configurations in a physically smaller, less expensive, and more efficient way than general purpose computing center and server clusters. Compared with the legacy backplane architectures VME (Versa Module Europe) [11], the advantage of using MTCA is the ability to leverage various options of AMCs (i.e., processing, switching, storage, etc.) that are readily available in the marketplace. Moreover, MTCA provides more communication bandwidth and flexibility than VME [12]. For this paper, we have chosen different kinds of AMCs as I/O or processing modules in order to implement the high-performance embedded computing system for PAR based on MTCA. Each of those modules works in parallel and can be enabled and controlled by the MTCA carrier hub (MCH). According to system throughput requirements, we can apply a suitable number of processing modules to modify the processing power of the backend system. Aerospace 2016, 3, 28 3 of 23

1.3. Comparison of Different Multiprocessor Clusters Central processing units (CPUs), FPGAs, and DSPs have long been integral to radar signal processing [13,14]. FPGAs and DSPs are traditionally used for front-of-backend processing, such as beamforming, pulse compression, and Doppler filtering [15], whereas CPUs, usually accompanied by GPUs, are for sophisticated tracking algorithms and system monitoring. As a general purpose processor, a CPU is designed to follow general purpose instructions among different types of tasks and thus allow the advantage of programming flexibility and efficiency in flow control [6]. However, since CPUs do not accommodate for a range of scientific calculations, GPUs can be used to support heavy processing loads. The combination of a CPU and GPU offers competitive levels of flow control and mathematical processing, which enable the radar backend system to perform sophisticated algorithms in real-time. The drawback of the combination, however, is its limited bandwidth for handling data flow in and out of the system [16]. CPUs and GPUs are designed for a server environment, in which Peripheral Component Interconnect Express (PCIe) can efficiently perform point-to-point for on-board communication. However, PCIe is not suitable for high throughput data communication among a large number of boards. If the throughput of processing is dominated by the size of data involved, then the communication bottleneck downgrades the computing performance for a CPU–GPU combination. Therefore, when signal processing algorithms have demanding communication bandwidth requirements, DSP and FPGA are better options, since both can provide significant bandwidth for in-chassis communication by using SRIO while simultaneously achieving high computing performance. FPGA is more capable than DSP of providing high throughput in real-time for a given device size and power. When the DSP cluster cannot achieve performance requirements, the FPGA cluster can be employed for critical-stage, real-time radar signal processing. However, such improved performance comes at the expense of limited flexibility in implementing complex algorithms [5]. In all, if an FPGA and DSP both meet application requirements, then DSP can be a more preferred option given its reduced cost and less complicated programmability.

1.4. Paper Structure This paper explores the application of high performance embedded computing to digital PAR with a detailed description of backend system architectures in Section2. This is followed by a discussion of implementing fundamental PAR signal processing algorithms on these architectures and an analysis of the performance results in Section3. Section4 shows a complete example of the performance of large-scale PAR. Finally, Section5 describes that the implementation of PAR signal processing algorithm by using an automatic parallelization solution, Open Computing Language (OpenCL), and the performances comparison between barebones DSP design and OpenCL has been made. Summaries are drawn in Section6. In this paper, the work focuses on the front of the backend, for which we consider using DSP as the solution for general radar data cube processing.

2. Backend System Architecture

2.1. Overview Typically, an HPEC platform for PAR accommodates a computing environment [6] consisting of multiple parallel processors. To facilitate system upgrades and maintenance, the multiprocessor computing and interconnection topology should be flexible and modular, meaning that each processing endpoint in the backend system needs to be identical, and its responsibility entirely assumed or shared with another endpoint without interfering other system operations. Moreover, the connection topology among each processing and I/O module should be flexible and capable of switching a large amount of data from other boards. Figure1 shows a top-level system description of a general large-scale array radar system. In receiving arrays, once data are collected from the array manifold, each transmit and receive module (TRM) downconverts the incoming I/Q streams in parallel. To support the throughput requirement, the receivers group I/Q data from each coherent pulse interval (CPI) and send grouped Aerospace 2016, 3, 28 4 of 23 pulse interval (CPI) and send grouped data for beamforming, pulse compression, and Doppler filtering. Beamforming and pulse compression are paired into pipelines, and the pairs process the data in a round-robin fashion. At each stage, data-parallel partitioning is used to mitigate the massive amount of computations into smaller, more manageable pieces. Fundamental processing functions for PAR (e.g., beamforming, pulse compression, Doppler processing, and real-time calibration) require tera-levels of operations per second for large-scale PAR applications [4]. Since such processing is executed on a channel-by-channel basis, the processing flow can be parallelized naturally. A typical scheme for parallelism involves assigning computation operations to multiple parallel processing elements (PE). In that sense, from the perspective of radar applications, a data cube containing data from all range gates and pulses in a CPI is distributed across multiple PEs within at least one chassis. A good distribution strategy can ensure that systems not only achieve high computing efficiency but fulfill the requirements of modularity and flexibility, as well. In particular, modularity permits growth in computing power by adding PEs and ensures that an efficient approach to development and system integration can be adopted by replicating a single PE [6]. The granularity of each PE is defined according to the size of a processing assignment that forms part of an entire task. Although finer granularity allows designers to attune the processing assignment, also poses the disadvantage of increased communication overhead within each PE [17]. To balance computation load and real-time communication in one PE, the ratio of the number of computation operations to communication bandwidth needs to be checked carefully. For example, as a PE in our basic system configuration, we use the 6678 Evaluation Module (Texas Instruments Inc., Dallas, TX, USA), which has eight C66xx DSP cores; an advanced configuration of that design uses a more powerful DSP module from Prodrive Technologies, Son, The Netherlands [18], which contains 24 DSP cores and four advanced reduced instruction set computing machine (ARM) cores in a single board. Texas Instruments claims that each C66xx core has 16 giga floating point operation per second (GFLOPS) at 1 GHz [19]. In our throughput measurement, the four-lane SRIO (Gen 2) link reachesAerospace 2016up to, 3, 1600 28 MB/s in NWrite mode; since the single-precision floating point format (IEEE4 754) of 23 [20] occupies four bytes in memory, the SRIO link provides 400 million floating point data per second. The ratio of computation to bandwidth is 40 [6], meaning that the core performs up to 40 floating pointdata for operations beamforming, for each pulse piece compression, of data that and flows Doppler into the filtering. system Beamforming without halting and pulsethe SRIO compression link. As such,are paired when intothe ratio pipelines, reaches and 40, the PE pairs balances process the the computation data in a round-robin load with real-time fashion. communication. At each stage, Indata-parallel general, making partitioning each is PE used work to mitigate efficiently the massive requires amount optimizing of computations algorithms, into entirely smaller, using more computingmanageable resources, pieces. and ensuring that I/O capacity reaches its peak.

Figure 1. Top-level system digital array system concept. Figure 1. Top-level system digital array system concept.

2.2. Scalable Backend System Architecture Fundamental processing functions for PAR (e.g., beamforming, pulse compression, Doppler processing,As mentioned and real-time earlier, calibration) the features require of a basic tera-levels radar processing of operations chain per allow second for for independent large-scale PARand parallelapplications processing [4]. Since task suchdivisions. processing In pulse is executedcompression, on a for channel-by-channel instance, the match basis, filter the operation processing in eachflow channel can be parallelized along the range naturally. gate dimension A typical schemecan perform for parallelism independently; involves as such, assigning a large computation throughput operations to multiple parallel processing elements (PE). In that sense, from the perspective of radar applications, a data cube containing data from all range gates and pulses in a CPI is distributed across multiple PEs within at least one chassis. A good distribution strategy can ensure that systems not only achieve high computing efficiency but fulfill the requirements of modularity and flexibility, as well. In particular, modularity permits growth in computing power by adding PEs and ensures that an efficient approach to development and system integration can be adopted by replicating a single PE [6]. The granularity of each PE is defined according to the size of a processing assignment that forms part of an entire task. Although finer granularity allows designers to attune the processing assignment, also poses the disadvantage of increased communication overhead within each PE [17]. To balance computation load and real-time communication in one PE, the ratio of the number of computation operations to communication bandwidth needs to be checked carefully. For example, as a PE in our basic system configuration, we use the 6678 Evaluation Module (Texas Instruments Inc., Dallas, TX, USA), which has eight C66xx DSP cores; an advanced configuration of that design uses a more powerful DSP module from Prodrive Technologies, Son, The Netherlands [18], which contains 24 DSP cores and four advanced reduced instruction set computing machine (ARM) cores in a single board. Texas Instruments claims that each C66xx core has 16 giga floating point operation per second (GFLOPS) at 1 GHz [19]. In our throughput measurement, the four-lane SRIO (Gen 2) link reaches up to 1600 MB/s in NWrite mode; since the single-precision floating point format (IEEE 754) [20] occupies four bytes in memory, the SRIO link provides 400 million floating point data per second. The ratio of computation to bandwidth is 40 [6], meaning that the core performs up to 40 floating point operations for each piece of data that flows into the system without halting the SRIO link. As such, when the ratio reaches 40, the PE balances the computation load with real-time communication. In general, making each PE work efficiently requires optimizing algorithms, entirely using computing resources, and ensuring that I/O capacity reaches its peak. Aerospace 2016, 3, 28 5 of 23

2.2. Scalable Backend System Architecture As mentioned earlier, the features of a basic radar processing chain allow for independent and parallelAerospace 2016 processing, 3, 28 task divisions. In pulse compression, for instance, the match filter operation5 of in23 each channel along the range gate dimension can perform independently; as such, a large throughput radarradar processing task can can be be assigned assigned to to multiple multiple processing processing units units (PUs). (PUs). Since Since each each PU PU consists consists of ofidentical identical PEs, PEs, the thetasktask would would undergo undergo further further decomposition decomposition into smaller into smaller pieces for pieces each for PE, each thereby PE, therebyallowing allowing an adjustable an adjustable level of levelgranularity of granularity that facilitates that facilitates precise preciseradar function radar function mapping. mapping. At the Atsame the time, same a centralized time, a centralized control unit control is used unit fo isr monitoring used for monitoring and scheduli andng schedulingdistributed distributedcomputing computingresources, as resources, well as for as managing well as for lower-level managing modules. lower-level PU modules.implementations PU implementations based on the MTCA based onopen the standard MTCA opencan balance standard tradeoffs can balance among tradeoffs processing among power, processing I/O functions, power, I/O and functions, system andmanagement. system management. In our implementation, In our implementation, each PU contains each at least PU contains one chassis, at least each oneof which chassis, includes each ofat whichleast one includes MCH atthat least provides one MCH central that control provides and central acts as control a data-switching and acts as entity a data-switching for all PEs, which entity forcould all be PEs, an which I/O module could (e.g., be an RF I/O transceiver) module (e.g., or a RF processing transceiver) card. or The a processing MCH of each card. MTCA The MCH chassis of eachcould MTCA be connected chassis with could a besystem connected manager with that a system supports manager the monitoring that supports and configuration the monitoring ofand the configurationsystem-level setting of the and system-level status of settingeach PE and by way status of ofan each IP interface. PE by way Within of an a IPsingle interface. MTCA Within chassis, a singlePEs exchanges MTCA chassis, data through PEs exchanges the SRIO data or PCIe through fabr theic on SRIO the orbackplane, PCIe fabric and on the the MCH backplane, is responsible and the MCHfor both is responsibleswitching and for fabric both switching management. and fabric management. Figure2 2 illustrates illustrates oneone way to use an an MTCA MTCA chassis chassis to to implement implement the the fundamental fundamental functions functions of ofradar radar signal signal processing. processing. Depending Depending on onthe the nature nature of of data data parallelism parallelism within within each each function, computingcomputing loadload is divideddivided equallyequally and a portionportion assignedassigned to eacheach PU.PU. TheThe computationalcomputational capabilitycapability isis reconfigurablereconfigurable byby adjustingadjusting thethe numbernumber ofof PUs,PUs, andand forfor eacheach processingprocessing function,function, aa differentdifferent PUPU cancan constituteconstitute at leastleast oneone MTCAMTCA chassischassis with variousvarious types of PEsPEs insertedinserted intointo it,it, allall accordingaccording toto specificspecific needs.needs. In the front, severalseveral PUsPUs handlehandle a tremendoustremendous amountamount of beamformingbeamforming calculations, and byby changing changing the the number number of PUsof PUs and and PEs, thePEs, beamformer the beamformer can be can adjusted be adjusted to accommodate to accommodate different typesdifferent and types numbers and of numbers array channels. of array Since channels. computing Since loads computing are smaller loads for pulseare smaller compression for pulse and Dopplercompression filtering, and Doppler assigning filtering, one PU assigning for each function one PU for is sufficient each function in MPAR is sufficient systems. in MPAR systems.

FigureFigure 2. Illustration 2. Illustration of the Micro of the Teleco Microm TelecomComputing Computing Architecture Architecture (MTCA) ar (MTCA)chitecture architecture in a backend in system. a backend system. Figure 3 shows an overview of the proposed MPAR backend processing chain that focuses only on a non-adaptive core processing chain. Adaptive beamforming, alignments, and calibrations are Figure3 shows an overview of the proposed MPAR backend processing chain that focuses only not included in that chain until further stable results are obtained from algorithm validations. Data on a non-adaptive core processing chain. Adaptive beamforming, alignments, and calibrations are not from the array manifold and front-end electronics are organized into three-dimensional data cubes, included in that chain until further stable results are obtained from algorithm validations. Data from and , , , and represent the total number of range gates, channels, pulses, and beams, the array manifold and front-end electronics are organized into three-dimensional data cubes, and Nrg, respectively. When any of those four numbers are red in Figure 3, the data are aligned in their Nch, Np, and Nb represent the total number of range gates, channels, pulses, and beams, respectively. Whencorresponding any of those dimensions. four numbers , are, red, and in Figure represent3, the data the arenumber aligned of PUs in their or PEs corresponding contained in analog to digital conversion, beamforming, pulse compression and Doppler filter, respectively. dimensions. Mr, Mb, Mp, and Md represent the number of PUs or PEs contained in analog to digital conversion,Initially, the beamforming,analog to digital pulse converters compression (ADC) and in one Doppler receiving filter, PUs respectively. would collect Initially, the data the with analog the ( ⁄) × × todimension digital converters equals to (ADC) in one receiving. In PUs total, would the number collect theof data-receiving with the PUs dimension would equalsform a data cube with the dimension of × ×, which would be further divided into a number of to (Nch/Mr) × Np × Nrg. In total, the number of Mr-receiving PUs would form a data cube with the portions and re-arranged, or corner turned, the data in the channel domain. As there are number of beamforming PUs, each one handles the dataset with the dimension of (⁄) × × . Since the output of beamforming is already aligned in the range gates, such an approach can save time form the data corner turn. In the pulse compression stage, each pulse compression PE takes

the data cube with dimension of ⁄× ×. Prior to Doppler filtering, another corner turn Aerospace 2016, 3, 28 6 of 23

dimension of Nch × Np × Nrg, which would be further divided into a number of Mb portions and re-arranged, or corner turned, the data in the channel domain. As there are Mb number of beamforming PUs, each one handles the dataset with the dimension of (Nch/Mb) × Np × Nrg. Since the output of beamforming is already aligned in the range gates, such an approach can save time form the data corner turn. In the pulse compression stage, each pulse compression PE takes the data cube with Aerospace 2016, 3, 28  6 of 23 dimension of Nb/Mp × Np × Nrg. Prior to Doppler filtering, another corner turn reorganizes data in Aerospace 2016, 3, 28 6 of 23 the pulsereorganizes domain. data Ultimately, in the pulse each domain. PEs inUltimately, the Doppler each filtering PEs in the stage Doppler would filtering take the stage data would size withtake the dimension of N /M  × N × N . thereorganizes data size datawithp in thed the dimension pulseb domain.rg of Ultimately,⁄× ea×ch PEs. in the Doppler filtering stage would take

the data size with the dimension of ⁄× ×.

FigureFigure 3. 3.Overview Overview ofof the non-adaptive, non-adaptive, “core “core data data cube cube”” processing processing chain in chain a general in a digital general array digital radar. arrayFigure radar. 3. Overview of the non-adaptive, “core data cube” processing chain in a general digital array radar. 2.3. Processing Unit Architecture 2.3. Processing Unit Architecture 2.3. ProcessingWe currently Unit operate Architecture an example of a receiving digital array at the University of Oklahoma We currently operate an example of a receiving digital array at the University of Oklahoma (FigureWe 4) currently with two operate types of an PUs—namely, example of a areceiving receiving digital (i.e., data array acquisition) at the University PU and ofa computingOklahoma (FigurePU.(Figure 4In) with the4) with tworeceiving two types types PU, of PUs—namely,ofsix PUs—namely, field-programmable a a receiving receiving RF (i.e.,(i.e.,transceiver data acquisition) acquisition) modules [21]PU PU and(e.g., and a computing aVadaTech computing PU.AMC518PU. In the In receivingthe + FMC214) receiving PU, sample sixPU, field-programmable six the field-programmable analog returns RF from transceiver RF TRM transceiver and modules send modules the [digitalized21] (e.g.,[21] VadaTech(e.g., data VadaTech to a AMC518DSP + FMC214)moduleAMC518 sampleby + wayFMC214) of the an analogsample SRIO backplane. returnsthe analog from Thereturns TRM DSP from andmodule sendTRM combines theand digitalizedsend and the sends digitalized data raw to I/Q adata DSP data to module toa DSPthe by waycomputingmodule of an SRIO by PUway backplane. through of an SRIOtwo The Hyperlink backplane. DSP module ports. The InDSP combines the module computing and combines sendsPU, the and raw number sends I/Q of dataraw PEs I/Q to is thedetermineddata computing to the PU throughbasedcomputing on required two PU Hyperlink through computational two ports. Hyperlink Inloads. the ports. Moreover, computing In the each computing PU, computing the number PU, PUthe can ofnumber PEs be connected is of determined PEs is withdetermined others based on requiredbybased way computational on of required the Hyperlink computational loads. port. Moreover, With loads. that eachMoreover, proposed computing eachPU architecture, computing PU can be PU we connected can test be the connected performance with others with byothersof the way of the Hyperlinkcomputingby way of the port.PU Hyperlinkby With using that a VadaTechport. proposed With VT813that PU proposed architecture, as the MTCA PU architecture, we chassis test the and performance weC6678 test as the the performance ofPE. the computing of the PU by usingcomputing a VadaTech PU by using VT813 a asVadaTech the MTCA VT813 chassis as the and MTCA C6678 chassis as the and PE. C6678 as the PE.

Figure 4. Simple example of a processing unit (PU)-based architecture. FigureFigure 4. 4.Simple Simple example example of of aa processingprocessing unit unit (PU)-based (PU)-based architecture. architecture. 2.4. Selecting a Backplane Data Transmission Protocol 2.4. SelectingWith more a Backplane powerful Data and Transmission efficient Protocolprocessors, HPEC platforms can acquire significant computingWith morepower powerfuland meet andscalable efficient system processors, requirements. HPEC However, platforms more can often acquire than not,significant HPEC performancecomputing power is limited and by meet the availabilityscalable system of a co requirmmensurateements. high-throughputHowever, more ofteninterconnect than not, network. HPEC performance is limited by the availability of a commensurate high-throughput interconnect network. Aerospace 2016, 3, 28 7 of 23

2.4. Selecting a Backplane Data Transmission Protocol With more powerful and efficient processors, HPEC platforms can acquire significant computing powerAerospace and 2016, meet 3, 28 scalable system requirements. However, more often than not, HPEC performance7 of 23 is limited by the availability of a commensurate high-throughput interconnect network. At the sameAt the time, same the time, communication the communication overhead overhead may bemay larger be larger than than the computing the computing time, time, which which makes makes the processorsthe processors halt. halt. Since Since that setback that setback significantly significan impactstly impacts the efficiency the efficiency of executing of executing system functions, system afunctions, proper implementation a proper implementation of the interconnection of the interc networkonnection among network all processing among all nodes processing is critical nodes to the is performancecritical to the theperformance parallel processing the parallel chain. processing chain. Currently, SRIO, Ethernet, and PCIe are common optionsoptions for fundamental data link link protocols. protocols. RapidIO is reliable, efficient,efficient, and highly scalable;scalable; compared with PCIe, which is optimized for a hierarchical bus structure, SRIO is designed to support both point-to-point and hierarchical models. models. It also demonstrates aa betterbetter flowflow controlcontrol mechanism mechanism than than PCIe. PCIe. In In the the physical physical layer, layer, RapidIO RapidIO offers offers a PCIe-stylea PCIe-style flow flow control control retry retry mechanism mechanism based based on on trackingtracking creditscredits insertedinserted into packet headers [22]. [22]. RapidIO also includes a virtual output queue backpressure mechanism, which allows switches and endpoints to learn whether data transfer destinatio destinationsns are congested [23]. [23]. Given those characteristics, SRIO allowsallows anan architecturearchitecture to to strike strike a a working workin balanceg balance between between high-performance high-performance processors processors and and the interconnectionthe interconnection network. network. In light of thosethose considerations,considerations, we use SRIO as ourour backplanebackplane transmissiontransmission protocol [[24],24], and our current testbeds are are based based on on SRIO SRIO Gen Gen 2 2backpl backplanes.anes. Each Each PE PE has has a four-lane a four-lane port port connected connected to toan anSRIO SRIO switch switch on on the the MCH. MCH. In In our our system, system, SRIO SRIO ports ports on on the the C6678 C6678 DSP DSP support four different bandwidths: 1.25, 2.5, 3.125, and 55 Gb/s.Gb/s. Since Since SRIO bandwidth overhead isis 20%20% inin 8-bit/10-bit8-bit/10-bit encoding, the theoretical effective data bandwidths ar aree 1, 2, 2.5, and 4 Gb/s, respectively. respectively. In In reality, reality, SRIO performance performance can can be be affected affected by by transfer transfer type, type, the the length length of differential of differential transmission transmission lines, lines, and andthe specific the specific type of type SRIO of port SRIO connectors. port connectors. To assess SRIO To assess performance SRIO performancein our testbed, in we our conducted testbed, wethe conductedfollowing throughput the following experiments. throughput experiments. Figure5 5shows shows the the performance performance of theof SRIOthe SRIO link inlink our in MTCA our MTCA test environment test environment by using by NWrite using andNWrite NRead and packets NRead in packets 5 Gb/s, in four-lane 5 Gb/s, mode.four-lane Performance mode. Performance is calculated is by calculated dividing theby dividing payload sizethe bypayload the elapsed size by transaction the elapsed time transaction from when time the from transmitter when startsthe transmitter to program starts SRIO to registers program and SRIO the receiverregisters has and received the receiver the entire has received dataset. First,the entire the performance dataset. First, of thethe SRIOperforma linknce is enhanced of the SRIO along link with is largerenhanced payload along sizes. with Second,larger payload the closer sizes. the destinationSecond, the memorycloser the to destination the core, the memory better the to performancethe core, the achievedbetter the with performance the SRIO achieved link. Optimally, with the SRIO SRIO 4× link.mode Optimally, can reach SRIO a speed 4× ofmode 1640 can MB/S, reach which a speed is 82% of of1640 its MB/S, theoretical which link is 82% rate. of its theoretical link rate.

Figure 5.5. DataData throughput throughput experiment experiment results results in in our our Micro Telecom Computing Architecture (MTCA)-based SerialSerial RapidRapid IOIO (SRIO)(SRIO) testbed.testbed.

2.5. System Calibration and Multichannel Synchronization

2.5.1. General Calibration Procedures Calibrating a fully digital PAR system is a complex procedure involving four general stages (Figure 6). During the first stage, transmit–receive chips in each array channel need to calibrate Aerospace 2016, 3, 28 8 of 23

2.5. System Calibration and Multichannel Synchronization

2.5.1. General Calibration Procedures Aerospace 2016, 3, 28 8 of 23 Calibrating a fully digital PAR system is a complex procedure involving four general stages (Figurethemselves6). Duringin terms the of first direct stage, current transmit–receive and chips offsets, in each on-chip array phase channel alignment, need to calibrateand local themselvesoscillator calibration. in terms ofDuring direct the current second and stage, frequency subarrays offsets, containing on-chip fewer phase channels alignment, and andradiating local oscillatorelements are calibration. aligned precisely During thein the second chamber stage, envi subarraysronment by containing way of near-field fewer channels measurements, and radiating plane elementswave spectrum are aligned analysis, precisely and far-field in the chamber active element environment pattern by characterizations. way of near-field During measurements, this stage, plane the wavefocus spectrumfalls upon analysis, antenna andelements, far-field not active the digita elementl backend, pattern and characterizations. initial array weights During for this forming stage, thefocused focus beams falls upon at the antenna subarray elements, level are notestimated the digital prec backend,isely. During and the initial third array stage, weights far-field for full forming array focusedalignment beams is performed at the subarray in either level chamber are estimated or outdoor precisely. range environments. During the third For this stage, stage, far-field we use full a arraysimple alignment unit-by-unit is performedapproach to in en eithersure that chamber each time or outdoora subarray range is added, environments. it maximizes For the this coherent stage, weconstruction use a simple of unit-by-unitthe wavefront approach at each to ensurerequired that beam-pointing each time a subarray direction. is added, Array-level it maximizes weights the coherentobtained constructionat the third stage of the are wavefront combined at with each cham requiredber-derived beam-pointing initial weights direction. from Array-level the second weights stage obtainedto numerically at the thirdoptimize stage array are combinedradiation withpatterns chamber-derived for all beam directions. initial weights When from multiple the second beams stage are toformed numerically at once, optimizethe procedure array repeats radiation for patternsall beamspac for alle configurations. beam directions. This When stage multiple requires beamsa far-field are formedprobe in at the once, loop the of procedure the alignment repeats process for all beamspace and requires configurations. synchronization This stageand requiresalignments a far-field in the probebackend. in the Initial loop factory of the alignment alignment process is finished and requires after th synchronizationis stage. Duringand the alignmentsfinal stage, in the the system backend. is Initialshipped factory for field alignment deployment, is finished in which after thisa series stage. of Duringenvironment-based the final stage, corrections the system (e.g., is shipped regarding for fieldtemperature, deployment, electronics in which drifting, a series and of environment-basedplatform vibration, which corrections is necessary (e.g., regarding for ship- temperature, or airborne electronicsradar). Based drifting, on the and internal platform sensor vibration, (i.e., calibra whichtion is necessarynetwork) monitoring for ship- or data, airborne algorithms radar). Basedin the onbackend the internal perform sensor channel (i.e., equalization calibration network) and pre–post monitoring distortions, data, algorithmsas well as correct in the backend system errors perform of channeldeviations equalization from the factory and pre–post standard. distortions, The final asstep well entails as correct data systemquality errorscontrol, of which deviations compares from the factoryobtained standard. data product The final with step analytical entails datapredictions quality control,to further which correct compares biases theat the obtained data product data product level withfor desired analytical pointing. predictions to further correct biases at the data product level for desired pointing.

FigureFigure 6. 6.GeneralGeneral system system calibration calibration procedure procedure for digital for digital array array radar radar and the and focus the focusof this of work this (in work red). (in red). 2.5.2. Backend Synchronization 2.5.2. Backend Synchronization Our study focuses only on backend synchronization during the third stage, a step necessary beforeOur parallel, study multicore focuses only processing on backend can be synchronization activated. Additionally, during the synchronized third stage, abackend step necessary enables beforethat reference parallel, clock multicore signals processing in the front-end can be activated.PU (and the Additionally, AD9361 chips synchronized in the front-end backend PU) enables to be thataligned reference through clock a FPGA signals Mezzanine in the front-end Card (FMC) PU interface. (and the AD9361For the testbed chips inarchitecture the front-end in Section PU) to 2.3, be alignedthe front-end through PU, a referred FPGA Mezzanine to as simply Card “front-end” (FMC) interface. in this section, For the of testbed the digital architecture PAR systems in Section includes 2.3, thea number front-end of array PU, referred RF channels. to as simplyIn each “front-end” channel, th inere this is an section, integrated of the RF digital digital PAR transceiver systems includeswith an aindependent number of arrayclock RFsource channels. in its digital In each section. channel, there is an integrated RF digital transceiver with an independentSynchronization clock source in this in itsfront-end digital section. system can be categorized according to either in-chassis or multichassisSynchronization synchronization. in this front-end In-chassis system synchronization can be categorized ensures accordingthat each tofront-end either in-chassis AMC in ora multichassischassis works synchronization. synchronously In-chassiswith those synchronization in the other chassis. ensures Figure that each 7 shows front-end the AMCarchitecture in a chassis of a dual-channel front-end AMC module, which is based on an existing product from VadaTech Inc., Henderson, NV, USA. The Ref Clock and Sync Pulse in Figure 7 are radial fan-out by the MCH to each slot in the chassis, and each front-end AMC uses the Sync Pulse and Ref Clock to accomplish in- chassis synchronization. As an example, Figure 8 shows the timing sequence of synchronizing two front-end AMCs. Since commands from the remote PC server or other MTCA chassis may arrive at AMC 1 and AMC 2 at different times, transmitting or receiving synchronizations requires sharing the Aerospace 2016, 3, 28 9 of 23 works synchronously with those in the other chassis. Figure7 shows the architecture of a dual-channel front-end AMC module, which is based on an existing product from VadaTech Inc., Henderson, NV, USA. The Ref Clock and Sync Pulse in Figure7 are radial fan-out by the MCH to each slot in the chassis, and each front-end AMC uses the Sync Pulse and Ref Clock to accomplish in-chassis synchronization. As an example, Figure8 shows the timing sequence of synchronizing two front-end AMCs. Since Aerospace 2016, 3, 28 9 of 23 commandsAerospace 2016, from 3, 28 the remote PC server or other MTCA chassis may arrive at AMC 1 and AMC9 of 2 at23 different times, transmitting or receiving synchronizations requires sharing the Sync Pulse between Sync Pulse between the AMCs. When AMCs acknowledge the command and detect the Sync Pulse, theSync AMCs. Pulse Whenbetween AMCs the AMCs. acknowledge When theAMCs command acknowledge and detect the thecommand Sync Pulse, and detect the FPGA the triggersSync Pulse, the the FPGA triggers the AD9361 chip on both boards at the falling edge of the next Ref Clock cycle. By AD9361the FPGA chip triggers on both the boards AD9361 at chip the falling on both edge boards of the at nextthe falling Ref Clock edge cycle. of the By next using Ref that Clock mechanism, cycle. By using that mechanism, multichannel signal acquisition and generation can be synchronized within a multichannelusing that mechanism, signal acquisition multichannel and generation signal acquisit can beion synchronized and generation within can abe chassis. synchronized The accuracy within of a chassis. The accuracy of in-chassis synchronization depends on how well the trace length is matched in-chassischassis. The synchronization accuracy of in-chassis depends synchronization on how well the depends trace length on how is matchedwell the trace from length an MCH is matched to each from an MCH to each AMC. If the trace length is fully matched, then synchronization will be tight. AMC.from an If theMCH trace to each length AMC. is fully If the matched, trace length then synchronizationis fully matched, will then be synchronization tight. will be tight.

Figure 7. “PU Frontend” Advanced Mezzanine Card (AMC) module architecture (based on existing Figure 7.7. “PU“PU Frontend”Frontend” AdvancedAdvanced MezzanineMezzanine Card (AMC)(AMC) module architecture (based(based on existingexisting product used in the testbed). productproduct usedused inin thethe testbed).testbed).

Figure 8. Frontend in-chassis synchronizationsynchronization timing sequence. Figure 8. Frontend in-chassis synchronization timing sequence. For multichassis synchronization, the chief problem is so-called clock skew [25] which, to For multichassismultichassis synchronization, the chief problem is so-called clock skew [[25]25] which,which, to overcome, requires a clock synchronization mechanism. The most common clock synchronization overcome, requires aa clockclock synchronizationsynchronization mechanism.mechanism. The most commoncommon clockclock synchronizationsynchronization solution is the Network Time Protocol (NTP), which synchronizes each client based on messaging solution is thethe NetworkNetwork TimeTime ProtocolProtocol (NTP),(NTP), whichwhich synchronizessynchronizes eacheach clientclient basedbased onon messagingmessaging with the User Datagram Protocol [26]. However, NTP accuracy ranges from 5 to 100 ms, which is not withwith the User Datagram Datagram Protocol Protocol [26]. [26]. However, However, NT NTPP accuracy accuracy ranges ranges from from 5 to 5 100 to 100 ms, ms, which which is not is precise enough for PAR application [27]. To get more accurate synchronization in the local area notprecise precise enough enough for forPAR PAR application application [27]. [27 To]. To get get more more accurate accurate synchronization synchronization in in the the local areaarea network, the IEEE 1588 Precision Time Protocol (PTP) standard [28] can provide sub-microsecond network, the IEEE 1588 Precision Time ProtocolProtocol (PTP)(PTP) standardstandard [[28]28] cancan provideprovide sub-microsecondsub-microsecond synchronization [29]. To implement PTP, the front-end chassis needs to be capable of packing or synchronization [29]. [29]. To To implement implement PTP, PTP, the the front- front-endend chassis chassis needs needs to tobe becapable capable of ofpacking packing or unpacking Ethernet packets, and additional dedicated hardware and software are required, which orunpacking unpacking Ethernet Ethernet packets, packets, and andadditional additional dedicated dedicated hardware hardware and software and software are required, are required, which increase both the complexity and cost of the front-end subsystem. A better method of implementing increase both the complexity and cost of the front-end subsystem. A better method of implementing multichassis synchronization would take advantage of GPS pulses per second (PPS) since, by multichassis synchronization would take advantage of GPS pulses per second (PPS) since, by connecting each chassis to a GPS receiver, the MCHs can use PPS as a reference signal to generate the connecting each chassis to a GPS receiver, the MCHs can use PPS as a reference signal to generate the Ref Clock and Sync Pulse for in-chassis synchronization. Since the PPS signal among different MCHs Ref Clock and Sync Pulse for in-chassis synchronization. Since the PPS signal among different MCHs is synchronized, the Ref Clock and Sync Pulse in each chassis is phase matched at any given time. In is synchronized, the Ref Clock and Sync Pulse in each chassis is phase matched at any given time. In case that all of the MCHs are using the PPS from the same satellites, the synchronization accuracy is case that all of the MCHs are using the PPS from the same satellites, the synchronization accuracy is between 5 and 20 ns [30]. However, when the GPS signal is inaccessible or lost, the front-end between 5 and 20 ns [30]. However, when the GPS signal is inaccessible or lost, the front-end subsystem should be able to stay synchronized by sharing the Sync Pulse from a common source, subsystem should be able to stay synchronized by sharing the Sync Pulse from a common source, which could be an external chassis clock generator or a signal from one of the chassis. In both which could be an external chassis clock generator or a signal from one of the chassis. In both Aerospace 2016, 3, 28 10 of 23 which increase both the complexity and cost of the front-end subsystem. A better method of implementing multichassis synchronization would take advantage of GPS pulses per second (PPS) since, by connecting each chassis to a GPS receiver, the MCHs can use PPS as a reference signal to generate the Ref Clock and Sync Pulse for in-chassis synchronization. Since the PPS signal among different MCHs is synchronized, the Ref Clock and Sync Pulse in each chassis is phase matched at any given time. In case that all of the MCHs are using the PPS from the same satellites, the synchronization accuracy is between 5 and 20 ns [30]. However, when the GPS signal is inaccessible or lost, the front-end subsystem should be able to stay synchronized by sharing the Sync Pulse from a common source, which could be an external chassis clock generator or a signal from one of the chassis. Aerospace 2016, 3, 28 10 of 23 In both methods, the trace length to each MCH from the common Sync Pulse source can vary, thereby makingmethods, the the propagation trace length time to each delay MCH of thefrom Sync the common Pulse from Sync each Pulse chassis source differ. can vary, To address thereby this making issue, wethe need propagation to know time thedelay delay timeof the difference Sync Pulse of from each ea chassisch chassis compared differ. withTo address the reference this issue, (i.e., we master) need chassis.to know Withthe delay that time knowledge, difference all of chassis each chassis can use compared the time with difference the reference as an (i.e., offset master) to adjust chassis. the triggeredWith that time. knowledge, all chassis can use the time difference as an offset to adjust the triggered time. ToTo implement implement thatthat approach,approach, wewe designeddesigned a clock counter to measure the the elapsed elapsed clock clock cycles cycles betweenbetween thethe SyncSync PulsePulse andand thethe returnreturn SyncSync Beacon,Beacon, the latter of of which which is is transmitted transmitted only only from from antennasantennas connected connected toto thethe referencereference chassis.chassis. Since the beacon beacon arrives arrives at at all all antennas antennas simultaneously, simultaneously, eacheach front-endfront-end subsystemsubsystem stopsstops itsits countercounter at the same time. The The ti timeme differences differences in in delay delay can can be be obtainedobtained by by subtracting subtracting the the counter counter number number from from each each slave slave chassis chassis to to the the reference reference chassis. chassis. FigureFigure9 illustrates9 illustrates a model a model timing timing sequence sequence after after each chassiseach ch receivesassis receives the Sync the Pulse. Sync AtPulse. time At T0, time the referenceT0, the chassisreference begins chassis to transmitbegins to thetransmit Sync Beaconthe Sync and Beacon starts and the starts counter. the counter. After two After and two a halfand clocka half cyclesclock ofcycles propagation of propagation delay, thedelay, slave the chassis slave chassis launches laun theches counter the counter as well. as At well. time At T3, time the T3, Sync the Beacon Sync isBeacon received is received by both by chassis, both chassis, however, however, since the since chassis the chassis detects detects the signal the signal only only at its at rising its rising edge, edge, the referencethe reference chassis chassis detects detects the signal the signal at time at T5time with T5counter with counter number number 16. By 16. contrast, By contrast, in the slavein the chassis, slave thechassis, counter the stopscounter at 13.stops In at turn, 13. whenIn turn, the when Sync the Pulse Sync is Pulse received is received the next the time, next the time, reference the reference chassis ischassis delayed is delayed by three by clock three cycles clock and cycles triggers and triggers AD9361 AD9361 at time T6,at time whereas T6, whereas the slave the chassis slave chassis starts it atstarts T7. Init at our T7. example, In our example, T6 is not T6 the is not same the as same T7. Suchas T7. deviation Such deviation arises aris becausees because the clock the clock phase phase angle betweenangle between the two the chassis two chassis is not identical. is not identical. When thisWhen phase this anglephase approachesangle approach 360◦es, it 360°, is possible it is possible for the Syncfor the Beacon Sync toBeacon arrive to when arrive the when rising theedge rising of edge one of clock onehas clock just has passed, just passed, while while the rising the rising edge ofedge the nextof the clock next cycle clock is still cycle approaching. is still approaching. In the worst-case In the scenario, worst-case only scenario, one clock only cycle one synchronization clock cycle errorsynchronization will occur, meaning error will that occur, the meaning accuracy that of multichassis the accuracy synchronization of multichassis synchronization refers to the period refers of theto referencethe period clock. of the One reference way to enhance clock. One its accuracy way to isenha to reducence its theaccuracy period is of to the reduce reference the clock;period however, of the thereference sampling clock; speed however, of the ADCthe sampling confines speed the shortest of the ADC period confines of the clock,the shortest because period a front-end of the clock, AMC cannotbecause read a front-end new data AMC in every cannot clock read cycle new from data thein every ADC clock when cycle the AMC’s from the reference ADC when clock the frequency AMC’s exceedsreference the clock ADC’s frequency sampling exceeds speed. the In ADC’s our example, sampling since speed. the In maximum our example, data since rate the inthe maximum AD9361 isdata 61.44 rate million in the samplesAD9361 is per 61.44 second, million the samples interchassis per second, synchronization the interchassis accuracy synchronization without using accuracy the GPS signalwithout is 16using ns. the GPS signal is 16 ns.

FigureFigure 9.9. ExampleExample timing sequence of multi-chassis synchronization. synchronization.

2.6. Backend System Performance Metrics To measure the benchmarks of digital PAR backend system performance, millions of instructions per second are often used as the metric. Meanwhile, to measure the floating point computational capability of a system, we use GFLOPS [31]. To evaluate the real-time benchmark performance, we simulate the complex floating point data cubes. For parallel computing systems, parallel speedup and parallel efficiency are two important metrics for evaluating the effectiveness of parallel algorithm implementations. Speedup is a metric of latency improvement for a parallel algorithm compared with a serial algorithm distributed over PUs, defined as:

=⁄ (1) Aerospace 2016, 3, 28 11 of 23

2.6. Backend System Performance Metrics To measure the benchmarks of digital PAR backend system performance, millions of instructions per second are often used as the metric. Meanwhile, to measure the floating point computational capability of a system, we use GFLOPS [31]. To evaluate the real-time benchmark performance, we simulate the complex floating point data cubes. For parallel computing systems, parallel speedup and parallel efficiency are two important metrics for evaluating the effectiveness of parallel algorithm implementations. Speedup is a metric of latency improvement for a parallel algorithm compared with a serial algorithm distributed over M PUs, defined as:

SM = TS/TP (1)

In Equation (1), TS and TP are the latency of the serial algorithm and the parallel algorithm, respectively. Ideally, we expect SM = M, or perfect speedup, although such is rarely achieved in practice. Instead, parallel efficiency is used to measure the performance of a parallel algorithm, defined as: EM = SM/M (2)

EM is usually less than 100%, since the parallel components need to spend time on data communication and synchronization [6], also known as overhead. In some cases, overhead is possible to overlap with computation time by using multiple buffering mechanisms. However, as the number of parallel computing nodes increases, the data size of each computing node lessens, meaning that the computing nodes would need to switch between processing and communication more often, thereby inevitably resulting in what is known as method call overhead. When the algorithm is distributed across more nodes, such overhead can preclude the benefit of using additional computing power. Parallel scheduling, thus, needs to minimize both communication and method call overhead.

3. Real-Time Implementation of a Digital Array Radar Processing Chain

3.1. Beamforming Implementation The procedure of beamforming is to convert the data from channel data (range gate) to beamspace, steer the radiating direction, and suppress the sidelobes by applying the beamformer weight, Wi, to the received signal, Yi, indicated in Equation (3). The parameters used in Equations (3)–(6) are shown in Table1. In the nonstationary conditions, adaptive beamforming is necessary to synthesize high gains in the beam-steering direction and reject/minimize energy for other directions. Here we assume the interference environment, as well as the array radar system itself is stable and does not change dramatically, hence the beamforming weights do not need to be updated rapidly and provided offline from the external server. Any adaptive weight computing, (e.g., adaptive calibration), is not included in this study. The benefit of off-line computing is the flexibility of modifying and developing in-use adaptive beamforming algorithms.

Ω N Θ Θ Θ [ (k−1)B+1 Beam = ∑ Wi Yi, where Wi = Wi (3) i=1 k=1

 C   C   C   C  Θ = Θ + Θ + Θ + ······ + Θ Beam ∑ Wi Yi ∑ WC+iYC+i ∑ W2C+iY2C+i ∑ W(M−1)C+iY(M−1)C+i (4) i=1 i=1 i=1 i=1

M−1 " C # Θ Θ Beam = ∑ ∑ WjC+iYjC+i (5) j=0 i=1

C N C ! Θ [ (k−1)B+1 ∑ WmC+iYmC+i = ∑ Wi Yi (6) i=1 k=1 i=1 Aerospace 2016, 3, 28 12 of 23

Typically, multiple beams pointing at different directions are formed independently in the beamforming process. A straightforward implementation is to provide a number of beamformers to form concurrent beams in parallel. Since each beamformer requires the signal from all of the antennas, the data routing between the antennas and beamformer would become complex when the number of channels is large. To reduce the routing complexity, as showing in Equations (4) and (5), the entire data are divided equally and a portion assigned to each sub-beamformer, (i.e., computing node), in which C  b  the term ∑i=1 WjC+iYjC+i is calculated independently. A formed beam is generated by accumulating theAerospace results 2016 from, 3, 28 each sub-beamformers. This method is named systolic beamforming [32]. 12 of 23

Table 1.1. Equation parameters. Parameter Definition Parameter Definition C Number of channels obtained by each PU C Number of channels obtained by each PU B Number of beams processed by each PE B Number of beams processed by each PE MM Numbers Numbers of PUs of PUs NN Numbers Numbers of PEs of in PEs a PU in a PU Ω=×Ω = M × C Total numberTotal number of receiving of receiving channels channels Θ=×Θ = N × B Total numberTotal number of beams of beams WΘ The number of B weight vectors for the ith receiving channel i The number of weight vectors for the ith receiving channel Y The ith receiving channel i Beam Θ The numberThe of ithB formedreceiving beams channel from total of Ω channels The number of formed beams from total of Ω channels

InIn our our implementation, implementation, the the received received data data from fromΩ channelsΩ channels are sentare sent to a numberto a number of M ofPU, in PU, which, in C (k−1)B+1 as showing in Equation (6), each PE calculates the term ∑ W (Yi )to form B partial beams which, as showing in Equation (6), each PE calculates the termi=1 i∑ to form partial inbeams parallel. in parallel. After all After the PEs all finishthe PEs computing, finish computing, one PU startsone PU to starts pass theto pass result the to result its lower to its neighbor, lower inneighbor, which the in receivedwhich the data received are summed data are withsummed its own with and its own the resultsand the are results sent are downstream. sent downstream. In turn, afterIn turn, the lastafter PU the combines last PU allcombines of the results,all of the the results, entire numberthe entire of Θnumberbeams of based Θ beams on Ω basedchannels on areΩ formed.channels Based are formed. on the PUBased shown on the in FigurePU shown4, a scalablein Figure beamforming 4, a scalable beamforming system architecture system is architecture represented inis Figurerepresented 10. in Figure 10.

FigureFigure 10.10. Scalable beamformer architecture. architecture.

InIn the the Equation Equation (3),(3), therethere isis oneone multiplicationmultiplication and one addition for for each each range range gate. gate. Since Since each each complexcomplex multiplication multiplication andand additionaddition require six and and two two real real floating-point floating-point operations, operations, respectively, respectively, thethe computing computing complexitycomplexity ofof thethe proposedproposed real-timereal-time beamformingbeamforming is ((66+2+ 2))××Nc ×× Nrg=8= 8Nc Nrg, , wherewhereN c and andN rgare are the the number number of of channel channel and and rangerange gates.gates. For given given processing processing time time interval interval T, ,  thethe throughput throughput of of beamformer beamformer is is 88Nc Nrg/⁄T floating-point floating-point operations operations per per second second (FLOPS). (FLOPS). Figure Figure 11 shows11 shows the performancethe performance of beamforming of beamforming by using by usin differentg different numbers numbers of PUs of asPUs an as example, an example, in which in Nwhichc = 480 and= 480N rgand= 1024 =. 1024 In this. In figure,this figure, the speedupthe speedup and and efficiency efficiency is calculatedis calculated according according to Equationsto Equations (1) (1) and and (2), (2), in in which which the the speedup speedup grows grows with withthe the numbernumber ofof PUs, but the the efficiency efficiency is is degradeddegraded duedue toto thethe methodmethod callcall overhead.overhead. For this reason, we we need need to to seek seek a a balance balance between between the the performanceperformance and and effectiveness effectiveness basedbased onon thethe system requirements. According According to to Figure Figure 11, 11 ,an an optimal optimal choice, for example, =28, allows the system to achieve a good speedup while maintaining a reasonable level of efficiency. Capacity cache miss [33] is another issue affecting the performance of real-time beamforming. A cache miss occurs when the cache memory does not have sufficient room to store the data used by the DSP core. For example, in Figure 12, when the channel number equals to 16, if there are no cache misses, four cases should have the same number of GFLOPS. However, for the cases that the numbers of range gates is equal to 128 and 256, the beamformer can outperform than in the cases that range gates are 512 and 1024. This variation is caused by the capacity cache misses the happened in the last two cases, in which the DSP core needs to wait for the data to be cached. The markers in Figure 12 represent the maximum number of channels that the DSP cache memory can hold for a specific Aerospace 2016, 3, 28 13 of 23 Aerospace 2016, 3, 28 13 of 23 number of range gates. Before reaching each maker point, the performance improvement of each case ischoice, from using for example, larger sizedM = vectors,28, allows which the reduces system the to method achieve call a good overhead. speedup However, while maintainingafter reaching a thereasonable marker levelpoints, of the efficiency. benefit of using large sizes of the vectors is compromised by the cache misses.

Aerospace 2016, 3, 28 13 of 23 number of range gates. Before reaching each maker point, the performance improvement of each case is from using larger sized vectors, which reduces the method call overhead. However, after reaching the marker points, the benefit of using large sizes of the vectors is compromised by the cache misses.

Figure 11. Speedup and efficiency efficiency of beamforming implementation.

Capacity cache miss [33] is another issue affecting the performance of real-time beamforming. A cache miss occurs when the cache memory does not have sufficient room to store the data used by the DSP core. For example, in Figure 12, when the channel number equals to 16, if there are no cache misses, four cases should have the same number of GFLOPS. However, for the cases that the numbers of range gates is equal to 128 and 256, the beamformer can outperform than in the cases that range gates are 512 and 1024. This variation is caused by the capacity cache misses the happened in the last two cases, in which the DSP core needs to wait for the data to be cached. The markers in Figure 12 represent the maximum number of channels that the DSP cache memory can hold for a specific number of range gates. Before reaching each maker point, the performance improvement of each case is from using larger sized vectors, which reduces the method call overhead. However, after reaching the marker points, the benefit of using large sizes of the vectors is compromised by the Figure 11. Speedup and efficiency of beamforming implementation. cache misses.

Figure 12. Digital signal processor (DSP) core performance versus the number of range gates.

Fortunately, the capacity cache miss can be mitigated by splitting up datasets and processing one subset at each time, which is referred as blocking or tiling [34]. In that sense, the data storage is handled carefully so the weight vectors will not be evicted before the next subset reuses it. As an example, one DSP core forms 15 beams from 24 channels, and each channel contains 1024 range gates, so and in Equation (3) are the matrices of dimensions 24 × 15 and 24 × 1024, which are 3 KB and 192 KB. As the size of L1D cache is 32 KB, to allow the weight vectors and input matrix fitting into L1D cache, the data from 24 channels should be divided into 16 subsets. So, one large-size beamforming based on 1024 range gates is converted into 16 small-size beamforming based on 640 range gates. The beamforming performance of this example is listed in Table 2, in which the performance of the DSP core remains the same regardless the size of input data. Note that the performance shown in Table 2 is based on one C66xx core in C6678. Since we already considered the Figure 12. Digital signal processor (DSP) core performance versus the number of range gates. I/O bandwidthFigure 12. limited Digital signalwhen processor all eight (DSP) core core works perf ormancetogether, versus C6678, the numberor similar of range multi-core gates. DSP, performances can be deduced from multiplying the number in Table 2 by the number of DSP cores. Fortunately, thethe capacitycapacity cache cache miss miss can can be be mitigated mitigated by by splitting splitting up up datasets datasets and and processing processing one subsetone subset at each at each time, time, which which is referredis referred as as blocking blocking or or tiling tiling [[34].34]. In In that that sense, sense, the the data data storage storage is ishandled handled carefully carefully so the so theweight weight vectors vectors will willnot be not evicted be evicted before before the next the subset next subsetreuses reusesit. As an it. example, one DSP core forms 15 beams from 24 channels, and each channel contains 1024 range gates, so and in Equation (3) are the matrices of dimensions 24 × 15 and 24 × 1024, which are 3 KB and 192 KB. As the size of L1D cache is 32 KB, to allow the weight vectors and input matrix fitting into L1D cache, the data from 24 channels should be divided into 16 subsets. So, one large-size beamforming based on 1024 range gates is converted into 16 small-size beamforming based on 640 range gates. The beamforming performance of this example is listed in Table 2, in which the performance of the DSP core remains the same regardless the size of input data. Note that the performance shown in Table 2 is based on one C66xx core in C6678. Since we already considered the I/O bandwidth limited when all eight core works together, C6678, or similar multi-core DSP, performances can be deduced from multiplying the number in Table 2 by the number of DSP cores.

Aerospace 2016, 3, 28 14 of 23

As an example, one DSP core forms 15 beams from 24 channels, and each channel contains b 1024 range gates, so Wi and Yi in Equation (3) are the matrices of dimensions 24 × 15 and 24 × 1024, which are 3 KB and 192 KB. As the size of L1D cache is 32 KB, to allow the weight vectors and input matrix fitting into L1D cache, the data from 24 channels should be divided into 16 subsets. So, one large-size beamforming based on 1024 range gates is converted into 16 small-size beamforming based on 640 range gates. The beamforming performance of this example is listed in Table2, in which the performance of the DSP core remains the same regardless the size of input data. Note that the performance shown in Table2 is based on one C66xx core in C6678. Since we already considered the I/O bandwidth limited when all eight core works together, C6678, or similar multi-core DSP, performances can be deduced from multiplying the number in Table2 by the number of DSP cores.

Table 2. DSP core performance measured in GFLOPS after mitigating cache misses.

Range Gates (Subsets) Channel 1024 (16) 512 (8) 256 (4) 128 (2) 4 0.67 0.67 0.67 0.67 8 1.34 1.34 1.34 1.33 12 1.97 1.96 1.96 1.95 16 2.42 2.42 2.41 2.40 20 2.81 2.81 2.80 2.78 24 3.15 3.15 3.14 3.12 28 3.45 3.44 3.43 3.40 32 3.71 3.70 3.67 3.66 36 3.92 3.91 3.87 3.82 40 4.10 4.09 4.04 3.96 44 4.25 4.24 4.19 4.08 48 4.39 4.38 4.34 4.23 52 4.19 4.48 4.46 4.35

3.2. Pulse Compression Implementation The essence of pulse compression is matched filtering operation, in which the correlation of the return signal, s [n], and a replica of the transmitted waveform, x [n], is performed. Matched filter implementation converts the signal into the frequency domain, point-wise multiplies with a waveform template, and then converts the result back to the time domain [6], as shown in Equations (7)–(10). Since the length of the fast Fourier transform (FFT) in Equations (7) and (8) needs to be the first power of 2 greater than N + L − 1, zero padding x [k] and s [k] are necessary. As zero padding increases the length of the input vectors, the designer should properly select the values of N and L to avoid unnecessary computation. FFT S [k] ← s [n] 0 ≤ n ≤ N (7)

FFT X [k] ← x [n] 0 ≤ n ≤ L (8)

Y (k) = S [k] X [k] 0 ≤ k ≤ (N + L − 1) (9)

IFFT y [n] ← Y [k] 0 ≤ n ≤ (N + L − 1) (10)

Based on Equations (7)–(10), the computing complexity of pulse compression depends on FFT, inverse FFT (IFFT), and point-wise vector multiplication. In radix-2 FFT, there are log2N butterfly computation stages, in which it consists of N/2 butterflies. Since each butterfly requires one complex multiplication, one complex addition, and one complex subtraction, the complexity of computing radix-2 FFT is: CFFT = (6 + 2 + 2) × (N/2) × log2 (N) = 5Nlog2 (N) floating-point operations. Since the computing complexity of IFFT is the same as FFT, the throughput of the pulse compression in the frequency domain is (2 × CFFT + NCmult) /T = (10Nlog2 (N) + 6N/T) FLOPS, in which Aerospace 2016, 3, 28 15 of 23

N is the number of range gates after zero padding, and Cmult is the complexity of point-wise complex multiplication. The computation throughput of pulse compression and FFT measured on one C66xx core is shown in Figure 13, in which dots represent the maximum number of range gates that the L1D cache can hold. It is evident that the calculation performance would degrade dramatically when the data size is close to or over the cache size, and the performance of pulse compression and FFT correlates to each other. Similar to beamforming, the pulse compression performance in a multi-core or multi-module system canAerospace be calculated 2016, 3, 28 by multiplying the throughput shown in Figure 13 by the number of cores enabled.15 of 23

Figure 13. Performance of pulse compression and fast Fourier transform (FFT) vs. different numbers of range gates.

3.3. Doppler Processing and Corner Turn The first first objective objective of ofthe the Doppler Doppler processing processing is to isextract to extract moving moving targets targetsfrom stationary from stationary clutters. clutters.The second The objective second is objective to measure is tothe measure radial veloci the radialty of the velocity targets of by the calculating targets by the calculating Doppler shift the Doppler[35], from shift the Fourier [35], from transform the Fourier of a data transform cube al ofong a datathe CPI cube dimension. along the Hence, CPI dimension. the throughput Hence, of the Doppler filter is / = 5log⁄ FLOPS, where  is the number of pulses in one CPI. the throughput of the Doppler filter is CFFT/T = 5Nplog2 Np /T FLOPS, where Np is the number ofIn our pulses environment, in one CPI. the In throughput our environment, per core of the a Do throughputppler filter per is shown core of in aTable Doppler 3. Again, filter hardware- is shown inverified Table 3performance. Again, hardware-verified in FLOPS linearly performance increases with in the FLOPS number linearly of DSP increases cores. As with the output the number of the ofpulse DSP compression cores. As the is arranged output of along the pulse the range compression gate dimension, is arranged the output along theneeds range to undergo gate dimension, a corner theturn output before needs being to handled undergo by a the corner Doppler turn before filtering being processors handled [36]. by the This Doppler two-dimensional filtering processors corner turn [36]. Thisoperation two-dimensional is equivalent corner to a matrix turn operation transpose is in equivalent the memory to aspace. matrix Using transpose EDMA3 in the [37] memory on TI generic space. UsingC66xx EDMA3DSP, the [37data] on can TI be generic reorganized C66xx DSP,into the the desired data can format be reorganized without interfering into the desired with the format real- withouttime computations interfering within the the DSP real-time core. Table computations 4 shows th ine the performance DSP core. Tableof the4 showsdata corner the performance turn by using of theEDMA3 data cornerunder different turn by using conditions. EDMA3 under different conditions.

Table 3. Doppler filteringfiltering performance measured in GFLOPS perper core.core.

Pulses Range Gates Pulses Range Gates 8 16 32 64 128 8 16 32 64 128 1024 0.7293 1.6036 2.6852 3.8543 4.2866 1024 0.7293 1.6036 2.6852 3.8543 4.2866 2048 2048 0.72940.7294 1.60001.6000 2.6841 2.6841 3.8543 3.8543 4.2867 4.2867 4096 4096 0.72940.7294 1.59991.5999 2.6842 2.6842 3.8544 3.8544 4.2867 4.2867 8192 0.7295 1.6000 2.6842 3.8544 4.2732 8192 0.7295 1.6000 2.6842 3.8544 4.2732

Table 4. Time consumption measured in of corner turn for one beam.

Pulses Range Gates 8 16 32 64 128 1024 21 33 64 126 253 2048 40 66 130 254 502 4096 135 265 513 1011 2028 8192 528 1025 2033 4070 8107

Aerospace 2016, 3, 28 16 of 23

Table 4. Time consumption measured in µs of corner turn for one beam.

Pulses Range Gates 8 16 32 64 128 1024 21 33 64 126 253 2048 40 66 130 254 502 4096 135 265 513 1011 2028 8192 528 1025 2033 4070 8107

4. Performance Analysis of Complete Signal Processing Chain In the previous sections, we measured the computing throughput of each basic processing stage in our backend architecture, in which the performance for both communication and computing are sensitive to the size of data involved. Based on previous discussions and inspired by a future full-size MPAR type system, we use the overall array backend system parameters in Table5 as an example.

Table 5. Example digital array radar system parameters.

Parameters Value Depends on Range gates 4096 Pulse Compression Pulses 128 Doppler Filtering Channels per chassis 48 Beamforming Beams per PE 22 Number of channels per chassis PRI 1 ms Pulse compression computing time CPI 128 ms PRI × number of pulses No. of beamforming PU 16 Total number of antennas required by application No. of PE in each PU 12 Total number of beams required by application Total No. of Beams 264 PE × number of beams per PE Total No. of Channels 768 PM × number of channels per chassis Total No. of PU 18 Beamforming + Match Filter + Doppler Processing

In our study, the entire backend processing chain is treated as one pipeline. Similar to a traditional FIFO memory, the depth of the pipeline represents the number of clock cycles or steps between the first data flow in the pipeline until the result data comes out. In Table5, the critical parameter is the number of range gates. As shown in Figure 13, when the number of range gates is 4096, the pulse compression performance is well-balanced. Based on this range gate number, we can estimate the processing time of pulse compression. This latency confines the shortest PRI that the backend system allows for real-time processing. Based on the parameters in Table5, the time scheduling of the radar processing chain is shown in Figure 14. This scheduling is a rigorous and realistic timeline including all of the impacts of SRIO communication and memory access, and has been verified by real-time hardware running tests. The numbers of PU and PE are chosen as an example, which can be changed based on different requirements. The overall latency, depth of pipeline, for the backend system is 1.5 CPI, or 187.7 ms. Firstly, the parallel beamforming processors use 123 ms to generate 264 beams for the 128 pulses in one CPI. After all the beamforming results are combined to the last PU and sent to the pulse compression processors, the pulse compression takes another 123 ms. For the Doppler processing, in 16 ms, the first 96 beams from 128 pulses will be realigned in the CPI domain and sent out to the Doppler filter. In total, there are 192, 12, and 12 DSPs are involved for the beamforming, pulse compression, and Doppler filter, respectively, and for each processing function, it achieves 6880 GFLOPS, 370 GFLOPS, and 140 GFLOPS real-time performance, respectively. By using the MTCA chassis and choosing SRIO as the backplane interconnection technology, this system benefits over the legacy form factors (i.e., Eurocard [11], which uses VME bus architecture) due to its robust system management, inter-processor bandwidth with better flexibility and scalability, and built-in error detection, isolation, and protection [12]. Aerospace 2016, 3, 28 17 of 23 Aerospace 2016, 3, 28 17 of 23

Figure 14. Real-time system timeline forfor the example backend system.

5. Parallel Implementation Using Open Computing Language (OpenCL)/Open Multi-Processing 5. Parallel Implementation Using Open Computing Language (OpenCL)/Open Multi-Processing(OpenMP) (OpenMP) The previous section summarizes the approach of “manual task division and parallelization.” AnotherAnother option is is using using standard standard and and automatic automatic parallelization parallelization solutions. solutions. For For example, example, OpenCL OpenCL is a isstandard a standard for parallel for parallel computing computing on heterogeneous on heterogeneous devices devices [38]. [38 The]. The standard standard requires requires a host a host to todispatch dispatch tasks, tasks, or orkernels, kernels, to devices to devices which which perform perform the thecomputation. computation. To leverage To leverage OpenCL, OpenCL, the the66AK2H14 66AK2H14 is loaded is loaded with with an anembedded embedded Linux Linux kern kernelel that that contains contains the the OpenCL OpenCL drivers. drivers. The ARM cores runrun thethe operatingoperating systemsystem andand dispatchdispatch kernels to the DSP cluster. For systems with more thanthan one DSP cluster, OpenCLOpenCL cancan dispatchdispatch differentdifferent kernelskernels toto eacheach cluster.cluster. KernelsKernels must be declared in thethe hosthost program.program. Since OpenCL is designed for heterogeneous processors that do not necessarilynecessarily share memory,memory, OpenCLOpenCL buffersbuffers areare usedused toto passpass arraysarrays toto thethe device.device. When the kernel is dispatched, arrays must be copiedcopied fromfrom hosthost memorymemory toto devicedevice memory.memory. ThisThis communicationcommunication addsadds significantsignificant overhead toto computationcomputation time that increases linearlylinearly with bufferbuffer size.size. The K2H platform does share memory betweenbetween hosthost andand device,device, soso thethe hosthost cancan directlydirectly allocateallocate memorymemory inin OpenCLOpenCL memory.memory. The DSP cluster registers onlyonly as a single OpenCL device, so once it has received the kernel, the computation must be distributeddistributed across thethe DSPDSP cores.cores. There are two optionsoptions forfor distributingdistributing thethe workload.workload. OpenCL can distribute computation among the DSP cores, but this requires all cores toto execute the same same program program (similar (similar to to the the way way a GPU a GPU operates). operates). This This distribution distribution limits limits flexibility flexibility and andcomplicates complicates algorithm algorithm development. development. The Thesecond second option option is isOpenMP. OpenMP. OpenMP OpenMP is is a a parallelparallel programmingprogramming standardstandard forfor homogenoushomogenous processorsprocessors withwith sharedshared memory.memory. The OpenMPOpenMP runtimeruntime dynamically manages threads during execution. Thes Thesee threads are forked out from a singlesingle mastermaster threadthread whenwhen itit encountersencounters aa parallelparallel regionregion inin thethe program.program. TheseThese areasareas areare denoteddenoted withwith OpenMPOpenMP pragmaspragmas (#pragma omp parallel) parallel) in in C C and and C++. C++. Ideally, Ideally, a coder a coder would would write write code code that that can can easily easily be becomputed computed in ina serial a serial manner manner (i.e., (i.e., on on a asingle single thread), thread), and and then then add add in in the the OpenMP pragma to parallelizeparallelize the task.task. This mechanism allows us to incorporate parallel regions into serial code very easily, andand alsoalso allowsallows easyeasy removalremoval ofof OpenMPOpenMP regions.regions. A prime use case for OpenMP is for loops. IfIf thethe resultresult ofof eacheach iterationiteration isis independentindependent ofof allall thethe others,others, thethe OpenMPOpenMP pragmapragma #pragma#pragma omp parallelparallel forfor can can be be used used to dynamicallyto dynamically distribute distribute the each the iterationeach iteration to its own to its thread. ownWithout thread. OpenMP,Without aOpenMP, single thread a single will executethread eachwill iterationexecute one-by-one.each iteration With one-by-one. OpenMP, iterationsWith OpenMP, are distributed iterations among are thedistributed threads among in real-time the threads and are in executed real-time in and parallel. are executed Each thread in parallel. handles Each one thread iteration handles at a time one untiliteration the at OpenMP a time until runtime the OpenMP detects the runtime end of detects the loop. the Thisend of ease the of loop. use This comes ease with of use a performance comes with penalty.a performance OpenMP penalty. spends OpenMP processor spends time processor on managing time on threadsmanaging so threads the coder so the must coder take must steps take to steps to minimize OpenMP function calls. This overhead can be overcome by allocating larger workloads to each thread. For example, if doing vector multiplication, the coder should allocate Aerospace 2016, 3, 28 18 of 23

minimizeAerospace 2016 OpenMP, 3, 28 function calls. This overhead can be overcome by allocating larger workloads18 of 23 to each thread. For example, if doing vector multiplication, the coder should allocate blocks of the vectorsblocks of to the each vectors thread to instead each ofthread assigning instead a singleof assigning multiplication a single to multiplication each thread. Anotherto each penaltythread. toAnother take into penalty consideration to take into is consideration memory accesses. is memory With multiple accesses. threads With multiple trying to threads access trying (in most to access cases) non-contiguous(in most cases) non-contiguous memory locations, memory the overhead locations, can the be overhead increased drastically.can be increased This is drastically. part of the This reason is whypart of most the reason multi-core why system’smost multi-core performance system’s does performance not scale linearly does not with scale the linearly number with of the processors. number Thisof processors. effect can This clearly effect be seencan clea in Sectionrly be seen 5.3 below. in Section 5.3 below.

5.1. Beamforming Implementation with OpenCL/OpenMP For beamforming,beamforming, wewe design design to to make make the the kernel kernel processes processes an arbitraryan arbitrary number number of datasets; of datasets; in each in dataset,each dataset, it contains it contains the sampled the sampled return return data from data 24 from channels. 24 channels. The processing The processing of each set of iseach allocated set is toallocated its OpenMP to its thread.OpenMP Figure thread. 15 Figureshows 15 that shows as the that number as the of number datasets of sent datasets to the sent kernel to the increases, kernel theincreases, time it the takes time to formit takes a beam to form from a eachbeam set from decreases. each set Due decreases. to OpenMP Due overhead,to OpenMP the overhead, performance the doesperformance not increase does linearlynot increase with line thearly number with ofthe datasets. number Whenof datasets processing. Whena processing small amount a small ofdatasets, amount theof datasets, overhead the contributes overhead acontribu greatertes percentage a greater to percentage the overall to execution the overall time execution than when time a largethan numberwhen a oflarge pulses number are processed. of pulses are Although processed. OpenMP Although overhead OpenMP is not overhead present in is single-threaded not present in single-threaded execution, there isexecution, still overhead there fromis still memory overhead accesses, from memory instruction accesses, fetching, instruction etc. Thus, fetching, the single-threaded etc. Thus, the execution single- followsthreaded a similarexecution logarithmic follows a increase similar logarithmic in performance incr asease the in number performance of datasets as the increases. number Theof datasets benefit ofincreases. multithreaded The benefit execution of multithreaded is that the maximum execut performanceion is that the (when maximum the datasets performance number are(when large) the is higherdatasets than number with are a single large) thread. is higher than with a single thread.

Figure 15. 15. BeamformingBeamforming kernel kernel performance performance using Open using Computing Open Computing Language (OpenCL) Language (8192 (OpenCL) range (8192gates). range gates).

Comparing the performance of OpenCL/OpenMP implementation to the manually optimized Comparing the performance of OpenCL/OpenMP implementation to the manually optimized scheme, the overhead of standard scheme can be seen more clearly in Figure 16. In the manually scheme, the overhead of standard scheme can be seen more clearly in Figure 16. In the manually optimized method, as the memory access from external memory to DSP can be done without optimized method, as the memory access from external memory to DSP can be done without interfering interfering with computing by using EDMA3 and the size of data can be fine-tuned to the cache with computing by using EDMA3 and the size of data can be fine-tuned to the cache memory, so those memory, so those advantages make the manually optimized method outperform the OpenCL advantages make the manually optimized method outperform the OpenCL implementation. implementation. Aerospace 2016, 3, 28 19 of 23 Aerospace 2016, 3, 28 19 of 23

Figure 16. ComparingComparing OpenCL OpenCL performance performance to tomanually manually optimized optimized code code (8192 (8192 range range gates) gates) for beamforming.for beamforming.

5.2. Pulse Compression Implementation with OpenCL/OpenMP In pulse pulse compression, compression, once once again again the the kernel kernel can can receive receive an arbitrary an arbitrary number number of pulses. of pulses. Each beamEach beamis processed is processed in its inOpenMP its OpenMP thread. thread. However, However, in this incase, this multi-threaded case, multi-threaded execution execution is not favorableis not favorable as shown as shown in Figure in Figure 17. This17. This difference difference is due is due to tothe the highly highly non-linear non-linear memory memory accesses accesses required byby thethe FFT FFT and and IFFT. IFFT. This This effect effect is moreis more pronounced pronounced for large-sizedfor large-sized FFTs. FFTs. When When multiple multiple FFTs FFTsand IFFTs and IFFTs are running are running in parallel, in parallel, the non-linear the non-linear accesses accesses are compounded, are compounded, which results which in results severely in severelydegraded degraded performance. performance.

Figure 17. Pulse compression performance using OpenCL (8192 range gates).

The comparison in Figure 18 shows the performance ofof OpenCL/MPOpenCL/MP implementation implementation compared with the manuallymanually optimized codes. It It should be noted that the L1D cache loading optimizations were not used in the kernel as L1DL1D cache cannot be used as a buffer in OpenCL kernels. As discussed previously inin Section Section 5.1 ,5.1, the advantagethe advantage of using of manuallyusing manually optimized optimized methods comesmethods from comes eliminating from eliminatingthe memory the access memory time fromaccess outside time from memory outside and memory fine-tuned and the fine-tuned algorithm. the algorithm. Aerospace 2016, 3, 28 20 of 23 Aerospace 2016, 3, 28 20 of 23 Aerospace 2016, 3, 28 20 of 23

FigureFigure 18. 18.Comparing Comparing OpenCL OpenCL performance performance to manually optimized optimized code code for for pulse pulse compression. compression. Figure 18. Comparing OpenCL performance to manually optimized code for pulse compression. 5.3.5.3. Doppler Doppler Processing Processing Implementation Implementation with with OpenCL/OpenMPOpenCL/OpenMP 5.3. Doppler Processing Implementation with OpenCL/OpenMP TheThe Doppler Doppler processing processing kernel kernel is is set set upup differentlydifferently from from the the previous previous two two steps steps due due to the to thelarge large amountThe of Doppler small FFTs processing to be done. kernel In isthis set kernel, up differen we configuretly from thethe previousnumber oftwo threads steps duemanually. to the largeEach amount of small FFTs to be done. In this kernel, we configure the number of threads manually. amountthread is of allocated small FFTs a fixed to be portion done. Inof thisthe kernel,data to weprocess. configure The numberthe number of threads of threads must manually. be a power Each of Each thread is allocated a fixed portion of the data to process. The number of threads must be a power thread2 so that is anallocated even amount a fixed of portion data is of sent the to data each to one. process. It is possible The number to set of any threads number must of threadsbe a power that ofis of 22a so sopower that that an anof even2,even but amountif that number of of data data exceedsis is sent sent to theto eacheach numb one. one.er It ofis It physicalpossible is possible tocores, set to any Op set enMPnumber any number and of loopthreads of overhead threads that is that is a powerabegin power to of ofdegrade 2, 2, but but if ifoverall thatthat numbernumber performance. exceeds Figure the the numb number 19 showser of of physical the physical performance cores, cores, Op OpenMPofenMP the kerneland and loop for loop overhead different overhead beginbeginnumbers to degrade to degradeof threads overall overall with performance. varyingperformance. numbers FigureFigure of range 19 shows gates, the the and performance performance Figure 20 compares of ofthe the kernel kernelthe performancefor fordifferent different numbersnumbersof manual of threads of optimization threads with with varying sche varyingme numberswith numbers OpenCL/MP of of range range scheme. gates, gates, and and FigureFigure 20 compares the the performance performance of manualof manual optimization optimization scheme sche withme with OpenCL/MP OpenCL/MP scheme. scheme.

Figure 19. Doppler processing performance using OpenCL. FigureFigure 19. 19.Doppler Dopplerprocessing processing performance using using OpenCL. OpenCL. OpenMP allows the programmer to change the maximum number of threads that are used at OpenMP allows the programmer to change the maximum number of threads that are used at runtime.OpenMP Figure allows 19 shows the programmer the performance to change of the ke thernel maximum for different number numbers of of threads threads with that arevarying used at runtime. Figure 19 shows the performance of the kernel for different numbers of threads with varying runtime.numbers Figure of range 19 shows gates. theIn single performance threaded of execut the kernelion, there for is different no OpenMP numbers overhead. of threads For any with number varying numbersof threads of aboverange gates.1, the In overhead single threaded is present. execut Noteion, therethat isthere no OpenMP is not a overhead. very large For difference any number in numbers of range gates. In single threaded execution, there is no OpenMP overhead. For any number ofperformance threads above between 1, the one, overhead two, and is fourpresent. threads. Note This that is therebecause is notOpenMP a very does large not difference distribute ina of threads above 1, the overhead is present. Note that there is not a very large difference in performance performancethread to more between than one one, core. two, To andfully four utilize threads. the DSP, This OpenMP is because must OpenMP be allowed does to not create distribute as many a betweenthreadthreads one, to as more two,there andthan are cores fourone core.threads.in the To system. fully This utilize isFigure because the 20 comparesDSP, OpenMP OpenMP the does performance must not distributebe allowed of manual ato thread create optimization toas moremany than onethreads core. To as fully there utilize are cores the in DSP, the OpenMPsystem. Figure must 20 be compares allowed tothe create performance as many of threads manual asoptimization there are cores in the system. Figure 20 compares the performance of manual optimization with OpenCL/OpenMP. The large difference in performance shows that OpenMP, while convenient and simple to use, tends to Aerospace 2016, 3, 28 21 of 23 Aerospace 2016, 3, 28 21 of 23 with OpenCL/OpenMP. The large difference in performance shows that OpenMP, while convenient and simple to use, tends to be inefficient. Distributing execution at runtime using the software as be inefficient.opposed to Distributing built-in hardware execution consumes at runtime processor using thecycles software that could as opposed otherwise to built-in be used hardware for consumescomputation. processor cycles that could otherwise be used for computation.

FigureFigure 20. 20.Comparing Comparing OpenCL OpenCL performance performance to manually optimized optimized code code for for Doppler Doppler processing. processing.

6. Summary6. Summary In thisIn this study, study, we we present present aa developmentdevelopment model model of ofan anefficient efficient and andscalable scalable backend backend system systemfor digital PAR based on Field-Programmable-RF channels, DSP core, and the SRIO backplane. The for digital PAR based on Field-Programmable-RF channels, DSP core, and the SRIO backplane. architecture of the model allows synchronized and data-parallel real-time surveillance for radar The architecture of the model allows synchronized and data-parallel real-time surveillance for radar signal processing. Moreover, the system is modularized for scalability and flexibility. Each PE in the signalsystem processing. has a proper Moreover, granularity the system to maintain is modularized a good balance for scalability between andcomputation flexibility. load Each and PE in thecommunication system has a properoverheads. granularity to maintain a good balance between computation load and communicationEven for overheads. the basic radar processing operations studied in this work, tera-scale floating point operationsEven for theare required basic radar in the processing MPAR- type operations backend system. studied For in thissuch work,requirement, tera-scale using floating software point operationsprogrammable are required DSPs that in the can MPAR- be attuned type to backendthe processing system. assignment For such in requirement, parallel would using be a softwaregood programmablesolution. The DSPs computational that can beaspects attuned of a to 7400 the GF processingLOPS throughput assignment phased in parallelarray backend would system be a good solution.have been The computationalpresented to illustrate aspects the of aanalysis 7400 GFLOPS of the basic throughput radar processing phased arraytasks and backend the method system of have beenmapping presented those to illustrate tasks to thean MTCA analysis chassis of the and basic DS radarP hardware. processing In tasksour implementation and the method of of a mappingPAR thosebackend tasks to system, an MTCA the chassisform-factor and can DSP be hardware. changed ba Insed our on implementation requirements of of various a PAR backendsystems. system,By changing the number of PUs, the total capacity of the system can be easily scaled. By changing the the form-factor can be changed based on requirements of various systems. By changing the number of number of inputs for each PE, we can adjust the throughput performance of a PU. A carefully PUs, the total capacity of the system can be easily scaled. By changing the number of inputs for each customized design of different processing stages in the DSP core also helps to achieve the optimal PE, weperformance can adjust regarding the throughput latency and performance efficiency. When of a PU.we parallelize A carefully a candidate customized algorithm, design there of different are processingtwo steps stages in the in design the DSP process. core First, also helpsthe algorithm to achieve is decomposed the optimal into performance several small regarding components. latency andNext, efficiency. each algorithm When we component parallelize is aassigned candidate to differ algorithm,ent processors there are for twoparallel steps execution. in the design In parallel process. First,computing, the algorithm the communication is decomposed overhead into several among small para components.llel computing Next, nodes each is a algorithmkey impact component on the is assignedparallel toefficiency different of processorsthe system. forWithin parallel each execution.parallel processor, In parallel dividing computing, the entire the data communication cube into overheadsmall subsets among to parallel avoid cache computing miss is also nodes necessary is a key wh impacten the size on of the input parallel data is efficiency larger than of the the cache system. Withinsize eachof processors. parallel processor,For data communication dividing the links, entire the data SRIO, cube HyperLink, into small and subsets EDMA3 to handle avoid the cache data miss is alsotraffic necessary between whenand/or the within size each of inputDSP. By data using is largerSRIO, the than data the traffic cache among size DSPs of processors. can be switched For data communicationthrough the SRIO links, fabric the controlled SRIO, HyperLink, by an MCH and of the EDMA3 MTCA chassis, handle which the data is more traffic flexible between than PCIe and/or and efficient than Gigabit Ethernet. A novel advantage of our proposed method is utilizing EDMA3 within each DSP. By using SRIO, the data traffic among DSPs can be switched through the SRIO fabric and the ping-pong buffer mechanism, which helps the system to overlap the communication time controlled by an MCH of the MTCA chassis, which is more flexible than PCIe and efficient than Gigabit with computing time and reduce the processing latency. This embedded computing platform can not Ethernet. A novel advantage of our proposed method is utilizing EDMA3 and the ping-pong buffer only be used for the phased array radar backend system, but is also suitable for other aerospace mechanism, which helps the system to overlap the communication time with computing time and reduce the processing latency. This embedded computing platform can not only be used for the phased array radar backend system, but is also suitable for other aerospace applications that require high I/O bandwidth and computing power (i.e., situational awareness systems [39], target identification [40], etc.). OpenCL is a framework to control the parallelism at a high level, in which the master kernel Aerospace 2016, 3, 28 22 of 23 assigns the tasks to each slave kernels. Compared with the barebones method of paralleling algorithms in DSP, OpenCL is platform-independent and enables heterogeneous multicore software development, which leads to the drawback of being less customized and efficient to specific hardware. The future works of this research involve in fulfilling the first two synchronization stages mentioned in Section 2.5.1, including the adaptive beamforming weight calculation in the real-time processing, and the buildup of an overall real-time backend system for the cylindrical polarimetric phased array radar [41] currently operated by the University of Oklahoma.

Acknowledgments: This research is supported by NOAA-NSSL through grant #NA11OAR4320072. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Ocean and Atmospheric Administration. Author Contributions: Xining Yu and Yan Zhang conceived and designed the experiments; Xining Yu and Ankit Patel performed the experiments. Allen Zahrai and Mark Weber provided important support on system applications. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Yu, X.; Zhang, Y.; Patel, A.; Zahrai, A.; Weber, M. An Implementation of Real-Time Phased Array Radar Fundamental Functions on DSP-Focused, High Performance Embedded Computing Platform. In Proc. SPIE 9829, Proceedings of Radar Sensor Technology XX, Baltimore, MD, USA, 17 April 2016; pp. 982913–982918. 2. Hendrix, R. Aerospace System Improvements Enabled by Modern Phased Array Radar. In Proceedings of the 2008 IEEE Radar Conference, Rome, Italy, 26–30 May 2008; pp. 1–6. 3. Tuzlukov, V. Signal Processing in Radar Systems; CRC Press: Boca Raton, FL, USA, 2013. 4. Weber, M.E.; Cho, J.; Flavin, J.; Herd, J.; Vai, M. Muti-Function Phased Array Radar for U.S. Civil-Sector Surveillance Needs. In Proceedings of 32nd Conference on Radar , Albuquerque, NM, USA, 22–29 October 2005. 5. Martinez, D.; Moeller, T.; Teitelbaum, K. Application of reconfigurable computing to a high performance front-end radar signal processor. J. VLSI Signal Process. Syst. Signal Image Video Technol. 2001, 28, 63–83. [CrossRef] 6. Martinez, D.R.; Bond, R.A.; Vai, M.M. High Performance Embedded Computing Handbook: A Systems Perspective; CRC Press: Boca Raton, FL, USA, 2008. 7. Stone, L.D.; Streit, R.L.; Corwin, T.L.; Bell, K.L. Bayesian Multiple Target Tracking, 2nd ed.; Artech House: Norwood, MA, USA, 2013. 8. Atoche, A.C.; Castillo, J.V. Dual super-systolic core for real-time reconstructive algorithms of high-resolution radar/sar imaging systems. Sensors 2012, 12, 2539–2560. [CrossRef][PubMed] 9. PICMG. AdvancedTCA® Base Specification; PCIMG: Wakefield, MA, USA, 2008. 10. PCIMG. Micro Telecommunications Computing Architecture Short Form Specification; PCIMG: Wakefield, MA, USA, 2006. 11. Entwistle, P.; Lawrie, D.; Thompson, H.; Jones, D.I. A Eurocard computer using transputers for control systems applications. In Proceedings of the IEE Colloquium on Eurocard Computers—A Solution to Low Cost Control? London, UK, 29 September 1989; pp. 611–613. 12. Vollrath, D.; Körte, H.; Manus, T. Now is the time to change your VME and CPCI computing platform to MTCA! Boards & Solutions/ECE Magazine, 2 April 2015. 13. Smetana, D. Beamforming: FPGAs rise to the challenge. Military Embedded Systems. 2 February 2015. Available online: http://mil-embedded.com/articles/beamforming-fpgas-rise-the-challenge/ (accessed on 1 July 2016). 14. Altera. Radar Processing: FPGAs or GPUs? Altera: San Jose, CA, USA, May 2013. Available online: https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp- 01197-radar-fpga-or-gpu.pdf (accessed on 1 July 2016). 15. Keller, J. Radar processing shifts to FPGAs and Altivec. Military Aerospace Electronics. 1 June 2003. Available online: http://www.militaryaerospace.com/articles/print/volume-14/issue-6/features/special-report/ radar-processing-shifts-to-fpgas-and-altivec.html (accessed on 1 July 2016). 16. Fuller, S. The opportunity for sub microsecond interconnects for processor connectivity. RapidIO: Austin, TX, USA. Available online: http://www.rapidio.org/technology-comparisons/ (accessed on 1 July 2016). Aerospace 2016, 3, 28 23 of 23

17. Grama, A. Introduction to Parallel Computing, 2nd ed.; Addison-Wesley: Harlow, UK; New York, NY, USA, 2003. 18. PRODRIVE. AMC-TK2: ARM and DSP AMC. PRODRIVE Technologies: Son, The Netherlands. Available online: https://prodrive-technologies.com/products/arm-dsp-amc/ (accessed on 1 July 2016). 19. Texas Instruments. Multicore DSP+ARM Keystone II System-on-chip (SoC); Texas Instruments Inc.: Dallas, TX, USA, 2013. 20. IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2008; IEEE: Piscataway, NJ, USA, 29 August 2008; pp. 1–70. 21. AD9361 Data Sheet (Rev. E.); Analog Devices, Inc.: Norwood, MA, USA, 2014. 22. Fuller, S. RapidIO: The Embedded System Interconnect; John Wiley & Sons, Ltd.: Chichester, UK; Hoboken, NJ, USA, 2005. 23. Barry Wood, T. Backplane tutorial: RapidIO, PCIe and Ethernet. EE Times, 14 January 2009. 24. Bueno, D.; Conger, C.; George, A.D.; Troxel, I.; Leko, A. RapidIO for radar processing in advanced space systems. ACM Trans. Embed. Comput. Syst. 2007, 7, 1. [CrossRef] 25. Fishburn, J.P. Clock skew optimization. IEEE Trans. Comput. 1990, 39, 945–951. [CrossRef] 26. What Is NTP? Available online: http://www.ntp.org/ntpfaq/NTP-s-def.htm (accessed on 1 July 2016). 27. How Accurate Will My Clock Be? Available online: http://www.ntp.org/ntpfaq/NTP-s-algo.htm (accessed on 1 July 2016). 28. IEEE Standard for A Precision Clock Synchronization Protocol for Networked Measurement and Control Systems (1588-2008); IEEE: Piscataway, NJ, USA, 24 July 2008; pp. 1–269. 29. Anand, D.M.; Fletcher, J.G.; Li-Baboud, Y.; Moyne, J. A practical implementation of distributed system control over an asynchronous ethernet network using time stamped data. In Proceedings of the 2010 IEEE International Conference on Automation Science and Engneering, Toronto, ON, Canada, 21–24 Auugst 2010; pp. 515–520. 30. Lewandowski, W.; Thomas, C. GPS time transfer. Proc. IEEE 1991, 79, 991–1000. [CrossRef] 31. Reddaway, S.F.; Bruno, P.; Rogina, P.; Pancoast, R. Ultra-high performance, low-power, data parallel radar implementations. IEEE Aerosp. Electron. Syst. Mag. 2006, 21, 3–7. [CrossRef] 32. Kung, H.T.; Leiserson, C.E. Systolic arrays (for VLSI). In Sparse Matrix Proceedings 1978; SIAM: Philadelphia, PA, USA, 1979; pp. 256–282. 33. CPU Cache. Available online: https://en.wikipedia.org/wiki/CPU_cache (accessed on 1 July 2016). 34. Kowarschik, M.; Weiß, C. An overview of cache optimization techniques and cache-aware numerical algorithms. In Algorithms for Memory Hierarchies: Advanced Lectures; Meyer, U., Sanders, P., Sibeyn, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 213–232. 35. Richards, M.A.; Scheer, J.A.; Holm, W.A. Priniciples of Modern Radar; Scitech Publishing: Edison, NJ, USA, 2010. 36. Klilou, A.; Belkouch, S.; Elleaume, P.; Le Gall, P.; Bourzeix, F.; Hassani, M.M.R. Real-time parallel implementation of pulse-doppler radar signal processing chain on a massively parallel machine based on multi-core DSP and serial rapidio interconnect. EURASIP J. Adv. Signal Process. 2014, 2014, 161. [CrossRef] 37. Texas Instruments. Enhanced Direct Memory Access 3 (EDMA3) for KeyStone Devices User's Guide (Rev. B); Texas Instruments Inc.: Dallas, TX, USA, 2015. 38. Kaeli, D.R.; Mistry, P.; Schaa, D.; Zhang, D.P. Heterogeneous Computing with OpenCL 2.0, 3rd ed.; Morgan Kaufmann: Waltham, MA, USA, 2015. 39. Collins, R.T.; Lipton, A.J.; Fujiyoshi, H.; Kanade, T. Algorithms for cooperative multisensor surveillance. Proc. IEEE 2001, 89, 1456–1477. [CrossRef] 40. Hsueh-Jyh, L.; Rong-Yuan, L. Utilization of multiple polarization data for aerospace target identification. IEEE Trans. Antennas Propag. 1995, 43, 1436–1440. [CrossRef] 41. Karimkashi, S.; Zhang, G. Design and manufacturing of a cylindrical polarimetric phased array radar (CPPAR) antenna for weather sensing applications. In Proceedings of the 2014 IEEE Antennas and Propagation Society International Symposium (APSURSI), Memphis, TN, USA, 6–11 July 2014; pp. 1151–1152.

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).