aerospace
Article An Implementation of Real-Time Phased Array Radar Fundamental Functions on a DSP-Focused, High-Performance, Embedded Computing Platform
Xining Yu 1,*, Yan Zhang 1, Ankit Patel 1, Allen Zahrai 2 and Mark Weber 2 1 School of Electrical and Computer Engineering, University of Oklahoma, 3190 Monitor Avenue, Norman, OK 73019, USA; [email protected] (Y.Z.); [email protected] (A.P.) 2 National Severe Storms Laboratory, National Oceanic and Atomospheric Administration, Norman, OK 73072, USA; [email protected] (A.Z.); [email protected] (M.W.) * Correspondence: [email protected]; Tel.: +1-405-325-2871
Academic Editor: Konstantinos Kontis Received: 22 July 2016; Accepted: 2 September 2016; Published: 9 September 2016 Abstract: This paper investigates the feasibility of a backend design for real-time, multiple-channel processing digital phased array system, particularly for high-performance embedded computing platforms constructed of general purpose digital signal processors. First, we obtained the lab-scale backend performance benchmark from simulating beamforming, pulse compression, and Doppler filtering based on a Micro Telecom Computing Architecture (MTCA) chassis using the Serial RapidIO protocol in backplane communication. Next, a field-scale demonstrator of a multifunctional phased array radar is emulated by using the similar configuration. Interestingly, the performance of a barebones design is compared to that of emerging tools that systematically take advantage of parallelism and multicore capabilities, including the Open Computing Language.
Keywords: phased array radar; embedded computing; serial RapidIO; MPAR
1. Introduction
1.1. Real-Time, Large-Scale, Phased Array Radar Systems In [1], we had introduced the real-time phased array radar (PAR) processing based on the Micro Telecom Computing Architecture (MTCA) chassis. PAR, especially digital PAR, ranks among the most important sensors for aerospace surveillance [2]. A PAR system consists of three components: a phased array antenna manifold, a front-end electronics system, and a backend signal processing system. In current digital PAR systems, the radar pushes the backend system closer to the antennas, which makes the front-end system more digitalized than its analog predecessors [3]. Accordingly, current front-end systems are mixed-signal systems responsible for transmitting and receiving radio frequencies (RF), digital in-phase and quadrature (I/Q) sampling, and channel equalization that improves the quality of signals. Meanwhile, digital PAR backend systems control the overall system, prepare to transmit waveforms, transform received data for use in a digital processor, and process data for further functions, including real-time calibration, beamforming, and target detection/tracking. Many PAR systems demonstrate high throughput rates for data of significant computational complexity in both the front- and backends, especially in digital PARs. For example, [4] proposed a 400-channel PAR with 1 ms pulse repetition interval (PRI); assuming 8192 range gates, each 8 bytes long in memory, in each PRI, the throughput in the front-end can reach up to 5.24 GB/s. As the requirements for such data throughput are extraordinarily demanding, at present, such front-end computing performance requires digital I/Q filtering to be mapped to a fixed set of gates, look-up tables, and Boolean operations on the field-programmable gate array (FPGA) or very-large-scale
Aerospace 2016, 3, 28; doi:10.3390/aerospace3030028 www.mdpi.com/journal/aerospace Aerospace 2016, 3, 28 2 of 23 integration (VLSI) with full-custom design [5]. After front-end processing, data are sent to the backend system, in which more computationally intensive functions are performed. Compared with FPGA or full-custom VLSI chips, programmable processing devices, such as digital signal processors (DSPs), offer a high degree of flexibility, which allows designers to implement algorithms in a general purpose language (e.g., C) in backend systems [6]. For application in aerospace surveillance, target detection and tracking are, thus, performed in the backend. Target tracking algorithms, including the Kalman filter and its variants, predict future target speeds and positions by using Bayesian estimation [7], whose computational requirements vary according to the format and content of input data. Accordingly, detection and tracking functions require processors to be more capable of logic and data manipulation, as well as complex program flow control. Such features differ starkly from those required for baseline radar signal processors, in which the size of data involved dominates the throughput of processing [6]. As such, for tracking algorithms, a general purpose processor or graphic processor unit (GPU)-based platform is more suitable than FPGA or DSP. In sum, for PAR applications, the optimal solution is a hybrid implementation in hardware dedicated for front-end processing, programmable hardware for backend processing, and a high-performance server for high-level functions.
1.2. High-Performance Embedded Computing Platforms A high-performance embedded computing (HPEC) platform contains microprocessors, network interconnection technologies, such as those of the peripheral component interconnect Industrial Computer Manufacturers Group and OpenVPX, and management software that allows more computing power to be packed into a reduced size, weight, and power consumption (SWaP) system [8]. Choosing an HPEC platform as a backend system for digital PAR can meet the requirements of substantial computing power and high bandwidth throughput with a smaller SWaP system or in other SWaP-constraint scenarios, including those with airborne radars. Using an HPEC platform is, therefore, the optimal backend solution. Open standards, such as the Advanced Telecommunications Computing Architecture (ATCA) and MTCA [9,10], can be used for open architectures in multifunction PAR (MPAR) systems. Such designs achieve compatibility with industrial standards and reduce both the cost and duration of development. MTCA and ATCA contain groups of specifications that aim to provide an open, multivendor architecture that seeks to fulfill the requirements of a high throughput interconnection network, increase the feasibility of system upgrading and upscaling, and improve system reliability. In particular, MTCA specifies the standard use of an Advanced Mezzanine Card (AMC) to provide processing and input–output (I/O) functions on a high-performance switch fabric with a small form factor. An MTCA system contains one or more chassis into which multiple AMCs can be inserted. Each AMC communicates with others via the backplane of a chassis. Among chassis, Ethernet, fiber, or Serial Rapid IO (SRIO) cables can serve as data transaction media, and the number of MTCA chassis and AMCs are adjustable, meeting the requirements of scalable and diversified functionality for specific applications. Due to PAR’s modularity and flexibility, its demands can be satisfied by using configurations in a physically smaller, less expensive, and more efficient way than general purpose computing center and server clusters. Compared with the legacy backplane architectures VME (Versa Module Europe) [11], the advantage of using MTCA is the ability to leverage various options of AMCs (i.e., processing, switching, storage, etc.) that are readily available in the marketplace. Moreover, MTCA provides more communication bandwidth and flexibility than VME [12]. For this paper, we have chosen different kinds of AMCs as I/O or processing modules in order to implement the high-performance embedded computing system for PAR based on MTCA. Each of those modules works in parallel and can be enabled and controlled by the MTCA carrier hub (MCH). According to system throughput requirements, we can apply a suitable number of processing modules to modify the processing power of the backend system. Aerospace 2016, 3, 28 3 of 23
1.3. Comparison of Different Multiprocessor Clusters Central processing units (CPUs), FPGAs, and DSPs have long been integral to radar signal processing [13,14]. FPGAs and DSPs are traditionally used for front-of-backend processing, such as beamforming, pulse compression, and Doppler filtering [15], whereas CPUs, usually accompanied by GPUs, are for sophisticated tracking algorithms and system monitoring. As a general purpose processor, a CPU is designed to follow general purpose instructions among different types of tasks and thus allow the advantage of programming flexibility and efficiency in flow control [6]. However, since CPUs do not accommodate for a range of scientific calculations, GPUs can be used to support heavy processing loads. The combination of a CPU and GPU offers competitive levels of flow control and mathematical processing, which enable the radar backend system to perform sophisticated algorithms in real-time. The drawback of the combination, however, is its limited bandwidth for handling data flow in and out of the system [16]. CPUs and GPUs are designed for a server environment, in which Peripheral Component Interconnect Express (PCIe) can efficiently perform point-to-point for on-board communication. However, PCIe is not suitable for high throughput data communication among a large number of boards. If the throughput of processing is dominated by the size of data involved, then the communication bottleneck downgrades the computing performance for a CPU–GPU combination. Therefore, when signal processing algorithms have demanding communication bandwidth requirements, DSP and FPGA are better options, since both can provide significant bandwidth for in-chassis communication by using SRIO while simultaneously achieving high computing performance. FPGA is more capable than DSP of providing high throughput in real-time for a given device size and power. When the DSP cluster cannot achieve performance requirements, the FPGA cluster can be employed for critical-stage, real-time radar signal processing. However, such improved performance comes at the expense of limited flexibility in implementing complex algorithms [5]. In all, if an FPGA and DSP both meet application requirements, then DSP can be a more preferred option given its reduced cost and less complicated programmability.
1.4. Paper Structure This paper explores the application of high performance embedded computing to digital PAR with a detailed description of backend system architectures in Section2. This is followed by a discussion of implementing fundamental PAR signal processing algorithms on these architectures and an analysis of the performance results in Section3. Section4 shows a complete example of the performance of large-scale PAR. Finally, Section5 describes that the implementation of PAR signal processing algorithm by using an automatic parallelization solution, Open Computing Language (OpenCL), and the performances comparison between barebones DSP design and OpenCL has been made. Summaries are drawn in Section6. In this paper, the work focuses on the front of the backend, for which we consider using DSP as the solution for general radar data cube processing.
2. Backend System Architecture
2.1. Overview Typically, an HPEC platform for PAR accommodates a computing environment [6] consisting of multiple parallel processors. To facilitate system upgrades and maintenance, the multiprocessor computing and interconnection topology should be flexible and modular, meaning that each processing endpoint in the backend system needs to be identical, and its responsibility entirely assumed or shared with another endpoint without interfering other system operations. Moreover, the connection topology among each processing and I/O module should be flexible and capable of switching a large amount of data from other boards. Figure1 shows a top-level system description of a general large-scale array radar system. In receiving arrays, once data are collected from the array manifold, each transmit and receive module (TRM) downconverts the incoming I/Q streams in parallel. To support the throughput requirement, the receivers group I/Q data from each coherent pulse interval (CPI) and send grouped Aerospace 2016, 3, 28 4 of 23 pulse interval (CPI) and send grouped data for beamforming, pulse compression, and Doppler filtering. Beamforming and pulse compression are paired into pipelines, and the pairs process the data in a round-robin fashion. At each stage, data-parallel partitioning is used to mitigate the massive amount of computations into smaller, more manageable pieces. Fundamental processing functions for PAR (e.g., beamforming, pulse compression, Doppler processing, and real-time calibration) require tera-levels of operations per second for large-scale PAR applications [4]. Since such processing is executed on a channel-by-channel basis, the processing flow can be parallelized naturally. A typical scheme for parallelism involves assigning computation operations to multiple parallel processing elements (PE). In that sense, from the perspective of radar applications, a data cube containing data from all range gates and pulses in a CPI is distributed across multiple PEs within at least one chassis. A good distribution strategy can ensure that systems not only achieve high computing efficiency but fulfill the requirements of modularity and flexibility, as well. In particular, modularity permits growth in computing power by adding PEs and ensures that an efficient approach to development and system integration can be adopted by replicating a single PE [6]. The granularity of each PE is defined according to the size of a processing assignment that forms part of an entire task. Although finer granularity allows designers to attune the processing assignment, also poses the disadvantage of increased communication overhead within each PE [17]. To balance computation load and real-time communication in one PE, the ratio of the number of computation operations to communication bandwidth needs to be checked carefully. For example, as a PE in our basic system configuration, we use the 6678 Evaluation Module (Texas Instruments Inc., Dallas, TX, USA), which has eight C66xx DSP cores; an advanced configuration of that design uses a more powerful DSP module from Prodrive Technologies, Son, The Netherlands [18], which contains 24 DSP cores and four advanced reduced instruction set computing machine (ARM) cores in a single board. Texas Instruments claims that each C66xx core has 16 giga floating point operation per second (GFLOPS) at 1 GHz [19]. In our throughput measurement, the four-lane SRIO (Gen 2) link reachesAerospace 2016up to, 3, 1600 28 MB/s in NWrite mode; since the single-precision floating point format (IEEE4 754) of 23 [20] occupies four bytes in memory, the SRIO link provides 400 million floating point data per second. The ratio of computation to bandwidth is 40 [6], meaning that the core performs up to 40 floating pointdata for operations beamforming, for each pulse piece compression, of data that and flows Doppler into the filtering. system Beamforming without halting and pulsethe SRIO compression link. As such,are paired when intothe ratio pipelines, reaches and 40, the PE pairs balances process the the computation data in a round-robin load with real-time fashion. communication. At each stage, Indata-parallel general, making partitioning each is PE used work to mitigate efficiently the massive requires amount optimizing of computations algorithms, into entirely smaller, using more computingmanageable resources, pieces. and ensuring that I/O capacity reaches its peak.
Figure 1. Top-level system digital array system concept. Figure 1. Top-level system digital array system concept.
2.2. Scalable Backend System Architecture Fundamental processing functions for PAR (e.g., beamforming, pulse compression, Doppler processing,As mentioned and real-time earlier, calibration) the features require of a basic tera-levels radar processing of operations chain per allow second for for independent large-scale PARand parallelapplications processing [4]. Since task suchdivisions. processing In pulse is executedcompression, on a for channel-by-channel instance, the match basis, filter the operation processing in eachflow channel can be parallelized along the range naturally. gate dimension A typical schemecan perform for parallelism independently; involves as such, assigning a large computation throughput operations to multiple parallel processing elements (PE). In that sense, from the perspective of radar applications, a data cube containing data from all range gates and pulses in a CPI is distributed across multiple PEs within at least one chassis. A good distribution strategy can ensure that systems not only achieve high computing efficiency but fulfill the requirements of modularity and flexibility, as well. In particular, modularity permits growth in computing power by adding PEs and ensures that an efficient approach to development and system integration can be adopted by replicating a single PE [6]. The granularity of each PE is defined according to the size of a processing assignment that forms part of an entire task. Although finer granularity allows designers to attune the processing assignment, also poses the disadvantage of increased communication overhead within each PE [17]. To balance computation load and real-time communication in one PE, the ratio of the number of computation operations to communication bandwidth needs to be checked carefully. For example, as a PE in our basic system configuration, we use the 6678 Evaluation Module (Texas Instruments Inc., Dallas, TX, USA), which has eight C66xx DSP cores; an advanced configuration of that design uses a more powerful DSP module from Prodrive Technologies, Son, The Netherlands [18], which contains 24 DSP cores and four advanced reduced instruction set computing machine (ARM) cores in a single board. Texas Instruments claims that each C66xx core has 16 giga floating point operation per second (GFLOPS) at 1 GHz [19]. In our throughput measurement, the four-lane SRIO (Gen 2) link reaches up to 1600 MB/s in NWrite mode; since the single-precision floating point format (IEEE 754) [20] occupies four bytes in memory, the SRIO link provides 400 million floating point data per second. The ratio of computation to bandwidth is 40 [6], meaning that the core performs up to 40 floating point operations for each piece of data that flows into the system without halting the SRIO link. As such, when the ratio reaches 40, the PE balances the computation load with real-time communication. In general, making each PE work efficiently requires optimizing algorithms, entirely using computing resources, and ensuring that I/O capacity reaches its peak. Aerospace 2016, 3, 28 5 of 23
2.2. Scalable Backend System Architecture As mentioned earlier, the features of a basic radar processing chain allow for independent and parallelAerospace 2016 processing, 3, 28 task divisions. In pulse compression, for instance, the match filter operation5 of in23 each channel along the range gate dimension can perform independently; as such, a large throughput radarradar processing task can can be be assigned assigned to to multiple multiple processing processing units units (PUs). (PUs). Since Since each each PU PU consists consists of ofidentical identical PEs, PEs, the thetasktask would would undergo undergo further further decomposition decomposition into smaller into smaller pieces for pieces each for PE, each thereby PE, therebyallowing allowing an adjustable an adjustable level of levelgranularity of granularity that facilitates that facilitates precise preciseradar function radar function mapping. mapping. At the Atsame the time, same a centralized time, a centralized control unit control is used unit fo isr monitoring used for monitoring and scheduli andng schedulingdistributed distributedcomputing computingresources, as resources, well as for as managing well as for lower-level managing modules. lower-level PU modules.implementations PU implementations based on the MTCA based onopen the standard MTCA opencan balance standard tradeoffs can balance among tradeoffs processing among power, processing I/O functions, power, I/O and functions, system andmanagement. system management. In our implementation, In our implementation, each PU contains each at least PU contains one chassis, at least each oneof which chassis, includes each ofat whichleast one includes MCH atthat least provides one MCH central that control provides and central acts as control a data-switching and acts as entity a data-switching for all PEs, which entity forcould all be PEs, an which I/O module could (e.g., be an RF I/O transceiver) module (e.g., or a RF processing transceiver) card. or The a processing MCH of each card. MTCA The MCH chassis of eachcould MTCA be connected chassis with could a besystem connected manager with that a system supports manager the monitoring that supports and configuration the monitoring ofand the configurationsystem-level setting of the and system-level status of settingeach PE and by way status of ofan each IP interface. PE by way Within of an a IPsingle interface. MTCA Within chassis, a singlePEs exchanges MTCA chassis, data through PEs exchanges the SRIO data or PCIe through fabr theic on SRIO the orbackplane, PCIe fabric and on the the MCH backplane, is responsible and the MCHfor both is responsibleswitching and for fabric both switching management. and fabric management. Figure2 2 illustrates illustrates oneone way to use an an MTCA MTCA chassis chassis to to implement implement the the fundamental fundamental functions functions of ofradar radar signal signal processing. processing. Depending Depending on onthe the nature nature of of data data parallelism parallelism within within each each function, computingcomputing loadload is divideddivided equallyequally and a portionportion assignedassigned to eacheach PU.PU. TheThe computationalcomputational capabilitycapability isis reconfigurablereconfigurable byby adjustingadjusting thethe numbernumber ofof PUs,PUs, andand forfor eacheach processingprocessing function,function, aa differentdifferent PUPU cancan constituteconstitute at leastleast oneone MTCAMTCA chassischassis with variousvarious types of PEsPEs insertedinserted intointo it,it, allall accordingaccording toto specificspecific needs.needs. In the front, severalseveral PUsPUs handlehandle a tremendoustremendous amountamount of beamformingbeamforming calculations, and byby changing changing the the number number of PUsof PUs and and PEs, thePEs, beamformer the beamformer can be can adjusted be adjusted to accommodate to accommodate different typesdifferent and types numbers and of numbers array channels. of array Since channels. computing Since loads computing are smaller loads for pulseare smaller compression for pulse and Dopplercompression filtering, and Doppler assigning filtering, one PU assigning for each function one PU for is sufficient each function in MPAR is sufficient systems. in MPAR systems.
FigureFigure 2. Illustration 2. Illustration of the Micro of the Teleco Microm TelecomComputing Computing Architecture Architecture (MTCA) ar (MTCA)chitecture architecture in a backend in system. a backend system. Figure 3 shows an overview of the proposed MPAR backend processing chain that focuses only on a non-adaptive core processing chain. Adaptive beamforming, alignments, and calibrations are Figure3 shows an overview of the proposed MPAR backend processing chain that focuses only not included in that chain until further stable results are obtained from algorithm validations. Data on a non-adaptive core processing chain. Adaptive beamforming, alignments, and calibrations are not from the array manifold and front-end electronics are organized into three-dimensional data cubes, included in that chain until further stable results are obtained from algorithm validations. Data from and , , , and represent the total number of range gates, channels, pulses, and beams, the array manifold and front-end electronics are organized into three-dimensional data cubes, and Nrg, respectively. When any of those four numbers are red in Figure 3, the data are aligned in their Nch, Np, and Nb represent the total number of range gates, channels, pulses, and beams, respectively. Whencorresponding any of those dimensions. four numbers , are, red , and in Figure represent3, the data the arenumber aligned of PUs in their or PEs corresponding contained in analog to digital conversion, beamforming, pulse compression and Doppler filter, respectively. dimensions. Mr, Mb, Mp, and Md represent the number of PUs or PEs contained in analog to digital conversion,Initially, the beamforming,analog to digital pulse converters compression (ADC) and in one Doppler receiving filter, PUs respectively. would collect Initially, the data the with analog the ( ⁄) × × todimension digital converters equals to (ADC) in one receiving . In PUs total, would the number collect theof data -receiving with the PUs dimension would equalsform a data cube with the dimension of × × , which would be further divided into a number of to (Nch/Mr) × Np × Nrg. In total, the number of Mr-receiving PUs would form a data cube with the portions and re-arranged, or corner turned, the data in the channel domain. As there are number of beamforming PUs, each one handles the dataset with the dimension of ( ⁄) × × . Since the output of beamforming is already aligned in the range gates, such an approach can save time form the data corner turn. In the pulse compression stage, each pulse compression PE takes
the data cube with dimension of ⁄ × × . Prior to Doppler filtering, another corner turn Aerospace 2016, 3, 28 6 of 23
dimension of Nch × Np × Nrg, which would be further divided into a number of Mb portions and re-arranged, or corner turned, the data in the channel domain. As there are Mb number of beamforming PUs, each one handles the dataset with the dimension of (Nch/Mb) × Np × Nrg. Since the output of beamforming is already aligned in the range gates, such an approach can save time form the data corner turn. In the pulse compression stage, each pulse compression PE takes the data cube with Aerospace 2016 , 3, 28 6 of 23 dimension of Nb/Mp × Np × Nrg. Prior to Doppler filtering, another corner turn reorganizes data in Aerospace 2016, 3, 28 6 of 23 the pulsereorganizes domain. data Ultimately, in the pulse each domain. PEs inUltimately, the Doppler each filtering PEs in the stage Doppler would filtering take the stage data would size withtake the dimension of N /M × N × N . thereorganizes data size datawithp in thed the dimension pulseb domain.rg of Ultimately, ⁄ × ea× ch PEs. in the Doppler filtering stage would take
the data size with the dimension of ⁄ × × .
FigureFigure 3. 3.Overview Overview ofof the non-adaptive, non-adaptive, “core “core data data cube cube”” processing processing chain in chain a general in a digital general array digital radar. arrayFigure radar. 3. Overview of the non-adaptive, “core data cube” processing chain in a general digital array radar. 2.3. Processing Unit Architecture 2.3. Processing Unit Architecture 2.3. ProcessingWe currently Unit operate Architecture an example of a receiving digital array at the University of Oklahoma We currently operate an example of a receiving digital array at the University of Oklahoma (FigureWe 4) currently with two operate types of an PUs—namely, example of a areceiving receiving digital (i.e., data array acquisition) at the University PU and ofa computingOklahoma (FigurePU.(Figure 4In) with the4) with tworeceiving two types types PU, of PUs—namely,ofsix PUs—namely, field-programmable a a receiving receiving RF (i.e.,(i.e.,transceiver data acquisition) acquisition) modules [21]PU PU and(e.g., and a computing aVadaTech computing PU.AMC518PU. In the In receivingthe + FMC214) receiving PU, sample sixPU, field-programmable six the field-programmable analog returns RF from transceiver RF TRM transceiver and modules send modules the [digitalized21] (e.g.,[21] VadaTech(e.g., data VadaTech to a AMC518DSP + FMC214)moduleAMC518 sampleby + wayFMC214) of the an analogsample SRIO backplane. returnsthe analog from Thereturns TRM DSP from andmodule sendTRM combines theand digitalizedsend and the sends digitalized data raw to I/Q adata DSP data to module toa DSPthe by waycomputingmodule of an SRIO by PUway backplane. through of an SRIOtwo The Hyperlink backplane. DSP module ports. The InDSP combines the module computing and combines sendsPU, the and raw number sends I/Q of dataraw PEs I/Q to is thedetermineddata computing to the PU throughbasedcomputing on required two PU Hyperlink through computational two ports. Hyperlink Inloads. the ports. Moreover, computing In the each computing PU, computing the number PU, PUthe can ofnumber PEs be connected is of determined PEs is withdetermined others based on requiredbybased way computational on of required the Hyperlink computational loads. port. Moreover, With loads. that eachMoreover, proposed computing eachPU architecture, computing PU can be PU we connected can test be the connected performance with others with byothersof the way of the Hyperlinkcomputingby way of the port.PU Hyperlinkby With using that a VadaTechport. proposed With VT813that PU proposed architecture, as the MTCA PU architecture, we chassis test the and performance weC6678 test as the the performance ofPE. the computing of the PU by usingcomputing a VadaTech PU by using VT813 a asVadaTech the MTCA VT813 chassis as the and MTCA C6678 chassis as the and PE. C6678 as the PE.
Figure 4. Simple example of a processing unit (PU)-based architecture. FigureFigure 4. 4.Simple Simple example example of of aa processingprocessing unit unit (PU)-based (PU)-based architecture. architecture. 2.4. Selecting a Backplane Data Transmission Protocol 2.4. SelectingWith more a Backplane powerful Data and Transmission efficient Protocolprocessors, HPEC platforms can acquire significant computingWith morepower powerfuland meet andscalable efficient system processors, requirements. HPEC However, platforms more can often acquire than not,significant HPEC performancecomputing power is limited and by meet the availabilityscalable system of a co requirmmensurateements. high-throughputHowever, more ofteninterconnect than not, network. HPEC performance is limited by the availability of a commensurate high-throughput interconnect network. Aerospace 2016, 3, 28 7 of 23
2.4. Selecting a Backplane Data Transmission Protocol With more powerful and efficient processors, HPEC platforms can acquire significant computing powerAerospace and 2016, meet 3, 28 scalable system requirements. However, more often than not, HPEC performance7 of 23 is limited by the availability of a commensurate high-throughput interconnect network. At the sameAt the time, same the time, communication the communication overhead overhead may bemay larger be larger than than the computing the computing time, time, which which makes makes the processorsthe processors halt. halt. Since Since that setback that setback significantly significan impactstly impacts the efficiency the efficiency of executing of executing system functions, system afunctions, proper implementation a proper implementation of the interconnection of the interc networkonnection among network all processing among all nodes processing is critical nodes to the is performancecritical to the theperformance parallel processing the parallel chain. processing chain. Currently, SRIO, Ethernet, and PCIe are common optionsoptions for fundamental data link link protocols. protocols. RapidIO is reliable, efficient,efficient, and highly scalable;scalable; compared with PCIe, which is optimized for a hierarchical bus structure, SRIO is designed to support both point-to-point and hierarchical models. models. It also demonstrates aa betterbetter flowflow controlcontrol mechanism mechanism than than PCIe. PCIe. In In the the physical physical layer, layer, RapidIO RapidIO offers offers a PCIe-stylea PCIe-style flow flow control control retry retry mechanism mechanism based based on on trackingtracking creditscredits insertedinserted into packet headers [22]. [22]. RapidIO also includes a virtual output queue backpressure mechanism, which allows switches and endpoints to learn whether data transfer destinatio destinationsns are congested [23]. [23]. Given those characteristics, SRIO allowsallows anan architecturearchitecture to to strike strike a a working workin balanceg balance between between high-performance high-performance processors processors and and the interconnectionthe interconnection network. network. In light of thosethose considerations,considerations, we use SRIO as ourour backplanebackplane transmissiontransmission protocol [[24],24], and our current testbeds are are based based on on SRIO SRIO Gen Gen 2 2backpl backplanes.anes. Each Each PE PE has has a four-lane a four-lane port port connected connected to toan anSRIO SRIO switch switch on on the the MCH. MCH. In In our our system, system, SRIO SRIO ports ports on on the the C6678 C6678 DSP DSP support four different bandwidths: 1.25, 2.5, 3.125, and 55 Gb/s.Gb/s. Since Since SRIO bandwidth overhead isis 20%20% inin 8-bit/10-bit8-bit/10-bit encoding, the theoretical effective data bandwidths ar aree 1, 2, 2.5, and 4 Gb/s, respectively. respectively. In In reality, reality, SRIO performance performance can can be be affected affected by by transfer transfer type, type, the the length length of differential of differential transmission transmission lines, lines, and andthe specific the specific type of type SRIO of port SRIO connectors. port connectors. To assess SRIO To assess performance SRIO performancein our testbed, in we our conducted testbed, wethe conductedfollowing throughput the following experiments. throughput experiments. Figure5 5shows shows the the performance performance of theof SRIOthe SRIO link inlink our in MTCA our MTCA test environment test environment by using by NWrite using andNWrite NRead and packets NRead in packets 5 Gb/s, in four-lane 5 Gb/s, mode.four-lane Performance mode. Performance is calculated is by calculated dividing theby dividing payload sizethe bypayload the elapsed size by transaction the elapsed time transaction from when time the from transmitter when startsthe transmitter to program starts SRIO to registers program and SRIO the receiverregisters has and received the receiver the entire has received dataset. First,the entire the performance dataset. First, of thethe SRIOperforma linknce is enhanced of the SRIO along link with is largerenhanced payload along sizes. with Second,larger payload the closer sizes. the destinationSecond, the memorycloser the to destination the core, the memory better the to performancethe core, the achievedbetter the with performance the SRIO achieved link. Optimally, with the SRIO SRIO 4× link.mode Optimally, can reach SRIO a speed 4× ofmode 1640 can MB/S, reach which a speed is 82% of of1640 its MB/S, theoretical which link is 82% rate. of its theoretical link rate.
Figure 5.5. DataData throughput throughput experiment experiment results results in in our our Micro Telecom Computing Architecture (MTCA)-based SerialSerial RapidRapid IOIO (SRIO)(SRIO) testbed.testbed.
2.5. System Calibration and Multichannel Synchronization
2.5.1. General Calibration Procedures Calibrating a fully digital PAR system is a complex procedure involving four general stages (Figure 6). During the first stage, transmit–receive chips in each array channel need to calibrate Aerospace 2016, 3, 28 8 of 23
2.5. System Calibration and Multichannel Synchronization
2.5.1. General Calibration Procedures Aerospace 2016, 3, 28 8 of 23 Calibrating a fully digital PAR system is a complex procedure involving four general stages (Figurethemselves6). Duringin terms the of first direct stage, current transmit–receive and frequency chips offsets, in each on-chip array phase channel alignment, need to calibrateand local themselvesoscillator calibration. in terms ofDuring direct the current second and stage, frequency subarrays offsets, containing on-chip fewer phase channels alignment, and andradiating local oscillatorelements are calibration. aligned precisely During thein the second chamber stage, envi subarraysronment by containing way of near-field fewer channels measurements, and radiating plane elementswave spectrum are aligned analysis, precisely and far-field in the chamber active element environment pattern by characterizations. way of near-field During measurements, this stage, plane the wavefocus spectrumfalls upon analysis, antenna andelements, far-field not active the digita elementl backend, pattern and characterizations. initial array weights During for this forming stage, thefocused focus beams falls upon at the antenna subarray elements, level are notestimated the digital prec backend,isely. During and the initial third array stage, weights far-field for full forming array focusedalignment beams is performed at the subarray in either level chamber are estimated or outdoor precisely. range environments. During the third For this stage, stage, far-field we use full a arraysimple alignment unit-by-unit is performedapproach to in en eithersure that chamber each time or outdoora subarray range is added, environments. it maximizes For the this coherent stage, weconstruction use a simple of unit-by-unitthe wavefront approach at each to ensurerequired that beam-pointing each time a subarray direction. is added, Array-level it maximizes weights the coherentobtained constructionat the third stage of the are wavefront combined at with each cham requiredber-derived beam-pointing initial weights direction. from Array-level the second weights stage obtainedto numerically at the thirdoptimize stage array are combinedradiation withpatterns chamber-derived for all beam directions. initial weights When from multiple the second beams stage are toformed numerically at once, optimizethe procedure array repeats radiation for patternsall beamspac for alle configurations. beam directions. This When stage multiple requires beamsa far-field are formedprobe in at the once, loop the of procedure the alignment repeats process for all beamspace and requires configurations. synchronization This stageand requiresalignments a far-field in the probebackend. in the Initial loop factory of the alignment alignment process is finished and requires after th synchronizationis stage. Duringand the alignmentsfinal stage, in the the system backend. is Initialshipped factory for field alignment deployment, is finished in which after thisa series stage. of Duringenvironment-based the final stage, corrections the system (e.g., is shipped regarding for fieldtemperature, deployment, electronics in which drifting, a series and of environment-basedplatform vibration, which corrections is necessary (e.g., regarding for ship- temperature, or airborne electronicsradar). Based drifting, on the and internal platform sensor vibration, (i.e., calibra whichtion is necessarynetwork) monitoring for ship- or data, airborne algorithms radar). Basedin the onbackend the internal perform sensor channel (i.e., equalization calibration network) and pre–post monitoring distortions, data, algorithmsas well as correct in the backend system errors perform of channeldeviations equalization from the factory and pre–post standard. distortions, The final asstep well entails as correct data systemquality errorscontrol, of which deviations compares from the factoryobtained standard. data product The final with step analytical entails datapredictions quality control,to further which correct compares biases theat the obtained data product data product level withfor desired analytical pointing. predictions to further correct biases at the data product level for desired pointing.
FigureFigure 6. 6.GeneralGeneral system system calibration calibration procedure procedure for digital for digital array array radar radar and the and focus the focusof this of work this (in work red). (in red). 2.5.2. Backend Synchronization 2.5.2. Backend Synchronization Our study focuses only on backend synchronization during the third stage, a step necessary beforeOur parallel, study multicore focuses only processing on backend can be synchronization activated. Additionally, during the synchronized third stage, abackend step necessary enables beforethat reference parallel, clock multicore signals processing in the front-end can be activated.PU (and the Additionally, AD9361 chips synchronized in the front-end backend PU) enables to be thataligned reference through clock a FPGA signals Mezzanine in the front-end Card (FMC) PU interface. (and the AD9361For the testbed chips inarchitecture the front-end in Section PU) to 2.3, be alignedthe front-end through PU, a referred FPGA Mezzanine to as simply Card “front-end” (FMC) interface. in this section, For the of testbed the digital architecture PAR systems in Section includes 2.3, thea number front-end of array PU, referred RF channels. to as simplyIn each “front-end” channel, th inere this is an section, integrated of the RF digital digital PAR transceiver systems includeswith an aindependent number of arrayclock RFsource channels. in its digital In each section. channel, there is an integrated RF digital transceiver with an independentSynchronization clock source in this in itsfront-end digital section. system can be categorized according to either in-chassis or multichassisSynchronization synchronization. in this front-end In-chassis system synchronization can be categorized ensures accordingthat each tofront-end either in-chassis AMC in ora multichassischassis works synchronization. synchronously In-chassiswith those synchronization in the other chassis. ensures Figure that each 7 shows front-end the AMCarchitecture in a chassis of a dual-channel front-end AMC module, which is based on an existing product from VadaTech Inc., Henderson, NV, USA. The Ref Clock and Sync Pulse in Figure 7 are radial fan-out by the MCH to each slot in the chassis, and each front-end AMC uses the Sync Pulse and Ref Clock to accomplish in- chassis synchronization. As an example, Figure 8 shows the timing sequence of synchronizing two front-end AMCs. Since commands from the remote PC server or other MTCA chassis may arrive at AMC 1 and AMC 2 at different times, transmitting or receiving synchronizations requires sharing the Aerospace 2016, 3, 28 9 of 23 works synchronously with those in the other chassis. Figure7 shows the architecture of a dual-channel front-end AMC module, which is based on an existing product from VadaTech Inc., Henderson, NV, USA. The Ref Clock and Sync Pulse in Figure7 are radial fan-out by the MCH to each slot in the chassis, and each front-end AMC uses the Sync Pulse and Ref Clock to accomplish in-chassis synchronization. As an example, Figure8 shows the timing sequence of synchronizing two front-end AMCs. Since Aerospace 2016, 3, 28 9 of 23 commandsAerospace 2016, from 3, 28 the remote PC server or other MTCA chassis may arrive at AMC 1 and AMC9 of 2 at23 different times, transmitting or receiving synchronizations requires sharing the Sync Pulse between Sync Pulse between the AMCs. When AMCs acknowledge the command and detect the Sync Pulse, theSync AMCs. Pulse Whenbetween AMCs the AMCs. acknowledge When theAMCs command acknowledge and detect the thecommand Sync Pulse, and detect the FPGA the triggersSync Pulse, the the FPGA triggers the AD9361 chip on both boards at the falling edge of the next Ref Clock cycle. By AD9361the FPGA chip triggers on both the boards AD9361 at chip the falling on both edge boards of the at nextthe falling Ref Clock edge cycle. of the By next using Ref that Clock mechanism, cycle. By using that mechanism, multichannel signal acquisition and generation can be synchronized within a multichannelusing that mechanism, signal acquisition multichannel and generation signal acquisit can beion synchronized and generation within can abe chassis. synchronized The accuracy within of a chassis. The accuracy of in-chassis synchronization depends on how well the trace length is matched in-chassischassis. The synchronization accuracy of in-chassis depends synchronization on how well the depends trace length on how is matchedwell the trace from length an MCH is matched to each from an MCH to each AMC. If the trace length is fully matched, then synchronization will be tight. AMC.from an If theMCH trace to each length AMC. is fully If the matched, trace length then synchronizationis fully matched, will then be synchronization tight. will be tight.
Figure 7. “PU Frontend” Advanced Mezzanine Card (AMC) module architecture (based on existing Figure 7.7. “PU“PU Frontend”Frontend” AdvancedAdvanced MezzanineMezzanine Card (AMC)(AMC) module architecture (based(based on existingexisting product used in the testbed). productproduct usedused inin thethe testbed).testbed).
Figure 8. Frontend in-chassis synchronizationsynchronization timing sequence. Figure 8. Frontend in-chassis synchronization timing sequence. For multichassis synchronization, the chief problem is so-called clock skew [25] which, to For multichassismultichassis synchronization, the chief problem is so-called clock skew [[25]25] which,which, to overcome, requires a clock synchronization mechanism. The most common clock synchronization overcome, requires aa clockclock synchronizationsynchronization mechanism.mechanism. The most commoncommon clockclock synchronizationsynchronization solution is the Network Time Protocol (NTP), which synchronizes each client based on messaging solution is thethe NetworkNetwork TimeTime ProtocolProtocol (NTP),(NTP), whichwhich synchronizessynchronizes eacheach clientclient basedbased onon messagingmessaging with the User Datagram Protocol [26]. However, NTP accuracy ranges from 5 to 100 ms, which is not withwith the User Datagram Datagram Protocol Protocol [26]. [26]. However, However, NT NTPP accuracy accuracy ranges ranges from from 5 to 5 100 to 100 ms, ms, which which is not is precise enough for PAR application [27]. To get more accurate synchronization in the local area notprecise precise enough enough for forPAR PAR application application [27]. [27 To]. To get get more more accurate accurate synchronization synchronization in in the the local areaarea network, the IEEE 1588 Precision Time Protocol (PTP) standard [28] can provide sub-microsecond network, the IEEE 1588 Precision Time ProtocolProtocol (PTP)(PTP) standardstandard [[28]28] cancan provideprovide sub-microsecondsub-microsecond synchronization [29]. To implement PTP, the front-end chassis needs to be capable of packing or synchronization [29]. [29]. To To implement implement PTP, PTP, the the front- front-endend chassis chassis needs needs to tobe becapable capable of ofpacking packing or unpacking Ethernet packets, and additional dedicated hardware and software are required, which orunpacking unpacking Ethernet Ethernet packets, packets, and andadditional additional dedicated dedicated hardware hardware and software and software are required, are required, which increase both the complexity and cost of the front-end subsystem. A better method of implementing increase both the complexity and cost of the front-end subsystem. A better method of implementing multichassis synchronization would take advantage of GPS pulses per second (PPS) since, by multichassis synchronization would take advantage of GPS pulses per second (PPS) since, by connecting each chassis to a GPS receiver, the MCHs can use PPS as a reference signal to generate the connecting each chassis to a GPS receiver, the MCHs can use PPS as a reference signal to generate the Ref Clock and Sync Pulse for in-chassis synchronization. Since the PPS signal among different MCHs Ref Clock and Sync Pulse for in-chassis synchronization. Since the PPS signal among different MCHs is synchronized, the Ref Clock and Sync Pulse in each chassis is phase matched at any given time. In is synchronized, the Ref Clock and Sync Pulse in each chassis is phase matched at any given time. In case that all of the MCHs are using the PPS from the same satellites, the synchronization accuracy is case that all of the MCHs are using the PPS from the same satellites, the synchronization accuracy is between 5 and 20 ns [30]. However, when the GPS signal is inaccessible or lost, the front-end between 5 and 20 ns [30]. However, when the GPS signal is inaccessible or lost, the front-end subsystem should be able to stay synchronized by sharing the Sync Pulse from a common source, subsystem should be able to stay synchronized by sharing the Sync Pulse from a common source, which could be an external chassis clock generator or a signal from one of the chassis. In both which could be an external chassis clock generator or a signal from one of the chassis. In both Aerospace 2016, 3, 28 10 of 23 which increase both the complexity and cost of the front-end subsystem. A better method of implementing multichassis synchronization would take advantage of GPS pulses per second (PPS) since, by connecting each chassis to a GPS receiver, the MCHs can use PPS as a reference signal to generate the Ref Clock and Sync Pulse for in-chassis synchronization. Since the PPS signal among different MCHs is synchronized, the Ref Clock and Sync Pulse in each chassis is phase matched at any given time. In case that all of the MCHs are using the PPS from the same satellites, the synchronization accuracy is between 5 and 20 ns [30]. However, when the GPS signal is inaccessible or lost, the front-end subsystem should be able to stay synchronized by sharing the Sync Pulse from a common source, which could be an external chassis clock generator or a signal from one of the chassis. Aerospace 2016, 3, 28 10 of 23 In both methods, the trace length to each MCH from the common Sync Pulse source can vary, thereby makingmethods, the the propagation trace length time to each delay MCH of thefrom Sync the common Pulse from Sync each Pulse chassis source differ. can vary, To address thereby this making issue, wethe need propagation to know time thedelay delay timeof the difference Sync Pulse of from each ea chassisch chassis compared differ. withTo address the reference this issue, (i.e., we master) need chassis.to know Withthe delay that time knowledge, difference all of chassis each chassis can use compared the time with difference the reference as an (i.e., offset master) to adjust chassis. the triggeredWith that time. knowledge, all chassis can use the time difference as an offset to adjust the triggered time. ToTo implement implement thatthat approach,approach, wewe designeddesigned a clock counter to measure the the elapsed elapsed clock clock cycles cycles betweenbetween thethe SyncSync PulsePulse andand thethe returnreturn SyncSync Beacon,Beacon, the latter of of which which is is transmitted transmitted only only from from antennasantennas connected connected toto thethe referencereference chassis.chassis. Since the beacon beacon arrives arrives at at all all antennas antennas simultaneously, simultaneously, eacheach front-endfront-end subsystemsubsystem stopsstops itsits countercounter at the same time. The The ti timeme differences differences in in delay delay can can be be obtainedobtained by by subtracting subtracting the the counter counter number number from from each each slave slave chassis chassis to to the the reference reference chassis. chassis. FigureFigure9 illustrates9 illustrates a model a model timing timing sequence sequence after after each chassiseach ch receivesassis receives the Sync the Pulse. Sync AtPulse. time At T0, time the referenceT0, the chassisreference begins chassis to transmitbegins to thetransmit Sync Beaconthe Sync and Beacon starts and the starts counter. the counter. After two After and two a halfand clocka half cyclesclock ofcycles propagation of propagation delay, thedelay, slave the chassis slave chassis launches laun theches counter the counter as well. as At well. time At T3, time the T3, Sync the Beacon Sync isBeacon received is received by both by chassis, both chassis, however, however, since the since chassis the chassis detects detects the signal the signal only only at its at rising its rising edge, edge, the referencethe reference chassis chassis detects detects the signal the signal at time at T5time with T5counter with counter number number 16. By 16. contrast, By contrast, in the slavein the chassis, slave thechassis, counter the stopscounter at 13.stops In at turn, 13. whenIn turn, the when Sync the Pulse Sync is Pulse received is received the next the time, next the time, reference the reference chassis ischassis delayed is delayed by three by clock three cycles clock and cycles triggers and triggers AD9361 AD9361 at time T6,at time whereas T6, whereas the slave the chassis slave chassis starts it atstarts T7. Init at our T7. example, In our example, T6 is not T6 the is not same the as same T7. Suchas T7. deviation Such deviation arises aris becausees because the clock the clock phase phase angle betweenangle between the two the chassis two chassis is not identical. is not identical. When thisWhen phase this anglephase approachesangle approach 360◦es, it 360°, is possible it is possible for the Syncfor the Beacon Sync toBeacon arrive to when arrive the when rising theedge rising of edge one of clock onehas clock just has passed, just passed, while while the rising the rising edge ofedge the nextof the clock next cycle clock is still cycle approaching. is still approaching. In the worst-case In the scenario, worst-case only scenario, one clock only cycle one synchronization clock cycle errorsynchronization will occur, meaning error will that occur, the meaning accuracy that of multichassis the accuracy synchronization of multichassis synchronization refers to the period refers of theto referencethe period clock. of the One reference way to enhance clock. One its accuracy way to isenha to reducence its theaccuracy period is of to the reduce reference the clock;period however, of the thereference sampling clock; speed however, of the ADCthe sampling confines speed the shortest of the ADC period confines of the clock,the shortest because period a front-end of the clock, AMC cannotbecause read a front-end new data AMC in every cannot clock read cycle new from data thein every ADC clock when cycle the AMC’s from the reference ADC when clock the frequency AMC’s exceedsreference the clock ADC’s frequency sampling exceeds speed. the In ADC’s our example, sampling since speed. the In maximum our example, data since rate the inthe maximum AD9361 isdata 61.44 rate million in the samplesAD9361 is per 61.44 second, million the samples interchassis per second, synchronization the interchassis accuracy synchronization without using accuracy the GPS signalwithout is 16using ns. the GPS signal is 16 ns.
FigureFigure 9.9. ExampleExample timing sequence of multi-chassis synchronization. synchronization.
2.6. Backend System Performance Metrics To measure the benchmarks of digital PAR backend system performance, millions of instructions per second are often used as the metric. Meanwhile, to measure the floating point computational capability of a system, we use GFLOPS [31]. To evaluate the real-time benchmark performance, we simulate the complex floating point data cubes. For parallel computing systems, parallel speedup and parallel efficiency are two important metrics for evaluating the effectiveness of parallel algorithm implementations. Speedup is a metric of latency improvement for a parallel algorithm compared with a serial algorithm distributed over PUs, defined as: