sensors

Article Real-Time Spaceborne Synthetic Aperture Radar Float-Point Imaging System Using Optimized Mapping Methodology and a Multi-Node Parallel Accelerating Technique

Bingyi Li 1, Hao Shi 1,2,* ID , Liang Chen 1,*, Wenyue Yu 1, Chen Yang 1 ID , Yizhuang Xie 1, Mingming Bian 3, Qingjun Zhang 3 and Long Pang 4

1 Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China; [email protected] (B.L.); [email protected] (W.Y.); [email protected] (C.Y.); [email protected] (Y.X.) 2 Department of Electronic Engineering, Tsinghua University, Beijing 100084, China 3 Beijing Institute of Spacecraft System Engineering, Beijing 100094, China; [email protected] (M.B.); [email protected] (Q.Z.) 4 School of Information Engineering, Communication University of China, Beijing 100024, China; [email protected] * Correspondence: [email protected] (H.S.); [email protected] (L.C.); Tel.: +86-186-1166-1399 (H.S.)

Received: 28 December 2017; Accepted: 5 February 2018; Published: 28 February 2018

Abstract: With the development of satellite load technology and very large-scale integrated (VLSI) circuit technology, on-board real-time synthetic aperture radar (SAR) imaging systems have facilitated rapid response to disasters. A key goal of the on-board SAR imaging system design is to achieve high real-time processing performance under severe size, weight, and power consumption constraints. This paper presents a multi-node prototype system for real-time SAR imaging processing. We decompose the commonly used chirp scaling (CS) SAR imaging algorithm into two parts according to the computing features. The linearization and logic-memory optimum allocation methods are adopted to realize the nonlinear part in a reconfigurable structure, and the two-part bandwidth balance method is used to realize the linear part. Thus, float-point SAR imaging processing can be integrated into a single Field Programmable Gate Array (FPGA) chip instead of relying on distributed technologies. A single-processing node requires 10.6 s and consumes 17 W to focus on 25-km swath width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384. The design methodology of the multi-FPGA parallel accelerating system under the real-time principle is introduced. As a proof of concept, a prototype with four processing nodes and one master node is implemented using a Xilinx xc6vlx315t FPGA. The weight and volume of one single machine are 10 kg and 32 cm × 24 cm × 20 cm, respectively, and the power consumption is under 100 W. The real-time performance of the proposed design is demonstrated on Chinese Gaofen-3 stripmap continuous imaging.

Keywords: synthetic aperture radar (SAR); real-time processing; single FPGA node imaging processing; multi-nodes parallel accelerating technique

1. Introduction As an important means of space-to-earth observation, spaceborne synthetic aperture radar (SAR) has the ability to collect data continuously under all weather conditions over large areas at high resolution, making it a unique instrument [1]. SAR plays an important role in disaster emergency response, environmental monitoring, resource exploration, and geographic information access [2–4]. Recent publications have reviewed the applications of satellite remote sensing techniques for hazards

Sensors 2018, 18, 725; doi:10.3390/s18030725 www.mdpi.com/journal/sensors Sensors 2018, 18, 725 2 of 20 manifested by solid earth processes, including earthquakes, volcanoes, floods, landslides, and coastal inundation [5–7]. In 1978, NASA’s satellites demonstrated the ability of SAR to acquire high-resolution images. Since then, many countries have launched SAR satellites and conducted research on spaceborne SAR processing. For example, the Sentinel-1 mission, including both the S-1A (launched in 2014) and S-1B (launched in 2016) satellites, was specifically designed by the European Space Agency (ESA) to acquire data and information products for applications such as the observation of the marine environment, the surveillance of maritime transport zones, the mapping of land surfaces, and the offering of support during crisis situations [8]. The TanDEM-X/TerraSAR-X (TDX/TSX) constellation is a high-resolution interferometric SAR mission of the German Aerospace Center (DLR) intended to fulfill the requirements of a global homogeneous and high-resolution coverage of all land areas, providing vital information for a variety of applications [9]. ALOS-2 (launched in 2014) is the replacement of the Japan Aerospace Exploration Agency (JAXA) L-band SAR satellite mission to ALOS [10]. The overall objective of ALOS-2 is to provide data continuity for cartography, regional observation, disaster monitoring, and environmental monitoring. The Korea Multi-Purpose Satellite-5 (KOMPSAT-5) was launched in 2013 by the Korea Aerospace Research Institute (KARI), and the RADARSAT Constellation Mission (RCM) is scheduled for launch in 2018 by the Canadian Space Agency (CSA). The SAR data are expected to be used mainly for maritime surveillance/national security, disaster management, and ecosystem monitoring [11,12]. Most of the abovementioned missions impose high demands on the real-time performance of SAR data processing. On-board processing is an efficient solution which allows higher-precision SAR data to be processed, leading to better image quality and enabling an optional image compression. This technique improves the downlink bandwidth utilization and provides rapid feedback to the radar controller. With these processed data products, decision makers can quickly plan and respond. As early as 2000, the MIT Lincoln Laboratory began a study of the implementation of real-time signal processors for SAR front-end signal processing [13]. The processors were designed, based on their own VLSI bit-level systolic array technology, to have high computational throughput and low power implementations. S. Langemeyer et al. of the University of Hannover, Germany, proposed a multi-DSP system for real-time SAR processing using the highly parallel digital signal processor (HiPAR-DSP) technique in 2003 [14]. The small volume and low power consumption of their processor make it suitable for on-board usage in compact air- or spaceborne systems. The Jet Propulsion Laboratory (JPL) has also worked to develop on-board processing. An experimental SAR processing system based on VLSI/SOC hardware was proposed [15]. A fault-tolerant FPGA (Xilinx Virtex-II Pro)-based architecture was proposed and tested using the SIR-C data [16,17]. The University of Florida developed a high-performance space computing framework based on a hardware/software interface in 2006 [18]. An FPGA serves as the co-processor/accelerator of the CPU. A near-real-time SAR processor (NRTP) was developed by the Indian Space Research Organization (IRSO) based on the Analog Devices TigerSHARC TS101S/TS201S DSP multiprocessor. On-board or on-ground quick-look real-time SAR signal processing was found to be achievable for ISRO’s RISAT-1 [19]. With the rapid development of the storage capacity and computing capacity of the commercial-off-the-shelf (COTS) FPGA, the state-of-art Xilinx Virtex-6 FPGA was adopted for the entire real-time SAR imaging system in 2013 [20]. In recent years, the graphics processing unit (GPU), with its large computing power, has also been used for real-time SAR processing [21]. As indicated by the development of on-board SAR real-time processing, building a high- performance SAR real-time processing platform for space deployment is hampered by the hostile environmental conditions and power constraints in space. The FPGA, ASIC, DSP, CPU, and GPU are, to some extent, superior with respect to real-time processing. Although the GPU has a high processing power, its large power consumption makes it unsuitable for the harsh conditions of spaceborne on-board processing. The CPU and DSP take advantage of their design flexibility; however, they cannot provide enough FLOPS per watt, which leads to a bottleneck in their potential applications. Sensors 2018, 18, 725 3 of 20

Benefiting from its customized design, the ASIC can provide sufficient processing power and high computation ability; however, in implementing an ASIC for SAR imaging, the large-scale, complicated logic design requires a longer development period. The FPGA has made great progress in terms of on-chip storage resources, arithmetic logic resources, and hardware-software co-design. The FPGA can adapt to a large throughput rate and strict real-time signal processing requirements. Moreover, the structural characteristics of the FPGA make it suitable for homogeneous extension. In this paper, we propose a multi-node parallel accelerating system to realize an on-board real-time SAR processing system. The mainstream spaceborne SAR imaging algorithm, chirp scaling (CS), is implemented in this system. Previous researchers have imported CS algorithms to special FPGA processors. However, further optimizations are required for the existing methods to accelerate SAR imaging processing. For example, four times all-pixel FFTs are the most computation-hungry operations of CS implementation, and the efficiency burden mainly occurs in the data access after corner turning (matrix transposition) [22]. In a previous study [23], the window access mode was used to accelerate the matrix transposition, while in another study [24], ping pong buffers were used for Dual data rate (DDR) SDRAM to solve this problem. However, these approaches have various limitations of universality, e.g., two-dimensional rate mismatch and the method of complicated phase function generation, which can reduce the hardware resource utilization and meets the level of real-time imaging. We provide the following contributions to the existing research:

• Optimized mapping methodology for single-chip integration. The CS algorithm can be decomposed in to two parts. For the nonlinear part, nonlinear–operation linearization method, logic-memory optimum allocation, and hierarchical reconfiguration structure are proposed to reduce the complexity and time consumption. Two-dimensional bandwidth dynamic balance technology is introduced to achieve a good real-time performance between the linear part and transpose operations. • Multi-node parallel accelerating technique for strong real-time requirements. By analyzing the spaceborne SAR real-time imaging processing conditions, this paper presents a parallel accelerating architecture consisting of a master node and multiple independent processing nodes with high processing performance, high real-time performance, and linear scalability features.

The rest of the paper is organized as follows: Section2 reviews the CS algorithm and analyses the characteristic of each part of the CS algorithm. Section3 presents a single-FPGA integration design for optimizing the CS algorithm implementation with optimized mapping strategy. In Section4, the design methodology of the multi-node parallel accelerating system under the real-time principle is described. In Section5, the corresponding hardware realization details and results are discussed. A comparison with related works is conducted to show the validity of our system. Section6 concludes the paper.

2. Chirp Scaling (CS) Algorithm Review The CS algorithm is one of the most fundamental and popular algorithms for spaceborne SAR data processing. Compared to other algorithms, the superiority of the CS algorithm lies primarily in its use of the “chirp scaling” principle, in which phase multiplies are used instead of a time-domain interpolator to implement range-variant range cell migration correction (RCMC) shift [25]. As a kernel algorithm, the CS algorithm can process various modes with certain pre- or post-steps, such as the stripmap, scan SAR, spotlight, Tops, and Mosaic modes. This algorithm can also solve the problem of the dependence of the secondary range compression (SRC) on the azimuthal frequency because of the requirement for data processing in the two-dimensional frequency domain. The high-level block diagram of the CS algorithm based on the squint equivalent range model is illustrated in Figure1. The algorithm mainly includes operations of four FFTs, three phase functions, and two Doppler parameters (Doppler frequency center (DFC) and Doppler frequency rate (DFR)) Sensors 2018, 18, 725 4 of 20 estimation. Note that the proposed flow introduces the raw-data-based method to ensure the precision of Doppler parameters. The steps in the CS algorithm are as follows: First, the SAR raw data are divided into two parallel branches: one branch is transferred to the Range-Doppler domain via an FFT in the azimuthal direction, and the other branch is used to estimate DFC. DFC can be treated as a basic parameter for subsequent phase functions and DFR calculation. Second, the data are multiplied by the 1st phase function to achieve the chirp scaling, which makes all the range migration curves the same. The 1st phase function can be described as follows:

2 φ (τ, f ; r ) = exp[−jπb ( f ; r )c ( f )(τ − r (1 + c ( f )))2], (1) 1 η re f r η re f s η c re f s η where τ is the range time, fη is the azimuthal frequency, rre f is the reference distance, br( fη; rre f ) is the modulating frequency in the phase center of the range direction, and cs( fη) is the curvature factor, expressed as follows: sin ϕre f cs( fη) = q − 1, (2) λ fη 2 1 − ( 2v )

3 h λ fη 2i 2 1 − ( 2v ) × b br( fη, rre f ) = . (3) h i 3 2 λ fη 2 2 λ fη 1 − ( 2v ) + brre f sin ϕre f ( 2v ) where λ is the wave length, b is the modulation frequency of the transmitted signal, ϕre f and v represent equivalent squint angle and equivalent squint velocity, respectively. These variables can be described as follows: r λr f λ f v = r + ( d )2 (4) 2 2 where fd represents DFC and fr represents DFR. Because step 1 and step 2 consider the range dimension, the initial values obtained by the ephemeris parameter can be adopted to simplify calculation. Third, the data are transferred to the two-dimensional frequency domain via an FFT in the range direction. Next, the data are multiplied by the 2nd phase function to complete the range compression, the SRC, and the remaining RCMC. The 2nd phase function can be described as follows:

2 fτ 4π φ2( fτ, fη; rre f ) = exp[−jπ ] exp[+j fτrre f cs( fη)], (5) br( fη; rre f )[1 + cs( fη)] c where fτ is the range frequency. Next, the data are transferred to the Range-Doppler domain via an inverse FFT in the range direction. The data can be multiplied by the 3rd phase function to complete the azimuth compression and the phase correction. The DFR based on the raw data is used to refine the equivalent velocity v to ensure the precision of the 3rd phase function and is described as follows: r 2π λ fη 4π φ (τ, f ) = exp[−j cτ(1 − sin ϕ 1 − ( )2) + j b ( f ; r )(1 + c( f ))c( f )(r − r )2]. (6) 3 η λ re f 2v c2 r η re f η η re f Finally, the inverse FFT operation in the azimuthal direction is executed to complete the CS algorithm. A visualized grayscale image can be obtained after performing the 8-bit quantization operation. Sensors 2018, 18, 725 5 of 20 Sensors 2018, 18, x FOR PEER REVIEW 5 of 20

SAR Raw Data Corner turning, range to azimuth Step 1

DFC Estimation Azimuth FFT 1st Chirp Scaling Phase Function Corner turning, azimuth to range Step 2 Range FFT

2nd RCMC & SRC Phase Function Range Compression Range IFFT Corner turning, range to azimuth DFR Estimation

3rd Phase Correction Phase Function Azimuth Compression Step 3 Azimuth IFFT

Quantize

SAR Image

Figure 1.1. Flowchart of the chirp scaling (CS) algorithm.

The processing element selected for this system design is the FPGA. DDR SDRAMs are The processing element selected for this system design is the FPGA. DDR SDRAMs are introduced introduced into the system to act as the external storage medium for the SAR raw data. The corner into the system to act as the external storage medium for the SAR raw data. The corner turning turning represented in the red boxes in Figure 1 is required before each step to fit the step’s data represented in the red boxes in Figure1 is required before each step to fit the step’s data operation operation dimension. For the convenience of hardware processing, the high-robustness and low- dimension. For the convenience of hardware processing, the high-robustness and low-overhead overhead algorithms called time-domain-autocorrelation (TDA) [26] and shift-and-correlate (SAC) algorithms called time-domain-autocorrelation (TDA) [26] and shift-and-correlate (SAC) [27] are [27] are chosen for DFC and DFR estimation, respectively. Otherwise, the SAC method would require chosen for DFC and DFR estimation, respectively. Otherwise, the SAC method would require one one additional FFT operation. Thus, different mapping and implementation strategies must be additional FFT operation. Thus, different mapping and implementation strategies must be designed designed for different operation parts. for different operation parts. 3. Single-Node Field Programmable Gate Array (FPGA)-Based Architecture 3. Single-Node Field Programmable Gate Array (FPGA)-Based Architecture

3.1. Hardware Design Methodology A homogeneous design reduces the complexity of the board and might be more powerful for special algorithms and decrease the system power requirements. Depending on the chosen algorithm and chassis shape of the target system, the homogeneous system can be scalable by simplysimply adding processing nodes. In In this this section, section, we we discuss discuss the the single single-node-node FPGA FPGA design design of of the the CS CS algorithm. Preparing the strategy for development requires trade-offstrade-offs and integration across these multiple domains of of algorithm algorithm and and hardware. hardware. An An efficient efficient mapp mappinging methodology methodology is required is required to ensure to ensure that thatthe processor the processor is able is to able operate to operate without without large, power large,-hungry power-hungry computations. computations. The process The of process mapping of mappingthe architecture the architecture and algorithm and algorithm is to progress is to progress from from a correct a correct algorithm, algorithm, described described at at a a level level of abstraction that that is bothis both implementation implementation independent independent and timing and independent, timing independent, to a system to description a system ofdescription time-dependent, of time- specificdependent, allocation specific of allocation processing of resources processing and resources the sequencing and the of sequencing events within of thoseevents resources within those [28 ].resources Compared [28] to. Compared the multiprocessor to the multiprocessor method, uniprocessor method, uniprocessor integration integration can avoid thecan overheadavoid the ofoverhead introducing of introducing multiprocessor multiprocessor interconnection. interconnection. The real-time The performancereal-time performance of the signal of FPGAthe signal implementation FPGA implementation mainly depends mainl ony depends the details on ofthe the details architecture of the architecture and algorithm and partitioning. algorithm Aspartitioning. described As in described the previous in the section, previous the section, CS algorithm the CS isalgorithm divided intois divided three partsinto three for hardware parts for processing.hardware processing. The linear part, The whichlinear contains part, which a large contains number a of large rapid number but repetitive of rapid computations, but repetitive is thecomp largestutations, computational is the largest burden computational of real-time burden implementation. of real-time implementation. In addition, because In addition, the linear because part isthe a linear per-image part is pixel a per operation,-image pixel the operation, communication the communication with mass storage with resourcemass storage must resource be taken must into be taken into consideration. The transpose operations affect communication overhead, which is the other factor of real-time performance. The nonlinear part consists of slower and often more irregular computations that determine the imaging quality under limited hardware resources.

Sensors 2018, 18, 725 6 of 20 consideration. The transpose operations affect communication overhead, which is the other factor of real-time performance. The nonlinear part consists of slower and often more irregular computations Sensors 2018, 18, x FOR PEER REVIEW 6 of 20 that determine the imaging quality under limited hardware resources.

3.2.3.2. Nonlinear Nonlinear Part Part Mapping Mapping Strateg Strategyy InIn this this paper, paper, we we adopt adopt the the IEEEIEEE-754-754 standard single precision floating floating-point-point data data format format to to generategenerate the the phase phase function. function. A A detailed detailed description description of of the the proposed proposed architecture architecture of of the the nonlinear nonlinear part part isis given given in in this this subsection. subsection. The The requirement requirement of of the the real real-time-time nonlinear nonlinear part part design design is is to to complete complete the the phasephase function calculation beforebefore thethe corresponding corresponding FFT FFT operation operation of of each each step. step. The The estimations estimations of theof thephase phase functions functions and Doppler and Doppler parameters parameters require require a series aof series irregular of and irregular transcendental and transcendental operations operationsthat can be thatconsidered can be a considered resource-intensive a resource operation-intensive for FPGA. operation Adopting for FPGA. software Adopting computing software and a computingper-store strategy and a per can-store help increasestrategy thecan real-timehelp increase performance the real-time of nonlinear performance part; however,of nonlinear this part; step however,requires athis massive step requires amount a of massive storage. amount Therefore, of storage. a dedicated Therefore, mapping a dedicated method thatmappi considersng method the thattrade-off considers between the trade memory-off between and hardware memory resource and hardware must be designed.resource must be designed.

3.2.1.3.2.1. Hierarchical Hierarchical Decomposition Decomposition Mapping Mapping Strategy Strategy and and Hardware Hardware-Based-Based Optimization Methods ThisThis subsection subsection focuses focuses on on the the high high-efficiency-efficiency mapping mapping strategy strategy for for pha phasese function estimation (PFE)(PFE) implementations. implementations. Because Because the theinput input of the of three the threeCS factors CS factors is the basic is the parameter basic parameter of the satellite, of the whichsatellite, does which not require does not updating require during updating the duringsatellite the working satellite time, working the data time, acquisition the data can acquisition be used tocan improve be used betting to improve or stori bettingng operations. or storing operations. The complexity The complexity of the PFE of isthe mainly PFE is mainlyreflected reflected in the intermediatein the intermediate parameters parameters that contain that contain a large a large number number of of resource resource-hungry-hungry operations, operations, such such as as prescribingprescribing and and dividing. dividing. However, However, through through the analysis, the analysis, the procedure the procedure of thrice PFE of is thrice approximately PFE is approximathe same, andtely itthe masks same, the and possibility it masks of the parallel possibility operations. of parallel Therefore, operations. as shown Therefore, in Figure as2 ,shown a unified in Figurethree-tier 2, mappinga unified structure three-tier for mapping PFE implementation, structure for which PFE implementation, is based on the relationship which is based between on the the relationshipparameters ofbetween the calculation, the parameters is proposed. of the calculation, is proposed.

Figure 2. Hierarchical decomposition mapping flow. Figure 2. Hierarchical decomposition mapping flow.

The first level represents the procedure of fine-grained parameter acquisition. Different-colored The first level represents the procedure of fine-grained parameter acquisition. Different-colored represent different types of parameters, where the parameters in the green box are calculated only represent different types of parameters, where the parameters in the green box are calculated only once once and need not be updated during the imaging process. These parameters, such as b, sin ref , and and need not be updated during the imaging process. These parameters, such as b, sin ϕre f , and rref, rasref, wellas well asthe as the calculations calculations associated associated with with them, them, areusually are usually prepared prepared in parallel in parallel at the at beginning the beginning of the ofPFE the steps PFE and steps stored and in stored the registers in the registersfor subsequent for subsequent operations. operations. The gray boxes The grayshow boxes the construction show the construction of the range axis ()i , the Doppler frequency axis fi(), and the corresponding of the range axis τ(i), the Doppler frequency axis fη(i), and the corresponding parameter calculation. parameter calculation. is constructed by the Doppler center of reference position weighted by the ambiguity number of each azimuth line. The Taylor expansion method is supported to simplify the specific rooting operation and id described as follows:

Sensors 2018, 18, x FOR PEER REVIEW 7 of 20 Sensors 2018, 18, 725 7 of 20

ff1 1()1()22 (7) fη(i) is constructed by the Doppler center of222 referencevv position weighted by the ambiguity number of each azimuth line. The Taylor expansion method is supported to simplify the specific rooting operation and idTherefore, described resources as follows: are saved by having one shift operation replacing one rooting operation. r The red boxes show the azimuth-related parameters,λ fη which1 mustλ fη be updated only in step 3. As noted 1 − ( )2 = 1 − × ( )2 (7) above, fdc and fdr can be used as the initial2v values in2 step 12 vand step 2, and they should be refined in theTherefore, calculation resources for velocity are saved as (4), by as having described one shiftin step operation 3. Next, replacing the procedure one rooting returns operation. to the gray The boxesred boxes displayed. show theThe azimuth-relatedsecond level represents parameters, the calculation which must procedure be updated of coarse only-grained in step 3.parameters As noted and b(,) f r as indicated in (2) and (3). Because the inputs of these parameters are prepared above,cfs ()fdc andrfdr can ref be used as the initial values in step 1 and step 2, and they should be refined in inthe the calculation first level for and velocity the operations as (4), as are described independent in step of 3. each Next, other, the procedure they can be returns calculated to the in gray parallel. boxes Thedisplayed. three-phase The second function level calculation represents theof the calculation third level procedure can be of regard coarse-graineded as a parameterssimple assemblycs( fη) procedureand br( fη, rwhenre f ) as the indicated parameters in (2) of and the (3).second Because level theare inputsready. ofNote these that parameters the procedure are preparedof the 2nd in PFE the canfirst be level started and at the the operations second level are because independent the fine of-grained each other, parameters they can are be the calculated same as inthose parallel. of the The 1st PFE.three-phase function calculation of the third level can be regarded as a simple assembly procedure whenAccording the parameters to the analysis of the second of the mapping level are ready.strategy, Note certain that simple the procedure linear approximation of the 2nd PFE methods can be andstarted the atelementary the second implementation level because the scheme fine-grained can be parametersconducted. areAdvanced the same optimizations as those of the for 1st further PFE. reducingAccording the resource to the analysisusage are of as the follows: mapping strategy, certain simple linear approximation methods and the elementary implementation scheme can be conducted. Advanced optimizations for further  Reasonable storage planning is introduced before implementation of the architecture design. reducing the resource usage are as follows: The parameters maintained during one imaging process can be stored in registers, and the fi() • -Reasonable and ()i -related storage parameters planning is should introduced be stored before in implementationDPRAM and accessed of the architecture in a time multiplexing design. The mode.parameters maintained during one imaging process can be stored in registers, and the fη(i)- and  Theτ(i) -related trigonometric parameters functions, should such be stored as sin in, cos, DPRAM and arccos and accessed, are approximated in a time multiplexing by the CORDIC mode. • method,The trigonometric which is achieved functions, by suchthe addition as sin, cos, iteration. and arccos, Next, whole are approximated operations can by be the unified CORDIC in additionmethod, (subtraction), which is achieved multiplica by thetion addition and division. iteration. Assuming Next, whole the computation operations can of one be unified addition in asaddition the benchmark (subtraction), T, one multiplication multiplication and is approximately division. Assuming 4 T and the one computation division is ofapproximately one addition 10as the T. Therefore,benchmark the T, one proposed multiplication design is reduces approximately the utilization 4 T and of one logical division resources is approximately by using dividers,10 T. Therefore, rooting, the and proposed CORDIC design operational reduces cores the utilization in a time ofmultiplexing logical resources mode. by Corresponding using dividers, torooting, Figure and 2, one CORDIC divider operational and rooting cores are in instituted a time multiplexing for each level, mode. and Corresponding one CORDIC is to used Figure for2, theone first divider level. and rooting are instituted for each level, and one CORDIC is used for the first level.

3.2.2.3.2.2. Reconfigurable Reconfigurable Implementation Implementation Structure TheThe proposed proposed design design employs employs the the reconfigurable reconfigurable architecture shown in Figure 33 toto achieveachieve highhigh utilizationutilization of the computational computational resources. The FPGA FPGA-based-based structure consists of 5 modules. The The state state machinemachine unit unit and the I/O interface interface unit unit are are responsible responsible for for system system control control and and data data input input (output) (output) management,management, respectively. TheThe computercomputer core core unit unit and and the the reconfigurable reconfigurable unit unit share share responsibility responsibility for forcomputation computation tasks. tasks. The The on-chip on-chip memory memory unit unit is divided is divided into into two two groups, groups, namely, namely, the registerthe register pool pooland the and SDRAM the SDRAM array, to array, store tothe store intermediate the intermediate parameters parameters corresponding corresponding to different to parameters. different parametersAll modules. All are modules connected are via connected the 32-bit via user-defined the 32-bit user and-defined high-speed and bus high interface.-speed bus interface.

On-chip State Machine I/O Interface

Interconnect Bus

Reconfigurable Unit On-chip Memory Computer Core Cordic Div Route PE Register SDRAM Management array Pool Array Polyfit Root

FigureFigure 3 3.. FieldField Programmable Programmable Gate Gate Array Array (FPGA) (FPGA)-based-based architecture.

Sensors 2018, 18, x FOR PEER REVIEW 8 of 20 Sensors 2018, 18, x 725 FOR PEER REVIEW 8 of 20

An isomorphism coarse-grained programmable element (PE), which is based on an instruction- drivenAn and isomorphism float-point coarse-grainedarithmetic unit programmable(AU), is adopted element as shown (PE), in which Figure is 4 baseda. The onPE an has instruction- a selector fordriven four and input float-point data options: arithmetic three unit for (AU),external is adopted inputs and as shown one for in Figureinternal4a. computing The PE has feedback. a selector An for internalfour input controller data options: analyzes three the for instruction external inputs from and route one management. for internal computing The PE contains feedback. three An internalmodes, namely,controller addition, analyzes multiplication the instruction and from bypass route modes, management. which are The chosen PE contains by the instruction. three modes, Each namely, link hasaddition, the particular multiplication register, and and bypass the three modes, modes which cannot are operate chosen at bythe thesame instruction. time. Four Eachlevel tree link-like has interconnectionsthe particular register, (Figure and 4b) thecorresponding three modes to cannot the procedure operate of at phase the same estimation time. Four are utilized level tree-like for PE arrayinterconnections (PEA) design. (Figure Each4 b)parent corresponding node connects to the all procedure of the children of phase nodes. estimation Every are PE utilized can become for PE a secondaryarray (PEA) root design. node and Each output parent the node result connects independently all of the to children fit the specific nodes. scheduling Every PE cantask. become a secondary root node and output the result independently to fit the specific scheduling task.

Instrcution Input Instrcution Input PE

Controller selector PE Controller selector PE PE Four LevelFour Level + - PE PE PE PE + Add - Multi PE PE PE PE Add Multi

REG REG PE PE PE PE PE PE PE PE Output Output (a) PE architecture (b) PE array interconnection Figure 4. Programmable element (PE) and PE array architecture. FigureFigure 4 4.. ProgrammableProgrammable element element ( (PE)PE) and PE array architecture.

3.2.3. Accuracy and Time-Consumption Analysis 3.2.3. Accuracy and Time-Consumption Analysis The accuracy of nonlinear part estimation should be taken into consideration. The bit-level The accuracy of nonlinear part estimation should be taken into consideration. The bit-level accurate simulation method based on an advanced programming language (Matlab) provides a accurate simulation method based on an advanced programming language (Matlab) provides a benchmark to gauge the performance of the nonlinear part. Because the software allows a simulation benchmark to gauge the performance of the nonlinear part. Because the software allows a simulation result of each step of the algorithm to be provided, it is suited for verifying the validity of each PFE result of each step of the algorithm to be provided, it is suited for verifying the validity of each PFE result perfectly. In this paper, the 16,384 × 16,384 point target image was chosen for the experiment. result perfectly. In this paper, the 16,384 × 16,384 point target image was chosen for the experiment. We chose the Xilinx XC6VSX315T FPGA as the prototype platform and adopted ISE 14.3 as the We chose the Xilinx XC6VSX315T FPGA as the prototype platform and adopted ISE 14.3 as the development tool. Figure 5 shows the relative errors between hardware and software computation. development tool. Figure5 shows the relative errors between hardware and software computation. As described in reference [29], we extract 1 value out of every 8. Each point of Figure 5 represents the As described in reference [29], we extract 1 value out of every 8. Each point of Figure5 represents the maximum relative errors of range or azimuth lines. Each value represents the maximum error for maximum relative errors of range or azimuth lines. Each value represents the maximum error for each each factor line. Using the above optimized methods can result in a certain loss of accuracy, but the relativefactor line. error, Using which the is above between optimized 10−3 and methods 10−5, can can be result tolerated in a certainin high loss-resolution of accuracy, imaging. but the relative relative error, which is between 10−3 and 10−5, can be tolerated in high-resolution imaging. error, which is between 10−3 and 10−5, can be tolerated in high-resolution imaging.

(a) 1st PFE relative errors (b) 2nd PFE relative errors (c) 3rd PFE relative errors Figure 5. The relative errors between hardware and software computation. FigureFigure 5 5.. TheThe relative relative errors errors between between hardware and software computation.

Figure 6 shows the no-window imaging result of the proposed approach and MATLAB— visually, the imaging result shows little difference between the two figures. The proposed design is

Sensors 2018, 18, 725 9 of 20

Sensors 2018, 18, x FOR PEER REVIEW 9 of 20 Figure6 shows the no-window imaging result of the proposed approach and MATLAB—visually, thealso imagingsupported result for window shows littlefunctions difference in range between and azimuth the two compression figures. The stages. proposed Additional design memory is also supportedcan be reserved for window to build functions lookup in tables range for and window azimuth functions. compression This stages. paper Additional chooses 5 order memory Taylor can bewindow reserved with to −30 build dB lookup PLSR, tablesas well for as window standard functions. Hamming This window, paper choosesfor comp 5ar orderison. Taylor Table window1 shows withthe point−30 target dB PLSR, imaging as well quality as standard assessment Hamming results window, with different for comparison. situations. TableThe peak1 shows sidelobe the point ratio target(PSLR), imaging integrated quality sidelobe assessment ratio (ISLR), results with and spatial different resolution situations. (RES) The peak are commonl sidelobey ratio adopted (PSLR), to integratedevaluate the sidelobe imaging ratio quality (ISLR), [30] and. spatialThe degradation resolution (RES) of PSLR are commonly and ISLR between adopted to proposed evaluate and the imagingMATLAB quality is less [30than]. The 2.5 degradationdB, which mainly of PSLR occurred and ISLR by betweenfactor-extraction proposed and and linear MATLAB-approximation. is less than 2.5The dB, result which shows mainly that occurredthe proposed by factor-extraction design appears to and be linear-approximation. able to provide sufficient The accuracy result shows for point that thetarget proposed imaging. design appears to be able to provide sufficient accuracy for point target imaging.

(a) Software imaging result

(b) Hardware imaging result

Figure 6. Point target imaging result with no window. Figure 6. Point target imaging result with no window.

Table 1. Point target imaging quality assessment. Table 1. Point target imaging quality assessment. Azimuthal Direction Range Direction Window AzimuthalPSLR Direction ISLR RES PSLR Range DirectionISLR RES Window PSLR (dB) ISLR(dB) (dB) (dB) RES (m)(m PSLR) (dB)(dB) ISLR (dB)(dB) RES( (m)m) ProposedProposed −11.21 −11.21−10.00 −10.00 2.93 2.93− 10.93−10.93 −9.03−9.03 2.22.2 / / MatlabMatlab −12.52 −12.52−11.41 −11.41 2.1 2.1− 12.61−12.61− 10.05−10.05 1.91.9 ProposedProposed Taylor −27.54 −27.54−21.85 −21.85 4.10 4.10− 27.23−27.23− 22.34−22.34 2.452.45 Taylor window MatlabMatlab window −29.74 −29.74−25.93 −25.93 2.7 2.7− 28.58−28.58− 24.21−24.21 2.32.3 ProposedProposed Hamming −38.73 −−38.7337.21 −37.21 4.61 4.61− 36.85−36.85− 34.73−34.73 3.103.10 Hamming window MatlabMatlab window −40.71 −40.71−39.10 −39.10 3.16 3.16− 38.32−38.32− 36.08−36.08 2.692.69

For further analysis of the influencesinfluences of real-timereal-time performance,performance, the time consumption of three PFE procedures is given in Table2 2.. For For each each PFEPFE procedure,procedure, the the time time delay delay is is less less than than 1010 µµs.s. Because the 16,384 × 16,38416,384-point-point FFT FFT operation operation delay delay is is at at the the millisecond level, level, the PFE procedures can be considered asas real-timereal-time processingprocessing andand dodo notnot affectaffect thethe parallelismparallelism ofof thethe linearlinear part.part.

Sensors 2018, 18, 725 10 of 20

Sensors 2018, 18, x FOR PEER REVIEW 10 of 20 Table 2. Time consumption of three PFE procedures. Table 2. Time consumption of three PFE procedures. Stage Cycles (Working Frequency of 100 MHz) Stage Cycles (Working Frequency of 100 MHz) 1st PFE 512 1st2nd PFE PFE 512 512 2nd3rd PFE PFE 512 404 3rd PFE 404 3.3. Linear Part Mapping Strategy 3.3. Linear Part Mapping Strategy The linear part mainly includes FFTs, multiplications, and corner turning. Three corner turningThe (matrix linear part transposition) mainly includes operations FFTs, aremultiplications, equivalent to and the corner range orturning. azimuth Three data corner access turning modes (matrixcorresponding transposition) to each step. operations The proposed are equivalent design introduced to the range the sub-matrix or azimuth cross-mapping data access method modes proposedcorresponding in [22 to] toeach improve step. The the proposed efficiency design of corner introduced turning. the In sub this-matrix manner, cross the-mapping two-dimensional method proposeddata access in can [22] be to of improve high bandwidth the efficiency witha of limited corner on-chip turning. memory In this resource.manner, the Through two-dimensional introducing datathe timing access analysiscan be of ofhigh the bandwidth critical path with from a limited the perspective on-chip memory of the hardware resource. structureThrough introducing design, the thesingle-node timing analysis optimized of pipelinethe critical is obtainedpath from to the instruct perspective the actual of hardwarethe hardware system structure implementation. design, the single-node optimized pipeline is obtained to instruct the actual hardware system implementation. 3.3.1. Data Access Pattern 3.3.1. Data Access Pattern The principle of the cross-mapping method is to optimize and balance the range and azimuth dataThe access principle rate. Each of the SAR cross raw- datummapping is sentmethod the DDRis to three-dimensionaloptimize and balance memory the range space and according azimuth to datathe specific access mappingrate. Each rules. SAR Theraw maindatum procedure is sent the of DDR this technology three-dimensional (shown in memory Figure7 )space is given according by the tofollowing the specific two mapping steps: rules. The main procedure of this technology (shown in Figure 7) is given by the following two steps: 1. Divide the NA × NR raw-data matrix into M × N sub-matrices of Na × Nr order. To improve the A R a r 1. Dividemapping the performance N × N raw of-data the sub-matrix,matrix into theM number× N sub- ofmatrices sub-matrix of N factors × N order. should To be improve equal to the mappingcontained performance elements of eachof the DDR sub- row.matrix, The the addressing number mappingof sub-matr rulesix factors of raw-data should matrix be equalA(x, toy) theto sub-matrixcontained elementsS(a, b) can of be each described DDR row. as follows: The addressing mapping rules of raw-data matrix A(x, y) to sub-matrix S(a, b) can be described as follows: ( xmNaaN  x = mN0 a + a 0 ≤ a ≤ Na  aa (8)  y = nNr + b 0 ≤ b ≤ N r (8) ynNbbN 0 rr where x and y represent the range and azimuth original position of raw-data matrix, respectively, where x and y represent the range and azimuth original position of raw-data matrix, respectively, and m (ranged in 0 to M − 1) and n (ranged in 0 to N − 1) determine which sub-matrix the factor and m (ranged in 0 to M − 1) and n (ranged in 0 to N − 1) determine which sub-matrix the factor is is mapped to. mapped to. 2. Map each sub-matrix into the three-dimensional storage array of the DDR according to the 2. Map each sub-matrix into the three-dimensional storage array of the DDR according to the cross- cross-mapping method. The data processing is on a bank-by-bank to rank-by-rank basis. Four mapping method. The data processing is on a bank-by-bank to rank-by-rank basis. Four factors factors of two lines of sub-matrices as a group are cross-mapped to the DDR row space. of two lines of sub-matrices as a group are cross-mapped to the DDR row space.

A0,Bn-1 A0,Bn-2 NR A0,Bn-3 Nr ... A0,3 A0,0 A0,1 … A0,N-1 0,0 0,1 ... 0,Nr-1 A0,2 1,0 1,1 ... 1,Nr-1 A0,1 i=0 A0,0 ... i=1 n ... Na A0,B A1,0 A1,1 …

Na-2,0 Na-2,1 ... Na-2,Nr-1 … A0,N-Bn Na-1,0 Na-1,1 Na-1,Nr-1 ... i=N/Bn A1,Bn-1 NA …

… A1,N-Bn … …

a N AM-1,0 AM-1,N-1 i=(M-1)×N /Bn AM-1,Bn

0,0 1,0 0,1 1,1 ... 0,Nr-1 1,Nr-1 ... Na-1,Nr-1 …

… Nr i=M×N/Bn-1 AM-1,N-Bn Raw data two – dimensional Map to three – dimensional matrix partitioning Cross-mapping method storage space of DDR Figure 7. Cross-mapping method. Figure 7. Cross-mapping method.

Each factor in a sub-matrix corresponds to a position in the DDR memory space. Assume that the factors are mapped to the positions D(i, j, k) (coordinate of row, column, and bank) of DDR, the address mapping rule can be described as follows:

Sensors 2018, 18, 725 11 of 20

Each factor in a sub-matrix corresponds to a position in the DDR memory space. Assume that the factors are mapped to the positions D(i, j, k) (coordinate of row, column, and bank) of DDR, the address mapping rule can be described as follows:  i = b(mN + n)/B )c 0 ≤ i ≤ MN/B − 1  n n j = b(a/2)c × 2 + 2b + a mod 2 0 ≤ j ≤ Na Nr − 1 (9)   k = (m + n) mod Bn 0 ≤ k ≤ Bn − 1

Simultaneous (5) and (6), the mapping rules of A(x, y) to D(i, j, k) can be described as follows:  i = b(mN + n)/B ) c  n j = b((x − mNa)/2)c × 2 + 2(y − nNr) + (x − mNa) mod 2 (10)   k = (m + n) mod Bn

The data can be accessed from any location of DDR memory space, according to (7). However, considering that read and write accesses to the DDR are burst oriented, l × N samples (l means the burst length) should be continuously stored to or loaded from the DDR during N clock cycles. For example, according to the cross-mapping method, if the l value is 4, then two lines of data (range or azimuth) can be accessed simultaneously; if the l value is 8, then two lines of range data or four lines of azimuth data can be accessed simultaneously. Since each step of SAR imaging is line-based processing, it must configure a sufficient quantity of parallel caches and FFTs to fit the data operation.

3.3.2. Two-Dimensional Bandwidth Timing Analysis Theoretically, each range line or azimuth line FFT operation is independent of the others in the SAR imaging system. In other words, the higher the degree of parallelism of the FFT (the number of FFTs can be executed simultaneously), the better the real-time performance. However, the number of processors is determined based on the computational throughput requirements, which are restricted by the bandwidth of the FFT operations, PFE operations, and DDR-FPGA communication. The one-dimensional relationship can be described as follows:

tB kB  = [nB B ] DDR, inner_memory min FFT, PFE_min min. (11) where BDDR is the DDR actual effective bandwidth and Binner_memory is the equivalent bandwidth of the FPGA memory, which is used to cache the data between the DDR and FPGA. BPFE_min is the minimum bandwidth of the three phase function estimation operations and can be considered a real-time operation, as described above. k represents the amount of FPGA internal cache memory required for a specific DDR access mode. Ideally, as described above, k is determined by the burst length l of DDR and different storage methods. BFFT represents the bandwidth of the FFT operations, and n is the parallelism of FFTs. The simulation results show that the processing delay of the PFE operations is approximately dozens to hundreds of cycles. BDDR, Binner_memory and BFFT can be described as follows:

BDDR = ηρBDDR_peak = 2ηρFDDR_IOW, (12)

Binner_memory = Finner_memoryW, (13)

BFFT = FFFTW. (14) where FDDR_IO, Finner_memory, and FFFT represent the clock frequency of the DDR I/O, the FPGA internal memory, and the FFT co-processor, respectively. W is the word length of the data processed per cycle. In the all float-point system, W can be unified in 32. BDDR_peak is the peak bandwidth of the DDR, which can be represented by the product of double FDDR_IO and W. η represents the actual DDR bandwidth loss ratio, which is determined based on the data access mode (range or azimuth) of the Sensors 2018, 18, 725 12 of 20

DDR. ρ is the utilization of burst length l. Consequently, the effectiveness of the parallel processing can be described as follows:

 ηρF W kF W = nF W 2 DDR_IO , inner_memory min FFT . (15)

According to (15), the parallelism of co-processors is determined based on specific system features such as Fddr_IO, Fm, Ff f t, and W. The two-dimensional bandwidth balance is based on one-dimensional bandwidth analysis combined with the actual system design through a reasonable design method to determine the parameters η, ρ, and k, resulting in the two-dimensional maximum n value.

4. Real-Time System Establishment In practical application, because of the two-dimensional finite correlation of SAR data, the output of samples from radar accumulated during working state can be divided into several mutually independent data blocks, a process known as granularity. Adjacent granularity must have a certain overlap of minimum length in two directions, which are equal to samples of chirp pulse width in the range dimension and samples of synthetic aperture time in the azimuth dimension. The multi-node parallel method can be adopted for real-time processing. This method is based on the assumption that basic parameters such as attitude angles and velocity are well behaved and can be described by simple models before each granularity is processed and assembled into the SAR image granularity by granularity to produce a smooth strip image without any discontinuities. The system is mainly composed of two types of nodes: one master node and several processing nodes. The architecture is shown in Figure8. The entire system is mounted in a high-reliability bus, which must be supported by most applications of the complex space environment since the bus is utilized only for the passing of telemetry and tele-control signal between the master node and the processing nodes; raw-data routing is handled by a high-speed parallel point-to-point custom bus for big-data runs. Conceptually, packed, formatted, and compressed data with attached header first pass to and are stored in a master node. The synchronized header data include time, attitude, beam position, and location information derived from real-time differential GPS and inertial navigation unit on-board the satellite. The header data are required for computing initial Doppler centroid, positioning, and other basic parameter calculations in real-time. After decompressing and header stripping, raw data via quadrature channels are processed by filtering and removing direct current. The master node sends a copy of the pre-processed data to processing node. Note that data input and output of processing node are handled by two completely separate buses to ensure the real-time performance of the system. The imaging result outputs for advanced product generation via the master node. The performance optimization of the proposed design is conducted at three levels:

• High performance: The proposed design works on addressing three critical requirements of high-performance spaceborne processing for SAR imaging: high computational throughput, large amount of memory, and high-speed data interconnect throughout the communication chain. Multiple processing nodes, in which instances featuring the same pipeline and available hardware resource, with corresponding independent mass storage DDR, are introduced to satisfy the first two requirements. Since the imaging procedure of each processing node is relatively independent, data reading or writing operation cannot occur at same time. Two unidirectional buses (read only and write only, processing node view) are designed not only for the high-speed data interconnects requirement, but also to avoid the bus conflict. • Linear scalability: The standalone master node integrates the main I/O interfaces. All of the data distribution and switching control operations are conducted by the master node. The master node is responsible for optimizing I/O requests inside or outside of the system to disallow the pipeline processing nodes that lack new input data. In addition, each processing node has its own ID. In other words, the master node can distribute raw data to the processing nodes by polling the ID list. The telemetry and tele-control signal is responsible for spawning and shutting down SensorsSensors2018 2018,,18 18,, 725 x FOR PEER REVIEW 1313of of 2020

polling the ID list. The telemetry and tele-control signal is responsible for spawning and shutting eachdown processing each processing node. Modular node. Modular and ID-based and ID design-based allows design the allows system the architecture system architecture to have the to scalabilityhave the scalabilityof hardware of architecture; hardware architecture; that is, the system that is, can the add system (or reduce) can add the functionality (or reduce) the or processingfunctionality power or processing by adding power (or deleting) by adding some (or of deleting) the modules. some of the modules. • Real-timeReal-time performance: performance: the the definition definition of real-time of real- istime the requirement is the requirement of system of that system the operations, that the suchoperations, as data transmissionsuch as data andtransmission computation, and prescribed computation, to be prescribed performed mustto be beperformed completed must within be acompleted certain interval within time, a certain i.e., accumulation interval time, time i.e., of one accumulation granularity. timeThe accumulation of one granularity. time can The be describedaccumulation as follows: time can be described as follows: Tac = NA/PRF (16) Tac N A / PRF (16) where PRF is the pulse repetition frequency of the transmitter and NA represents the azimuth where PRF is the pulse repetition frequency of the transmitter and NA represents the azimuth samples samples of one granularity. The data accumulation procedure occurs in the master node, and of one granularity. The data accumulation procedure occurs in the master node, and since the pre- since the pre-processing is real-time, the accumulation time Tac also represents the operation delay processing is real-time, the accumulation time Tac also represents the operation delay of the master of the master node. Therefore, the parallelism (i.e., the real-time required number of processing node. Therefore, the parallelism (i.e., the real-time required number of processing nodes) can be given nodes) can be given as follows: as follows:   Tac + Tin + Tpro + Tout P = TTTTacinproout (17) P= Tac (17) Tac where Tin and Tout represent the time of raw-data input and the imaging result output of the where Tin and Tout represent the time of raw-data input and the imaging result output of the processing node, respectively; Tpro is the processing delay. To obtain clearer into the processing node, respectively; Tpro is the processing delay. To obtain clearer insight into the parallel parallel processing strategy, the time-sequence diagram of multi-node parallel processing is processing strategy, the time-sequence diagram of multi-node parallel processing is shown in Figure shown in Figure9: vertically, the data are assigned to the corresponding processing node in 9: vertically, the data are assigned to the corresponding processing node in groups of P granularities; groups of P granularities; horizontally, each processing node handles data in turn with an interval horizontally, each processing node handles data in turn with an interval of P granularities. of P granularities. Theoretically, the system guarantees the outputs continuously in the case of Theoretically, the system guarantees the outputs continuously in the case of continuous inputs in continuous inputs in real-time. real-time.

Telemetry and telecontrol signal

Master node Packed, formatted and Process node compressed data Radar data analysis DDR  decompressing DDR  Header stripping

address On-chip State Headers, mapping raw data via System Machine configuration files quadrature complex I/O Interface channels multiplication

Real-time data preprocessing Reconfigurable Unit FFT engine pool  Filtering FFT engine pool  Calculating basic parameter FFT engine set  Removing Direct current On-chip Memory

Quntify Computer Core Imaging result FIFO/RAM FIFO/RAM In-buffer Out-buffer

...

FigureFigure 8.8. BlockBlock diagramdiagram ofof thethe multi-nodemulti-node parallelparallel systemsystem architecture.architecture.

Sensors 2018, 18, 725 14 of 20 Sensors 2018, 18, x FOR PEER REVIEW 14 of 20

Processing Tin Tpro Tout node 1 Granularity 1 process flow Granularity P+1 process flow

Processing

node 2 Granularity 2 process flow Granularity P+2 process flow

......

Processing Granularity P process flow node P t

Figure 9.9. The time time-sequence-sequence of multi multi-node-node processing parallel processing.

5. Realization of the Multi-NodeMulti-Node Prototype Platform To fullyfully testtest andand verifyverify thethe functionalityfunctionality and performance of the proposedproposed architecture,architecture, in this section, we firstfirst analyzeanalyze and valuevalue thethe parametersparameters bothboth ofof single-processingsingle-processing nodenode andand multi-nodemulti-node systems, and then the prototype systemsystem andand continuouscontinuous imagingimaging performancesperformances areare given.given.

5.1. Single-NodeSingle-Node Parallel Processing Analysis To validatevalidate thethe effectivenesseffectiveness of the system design methodology, in this section, we establish a prototype to realizerealize 16,38416,384 ×× 16,384 stripmap SAR image processing. The The standard raw data from Chinese Gaofen-3Gaofen-3 are represented in a single precision floating-pointfloating-point complex form, which requires 64 bits (32 bits for the real part and 32 bits for the imaginary part) for each sample data point. DDR3 SDRAM is also introduced for the overall data storage. Table 33 shows shows thethe basicbasic parametersparameters ofof thethe system according toto SectionSection3 3..

Table 3.3. Basic parameters of the single-nodesingle-node imagingimaging implementation.implementation. Parameter Value Parameter Value FPGA main frequency 100 MHz FPGA main frequency 100 MHz NA 16,384 NA 16,384 NR 16,384 NR 16,384 Na Na 32 32 Nr Nr 32 32 l l 8 8 F 400 MHz ddr_IO F Fm ddrIO_ 400 MHz200 MHz F 100 MHz f f t Fm 200 MHz W 64 bit Ffft 100 MHz W 64 bit We adopt the ping/pong memory group as caches between the DDR and FPGA; each group containsWe fouradopt SDRAMs the ping/pong with 0.125 memory MB of group storage as space. caches Each between SDRAM the is DDR suitable and for FPGA; one 16 each K × group64-bit datacontains line. four The SDRAMs burst length withl of 0.125 the DDR3MB of SDRAMstorage space. chosen Each is 8. SDRAM As noted is above, suitable for for the one range 16 operation,K × 64-bit thedata 8 line.× 64-bit The burst data length belong l of to the the DDR3 two SDRAM range lines chosen and is four 8. As azimuth noted above lines., for Thus, the range to balance operation, the two-dimensionalthe 8 × 64-bit data data belong access to rate,the two the DPRAM-cachingrange lines and four method azimuth is different lines. forThus, range to balance or azimuth the lines,two- asdimensional Figure 10 shows. data access rate, the DPRAM-caching method is different for range or azimuth lines, as Figure 10 shows.

Sensors 2018,, 18,, 725x FOR PEER REVIEW 15 of 20

0,0 1,0 0,1 1,1 0,2 1,2 0,3 1,3 ... 0,0 1,0 0,1 1,1 0,2 1,2 0,3 1,3 ...

DPRAM 0 DPRAM 1 DPRAM 2 DPRAM 3 DPRAM 0 DPRAM 1 DPRAM 2 DPRAM 3 0,0 0,1 1,0 1,1 0,0 0,1 0,2 0,3 0,2 0,3 1,2 1,3 1,0 1,1 1,2 1,3 ......

Line 0 Line 1 Line 0 Line 1 Line 2 Line 2 0,0 0,1 0,2 0,3 ... 1,0 1,1 1,2 1,3 ... 0,0 1,0 ... 0,1 1,1 ... 0,2 1,2 ... 0,3 1,3 ... (a) The range mode (b) The azimuth mode

Figure 10. Two-dimensional data read mode. Figure 10. Two-dimensional data read mode.

In range mode, DPRAM0 and DPRAM1 cache the data of line 0 and DPRAM2 and DPRAM3 cacheIn the range data mode, of line DPRAM01 in cross andorder. DPRAM1 Thus, when cache the the next data operation of line 0must and DPRAM2read data from and DPRAM3DPRAM, cachethrough the combining data of line the 1 in data cross in order. the correspondi Thus, whenng the two next DPRAMs, operation a must complete read data range from line DPRAM, can be throughobtained combining in the pipeline. the data In azimuth in the corresponding mode, four DPRAMs two DPRAMs, cache a the complete data of range four linelines can and be complete obtained inazimuth the pipeline. data can In be azimuth read out mode, from four corresponding DPRAMs cache DPRAM. the data In ofthis four manner, lines andfour complete DPRAMs azimuth can be dataoperated can bein full read parallel out from and corresponding the bandwidth DPRAM. of two-dimensional In this manner, data access four DPRAMs between DDR can be and operated caches incan full be parallelbalanced. and Taking the bandwidth the internal of two-dimensionaldata-interaction between data access caches between and FFTs, DDR k and is valued caches as can 2 bein balanced.range mode Taking and 4 the in internalazimuth data-interaction mode. Since the between burst length caches is andin full FFTs, utilization,k is valued  as is 2 valued in range as mode 1. In and 4 in azimuth mode. Since the burst length is in full utilization, ρ is valued as 1. In this design, this design, according to the actual measurement,  is 0.74 in two dimensions. According to (15), according to the actual measurement, η is 0.74 in two dimensions. According to (15), the number of the number of FFTs is chosen to be 4 and 6 for range and azimuth operation theoretically. Table 4 FFTs is chosen to be 4 and 6 for range and azimuth operation theoretically. Table4 shows a summary shows a summary of the parameters related to parallel processing. The result of [22] has also been of the parameters related to parallel processing. The result of [22] has also been shown in Table4 shown in Table 4 with the condition of using same basic parameters and similar hardware structure. with the condition of using same basic parameters and similar hardware structure. By comparing the By comparing the parameters of the corresponding positions, the proposed design has the following parameters of the corresponding positions, the proposed design has the following advantages: (1) in advantages: (1) in the range direction operation, since the cache bandwidth determines the number the range direction operation, since the cache bandwidth determines the number of FFTs, higher cache of FFTs, higher cache parallelism indicates that more FFT cores can be used to improve the range parallelism indicates that more FFT cores can be used to improve the range processing efficiency; and processing efficiency; and (2) in the azimuthal direction operation, since the DDR bandwidth (2) in the azimuthal direction operation, since the DDR bandwidth determines the numbers of FFTs, determines the numbers of FFTs, full usage of the burst length also indicates that more FFT cores can full usage of the burst length also indicates that more FFT cores can be used in azimuth processing. be used in azimuth processing. Table 4. Summary of the parameters related to parallel processing. Table 4. Summary of the parameters related to parallel processing.

RaRangenge Direction Direction Operation Operation Azimuthal Azimuthal Direction Direction Operation Operation ParameterParameter ProposedProposed [22] [ 22]Proposed Proposed [22] [22] η  0.740.74 0.9375 0.9375 0.74 0.740.74 0.74 ρ  11 1 11 10.5 0.5 k k 22 1 14 44 4 Bddr_peak 6.4 GB/s 6.4 GB/s 6.4 GB/s 6.4 GB/s Bddr Bddr_ peak 4.86.4 GB/s GB/s 6.4 GB/s 6 GB/s 6.4 GB/s 4.8 GB/s6.4 GB/s 2.37 GB/s Bm 1.6 GB/s 1.6 GB/s 1.6 GB/s 1.6 GB/s Bddr 4.8 GB/s 6 GB/s 4.8 GB/s 2.37 GB/s Bf f t 0.8 GB/s 0.8 GB/s 0.8 GB/s 0.8 GB/s

n Bm 1.64 GB/s 1.6 GB/s 2 1.6 GB/s 61.6 GB/s 2

Bfft 0.8 GB/s 0.8 GB/s 0.8 GB/s 0.8 GB/s 5.2. Single-Node Imaging Result Analysis n 4 2 6 2 In this paper, we chose the Xilinx XC6VSX315T FPGA as the platform of the imaging node. Because5.2. Single of-Node the limits Imaging of theResult hardware Analysis resource, especially the storage resource, two FFTs are adopted for parallel processing. The FPGA resource occupation is shown in Table5. Figure 11 shows the final In this paper, we chose the Xilinx XC6VSX315T FPGA as the platform of the imaging node. actual scene imaging result of the proposed single-node hardware system, and the corresponding Because of the limits of the hardware resource, especially the storage resource, two FFTs are adopted imaging quality assessment is shown in Table6. for parallel processing. The FPGA resource occupation is shown in Table 5. Figure 11 shows the final actual scene imaging result of the proposed single-node hardware system, and the corresponding imaging quality assessment is shown in Table 6.

Sensors 2018, 18, x FOR PEER REVIEW 16 of 20

Table 5. FPGA resource occupation (Xilinx xc6vlx315t).

Parameter Value Number of slice registers 134,259 (34%) Number of LUTs 122,467 (62%) Sensors 2018, 18, 725 Number of block RAMs/FIFOs 499 (67%) 16 of 20 Number of DSP48s 387 (28%)

Table 5. FPGA resource occupation (Xilinx xc6vlx315t). Table 6. Actual scene imaging quality assessment.

ParameterParameter Value Number of sliceMSE registers 134,25923.3 (34%) NumberPSNR of LUTs (dB) 122,46730 (62%) Number of blockSSIM RAMs/FIFOs 4990.99 (67%) Number of DSP48s 387 (28%) γ (dB) 4.98

Figure 11. Actual scene imaging result of the system hardware. Figure 11. Actual scene imaging result of the system hardware. Through recording the numbers of clock cycles, it takes 10.6 s and 17 W for the system hardware Table 6. Actual scene imaging quality assessment. to process SAR raw data with 16,384 × 16,384 data granularity. Table 7 shows a comparison with previous works. The time and power consumption are less than that of the related design described in [22] because the proposedParameter system can achieve round-robin assignment Value in full parallel for both range direction operations. ComparedMSE with references [2,24,31,19,32], 23.3taking into account the data granularity processed, thePSNR proposed (dB) system shows advantages in both 30processing time and power consumption. Although [21]SSIM takes only 2.8 s to process SAR raw data 0.99 with 32,768 × 32,768 data granularity, the large powerγ (dB) consumption of the GPU is unacceptable 4.98 with respect to the harsh spaceborne on-board real-time processing requirements. Through recording the numbers of clock cycles, it takes 10.6 s and 17 W for the system hardware to process SAR raw data withTable 16,384 7. ×Comparison16,384 data with granularity. previous works. Table 7 shows a comparison with previous works. The time and power consumptionData are lessWorking than that of thePower related designProcessing described Works Year Schemes in [22] because the proposed system can achieveGranularity round-robin Frequency assignment Consumption in full parallel Time for both range directionProposed operations. 2017 Compared FPGA with references16,384 [2,19 × 16,384,24,31 ,32100], takingMHZ into account17 W the data10.6 granularitys [22] 2017 FPGA + ASIC 16,384 × 16,384 100 MHZ 21 W 12.1 s processed, the proposed system shows advantages in both processing time and power consumption. [2] 2016 FPGA + Microprocess 6472 × 3328 / 68 W 8 s Although[21] [21 ] takes2016 only 2.8CPU s to + GPU process SAR32,768 raw × 32,768 data with 32,768/ × 32,768>330 W data granularity,2.8 s the large power[24] consumption 2015 ofMulti the GPU-DSP is unacceptable4096 × 4096 with respect100 MHZ to the harsh/ spaceborne2.178 s on-board real-time processing[31] 2012 requirements. CPU + ASIC 1024 × 1024 100 MHz 10 W / [19] 2008 Multi-DSP 4096 × 4096 100 MHZ 35 W 13 s [32] 1998 ASICTable 7. Comparison1020 with× 200 previous10 MHz works. 2 W /

Data Working Power Processing Works Year Schemes Granularity Frequency Consumption Time Proposed 2017 FPGA 16,384 × 16,384 100 MHZ 17 W 10.6 s [22] 2017 FPGA + ASIC 16,384 × 16,384 100 MHZ 21 W 12.1 s [2] 2016 FPGA + Microprocess 6472 × 3328 / 68 W 8 s [21] 2016 CPU + GPU 32,768 × 32,768 / >330 W 2.8 s [24] 2015 Multi-DSP 4096 × 4096 100 MHZ / 2.178 s [31] 2012 CPU + ASIC 1024 × 1024 100 MHz 10 W / [19] 2008 Multi-DSP 4096 × 4096 100 MHZ 35 W 13 s [32] 1998 ASIC 1020 × 200 10 MHz 2 W / Sensors 2018, 18, 725 17 of 20

Sensors 2018, 18, x FOR PEER REVIEW 17 of 20 5.3. Multi-Node System Architecture 5.3. Multi-Node System Architecture In the proposed design, all the data transforms occur via LVDS. To approach real spaceborne In the proposed design, all the data transforms occur via LVDS. To approach real spaceborne processing, the procedure is designed as follows: processing, the procedure is designed as follows: 1. The data simulator sends the raw data of the original image to the master node via quadrature 1. The data simulator sends the raw data of the original image to the master node via quadrature channels. The transmitted raw data are pixel-by-pixel 8: 4 compressed, and the frequency of each channels. The transmitted raw data are pixel-by-pixel 8: 4 compressed, and the frequency of each channel is 100 MHz. channel is 100 MHz. 2. Through uncompressing, filter, and direct current removing operations, the 8-bit fixed data are 2. Through uncompressing, filter, and direct current removing operations, the 8-bit fixed data are sent to corresponding processing nodes. Each of the quadrature channels is 11 bit × 100 MHz, sent to corresponding processing nodes. Each of the quadrature channels is 11 bit × 100 MHz, including 8 bits for raw data transform, 1 bit for clock, and 2 bits for processing the node chosen including 8 bits for raw data transform, 1 bit for clock, and 2 bits for processing the node chosen ID. As noted above, since these procedures can be considered to occur in real-time, the data input ID. As noted above, since these procedures can be considered to occur in real-time, the data time Tac is approximately 2.7 s. input time Tac is approximately 2.7 s. 3. AfterAfter image processing, thethe resultresult is is sent sent back back to to the the master master node node and and then then transmitted transmitted to theto the PC. The time delay of 8-bit imaging result Tout is approximately 2.7 s. PC. The time delay of 8-bit imaging result Tout is approximately 2.7 s. TheThe pulse pulse repetition repetition frequency frequency (PRF) (PRF) of Gaofen-3 of Gaofen satellite’s-3 satellite’s 50-km-width, 50-km 5-m-resolution-width, 5-m-resolution stripmap modestripmap is 1926. mod Accordinge is 1926. According to (16), it takes to (16), approximately it takes approximately 7.9 s to acquire 7.9 ones to granularityacquire one of granularity the SAR raw of data.the SAR The raw real-time data. The parameters real-time of parameters the multi-node of the system multi- arenode shown system in Tableare shown8. in Table 8.

Table 8.8. The real-timereal-time parametersparameters ofof thethe multi-nodemulti-node system.system.

ParameterParameter ValueValue (s) (s) Tac 7.9 Tac 7.9 Tinin 2.72.7 Tpro 10.610.6 T 2.7 Toutout 2.7 P 4 P 4

Thus, we selected four processingprocessing nodes to achieveachieve thethe real-timereal-time requirement.requirement. The processingprocessing board is is shown shown in Figure in Figure 12a, which12a, containswhich contains two independent two independ processingent nodes processing and the nodes corresponding and the DDRcorresponding SDRAMs. DDR The otherSDRAMs. FPGA, The chosen other as FPGA, XC5VLX30T chosen inas theXC5VLX30T prototype, in is the used prototype, for debugging, is used data for flowdebugging, control, data and flow host control, command and resolution. host command This FPGAresolution. communicates This FPGA with communicates two imaging with FPGAs two throughimaging theFPGAs 2 × GTXthrough channel. the 2 The× GTX prototype channel. single The machineprototype is single mainly machine composed is mainly of one mastercomposed board of andone master two processing board and boards, two processing as shown inboards, Figure as 12 shownb. The in main Figure system 12b. indicatorsThe main aresystem also indicators shown in Tableare also9. shown in Table 9.

(a) (b)

Figure 12. (a) Photographs of the board and (b) the prototype single machine. Figure 12. (a) Photographs of the board and (b) the prototype single machine.

Table 9. The structuring indicators of the prototype single machine system.

Indicator Value Weight 10 kg Volume 32 cm × 24 cm × 20 cm Power <100 W

Sensors 2018, 18, 725 18 of 20

Table 9. The structuring indicators of the prototype single machine system.

Indicator Value Weight 10 kg Volume 32 cm × 24 cm × 20 cm Power <100 W Sensors 2018, 18, x FOR PEER REVIEW 18 of 20

FigureFigure 1313 showsshows the the final final 12 granularity imaging imaging continuous continuous results results of of the the proposed proposed prototype prototype hardwarehardware system. The The time time delay delay of of whole whole imaging imaging procedure procedure is less less than 150 150 s. The The image image quality quality is is reliablreliable,e, and and after after a a simple simple post post-splicing-splicing processing, a complete wide image can be obtained.

Figure 13. The continuous imaging results of 12 images. Figure 13. The continuous imaging results of 12 images. 6. Conclusions 6. Conclusions In this paper, to complete the on-board real-time SAR imaging processing task, a float-point In this paper, to complete the on-board real-time SAR imaging processing task, a float-point imaging system using multi-node parallel acceleration technology is proposed. With an efficient imaging system using multi-node parallel acceleration technology is proposed. With an efficient mapping methodology, the whole SAR imaging procedure can be integrated in a single FPGA. To mapping methodology, the whole SAR imaging procedure can be integrated in a single FPGA. To satisfy satisfy the requirement of real-time performance, we designed a prototype system with one master the requirement of real-time performance, we designed a prototype system with one master node and node and several processing nodes, and verified the performance via both a point target and an actual several processing nodes, and verified the performance via both a point target and an actual scene scene imaging quality evaluation. The efficient architecture achieves high real-time performance with imaging quality evaluation. The efficient architecture achieves high real-time performance with low low power consumption. A single-processing board requires 10.6 s and consumes 17 W to focus 25- power consumption. A single-processing board requires 10.6 s and consumes 17 W to focus 25-km km width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384, and the width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384, and the prototype prototype single machine is suitable for continuous imaging processing. single machine is suitable for continuous imaging processing. The indicators of the single machine, such as weight, volume, and power, can satisfy the needs The indicators of the single machine, such as weight, volume, and power, can satisfy the needs of of spaceborne SAR imaging processing. With the development of anti-radiation reinforcement spaceborne SAR imaging processing. With the development of anti-radiation reinforcement technology technology and system fault-tolerant technology, the proposed framework is expandable and feasible and system fault-tolerant technology, the proposed framework is expandable and feasible for potential for potential spaceborne real-time SAR imaging processing. spaceborne real-time SAR imaging processing. Acknowledgments: This work was supported by the National Natural Science Foundation of China under Grant Acknowledgments: This work was supported by the National Natural Science Foundation of China under Grant 91438203,91438203, the the Chang Chang Jiang Jiang Scholars Scholars Program Program under under Grant Grant T20 T201212212122 and and the the Hundred Hundred Leading Leading Talent Talent Project Project of of BeijingBeijing Science Science and and Technology Technology under Grant Z141101001514005. AuthorAuthor Contributions: Bingyi Li,Li, ChenChen YangYang and and Hao Hao Shi Shi conceived conceived and and designed designed the the framework. framework. Bingyi Bingyi Li and Li andWenyue Wenyue Yu performed Yu performed the mappingthe mapping technology technology and a analysednd analysed the data.the data. Bingyi Bingyi Li, LongLi, Long Pang Pang and and Yizhuang Yizhuang Xie debugged the system hardware. Bingyi Li, Wenyue Yu, Yizhuang Xie and Liang Chen contributed to the hardware Xie debugged the system hardware. Bingyi Li, Wenyue Yu, Yizhuang Xie and Liang Chen contributed to the platform implementation. Bingyi Li wrote the paper. Mingming Bian and Qingjun Zhang reviewed papers and hardwaresubmitted platform comments. implementation. Bingyi Li wrote the paper. Mingming Bian and Qingjun Zhang reviewed papers and submitted comments. Conflicts of Interest: The authors declare no conflict of interest. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Franceschetti, G.; Lanari, R. Synthetic Aperture Radar Processing. In Proceedings of the IGARSS ‘94 International Geoscience and Remote Sensing Symposium, Surface and Atmospheric Remote Sensing: Technologies, Data Analysis and Interpretation, Pasadena, CA, USA, 8–12 August 1994. 2. Lou, Y.; Clark, D.; Marks, P.; Muellerschoen, R.J.; Wang, C.C. Onboard Radar Processor Development for Rapid Response to Natural Hazards. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2770–2776.

Sensors 2018, 18, 725 19 of 20

References

1. Franceschetti, G.; Lanari, R. Synthetic Aperture Radar Processing. In Proceedings of the IGARSS ‘94 International Geoscience and Remote Sensing Symposium, Surface and Atmospheric Remote Sensing: Technologies, Data Analysis and Interpretation, Pasadena, CA, USA, 8–12 August 1994. 2. Lou, Y.; Clark, D.; Marks, P.; Muellerschoen, R.J.; Wang, C.C. Onboard Radar Processor Development for Rapid Response to Natural Hazards. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2770–2776. [CrossRef] 3. Hirose, A.; Rosen, P.A.; Yamada, H.; Zink, M. Foreword to the Special Issue on Advances in SAR and Radar Technology. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 3748–3750. [CrossRef] 4. Gierull, C.H.; Vachon, P.W. Foreword to the Special Issue on Multichannel Space-Based SAR. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 4995–4997. [CrossRef] 5. Tralli, D.M.; Blom, R.G.; Zlotnicki, V.; Donnellan, A.; Evans, D.L. Satellite remote sensing of earthquake, volcano, flood, landslide and coastal inundation hazards. ISPRS J. Photogramm. Remote Sens. 2005, 59, 185–198. [CrossRef] 6. Percivall, G.S.; Alameh, N.S.; Caumont, H.; Moe, K.L.; Evans, J.D. Improving Disaster Management Using Earth Observations—GEOSS and CEOS Activities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 1368–1375. [CrossRef] 7. Joyce, K.E.; Belliss, S.E.; Samsonov, S.V.; Mcneill, S.J.; Glassey, P.J. A review of the status of satellite remote sensing and image processing techniques for mapping natural hazards and disasters. Prog. Phys. Geogr. 2009, 33, 183–207. [CrossRef] 8. Copernicus: Sentinel-1—The SAR Imaging Constellation for Land and Ocean Services. Available online: https://directory.eoportal.org/web/eoportal/satellite-missions/c-missions/copernicus-sentinel-1 (accessed on 23 March 2017). 9. TDX (TanDEM-X: TerraSAR-X Add-On for Digital Elevation Measurement). Available online: https:// directory.eoportal.org/web/eoportal/satellite-missions/t/tandem-x (accessed on 23 March 2017). 10. ALOS-2 (Advanced Land Observing Satellite-2; SAR Mission)/Daichi-2. Available online: https://directory. eoportal.org/web/eoportal/satellite-missions/a/alos-2 (accessed on 23 March 2017). 11. RCM (RADARSAT Constellation Mission). Available online: https://directory.eoportal.org/web/eoportal/ satellite-missions/r/rcm (accessed on 23 March 2017). 12. KOMPSAT-5 (Korea Multi-Purpose Satellite-5)/Arirang-5. Available online: https://directory.eoportal.org/ web/eoportal/satellite-missions/k/kompsat-5 (accessed on 23 March 2017). 13. Song, W.S.; Baranoski, E.J.; Martinez, D.R. One trillion operations per second on-board VLSI signal processor for Discoverer II space based radar. In Proceedings of the Aerospace Conference Proceedings, Big Sky, MT, USA, 25 March 2000. 14. Langemeyer, S.; Kloos, H.; Simon-Klar, C.; Friebe, L.; Hinrichs, W.; Lieske, H.; Pirsch, P. A compact and flexible multi-DSP system for real-time SAR applications. In Proceedings of the 2003 IEEE International Geoscience and Remote Sensing Symposium, IGARSS ‘03, Toulouse, France, 21–25 July 2003. 15. Wai-Chi, F.; Jin, M.Y. On board processor development for NASA’s spaceborne imaging radar with VLSI system-on-chip technology. In Proceedings of the 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), Vancouver, BC, Canada, 23–26 May 2004. 16. Le, C.; Chan, S.; Cheng, F.; Fang, W.; Fischman, M.; Hensley, S.; Johnson, R.; Jourdan, M.; Marina, M.; Parham, B.; et al. Onboard FPGA-based SAR processing for future spaceborne systems. In Proceedings of the Radar Conference, Philadelphia, PA, USA, 29 April 2004. 17. Fang, W.C.; Le, C.; Taft, S. On-board fault-tolerant SAR processor for spaceborne imaging radar systems. In Proceedings of the IEEE International Symposium on Circuits and Systems, Kobe, Japan, 23–26 May 2005. 18. Greco, J.; Cieslewski, G.; Jacobs, A.; Troxel, I.A.; George, A.D. Hardware/software Interface for High-performance Space Computing with FPGA Coprocessors. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 4–11 March 2006. 19. Desai, N.M.; Kumar, B.S.; Sharma, R.K.; Kunal, A.; Gameti, R.B.; Gujraty, V.R. Near Real Time SAR Processors for ISRO’s Multi-Mode RISAT-I and DMSAR. In Proceedings of the European Conference on Synthetic Aperture Radar, Friedrichshafen, Germany, 2–5 June 2008. Sensors 2018, 18, 725 20 of 20

20. Pfitzner, M.; Cholewa, F.; Pirsch, P.; Blume, H. FPGA based architecture for real-time SAR processing with integrated motion compensation. In Proceedings of the Synthetic Aperture Radar, Tsukuba, Japan, 23–27 September 2013. 21. Zhang, F.; Li, G.; Li, W.; Hu, W.; Hu, Y. Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing. Sensors 2016, 16.[CrossRef][PubMed] 22. Yang, C.; Li, B.; Chen, L.; Wei, C.; Xie, Y.; Chen, H.; Yu, W. A Spaceborne Synthetic Aperture Radar Partial Fixed-Point Imaging System Using a Field-Programmable Gate Array—Application-Specific Integrated Circuit Hybrid Heterogeneous Parallel Acceleration Technique. Sensors 2017, 17.[CrossRef][PubMed] 23. Zhou, J. Fine-grained Algorithm and Architecture for Data Processing in SAR Applications. Ph.D. Thesis, National University of Defense Technology, Changsha, China, October 2010. 24. Gu, C.F.; Chang, W.G.; Li, X.Y.; Liu, Z.H. Multi-Core DSP Based Parallel Architecture for FMCW SAR Real-Time Imaging. Radioengineering 2015, 24, 1084–1090. [CrossRef] 25. Raney, R.K.; Runge, H.; Bamler, R.; Cumming, I.G.; Wong, F.H. Precision SAR processing using chirp scaling. IEEE Trans. Geosci. Remote Sens. 1994, 32, 786–799. [CrossRef] 26. Dall, J. A New Frequency Domain Autofocus. In Proceedings of the Geoscience and Remote Sensing Symposium, Espoo, Finland, 3–6 June 1991. 27. Li, B.; Xie, Y.; Liu, X.; Deng, Y. A new reconfigurable methodology to implement the shift-and-correlate (SAC) algorithm for real-time SAR autofocusing. In Proceedings of the IET International Radar Conference, Hangzhou, China, 14–16 October 2015. 28. Ackenhusen, J.G. Real Time Signal Processing: Design and Implementation of Signal Processing Systems; Prentice Hall PTR: Upper Saddle River, NJ, USA, 1999. 29. Wen, Y.; Chen, H.; Xie, Y.; Liu, X. A New Chirp Scaling Algorithm Using Region-Constant Phase Multipliers. Trans. Beijing Inst. Technol. 2014, 3, 304–309. 30. Zhang, H.; Li, Y.; Su, Y. SAR image quality assessment using coherent correlation function. In Proceedings of the 2012 5th International Congress on Image and Signal Processing, Chongqing, China, 16–18 October 2012. 31. Bierens, L.; Vollmuller, B.J. On-board Payload Data Processor (OPDP) and its application in advanced multi-mode, multi-spectral and interferometric satellite SAR instruments. In Proceedings of the EUSAR 2012; 9th European Conference on Synthetic Aperture Radar, Nuremberg, Germany, 23–26 April 2012. 32. Franceschetti, G.; Tesauro, M.; Strollo, A.G.M.; Napoli, E. A VLSI architecture for real time processing of one-bit coded SAR signals. In Proceedings of the Ursi International Symposium on Signals, Systems, and Electronics, Pisa, Italy, 2 October 1998.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).