entropy

Article A Residual Network and FPGA Based Real-Time Depth Map Enhancement System

Zhenni Li 1, Haoyi Sun 1, Yuliang Gao 2 and Jiao Wang 1,*

1 College of Information Science and Engineering, Northeastern University, Shenyang 110819, China; [email protected] (Z.L.); [email protected] (H.S.) 2 College of Artificial Intelligence, Nankai University, Tianjin 300071, China; [email protected] * Correspondence: [email protected]

Abstract: Depth maps obtained through sensors are often unsatisfactory because of their low- resolution and noise interference. In this paper, we propose a real-time depth map enhancement system based on a residual network which uses dual channels to process depth maps and intensity maps respectively and cancels the preprocessing process, and the algorithm proposed can achieve real- time processing speed at more than 30 fps. Furthermore, the FPGA design and implementation for depth sensing is also introduced. In this FPGA design, intensity image and depth image are captured by the dual-camera synchronous acquisition system as the input of neural network. Experiments on various depth map restoration shows our algorithms has better performance than existing LRMC, DE-CNN and DDTF algorithms on standard datasets and has a better depth map super-resolution, and our FPGA completed the test of the system to ensure that the data throughput of the USB 3.0 interface of the acquisition system is stable at 226 Mbps, and support dual-camera to work at full  speed, that is, 54 fps@ (1280 × 960 + 328 × 248 × 3). 

Citation: Li, Z.; Sun, H.; Gao, Y.; Keywords: depth map enhancement; residual network; FPGA; ToF Wang, J. A Residual Network and FPGA Based Real-Time Depth Map Enhancement System. Entropy 2021, 23, 546. https://doi.org/10.3390/ 1. Introduction e23050546 Recently, with the proposal of depth map acquisition methods such as structured light method [1] and time-of-flight (ToF) method [2], various consumer-image sensors Academic Editor: have been developed, such as the Microsoft and time-of-flight cameras. Meanwhile, Fernando Morgado-Dias depth maps have also been extensively studied. Depth map is the most effective way to express the depth information of 3D scenes, and is used to solve problems such as target Received: 11 March 2021 tracking [3], image segmentation [4], and object detection [5]. It is widely used in the Accepted: 26 April 2021 Published: 28 April 2021 emerging applications, such as , driving assistance and three- dimensional reconstruction. However, the depth image obtained through sensors is often unsatisfactory,

Publisher’s Note: MDPI stays neutral and this will adversely affect back-end applications. Therefore, low resolution and noise with regard to jurisdictional claims in interference have become important issues in depth map research. published maps and institutional affil- Many researchers have proposed various methods to reconstruct high-quality depth iations. maps. The depth map enhancement algorithm focuses on repair and super-resolution (SR). General depth map repair algorithms have adopted the method of joint filtering. The methods of depth map SR can be divided into three categories: (1) multiple depth map fusion; (2) image-guided depth map SR; (3) single depth map SR [6]. Xu et al. [7] fused depth maps through multi-resolution contourlet transform fusion, which not only retains Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. image contour information but also improves the quality of the depth map. Li et al. [8] This article is an open access article introduced synchronized RGB images to align with the depth image, and preprocessed distributed under the terms and the color image and the depth image to extract the effective supporting edge area, which conditions of the Creative Commons ensured the edge details of the image while repairing the depth map. Zuo et al. [9] proposed Attribution (CC BY) license (https:// a frequency-dependent depth map enhancement algorithm via iterative depth guided affine creativecommons.org/licenses/by/ transformation and intensity-guided refinement, and improved performances based on 4.0/). qualitative and quantitative evaluations are demonstrated.

Entropy 2021, 23, 546. https://doi.org/10.3390/e23050546 https://www.mdpi.com/journal/entropy Entropy 2021, 23, 546 2 of 23

To approximate the real depth map, it is hard to train the depth map to be repaired directly, however, if the jump structure of residual network (ResNet) is adopted, then the training goal becomes to approximate the difference, which is between the depth map to be repaired and the real depth map. Besides, the most widely used implementation platforms are graphics processing unit (GPU) and the field-programmable gate array (FPGA). GPUs support parallel computing, which is essential to achieve real-time performance, although they are very power-consuming and therefore not suitable for embedded applications. In contrast, FPGA is able to support stereo vision system computations with lower power consumption and cost. Therefore, FPGA design for stereo vision systems has become an active research topic in recent years. In this paper, we propose a depth map enhancement algorithm based on residual network and introduce the FPGA design and implementation for depth sensing. The network uses a fully convolutional network, which eliminates the preprocessing process in the existing algorithm and deepens the network structure. For the problems in deep network training, a batch standardization and residual structure is proposed. Our method uses peak signal to noise ratio (PSNR) and root mean square error (RMSE) for evaluation. Experiments show that the system has reached the predetermined performance index. In summary, the main contributions of this paper are as follows: (1) We design a dual-camera synchronous acquisition system based on FPGA and collect intensity map and depth map at the same time as the input of neural network. (2) We propose a depth map enhancement algorithm based on residual network, extract- ing features from the acquired intensity map and depth map in two ways, and then performing fusion and residual calculation. The result shows our proposed network has better performance than existing low-rank matrix completion (LRMC) [10], de- noise and enhance convolutional neural network (DE-CNN) [11] and data-driven tight frame (DDTF) [12] algorithms on standard dataset. (3) We have completed the test of the system to ensure that the data throughput of the USB interface of the acquisition system is stable at 226 Mbps and support dual-camera to work at full speed, 54 fps@ (1280 × 960 + 328 × 248 × 3). However, some limitations of our proposal are as follows: (1) The acquisition system consists of multiple development boards, which are bulky and not flexible enough, so it is necessary to design an all-in-one board, which concentrates the FPGA, TC358748, CYUSB3014 and other chips in the acquisition system on one printed circuit board (PCB). (2) The acquisition program runs on the central processing unit (CPU), but the current single-threaded sequential processing will consume a lot of time, and the system will become less real-time. The rest of this paper is organized as follows. In Section2, we discuss the related work including the residual network design and FPGA design of image acquisition system. In Section3, the details of the proposed algorithm are presented. Details of hardware architec- tures and system implementation are shown in Section4. In Section5, the experimental results are given in terms of the enhanced performance indicators of the depth map. Finally, we conclude this paper in Section6.

2. Related Work 2.1. Residual Network Design High quality depth images are often applied in the field of , and many works introduce deep learning into various image processing applications, such as depth map enhancement. Ni et al. [13] proposed a color-guided convolutional neural network method for depth map super resolution. They adopted a dual-stream convolutional neural network, which integrates the color and depth information simultaneously, and the optimized edge map generated by the high-resolution color image and low-resolution depth map is used as additional information to refine the object boundary in the depth map. Entropy 2021, 23, 546 3 of 23

The algorithm can effectively obtain depth maps, but cannot take the low super resolution depth map and high super resolution color image as inputs and directly outputs the high super resolution depth map. Zhou et al. [14] proposed a deep neural network structure that implements end-to-end mapping between low-resolution depth maps and high-resolution depth maps and proved that deep neural networks were superior to many state-of-the-art algorithms. Chen et al. [15] proposed a single depth map super-resolution method based on convolutional neural networks. Super-resolution is achieved by obtaining high-quality edge maps from low-quality depth images, thereby using high-quality edge maps as the weight of the regularization term in the total variation (TV) model. Korinevkava and Makarov [16] proposed two deep convolutional neural networks to solve the problem of single depth map super-resolution. This method has good performance on the RMSE and PSNR indicators and can process depth map super-resolution in real time with over 25–30 frames per second rate. However, their method cannot be applied to other datasets because the quality of the results varies heavily. Li et al. [17] proposed an end-to-end convolution neural network (CNN) combined with a residual network (ResNet-50) to learn the relationship between the pixel intensity of RGB images and the corresponding depth map. He et al. [18] developed a model of a fully connected convolutional auto-encoder where the middle layer information was also used to estimate the depth. Kumari et al. [19] has developed a convolutional neural network encoder decoder architecture and their model integrates residual connections within pooling and up-sampling layers, and hourglass networks which operate on the encoded feature. Siddiqui et al. [20] proposed a deep regression network using a transfer learning approach, where an encoder is used to initialize it to extract dense features and a decoder is used to upsample and predict the desired depth. Schlemper et al. [21] proposed a framework for reconstructing sequences images from undersampled data using a deep cascade of convolutional neural networks to accelerate the data acquisition process. Li et al. [22] treated the depth estimation problem as a construction from a sequence-to- sequence model, using location information and attention, with dense pixel matching instead of cost volume. Compared with existing works, CNN is a basic architecture in depth map enhance- ment, and some algorithms added encoder and decoder, adopted some ResNet modules in the hidden layers of the network, or used sequence-to-sequence model. This paper designs a residual network, which uses dual channels to process depth maps and intensity maps respectively, then calculates the residual of depth maps and the output of network and cancels the preprocessing process. For the function of depth map restoration, this algorithm improves PSNR by 3 dB. For the super-resolution function of depth map, this algorithm reduces RMSE by more than 10 times. In addition, the processing speed of this algorithm is also greatly improved, because it does not need CPU to participate in the calculation, so GPU can be used to achieve good acceleration effects.

2.2. FPGA Design of Image Acquisition System With the increasing non-repetitive engineering (NRE) cost and design cycle of applica- tion specific integrated circuit (ASIC), FPGA design of stereo vision has gradually become an active research topic in recent years [23]. FPGAs are able to find an optimal trade-off between performance, energy efficiency, fast development and cost, therefore, they are widely adopted in neural network applications [23–25]. And FPGA design of stereo vision systems has become an active research topic in recent years. With the progress of the stereo correspondence algorithms [26,27], a large number of FPGA architectures have been proposed. Dong et al. [28] designed the image acqui- sition and processing system of the color sorter based on FPGA. They completed the construction of the hardware platform and the design of the image processing algorithms. Their experiment confirmed that the system based on FPGA had high-speed and high- precision performance. Manabe et al. [29] proposed a real-time processing system for super-resolution moving images based on CNN. The system can perform super-resolution Entropy 2021, 23, x FOR PEER REVIEW 4 of 23

experiment confirmed that the system based on FPGA had high-speed and high-precision performance. Manabe et al. [29] proposed a real-time processing system for super-resolu- Entropy 2021, 23, 546 4 of 23 tion moving images based on CNN. The system can perform super-resolution between 960 × 540 and 1920 × 1080 at 60 fps. Shandilya et al. [30] applied image enhancement based on FPGA to the automatic vehicles number plate (AVNP) problem and achieved image betweenenhancement 960 × by540 estimating and 1920 the× best1080 value at 60 of fps. various Shandilya aspects etof al.image [30] quality. applied Prashant image en- et hancemental. [31] processed based blurred on FPGA or touseless the automatic image information vehicles number on a FPGA plate SPARTAN (AVNP) problem 3 XC3S500E and achievedboard and image obtained enhancement high-quality by images estimating by image the best fusion. value However, of various the aspects detailed of imageFPGA quality.implementation Prashant is et not al. described. [31] processed In contrast, blurred Pfeifer or useless et al. image[32] implemented information an on active a FPGA ste- SPARTANreo vision 3system XC3S500E on a boardXilinx and Zynq-7030 obtained SoC. high-quality The system images provides by image a computation fusion. However, speed theof 12.2 detailed fps, at FPGA a resolution implementation of 1.3 megapixel, is not described. but the system In contrast, cannot Pfeifer deal with et al. higher [32] imple- reso- mentedlutions. anLee active et al. stereo[33] designed vision systema stereo on matching a Xilinx accelerator Zynq-7030 that SoC. achieved The system 41 fps@480 provides × a270 computation performance speed on the of 12.2KU040 fps, FPGA at a resolution board, but of it 1.3 could megapixel, not deal but with the high-resolution system cannot dealimages with and higher needed resolutions. to be converted Lee et to al. 480 [33 ×] 270 designed resolution. a stereo matching accelerator that achievedCompared 41 fps@480 with× the270 above performance mentioned on thesyst KU040em, a synchronous FPGA board, real-time but it could acquisition not deal withsystem high-resolution of dual cameras images based and on neededFPGA is to implemented, be converted which to 480 can× 270 accurately resolution. control ToF cameraCompared and visible with light the camera above mentionedto acquire im system,ages synchronously, a synchronous and real-time transmit acquisition the data systemto the back-end of dual cameras acceleration based platform on FPGA in is re implemented,al time through which USB3.0 can accurately interface. controlThe FPGA ToF cameradesign in and this visible paper light has cameraa certain to improvem acquire imagesent in synchronously,computing speed and and transmit depth themap data qual- to theity. back-endIn addition, acceleration the data platformthroughput in real of the time USB through interface USB3.0 in the interface. system The proposed FPGA designcan be instabilized this paper at 226 has Mbps, a certain and improvement supports dual in cameras computing to work speed at andfull speed. depth map quality. In addition, the data throughput of the USB interface in the system proposed can be stabilized at3. 226Materials Mbps, and and Methods supports dual cameras to work at full speed. 3. MaterialsIn this section, and Methods we have proposed a CNN-based model to estimate the depth infor- mationIn thisfrom section, the single we haveimage proposed by integrating a CNN-based the residual model depth to estimate image. the depth informa- tion from the single image by integrating the residual depth image. 3.1. CNN Architecture and ResNet 3.1. CNNCNN Architecture architectures and usually ResNet include a contractive part that progressively reduces the resolutionCNN architecturesof the input usuallyimage through include a contractiveseries of convolution part that progressively and pooling reducesoperations, the resolutionhowever, in of regression the input problems, image through if the output a series is of expected convolution to be anda high-resolution pooling operations, image, however,some form in of regression upsampling problems, is required if the to output obtain is a expected larger output to be amap. high-resolution However, with image, the someincreasing form number of upsampling of network is required layers, a to degradation obtain a larger problem output might map. occur. However, ResNet with [34] the is increasinga deep convolutional number of networknetwork layers,proposed a degradation in 2015, which problem addressed might occur.the degradation ResNet [34 prob-] is a deeplem. convolutional network proposed in 2015, which addressed the degradation problem. As shownshown inin FigureFigure1 1,, ResNet ResNet uses uses shortcut shortcut connection connection asas the the basic basic structure structure of of the the network,network, andand thethe outputoutput ofof networknetwork AA cancan bebe defineddefined asasH H(xx)==FF(xx)++xx,, wherewhere xx denotesdenotes thethe outputoutput ofof networknetwork B andand FF(xx) represents representsthe the residual residualmapping mapping toto bebe learned.learned. AfterAfter addingadding the identity mapping, if FF(xx) converges converges toto 0,0, then,then, whichwhich meansmeans thatthat networknetwork AA achievesachieves almostalmost thethe samesame effecteffect asas networknetwork B.B.

FigureFigure 1.1. Shortcut connection.

InIn Inception-v4Inception-v4 [[35],35], thethe authorauthor combinedcombined thethe residualresidual andand inceptioninception structuresstructures andand foundfound thatthat residualresidual waswas ableable toto accelerateaccelerate thethe trainingtraining ofof thethe inceptioninception networknetwork withwith aa small improvement in accuracy. And in this paper, we design a CNN network combined with a residual network, and specific improvements will be discussed in the next section.

Entropy 2021, 23, x FOR PEER REVIEW 5 of 23

small improvement in accuracy. And in this paper, we design a CNN network combined with a residual network, and specific improvements will be discussed in the next section. Entropy 2021, 23, 546 5 of 23 3.2. Proposed Neural Network Design The network structure designed in this paper is shown in Figure 2. The input is the 3.2.intensity Proposed map Neural and depth Network map, Design referring to the full convolutional and multi-branch design idea Theof GoogleNet network structure[35]. The designedintensity map in this and paper depth is map shown are ininput Figure to the2. Thetwo inputconvolu- is thetional intensity networks map separately and depth to map,extract referring the feature to the maps, full convolutionalfollowed by fusion and multi-branch and residual designcalculation. idea Compared of GoogleNet with [35 DE-CNN]. The intensity [11], there map are and four depth major mapimprovements are input toin theour twonet- convolutionalwork structure: networks separately to extract the feature maps, followed by fusion and residual1. Dividing calculation. the depth Compared map and with intensity DE-CNN map [11 into], there two areseparate four major channels, improvements which is dif- in our networkferent from structure: DE-CNN where the depth map and intensity map are placed in the two 1. channelsDividing of the the depth input map image; and intensity map into two separate channels, which is 2. Removingdifferent from the DE-CNNpooling layer, where which the depth enables map more and accurate intensity feature map are maps placed to inbe theex- tracted;two channels of the input image; 2.3. DeepeningRemoving the the pooling network layer, structure, which which enables provides more accurate better featurerepresentation maps to bethan extracted; the shal- 3. lowDeepening network the in DE-CNN. network structure, Meanwhile, which to address provides the betterproblem representation of gradient vanishing than the andshallow degradation network problem in DE-CNN. when Meanwhile, training deep to neural address networks, the problem the batch of gradient normaliza- van- tionishing and and residual degradation structure problem are adopted; when training deep neural networks, the batch 4. Removenormalization pre-processing. and residual CNN structure has a more are adopted; robust performance compared to conven- 4. tionalRemove algorithms. pre-processing. CNN has a more robust performance compared to conven- tional algorithms.

FigureFigure 2.2. The proposed networknetwork structure.structure.

InIn thisthis paper,paper, wewe adoptadopt thethe CaffeCaffe frameworkframework toto implementimplement thethe networknetwork asas shownshown inin FigureFigure2 2.. TheThe basicbasic datadata structurestructure of of CaffeCaffe is is a a four-dimensional four-dimensional array array called called Blob, Blob, in in order order fromfrom outsideoutside toto inside,inside, starting starting from from N N (the (the number number of of input input images images per per batch, batch, axis axis = 0),= 0), C (theC (the number number of channelsof channels of theof th image,e image, axis axis = 1), = 1), H (theH (the height height of theof the image, image, axis axis = 2) = and2) and W (theW (the width width of theof the image, image, axis axis = 3), = i.e.,3), i.e.,N × 𝑁×𝐶×𝐻×𝑊C × H × W. We. We place place the the depth depth data data in channelin chan- 0nel and 0 and the intensitythe intensity data data corresponding corresponding to the to depth the depth data in data channel in channel 1. Thus, 1. theThus, data the is data first splitis first from split the from C dimension the C dimension (axis = (axis 1) using = 1) theusing Slice the layer Slice providedlayer provided by Caffe. by Caffe. And afterAnd extractingafter extracting the features the features of the depthof the mapdepth and map the and intensity the intensity map separately, map separately, the two partsthe two of theparts feature of the map feature are map connected are connected along the along C dimension the C dimension (axis = 1) (axis using = the1) using Concat the layer. Concat layer.As shown in Figure2, the convolutional layers can be divided into 2 categories. The first class of convolutional layers consists of 4 parts: convolution, bias, batch normalization

and ReLU activation function. Assume that in the lth convolutional layer, the ith output l l l feature map and its corresponding bias terms are qi and bi , the jth input feature map is pj,  l l l and the feature map pair qi, pj corresponds to the convolution kernel Wij.The first class of convolutional layers can be represented as:

Entropy 2021, 23, 546 6 of 23

!! l  l l l qi = f BNγ,β ∑ Wij ∗ pj + bi (1) j The second class of convolutional layer consists of convolution and bias operations only, without batch normalization and ReLU activation functions. The second class of convolutional layers can be represented as:

l  l l l qi = ∑ Wij ∗ pj + bi (2) j

It is very difficult to directly train the depth map to be repaired to approximate the real depth map. However, if the structure of ResNet is adopted, the training goal becomes to approximate the difference between the depth map to be repaired and the real depth map, which needs to be normalized to [0, 1] before the image is input to the network, so the approximate difference is relatively small, and empirically it is easier for ResNet to train such data successfully. According to Inception-v4 Network, some of the convolution kernels are 1 × 1, 3 × 3, and in paper [36], the author argues two 3 × 3 convolution filters are equivalent to one 5 × 5 convolution filter, with fewer parameters than one 5 × 5 convolution filter and make network deeper and extract more complex features. So 3 × 3 convolution filter is adopted in this architecture. Besides, the image is padded with zero to make the output image the same size as the input image. And there are three similarities between the settings of the above two classes of convolutional layers: (1) The size of the convolution kernel is 3 × 3; (2) The strides are set to one and the edges are padded with zero; (3) MSRA [31] initialization is applied.

Assuming that the input is a feature vector X = (x1, x2,... xm), and the output feature vector is Y = (y1, y2,... ym), with the scaling factor γ and the translation parameter β initialized to 1 and 0 respectively, then the batch normalization can be described as Algorithm 1.

Algorithm 1. Compute Batch Normalization. Default settings are γ = 1, β = 1, λ = 0.99, e = 0.01

Input: The feature vector X = (x1, x2,... xm) Output: The feature vector Y = (y1, y2,... ym) 1: function BATCH NORMALIZATION (X) 1 m 2: µX ← m ∑i=1 xi (Calculate the mean of Batch) 2 1 m 2 3: σX ← m ∑i=1 xi − µβ (Calculate the variance of Batch) 4: St+1 ← λSt−1 + (1 − λ)St (Update the sliding mean and variance) −µ 0 ← √X X 5: X 2 (Normalize Z-score) σX +ε 6: Y ← γX0 + β 7: return Y 8: end function

The batch normalization in Caffe consists of two parts, the BatchNorm layer and the Scale layer, so the first class of convolutional layers requires five Caffe-defined layers for implementation including Conv, Bias, BatchNorm, Scale and ReLU, while the second class of convolutional layers requires only two Caffe-defined layers, Conv and Bias. The BatchNorm layer corresponds to line 2–5, and the Scale layer corresponds to line 6. During the training process, the BatchNorm layer runs in line 2–5, and in addition to calculating the sliding mean and sliding variance, the scaling factor γ and the translation parameter β are updated according to the gradient descent algorithm, and these four parameters are stored in the Caffemodel file after training. Instead, line 2–4 are ignored Entropy 2021, 23, x FOR PEER REVIEW 7 of 23

The BatchNorm layer corresponds to line 2–5, and the Scale layer corresponds to line 6. During the training process, the BatchNorm layer runs in line 2–5, and in addition to calculating the sliding mean and sliding variance, the scaling factor 𝛾 and the translation Entropy 2021, 23, 546 parameter 𝛽 are updated according to the gradient descent algorithm, and these four pa- 7 of 23 rameters are stored in the Caffemodel file after training. Instead, line 2–4 are ignored in the test BatchNorm layer and step 5 is calculated directly using the sliding mean and slid- ing variance stored in the Caffemodel file, which avoids depending on the Batch Size dur- in the test BatchNorm layer and step 5 is calculated directly using the sliding mean and ing testing and improves the generalization ability of the network. sliding variance stored in the Caffemodel file, which avoids depending on the Batch Size during testing and improves the generalization ability of the network. 3.3. Dataset and Network Training Binary3.3. large Dataset object and (BLOB) Network is Training the standard array structure and unified storage inter- face for the entireBinary framework. large object Caffe (BLOB) uses is theBlob standard to store, array exchange structure and and manipulate unified storage infor- interface mation, itsfor data the entiretype is framework. a four-dimensional Caffe uses array. Blob toEncoded store, exchange image formats and manipulate such as PNG information, cannot beits used data directly. type is aInstead, four-dimensional Caffe support array.s three Encoded data structures image formats including such aslightning PNG cannot be memory-mappedused directly. database Instead, (LMDB), Caffe level supports data threebase (LevelDB) data structures and hierarchical including lightning data for- memory- mat (HDF-5),mapped which database can convert (LMDB), training level data samples base (LevelDB) and labels, and according hierarchical to dataimage format size, (HDF-5), Batch Sizewhich and cannumber convert of samples, training into samples a four-dimensional and labels, according array corresponding to image size, Batchto the Size and Blob formatnumber and integrate of samples, them into into a four-dimensional a single file. In this array paper, corresponding HDF-5 is adopted. to the Blob format and Twointegrate depth datasets, them into Middlebury a single file. and In MPI this Sintel paper, [37], HDF-5 are used is adopted. to provide the training and test sets. TwoThe depthMiddlebury datasets, dataset Middlebury is a depth and map MPI Sinteltaken [37from], are the used real to world provide using the training structuredand light, test which sets. The allows Middlebury for better dataset training is of a depththe network, map taken and fromis also the used real as world a using benchmarkstructured when comparing light, which the allows performance for better of trainingvarious ofdepth the network,map restoration and is alsoalgo- used as a rithms. However,benchmark as when shown comparing in Figure the 3a, performance the depth map of various in the depthdataset map has restoration some holes, algorithms. which willHowever, affect the as training shown in of Figure the restorat3a, theion depth function, map in so the we dataset need hasto pre-process some holes, the which will images inaffect this dataset. the training of the restoration function, so we need to pre-process the images in this dataset.

(a) (b)

Figure 3. TheFigure Aloe 3. sceneThe Aloe in Middlebury scene in Middlebury data set: (a data) original; set: (a )(b original;) processed (b) processed by LRMC. by LRMC.

LRMC wasLRMC a better was depth a better map depth restoration map restoration algorithm, algorithm, and Figure and 3b Figure shows3 bthe shows results the results of the LRMCof the algorithm LRMC algorithm for the Aloe for thescene Aloe in scenethe Middlebury in the Middlebury dataset. dataset.Although Although Si Lu’s Si Lu’s team didteam not make did not the make code thepublicly code publiclyavailable, available, the Middlebury the Middlebury dataset processed dataset processed by the by the LRMC algorithm,LRMC algorithm, including including 30 sets of 30 RGBD sets of images, RGBD is images, published is published on the website. on the website.In this In this paper, 24paper, of these 24 images of these were images selected were for selected the training for the set training and 2 setfor andthe test 2 for set. the However, test set. However, the trainingthe sets training produced sets produced with these with images these could images only could train only the trainnetwork the networkup to the up level to the level of the LRMCof the algorithm LRMC algorithm at best. Comparing at best. Comparing Figure 3a,b, Figure LRMC3a,b, algorithm LRMC algorithm blurs the blurs image the image edge details,edge causing details, some causing deviation some deviation in the correct in the depth correct values. depth The values. depth The map depth in the map in the MPI SintelMPI data Sintel set is data obtained set is obtainedfrom the from3D micro the 3D micro animation Sintel using Sintel the using optical the flow optical flow method. method.Although Although it is not like it is the not Middlebury like the Middlebury data set which data set can which restore can the restore real scene, the real scene, it has theit advantage has the advantage of being ofaccurate being accurateand free andof holes. free ofIn holes. order Into ordercompensate to compensate for the for the degradeddegraded accuracy accuracyof the LRMC-processed of the LRMC-processed depth maps, depth 64 maps,sets of 64 RGBD sets ofimages RGBD from images the from the Sintel datasetSintel were dataset selected were selected to supplement to supplement the training the training set, and set, another and another 4 sets 4 setswere were se- selected lected to tosupplement supplement the the test test set. set. From From a ada data-driventa-driven perspective, perspective, such such a atraining training set set al- allows the lows the networknetwork toto learnlearn thethe featuresfeatures ofof realreal scenesscenes whilewhile consideringconsidering accuracyaccuracy issues.issues. To be specific, Ground Truth consists of two parts: 26 depth maps from the Middlebury dataset processed by LRMC algorithm and 68 original depth maps from the Sintel dataset, which need to be normalized. And the depth map used in the sample was not processed by LRMC algorithm. The following methods are adopted to add noise and holes to the original Y-D image: the selected RGB image is first converted to a greyscale map and normalized into the range of [0, 1], and then Additive White Gaussian Noise (AWGN) is added as the Y channel of the input sample; next the original depth map is normalized Entropy 2021, 23, x FOR PEER REVIEW 8 of 23

To be specific, Ground Truth consists of two parts: 26 depth maps from the Middle- bury dataset processed by LRMC algorithm and 68 original depth maps from the Sintel dataset, which need to be normalized. And the depth map used in the sample was not processed by LRMC algorithm. The following methods are adopted to add noise and holes Entropy 2021, 23, 546 8 of 23 to the original Y-D image: the selected RGB image is first converted to a greyscale map and normalized into the range of 0,1, and then Additive White Gaussian Noise (AWGN) is added as the Y channel of the input sample; next the original depth map is normalized intointo thethe rangerange ofof [0,0,1 1],, andand 13%13% ofof thethe pixelspixels ofof thisthis imageimage areare randomlyrandomly setset toto 00 asas thethe DD channelchannel ofof thethe inputinput sample.sample. DeepDeep learninglearning requiresrequires training training samples samples of of at at least least10 104 orders orders of magnitude,of magnitude, therefore there- afore simple a simple but efficient but efficient data augmentation data augmentation method method is applied is applied to guarantee to guarantee sufficient sufficient training samples.training samples. To be specific, To be specific, as shown as shown in Figure in 4Figure for Ground 4 for Ground Truth andTruth noise-added and noise-added Y-D images,Y-D images, the data the augmentationdata augmentation in the in training the training process process is implemented is implemented through through flipping flip- left toping right left operation, to right operation, along with along rotation with 90 rotation◦ in up, down,90° in up, left anddown, right left four and directions, right four sodirec- we cantions, get so 8 we times can the get data 8 times set. the Then data the se stridet. Then is the set tostride 20 with is set50 to× 2050 withpatches 50 × and 50 discardpatches patchesand discard with patches lack strides. with lack Finally, strides. there Finally, are 400,512 there are sets 400,512 of Ground sets of Truth Ground and Truth training and samplestraining insamples the training in the set,training and 2590set, and sets 2590 of Ground sets of Ground Truth and Truth test and samples test samples in the test in set.the Duringtest set. theDuring training, the training, we input we a input batch a sample batch sample with a with batch a sizebatch of size 128 of per 128 iteration, per iteration, and iterateand iterate all batches all batches per epoch. per epoch.

FigureFigure 4.4. DataData augmentation.augmentation.

TheThe AdamAdam gradientgradient descentdescent algorithmalgorithm inin CaffeCaffe SolverSolver isis adoptedadopted inin thisthis paper,paper, andand thethe parametersparameters β𝛽1,, β𝛽2 andand eϵ useusethe the default defaultvalues valuesin in the the literature literature [ 38[38],], i.e., i.e.,β 𝛽1==0.90.9,, −8 −5 β𝛽2 = 0.9990.999 andand eϵ=10= 10 .. The The base base learning learning rate rate is fixedfixed atat 1010 .. DifferentDifferent fromfrom CaffeCaffe modelmodel whichwhich setssets thethe learninglearning raterate ofof thethe biasbias termterm toto twice the learning rate of the weights, thethe additionaladditional learninglearning raterate ofof thethe weightsweights isis setset toto 11 andand thethe learninglearning raterate ofof thethe biasbias termterm isis setset toto 0.10.1 inin thisthis paper.paper. Assuming that M denotes the number of samples, DGT denotes ground truth, W Assuming that M denotes the number of samples, 𝐷 i denotes ground truth, W de- denotes the weight, and f (D , Y , W) denotes the forward function of the whole network, notes the weight, and 𝑓𝐷,𝑌i,𝑊i denotes the forward function of the whole network, we wecan can mathematically mathematically describe describe the theMSE MSE loss loss term term as follows: as follows: M−1   2 1 1 1 1 GT L(W)L=𝑊 = k f(𝑓Di𝐷, Y,𝑌i, W,𝑊) −− D𝐷i −−𝐷D k (3)(3) M ∑𝑀 2 2 i i=0 ByBy addingadding thethe L2L2 regularregular term,term, wherewhere thethe defaultdefault decaydecay weight is set to 55×10× 10−4,, thethe MSEMSE errorerror cancan bebe denoteddenoted as:as: 𝐽𝑊 =𝐿𝑊 + λ‖𝑊‖2 (4) J(W) = L(W) + λk W k (4) The server CPU model is an Intel Core i7 6700K with 64 GB of random-access memoryThe server(RAM) CPU and modelthe GPU is an model Intel Coreis NAVIDA i7 6700K GTX1080 with 64 GB with of random-access8 GB of video memory. memory (RAM)The software and the environment GPU model is includes NAVIDA Ubuntu GTX1080 16.04.4 with 8LTS, GB ofCaffe video 1.0, memory. CUDA The8.0, softwareCuDNN environment6.0.21, TensorRT includes 4.0.0.3, Ubuntu Python 16.04.4 3.5 and LTS, MATLAB Caffe 1.0, 2018a. CUDA In 8.0, addition, CuDNN two 6.0.21, GPUs TensorRT are re- 4.0.0.3,quired Pythonwhen setting 3.5 and batch MATLAB size to 2018a.128 for InLinux addition, terminal two training, GPUs are otherwise required an when insufficient setting batchvideo sizememory to 128 error for Linux is prompted. terminal training, otherwise an insufficient video memory error is prompted. The training error and test error were recorded every 100 iterations and a Snapshot file was saved. After running 60,000 iterations, the Iteration-Loss curve as shown in Figure5 can be obtained. It can be seen that as the training proceeds, the loss value decreases

smoothly and eventually the training and testing errors stabilize around 0.5. The best performance was tested for the 56,000th iteration, and the Caffemodel file from this iteration will be used to test the performance of the network in the next section. Entropy 2021, 23, x FOR PEER REVIEW 9 of 23 Entropy 2021, 23, x FOR PEER REVIEW 9 of 23

The training error and test error were recorded every 100 iterations and a Snapshot The training error and test error were recorded every 100 iterations and a Snapshot file was saved. After running 60,000 iterations, the Iteration-Loss curve as shown in file was saved. After running 60,000 iterations, the Iteration-Loss curve as shown in Figure 5 can be obtained. It can be seen that as the training proceeds, the loss value de- Figure 5 can be obtained. It can be seen that as the training proceeds, the loss value de- creases smoothly and eventually the training and testing errors stabilize around 0.5. The creases smoothly and eventually the training and testing errors stabilize around 0.5. The Entropy 2021, 23, 546 best performance was tested for the 56,000th iteration, and the Caffemodel file from9 of this 23 best performance was tested for the 56,000th iteration, and the Caffemodel file from this iteration will be used to test the performance of the network in the next section. iteration will be used to test the performance of the network in the next section.

Figure 5. Iteration-Loss curve. FigureFigure 5. 5. Iteration-LossIteration-Loss curve. curve. 4. Full-Pipeline FPGA Implementation 4.4. Full-Pipeline Full-Pipeline FPGA FPGA Implementation Implementation 4.1.4.1. HardwareHardware ArchitectureArchitecture OverviewOverview 4.1. Hardware Architecture Overview TheThe physicalphysical connectionconnection isis shownshown inin FigureFigure6 6.. The The USB USB 3.0 3.0 controllercontroller CYUSB3014 CYUSB3014 [ [39],39], The physical connection is shown in Figure 6. The USB 3.0 controller CYUSB3014 [39], isis aa chipchip thatthat internallyinternally integratesintegrates thethe physicalphysical layerslayers ofof USBUSB 2.02.0 andand USBUSB 3.0,3.0, thethe 32-bit32-bit is a chip that internally integrates the physical layers of USB 2.0 and USB 3.0, the 32-bit microprocessormicroprocessor ARM926EJ-S,ARM926EJ-S, andand generalgeneral programmableprogrammable interfaceinterface IIII (GPIF(GPIF II),II), aa second-second- microprocessor ARM926EJ-S, and general programmable interface II (GPIF II), a second- generationgeneration general-purposegeneral-purpose programmableprogrammable interfaceinterface forfor communicationcommunication withwith microcon-microcon- generation general-purpose programmable interface for communication with microcon- trollers,trollers, FPGAsFPGAs oror imageimage sensors.sensors. AnAn ONON SemiconductorSemiconductor AR0135CSAR0135CS waswas selectedselected asas thethe trollers, FPGAs or image sensors. An ON Semiconductor AR0135CS was selected as the sensorsensor forfor the visible visible camera camera and and OPN8008 OPN8008 of of OPNOUS OPNOUS was was selected selected as asthe the sensor sensor for forthe sensor for the visible camera and OPN8008 of OPNOUS was selected as the sensor for the theToF ToF camera. camera. The Thecamera camera serial serial interface interface 2 (CSI-2) 2 (CSI-2) interface interface used by used the byOPN8008 the OPN8008 is a high- is ToF camera. The camera serial interface 2 (CSI-2) interface used by the OPN8008 is a high- aspeed high-speed differential differential serial interface serial interface for camera for cameradata transmission data transmission under the under mobile the industry mobile speed differential serial interface for camera data transmission under the mobile industry industryprocessor processor interface interface(MIPI) interface. (MIPI) interface. processor interface (MIPI) interface.

FigureFigure 6.6. PhysicalPhysical connectionconnection diagramdiagram ofof thethe acquisitionacquisition system.system. Figure 6. Physical connection diagram of the acquisition system. ThenThen anan overviewoverview ofof thethe proposedproposed FPGAFPGA systemsystem isis discusseddiscussed here. here. AsAs shownshown inin Then an overview of the proposed FPGA system is discussed here. As shown in FigureFigure7 7,, the design of of the the FPGA FPGA mainly mainly includes includes the the initialization initialization module, module, the clock the clock gen- Figure 7, the design of the FPGA mainly includes the initialization module, the clock gen- generationeration module module and and the thevideo video capture capture module. module. The The clock clock generation generation module module is imple- is im- eration module and the video capture module. The clock generation module is imple- plementedmented in inFPGA, FPGA, which which uses uses the the VCU118 VCU118 on-board on-board crystal crystal as the as theclock clock source source and and the mented in FPGA, which uses the VCU118 on-board crystal as the clock source and the theFPGA FPGA on-chip on-chip clock clock management management module module to generate to generate the theclock, clock, providing providing a 27 a MHz 27 MHz ref- FPGA on-chip clock management module to generate the clock, providing a 27 MHz ref- referenceerence clock clock for for the the AR0135CS, AR0135CS, TC358748 TC358748 and and OPN8008, OPN8008, and and a a 60 MHz DMA clockclock forfor erence clock for the AR0135CS, TC358748 and OPN8008, and a 60 MHz DMA clock for thethe GPIFGPIF IIII interface.interface. the GPIF II interface.

Entropy 2021, 23, 546 10 of 23 Entropy 2021, 23, x FOR PEER REVIEW 10 of 23

FigureFigure 7.7.Top-level Top-level designdesign of of FPGA. FPGA.

4.2.4.2. InitializationInitialization ModuleModule TheThe core core of of the the initialization initialization module module is MicroBlazeis MicroBlaze Soft Soft Processor Processor Core, Core, and and the systemthe sys- connectiontem connection is shown is shown in Figure in Figure8. A local 8. A memorylocal memory with awith capacity a capacity of 64 of KB 64 is KB used is used as the as processorthe processor running running memory, memory, and itsand internal its internal structure structure is shown is shown in Figure in Figure9. 9. TheThe AXI4AXI4 bus and and all all intellectual intellectual property property (IP) (IP) cores cores are are 100 100 MHz MHz clock clock drivers drivers gen- generatederated by bythe the clock clock wizard wizard clk_wiz_1. clk_wiz_1. clk_wiz_1 clk_wiz_1 is set is set to tohigh high level level and and the the reset reset pin pin is islocked locked to tothe the CPU_RESET CPU_RESET button button on the on theVCU118; VCU118; the clock the clock source source is the is 250 the MHz 250 differ- MHz differentialential clock clock source source on the on board the board VCU118. VCU118. AA MicroBlazeMicroBlaze corecore hashas onlyonly oneone setset ofof AXI4AXI4 mastermaster ports,ports, asas shownshown inin FigureFigure9 .9. M_AXI_DPM_AXI_DP is is the the data data channel, channel, while while the the command command channel channel M_AXI_IP M_AXI_IP is is not not used used at at this this time.time. XilinxXilinx providesprovides AXIAXI SmartConnectSmartConnect IPIP corescores forfor mounting mounting multiple multiple devices devices on on the the AXI4AXI4 bus.bus. InIn thisthis design,design, threethree AXIAXI IICIIC IPIP corescores areare mountedmounted onon thethe AXI4AXI4 busbus forfor inter-inter- 2 integratedintegrated circuit circuit (I (IC)2C) communication,communication, one one AXI AXI general-purpose general-purpose input/output input/output (GPIO) (GPIO) IP IP corecore is is used used to to generate generate reset reset signals signals for for external external modules modules and and chips, chips, and and one one AXI AXI Uartlite Uartlite IP core is used to print the operating status from the serial port. IP core is used to print the operating status from the serial port. The XGpio_Initialize function is used to initialize the GPIO core, then XGpio_Discrete The XGpio_Initialize function is used to initialize the GPIO core, then XGpio_ Write is used to pull up rstn_pll to unlock the reset of the clock generation module. Since DiscreteWrite is used to pull up rstn_pll to unlock the reset of the clock generation module. the clock has been stabilized after a delay of 50 µs, the rstn_camera is pulled up again Since the clock has been stabilized after a delay of 50 μs, the rstn_camera is pulled up at this time to unlock the reset of the image sensor and TC358748, finally completing the again at this time to unlock the reset of the image sensor and TC358748, finally completing power-up process of the external chip. the power-up process of the external chip. The initial configuration of the three external chips, including OPN8008, is completed The initial configuration of the three external chips, including OPN8008, is completed by operating the corresponding AXI IIC IP cores. In this paper, the I2C core operates in the by operating the corresponding AXI IIC IP cores. In this paper, the I2C core operates in the default 100 KHz standard mode, and we complete the I2C writing operations by writing default 100 KHz standard mode, and we complete the I2C writing operations by writing byte sequences to three registers 0x100, 0x108 and 0x120 through the Xil_Out32 func- byte sequences to three registers 0x100, 0x108 and 0x120 through the Xil_Out32 function tion [40]. The register values and addresses of OPN8008 are 8 bits, and that of AR0135CS and[40]. TC358748 The register are values 16 bits. and addresses of OPN8008 are 8 bits, and that of AR0135CS and TC358748All three are chips16 bits. have software to generate the configuration and the byte sequence accordingAll three to the chips selected have function, software andto generate then operate the configurat the I2C coreion asand described the byte abovesequence to 2 sendaccording the generated to the selected bytes out. function, The configuration and then operate of the Registerthe I C core Wizard as described software provided above to bysend ON the Semiconductor generated bytes shows out. The that configuratio the imagen size of the and Register frame Wizard rate of software the AR0135CS provided is 1280by ON× 960@54 Semiconductorfps. Since theshows AR0135CS that the is requiredimage size to workand frame in trigger rate mode of the in thisAR0135CS paper, itis also1280 has × to960@54fps write 0x19D8. Since to the the AR0135CS 0x301A register, is required which to is work not included in trigger in mode the byte in this sequence paper, generatedit also has by tothe write software. 0x19D8 to the 0x301A register, which is not included in the byte se- quence generated by the software.

Entropy 2021, 23, x FOR PEER REVIEW 11 of 23

Entropy 2021, 23, 546 11 of 23

FigureFigure 8. 8.Initialization Initialization module. module.

Entropy 2021, 23, 546 12 of 23 Entropy 2021, 23, x FOR PEER REVIEW 12 of 23 Entropy 2021, 23, x FOR PEER REVIEW 12 of 23

Figure 9. Internal structure of local memory. FigureFigure 9.9. InternalInternal structurestructure ofof locallocal memory.memory.

4.3.4.3.4.3. Clock ClockClock Generation GenerationGeneration Module ModuleModule TheTheThe schematic schematicschematic of ofof the thethe clock clockclock generation generationgeneration module module module is is is shown shown shown in in in Figure Figure Figure 10.10 10.. The The The clock clock clock source source source ofofof clk_wiz_0 clk_wiz_0clk_wiz_0 and and clk_wiz_1 clk_wiz_1 bothboth both useuse use the the the on-board on on-board-board crystal crystal crystal of of VCU118.of VC VCU118.U118. The The The reset reset reset input input inputresetn re- re- setnissetn active is is active active low low andlow and isand driven is is driven driven by by the by the initializationthe initiali initializationzation module. module. module. The The The output outp output ofut of the of the the IP IP coreIP core corelocked locked lockedis isactiveis active active high high high and and andlocked locked lockedis setis is set toset to 1 to when 1 1 when when the the phase-lockedthe phase-locked phase-locked loop loop loop output output output is stable. is is stable. stable. The The The outputs out- out- putsofputs the of of Notthe the Not AndNot And And (Nand) (Nand) (Nand) operation operation operation of of locked_0 of locked_0 locked_0 and and and locked_1 locked_1 locked_1 are are are used used used as as as thethe the high high active active resetresetreset signal signalsignal rstrst rst forfor for the thethe video videovideo capture capturecapture module. module. module. clk_usbclk_usb clk_usb isis is the thethe 60 6060 MHz MHzMHz Direct DirectDirect Memory Memory Memory Access Access Access (DMA)(DMA)(DMA) clock clock clock used used used to to to drive drive drive the the the video videovideo capture capturecapture module module module and and and GPIF GPIF GPIF II II II interface. interface. interface. clk_tofclk_tof clk_tof,, clk_mipi,clk_mipi clk_mipi andandand clk_colorclk_color clk_color areare are thethe the 2727 27 MHzMHz MHz referencereference reference clocks clocks clocks for for for OPN8008, OPN8008, TC358748 TC358748 and and AR0135CS. AR0135CS.

clk_usb sysclk_125_p clk_usb OSERDESE3 clk_usb_out sysclk_125_p Clocking Wizard IP Core clk_mipi OSERDESE3 clk_usb_out sysclk_125_n Clocking Wizard IP Core clk_mipi OSERDESE3 clk_mipi_out sysclk_125_n clk_wiz_0 OSERDESE3 clk_mipi_out clk_wiz_0 locked_0 locked_0 resetn rst resetn locked_1 rst locked_1 clk_wiz_1 clk_tof sysclk_300_p clk_wiz_1 clk_tof OSERDESE3 clk_tof_out sysclk_300_p Clocking Wizard IP Core clk_color OSERDESE3 clk_tof_out sysclk_300_n Clocking Wizard IP Core clk_color OSERDESE3 clk_color_out sysclk_300_n OSERDESE3 clk_color_out

FigureFigureFigure 10. 10.10. ClockClock Clock generator generatorgenerator schematic. schematic.schematic.

TheTheThe output outputoutput double doubledouble data datadata rate raterate register registerregister (ODDR) (ODDR)(ODDR) is isis located locatedlocated in inin the thethe IOB IOBIOB of ofof FPGA, FPGA, and and sending the above 4 clocks through ODDR is good for synchronization of clocks and data sendingsending the the above above 4 4 clocks clocks through through ODDR ODDR is is g goodood for for synchronization synchronization of of clocks clocks and and data data on the bus. However, for the three external chip reference clocks such as clk_tof, the use of onon the the bus. bus. However, However, for for the the three three external external chip chip reference reference clocks clocks such such as as clk_tof clk_tof, ,the the use use of of ODDR or not has almost no effect on the performance. The OSERDESE3 (Output SerDes, ODDRODDR or or not not has has almost almost no no effect effect on on the the performance. performance. The The OSERDESE3 OSERDESE3 (Output (Output SerDes, SerDes, Output Serializer) in Figure 10 is a specific implementation of ODDR in the UltraScale+ OutputOutput Serializer) Serializer) in in Figure Figure 10 10 is is a a specific specific implementation implementation of of ODDR ODDR in in the the UltraScale+ UltraScale+ family of chips [41]. The simulation result of the clock generation module shows that the familyfamily of of chips chips [41]. [41]. The The simulation simulation result result of of the the clock clock generation generation module module shows shows that that the the outputs of both PLLs are stable after about 6.6926.692μsµs. outputsoutputs of of both both PLLs PLLs are are stable stable after after about about 6.692μs. . 4.4. Video Capture Module 4.4.4.4. Video Video Capture Capture Module Module The structure diagram of the video capture module is shown in Figure 11, including TheThe structure structure diagram diagram of of the the video video capture capture module module is is shown shown in in Figure Figure 11, 11, including including threethree sub-modules, sub-modules, that that is, is, Trigger, Trigger, Pixel2RAM and usb_controller. Among Among them, them, the the Pixel2RAMthree sub-modules, module isthat the is, storage Trigger, controller, Pixel2RAM and thereand areusb_controller. two instances Among of Depth_a them, and the Pixel2RAM module is the storage controller, and there are two instances of Depth_a and Depth_bPixel2RAM in Depth module Branch, is the whichstorage is controller, responsible and for storingthere are the two depth instances data from of Depth_a the parallel and Depth_b in Depth Branch, which is responsible for storing the depth data from the parallel portDepth_b input in of Depth TC358748 Branch, into which UltraRAM. is responsible PixelClkD, for storing HsyncD the anddepth PixelD data from are pixel the parallel clock, port input of TC358748 into UltraRAM. PixelClkD, HsyncD and PixelD are pixel clock, fieldport synchronizationinput of TC358748 and into pixel UltraRAM. data respectively. PixelClkD, HsyncD and PixelD are pixel clock, fieldfield synchronization synchronization and and pixel pixel data data respectively. respectively.

EntropyEntropy 2021, 232021, x, FOR23, 546 PEER REVIEW 13 of 2313 of 23 Entropy 2021, 23, x FOR PEER REVIEW 13 of 23

Figure 11. Video capture module. . In this paper,FigureFigure ping-pong 11. 11. VideoVideo capturemethod module.module is proposed to improve the operation speed, and the chip select signal Select is generated by the Trigger module. Since the Trigger module In this paper, ping-pong method is proposed to improve the operation speed, and the is drivenIn this by paper, clk_usb ping-pong clock, the SelectD method output is prop fromosed the DFF to improve (D Flip-Flop) the isoperation in the PixelClkD speed, and chip select signal Select is generated by the Trigger module. Since the Trigger module is theclock chip domain select signal after a Select beat of is Select generated with PixelClkD by the Trigger. SelectD module. and its inverse Since thesignal Trigger control module a driven by clk_usb clock, the SelectD output from the DFF (D Flip-Flop) is in the PixelClkD is drivendata selector, by clk_usb and whenclock, theSelectD SelectD is high, output HsyncD_a from isthe driven DFF (Dby Flip-Flop)HsyncD and is HsyncD_b in the PixelClkD is clock domain after a beat of Select with PixelClkD. SelectD and its inverse signal control pulled low, so the pixels of the current frame will be written to Depth_a, while the clocka data domain selector, after and a whenbeat ofSelectD Selectis with high, PixelClkDHsyncD_a. SelectDis driven and by HsyncDits inverseand signalHsyncD_b control a usb_controller module reads the pixels of the previous frame from Depth_b and sends datais pulledselector, low, and so when the pixels SelectD of the is high, current HsyncD_a frame will is bedriven written by to HsyncD Depth_a, and while HsyncD_b the is pulledthemusb_controller low,to the so CYUSB3014 the module pixels reads chip of thethrough pixelscurrent the of GPIF theframe previous II interface;will framebe writtenconversely, from Depth_bto Depth_a,when and SelectD sendswhile is the usb_controllerlow,them Depth_a to the CYUSB3014 modulereads and reads chipDepth_b through the writes.pixels the GPIFof the II interface;previous conversely, frame from when Depth_bSelectD isand low, sends The intensity branch is responsible for receiving data from the parallel port of themDepth_a to the reads CYUSB3014 and Depth_b chip writes. through the GPIF II interface; conversely, when SelectD is AR0135CS, and its design is similar to the depth branch. The design of the three seed low, Depth_aThe intensity reads branch and Depth_b is responsible writes. for receiving data from the parallel port of AR0135CS, modules Pixel2RAM, Trigger and usb_controller will be introduced in the following sec- andThe its intensity design is similarbranch tois theresponsible depth branch. for receiving The design data of the from three the seed parallel modules port of tions.Pixel2RAM, Trigger and usb_controller will be introduced in the following sections. AR0135CS, and its design is similar to the depth branch. The design of the three seed modules4.4.1. Pixel2RAM Pixel2RAM, Sub-Module Trigger and usb_controller will be introduced in the following sec- tions. Figure 12 showsshows the the architecture architecture of of Pixel2RAM Pixel2RAM submodule. submodule. An An asynchronous asynchronous first first in firstin first out out (FIFO) (FIFO) is first is first used used to totransfer transfer PixelD/PixelI PixelD/PixelI from from PixelClkD/PixelClkI PixelClkD/PixelClkI clock clock do- 4.4.1.maindomain Pixel2RAM to toclk_usb clk_usb Sub-Moduleclock clock domain. domain. The The FIFO FIFO is is implemented implemented using using Xilinx Xilinx FIFO FIFO Generator IP coreFigure and the 12 configurationconfigurationshows the architecture information of isis Pixel2RAM listedlisted inin TableTable submodule.1 1.. An asynchronous first in first out (FIFO) is first used to transfer PixelD/PixelI from PixelClkD/PixelClkI clock do- main to clk_usb clock domain. The FIFO is implemented using Xilinx FIFO Generator IP core and the configuration information is listed in Table 1.

Figure 12. Pixel2RAM sub-module sub-module..

Figure 12. Pixel2RAM sub-module.

Entropy 2021, 23, x FOR PEER REVIEW 14 of 23

Entropy 2021, 23, 546 14 of 23 Table 1. Configurations of asynchronous FIFOs.

8 Byte FIFO 16 Byte FIFO Table 1. ConfigurationsRead bit wide of asynchronous FIFOs. 32 32 Read depth 512 512 Write bit wide 8 Byte 8 FIFO 16 Byte 16 FIFO ReadWrite bit depth wide 2048 32 1024 32 Read depthmode First-word 512 Fall-through First-word 512 Fall-through storageWrite bitmedium wide 18 Kb BRAM 8 × 1 18 Kb BRAM 16 × 1 Write depth 2048 1024 Read mode First-word Fall-through First-word Fall-through The 8-bit FIFOs are used in Intensity_a and Intensity_b in Figure 11, and the 16-bit storage medium 18 Kb BRAM × 1 18 Kb BRAM × 1 FIFOs are used in Depth_a and Depth_b. And the bit width of all FIFOs’ read port is 32 bits, which is designed to accommodate the GPIF II interface bit width. We invert the read Thenull 8-bitsignal FIFOs empty are from used the in IP Intensity_a core output and and Intensity_b use it as input in Figure to the read11, and enable the signal 16-bit rd_enFIFOs, so are that used the in data Depth_a will andbe read Depth_b. automatically And the bitwhen width theof FIFO all FIFOs’ is non-empty. read port The is 32 word bits, whichorder is is given designed in [42] to accommodatewhen the write-read the GPIF bit- IIwidth interface ratio bit is width. 1:4 for We2-bit invert data theinput, read while null signal empty from the IP core output and use it as input to the read enable signal rd_en, so for 32-bit data read, the input is replaced with 8 bits, then the 32-bit data read is in big that the data will be read automatically when the FIFO is non-empty. The word order is endian order. given in [42] when the write-read bit-width ratio is 1:4 for 2-bit data input, while for 32-bit The UltraRAM implementation of simple dual port RAM (SDPRAM) can only be data read, the input is replaced with 8 bits, then the 32-bit data read is in big endian order. called by a Xilinx Parameterized Macro (XPM). The data of SDPRAM can only be written The UltraRAM implementation of simple dual port RAM (SDPRAM) can only be from port A and be read from port B, so there are only write enable signal wea and no read called by a Xilinx Parameterized Macro (XPM). The data of SDPRAM can only be written enable signal. There are two limitations when SDPRAM is implemented with UltraRAM: from port A and be read from port B, so there are only write enable signal wea and (1) Ports A and B have to use common clock, so switching the clock domain with asyn- no read enable signal. There are two limitations when SDPRAM is implemented with chronous FIFO is necessary in this design; (2) the read delay is at least 3 clock cycles, so UltraRAM: (1) Ports A and B have to use common clock, so switching the clock domain we have to shift the input read signal from usb_controller readD/readI by 3 clk_usb cycles with asynchronous FIFO is necessary in this design; (2) the read delay is at least 3 clock −3 andcycles, use so it weas the have valid to shift signal the vldD/vldI input read for signal doutb, from in fact usb_controller Z in FigurereadD/readI 12 is the shiftby 3 register clk_usb withcycles a anddepth use of it 3. as the valid signal vldD/vldI for doutb, in fact Z−3 in Figure 12 is the shift register with a depth of 3. 4.4.2. Trigger Sub-Module 4.4.2.The Trigger Trigger Sub-Module sub-module is responsible for the control of the ping-pong operation and the generationThe Trigger of sub-module the image issensor responsible trigger for signal, the control and ofthe the core ping-pong is a Finite operation State Ma- and chinethe generation of the image sensor trigger signal, and the core is a Finite State Machine (FSM). The structure ofof the module and thethe FSM state transfer diagram areare shownshown inin Figure 13 a,b.a,b.

clk_usb

idle Timer: 0~T-1 FSM tri_external rdy tri_tof tri_color

DFF Select

(a)

tri_external asserts Timer reaches T-1 reset S0 S1 S2

idle asserts (b) S3

Figure 13. Trigger sub-module.sub-module. ((aa)) Structure,Structure, ((bb)) FSMFSM statestate transfertransfer diagramdiagram

S0 is the standby state, and whenwhen the externalexternal trigger signal tri_externltri_externl isis detecteddetected high, the FSM is transferred from S0 to S1, and the capture module is activated and starts to run. Besides, Besides, note note that that the the state state values values are are encoded by the one-hot code, S0–S3 are 0001/0010/0100/1000 respectively. respectively. In In this this design, design, tri_externltri_externl isis connected connected to to the the button for the first time triggering the camera after power on. S1 is used to achieve the preset frame rate and complete the trigger. The operating frame rate of an image sensor in trigger mode must be greater than the trigger signal frequency to ensure that the sensor can process and

Entropy 2021, 23, x FOR PEER REVIEW 15 of 23

Entropy 2021, 23, 546 the first time triggering the camera after power on. S1 is used to achieve the preset15 frame of 23 rate and complete the trigger. The operating frame rate of an image sensor in trigger mode must be greater than the trigger signal frequency to ensure that the sensor can process and send a frame within the trigger signal period. In this paper, the operating frame rate of send a frame within the trigger signal period. In this paper, the operating frame rate of AR0135CS is 54 fps and the frequency of the trigger signal is 30 Hz. S2 is used to wait the AR0135CS is 54 fps and the frequency of the trigger signal is 30 Hz. S2 is used to wait the usb_controller idle. Signal rdy and signal idle are a pair of handshake signals. Idle is used usb_controller idle. Signal rdy and signal idle are a pair of handshake signals. Idle is used to to indicate whether the usb_controller module has finished the current DMA transfer; rdy indicate whether the usb_controller module has finished the current DMA transfer; rdy is is held for only one clock cycle and is used to inform the usb_controller module that the held for only one clock cycle and is used to inform the usb_controller module that the new new pixel data is ready. S3 is used to enable the signal rdy and to invert the signal Select, pixel data is ready. S3 is used to enable the signal rdy and to invert the signal Select, and and this state is held for only one period. this state is held for only one period.

4.4.3.4.4.3. USB_Controller Sub-ModuleSub-Module TheThe structurestructure ofof thethe USBUSB controllercontroller modulemodule andand thethe finite-statefinite-state machinemachine (FSM)(FSM) statestate transfertransfer diagram are are shown shown in in Figure Figure 14a,b. 14a,b. S0 is S0 the is theidle idlestate, state, used usedto complete to complete the hand- the handshakeshake with withthe Trigger the Trigger module, module, and andsignal signal idle idleis activeis active in this in this state. state. When When the thesignal signal rdy rdyis detectedis detected high, high, the the FSM FSM shifts shifts from from S0 S0 state state to to S1 S1 state. state. State State S1 S1 is is used used to wait forfor thethe DMADMA bufferbuffer toto bebe empty, andand thethe FSMFSM transitionstransitions fromfrom S1S1 toto S2S2 when flagaflaga is pulled high. S2S2 isis thethe DMADMA writewrite state.state. Dma_wrDma_wr isis validvalid onlyonly inin S1S1 andand isis usedused toto drivedrive thethe SDPRAMSDPRAM readread signals readIreadI andand readDreadD. The. The state state transfer transfer in S2 in is S2 determined is determined by both by boththe word the wordcoun- counterter cnt_wordcnt_word andand the thepackage package counter counter cnt_packagecnt_package. When. When the the cnt_word cnt_word count count is is full,full, usb_controllerusb_controller willwill sendsend 40964096 readread signalssignals continuously.continuously. AtAt thisthis pointpoint ifif thethe cnt_packagecnt_package countcount value is is not not full full then then the the data data in inRAM RAM is not is not read, read, S2 turns S2 turns to S3 to and S3 waits and waits for flaga for flagato be pulled to be pulled high, then high, S3 then turns S3 back turns toback S1 to tocontinue S1 to continue the cycle. the According cycle. According to the literature to the literature[43], flaga [ 43does], flaga not pull does down not pull immediately down immediately after the current after the Buffer current write Buffer is full,write butis there full, butis a theredelay is of a delay4 bus ofclock 4 bus cycles, clock cycles,so if we so enter if we state enter S1 state directly S1 directly without without waiting, waiting, it will it willdefinitely definitely lead lead to a tomisjudgment a misjudgment of the of Buffer the Buffer empty-full empty-full state. state. If cnt_package If cnt_package countcount is full, is full,all data all datais read is readand S2 and goes S2 goesback backto S0 toto S0wait to for wait the for next the round next round of transfers. of transfers. Besides, Besides, write writeenable enable signal signal slwr ofslwr DMAof DMA is driven is driven by the by Nand the Nand operation operation of data ofdata valid valid signals signals vldI vldIand andvldDvldD givengiven by the by Pixel2RAM the Pixel2RAM module, module, without without the involvement the involvement of the of FSM. the FSM.

clk_usb dma_wr cnt_word: cnt_package: rdy 4095 0~4095 = 0~104 idle FSM flaga

75 > readD

readI vldD slwr vldI doutD ENB fdata (a) doutI

cnt_word == 4095 && cnt_package == 104 rdy asserts flaga deasserts reset S0 S1 S2 cnt_word == 4095 flaga asserts && S3 cnt_package < 104 (b) FigureFigure 14.14. USB_controllerUSB_controller sub-module.sub-module. ( a(a)) Structure, Structure, ( b(b)) FSM FSM state state transfer transfer diagram diagram

4.5.4.5. FPGA Resources AnalysisAnalysis FPGAFPGA resources utilization utilization are are shown shown in in Table Table 2.2 .From From the the results, results, we we could could find find that thatwith with an efficient an efficient design design of the of system, the system, the re thesource resource utilization utilization of FPGA of FPGA is acceptable. is acceptable. For For example, the utilization of lookup table (LUT) by the whole system is only 0.36%. Moreover, LUT and block RAMs (BRAM) are also two critical criterions to evaluate the resource utilization. From the results, we could find that BRAM is constrained for the proposed implementation.

Entropy 2021, 23, x FOR PEER REVIEW 16 of 23

example, the utilization of lookup table (LUT) by the whole system is only 0.36%. Moreo- ver, LUT and block RAMs (BRAM) are also two critical criterions to evaluate the resource utilization. From the results, we could find that BRAM is constrained for the proposed Entropy 2021, 23, 546 implementation. 16 of 23

Table 2. Resources utilization.

Table 2. ResourcesResources utilization. Utilization Total Percentage (%) LUT 4214 1,182,240 0.36 Resources Utilization Total Percentage (%) LUTRAM 215 591,840 0.04 LUTFF 42144006 1,182,2402,364,480 0.360.17 LUTRAMBRAM 21518 591,8402160 0.040.83 FF 4006 2,364,480 0.17 URAM 210 960 21.88 BRAM 18 2160 0.83 IO 101 832 12.14 URAM 210 960 21.88 BUFGIO 1018 8321800 12.140.44 BUFGMMCM 83 180030 0.4410.00 MMCM 3 30 10.00 5. Experimental Results 5.1.5. Experimental Experimental ResultsSetup 5.1. ExperimentalWith the FPGA Setup architecture designed above, a real-time system for depth map resto- rationWith and the super-resolution FPGA architecture is designedimplemen above,ted. aThe real-time FPGA system used foris depthXilinx map XCVU9P- restora- L2FLGA2104tion and super-resolution embedded in is VCU118 implemented. development. The FPGA Figure used is15 Xilinx shows XCVU9P-L2FLGA2104 the connection of the imageembedded acquisition in VCU118 system development. and the snapshot Figure of 15cameras shows and the FMC+adaptor. connection of theON imageSemicon- ac- ductor’squisition AR0135CS system and is theselected snapshot as the of camerassensor for and the FMC+adaptor. visible camera, ON and Semiconductor’s OPN8008 from OPNOUSAR0135CS is is selected selected as as the the sensor sensor for for the the To visibleF camera. camera, The and main OPN8008 parameters from of OPNOUS these two is sensorsselected are as thelisted sensor in Table for the3. ToF camera. The main parameters of these two sensors are listed in Table3.

(a) Connection of the imageFigure acquisition 15. Physical system connection diagram of(b the) Snapshot realized system.of cameras and FMC+adaptor Figure 15. Physical connection diagram of the realized system. Table 3. Parameters of the image sensors in this paper. Table 3. Parameters of the image sensors in this paper. AR0135CS OPN8008 Working mode TriggerAR0135CS mode TriggerOPN8008 mode IOWorking level standard mode Trigger 1.8 Vmode Trigger 3.3 Vmode PixelIO Datalevel Dimensionsstandard 1280 1.8× V960 328 × 3.3248 V × 3 PixelPixel Data data Dimensions format 12 bit1280 Monochrome × 960 328 RAW12 × 248 × 3 SensorPixel Configuration data format Bus 12 bit IMonochrome2C Bus RAW12I2C Bus SensorPixel DataConfiguration Transfer Bus Bus Parallel I2C InterfaceBus MIPI I2C Bus CSI-2 Pixel Data Transfer Bus Parallel Interface MIPI CSI-2 5.2. Result Analysis and Comparison 5.2.5.2.1. Result Depth Analysis Map Repair and Comparison Performance 5.2.1.PSNR Depth is Map proposed Repair as Performance a measure of depth map restoration performance. It should be noted that the locations with voids are not included in the MSE when calculating the PSNR value. In testing the restoration function, we selected six scenes in the Middlebury dataset and cropped the image size to 320 × 240, added σ = 25 AWGN to the intensity map, and randomly added 5%, 10%, 15%, and 20% voids to the depth map. The PSNR performance before and after restoration is statistically shown in Tables4 and5. It can be seen that the algorithm proposed in this paper can improve the PSNR value by about two times. Entropy 2021, 23, x FOR PEER REVIEW 17 of 23

PSNR is proposed as a measure of depth map restoration performance. It should be noted that the locations with voids are not included in the MSE when calculating the PSNR value. In testing the restoration function, we selected six scenes in the Middlebury dataset and cropped the image size to 320 × 240, added 𝜎=25 AWGN to the intensity map, and randomly added 5%, 10%, 15%, and 20% voids to the depth map. The PSNR Entropy 2021, 23, 546 17 of 23 performance before and after restoration is statistically shown in Tables 4 and 5. It can be seen that the algorithm proposed in this paper can improve the PSNR value by about two times. Table 4. Original PSNR performance (dB). Table 4. Original PSNR performance (dB). Bowling 1 Bowling 2 Cloth 1 Cloth 2 Cloth 3 Cloth 4 Bowling 1 Bowling 2 Cloth 1 Cloth 2 Cloth 3 Cloth 4 5% holes 18.598333 18.013552 20.250878 17.642017 20.801490 18.257972 10%5% holes holes 18.598333 15.582313 18.013552 15.031216 20.250878 17.259421 17.642017 14.635192 20.801490 17.795489 18.257972 15.259041 10%15% holes holes 15.582313 13.813898 15.031216 13.294900 17.259421 15.499276 14.635192 12.873909 17.795489 16.028865 15.259041 13.499192 15%20% holes holes 13.813898 12.555023 13.294900 12.036268 15.499276 14.237644 12.873909 11.618691 16.028865 14.772910 13.499192 12.242208 20% holes 12.555023 12.036268 14.237644 11.618691 14.772910 12.242208

Table 5. PSNR performance (dB) after inpainting.

BowlingBowling 1 Bowling 2 Cloth 1 Cloth 2 Cloth 3 Cloth 4 5%5% holes holes 42.462463 42.462463 42.173859 42.173859 47.267376 44.200977 43.456047 44.102913 10%10% holes holes 42.732605 42.732605 42.454369 42.454369 48.292862 44.581429 44.111671 43.971207 15%15% holes holes 42.717278 42.717278 42.260109 42.260109 48.252022 44.728699 44.343636 44.061874 20%20% holes holes 42.735268 42.735268 41.923935 41.923935 47.750134 44.527683 44.169636 43.622093

LRMCLRMC and and DE-CNN DE-CNN are are the the representatives representatives of traditional traditional restoration restoration algorithms and deepdeep learning-based learning-based restoration restoration algorithms. algorithms. A Areview review of literature of literature [10–12] [10– 12shows] shows that that the testthe conditions test conditions of these of these three three algorithms algorithms are adding are adding 𝜎=25σ = AWGN25 AWGN to the to intensity the intensity map andmap randomly and randomly adding adding 13% voids 13% voidsto the todepth the depthmap. The map. comparison The comparison of algorithm of algorithm perfor- manceperformance under underthe same the conditions same conditions is shown is shownin Table in 6, Table which6, whichshows showsthat DE-CNN that DE-CNN has no PSNRhas no performance PSNR performance advantage advantage over the overtraditional the traditional algorithm algorithm LRMC, while LRMC, the whileproposed the algorithmproposed algorithmin this paper in thisoutperforms paper outperforms LRMC, DE-CNN LRMC, and DE-CNN DDTF andin terms DDTF of inPSNR terms per- of formancePSNR performance on all six scenarios. on all six scenarios.

Table 6. Depth inpainting comparison (dB). Bowling 1 Bowling 2 Cloth 1 Cloth 2 Cloth 3 Cloth 4 Bowling 1 Bowling 2 Cloth 1 Cloth 2 Cloth 3 Cloth 4 LRMC 38.22 39.37 46.24 42.35 42.37 37.43 DE-CNNLRMC 37.52 38.22 38.71 39.37 45.01 46.24 41.45 42.35 42.75 42.37 39.16 37.43 DE-CNN 37.52 38.71 45.01 41.45 42.75 39.16 DDTF [12] 38.00 38.18 46.86 42.33 42.16 37.57 DDTF [12] 38.00 38.18 46.86 42.33 42.16 37.57 MethodMethod proposed proposed 42.6742.67 42.30 42.30 48.35 48.35 44.71 44.71 44.28 44.28 44.06 44.06

Figures 16 and 17 show the results of the depth map enhancement tests for the Bowl- ing 1 Figuresand Cloth 16 and 1 scenes. 17 show These the resultsimages of are: the depth(a) the map scene; enhancement (b) the original tests fordepth the map; Bowling (c) the1 and depth Cloth map 1 scenes. with 13% These holes images added; are: and (a) (d) the the scene; restoration (b) the results. original depth map; (c) the depth map with 13% holes added; and (d) the restoration results.

(a) (b) (c) (d)

Figure 16. Scene of bowling (a) RGB image; (b) Original depth image; (c) Depth image with 13% holes added; (d) Depth image restoration from our method.

Entropy 2021, 23, x FOR PEER REVIEW 18 of 23

EntropyFigure2021, 23 16., 546 Scene of bowling (a) RGB image; (b) Original depth image; (c) Depth image with 13% holes added; (d) Depth18 of 23 image restoration from our method.

(a) (b) (c) (d)

Figure 17. Scene of clothes (a) RGBRGB image;image; ((b) OriginalOriginal depthdepth image;image; (c) DepthDepth imageimage withwith 13%13% holesholes added;added; ((d)) DepthDepth image restoration fromfrom ourour method.method.

5.2.2. Accuracy RMSE is proposed as a measure of the super-resolutionsuper-resolution performance of the depthdepth map, and locationslocations withwith voids voids are are not not counted. counted. In In testing testing the the super-resolution super-resolution function, function, we weselected selected six scenessix scenes in the in Middlebury the Middlebury dataset dataset but without but without cropping, cropping, added σadded= 25 AWGN𝜎=25 AWGNto the intensity to the intensity map. The map. depth The map depth is used map without is used addingwithout noise, adding and noise, the down-sampling and the down- samplingmethod in method literature in literature [44] is used [44] to is obtainused to images obtain withimages constant with constant field of field view of but view lower but lowerresolution, resolution, and then and the then low-resolution the low-resolution depth depth map ismap up-sampled is up-sampled to obtain to obtain the input the in- of putthe network.of the network. The RMSE values forfor upup samplingsampling factorsfactors of of 2 2 and and 4 4 are are given given in in Tables Tables7 and 7 and8, and 8, and it itcan can be be seen seen that that the the performance performance of of the the proposed proposed algorithmalgorithm hashas beenbeen greatlygreatly improved compared to to the the existing existing algorithms. algorithms. The The ma mainin reasons reasons for forthe theimprovement improvement are arethe fol- the lowings:followings: (1) (1)The The input input intensity intensity map map provides provides complementary complementary information information to make to make the algorithmthe algorithm more more oriented; oriented; (2) (2)The The representati representationon capability capability of of the the convolutional convolutional neural network is stronger than that of the conventional algorithm based on convex optimization. network is stronger than that of the conventional algorithm based on convex optimization.

Table 7. k = TableDepth 7. super-resolutionDepth super-resolution comparison, comparison,2. 𝑘=2.

Aloe Baby AloeCones Baby Plastic Cones PlasticTeddy TeddyVenus Venus Paper [45] 4.93 Paper [45]3.26 4.934.08 3.263.16 4.08 3.163.18 3.181.92 1.92 Paper [44] 2.89 Paper [44]1.81 2.892.13 1.811.81 2.13 1.811.74 1.740.98 0.98 Method proposed 0.13Method proposed0.065 0.130.17 0.0650.071 0.17 0.0710.16 0.160.047 0.047

Table 8. Depth super-resolution comparison, 𝑘=4. Table 8. Depth super-resolution comparison, k = 4. Aloe Baby Cones Plastic Teddy Venus Paper [45] 7.29 4.49 Aloe5.88 Baby Cones3.31 Plastic4.53 Teddy1.89 Venus Paper [44] 5.12 Paper [45]2.97 7.293.73 4.49 5.882.63 3.312.86 4.531.67 1.89 Method proposed 0.17 Paper [44]0.12 5.120.23 2.970.14 3.73 2.630.21 2.860.11 1.67 Method proposed 0.17 0.12 0.23 0.14 0.21 0.11 Figures 18 and 19 show the results of the depth map super-resolution tests for the ConesFigures scene and18 and the Plastic19 show scene. the resultsThese images of the depthare, in maporder: super-resolution (a) the scene; (b) teststhe original for the depthCones map; scene (c) and the the results Plastic at scene. k = 2; Theseand (d) images the results are, in at order: k = 4. (a) the scene; (b) the original depth map; (c) the results at k = 2; and (d) the results at k = 4.

Entropy 2021, 23, x FOR PEER REVIEW 19 of 23 Entropy 2021, 23, x FOR PEER REVIEW 19 of 23 Entropy 2021, 23, 546 19 of 23

(a) (b) (c) (d) Figure 18. Scene of cones (a) RGB image; (b) Original depth image; (c) 𝑘=2; (d) 𝑘=4. FigureFigure 18. 18. SceneScene of of cones cones (a (a) )RGB RGB image; image; ( (bb)) Original Original depth depth image; image; ( (cc)) 𝑘=2k = 2;; (d) k𝑘=4= 4..

(a) (b) (c) (d) Figure 19. Scene of plastic (a) RGB image; (b) Original depth image; (c) 𝑘=2; (d) 𝑘=4. FigureFigure 19. 19. SceneScene of of plastic plastic (a (a) )RGB RGB image; image; ( (bb)) Original Original depth depth image; image; ( (cc)) 𝑘=2k = 2;; (d) k𝑘=4= 4.. 5.2.3. Speed 5.2.3. Speed The proposed algorithm does not need pre-processing, and all the processing is done The proposed algorithm does not need pre-processing, and all the processing is done byby convolutional neural neural network, network, so so it it ha hass a much better running speed compared with byLRMC convolutional and DE-CNN. neural 1 minnetwork, (0.017 so fps) it isha neededs a much for better LRMC running to process speed a 320 compared× 240 image, with LRMC and DE-CNN. 1 min (0.017 fps) is needed for LRMC to process a 320 × 240 image, LRMCand 0.083 and s DE-CNN. (12.048 fps) 1 ismin needed (0.017 for fps) DE-CNN is needed including for LRMC the to pre-processing process a part. As image, shown and 0.083 s (12.048 fps) is needed for DE-CNN including the pre-processing part. As in Table9, it takes about 0.0372 s to process a Y-D image with a resolution of 320 320× 240 × shown in Table 9, it takes about 0.0372 s to process a Y-D image with a resolution of 320 × 240on a on single a single GTX1080, GTX1080, which which translates translates into into a frame a frame rate rate of 25 of fps, 25 fps, a 1470-fold a 1470-fold and and 2-fold 2- 240improvement on a single compared GTX1080, with which LRMC translates and DE-CNN; into a frame while rate a Y-Dof 25 image fps, a with 1470-fold a resolution and 2- fold improvement compared with LRMC and DE-CNN; while a Y-D image with a resolu- of 640 ×640480 ×can 480 reach 13 fps. The relationship between speed and resolution is not linear, tion of 640 × 480 can reach 13 fps. The relationship between speed and resolution is not linear,mainly mainly because because the CuDNN the CuDNN library library uses some uses efficientsome efficient algorithms algorithms such as such Winograd as Wino- to linear,speed upmainly the computation because the whenCuDNN implementing library useslarge-scale some efficient convolution algorithms operations. such as Wino- grad to speed up the computation when implementing large-scale convolution opera- tions. tions.Table 9. Elapsed time (s) without TensorRT framework. Table 9. Elapsed time (s) without TensorRT framework. Table 9. Elapsed time (s)Bowling without 1 TensorRT Bowling framework. 2 Cloth 1 Cloth 2 Cloth 3 Cloth 4 Bowling 1 Bowling 2 Cloth 1 Cloth 2 Cloth 3 Cloth 4 Elapsed time (s) Bowling 0.0372044 1 Bowling 0.0375206 2 Cloth0.0374086 1 0.0372178 Cloth 2 Cloth0.03684 3 0.0374244 Cloth 4 Elapsed time (s) 0.0372044 0.0375206 0.0374086 0.0372178 0.03684 0.0374244 Elapsed time (s) 0.0372044 0.0375206 0.0374086 0.0372178 0.03684 0.0374244 TensorRT further optimizes the computation of convolution on top of CuDNN, and TensorRT further optimizes the computation of convolution on top of CuDNN, and can fuseTensorRT some eligiblefurther layersoptimizes together the computatio to reduce then of number convolution of scheduling, on top of maximizing CuDNN, and the can fuse some eligible layers together to reduce the number of scheduling, maximizing cancomputational fuse some eligible power of layers GPU together and improving to reduce the efficiencythe number of videoof scheduling, memory usage.maximizing Since thewe docomputational not quantize power the weights of GPU (we and still improvin use 32-bitg floating the efficiency point numbers), of video thememory optimization usage. Sincestrategy we useddo not by quantize TensorRT the forweights the network (we still structureuse 32-bit offloating this paper point isnumbers), mainly verticalthe op- timizationlayer fusion, strategy which used includes by TensorRT the following for the three network optimizations: structure (1)The of this four paper operations is mainly of verticalconvolution, layer bias,fusion, batch which normalization includes the and following ReLU in three the firstoptimizations: class of convolutional (1)The four layersoper- ationsare fused of convolution, into one CBR bias, kernel; batch (2)The normalization two operations and of ReLU convolution in the first and biasclass in of the convolu- second tionalclass of layers convolutional are fused into layers one are CBR fused kernel; into one(2)The CBR two kernel; operations (3) Canceling of convolution Concat and layer bias by inpre-allocating the second class cache. of convolutional layers are fused into one CBR kernel; (3) Canceling . ConcatWhen layer the by TensorRT pre-allocating framework cache is. not adopted, the data scheduling and computational resourceWhen allocation the TensorRT among framework the Conv, Bias,is not BatchNorm, adopted, the Scale data and scheduling ReLU layers and incomputa- the first tionalclass ofresource convolutional allocation layers among will the take Conv, up Bi aas, lot BatchNorm, of time. This Scale non-computational and ReLU layers in time the firstoverhead class of is convolutional greatly reduced layers with will the take TensorRT up a lot framework. of time. This Note non-computational that since TensorRT time 4 overhead is greatly reduced with the TensorRT framework. Note that since TensorRT 4

Entropy 2021, 23, 546 20 of 23 Entropy 2021, 23, x FOR PEER REVIEW 20 of 23

does not support slice layers, we have to replace the slice layers with two input layers to does not support slice layers, we have to replace the slice layers with two input layers to input the depth map and the intensity map. input the depth map and the intensity map. With the TensorRT framework deployment, a single GTX1080 processing speed can With the TensorRT framework deployment, a single GTX1080 processing speed can achieve 320 × 240@47fps and 640 × 480@38fps, which are 1.88 times and 2.92 times faster 320 × 240@47fps 640 × 480@38fps thanachieve the speed without the and TensorRT framework., Thewhich whole are processing1.88 times processand 2.92 can times be dividedfaster than into the two speed parts: without the IO the phase TensorRT between framework. host memory The whole and GPU processing memory, process and thecan GPUbe divided computation into two phase, parts: and the our IO timephase consumption between host statistics memory for and these GPU two memory, parts are and shown the inGPU Tables computation 10 and 11. phase, and our time consumption statistics for these two parts are shown in Tables 10 and 11. Table 10. Time consumption of each phase using TensorRT under 320 × 240 resolution. Table 10. Time consumption of each phase using TensorRT under 320 × 240 resolution. IO Phase Computing Phase Total IO Phase Computing Phase Total Elapsed time (s) 0.000442 0.020668 0.021110 PercentageElapsed (%)time (s) 2.094 0.000442 97.906 0.020668 0.021110100 Percentage (%) 2.094 97.906 100

TableTable 11.11.Time Time consumptionconsumption ofof eacheach phase phase using using TensorRT TensorRT under under 640 640× ×480 480 resolution. resolution.

IO Phase IO Phase Computing Computing Phase Phase Total Total ElapsedElapsed time time (s) (s) 0.002566 0.002566 0.023228 0.023228 0.025794 0.025794 PercentagePercentage (%) (%) 9.948 9.948 90.052 90.052 100 100

Finally,Finally, wewe testedtested thethe acquisitionacquisition speedspeed usingusing C++C++ Streamer, aa speedspeed measurementmeasurement softwaresoftware providedprovided by by Cypress. Cypress. The The result result is stable is stable at 50,400 at 50,400KBps KBps = 1,720,320 = 1,720,320Bytes Bytes × 30 fps ×, which30fps, showswhich shows that the that frame the frame rate control rate control of this of design this design is very is accurate.very accurate. Assuming Assuming full speedfull speed operation operation at 60 at MHz 60 MHz bus clock bus withoutclock without frame frame rate control, rate control, the upload the upload speed can speed be stabilizedcan be stabilized at 231,500 at KBps231,500KBps = 226 MBps. = 226MBps The maximum. The maximum frame rate frame of AR0135CS rate of AR0135CS is 54 fps atis 128054 fps× at960 1280, which × 960 means, which the acquisitionmeans the acquisition system can system fully support can fully the support camera runningthe camera at fullrunning speed. at full speed.

5.3.5.3. ResultResult ofof PracticalPractical TestTest TheThe practical test test is is shown shown in in Figure Figure 20 20 and and we we also also calculated calculated the thePSNR, PSNR, RMSE RMSE and andSSIM, SSIM, which which are listed are listed in Table in Table 12, and 12, andthese these data data basically basically match match with with the previous the previous sim- simulationulation results. results.

FigureFigure 20.20. ResultsResults ofof practicalpractical testtest ((aa)) IntensityIntensity image;image; ((bb)) DepthDepth image;image; ((cc)) k𝑘=2= 2; (; d(d) )k 𝑘=4= 4. .

TableTable 12.12.Resources Resources utilization.utilization.

PSNRPSNR (dB)(dB) RMSERMSE SSIMSSIM ScenarioScenario 11 42.449310 42.449310 0.182983 0.182983 0.949597 0.949597 ScenarioScenario 22 41.113400 41.113400 0.157987 0.157987 0.954358 0.954358 ScenarioScenario 33 41.554905 41.554905 0.201898 0.201898 0.933705 0.933705

Entropy 2021, 23, 546 21 of 23

6. Conclusions In this paper, we presented the design and implementation of a real-time depth map enhancement system based on residual network. A depth map enhancement algorithm is proposed, extracting features from the acquired intensity map and depth map in two ways, and then fusion and residual calculation are performed. Besides, the algorithm proposed adopts a full convolutional network and eliminates the pre-processing process. On a single GTX1080 graphics card, the processing speed can reach 320 × 240@25fps or 640 × 480@13fps without the TensorRT framework, which is 1470 times and 2 times faster than LRMC and DE-CNN respectively; and the speed is further increased to 320 × 240@47fps or 640 × 480@38fps with the TensorRT framework. Moreover, a FPGA-based dual-camera synchronous real-time acquisition system is implemented, which can precisely control ToF camera and visible camera to acquire images synchronously and can transfer the data to the back-end acceleration platform in real time via USB 3.0 interface. The experimental results show that the data throughput of the acquisition system is stable at 226 Mbps, and support dual-camera to work at full speed.

Author Contributions: Conceptualization, Z.L. and Y.G.; methodology, Y.G.; software, Y.G. and J.W.; validation, Z.L., Y.G. and H.S.; formal analysis, Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, H.S. and Z.L.; , H.S.; supervision, Z.L.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by Fundamental Research Funds for the Central Universities, grant number 2020GFYD011 and 2020 GFZD008. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: Caffe: [https://caffe.berkeleyvision.org/tutorial/layers/hdf5data.html, accessed on 16 December 2020], Middlebury: [http://graphics.cs.pdx.edu/project/depth-enhance/, accessed on 16 December 2020], MPI Sintel [http://sintel.is.tue.mpg.de/, accessed on 16 December 2020]. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Chen, S.Y.; Li, Y.F.; Zhang, J. Vision processing for realtime 3-D data acquisition based on coded structured light. IEEE Trans. Image Process. 2008, 17, 167–176. [CrossRef][PubMed] 2. Li, L.; Xiang, S.; Yang, Y.; Yu, L. Multi-camera interference cancellation of time-of-flight (TOF) cameras. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 556–560. [CrossRef] 3. Hou, Y.; Chiou, S.; Lin, M. Real-time detection and tracking for moving objects based on computer vision method. In Proceedings of the 2017 2nd International Conference on Control. and Robotics Engineering (ICCRE), Bangkok, Thailand, 1–3 April 2017; pp. 213–217. [CrossRef] 4. Cao, Y.; Shen, C.; Shen, H.T. Exploiting depth from single monocular images for object detection and semantic segmentation. IEEE Trans. Image Process. 2017, 26, 836–846. [CrossRef] 5. Raghunandan, A.; Mohana; Raghav, P.; Aradhya, H.V.R. Object detection algorithms for video surveillance applications. In Pro- ceedings of the 2018 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 3–5 April 2018; pp. 0563–0568. [CrossRef] 6. Yang, S.; Liu, J.; Fang, Y.; Guo, Z. Joint-feature guided depth map super-resolution with face priors. IEEE Trans. Cybern. 2018, 48, 399–411. [CrossRef][PubMed] 7. Xu, D.; Fan, X.; Zhao, D.; Gao, W. Multiresolution contourlet transform fusion based depth map super resolution. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 2187–2191. [CrossRef] 8. Li, W.; Hu, W.; Dong, T.; Qu, J. Depth image enhancement algorithm based on RGB image fusion. In Proceedings of the 2018 11th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 8–9 December 2018; pp. 111–114. [CrossRef] 9. Zuo, Y.F.; Fang, Y.M.; An, P.; Shang, X.; Yang, J. Frequency-dependent depth map enhancement via iterative depth-guided affine transformation and intensity-guided refinement. IEEE Trans. Multimed. 2021, 23, 772–783. [CrossRef] 10. Si, L.; Xiaofeng, R.; Feng, L. Depth enhancement via low-rank Matrix completion. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 3390–3397. Entropy 2021, 23, 546 22 of 23

11. Xin, Z.; Ruiyuan, W. Fast depth image denoising and enhancement using a deep convolutional network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 2499–2503. 12. Wang, J.; Cai, J.F. Data-driven tight frame for multi-channel images and its application to joint color-depth image reconstruction. J. Oper. Res. Soc. China 2015, 3, 99–115. [CrossRef] 13. Ni, M.; Lei, J.; Cong, R.; Zheng, K.; Peng, B.; Fan, X. Color-guided depth map super resolution using convolutional neural network. IEEE Access 2017, 5, 26666–26672. [CrossRef] 14. Zhou, W.; Li, X.; Reynolds, D. Guided deep network for depth map super-resolution: How much can color help? In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 1457–1461. [CrossRef] 15. Chen, B.; Jung, C. Single depth image super-resolution using convolutional neural networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1473–1477. [CrossRef] 16. Korinevskaya, A.; Makarov, I. Fast depth map super-resolution using deep neural network. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Munich, Germany, 16–20 October 2018; pp. 117–122. [CrossRef] 17. Li, B.; Dai, Y.; Chen, H.; He, M. Single image depth estimation by dilated deep residual convolutional neural network and soft-weight-sum inference. arXiv 2017, arXiv:1705.00534. 18. He, L.; Wang, G.; Hu, Z. Learning depth from single images with deep neural network embedding focal length. IEEE Trans. Image Process. 2018, 27, 4676–4689. [CrossRef][PubMed] 19. Kumari, S.; Jha, R.R.; Bhavsar, A.; Nigam, A. AUTODEPTH: Single image depth map estimation via residual CNN encoder- decoder and stacked hourglass. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019. 20. Siddiqui, S.A.; Vierling, A.; Berns, K. Multi-modal depth estimation using convolutional neural networks. arXiv 2020, arXiv:2012.09667. 21. Schlemper, J.; Caballero, J.; Hajnal, J.V.; Price, A.N.; Rueckert, D. A deep cascade of convolutional neural networks for dynamic MR image reconstruction. IEEE Trans. Med. Imaging 2018, 37, 491–503. [CrossRef][PubMed] 22. Li, Z.; Liu, X.; Creighton, F.X.; Taylor, R.H.; Unberath, M. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. arXiv 2020, arXiv:2011.02910. 23. Qian, T.; Chen, L.; Li, X.; Sun, H.; Ni, J. A 1.25 Gbps programmable FPGA I/O buffer with multi-standard support. In Pro- ceedings of the 2018 IEEE 3rd International Conference on Integrated Circuits and Microsystems (ICICM), Shanghai, China, 24–26 November 2018; pp. 362–365. [CrossRef] 24. Venieris, S.I.; Bouganis, C.-S. FpgaConvNet: Mapping regular and irregular convolutional neural networks on FPGAs. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 326–342. [CrossRef][PubMed] 25. Ahmad, A.; Pasha, M.A. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs. In Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 25–29 March 2019; pp. 1106–1111. 26. Jiang, Y.; Luo, J. Target height measurement method based on stereo vision. In Proceedings of the 2020 International Conference on Information Science, Parallel and Distributed Systems (ISPDS), Xi’an, China, 14–16 August 2020; pp. 14–17. [CrossRef] 27. Deng, H.; Dong, P.; Li, Z.; Lyu, H.; Zhang, Y.; Luo, Y.; An, F. Robot navigation based on pseudo-binocular stereo vision and linear fitting. In Proceedings of the 2020 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), Nanjing, China, 23–25 November 2020; pp. 174–175. [CrossRef] 28. Dong, H.; Zhang, Y.; Chen, M.; Jin, W. Design of the image acquisition and processing system for color sorter based on FPGA. In Proceedings of the 2017 2nd International Conference on Cybernetics, Robotics and Control (CRC), Chengdu, China, 21–23 July 2017; pp. 184–188. [CrossRef] 29. Manabe, T.; Shibata, Y.; Oguri, K. FPGA implementation of a real-time super-resolution system with a CNN based on a residue number system. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, VIC, Australia, 11–13 December 2017; pp. 299–300. [CrossRef] 30. Shandilya, R.; Sharma, R.K. FPGA implementation of image enhancement technique for Automatic Vehicles Number Plate detection. In Proceedings of the 2017 International Conference on Trends in Electronics and Informatics (ICEI), Tirunelveli, India, 11–12 May 2017; pp. 1010–1017. [CrossRef] 31. Prashant, G.P.; Jagdale, S.M. Information fusion for images on FPGA: Pixel level with pseudo color. In Proceedings of the 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), Aurangabad, India, 5–6 October 2017; pp. 185–188. [CrossRef] 32. Pfeifer, M.; Scholl, P.M.; Voigt, R.; Becker, B. Active stereo vision with high resolution on an FPGA. In Proceedings of the 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), San Diego, CA, USA, 28 April–1 May 2019; pp. 118–126. [CrossRef] Entropy 2021, 23, 546 23 of 23

33. Lee, Y.; Choi, S.; Lee, E.; Lee, S.; Jang, S. A real-time AD-census stereo matching based on FPGA. In Proceedings of the 2019 19th International Conference on Control, Automation and Systems (ICCAS), Jeju, Korea, 15–18 October 2019; pp. 1622–1624. [CrossRef] 34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. 35. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2016. 36. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; pp. 1–14. 37. Butler, D.J.; Wulff, J.; Stanley, G.B.; Black, M.J. A naturalistic open source movie for optical flow evaluation. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 611–625. 38. Kingma, D.P.; Ba, J.L.A. A method for Stochastic optimization. arXiv 2017, arXiv:1412.6980. 39. Cypress. EZ-USB FX3 SuperSpeed USB Controller 2012. Available online: www.cypress.com (accessed on 16 December 2020). 40. Xilinx. AXI IIC Bus Interface LogiCORE IP Product Guide (PG090). 2016. Available online: https://www.xilinx.com/support/ documentation/ip_documentation/axi_iic/v1_02_a/pg090-axi-iic.pdf (accessed on 16 December 2020). 41. Xilinx. UltraScale Architecture Libraries Guide (UG974). 2019. Available online: www.xilinx.com (accessed on 16 December 2020). 42. Xilinx. FIFO Generator LogiCORE IP Product Guide (PG057). 2017. Available online: https://www.xilinx.com/support/ documentation/ip_documentation/fifo_generator/v9_3/pg057-fifo-generator.pdf (accessed on 16 December 2020). 43. Cypress. Designing with the EZ-USB FX3 Slave FIFO Interface (AN75705). 2018. Available online: www.cypress.com (accessed on 16 December 2020). 44. Mandal, S.; Bhavsar, A.; Sao, A.K. Depth map restoration from under-sampled data. IEEE Trans. Image Process. 2017, 26, 119–134. [CrossRef][PubMed] 45. Aodha, O.M.; Campbell, N.D.F.; Nair, A.; Brostow, G.J. Patch based synthesis for single depth image super-resolution. In Proceedings of the 2012 Springer European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 71–84.