An Improved Estimation Algorithm of Space Targets Pose Based on Multi-Modal Feature Fusion
Total Page:16
File Type:pdf, Size:1020Kb
mathematics Article An Improved Estimation Algorithm of Space Targets Pose Based on Multi-Modal Feature Fusion Jiang Hua 1, Tonglin Hao 2, Liangcai Zeng 1 and Gui Yu 1,3,* 1 Key Laboratory of Metallurgical Equipment and Control Technology, Ministry of Education, Wuhan University of Science and Technology, Wuhan 430081, China; [email protected] (J.H.); [email protected] (L.Z.) 2 School of Automation, Wuhan University of Technology, Wuhan 430081, China; [email protected] 3 School of Mechanical and Electrical Engineering, Huanggang Normal University, Huangang 438000, China * Correspondence: [email protected] Abstract: The traditional estimation methods of space targets pose are based on artificial features to match the transformation relationship between the image and the object model. With the explosion of deep learning technology, the approach based on deep neural networks (DNN) has significantly improved the performance of pose estimation. However, the current methods still have problems such as complex calculation, low accuracy, and poor real-time performance. Therefore, a new pose estimation algorithm is proposed in this paper. Firstly, the mask image of the target is obtained by the instance segmentation algorithm, and its point cloud image is obtained based on a depth map combined with camera parameters. Finally, the correlation among points is established to realize the prediction of pose based on multi-modal feature fusion. Experimental results in the YCB-Video dataset show that the proposed algorithm can recognize complex images at a speed of about 24 images per second with an accuracy of more than 80%. In conclusion, the proposed algorithm can realize fast pose estimation for complex stacked objects and has strong stability for Citation: Hua, J.; Hao, T.; Zeng, L.; Yu, G. An Improved Estimation different objects. Algorithm of Space Targets Pose Based on Multi-Modal Feature Keywords: space targets pose; deep neural networks; feature fusion; fast pose estimation; strong sta- Fusion. Mathematics 2021, 9, 2085. bility https://doi.org/10.3390/ math9172085 Academic Editor: Amir Mosavi 1. Introduction Targets pose estimation refers to the establishment of the transformation relationship Received: 1 July 2021 between the world coordinate system and the camera coordinate system, which can ac- Accepted: 26 August 2021 curately estimate the 3D position and pose of the object. In this way, more flexible space Published: 29 August 2021 robot grasping and manipulation can be realized. The traditional pose estimation method is mainly based on a local feature operator [1]. This method is only suitable for objects Publisher’s Note: MDPI stays neutral with texture information, but it cannot do anything for weak texture objects. Aiming at the with regard to jurisdictional claims in problem of pose estimation of weak texture objects, researchers propose a two-dimensional published maps and institutional affil- plane detection method based on edge extraction, but the method relies heavily on key iations. point matching, target detection, and a known three-dimensional model [2]. With the development of deep learning technology in the field of two-dimensional images, the method of remote pose estimating from images directly based on CNN ar- chitecture has been proposed multiple times [3]. For example, the BB8 algorithm [4] first Copyright: © 2021 by the authors. segments the pixels of the object from the two-dimensional image; then, it uses another Licensee MDPI, Basel, Switzerland. CNN network to regress the eight corners of the object and the three-dimensional bound- This article is an open access article ing box and finally optimizes the 6D pose of the object through the PNP algorithm. The distributed under the terms and SSD-6D algorithm [5] generates the position and rotation parameters on account of the conditions of the Creative Commons Attribution (CC BY) license (https:// target detection algorithm SSD. First, a convolutional neural network is used to estimate creativecommons.org/licenses/by/ the initial 6D pose of the target. Finally, the ICP algorithm is used for iterative correction. 4.0/). The common point of these methods is to improve the existing target detection algorithm Mathematics 2021, 9, 2085. https://doi.org/10.3390/math9172085 https://www.mdpi.com/journal/mathematics Mathematics 2021, 9, 2085 2 of 14 and estimate the 6D pose of the object in an RGB image by calculating the offset of the object and bounding box. However, these algorithms extract information from RGB im- ages separately. The lack of depth information seriously limits the application of these algorithms in complex environments [6]. With the popularity of 3D cameras, the point cloud data based on depth information can not only express the basic shape features of the object but also contain the texture features of the object [7]. Therefore, a pose estimation method based on RGB-D is proposed that can combine the local features of the image with the 3D point cloud to build a template library with 3D information, which can effectively improve the robustness of the algorithm. In order to avoid the interference of illumination, environment, and other factors, the representative algorithm Linemod [8] based on RGB-D extracts image boundary features and surface normal features of depth map as templates, which greatly improves the stability of the algorithm. After a binary coding, these features can be achieved by a parallel template search, which improves the processing speed of the algorithm. The PoseCNN algorithm [9] is the first end-to-end 6D pose estimation network, which directly extracts features and then completes the output by adding three branch networks of image segmentation, position, and pose. Finally, the ICP algorithm iteratively optimizes the pose. However, the defect of this algorithm is that it needs to know the 3D model of the target, and in order to improve the accuracy of pose estimation, many algorithms need to add a post-processing operation, which is bound to fail to meet the real-time requirements of space robot grasping [10]. Accurate remote estimation of 6D pose is the key factor for the space robot to grasp successfully. It can provide the 3D translation coordinates and 3D rotation vectors of the object in the camera coordinate system. Traditional methods of pose estimation have disad- vantages such as complex calculation, low accuracy, insufficient robustness to occlusion, and poor generalization. Therefore, in such complex scenes as space robot grasping, the method based on deep learning has unparalleled advantages. In addition, compared with the RGB-based pose estimation method, the RGB-D method contains more modal informa- tion and is not limited to the target texture information. To solve the above problems, this paper, based on the instance segmentation algorithm, obtains three-dimensional data of the object from the image, which extracts the semantic features and point cloud features of two heterogeneous kinds of data and fuses multi-modal features at the pixel level. In addition, the post-processing module for pose fine tuning is integrated into the deep learning frame- work to further improve the accuracy of the algorithm without increasing computation too much. Experiments on a common dataset show that the algorithm has high accuracy and significantly improves the real-time performance of pose estimation. The main contributions of the proposed algorithm framework are as follows. Firstly, a multi-modal feature fusion method is proposed, which integrates the two-dimensional information contained in each pixel into the corresponding three-dimensional point cloud of the model, effectively preserving the spatial structure information of the model. Secondly, the network structure based on posture estimation is integrated into a refine module and integrated into the end-to-end neural network to avoid the dependence on ICP and other post-processing methods, which effectively improves the processing speed of the model. The remainder of the study is arranged as follows. In Section2, we introduce the related work of pose estimation. Section3 focuses on the construction of a new neural network that can estimate the pose of the object based on feature fusion. In Section4, the performance of the proposed neural network is discussed according to the experiment result. Finally, the conclusion that the new method has faster processing speed and better generalizability is given, and future research is described. 2. Related Work The traditional architecture of a pose estimation network is to extract local or template features first and then classify the extracted features to realize the recognition of different targets [11]. However, the performance of the algorithm is seriously constrained by the artificially designed features and fixed matching process. Therefore, the pose estimation Mathematics 2021, 9, 2085 3 of 14 methods based on CNN architecture appear repeatedly, such as PoseNet and Pvnet [12]. However, due to the lack of depth information, pose estimation is still a difficult problem, so the method of remote pose estimation based on 3D data has become a trend. The traditional method is based on RGB-D to extract 3D features, which integrates the depth information as an additional channel of the image into the CNN-based archi- tecture [13]. Recent algorithms transform depth information into 3D data, such as voxels and point clouds, and then estimate the target pose from the 3D data. For example, Song et al. [14] estimated the target pose based on the voxelized input of a 3D convolution network and achieved a good result. However, these methods still need expensive post- processing steps to make full use of 3D data, resulting in low real-time performance of the algorithm. The two-dimensional image can only reflect the projection information of the target at a certain angle in the three-dimensional space, so the representation of the related attributes of the object is not accurate, and the three-dimensional data can accurately represent the real shape and geometric structure of the target [15].