Yolomask, an Instance Segmentation Algorithm Based on Complementary Fusion Network
Total Page:16
File Type:pdf, Size:1020Kb
mathematics Article YOLOMask, an Instance Segmentation Algorithm Based on Complementary Fusion Network Jiang Hua 1, Tonglin Hao 2, Liangcai Zeng 1 and Gui Yu 1,3,* 1 Key Laboratory of Metallurgical Equipment and Control Technology, Ministry of Education, Wuhan University of Science and Technology, Wuhan 430081, China; [email protected] (J.H.); [email protected] (L.Z.) 2 School of Automation, Wuhan University of Science and Technology, Wuhan 430081, China; [email protected] 3 School of Mechanical and Electrical Engineering, Huanggang Normal University, Huanggang 438000, China * Correspondence: [email protected] Abstract: Object detection and segmentation can improve the accuracy of image recognition, but traditional methods can only extract the shallow information of the target, so the performance of algorithms is subject to many limitations. With the development of neural network technology, semantic segmentation algorithms based on deep learning can obtain the category information of each pixel. However, the algorithm cannot effectively distinguish each object of the same category, so YOLOMask, an instance segmentation algorithm based on complementary fusion network, is proposed in this paper. Experimental results on public data sets COCO2017 show that the proposed fusion network can accurately obtain the category and location information of each instance and has good real-time performance. Keywords: image segmentation; deep learning; instance segmentation; fusion network Citation: Hua, J.; Hao, T.; Zeng, L.; Yu, G. YOLOMask, an Instance Segmentation Algorithm Based on Complementary Fusion Network. 1. Introduction Mathematics 2021, 9, 1766. https:// Object detection and segmentation based on RGB images are the basis of 6D pose doi.org/10.3390/math9151766 estimation, and it is also the premise of successful robot grasping. Accurate detection and segmentation can improve the accuracy of image recognition, but the traditional method Academic Editor: Frank Werner can only extract the shallow information of the target, so it is limited by many conditions. With the development of deep learning in the field of image classification, object detection, Received: 24 May 2021 semantic segmentation, and instance segmentation based on deep learning have made Accepted: 23 July 2021 great achievements [1]. The object detection technology is the basis of the latter two, which Published: 27 July 2021 mainly includes one-stage methods and two-stage methods. Both semantic segmentation and instance segmentation are used to classify the image at the pixel level, and the instance Publisher’s Note: MDPI stays neutral segmentation model also needs to get the category information and location information of with regard to jurisdictional claims in each pixel [2]. Therefore, the instance segmentation algorithm needs higher computational published maps and institutional affil- cost. How to maintain high accuracy and real-time performance in complex robot grasping iations. environments is the key problem to be solved in this paper. At present, there are mainly two frameworks for instance segmentation algorithms based on deep learning [3]. One is to determine the object’s category information and loca- tion information through the object detection algorithm, and then segment each instance Copyright: © 2021 by the authors. at the pixel level [4]. The disadvantage of this method is that it cuts off the connection Licensee MDPI, Basel, Switzerland. between the pixels inside and outside the bounding box and does not fully consider the This article is an open access article context, which leads to the low accuracy of the final segmentation. The other is to cluster distributed under the terms and the pixels of each instance in the image directly, then the offset of the instance pixels can be conditions of the Creative Commons calculated according to the distance from the object center to each pixel, and finally, the Attribution (CC BY) license (https:// instance can be segmented without the bounding box [5]. The disadvantage of this method creativecommons.org/licenses/by/ is that it cannot segment the edge information of large-scale objects well. At the same time, 4.0/). Mathematics 2021, 9, 1766. https://doi.org/10.3390/math9151766 https://www.mdpi.com/journal/mathematics Mathematics 2021, 9, 1766 2 of 12 most of the instance segmentation algorithms focus on improving the performance of the model while ignoring the real-time requirements in the actual operation process of the robot [6]. Therefore, the goal of this paper is to design an instance segmentation model with balance of accuracy and speed to fill the gap in this aspect. The essence of instance segmentation is to classify each pixel in the image according to the similar attributes of the instance, such as texture, color, brightness, or distance index [7]. In the early stage of 6D pose estimation, it is necessary to accurately segment the contour information of the object to be captured and, at the same time, to ensure the speed of robot grasping. In this paper, a new instance segmentation algorithm is proposed. In the new model, first, the optimized YOLOv4 (you only look once) algorithm is used to solve the problem of classification and location of salient objects in the image. Then, the semantic segmentation algorithm of dense UNet (U-shaped network) is used to classify the pixels of the objects. Finally, the accurate contour information and location results of the objects are obtained. The architecture adopts dense encoder–decoder structure, which has better spatial connectivity and retains the features learned in the shallow layer of the network, so it can achieve better location and segmentation. Through the experimental verification on public datasets, it is proven that the new algorithm still has good real-time performance under the premise of ensuring the accuracy. The remainder of the study is arranged as follows: in Section2, we introduce the related work of image detection and image segmentation. Section3 focuses on the con- struction of a neural network that can help to segment each instance in the image. In Section4, the performance of the proposed neural network is discussed according to the experiment result. Finally, the conclusion that the new method has higher precision and generalizability is given and future research is described. 2. Related Work In the process of robot grasping, target recognition and location is very important, which is the premise and foundation of later pose estimation. Therefore, how to improve the accuracy and speed of object detection is the key problem of robot grasping applications [8]. According to the timeline, the development of object detection methods can be divided into two historical stages: the traditional mathematical modeling method and the deep learning method. Traditional object detection methods mainly construct image features based on mathematical modeling. The representative features include corner feature descriptor Harris [9], ORB (oriented FAST and rotated BRIEF) [10], SIFT(scale invariant feature transform) [11], SURF (speeded-up robust features) [12], HOG (histogram of oriented gradient), LBP (local binary patterns), and DPM (deformable parts model) [13]. However, the above features mainly extract the shallow information such as color, texture and shape, which have great limitations and are easily disturbed by the external environment. As Krizhevsky et al. developed the Alexnet algorithm based on deep convolutional neural network, the object detection method can break through the traditional limita- tions [14]. The methods based on deep learning can process the data by convolution neural networks, extract the high-order semantic features of the image, and improve the robustness of object detection. The idea of object detection algorithms is to find the object and then identify the target category and location, which can be divided into two types of solutions [15]. One is a two-step method, which can recognize the objects in the rectangle based on the target candidate regions. The other is a one-step method, which can obtain the object detection results directly according to the images without searching candidate regions [16]. The original object detection algorithms are mainly based on the selective search method to extract the target candidate box, so there is a bottleneck in the speed. With the development of RPN (Region Proposal Networks) in Fast-RCNN, the end-to-end object detection based on deep learning can be made, and the speed and performance are greatly improved [17]. YOLO and SSD (single shot multibox detector) networks can directly return to the target box position without extracting candidate boxes, so they run faster, but the Mathematics 2021, 9, 1766 3 of 12 accuracy is not as good as the former. With the continuous upgrading and optimization of the network, there are mainly four versions of the YOLO algorithm [18]. YOLOv1 first divides the image into several grids and then gives two boundary frames in each grid. The final results are screened according to the probability of the boundary frame containing objects and the NMS (non-maximum suppression) method [19]. The advantage of this version is fast speed and generalization ability, and it is based on an end-to-end network to achieve object detection. However, the problem is that the lower recall rate and the mesh size correspond to only one object, so the detection results for small objects are not good. In order to further improve the accuracy and real-time performance, YOLOv2 adds a BN (batch normalization) layer after each convolution pooling layer and then connects the activation function. By normalizing the processed data, the offset can be eliminated, and the training speed of the model can be greatly improved. The formula of BN is shown as follows: (k) (k) (k) x −E[x ] x = q V[x(k)] (1) y(k) = g(k)xˆ + b(k) where x(k) is the sampled data, E[x(k)] is the mean value of the batch data, and V[x(k)] is the variance. The purpose is to evenly distribute the data processed by BN to the definition domain of the activation function so as to maximize the effect of the activation function.