This document is downloaded from DR‑NTU (https://dr.ntu.edu.sg) Nanyang Technological University, Singapore.
Optimization of object classification and recognition for e‑commerce logistics
Ren, Meixuan
2018
Ren, M. (2018). Optimization of object classification and recognition for e‑commerce logistics. Master’s thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/75867 https://doi.org/10.32657/10356/75867
Downloaded on 02 Oct 2021 19:14:16 SGT N O I T A Z I M I T P O N T O C I F E T O J A B C O I F I S S A L C OPTIMIZATION OF OBJECT N O
I CLASSIFICATION AND RECOGNITION T I D N N G A
O FOR E-COMMERCE LOGISTICS C E R
REN MEIXUAN N N A E U R X I E M
SCHOOL OF MECHANICAL AND 8
1 AEROSPACE ENGINEERING 0 2 NANYANG TECHNOLOGICAL UNIVERSITY
2018 OPTIMIZATION OF OBJECT
CLASSIFICATION AND RECOGNITION
FOR E-COMMERCE LOGISTICS
N Submitted by N A E U R X
I REN MEIXUAN E M
Robotics Research Centre
School of Mechanical and Aerospace Engineering
A thesis submitted to the Nanyang Technological University
in partial fulfillment of the requirement for the degree of
Master of Engineering (Mechanical Engineering)
2018 ABSTRACT
E-commerce, an online transaction in the information-based society, draws on various technologies to achieve automated order picking process for the fulfillment of supply chain's productivity. Robotic systems like Amazon Kiva are applied in logistics warehouses for low labor cost and high efficiency. Amazon Robotic Challenge (ARC) in 2017 aimed to explore a solution to bin picking problem in cluttered environment which is a common situation in logistics warehouses. Since the perception strategy is a key factor to picking performance, this thesis proposes a robust vision-based approach to object recognition for the robotic system of Team Nangyang in ARC.
In this thesis, traditional methods and deep learning methods for object recognition are reviewed and verified. Five perception approaches based on GMS
(Grid-based Motion Statistics), CNN (convolutional neural network) and image differencing are proposed to achieve the order picking. First the experiments of GMS + fixed sliding window, CNN + fixed sliding window and CNN + dynamic sliding window are designed and conducted. Then two hybrid methods which combine CNN + dynamic sliding window with GMS and image differencing are proposed and tested to get a more accurate suction point. Finally, after comparing all the experimental results, a conclusion is drawn that CNN + dynamic sliding window + image differencing is a robust perception method to realize the object recognition in unstructured workspace in logistics warehouses.
I ACKNOWLEDGEMENTS
I would like to express my sincere appreciation to Prof. Chen I-Ming for his valuable encouragement advice and guidance.
My special thanks to people who helped me during the project, which includes my project leader Dr. Albert Causo and my research group members such as Ms.
Pang Wee Ching, Mr. Chong Zheng Hao, Mr. Ramamoorthy Luxman, Mr. Kok
Yuan Yik, Mr. Weng Ching-Yeng, Mr. Zhao Yi, Mr. Hendra Suratno Tju.
Without their help and rich experience, my research could not proceed so smoothly.
Finally, I wish to acknowledge the help provided by my family and friends for the support and inspiring discussion received.
II TABLE OF CONTENTS
ABSTRACT...... I
ACKNOWLEDGEMENTS...... II
LIST OF FIGURES...... VI
LIST OF TABLES...... VIII
CHAPTER1 Introduction...... 1
1.1 Background and Motivation...... 1
1.2 Objectives...... 3
1.3 Organizations...... 4
CHAPTER2 Literature Review...... 6
2.1 Traditional Methods...... 7
Determination of feature...... 7
Feature detector...... 8
Feature descriptor...... 10
Feature matcher...... 14
2.2 Deep Learning Methods...... 15
LeNet...... 16
AlexNet...... 16
ZFNet...... 17
III VGGNet...... 17
Inception (GoogLeNet)...... 17
ResNet...... 18
MaskRCNN...... 18
CHAPTER3 Problem Statement...... 20
3.1 Problem Statement...... 20
3.2 Introduction of ARC...... 23
3.3 Decision of Methodology...... 25
CHAPTER4 Recognition Approaches...... 27
4.1 Introduction of GMS and CNN...... 28
GMS...... 28
CNN...... 29
4.2 GMS + Fixed Sliding Window...... 32
4.3 CNN + Fixed Sliding Window...... 34
4.4 CNN + Dynamic Sliding Window...... 37
4.5 CNN + Dynamic Sliding Window + GMS...... 38
4.6 CNN + Dynamic Sliding Window + Image Differencing...... 40
CHAPTER5 Experiment and Results...... 42
5.1 Experimental Setup...... 42
IV 5.2 Data Acquirement and Training...... 43
Database setup...... 43
Training process...... 45
Training results...... 48
5.3 Experimental Results...... 50
Accuracy calculation...... 50
GMS + fixed sliding window...... 50
CNN + fixed sliding window...... 53
CNN + dynamic sliding window...... 55
CNN + dynamic sliding window + GMS...... 57
CNN + dynamic sliding window + image differencing...... 58
5.4 Discussion...... 60
Improvement process...... 60
Results comparison...... 63
CHAPTER6 Conclusions and Perspectives...... 70
6.1 Conclusions...... 70
6.2 Perspectives...... 74
REFERENCE...... 75
V LIST OF FIGURES
Figure 1 General process for the vision system...... 7
Figure 2 Neighborhood feature points for GMS matcher...... 15
Figure 3 Architecture of LeNet (CNN) [35]...... 16
Figure 4 Structure for the storage system...... 21
Figure 5 Sample image acquired by the robot system...... 21
Figure 6 Amazon Robotic Challenge dataset in 2017 [45]...... 22
Figure 7 Methodologies summary for ARC 2015...... 23
Figure 8 Robotic system of Delft in ARC 2016 [48]...... 24
Figure 9 Overview flowchart of the exploration process...... 27
Figure 10 Architecture for GoogLeNet Inceptionv3 [42]...... 30
Figure 11 Introduction of the layers in GoogLeNet Inceptionv3...... 30
Figure 12 Neuron structure of the convolution layer [49]...... 31
Figure 13 Recognition process of GMS + fixed sliding window...... 33
Figure 14 Recognition process of CNN + fixed sliding window...... 35
Figure 15 Recognition process of CNN + dynamic sliding window + GMS...... 39
Figure 16 GMS matching results of the sliding window image...... 39
Figure 17 Recognition process of CNN + dynamic sliding window + image differencing...... 41
Figure 18 Part of the database for GMS...... 44
Figure 19 Part of the database for CNN...... 45
Figure 20 Original dataset example used to train the model...... 48
VI Figure 21 Accuracy result of the CNN training process...... 49
Figure 22 Cross entropy result of the CNN training process...... 49
Figure 23 Other results for the CNN training process...... 49
Figure 24 Recognition and matching result by SIFT...... 50
Figure 25 Recognition and matching result by GMS...... 51
Figure 26 Recognition accuracy of GMS + fixed sliding window...... 52
Figure 27 Recognition accuracy of CNN + fixed sliding window...... 55
Figure 28 Recognition accuracy of CNN + dynamic sliding window...... 56
Figure 29 Recognition accuracy of CNN + dynamic sliding window + GMS...... 58
Figure 30 Recognition accuracy of CNN + dynamic sliding window + image differencing...... 59
Figure 31 GMS + fixed sliding window result for the item with feature...... 61
Figure 32 GMS + fixed sliding window result for the item without feature...... 61
Figure 33 Problem of the wrong suction point location...... 62
Figure 34 Result comparison of GMS + fixed sliding window, CNN + fixed sliding window and CNN + dynamic sliding window...... 65
Figure 35 Advantages and disadvantages of GMS and CNN methods...... 66
Figure 36 Result comparison of CNN + dynamic sliding window, CNN + dynamic sliding window + GMS and CNN + dynamic sliding window + image differencing....68
Figure 37 Comparison of the two hybrid recognition methods...... 69
VII LIST OF TABLES
Table 1 Accuracy of different deep learning models [41, 42]...... 18
Table 2 Dynamic sliding window sizes for different items...... 37
Table 3 Image number for the training process...... 47
Table 4 Recognition accuracy of GMS + fixed sliding window...... 51
Table 5 Tuning the parameters of window size and step size for CNN + fixed sliding window...... 53
Table 6 Recognition accuracy of CNN + fixed sliding window...... 54
Table 7 Recognition accuracy of CNN + dynamic sliding window...... 55
Table 8 Recognition accuracy of CNN + dynamic sliding window + GMS...... 57
Table 9 Recognition accuracy of CNN + dynamic sliding window + image differencing
...... 58
Table 10 Result comparison of GMS + fixed sliding window, CNN + fixed sliding window and CNN + dynamic sliding window...... 64
Table 11 Result comparison of CNN + dynamic sliding window, CNN + dynamic sliding window + GMS and CNN + dynamic sliding window + image differencing....67
Table 12 Result comparison of all the recognition methods...... 72
VIII CHAPTER1 Introduction
1.1 Background and Motivation
E-commerce, an aggregation of commercial transaction, is in a boom with the growth of economy and information technology. Considering its characteristics of high efficiency and convenience, business enterprises are now transferring into an e- commerce age. Logistics activities play a significant role in online transaction by supporting the supply and sale chain. In order to increase the efficiency and decrease the labor cost, it is necessary to transform the logistics activities into full automated ones. Under this occasion, robots become a good tool to meet the commercial requirements and achieve the goal of industrial automation [1, 2].
Logistics robots for the warehouses soared and became popular since Amazon purchased Kiva Systems, which consisted of Autonomous Mobile Robots (AMRs) in
2012. After rebranding Kiva as Amazon Robotic and testing AMRs in other companies,
Amazon keeps trying to figure out an optimal way to design, manufacture and deploy them [3]. Amazon robotic system is testified to perform well in items of accuracy and efficiency in logistics activities. With the encouragement of some successful examples, increasing number of e-commerce companies start concentrating on setting up automated robotic logistics systems.
Order picking, one of the logistics activities in warehouses, is taking and collecting products in a specified quantity before shipment according to customers'
1 orders. Because of its significance on ensuring high productivity of supply chain, the realization of the order picking by robots has become a popular research area. However, in some warehouses, a common occasion is that products may not be stored orderly and tidily. Thus, commercially viable automated picking according to a specific order in unstructured workspace is a challenging research topic. Amazon Robotic Challenge
(ARC) is held to explore a solution to the item picking problem in unstructured workspace in order to set up a connection between industrial automation and academic research. According to the rules for ARC in 2017, our dual-arm robotic system is set up to accomplish a completely automated order picking process.
A picking robotic system usually consists of a mechanical system, a vision system and a program manager. The program manager is used to get the order information and send it to the vision system for recognition. After computing and getting the item location, motion planning is conducted to grip the item to the right place in the storage system. To acquire a robust and accurate systematic performance, the key point is that the robotic system needs a strong recognition and classification algorithm and strategy. Therefore, this thesis focuses on the perception method of the robotic system and tries to improve it to realize high-quality recognition. The research of the perception method is based on the robotic system and the order picking process in ARC 2017.
2 1.2 Objectives
Nowadays, intelligent automatic robots in warehouses are worth studying to realize high efficiency and accuracy. Accurate recognition methods and strategies are needed to ensure the order no matter in messy or common workspace. The research objective for the thesis is to find out a robust and feasible perception method to realize order picking in unstructured environment. In this thesis, the problem is defined under the same condition of 2017 Amazon Robotic Challenge. Therefore, the main task is proposing and implementing a robust perception strategy to realize object recognition in Amazon Robotic Challenges. After summarizing the recognition methodologies, an exploration process of the methods which can be used for object recognition in unstructured bins is explained. CNN + dynamic sliding window + image differencing which is a hybrid method of deep learning method and traditional method is proposed to achieve a high recognition rate up to about 95%.
The main objectives include:
1. Implementing the robotic system to accomplish the perception strategy of the
order picking task under the settings of Amazon Robotic Challenge.
2. Proposing possible perception methods by literature review and setting up the
experiments to test the object recognition results.
3. Evaluating the recognition accuracy of the proposed strategies and analyzing
the systematic performance.
4. Making a conclusion of a perception method of object recognition in messy bins
for the intelligent automatic picking robots in warehouses.
3 1.3 Organizations
The report is divided into six chapters, which are briefly summarized as following:
Chapter 1 briefly introduces the background of the picking robots for e-commerce logistics and explains the motivation of researching object recognition in cluttered workspace. The objectives and organizations of the thesis are also introduced.
Chapter 2 reviews and discusses the theories and characteristics of the traditional methods and deep learning methods for the perception strategy of the robotic system.
Chapter 3 shows some cases of actual robotic picking systems in Amazon Robotic
Challenge. After the statement of rules and the problem, methodologies of the recognition system in our study are analyzed and decided according to the literature review results and robot structure.
Chapter 4 introduces the recognition methods used in our robot system. This chapter contains the theory and algorithms for GMS (Grid-based Motion Statistics) and
CNN (convolutional neural network), GMS + fixed sliding window, CNN + fixed sliding window, CNN + dynamic sliding window, CNN + dynamic sliding window +
GMS and CNN + dynamic sliding window + image differencing.
Chapter 5 explains the overall experimental design and database setup. The relevant results are presented which include the devices, data collection and training, and the recognition accuracy of the five methods. This chapter also explains the complete improvement process of the recognition strategy and makes a comparison among the five approaches.
4 Chapter 6 draws the conclusion from comparing and summarizing the perception methods. The hybrid method of CNN with dynamic sliding window and image differencing gives the best performance with a recognition accuracy of 95%.
5 CHAPTER2 Literature Review
Considering the working process of the e-commerce warehouses, bin-picking robots are required to classify the ordered objects and accomplish the picking tasks.
Basically, the solution contains six steps: initial data acquisition, object detection, pose estimation, path planning, object grasping and object placement. However, industrial automation for order picking still remains a research topic because of the complex working conditions and the accuracy. Therefore, the accuracy and stability of the logistics robots need to be improved [4].
Apparently, the vision system is a key factor of picking robots in warehouses to realize and ensure high accuracy. A robust object recognition method is needed to achieve an automated picking process in cluttered environment. In this chapter, methods for object recognition are presented and discussed for further use and application.
The general process for the object recognition is shown in Figure 1. Based on the principles, object recognition can be divided into traditional methods and deep learning methods. Traditional methods mostly concentrate on the operations of features.
They realize the recognition by finding and matching the features between target images and reference databases. As for deep learning methods, convolutional neural network is used to train the classification model and implement the recognition process.
Both of the methods will be discussed in the following literature review.
6 Acquire images
Pre-process images
Extract the feature
Segment and detect objects
Post-process images
Make the decision
Figure 1 General process for the vision system
2.1 Traditional Methods
In traditional computer vision, feature contains the information of an image such as some key points and color regions. To achieve the recognition, feature detectors are first applied to find and extract the feature information. Then feature descriptors are usually used to express the feature by different mathematical methods.
Finally, recognition will be accomplished by applying proper feature matchers to get correct match points and their locations. Generally, four significant steps of recognition are: Determination of feature; Choice of feature detectors; Choice of feature descriptors;
Choice of feature matchers to realize the recognition [5].
Determination of feature
In image processing, feature contains the information of an image such as edges, points, corners , interest regions and ridges [6]. Recognition of a target item is just a process to find and locate the features of it in an image. Features of images consist of
7 global features and local features. Global feature uses one multidimensional feature vector to describe the information of the whole image, such as color, texture and shape. etc. As for the local features, the detection process is to separate some interest regions and collect the information from them. An example of local features can be multiple interest points or some other features in a specific part of an image [7].
Feature detector
Feature detector is an algorithm used to find the features in an image. Some detectors can detect only one kind of features, while others can detect more. Different feature detectors are needed to be chosen according to the characteristic of the images and the target items. This section introduces common feature detectors for different features such as edges, corners and points.
Steps of edge detection are filtering, enhancement, detection and localization.
For 1-D edge detection, Roberts operator, Sobel operator and Prewitt operator are commonly used [8]. The Roberts operator does an approximation to the gradient magnitude of the image and applies convolution masks to calculate the entire image horizontally and vertically. So2b×el 2operator and Prewitt operator are discrete differentiation operators which compute the approximate derivative by two masks. 2-D edge detector includes Laplacian operator and Second direc3tio×n3al derivative, which are for the cases where there are too many detected edge points. In this situation, choosing the points that have local maximums in gradient values can be a good approach by finding the points that have a peak in the first derivative and a zero
8 crossing in the second derivative [9]. Another famous edge detector is Canny edge detector, whose target is to realize a good-qualified performance by finding a single edge. John F. Canny used the calculus of variations and optimized the function. It summed the four exponential terms together and could also be expressed as first derivative of a Gaussian approximately [10].
For corners and interest points, there are many different methods like Shi and
Tomasi, Hessian feature strength measures and Level curve curvature. Harris operator, a common corner detector, calculates the differential of the corner score which is proportional to the direction. Another operator called SUSAN tries to acquire the corner by comparing brightness difference in the defined circle which leads to the robustness to noise [11-14]. Features from Accelerated Segment Test (FAST), presented by Edward Rosten and Tom Drummond, is a corner feature detector [15].
FAST accomplishes detecting a corner by comparing the intensity around a possible point. The basic process is listed.
1. Choose a point P;
2. Set up a threshold;
3. Decide 16-pixel contiguous points set in the Bresenham;
4. Compare the pixels and regard P as a corner if several continuous points on
the circle are brighter or darker than P.
Blob detection aims to get the feature information of a region or global features like color and texture. Determinant of Hessian (DoH), Difference of Gaussians (DoG),
Laplacian of Gaussian (LoG), MSER, PCBR, Grey-level blobs are some common
9 methodologies [16, 17]. DoH, DoG and LoG achieve the detection by applying the convolution to the image, getting the derivatives of the Gaussian function and obtaining the jumping edges of the zero-crossing for the second derivatives.
There are also other famous detectors, such as Hough transform and affine invariant feature detection. Hough transform is commonly used to detect the shape of the target item [18, 19]. Affine shape adaptation, Harris affine and Hessian affine are used to accomplish the affine invariant feature detection [20, 21].
Feature descriptor
Feature descriptor is the mathematical description in different formats of the features. For global features, the descriptors for color include Dominant Color
Descriptor (DCD), Scalable Color Descriptor (SCD), Color Structure Descriptor (CSD),
Color Layout Descriptor (CLD) and Group of frame (GoF) or Group-of-pictures (GoP).
Homogeneous Texture Descriptor (HTD), Texture Browsing Descriptor (TBD) and
Edge Histogram Descriptor (EHD) are used to describe the textures. For another feature, shape, Region-based Shape Descriptor (RSD), Contour-based Shape Descriptor
(CSD) and 3-D Shape Descriptor (3-D SD) are always used. Finally Motion Activity
Descriptor (MAD), Camera Motion Descriptor (CMD), Motion Trajectory Descriptor
(MTD) and Warping and Parametric Motion Descriptor (WMD and PMD) [22, 23] are available to decide the motion.
As we can see, there are a large number of global feature descriptors. However, in recognition application, local features are more commonly used than the global
10 features. In this case, the orientation and the scale for the features need to be considered for real practice. Therefore, a feature descriptor with good performance not only needs to align the information of location, scale, orientation and their image structure, but also needs to have good discriminate matching performance and sensitivity to local feature. With the effort of the scientists, a lot of local feature descriptors are proposed and improved like SIFT, SURF, GLOH and LBP [24]. Generally, SIFT makes a mathematical expression of the feature detected by Difference of Gaussians (DoG). As for SURF, which increases the speed of SIFT, uses the idea for Determinant of Hessian
(DoH) to realize the description. Some commonly used feature descriptors are introduced here to provide an idea of their working processes.
Scale Invariant Feature Transform (SIFT)
Scale Invariant Feature Transform (SIFT) whose author is Lowe is the most famous feature descriptor. Since scale-invariant coordinates are relative to local features of the image, SIFT transforms the image data to mathematical expression through four steps: detecting scale-space extrema, localizing key points, assigning the orientation and describing the key points in a math expression [25].
To detect the scale-space extrema, Lowe used the LoG function to identify feature points. If the points were invariant to scale and orientation, they would be recorded as the interest points. After that, Lowe compared each point to its neighbors to decide and localize the useful points. Once a key point was found, Lowe tried to exclude the wrong points and increase the accuracy by applying Taylor expansion of
11 the scale-space function and comparing to the nearby data for location, scale, and ratio of principal curvatures. Then a consistent orientation would be assigned to the points according to the local feature information by using an orientation histogram method.
With all the operations, the local feature is expressed in a math format which is easier for calculation.
Speeded-Up Robust Features Descriptor (SURF)
As introduced previously, SURF approximates the points detected by Hessian blob detector integrally. Then it extracts a square region of the points by adding the
Haar wavelet response together. The center and orientation of the region is decided by the selected points. Finally, Gaussian is used to weigh the responses and lead to the robustness [26].
Gradient Location-Orientation Histogram (GLOH) & Local Binary Pattern (LBP)
Gradient Location and Orientation Histogram increases the performance of
SIFT by using more spatial histogram regions for calculation [27]. As for Local binary patterns, a binary descriptor, performs particularly well to find a feature for texture classification. It can also be combined with the Histogram of oriented gradients (HOG) descriptor to give a better result [28, 29].
12 Oriented FAST and Rotated BRIEF (ORB)
ORB, a binary descriptor, makes an improvement on the hybrid method of
Features from Accelerated Segment Test (FAST) and Binary Robust Independent
Elementary Features (BRIEF). Compared to SIFT and SURF, ORB has a faster performance. FAST is a feature detector for corners and key points detection. Binary
Robust Independent Elementary Features is a binary descriptor by comparing
Hamming distances of the chosen point pairs [30, 31].
After finding the interest points by FAST, the image is first smoothed. Then a patch of points around the feature points are chosen and defined. The intensity of the two points in the patch will be compared to get 1 or 0. Then several pairs of points will be chosen from the image and form a binary string uniquely. Five ways are tested for
BRIEF to decide the point pairs and the one with best performance is X, Y ~ Gaussian
. Finally, hamming distances of all the features in target images and reference 2 2 im a ges are calculated. The two features with the shortest distance will be regarded as the same ones.
ORB combines the two methods to a new feature descriptor by making three improvements [32].
1. Adding a component of accurate orientation to FAST.
2. Calculating oriented BRIEF features in a more efficient way.
3. Improving oriented BRIEF features through variance and correlation.
13 Feature matcher
After extracting the features and applying the descriptors, a common approach to image matching is comparing the descriptors of the original images and the target images through the distance between two descriptors. In this part, how to reject the wrong points to ensure the accuracy and speeding up the algorithm become two decisive factors. For now, Brute-Force Matcher, the fast library for approximate nearest neighbors (FLANN) and the randomized k-d forest are the most common methods [33].
Brute-Force Matcher is just a basic algorithm to compute all the distances between different descriptors and acquire the matching result by comparing the distances.
GMS is a feature matching method with ORB applied as a feature descriptor. It is a statistical methodology presented by JiaWang Bian and Wen-Yan Lin which increases the accuracy of the match points by comparing the points around the features and rejecting the wrong feature points [34]. JiaWang Bian proposed an assumption that motion smoothness causes a neighborhood around a true match to view the same 3D location. Likewise, the neighborhood around a false match views geometrically different 3D locations [34]. As shown in Figure 2, the neighborhood of match is