A Deep Transfer Learning Approach for 3D Object Recognition in Open-Ended Domains S
Total Page:16
File Type:pdf, Size:1020Kb
1 Accepted in IEEE/ASME Transactions on Mechatronics (TMECH) – DOI: 10.1109/TMECH.2020.3048433 OrthographicNet: A Deep Transfer Learning Approach for 3D Object Recognition in Open-Ended Domains S. Hamidreza Kasaei Abstract—Nowadays, service robots are appearing more and Open-Ended Learning and Recognition more in our daily life. For this type of robot, open-ended object Object category Recognition models Perceptual Memory category learning and recognition is necessary since no matter previous object new object categories categories deep object deep how extensive the training data used for batch learning, the robot representation might be faced with a new object when operating in a real- Object Conceptualization world environment. In this work, we present OrthographicNet, a deep object representation Convolutional Neural Network (CNN)-based model, for 3D object point/teach/ask/correct recognition in open-ended domains. In particular, Orthographic- new object categories Net generates a global rotation- and scale-invariant representa- tion for a given 3D object, enabling robots to recognize the same Fig. 1. Overview of the proposed open-ended object recognition pipeline: (left) the cup object and its bounding box, reference frame and three projected or similar objects seen from different perspectives. Experimental views; (center) each orthographic projection is then fed to a CNN to obtain results show that our approach yields significant improvements a view-wise deep feature; (right) to construct a global deep representation over the previous state-of-the-art approaches concerning object for the given object, view-wise features are merged using an element-wise recognition performance and scalability in open-ended scenarios. pooling function. The obtained representation is finally used for open-ended Moreover, OrthographicNet demonstrates the capability of learn- object category learning and recognition. ing new categories from very few examples on-site. Regarding real-time performance, three real-world demonstrations validate the promising performance of the proposed architecture. it is able to adjust the learned model of a certain category Index Terms—3D Object Perception, Open-Ended Learning, when a new instance is taught; (iv) opportunistic: apart from Interactive Robot Learning. learning from a batch of labeled training data at predefined times or according to a predefined training schedule, the robot I. INTRODUCTION must be prepared to accept a new example when it becomes Umans can adapt to different environments dynamically available. This way, it is expected that the competence of the H by watching and learning about new object categories. robot increases over time. In contrast, for migrating a robot to a new environment, Most of the recent approaches use Convolutional Neural one must often completely re-program the knowledge base Networks (CNN) for 3D object recognition [2], [3], [4], [5], that it is running with. Although many important problems [6], [7]. It is now clear that if an application has a pre-defined have already been understood and solved successfully, many fixed set of object categories and thousands of examples issues still remain. Open-ended object category learning is per category, an effective way to build an object recognition one of these issues waiting for many improvements. In this system is to train a deep CNN. However, there are several work, open-ended learning refers to the ability to learn new limitations to use CNN in open-ended domains. CNNs are in- object categories sequentially without forgetting the previously cremental by nature but not open-ended, since the inclusion of arXiv:1902.03057v3 [cs.RO] 31 Dec 2020 learned categories. In particular, the robot does not know new categories enforces a restructuring in the topology of the in advance which object categories it will have to learn, network. Moreover, CNNs usually need a lot of training data, which observations will be available, and when they will be and if limited data is used for training, this might lead to non- available to support the learning. In such scenarios, the robot discriminative object representations and, as a consequence, to should learn about new categories from on-site experiences, poor object recognition performance. Deep transfer learning supported in the feedback from human teachers. To address can relax these limitations and motivates us to combine deep- this problem, we believe the learning system of the robots learned features with an online classifier to handle the problem should have four characteristics [1]: (i) on-line: meaning of open-ended object category learning and recognition. To that the learning procedure takes place while the robot is the best of our knowledge, there is no other deep learning running; (ii) supervised: to include the human instructor in based approach jointly tackling 3D object pose estimation and the learning process. This is an effective way for a robot to recognition in an open-ended fashion. As depicted in Fig. 1, obtain knowledge from a human teacher. (iii) incremental: we first construct a unique reference frame for the given object. Afterwards, three principal orthographic projections including Department of Artificial Intelligence, University of Groningen, Gronin- front, top, and right–side views are computed by exploiting the gen, the Netherlands. Email: [email protected] We thank NVIDIA Corporation for their generous donation of GPUs object reference frame. Each projected view is then fed to a used for this research. CNN to obtain a view-wise deep feature. To construct a global 2 feature for the given object, the obtained view-wise features planes. In contrast, pointset-based approaches work directly are then merged using an element-wise pooling function. The on 3D point clouds and require neither voxelization nor 2D obtain feature is scale and pose invariant, informative and image projections. Among these methods, view-based methods stable, and designed with the objective of supporting accurate are shown effective in object recognition tasks and obtained the 3D object recognition. We finally conducted our experiments best recognition results so far [3], [22]. Our approach falls into with an instance-based learning. We perform a set of extensive the view-based category. Shi et al. [7] tackle the problem of 3D experiments to assess the performance of OrthographicNet and object recognition by combining a panoramic representation compare it with a set of state-of-the-art methods. of 3D objects with a CNN, i.e., named DeepPano. In this The remainder of this paper is organized as follows: Sec- approach, each object is first converted into a panoramic view, tion II reviews related work of open-ended learning and deep i.e., a cylinder projection object around its principle axis. Then, learning approaches applied to 3D object recognition. Next, a variant of CNN is used to learn the deep representations the detailed methodologies of our proposal – namely Ortho- from the panoramic views. Su et al. [19] described an object graphicNet – are explained in Section III and IV. Experimental recognition approach using multiple view-wise CNN features. result and discussion are given in Section V, followed by In another work, Sinha et al. [18] adopted an approach of conclusions in Section VI. converting the 3D object into a “geometry image” and used standard CNNs directly to learn 3D shape surfaces. These works are similar to ours in that they use multiple views of II. RELATED WORK an object. However, unlike our approach, these approaches In the last decade, various research groups have made are not suitable for real-world scenarios since objects are substantial progress towards the development of learning ap- often partially observable due to occlusions, which makes it proaches which support online and incremental object category difficult to rely on multi-view representations that are learned learning [8], [9], [10], [11], [12]. In such systems, object with the whole circumference. Several deep transfer learning representation plays a prominent role since its output uses in approaches assumed that large amounts of training data are both learning and recognition phases. In [8], an open-ended available for novel classes [23]. For such situations the strength object category learning system, based on a global 3D shape of pre-trained CNNs for extracting features is well known feature namely GOOD [13], is described. In particular, the [23], [24]. Unlike our approach, CNN-based approaches are authors proposed a cognitive robotic architecture to create a not scale and rotation invariant. Several researchers try to concurrent 3D object category learning and recognition in an solve the issue by data augmentation either using Generative interactive and open-ended manner. Kasaei et al. [9] proposed Adversarial Networks (GAN) [25] or by modifying images by a naive Bayes learning approach with a Bag-of-Words object translation, flipping, rotating and adding noise [26] i.e., CNNs representation to acquire and refine object category models are still required to learn the rotation equivariance properties in open-ended fashion. Faulhammer et al. [10] introduced a from the data [27]. Unlike the above mentioned CNN-based system which allows a mobile robot to autonomously detect, approaches, we assume that the training instances are extracted model and recognize objects in everyday environments. Skocaj from on-site experiences of a robot, and thus become gradually et al. [14] presented an integrated robotic system