1 Accepted in IEEE/ASME Transactions on Mechatronics (TMECH) – DOI: 10.1109/TMECH.2020.3048433 OrthographicNet: A Deep Transfer Learning Approach for 3D Object Recognition in Open-Ended Domains S. Hamidreza Kasaei

Abstract—Nowadays, service robots are appearing more and Open-Ended Learning and Recognition more in our daily life. For this type of robot, open-ended object Object category Recognition models Perceptual Memory category learning and recognition is necessary since no matter previous object new object

categories categories deep object deep

how extensive the training data used for batch learning, the robot representation might be faced with a new object when operating in a real- Object Conceptualization world environment. In this work, we present OrthographicNet, a deep object representation Convolutional Neural Network (CNN)-based model, for 3D object point/teach/ask/correct recognition in open-ended domains. In particular, Orthographic- new object categories Net generates a global rotation- and scale-invariant representa- tion for a given 3D object, enabling robots to recognize the same Fig. 1. Overview of the proposed open-ended object recognition pipeline: (left) the cup object and its bounding , reference frame and three projected or similar objects seen from different perspectives. Experimental views; (center) each orthographic projection is then fed to a CNN to obtain results show that our approach yields significant improvements a view-wise deep feature; (right) to construct a global deep representation over the previous state-of-the-art approaches concerning object for the given object, view-wise features are merged using an element-wise recognition performance and scalability in open-ended scenarios. pooling function. The obtained representation is finally used for open-ended Moreover, OrthographicNet demonstrates the capability of learn- object category learning and recognition. ing new categories from very few examples on-site. Regarding real-time performance, three real-world demonstrations validate the promising performance of the proposed architecture. it is able to adjust the learned model of a certain category Index Terms—3D Object Perception, Open-Ended Learning, when a new instance is taught; (iv) opportunistic: apart from Interactive Robot Learning. learning from a batch of labeled training data at predefined times or according to a predefined training schedule, the robot I.INTRODUCTION must be prepared to accept a new example when it becomes Umans can adapt to different environments dynamically available. This way, it is expected that the competence of the H by watching and learning about new object categories. robot increases over time. In contrast, for migrating a robot to a new environment, Most of the recent approaches use Convolutional Neural one must often completely re-program the knowledge base Networks (CNN) for 3D object recognition [2], [3], [4], [5], that it is running with. Although many important problems [6], [7]. It is now clear that if an application has a pre-defined have already been understood and solved successfully, many fixed set of object categories and thousands of examples issues still remain. Open-ended object category learning is per category, an effective way to build an object recognition one of these issues waiting for many improvements. In this system is to train a deep CNN. However, there are several work, open-ended learning refers to the ability to learn new limitations to use CNN in open-ended domains. CNNs are in- object categories sequentially without forgetting the previously cremental by nature but not open-ended, since the inclusion of arXiv:1902.03057v3 [cs.RO] 31 Dec 2020 learned categories. In particular, the robot does not know new categories enforces a restructuring in the topology of the in advance which object categories it will have to learn, network. Moreover, CNNs usually need a lot of training data, which observations will be available, and when they will be and if limited data is used for training, this might lead to non- available to support the learning. In such scenarios, the robot discriminative object representations and, as a consequence, to should learn about new categories from on-site experiences, poor object recognition performance. Deep transfer learning supported in the feedback from human teachers. To address can relax these limitations and motivates us to combine deep- this problem, we believe the learning system of the robots learned features with an online classifier to handle the problem should have four characteristics [1]: (i) on-line: meaning of open-ended object category learning and recognition. To that the learning procedure takes place while the robot is the best of our knowledge, there is no other deep learning running; (ii) supervised: to include the human instructor in based approach jointly tackling 3D object pose estimation and the learning process. This is an effective way for a robot to recognition in an open-ended fashion. As depicted in Fig. 1, obtain knowledge from a human teacher. (iii) incremental: we first construct a unique reference frame for the given object. Afterwards, three principal orthographic projections including Department of Artificial Intelligence, University of Groningen, Gronin- front, top, and right–side views are computed by exploiting the gen, the Netherlands. Email: [email protected] We thank NVIDIA Corporation for their generous donation of GPUs object reference frame. Each projected view is then fed to a used for this research. CNN to obtain a view-wise deep feature. To construct a global 2 feature for the given object, the obtained view-wise features planes. In contrast, pointset-based approaches work directly are then merged using an element-wise pooling function. The on 3D point clouds and require neither voxelization nor 2D obtain feature is scale and pose invariant, informative and image projections. Among these methods, view-based methods stable, and designed with the objective of supporting accurate are shown effective in object recognition tasks and obtained the 3D object recognition. We finally conducted our experiments best recognition results so far [3], [22]. Our approach falls into with an instance-based learning. We perform a set of extensive the view-based category. Shi et al. [7] tackle the problem of 3D experiments to assess the performance of OrthographicNet and object recognition by combining a panoramic representation compare it with a set of state-of-the-art methods. of 3D objects with a CNN, i.e., named DeepPano. In this The remainder of this is organized as follows: Sec- approach, each object is first converted into a panoramic view, tion II reviews related work of open-ended learning and deep i.e., a cylinder projection object around its principle axis. Then, learning approaches applied to 3D object recognition. Next, a variant of CNN is used to learn the deep representations the detailed methodologies of our proposal – namely Ortho- from the panoramic views. Su et al. [19] described an object graphicNet – are explained in Section III and IV. Experimental recognition approach using multiple view-wise CNN features. result and discussion are given in Section V, followed by In another work, Sinha et al. [18] adopted an approach of conclusions in Section VI. converting the 3D object into a “geometry image” and used standard CNNs directly to learn 3D shape surfaces. These works are similar to ours in that they use multiple views of II.RELATED WORK an object. However, unlike our approach, these approaches In the last decade, various research groups have made are not suitable for real-world scenarios since objects are substantial progress towards the development of learning ap- often partially observable due to occlusions, which makes it proaches which support online and incremental object category difficult to rely on multi-view representations that are learned learning [8], [9], [10], [11], [12]. In such systems, object with the whole circumference. Several deep transfer learning representation plays a prominent role since its output uses in approaches assumed that large amounts of training data are both learning and recognition phases. In [8], an open-ended available for novel classes [23]. For such situations the strength object category learning system, based on a global 3D shape of pre-trained CNNs for extracting features is well known feature namely GOOD [13], is described. In particular, the [23], [24]. Unlike our approach, CNN-based approaches are authors proposed a cognitive robotic architecture to create a not scale and rotation invariant. Several researchers try to concurrent 3D object category learning and recognition in an solve the issue by data augmentation either using Generative interactive and open-ended manner. Kasaei et al. [9] proposed Adversarial Networks (GAN) [25] or by modifying images by a naive Bayes learning approach with a -of-Words object translation, flipping, rotating and adding noise [26] i.e., CNNs representation to acquire and refine object category models are still required to learn the rotation equivariance properties in open-ended fashion. Faulhammer et al. [10] introduced a from the data [27]. Unlike the above mentioned CNN-based system which allows a mobile robot to autonomously detect, approaches, we assume that the training instances are extracted model and recognize objects in everyday environments. Skocaj from on-site experiences of a robot, and thus become gradually et al. [14] presented an integrated robotic system capable of available over time, rather than being completely or partially interactive learning in dialogue with a human. Their system available at the beginning of the learning process. Moreover, learns and refines conceptual models of visual objects and their in our approach the set of classes is continuously growing properties, either by attending to information provided by a while in the mentioned deep learning approaches the set human tutor or by taking initiative itself. Oliveira et al. [15] of classes is predefined. Catastrophic forgetting is another tackle this problem by proposing an approach for concurrent important limitation of CNNs for being used in open-ended learning of visual code-books and object categories in open- scenarios [28]. In simple terms, the catastrophic forgetting ended domains. In [11], an extension of Latent Dirichlet problem is defined as forgetting the older categories as the Allocation (LDA) [16] namely Local-LDA is proposed that robot is being trained further on newer categories. This usually goes one step further by learning specific topics for each happens due to modifying previously learned weights upon category independently and in an open-ended manner. All the start of training the network for newer categories. Several the above approaches use hand-crafted features. This in turn approaches have been proposed to tackle this problem; notable means that they may not generalize well across different recent works include [29], [30], [31]. Prototypical Networks domains. Recently, deep learning approaches have received learn to classify test examples by computing distances to significant attention from the robotics, machine learning, and prototype feature vectors of the novel categories [29]. It should computer vision communities. For 3D object recognition, deep be noted that our approach is suitable for both instance- learning approaches can be categorized into three categories level (e.g., Mug1 vs. Mug2) and category-level (e.g., Mug according to their input: (i) volume-based [6], [17], (ii) view- vs. Plate) object recognition, while Prototypical Networks can based [7], [18], [19], and (iii) pointset-based methods [20], only be used for category-level object recognition. Unlike [21]. Volume-based approaches, first represent an object as our approach, Memory Replay based approaches re-train the a 3D voxel grid and then use the obtained representation as classifier once a new category is introduced, mainly using a the input to a CNN with 3D filter banks. Approaches of the large set of training instances of old categories and few training second category (i.e., view-based) extract 2D images from the instances of the new category [29]. Lie et al., [31] proposed 3D representation by projecting the object’s points into 2D a deep transfer learning approach for lifelong 3D object 3 classification (L3DOC). Similar to our approach, L3DOC can consecutively learn about new object categories mainly based on PointNet [32]. The main idea of L3DOC is to extract and store the shared information of a set of known tasks and transfer the learned knowledge to the new coming task using a memory attention mechanism. In general, such approaches are slow and not applicable in real-time robotic applications.

III.OBJECT REPRESENTATION Fig. 3. Visualization of sign disambiguation procedure: The red, green and blue lines represent the unambiguous X, Y, Z axes respectively. Two A point cloud of an object is represented as a set of points, orthographic projections of the object on the XoZ and YoZ planes are used pi : i ∈ {1,...,n}, where each point is described by their to determine the sign of X and Y axis respectively; (left) sign is positive and; 3D coordinates [x, y, z] and [R, G, B] color information. (right) sign is negative, and therefore all projections have been mirrored. For As shown in Fig. 1 (left), OrthographicNet starts with con- a better representation, we highlighted the corresponding bins. structing a global object reference frame (RF) for a given object. Towards this end, three principal axes of the object of projection planes, l, is determined by the largest edge are constructed based on eigenvectors analysis. In particular, length of a tight-fitting axis-aligned bounding box (AABB) of we first compute the geometric center of the object using the object. This choice makes the projections scale invariant. 1 n c = n ∑i=1 pi. The normalized covariance matrix, Σ, of the The dimensions of the AABB are obtained by computing the 1 n T object is then calculated by Σ = n ∑i=1(pi − c)(pi − c) . Af- minimum and maximum coordinate values along each axis. terwards, eigenvalue decomposition is performed by ΣV = EV, The object is then projected on the planes. As an example, where V = [v~1,~v2,~v3] is eigenvectors of Σ, and E = [λ1,λ2,λ3] three orthographic projections of a bathtub object are shown in is the corresponding eigenvalues and λ1 ≥ λ2 ≥ λ3. In this Fig. 2. Afterwards, the obtained projection views are uniformly , the largest eigenvector, v~1, of the covariance matrix scaled to an appropriate input image size and then fed into a always points into the direction of the largest variance of the CNN to extract view-wise features. object’s points, and the magnitude of this vector equals the It should be noted that the direction of eigenvectors is corresponding eigenvalue, λ1. The second largest eigenvector, not unique, i.e., not repeatable across different trials and its v~2, is always orthogonal to the largest eigenvector, and points orientation has 180◦ ambiguity. Therefore, the orthographic into the direction of the second largest spread of the data. projections can be mirrored as shown in Fig. 3. This prob- Therefore, the first two axes, X and Y, are defined by the lem is known as the sign ambiguity, for which there is no eigenvectors v~1 and v~2, respectively. However, regarding the mathematical solution [33]. To cope with this issue, it is third axis, Z, instead of defining it based on v~3, we define suggested that the sign of each axis be similar to the sign it based on the cross product v~1 × v~2. The object is then of the Pearson’s correlation of the scatter points. For building transformed to be placed in the reference frame. a scatter plot, each projected point ρ = (α,β) ∈ R2, where α Afterwards, we use an orthographic projection method to is the perpendicular distance to the horizontal axis and β is the generate views of the object. It is worth to mention that the perpendicular distance to the vertical axis, is then shifted to orthographic projection is a universal language among people l right and top by 2 . To complete the disambiguation, Pearson’s in engineering professions and uses for technical drawing. In correlation, r, is computed for the XoZ projection plane to find orthographic projection, up to six views of an object can be the direction of X axis: produced (called primary views). In this work, we just use ¯ three projection views including front, top, right-side and do ∑αiβi − nα¯ β not consider the rear, bottom and left-side views since they rx = q q . (1) 2 2 2 ¯ 2 are mirror of the considered views. In this method, since (∑αi − nα¯ ) (∑βi − nβ ) projection lines are parallel to each other and are perpendicular where α¯ and β¯ are the mean of α and β respectively. In to the projection plane, an accurate outline of the visible face particular, Pearson’s correlation reflects the non-linearity and of the object is obtained. We therefore create three square direction of a linear relationship as a value between −1 projection planes centered on the object’s center. Each plane and 1, where 1 indicates a strong positive relationship, −1 of projection is positioned between the observer and the indicates a strong negative relationship. A similar indication, object and is perpendicular to one axis and parallel with the r , is computed for the Y axis using YoZ plane. Finally, the others axes of the object reference frame. The side length y sign of the axes is determined as s = rx.ry, where s can be either positive or negative. In the case of negative s, the three projections should be mirrored otherwise not. Therefore, the final RF (X, Y, Z) is defined by (sv~1, sv~2,v ~1 × v~2).

IV. OBJECT CATEGORY LEARNING AND RECOGNITION Object recognition using very limited training data is crucial Fig. 2. A sample point cloud of a bathtub and its orthographic projections. for many computer vision and robotics applications and has 4 recently attracted interests in both research communities and category through human-robot interaction not only makes a industries. This problem becomes even harder when the system mutual language for robot and user, but also prevents linearly must learn about new object categories from very few ex- accumulation of redundant object views and optimizes the amples online while maintaining the recognition performance number of instances required to represent a specific category on previously learned categories. As an example, consider a leading to less memory consumption1. Whenever a new object robotic scenario, in which a robot faced with a new object is added to a category, the agent retrieves the current model category. It would be ideal if a user can teach the new of the category and updates it by adding the representation category to the robot on-site using a few object views from the of new object views to the category model. In particular, our new category without accessing to the old training data used approach can be seen as a combination of a particular ob- for previously learned categories. It is very difficult to use ject representation, similarity measure and classification rule. CNNs for such tasks since they are data hungry approaches Therefore, the choice of the similarity metric has an impact on and suffering from catastrophic forgetting as we discussed in the recognition performance. In the case of similarity measure, previous sections. We have tackled this problem by proposing since each object is represented by a global deep feature, the an instance-based learning and recognition (IBL) approach dissimilarity between two objects can be computed by different which considers category learning as a process of learning distance functions. about the instances of a category, i.e., a category is represented We refer the reader to a comprehensive survey on dis- simply by a set of known instances, C ← {O1,...,On}, where tance/similarity measures provided by S. Cha [35]. After Oi is the deep representation of an object view. In this performing several cross-validation experiments, we conclude work, the deep representation of an object is obtained after two type of distance functions including Jensen-Shannon (JS) the element-wise pooling layer as shown in Fig. 1. IBL is and chi-squared (χ2) distances are suitable to estimate the a baseline approach to evaluate object representations. An similarity between two instances. Both functions are in the advantage of the IBL approaches over other machine learning form of a bin-to-bin distance function. Although the practical methods is the ability of fast adapting an object category results of χ2 and JS are almost identical, χ2 is computationally model to a previously unseen instance by storing the new more efficient. Therefore, we use χ2 function to estimate the instance or by throwing away an old instance. Therefore, the similarity of two instances. Mathematically, let P and Q ∈ IRK training phase is very fast in IBL approaches. Such learning be the representation of two objects: approaches are able to recognize objects using a small number 1 K (P −Q )2 of representative instances, while storing too many redundant D(P,Q) = ∑ i i , (4) instances results in large memory consumption and slows 2 i=1 (Pi +Qi) down the recognition speed. Therefore, in our current setup The dissimilarity of a target object and all stored instances a new instance of a specific category is stored in the robot’s of all categories, C , is computed using Eq. 4. The target object memory in the following situations: i is finally classified based on the nearest neighbor distance: • When the teacher for the first time teaches a certain category, through a teach or a correct action, an instance- catgeory(T) = argmin D(T,OCi ). (5) based model of this new category is created and initialized C∈{1,...,m} with the set of views of the target object collected since where m is the number of known categories up to now, and object tracking started: th OCi is the i stored instances of category C.

C1 ← {O1,...,O1k }, (2) 1 V. EXPERIMENTAL RESULT AND DISCUSSION where k1 is the number of stored key object views for the We extensively evaluate the proposed recognition system first teaching action. using four types of experiments. All tests were performed with • In later teaching actions, the target object views are added a PC running Ubuntu 18.04 with a 3.20 GHz Intel Xeon(R) to the model of the category: i7 CPU, and a Quadro P5000 NVIDIA graphics card.

Cn ← Cn−1 ∪ {Onk,...,O1kn }, (3) A. Ablation Study where kn is the number of stored key object views for the n-th teaching action. As shown in Fig. 1, the proposed approach uses ortho- graphic projections and three CNNs to generate a global deep It is worth mentioning that it is possible to let the robot representation for the given object. In this approach, the res- to learn in a self-supervise manner. For instance, if the olution of orthographic projections and the CNN architecture dissimilarity of a new object to all known categories is above must be well selected to provide a good balance among recog- a threshold, the robot infers the given object is not belong nition performance, computation time and memory usage. To to known categories, and thus initializes a new category define the optimal system configuration, we conduct 14 sets labeled as “category m+1”, where m is the number of known of experiments using various CNN architectures. All CNNs categories so far. However, such a category is not meaningful for a human user, and the user is not able to 1It is worth mentioning that prototype learning approaches [29] or naturally instruct the robot to perform a task, e.g., ”give me a forgetting mechanisms [34] can be used to bound the memory usage of each mug”. In contrast, the mechanism of introducing a new object category and the overall memory usage. 5

have been pre-trained on ImageNet dataset [36] using the TABLE I default network configurations as suggested in their scientific PROPRIETIESOFTHETOP-THREE CNNS . In this work, we assumed that the system does not have Model Feature Length Size #Parameters Depth predefined set of training data for the target task (3D object MobileNet-v2 1280 float 14 MB 3.53 M 88 recognition) available at the beginning of the learning process. VGG16 4096 float 528 MB 138.35 M 23 It is worth mentioning if an application has enough training ResNet50 4096 float 99 MB 25.63 M 50 data in advance, even random initialization could produce excellent results [37]. For each CNN, 18 experiments were performed with different resolutions of orthographic images (i.e., ranging from 25 × 25 to 225 × 225 pixels) and avg/max the AIA of DenseNets was lower than the others under all pooling functions. level levels of image resolution. The other evaluated CNNs, Dataset and evaluation metrics: The offline evaluations including Xception [41], Nasnet [42], Inception [43] and were carried out using Princeton ModelNet10 dataset [6]. Inception Resnet [44], demonstrated a medium-level perfor- This dataset categorized 4899 objects into ten categories and mance. Overall, the best AIA obtained with MobileNet-v2, max defined training and testings data for each category (i.e., pooling and orthographic projections with the resolution of in total 3991 training samples and 908 testing samples). 150 × 150 pixels which was 88.56 percent. These parameters ModelNet10 is a relatively small dataset with significant intra- were selected as the default system configurations. The AIA of class variation; therefore, it is suitable for performing extensive the system with the default configuration and average pooling sets of experiments to fine-tune the parameters of the system. was 87.44. Note that the length of feature vector, size and The obtained results is reported using two metrics: Average number of parameters of the network have direct influence Instance Accuracy (AIA) and Average Class Accuracy (ACA). on memory usage and computation time. Table I summarizes The former one is defined as the percentage of the correctly the proprieties of the top-three CNNs. The length of feature recognized test instances. The latter one, i.e., ACA, is used to extracted by MobileNet-v2 is 1280 float, while both VGG16- represent the average accuracy of all categories. fc1 and ResNet50 represent an object by a vector of 4096 float. Results: A summary of the experiments is plotted in Fig. 4. The size of MobileNet-v2 is 14MB which is around 37 and In the case of ACA (top-row), the best result was obtained 7 times smaller than than VGG16 and ResNet50 respectively. with MobileNet-v2, average pooling and image resolution of We also performed a experiment to measure the computational 150 × 150 pixels. The accuracy of the proposed system with time of the proposed approach with MobileNet-V2. Since the this configuration was 0.8685. It should be mentioned that number of object’s points directly effects the computational while a higher resolution orthographic images may provide time, we calculate the average time required to render three more details about the object, it increases computation time, projections and process them by the network for 20 randomly memory usage and sensitivity to noise. As it can be observed selected objects from the ModelNet10 dataset. The average from Fig. 4 (top-row), the average class accuracy of DenseNet computation time for rendering orthographic images was 0.003 architectures [38] (i.e., 121, 169, 201) is the worst among second, and processing all three views by the network to the evaluated CNNs and MobileNet-v2, VGG16-fc1 [39] and generate a global feature for the given object on average took ResNet50 [40] are the top-three CNNs, achieving a good 0.008 second. According to these evaluations, our approach average class accuracy with stable performance. An important with MobileNet-v2 is competent for applications with strict observation is that, although DenseNet architectures are more advanced models, their accuracies are not as good as the others. This observation can be explained by this fact that 0.9 0.9 0.88 0.88 vgg16-fc1 vgg16-fc2 0.86 0.86 deeper networks learn a set of specific features (problem vgg19-fc1 0.84 0.84 vgg19-fc2 xception dependent), while less complex networks usually learn general 0.82 0.82 resnet50 mobilenet 0.8 features (problem independent). In other words, we observed 0.8 mobilenet_v2 0.78 0.78 densenet121 densenet169 that since orthographic images are different from images in 0.76 0.76 densenet201 nasnet_large 0.74 0.74 inception ImageNet dataset (used to pre-train networks), general deep inception_resnet 0.72 0.72 Average Class Accuracy Average Class Accuracy

0.7 0.7 features learned by small networks could represent ortho- 25 50 75 100 125 150 175 200 225 250 25 50 75 100 125 150 175 200 225 250 graphic images much better than specific deep features learned Image Resolution 0.9 0.9

0.88 by deep networks. We also realized that several misclassi- 0.88 vgg16-fc1 0.86 0.86 vgg16-fc2 vgg19-fc1 fications mainly occurred among items that look alike. In 0.84 0.84 vgg19-fc2 xception 0.82 0.82 resnet50 particular, some instances in the desk category has a very mobilenet 0.8 0.8 mobilenet_v2 similar shape to the instances of table category; similarly, there 0.78 0.78 densenet121 densenet169 0.76 0.76 densenet201 are several highly similar instances in dresser and night stand nasnet_large 0.74 0.74 inception inception_resnet 0.72 0.72 categories. Average Class Accuracy Average Instance Accuracy 0.7 0.7 25 50 75 100 125 150 175 200 225 250 25 50 75 100 125 150 175 200 225 250 Fig. 4 (bottom-row) compares the average instance accu- Image Resolution Image Resolution racy of the evaluated system configurations in which several Fig. 4. Summary of off-line evaluations using various CNN architectures and observations can be made. First, similar to the previous round two pooling functions: (top-row) average class accuracy vs. image resolution; of experiments, our approach with MobileNet-v2, VGG16-fc1 (bottom-row) average instance accuracy vs. image resolution; (left-column) and ResNet50 demonstrated a good performance. In contrast, results using max pooling; (right-column) results using average pooling. 6

limits on the memory and computation time requirements. This TABLE II evaluation shows that the overall performance of the proposed CLASSIFICATION ACCURACY AND COMPARISON TO STATE-OF-THE-ART METHODSUSING MODELNET DATASETS. approach is promising and it is capable of providing distinctive global features for 3D objects. Input data Algorithm ModelNet40 ModelNet10 MVCNN [19] 90.10 – Pairwise [46] 90.70 92.80 B. Offline Evaluation PANORAMA-NN [47] 90.7 91.1 Geometry Image [18] 83.9 88.4 2D projection According to the results of our ablation study, we used the Multiple Depth Maps [48] 87.8 91.5 MobileNet-V2 as the backbone of OrthographicNet architec- DeepPano [7] 77.63 85.45 ture. Furthermore, to have a fair comparison with the state-of- GIFT [49] 83.10 92.35 RotationNet (3 Views) [3] 89.00 93.10 the-art approaches, we replaced the open-ended classifier of OrthographciNet 91.00 94.57 the proposed architecture with a small Deep Neural Network 3DShapeNets [6] 77.0 83.50 3D-GAN [50] 83.30 91.00 3D Volume (DNN). We trained 100 networks with different architectures, LightNet [5] 86.90 93.39 various optimization algorithms, and different learning rate FusionNet [51] 90.80 93.11 for 200 epochs each. The network architecture that obtained VRN [45] 91.33 93.60 the best performance consists of four fully connected layers * The best result is shown in bold, and the second best is highlighted by blue color. with 1024, 512, 256, 128 neurons respectively, and a softmax layer with the same number of nodes as the number of classes exist in the object dataset. We used a relu activation function (i.e., 91.00 vs. 91.33). The fact that our approach is computed for each fully connected layer. A dropout layer has been on a stable, unique, and unambiguous local reference frame added between each pairs of fully connected layers, as well is likely to explain the obtained results. Furthermore, our as the last fully connected layer and the softmax layer. We approach represents a 3D object using three orthographic pro- set the dropout rate to 30% in all layers. In this round of jections. In particular, OrthographicNet (i) provides a compact, experiments, OrthographicNet is trained from scratch using unique, and rich representation of objects, and (i) is efficient stochastic gradient descent (SGD) with a momentum of 0.9. to compute and less affected by noise since in each projection The learning rate, η, is initially set to 0.1, and decayed by one of the dimensions of the object is discarded. η × (0.98 × #epochs). Note that all weights are initialized randomly and the categorical cross-entropy is used as loss. C. Open-Ended Evaluation Dataset and evaluation metrics: We evaluated our ap- proach using ModelNet10 and ModelNet40 datasets. The the Offline evaluation methodologies follow the classical Train-then-Test former dataset has been discussed in the previous subsection. procedure, i.e., two separate stages. The later dataset, ModelNet40, consists of 12311 instances These methodologies are not well suited to evaluate open- categorized in 40 categories, in which there are 9843 training ended learning systems, because they do not abide to the and 2468 test objects [6]. It should be noted that we used simultaneous nature of learning and recognition and also the the original train and test split of datasets [6], and generated number of categories must be predefined in such evaluations. three orthographic projects (top, side, and front views) of all Therefore, another round of experiments was carried using an training and testing objects. The classification accuracy is used open-ended teaching protocol [52], [53], [54]. The main idea as evaluation metric to compare the performance of different is to emulate the interactions of a robot with the surrounding approaches. environment over long periods of time. The teaching protocol Results: We compared our method with 13 state-of-the-art determines which examples are used for training the algorithm, approaches. Table II summarizes the comparative results of and which are used to test the algorithm. This protocol is based Test-then-Train classification on ModelNet10 and ModelNet40. In the case of on a scheme, which can be followed by simu- ModelNet10, it was observed that the classification accuracy a human user or by a simulated user. We developed a lated teacher of our approach was significantly better than the other state-of- to perform a set of consistent and reproducible the-art approaches that use 2D projection as input. Particularly, experiments in a fraction of time a human would take to do the it was 1.47 percentage points (p.p.) and 1.77 p.p. better same task. The simulated user interacts with the agent using than RotationNet with three views and Pairwise, i.e., the three basic actions: two best approaches among 2D projections based approaches, • Teach: introduces the category of an object to the agent, respectively. Furthermore, OrthographicNet significantly out- • Ask: to inquire about the category of a test object view, performed the state-of-the-art methods that use 3D volumetric • Correct: if the agent classifies an object view incorrectly, data and point clouds as input. In particular, OrthographicNet the simulated teacher provides corrective feedback. clearly outperforms both VRN [45] and LightNet [5] methods In this round of evaluation, the agent starts with no previous by 0.97 p.p. and 1.18 p.p., respectively. In the case of the knowledge. The teacher randomly selects three object views ModelNet40 dataset, the proposed OrthographicNet method of a category and teaches that category to the agent. The outperformed state-of-the-art approaches that use 2D projec- model the agent creates based on these instances, is stored in tions and most of the approaches that use 3D data as input the agent’s memory. After teaching each category, the teacher (outperformed 12 out of 13 approaches). However, it was tests the agent to see if he has learned the new category and has observed that VRN slightly worked better than our approach not forgot the previously learned categories. This is done by 7

recognition Therefore, we use three recently introduced metrics [15] result Object category Simulated Recognition models Perceptual Memory to evaluate the learning performance. They are including: Teacher label (i) the average number of learned categories at the end of deep object previous object new object

representation categories categories experiments (ALC), an indicator of how much the system is

Correct: object view object object label able to learn in an open-ended setting;(ii) average number Ask: Deep Object Object Object Dataset

Representation deep object Conceptualization of question/correction iterations (QCI) required to learn those object views + + label views object Teach: representation categories, and the average number of stored instances per category (AIC), which represent time and memory resources Fig. 5. Interaction between the simulated teacher and the learning agent; (left) The simulated teacher is connected to a large object dataset and interacts with required to learn those categories– therefore, a lower value is the agent by teach, ask and correct actions as shown by blue color; (right) better as it means that less instances and thus less computing In case of ask action, the agent is evaluated by a never-seen-before object. power were needed to learn an object category; (iii) Global The agent recognizes the object and sends back the result to the simulated user (shown by red color); In the cases of teach and correct actions, the agent Classification Accuracy (GCA), describes the accuracy of the creates a new category model or updates the model of respective category. agent over the whole run, and the Average Protocol Accuracy (APA), represents the average accuracy over all windows. Therefore, GCA and APA are indicator of how well the system asking the agent to recognize never-seen-before instances of learns. all previously learned categories. In this experiment, when the Results: We have conducted an extensive set of open-ended agent incorrectly recognize a test instance, the teacher provides experiments and comparisons to highlight the effectiveness corrective feedback. This feedback allows the agent of the proposed approach. In particular, we compared Ortho- adjusts its category model using the mistaken instance – this graphicNet with five recently proposed open-ended learning way, the agent incrementally learns new object categories approaches including RACE [12], BoW [9] based on a L2 using as few as possible instances. based nearest neighbour classifier, an updated version of the The recognition performance of the agent is continuously online LDA [56] (here refer to as Open-ended LDA), Local- evaluated using a sliding window over the last 3n ques- LDA [11] and GOOD [8]. Furthermore, since our approach tion/correction iterations, where n is the number of learned cat- with MobileNet-v2, VGG16-fc1 and ResNet50 demonstrated a egories. We called this measure as protocol accuracy. good performance in previous round of evaluation, we have To compute the protocol accuracy, if the number of ques- included all of them in this round of evaluation. tion/correction iterations since the last category was intro- To better understand the idea, the performance of the duced, is less than 3n, all results are taken into account. If OrthographicNet-MobileNetV2 in the initial 200 iterations of the protocol accuracy of the agent exceeds a certain threshold the first open-ended experiment is plotted in Fig. 6 (top). In (τ = 0.67, meaning accuracy is at least twice the error rate), this experiment, the simulated user progressively estimates the the simulated teacher randomly selects a new category and recognition accuracy of the system and, in case this accuracy teaches it to the agent. If the agent fails to reach the protocol exceeds a given threshold (marked by the horizontal green accuracy threshold within a certain amount of iterations (e.g., line), introduces an additional object category (marked by 100 iterations) since the last category was introduced, the the vertical red lines and ). The simulated teacher also teacher infers that the agent is not able to learn more categories provides corrective feedback in case of misclassification. This at this point and stops the experiment. It is also possible way, the system is trained and at the same time the accuracy that the agent learns all existing object categories and keep of the system is continuously estimated. This plot clearly the protocol accuracy above the threshold. In this case, the represents the different steps of the simulated teaching process. experiment is also stopped and the stopping condition is “lack In this experiment, categories that are very similar in term of data”. Since the order of teaching new categories and of shape is harder to learn, think for example about rounded instances may have an effect on the performance of the agent, objects (lime, bell pepper, onion, tomato, apple, etc.). Fig. 6 we have performed ten experiments for each approach. For (second) shows the global classification accuracy (i.e. the ac- example, if the simulated teacher introduces an apple right curacy since the beginning of the experiment) as a function of after a tomato, it would be harder to recognize this new object the number of learned categories. In this figure, we can see that than when a banana followed the tomato are introduced. In the global classification accuracy decreases as more categories order to compare methods fairly, the simulated teacher shuffles are learned. This is expected since the number of categories data at the beginning of each round of experiments and uses known by the system makes the classification task more the same order of object categories and instances for teaching difficult. Fig. 6 (third) shows the number of learned categories and testing all the methods. Furthermore, we re-implemented as a function of the protocol iterations. This gives a measure of all of the selected base-lines ourselves in order to ensure a fair how fast the learning occurred in each of the experiments. This comparison. experiment concluded prematurely due to the “lack of data” Object Dataset and Evaluation Metrics: For this round of (i.e., no more categories available in the dataset), indicating the evaluations, Washington RGB-D Object dataset [55] is used. potential for learning many more categories. Fig. 6 (bottom) This dataset consists of 250,000 views from 300 common represents the number of instances stored in the models of household objects in 51 categories. The performance evalua- all of the categories of the mentioned experiment. These tion of an open-ended learning system cannot be limited to the are the three instances provided at the introduction of the classical evaluation metrics (accuracy, precision and recall). category together with any instances that had to be corrected 8

1 TABLE III 0.9

0.8

0.7 RESULT OF OPEN-ENDED EVALUATIONS USING RGB-D DATASET [55].

0.6

0.5

0.4

0.3 Protocol Accuracy Approaches #QCI ALC AIC GCA APA 0.2

0.1

0 0 20 40 60 80 100 120 140 160 180 200 RACE [12] 382.10 19.90 8.88 0.67 0.78 Question / Correction Iterations 1.05 BoW [9] 411.80 21.80 8.20 0.71 0.82 1 Open-Ended LDA [56] 262.60 14.40 9.14 0.66 0.80 0.95

0.9 Local-LDA [11] 613.00 28.50 9.08 0.71 0.80 0.85 GOOD [8] 1659.20 39.20 17.28 0.66 0.74 0.8

0.75

Global Classification Accuracy (*) 0.7 ours-MobileNet-v2 1342.60 51.00 8.97 0.77 0.80 0 5 10 15 20 25 30 35 40 45 50 55 Number of Learned Categories (#) 60 ours-VGG16-fc1 1568.70 48.10 12.42 0.71 0.76 50 ours-ResNet50(*) 1446.60 51.00 10.21 0.75 0.78 40 (*) Stopping condition was “lack of data”. (#) In 6 out of 10 experiments, stopping 30 condition was “lack of data” 20

10 Number of Learned Categories 0 0 200 400 600 800 1000 1200 1400 Question / Correction Iterations

25

20

15 scalability of OrthographicNet was much better than the other

10

5 approaches. In particular, our approach with MobileNet and Number of Stored Instances 0

ball cap lime comb bowl peach apple onion plate pliers pear garlic -food pitcherkleenex tomato lemon bananaspongemarker binder scissorscamera stapler potato orange greens ResNet50 learned all 51 categories in all 10 experiments. All cup-food shampoo box-foodglue-stick bag-food keyboard can-food soda-cannotebook lightbulb flashlight hand-towel mushroombell-peppercalculator toothpaste cell-phone toothbrush coffee-mug dry-battery cereal-box water- rubber-eraser instant-noodles of those experiments concluded prematurely due to the “lack Fig. 6. Summary of the first open-ended experiment of OrthographicNet- of data”, i.e., no more categories available in the dataset, MobileNetV2:(top) Evolution of protocol accuracy over first 200 iterations. indicating the potential for learning many more categories. In this experiment, the protocol threshold is set to 0.67, as shown by the green horizontal dashed line. Once the agent has classified all learned categories at OrthographicNet with VGG16-fc1 also obtained acceptable least once, and the protocol accuracy is higher than the threshold, a new scalability (i.e., in six out of 10 experiments, the agent could category is introduced, which is shown by a red dashed line and a category learn all categories) while the scalability of other approaches name. (second) This plot shows global classification accuracy as a function of number of learned categories for the same experiment. It can be seen that was much lower than ours. Our approach with MobileNet, on the agent was able to stay above the protocol threshold during the entire average, learned all the categories faster than the other CNN experiment, indicating the agent can learn many more categories. (third) This approaches. QCI measures how fast the learning occurred in graph represents how fast does the agent learn by representing the number of learned categories as a function of the number of question/correction each of the experiments. It was observed that the longest exper- iterations. (bottom) This graph shows the number of instances stored in each iments were OrtographicNet-VGG16-fc1, and the shortest ones category, i.e., the three instances provided at the introduction of the category were Open-Ended LDA. In particular, on average, the agent together with the instances that had to be corrected somewhere along the experiment run. The ball category apparently was the most difficult one, with Open-Ended LDA was able to learn 14.40 categories requiring the largest number of instances. in 262.60 iterations, while the agent with OrtographicNet- VGG16-fc1 continued for 1568.70 iterations to learn 48.10 categories. By comparing all approaches, it is clear that somewhere along the experiment run. Ball category apparently RACE, BoW, and OrthographicNet-MobileNet-v2, on average, was the most difficult category, requiring the largest number of stored less than nine instances per category. Although RACE instances. However do note that instances that were introduced and BoW stored fewer instances per category (AIC) than near the end of the experiment have been tested less, which OrthographicNet-MobileNet-v2, the difference is minor (less is clearly visible in a general trend for less instances to be than one instance per category). The discriminative power included for categories appearing later. of OrthographicNet is much better than RACE and BoW. A detailed summary of all experiment is reported in the Specifically, our approach learned 31.10 and 29.20 categories Table III and Fig. 7. One important observation is that the more than RACE and BoW approaches, respectively. The right column of Fig. 7 correlates the global classification accuracy (GCA, top-right) and average protocol accuracy

1 GCA ALC 100 (APA, bottom-right), obtained by the evaluated approaches, 1800 #QCI ALC 100 1600 0.8 80 1400 80 1200 0.6 60 60 with the average number of learned categories (ALC). Our 1000

800 0.4 40 ALC

40 GCA 600 ALC #QCI 0.2 20 400 20 approach with MobileNet-v2 achieved the best GCA with 200 0 0 0 0

LDA RACE BoW LDA RACE BoW GOOD GOOD stable performance. BoW achieved better performance than Local LDA Local LDA our+vgg16 our+vgg16our+ResNet our+ResNet our+MobileNet our+MobileNet APA ALC 20 AIC ALC 100 1 100 our approaches regarding APA. This is expected since BoW 0.8 80 15 80 60 0.6 60 learned fewer categories, and it is easier to get better APA in 10 0.4 40

40 ALC AIC ALC APA 5 20 0.2 20 fewer categories. In particular, our approach with MobileNet- 0 0 0 0

LDA BoW LDA RACE BoW GOOD RACE GOOD Local LDA Local LDA v2 our+vgg16 was able to learn all 51 categories, on average, while our+vgg16our+ResNet our+ResNet our+MobileNet our+MobileNet the other approaches learned less than 40 categories. It can Fig. 7. Summary of open-ended evaluations: (left-top) shows the number be concluded that OrthographicNet-ModelNet-v2 achieved the of question/correction iterations (QCI) required to learn a certain number of best performance. categories; (left-bottom): Average number of stored instances per category (AIC) and average number of learned categories (ALC) at the end of The protocol_threshold parameter, τ, defines how an experiment as an indicator of how much memory does each approach good the agent should learn object categories. For example, take?; The right-top and right-bottom plots correlate the global classification accuracy (GCA), and average protocol accuracy (APA) with the ALC as an τ = 0.67 means the recognition accuracy is at least twice indicator of how well does it learn?. better than the error. To explore the impact of protocol 9

TABLE IV using the representation of extracted training objects’ views RESULT OF OPEN-ENDED EVALUATIONS WITH DIFFERENT VALUES FOR (see Fig. 8). Afterwards, we test the robot by the remaining ten THEPROTOCOLTHRESHOLD [55]. scenes (shown by red cameras in Fig. 8 (left). It was observed τ #QCI ALC AIC GCA APA that the robot could recognize all objects correctly based on 0.70 1384 51 9.90 0.77 0.80 the learned knowledge from the three training scenes (Fig. 8 0.80 1943 51 10.41 0.81 0.84 right). It should be mentioned that some misclassification also 0.90 1972 34 10.82 0.87 0.93 happened during the evaluation. At some points, the object tracking module did not track the point cloud of the object precisely. Therefore, different portions of the target object threshold, an additional set of experiments with different were excluded, and misclassification happened as a result. values of τ ∈ [0.7,0.8,0.9] was evaluated using the best system We later tested the robot in two new contexts. The first configuration (OrthographicNet+MobileNet-v2). It should be context contains six instances lipton, oreo, and amita cate- noted that the order of introducing categories and instances is gories. It was seen that the robot recognized all instances by the same for all experiments. Refer to Table IV for the impact using the learned knowledge from the previous context. In that changing τ has on system outcomes. We can see that, as this context, we mainly evaluated the scale- and pose-invariant τ increases, the GCA and APA increases as well. However, properties of the OrthographicNet. The robot recognized both this comes at the cost of the amount of learned categories. amita instances correctly. In the second context, all scenes The fact that the orthographicNet with τ = 0.9 yields better contain four instances of two (lipton and softkings) categories, accuracies, can be explained by Fig. 6 (second plot from top), where the two instances of lipton occluded the softking which clearly shows that the amount of learned categories has objects. Besides, instances of these categories have a very a negative correlation with global classification accuracy. similar shape, and therefore, make the object recognition task harder for the robot. We observed that the robot correctly D. Evaluation of Real-Time Performance recognized all instances using the learned knowledge from the initial context (i.e., three consecutive scenes). This evaluation We carried out three real-word demonstrations to show the shows the robustness and descriptive power of the proposed real-time performance of the proposed approach. For these approach. A video of this evaluation is accessible online at experiments, we integrated our work into the robotic system https://youtu.be/JEUb-Q7TbJQ presented in [12], and used Kinect V1 sensor to perceive the “Serve A Drink” scenario: In this scenario, there is a environment. table in front of a Kinect sensor as shown in Fig. 9 (a). For Domestic Environment: We have used the Domestic En- object categories including juiceBox, oreo, cup and bottle are vironment Dataset [57] for this round of evaluation. This used in this round of evaluation. In contrast with the previous dataset consists of several 3D scenes containing a table and a demonstration, the robot initially knows juiceBox and oreo set of table-top objects including amita, colgate, lipton, elite, categories, which have been learned from batch data, and does oreo and softkings. As shown in Fig. 8, each scene has been not have any knowledge about the bottle and cup categories. captured under different levels of clutter and occlusion and Throughout this evaluation, a human-user puts objects on top contains a various set of objects in different position and of the table and teaches new categories to the robot. The robot orientation. Therefore, it is a suitable dataset to evaluate the should demonstrate the capability of recognizing instances of robustness of the proposed approach to clutter and occlusion, known categories, and also the capability of learning about as well as its scale- and pose-invariant properties. new object categories in a supervised and open-ended manner. In this evaluation, the robot begins without any prior knowl- Figure 9 and the following description explains the behavior edge; therefore, it recognizes all detected objects as unknown. of the proposed approach: A human-user teaches all object categories to the robot using • The user presents a cup object to the system and provides the provided human-robot interaction interface. This way, the the respective category labels (TrackID9 as a cup). The robot conceptualizes all object categories in an online fashion system conceptualizes the cup category and TrackID9 is correctly recognized. • The instructor places a juiceBox on the table. The system has learned about juiceBox category from batch data, therefore, TrackID10 is recognized properly. An addi- tional juiceBox is placed at the left-side of the table. Tracking is initialized and the juiceBox is recognized accurately (Fig. 9 b). • The instructor moves the right juiceBox for a while to show the real-time performance of the OrthographicNet for performing object recognition and pose estimation concurrently (Fig. 9 c). Fig. 8. Domestic environment demonstration: (left) Experimental setup: we • Later, the user removes all objects from the scene; no ob- use three consecutive scenes for training (blue cameras) and use the remaining 10 scenes for testing (red cameras). (right) An example of object recognition jects are visible; an oreo and a bottle enter the scene. They results on the first test scene. are detected and assigned to TrackID13 and TrackID14 10

Fig. 9. Serve a drink scenario: (a) experimental setup; (b) This snapshot shows the proposed approach supports both batch and open-ended learning cases; The system gained knowledge about juiceBox using batch learning and learns about cup category in an open-ended manner; (c) A user then moves the juiceBox and the system can track and recognize it correctly; (d) The user removes all objects from the table and adds three new objects including oreo, bottle, and a new cup. The system had prior knowledge about oreo and learned about bottle category in an online manner. The system recognizes all object correctly.

respectably. Because there is no prior knowledge about VI.CONCLUSION bottle category, a misclassification happened. TrackID14 is labelled as a bottle; the system first conceptualizes the In this paper, we propose a deep transfer learning based bottle category and then recognizes it correctly. approach for 3D object recognition in open-ended domains • Another cup is placed on the table. This particular cup named OrthographicNet. This approach provides a good trade- had not been seen before, but it recognizes correctly since off between descriptiveness, computation time and memory the system learned about cup category earlier (Fig. 9 d). usage, allowing concurrent object recognition and pose esti- mation. OrthographicNet computes a unique and repeatable This demonstration shows that apart from batch learning, global object reference frame and three scale-invariant or- the robot can also learn about new object categories in an thographic projections for a given object. The orthographic open-ended fashion. Furthermore. it shows that the proposed projects are then fed as input to three CNNs to obtain a view- approach is capable of recognizing objects in various posi- wise deep feature. The obtained features are then merged using tions. A video of this demonstration is available online at: an element-wise max pooling layer to form a rotation-invariant https://youtu.be/sFJFc lnzHQ global feature for the given object. A set of experiments were Robotic application: We performed a real-robot experiment carried out to assess the performance of OrthographicNet in the context of clear-table task. Our experimental setup is and compare it with other state-of-art with respect to several shown in Fig. 10. In this experiment, the robot should detect characteristics including descriptiveness, scalability and mem- the pose and recognize all objects including Cups and the ory usage. We have shown that OrthographicNet can achieve TrashBasket. Then, the robot should clean all Cups from the performance better than the selected state-of-the-art and is table by putting them into the basket. At the beginning of this suited for real-time robotic applications. In the continuation of experiment, the set of categories known to the system is empty this work, we would like to consider color information [58], and the system recognizes all table-top objects as Unknown [59] and use orthographic projections for simultaneous object (Fig. 11 a). A user interacts with the system by teaching Cup recognizing and grasp affordance detection. and TrashBasket categories. The system conceptualizes both categories and later recognizes them properly (Fig. 11 b). This illustrates the process of acquiring categories in an open-ended REFERENCES fashion with user mediation. Later, the user instructs the robot [1] L. S. Lopes and A. Chauhan, “How many words can my robot learn?: An to perform a clear table task. In this experiment, the robot approach and experiments with one-class learning,” Interaction Studies, must be able to detect the pose and recognize all active objects. vol. 8, no. 1, pp. 53–81, 2007. Afterwards, in each execution cycle, it has to grasp the nearest [2] D. Tang, Q. Fang, L. Shen, and T. Hu, “Onboard detection-tracking- localization,” IEEE/ASME Transactions on Mechatronics, 2020. Cup and then transport it on top of the TrashBasket and release [3] A. Kanezaki, Y. Matsushita, and Y. Nishida, “RotationNet: Joint object it. This sequence is shown in Fig. 11 (c)–(d). A video of this categorization and pose estimation using multiviews from unsupervised session is available online at: https://youtu.be/4n2v8LmB9D0 viewpoints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5010–5019. [4] A. Garcia-Garcia, F. Gomez-Donoso, J. Garcia-Rodriguez, S. Orts- Escolano, M. Cazorla, and J. Azorin-Lopez, “Pointnet: A 3D convolu- tional neural network for real-time object class recognition,” in Neural Networks (IJCNN), 2016 International Joint Conference on. IEEE, 2016, pp. 1578–1584. [5] S. Zhi, Y. Liu, X. Li, and Y. Guo, “LightNet: A lightweight 3D convolutional neural network for real-time 3D object recognition.” in 3DOR, 2017. [6] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D shapeNets: A deep representation for volumetric shapes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1912–1920. Fig. 10. Our experimental setup for the real-robot experiment: it consists of [7] B. Shi, S. Bai, Z. Zhou, and X. Bai, “Deeppano: Deep panoramic rep- a Kinect camera, a UR5e robotic-arm and a Qbhand as the primary sensory- resentation for 3D shape recognition,” IEEE Signal Processing Letters, motor embodiment for perceiving and acting upon its environment. vol. 22, no. 12, pp. 2339–2343, 2015. 11

Fig. 11. The clear table scenario; (a) All table-top objects are detected and highlighted by green bounding . The local reference frame of each object represents the pose of the object. The object reference frame is initially estimated by the proposed approach and then tracked by an object tracking module. In this demonstration, the robot has no prior knowledge about any object. Therefore, all objects are recognized as unknown, shown by red on top of each object. (b) A user then teaches the Cup and TrashBasket categories to the robot, and all objects are then correctly recognized. Afterwards, the user instruct the robot to perform clear table task. Towards this goal, in each execution cycle, the robot should find out the nearest Cup and go to its pre-grasp area (c); It then grasps the Cup and moves it on top of the TrashBasket and releases it (d). This procedure is repeated several times to remove all Cups from the table.

[8] S. H. Kasaei, J. Sock, L. S. Lopes, A. M. Tome,´ and T.-K. Kim, [23] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN “Perceiving, learning, and recognizing 3D objects: An approach to cog- features off-the-shelf: an astounding baseline for recognition,” in Pro- nitive service robots,” in Thirty-Second AAAI Conference on Artificial ceedings of the IEEE conference on computer vision and pattern Intelligence, 2018. recognition workshops. IEEE, 2014, pp. 806–813. [9] S. H. Kasaei, M. Oliveira, G. H. Lim, L. S. Lopes, and A. M. Tome,´ [24] ——, “CNN features off-the-shelf: an astounding baseline for recogni- “Towards lifelong assistive robotics: A tight coupling between object tion,” in Proceedings of the IEEE conference on computer vision and perception and manipulation,” Neurocomputing, vol. 291, pp. 151–166, pattern recognition workshops. IEEE, 2014, pp. 806–813. 2018. [25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, [10] T. Faulhammer,¨ R. Ambrus, C. Burbridge, M. Zillich, J. Folkesson, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in N. Hawes, P. Jensfelt, and M. Vincze, “Autonomous learning of object Advances in neural information processing systems, 2014. models on a mobile robot,” IEEE Robotics and Automation Letters, [26] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan, “Low-shot vol. 2, no. 1, pp. 26–33, 2017. learning from imaginary data,” arXiv preprint arXiv:1801.05401, 2018. [11] S. H. Kasaei, A. M. Tome,´ and L. S. Lopes, “Hierarchical object [27] T. Cohen and M. Welling, “Group equivariant convolutional networks,” representation for open-ended object category learning and recognition,” in International conference on machine learning, 2016, pp. 2990–2999. in Advances in Neural Information Processing Systems, 2016, pp. 1948– [28] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan, 1956. “Measuring catastrophic forgetting in neural networks,” in Thirty-second [12] M. Oliveira, L. S. Lopes, G. H. Lim, S. H. Kasaei, A. M. Tome,´ and AAAI conference on artificial intelligence, 2018. A. Chauhan, “3D object perception and perceptual learning in the race [29] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot project,” Robotics and Autonomous Systems, vol. 75, pp. 614–626, 2016. learning,” in Advances in neural information processing systems, 2017, [13] S. H. Kasaei, A. M. Tome,´ L. S. Lopes, and M. Oliveira, “Good: A pp. 4077–4087. global orthographic object descriptor for 3D object recognition and [30] C. Wu, L. Herranz, X. Liu, J. van de Weijer, B. Raducanu et al., manipulation,” Pattern Recognition Letters, vol. 83, pp. 312–320, 2016. “Memory replay gans: Learning to generate new categories without forgetting,” in Advances in Neural Information Processing Systems, [14] D. Skocaj,ˇ A. Vrecko,ˇ M. Mahnic,ˇ M. Jan´ıcek,ˇ G.-J. M. Kruijff, M. Han- 2018, pp. 5962–5972. heide, N. Hawes, J. L. Wyatt, T. Keller, K. Zhou et al., “An integrated system for interactive continuous learning of categorical knowledge,” [31] Y. Liu, Y. Cong, and G. Sun, “L3DOC: Lifelong 3D object classifica- arXiv Journal of Experimental & Theoretical Artificial Intelligence, vol. 28, tion,” , pp. arXiv–1912, 2019. no. 5, pp. 823–848, 2016. [32] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3D classification and segmentation,” in Proceedings of the [15] M. Oliveira, L. S. Lopes, G. H. Lim, S. H. Kasaei, A. D. Sappa, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), and A. M. Tome,´ “Concurrent learning of visual codebooks and object July 2017. categories in open-ended domains,” in Intelligent Robots and Systems [33] R. Bro, E. Acar, and T. G. Kolda, “Resolving the sign ambiguity in the (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015, pp. singular value decomposition,” Journal of Chemometrics: A Journal of 2488–2495. the Chemometrics Society, vol. 22, no. 2, pp. 135–140, 2008. [16] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” [34] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of object Journal of machine Learning research , vol. 3, no. Jan, pp. 993–1022, detectors without catastrophic forgetting,” in Proceedings of the IEEE 2003. International Conference on Computer Vision, 2017, pp. 3400–3409. [17] D. Maturana and S. Scherer, “Voxnet: A 3D convolutional neural [35] S.-H. Cha, “Comprehensive survey on distance/similarity measures network for real-time object recognition,” in Intelligent Robots and between probability density functions,” International Journal of Math- Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, ematical Models and Methods in Applied Sciences, 2007. 2015, pp. 922–928. [36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: [18] A. Sinha, J. Bai, and K. Ramani, “Deep learning 3D shape surfaces A large-scale hierarchical image database,” in 2009 IEEE conference on using geometry images,” in European Conference on Computer Vision. computer vision and pattern recognition. Ieee, 2009, pp. 248–255. Springer, 2016, pp. 223–240. [37] K. He, R. Girshick, and P. Dollar,´ “Rethinking imagenet pre-training,” [19] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view arXiv preprint arXiv:1811.08883, 2018. convolutional neural networks for 3D shape recognition,” in Proceedings [38] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely of the IEEE international conference on computer vision, 2015. connected convolutional networks,” in 2017 IEEE Conference on Com- [20] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical puter Vision and Pattern Recognition (CVPR), 2017. feature learning on point sets in a metric space,” in Advances in Neural [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for Information Processing Systems, 2017, pp. 5099–5108. large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [21] R. Klokov and V. Lempitsky, “Escape from cells: Deep kd-networks for [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image the recognition of 3D point cloud models,” in Computer Vision (ICCV), recognition,” in Proceedings of the IEEE conference on computer vision 2017 IEEE International Conference on. IEEE, 2017, pp. 863–872. and pattern recognition, 2016, pp. 770–778. [22] C. R. Qi, H. Su, M. Niessner, A. Dai, M. Yan, and L. J. Guibas, [41] F. Chollet, “Xception: Deep learning with depthwise separable convo- “Volumetric and multi-view CNNs for object classification on 3D data,” lutions,” arXiv preprint, pp. 1610–02 357, 2017. in The IEEE Conference on Computer Vision and Pattern Recognition [42] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable (CVPR), 2016. architectures for scalable image recognition,” in Proceedings of the IEEE 12

conference on computer vision and pattern recognition, 2018, pp. 8697– in open-ended object recognition without explicit context information,” 8710. in 2018 IEEE/RSJ International Conference on Intelligent Robots and [43] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking Systems (IROS). IEEE, 2018, pp. 1–7. the inception architecture for computer vision,” in Proceedings of the [55] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi- IEEE conference on computer vision and pattern recognition, 2016, pp. view rgb-d object dataset,” in Robotics and Automation (ICRA), 2011 2818–2826. IEEE International Conference on. IEEE, 2011, pp. 1817–1824. [44] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, [56] M. Hoffman, F. R. Bach, and D. M. Blei, “Online learning for latent inception-resnet and the impact of residual connections on learning.” in dirichlet allocation,” in advances in neural information processing AAAI, vol. 4, 2017, p. 12. systems, 2010, pp. 856–864. [45] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Generative and [57] A. Doumanoglou, R. Kouskouridas, S. Malassiotis, and T.-K. Kim, discriminative voxel modeling with convolutional neural networks,” “Recovering 6D object pose and predicting next-best-view in the crowd,” arXiv preprint arXiv:1608.04236, 2016. in Proceedings of the IEEE Conference on Computer Vision and Pattern [46] E. Johns, S. Leutenegger, and A. J. Davison, “Pairwise decomposition Recognition, 2016, pp. 3583–3592. of image sequences for active multi-view recognition,” in Proceedings [58] S. H. Kasaei and et al, “Investigating the importance of shape features, of the IEEE Conference on Computer Vision and Pattern Recognition, color constancy, color spaces and similarity measures in open-ended 3D 2016, pp. 3813–3822. object recognition,” arXiv preprint arXiv:2002.03779, 2020. [47] K. Sfikas, T. Theoharis, and I. Pratikakis, “Exploiting the panorama rep- [59] N. Keunecke and S. H. Kasaei, “Combining shape features with multiple resentation for convolutional neural network classification and retrieval.” color spaces in open-ended 3D object recognition,” arXiv preprint 3DOR, vol. 6, p. 7, 2017. arXiv:2009.09235, 2020. [48] P. Zanuttigh and L. Minto, “Deep learning for 3D shape classification from multiple depth maps,” in 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 3615–3619. [49] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. Jan Latecki, “Gift: A real- time and scalable 3D shape search engine,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5023– 5032. Hamidreza Kasaei is an Assistant Professor in the [50] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a Department of Artificial Intelligence at the Uni- probabilistic latent space of object shapes via 3D generative-adversarial versity of Groningen, the Netherlands. Hamidreza modeling,” in Advances in neural information processing systems, 2016, received a Ph.D. degree in Robotics from the Univer- pp. 82–90. sity of Porto (MAP-i), Portugal, in 2018. He was a [51] V. Hegde and R. Zadeh, “Fusionnet: 3D object classification using visiting researcher at Imperial College London, UK, multiple data representations,” arXiv preprint arXiv:1607.05695, 2016. in 2016. He has been serving as Associate Editor [52] S. H. Kasaei, M. Oliveira, G. H. Lim, L. Seabra Lopes, and A. M. for IEEE International Conference on Robotics and Tome, “Interactive open-ended learning for 3D object recognition: An Automation (ICRA) since 2019. Kasaei’s research approach and experiments,” Journal of Intelligent & Robotic Systems, interests focus on the intersection of robotics, ma- vol. 80, no. 3-4, pp. 537–553, 2015. chine learning, and machine vision. His main re- [53] A. Chauhan and L. S. Lopes, “Using spoken words to guide open-ended search goal is to achieve a breakthrough by enabling robots to incrementally category formation,” Cognitive processing, vol. 12, no. 4, p. 341, 2011. learn from past experiences and safely interact with human users using open- [54] S. H. Kasaei, L. S. Lopes, and A. M. Tome,´ “Coping with context change ended machine learning techniques.