Learning from Observations

Anonymous Author(s) Affiliation Address email

1 Abstract: In this article we present a view that robot learning should make use 2 of observational data in addition to learning through interaction with the world. 3 We hypothesize that the acquisition of many, if not the majority, of features of 4 intelligent can happen or be drastically accelerated through learning by ob- 5 servation. This view is in part motivated by extensive studies on observational 6 learning in cognitive science and developmental psychology. There are, however, 7 many aspects of observational learning. To help facilitate future progress, we con- 8 struct a taxonomy of observational learning. At the highest level of the hierarchy, 9 we categorize it into learning representations, models of the world, behaviors, and 10 goals. Thanks to the internet and advances in augmented reality hardware, we 11 may soon be able to observe large fractions of everyday human activities and in- 12 teractions with the world, which puts us in a good position to develop robots that 13 learn to act in the context of rich observational experience.

14 Keywords: Robot Learning, Observational Learning

15 1 Introduction

16 The dream of has always been that of general purpose machines that can perform many 17 diverse tasks in unstructured human environments. One way to arrive at such machines is to embrace 18 learning from ground up. The most popular paradigm for this is currently 19 (RL), which involves trial and error learning through interaction with the environment [1]. We 20 have recently witnessed impressive results in RL, achieved by large scale training in simulation 21 across an ensemble of randomized environments [2]. In current instantiations, such techniques 22 suffer from three main limitations: large sample complexity, considerable human effort in reward 23 function design, and limited generalization ability.

24 Consider a child learning to play tennis. Even though the child has never played the game of tennis, it 25 comes equipped with rich representations and models of the world. Thus, the child is able to identify 26 the racket and the balls, knows that a racket is held at the handle, and that if the ball collides with 27 the net it will stop. Furthermore, the child has also likely observed others play tennis. Consequently, 28 it has a good sense of behaviors required to hit the ball and is able to roughly reproduce them, and 29 knows that the goal of the game involves getting the ball onto the opponent’s side of the court.

30 While pedagogical, the tennis example hints at ways to improve our current robot learning systems. 31 In particular, the key message is that the child has learnt a lot about the world and the game of 32 tennis through observation before even starting to play. Thanks to this, it does not need to discover 33 everything through interaction and is able to learn to play much quicker. Likewise, our robots should 34 make use of observational data in addition to learning through interaction with the world.

35 Thanks to the internet, we now have access to a large amounts of video data from which we may 36 hope to learn in various ways. We can observe physical phenomena such as apples falling, balls 37 bouncing, and water flowing. We can observe humans performing cartwheels, juggling, baking 38 cakes, playing soccer and assembling furniture. Thus ”observational learning” embraces several 39 different capabilities - acquiring commonsense physics, rules of social behavior, and procedural 40 knowledge such as how to change a tire.

Submitted to the 5th Conference on Robot Learning (CoRL 2021). Do not distribute. Robot obs. rew. act. World

Observation Interaction

Figure 1: We advocate for observational robot learning which includes learning by observation from a large quantity of available human data (left), as opposed to relying only on the agent’s own interactions with the world (right). Our robots should learn from both observation and interaction.

41 A very important class of learning, and one at which normal human children are particularly adept 42 at, is ”social learning”, well studied in cognitive science and developmental psychology. It begins 43 early, e.g. facial imitation in children [3,4,5]. When compared with primates, human babies are 44 comparable at motor control but excel at social learning [6]. More generally, social learning can 45 be seen as a core component of cultural transmission of knowledge across generations [7,8]. In a 46 similar spirit, social learning theory [9] argues that learning is not purely behavioral but takes place 47 in a social context. Thus, learning can occur by observing behaviors of others acting in the world.

48 We focus particularly on observational data that comes in the form of videos, optionally with sound, 49 text, or metadata. Crucially, this data should be recorded in the wild and be as representative of the 50 real world as much as possible. The nature of such data, may vary along a number of dimensions. We 51 highlight a difference between third-person data, recorded by an observer (top left), and first-person 52 data, recorded by the camera wearer (bottom left). Majority of the observational data currently 53 available on the internet is third-person. This is a side-effect of the types of camera devices we use, 54 most commonly phones, and the “interestingness” bias that people have when uploading data to the 55 web. With the advances in augmented reality, we may soon get access to endless streams of first- 56 person data of daily activities of people and their interactions with the world. These may include 57 boring everyday things that are perhaps not of interest to the internet but may prove useful for 58 robotics. If we make an analogy with camera images and how their wide availability has impacted 59 , it might be that this sort of data of human activities proves particularly crucial.

60 The contribution of this paper is to construct a taxonomy of observational learning. At the highest 61 level, we categorize it into learning representations, models, behaviors, and goals.

62 2 Observational Learning

63 In the previous section, we presented a case for robot learning from both observational and interac- 64 tion data. But what might we be able to learn from observational data?

65 Here, we propose a taxonomy of observational learning and give concrete examples. At the highest 66 level of the hierarchy, we can learn representations, models, behaviors, and goals (Figure2).

67 2.1 Learning Representations

68 When a child is faced with learning a new task, like in the tennis example, it does not start by seeing 69 an unstructured collection of pixel values. But rather, it is already equipped with rich representations 70 of the world. These allow for understanding objects, parts, affordances, and many other aspects. 71 Most importantly, these have largely been acquired through observational learning.

72 Likewise, our robots could leverage observational data to build rich representations. Such repre- 73 sentations would allow them to learn to act in the context of prior experience of the world. These 74 representations further facilitate learning models, behaviors, and goals on top, discussed below.

75 While starting from pixels is one extreme, operating directly on top of state representations such as 76 objects with known 6 DoF pose, is the other extreme. Such representations are difficult to acquire 77 and can lead to compounding errors when used directly. There is a range of possibilities in between.

2 Obs. Learn.

Repr. Models Behaviors Goals

Objects Agents Imitation Perceptual

Afford. World Exploration Geometric

Figure 2: We construct a taxonomy of observational learning. At the highest level, we categorize it into learning representations, models, behaviors, and goals. There are a number of different aspects for each of these and we list only a few basic ones to help give a general sense.

78 Computer vision community has now developed reliable systems for detecting objects [10, 11] that 79 could serve as useful representations, e.g. for building dynamics models [12]. More specific than 80 objects are object part keypoints [13] and even more specialized than that are representations of 81 humans [14]. We might also want to capture various scene properties, such as depth or surface 82 normals, which can help learning policies in navigation [15] or manipulation [16].

83 While many of the examples listed above use supervision to arrive at such representations these 84 could, and in the longer term should, be learnt in an unsupervised way [17, 18, 19, 20, 21]. Indeed, 85 there is already evidence of such representations being useful for learning controllers [22, 23].

86 2.2 Learning Models

87 In his visionary work, Kenneth Craik [24] hypothesized that if an organism caries a small-scale 88 model of the world and its own possible actions in its head, it can use it to simulate different alterna- 89 tives and decide on actions to take. This view has been further developed into mental models theory 90 of cognition [25], theory of mind [26], and motor control [27].

91 Similarly, our robots could build mental models and use them to plan and accomplish tasks they 92 encounter. Even if the task the robot faces is new, the world still works the same way and the same 93 models apply. For example, the physics of the world do not change whether one is throwing a ball in 94 a park or learning to play tennis. Many such models can be acquired or initialized from observations.

95 One line of work on learning world models, popular in and robotics, focused on 96 prediction in pixel space [28, 29, 30, 31]. Representations obtained in this way can then be used 97 for learning controllers [32]. While pixels are a nice and general way to obtain supervision, one 98 may benefit from more abstract representations. Another line of work focused on learning using 99 higher-level representations of entities and objects [33, 34, 35, 12].

100 When it comes to learning models of other agents, robotics and computer vision communities have 101 focused on predicting future human trajectories in social navigation settings [36, 37, 38] as well as 102 making more fine-grained predictions at the level of individual body parts [39].

103 2.3 Learning Behaviors

104 Studies of imitation in humans and primates have now spanned more than a century [40, 41]. Using 105 Thorndike’s definition [40], imitation is “learning to do an act from seeing it done”. There is now a 106 large body of work showing that humans learn to perform various behaviors through imitation.

107 Naturally, the idea of having robots learn to reproduce observed behaviors is an old one and it has 108 been studied under a variety of names in robotics, including learning by watching, programming by 109 demonstration, learning from demonstrations, and imitation learning [42, 43].

110 There are now many different problem settings and the specifics vary considerably. For example, 111 the approach most commonly used in practice is direct behavior cloning [44]. This setting makes a 112 number of assumption including access to teacher actions, same robot morphology, and dynamics.

3 113 We highlight the more general setting of learning from observations alone, and by learning behaviors 114 we mean learning how to perform various tasks from observations. Thanks to YouTube and computer 115 vision datasets [45, 46, 47], we now have large quantities of observational data showing humans 116 performing a wide variety of behaviors that we can learn to perform on a robot.

117 The learnt behaviors could be used as a whole or in a compositional fashion. For example, the 118 robot could replicate human behavior based on latent representations [48, 49], or more structured 119 representations of humans [50] and hands [51]. The robot could also learn only parts of observed 120 behaviors and recombine them in new ways to achieve its goals [52, 53, 54].

121 In addition to learning specific human behaviors, the robot could leverage observations of humans 122 in order to learn behaviors by itself. For example, the robot can see that a child plays with toys and 123 that interesting parts of the space involve interacting with objects. Thus, observations can serve as a 124 form of exploration strategy for the robot and speed up learning and discovery of behaviors.

125 2.4 Learning Goals

126 Developmental psychologists distinguish between imitation and emulation [55]. In particular, imi- 127 tation can be defined as copying the form of an action and emulation as copying the outcome of an 128 action sequence. We capture the notion of emulation into the ability to infer and understand goals. 129 In other words, understand what someone is trying to do by observing them.

130 In reinforcement learning terminology, this can be seen as learning the reward function. This has 131 been explored in robotics [56], inverse reinforcement learning [57, 58], and reinforcement learn- 132 ing [59, 60, 61]. Similarly, this can be seen as a criterion, a cost function, a discriminator, or a critic. 133 But really we want to capture a general notion of a goal and what it means to achieve it.

134 We can acquire such notions of goals from observational data. Consequently, these can be used to 135 guide learning of behaviors through interactions, e.g. by serving as a reward functions for RL, or 136 they could be coupled with a learnt models to evaluate actions and make decisions. For example, 137 very much like Craik envisioned, the robot could use learnt models to determine possible outcomes 138 of potential actions, rank them using a learnt notion of success, and pick the best action.

139 3 Discussion

140 In this article, we have presented a view in favor of robot learning from observations and constructed 141 a taxonomy of observational learning. In the following, we briefly answer some of the questions that 142 might arise when discussing this article.

143 What is new in this article? People have, of course, already been working on different aspects of 144 observational learning and we hope the text makes that clear. The goal of this article is to unify the 145 related efforts under a common framework and bring community attention to this research direction.

146 Is observational data enough? No, it is not. While a lot can be learnt or accelerated by learning 147 from observation alone, we still require learning through interaction. The two are complementary.

148 Why treat observational data separately? One could argue that a robot can always choose to 149 observe if it wants to and that that there is nothing special about it, it is just another action. And he 150 would not be wrong, just not very practical. We believe that it is useful to distinguish between the 151 two, to ensure that we develop technique capable of leveraging both forms of data effectively.

152 Why does robot learning need to be inspired by humans? It does not. Here we focus on a 153 human-inspired approach as we believe that it is a promising path for developing general purpose 154 robots that operate in human environments and that it may facilitate our understanding of humans.

155 Do we need unsupervised, supervised, or reinforcement learning? Learning from observational 156 data likely involves at least a subset of them, if not all. And maybe more. Here we focus on the 157 problem setting and its different aspects, rather than the classes of learning methods required for it.

4 158 References

159 [1] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

160 [2] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, 161 A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, 162 L. Weng, Q. Yuan, W. Zaremba, and L. Zhang. Solving rubik’s cube with a robot hand. arXiv, 163 2019.

164 [3] J. M. Baldwin. Mental development in the child and the race: Methods and processes. Macmil- 165 lan, 1906.

166 [4] J. Piaget. Play, dreams and imitation in childhood. WW Norton, 1962.

167 [5] A. Meltzoff and A. Gopnik. The role of imitation in understanding persons and developing 168 theory of mind. Understanding other minds: Perspectives from autism, 1993.

169 [6] E. Herrmann, J. Call, M. V. Hernandez-Lloreda,´ B. Hare, and M. Tomasello. Humans have 170 evolved specialized skills of social cognition: The cultural intelligence hypothesis. Science, 171 2007.

172 [7] J. Henrich. The secret of our success. Princeton University Press, 2015.

173 [8] M. Tomasello. Becoming human. Harvard University Press, 2019.

174 [9] A. Bandura and R. H. Walters. Social learning theory. Englewood cliffs Prentice Hall, 1977.

175 [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object 176 detection and semantic segmentation. In CVPR, 2014.

177 [11] K. He, G. Gkioxari, P. Dollar,´ and R. Girshick. Mask r-cnn. In ICCV, 2017.

178 [12] H. Qi, X. Wang, D. Pathak, Y. Ma, and J. Malik. Learning long-term visual dynamics with 179 region proposal interaction networks. arXiv:2008.02265, 2020.

180 [13] L. Manuelli, W. Gao, P. Florence, and R. Tedrake. kpam: Keypoint affordances for category- 181 level robotic manipulation. arXiv:1903.06684, 2019.

182 [14] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human shape 183 and pose. In CVPR, 2018.

184 [15] A. Sax, B. Emi, A. R. Zamir, L. Guibas, S. Savarese, and J. Malik. Mid-level visual rep- 185 resentations improve generalization and sample efficiency for learning visuomotor policies. 186 arXiv:1812.11971, 2018.

187 [16] B. Chen, A. Sax, G. Lewis, I. Armeni, S. Savarese, A. Zamir, J. Malik, and L. Pinto. Ro- 188 bust policies via mid-level visual representations: An experimental study in manipulation and 189 navigation. In CoRL, 2020.

190 [17] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant map- 191 ping. In CVPR, 2006.

192 [18] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context 193 prediction. In ICCV, 2015.

194 [19] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric 195 instance discrimination. In CVPR, 2018.

196 [20] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual 197 representation learning. In CVPR, 2020.

198 [21] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning 199 of visual representations. In ICML, 2020.

200 [22] A. Srinivas, M. Laskin, and P. Abbeel. Curl: Contrastive unsupervised representations for 201 reinforcement learning. arXiv:2004.04136, 2020.

5 202 [23] A. Zhan, P. Zhao, L. Pinto, P. Abbeel, and M. Laskin. A framework for efficient robotic 203 manipulation. arXiv:2012.07975, 2020.

204 [24] K. J. W. Craik. The nature of explanation. CUP Archive, 1952.

205 [25] P. N. Johnson-Laird. Mental models: Towards a cognitive science of language, inference, and 206 consciousness. Harvard University Press, 1983.

207 [26] R. M. Gordon. Folk psychology as simulation. Mind & language, 1986.

208 [27] M. Kawato. Internal models for motor control and trajectory planning. Current opinion in 209 neurobiology, 1999.

210 [28] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square 211 error. arXiv:1511.05440, 2015.

212 [29] C. Finn and S. Levine. Deep visual foresight for planning robot motion. In ICRA, 2017.

213 [30] E. Denton and R. Fergus. Stochastic video generation with a learned prior. In ICML, 2018.

214 [31] A. Clark, J. Donahue, and K. Simonyan. Adversarial video generation on complex datasets. 215 arXiv:1907.06571, 2019.

216 [32] D. Ha and J. Schmidhuber. World models. arXiv:1803.10122, 2018.

217 [33] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik. Learning visual predictive models of 218 physics for playing billiards. arXiv:1511.07404, 2015.

219 [34] P. W. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu. Interaction networks 220 for learning about objects, relations and physics. arXiv:1612.00222, 2016.

221 [35] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. A compositional object-based 222 approach to learning physical dynamics. arXiv:1612.00341, 2016.

223 [36] H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard. Socially compliant navi- 224 gation via inverse reinforcement learning. IJRR, 2016.

225 [37] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social lstm: 226 Human trajectory prediction in crowded spaces. In CVPR, 2016.

227 [38] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. Social gan: Socially acceptable 228 trajectories with generative adversarial networks. In CVPR, 2018.

229 [39] A. D. Dragan, K. C. Lee, and S. S. Srinivasa. Legibility and predictability of robot motion. In 230 HRI, 2013.

231 [40] E. L. Thorndike. Animal intelligence: An experimental study of the associative processes in 232 animals. The Psychological Review: Monograph Supplements, 1898.

233 [41] A. Whiten and R. Ham. Kingdom: reappraisal of a century of research. Advances in the Study 234 of Behavior, 1992.

235 [42] S. Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 236 1999.

237 [43] A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Survey: Robot programming by demon- 238 stration. Technical report, Springrer, 2008.

239 [44] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Technical report, 240 1989.

241 [45] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, 242 I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database 243 for learning and evaluating visual common sense. In ICCV, 2017.

6 244 [46] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, 245 J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens 246 dataset. In ECCV, 2018.

247 [47] D. Shan, J. Geng, M. Shu, and D. F. Fouhey. Understanding human hands in contact at internet 248 scale. In CVPR, 2020.

249 [48] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain. Time- 250 contrastive networks: Self-supervised learning from video. In ICRA, 2018.

251 [49] A. Edwards, H. Sahni, Y. Schroecker, and C. Isbell. Imitating latent policies from observation. 252 In ICML, 2019.

253 [50] X. B. Peng, A. Kanazawa, J. Malik, P. Abbeel, and S. Levine. Sfv: Reinforcement learning of 254 physical skills from videos. TOG, 2018.

255 [51] Z. Cao, I. Radosavovic, A. Kanazawa, and J. Malik. Reconstructing hand-object interactions 256 in the wild. arXiv:2012.09856, 2020.

257 [52] S. Schaal. Dynamic movement primitives-a framework for motor control in humans and hu- 258 manoid robotics. In Adaptive motion of animals and machines. 2006.

259 [53] T. Shankar, S. Tulsiani, L. Pinto, and A. Gupta. Discovering motor programs by recomposing 260 demonstrations. In ICLR, 2019.

261 [54] K. Pertsch, Y. Lee, and J. J. Lim. Accelerating reinforcement learning with learned skill priors. 262 arXiv:2010.11944, 2020.

263 [55] A. Whiten, N. McGuigan, S. Marshall-Pescini, and L. M. Hopper. Emulation, imitation, over- 264 imitation and the scope of culture for child and chimpanzee. Philosophical Transactions of the 265 Royal Society B: Biological Sciences, 2009.

266 [56] C. G. Atkeson and S. Schaal. Robot learning from demonstration. In ICML, 1997.

267 [57] A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In ICML, 2000.

268 [58] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Pro- 269 ceedings of the twenty-first international conference on Machine learning, 2004.

270 [59] A. Edwards, C. Isbell, and A. Takanishi. Perceptual reward functions. arXiv:1608.03824, 271 2016.

272 [60] P. Sermanet, K. Xu, and S. Levine. Unsupervised perceptual rewards for imitation learning. 273 arXiv:1612.06699, 2016.

274 [61] L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg. Concept2robot: Learning manipula- 275 tion concepts from instructions and human demonstrations. In RSS, 2020.

7