Can a Compact Neuronal Circuit Policy Be Re-Purposed to Learn Simple Robotic Control?
Total Page:16
File Type:pdf, Size:1020Kb
Can a Compact Neuronal Circuit Policy be Re-purposed to Learn Simple Robotic Control? Ramin Hasani1∗, Mathias Lechner2∗, Alexander Amini3, Daniela Rus3 and Radu Grosu1 Abstract— We propose a neural information processing sys- Network general structure Tap-withdrawal neural circuit tem which is obtained by re-purposing the function of a k observations 4 observations biological neural circuit model, to govern simulated and real- AVM ALM 1 … k PVD PLM world control tasks. Inspired by the structure of the nervous system of the soil-worm, C. elegans, we introduce Neuronal Circuit Policies (NCPs), defined as the model of biological neural … circuits reparameterized for the control of an alternative task. 1 NI DVA PVC AVD We learn instances of NCPs to control a series of robotic tasks, including the autonomous parking of a real-world rover robot. For reconfiguration of the purpose of the neural circuit, 1 … Nc AVB AVA we adopt a search-based optimization algorithm. Neuronal circuit policies perform on par and in some cases surpass the … performance of contemporary deep learning models with the 1 n FWD REV advantage leveraging significantly fewer learnable parameters n actions 2 actions and realizing interpretable dynamics at the cell-level. sensory neuron command neuron chemical synapse I. INTRODUCTION upper interneuron motor neuron gap junction We wish to explore a new class of machine learning algorithms for robot control that is inspired by nature. Fig. 1. Left: C. elegans’ general neuronal circuit structure. Right: Tap- Through natural evolution, the subnetworks within the ner- Withdrawal (TW) neural circuit schematic. Total number of interneurons = Ni +NC. We preserve the TW circuit wiring topology, model its dynamics vous system of the nematode, C. elegans, structured a near- by computational models, and deploy a search-based reinforcement learning optimal wiring diagram from the wiring economy principle1 algorithm to control robots. perspective [2], [1]. Its stereotypic brain composed of 302 neurons connected through approximately 8000 chemical and electrical synapses [3]. C. elegans exhibits distinct behavioral body. The circuit has been characterized in terms of its neu- mechanisms to process complex chemical stimulations [4], ronal dynamics [11]. The TW circuit comprises nine neuron avoid osmotic regions [5], sleep [6], show adaptive behavior classes which are wired together by means of chemical and [7], perform mechanosensation [8], and to control muscles electrical synapses. Behavior of the TW reflexive response [9]. The functions of many neural circuits within its brain is substantially similar to the control agent’s reaction in have been identified [10], [11], [12], [6], and simulated some standard control settings such as a controller acting [13], [14], [15], [16], [17], which makes it a suitable model on driving an underpowered car, to go up on a steep hill, organism to be investigated, computationally. The general known as the Mountain Car [19], or a controller acting on network architecture in C. elegans establishes a hierarchical the navigation of a rover robot that plans to go from point topology from sensory neurons through upper interneuron A to B, on a planned trajectory. and command interneurons down to motor neurons (See arXiv:1809.04423v2 [cs.LG] 16 Nov 2019 The biophysical neuronal and synaptic models express Fig. 1A). In these neuronal circuits, typically interneurons useful properties; 1) In addition to the nonlinearities exposed establish highly recurrent wiring diagrams with each other on the neurons’ hidden state, synapses also are modeled while sensors and command neurons mostly realize feed- by additional nonlinearity. This property results in realizing forward connections to their downstream neurons. An exam- complex dynamics with a fewer number of neurons. 2) Their ple of such a structure is a neural circuit shown in Fig. 1B, dynamics are set by grounded biophysical properties which the Tap-withdrawal (TW) circuit [18], which is responsible ease the interpretation of the network hidden dynamics. for inducing a forward/backward locomotion reflex when the worm is mechanically exposed to touch stimulus on its The trained networks obtained by learning the parameters of a fixed neural circuit structure are called neuronal circuit *Equal Contributions policies (NCP). We experimentally investigate NCP’s prop- 1Technische Universitat¨ Wien (TU Wien), 1040 Vienna, Austria erties in terms, their learning performance, their ability to 2Institute of Science and Technology Austria (IST Austria) solve tasks in different RL domains, and ways to interpret 3 Massachusetts Institute of Technology (MIT), USA their internal dynamics. For this purpose, we preserve the 1Under proper functionality of a network, the wiring economy principle proposes that its morphology is organized such that the cost of wiring its wiring structure of an example NCP (the TW circuit) and elements is minimal [1]. adopt a search-based optimization algorithm for learning the neuronal and synaptic parameters of the network. This paper reversal potentials of the leakage and chemical channels. contributes to the following: Ii;L, Iˆi; j, and Ii; j present the currents flowing through the • Demonstration of a compact neuronal circuit policy leak channel, electric-synapse, and chemical-synapse, with (NCP) inspired by the brain of the C. elegans worm, conductances wi;L, wˆi; j, and wi; j, respectively. gi; j(t) is the as an interpretable controller in various control settings. dynamic conductance of the chemical-synapse, and Ci;m is • An optimization algorithm (Adaptive search-based) and the membrane capacitance. Ei; j determines the whether a experiments with NCPs in different control domains, synapse is inhibitory or excitatory. including simulated and physical robots. For interacting with the environment, We introduced • Autonomous parking of a mobile robot by re-purposing sensory and motor neuron models. A sensory component a compact neuronal circuit policy. consists of two neurons Sp, Sn and an input variable, x. • Interpretation of the internal dynamics of the learned Sp gets activated when x has a positive value, whereas Sn policies. We introduce a novel computational method fires when x is negative. The potential of the neurons Sp, to understand continuous-time network dynamics. The and Sn, as a function of x, are defined by an affine function technique (Definition 1) determines the relation between that maps the region [xmin;xmax] of the system variable x, to the kinetics of sensory/interneurons and a motor neu- a membrane potential range of [−70mV;−20mV]. (See the ron’s decision, We compute the magnitude of a neuron’s formula in Supplementary Materials Section 2).2 Similar to contribution, (positive or negative), of these hidden sensory neurons, a motor component is composed of two nodes to the output dynamics in determinable phases neurons Mn, Mp and a controllable motor variable y. Values of activity, during the simulation. of y is computed by y := yp +yn and an affine mapping links the neuron potentials Mn and Mp, to the range [ymin;ymax]. II. PRELIMINARIES (Formula in Supplementary Materials Section 2). FWD and In this section, we first briefly describe the structure and REV motor classes (Output units) in Fig. 1B, are modeled dynamics of the tap-withdrawal neural circuit. We then intro- in this fashion. duce the mathematical neuron and synapse models utilized For simulating neural networks composed of such dynamic to build up the circuit, as an instance of neuronal circuit models, we adopted an implicit numerical solver [22]. For- policies. mally, we realized the ODE models in a hybrid fashion which A. Tap-Withdrawal Neural Circuit Revisit combines both implicit and explicit Euler’s discretization method [21]. (See Supplementary Materials Section 3, for A mechanically exposed stimulus (i.e. tap) to the petri dish a concrete discussion on the model implementation, and the in which the worm inhabits, results in the animal’s reflexive choice of parameters.) Note that the solver has to serve as response in the form of a forward or backward movement. a real-time control system, additionally. For reducing the This response has been named as the tap-withdrawal reflex, complexity, therefore, our method realizes a fixed-timestep and the circuit identified to underlay such behavior is known solver. The solver’s complexity for each time step D is as the tap-withdrawal (TW) neural circuit [18]. The circuit t O(j# neuronsj + j# synapsesj). In the next section, we in- is shown in Fig. 1B. It is composed of four sensory neurons, troduce the optimization algorithm used to reparametrize the PVD and PLM (posterior touch sensors), AVM and ALM tap-withdrawal circuit. (anterior touch sensors), four interneuron classes (AVD, PVC, AVA and AVB), and two subgroup of motor neurons III. SEARCH-BASED OPTIMIZATION ALGORITHM which are abstracted as a forward locomotory neurons, FWD, and backward locomotory neurons, REV. Interneurons recur- In this section we formulate a Reinforcement learning rently synapse into each other with excitatory and inhibitory (RL) setting for tuning the parameters of a given neural synaptic links. circuit to control robots. The behavior of a neural circuit can be expressed as a policy pq (oi;si) 7! hai+1;si+1i, that B. Neural circuit modeling revisit maps an observation oi, and an internal state si of the Here, we briefly describe the neuron and synapse model circuit, to an action ai+1, and a new internal state si+1. model [20], [21], used to design neural circuit dynamics. This policy acts upon a possible stochastic environment Dynamics of neura Env(ai+1), that provides an observation oi+1, and a reward, T n n ri+1. The stochastic return is given by R(q) := ∑ rt . The V˙i(t) = [Ii;L + ∑ Iˆi; j(t) + ∑ Ii; j(t)]=Ci;m t=1 j=1 j=1 objective of the Reinforcement learning is to find a q that Ii;L(t) = wi;L [Ei;L −Vi(t)] maximizes E R(q) .