N o v e l A r c h i t e c t u r e s

Editors: Volodymyr Kindratenko, [email protected] Pedro Trancoso, [email protected]

Modeling Spiking Neural Networks on SpiNNaker

By Xin Jin, Mikel Luján, Luis A. Plana, Sergio Davies, Steve Temple, and Steve B. Furber

SpiNNaker is a massively parallel architecture with more than a million processing cores that can model up to 1 billion spiking neurons in biological real time.

espite an increasing amount In SpiNNaker, our treatment of asynchronous interconnect make of experimental data and spikes is a key innovation implemented SpiNNaker a feasible option for em- D deeper scientific understand- with application-specific hardware: a bedded control systems such as those ing, deciphering the inner workings multicast, packet-switched and self- in robots. This is a clear advantage of biological brains remains a grand timed communication fabric with over large HPC systems. challenge. Investigations into the hu- on-chip routers. To maintain flex- Compared to dedicated hardware man brain’s microscopic structure ibility and generality, the neuronal solutions based on field-programmable have shown that neuron cells are the models run in software on embedded gate arrays, analogue circuits, or key components in the cortex. Each ARM968 processors. These neuro- hybrid analog-digital VLSI,4 SpiN- individual neuron is physically very nal models communicate by means of Naker offers flexibility in choosing much like other cells in our body, but spike packets directly supported by the neuronal dynamics, models, and it’s different in that it interacts with SpiNNaker architecture. learning rules. These are important other neurons by receiving or sending We taped out the SpiNNaker test features when we consider the experi- electrical pulses, or spikes. chips in 2009 with the batch arriv- mental nature of state-of-the-art neu- Researchers have proposed several ing in in December. As ral modeling. mathematical models to describe the Figure 1 shows, these test chips are The SpiNNaker chips are connected biological process of neurons firing fully functional SpiNNaker chips using a 2D toroidal triangular mesh spikes. These vary in their compu- but have a highly reduced core count: based upon a light-weight packet- tational complexity as well as their only two cores per chip. Here, we of- switched fabric.5 Packets represent fidelity, while maintaining biological fer an overview of our research proj- neural spikes and travel seamlessly plausibility. The spike is a common ect and describe the first experiments through the fabric—a network-on- first class of abstraction among these with these test chips running spiking chip (NoC) and interchip connec- various mathematical models. Spike neurons based on Eugene Izhikevich’s tion network—that connects more events are communicated to all con- model.2 Note that we’re not targeting than 65,000 (216) nodes housed in the nected neurons, with typical fan-outs artificial neural networks (such as per- largest SpiNNaker configuration. To on the order of 103. Computational ceptrons or multilayer networks) that emulate biological systems’ very high modeling of spiking neurons has were inspired by, but don’t model, bio- connectivity, the on-chip routers pro- abundant parallelism and no explicit logically plausible neural systems. vide multicast routing. requirement for cache coherent shared The SpiNNaker architecture is memory. Thus, researchers can use SpiNNaker Overview based on three guiding principles: large supercomputing systems and SpiNNaker is a multicore-based ar- high-performance computing clusters chitecture built for a specific pur- • Virtualized topology. The neural for this kind of simulation.1 How- pose. The smallest configuration model’s physical organization is ever, spike communication stresses is a single SpiNNaker chip, which decoupled from the target system’s standard HPC clusters and networks, can simulate up to 20,000 neurons.3 physical organization. So, in princi- making them unsuitable for real-time In this small form, the low-power ple, any neuron can be modeled on simulation. properties of the ARM cores and any processing core in the system.

September/October 2010 Copublished by the IEEE CS and the AIP 1521-9615/10/$26.00 © 2010 IEEE 91

CISE-12-5-Novel.indd 91 06/08/10 10:20 AM N o v e l A r c h i t e c t u r e s

Communication NoC

Tightly Router coupled memory ARM968 core

System NoC Boot ROM three stages before using the actual SDRAM System (memory) routing engine. The router is designed RAM Ethernet controller to support point-to-point and multi- Clock Phase-locked generator cast communications. The multi­cast loop (PLL) engine helps reduce pressure at the injection ports, and—compared to a (a) (b) pure point-to-point alternative—it reduces significantly the number of Figure 1. SpiNNaker test chip. (a) The test-board section showing a SpiNNaker chip and SDRAM. (b) A chip plot highlighting individual components. packets that traverse the communica- tion fabric. Although the machine employs a this is ideal for storing frequently ac- Packet routing must be done in 2D topology, it can model a 3D (or cessed instructions or data. an innovative way. Because neurons higher) neural structure. This is One of the cores on each chip is se- send spikes to thousands of other possible because electronic commu­ lected to perform system management neurons, it’s impractical to list all nication speeds are much higher tasks. The other processing cores run destinations in every packet. There- than biological communication independent event-driven neural pro- fore, routers make routing decisions speeds. cesses, each simulating a group of based on the packet’s source address • Bounded asynchrony. Time models neurons. The events are generated by (the identifier of the neuron that itself. Because the system oper- peripherals, such as the direct memo- fired the spike). The network itself ates in real time (possibly scaled, ry access (DMA) control. Cores com- will deliver the packets to all chips although we assume ×1 scaling), municate with other cores and chips containing neurons that have syn- there’s no requirement for explicit through the communication network. aptic connections with the source synchronization in the computa- Access to all other on-chip resources neuron. These connections are em- tion. Things happen when they on the system NoC is through the bedded in the 1,024-word routing happen. Although this leads to non- DMA controller, which is mainly used tables inside the routers, and must deterministic behavior, the biologi- for reading (when a packet arrives) or be preloaded using application-spe- cal system being modeled shares writing (when updating synaptic in- cific information. To minimize the this property. formation during learning) the neural space pressure on the routing tables, • Energy frugality. Processors are free; state stored in the external SDRAM. these offer a masked associative the real cost of computing is energy. As Figure 2 shows, there’s one router route look-up. This is why we use embedded pro- on each chip capable of on- and off- SpiNNaker chips are arranged in cessors, and why the synchronous chip communication. It has 18 ports a 2D triangular torus topology with dynamic RAM (SDRAM) is of for internal use of the ARM cores and links to the neighbors in the north, the mobile double data rate (DDR) six ports to communicate with six ad- south, east, west, southwest, and north- variety—in both cases, we sacrifice jacent chips. All ports are full duplex east. The routers perform a default some performance for greatly en- and implement self-timed protocols. routing that sends the packet following hanced power efficiency. The self-timed protocols make the a straight line, a process that avoids us- SpiNNaker chip a globally asynchro- ing extra entries in the routing tables. As Figure 2 shows, each SpiNNaker nous, locally synchronous design: in- For example, if the packet comes from chip has 18 identical ARM9 processing dividual processing cores on the chip the north, it will be sent to the south. cores running at 200 MHz. As Figure 3 act as (clocked) synchronous “islands” The topology allows two-hop routes to shows, the core is directly connected surrounded by a “sea” of asynchro- go from a chip to each one of its neigh- to two memory blocks: instruction nous connectivity. This not only fa- bors. These two-hop paths between tightly coupled memory (ITCM) cilitates the VLSI design process, but neighbor chips are known as emergency and data tightly coupled memory isolates faulty cores and provides tim- routes and the router can invoke them (DTCM). The TCMs are small (32 ing tolerance. automatically to bypass problematic and 64 Kbytes, respectively) but run The router’s internal organization links due to transient congestion states extremely fast at processor clock speed; is hierarchical; ports are merged in or link failures.

92 Computing in Science & Engineering

CISE-12-5-Novel.indd 92 06/08/10 10:20 AM 2Gb/s 4Gb/s 8Gb/s RtrClk

2of7 2of7 Dec Enc

2of7 2of7 Dec Enc Input Output 2of7 2of7 Dec Packet Routing Output Enc Links decode engine select Links 2of7 2of7 Dec Enc

2of7 2of7 Dec Enc

2of7 2of7 Dec Packet router Enc

Router control Comms NoC AHB master AHB slave Comms NoC (input) (output)

CommCtlr CommCtlr CommCtlr CommCtlr CommCtlr CommCtlr JTAG Proc0 Proc1 Proc2 Proc3... ProcN–2 ProcN–1 ProcN debug AXI master AXI master AXI master AXI master AXI master AXI master

CpuClk CpuClk CpuClk CpuClk CpuClk CpuClk

System NoC

MemClk System AHB RtrClk AXI slave APB slave AHB slave AHB slave AHB slave APB slave AHB slave CpuClk System System Watch- System Clock MemClk PL340 SDRAM I/F Ethernet phase-locked RAM ROM dog controller loop (PLL)

Ether MII Reset Test 10 MHz I/O port POR 1Gbit DDR SDRAM

Figure 2. The SpiNNaker chip organization. The top half contains the multicast router that distributes spikes to cores on the same and neighboring chips. The ARM cores—used to simulate neurons—and their peripherals occupy the bottom half. The Joint Test Action Group (JTAG) is used for debugging. AHB stands for advanced high-performance bus and AXI stands for advanced extensible interface.

Neural Modeling: developed a well-known biologically a direct biological meaning, but ad- Implementing Izhikevich plausible conductance-based model dress most key properties of neurons on SpiNNaker (Hodgkin-Huxley) that describes ion and are less computationally intensive. Neuronal activity is a result of currents dynamics by a set of non­linear One important phenomenal model ionic movement, caused by two differential equations.6 In conductance- is the Izhikevich model, which uses electrochemical gradients—concen­ based models, the variables and param- the bifurcation theory to reduce the tration and electric potential gradients— eters have well-defined biological high-dimensional conductance-based around the cell body’s membrane. The meanings, and can be measured exper- model to a 2D system with a fast two electrochemical gradient forces imentally. However, analyzing the con- membrane potential variable v and a drive ions in opposite directions, to- ductance-based models is complicated. slow membrane recovery variable u. ward either the inside or the outside In contrast, phenomenal models The model’s equations are of the cell. simply build relationships between Several different types of math- the membrane potential (the elec- v =0. 04 v2 + 5 v + 140 − u + I, ematical models have been developed tric potential difference between the u = a( bv − u ), and to describe neuronal dynamics. Alan inside and outside of the membrane) Lloyd Hodgkin and Andrew Huxley and input current. These models lack if v≥ 30 mv,,then v= c , u = u + d ,

September/October 2010 93

CISE-12-5-Novel.indd 93 06/08/10 10:20 AM N o v e l A r c h i t e c t u r e s

JTAG Comms NoC

ITCM 32 Kbytes ARMClk Communications CCClk controller ARM968E−S AHB S DTCM 64 Kbytes ARM IRQ/FIQ AHBClk AHB S AHB−Lite M TClk Timer/counter AHB2 AHBClk AHBClk AHB1 AHBClk ARMClk CCClk AHB M AHB S CpuClk Clock AHBClk (~200 MHz) Buf/Gen AXIClk Interrupt DMAClk DMA controller controller

DMAClk

AXIClk AXI master

Chain gateway

System NoC Interrupt request (IRQ)

Figure 3. The processing core subsystem. Each ARM core has fast local memory to perform its tasks. The cores operate in interrupt-driven mode to respond in real time to system events such as spike arrivals. The core is directly connected to two memory blocks: instruction tightly coupled memory (ITCM) and data tightly coupled memory (DTCM). ARM stands for advanced RISC machines and IRQ/FIQ stands for interrupt request/fast interrupt request.

where v = dv/dt, t is time in milli­ information is coded and propagated as well as to avoid having a floating- seconds, I is the synaptic current, v within the neural network through point unit in the ARM968 cores. We represents the membrane potential (in structural links such as synapses and take advantage of knowing the mem- millivolts). The variable u represents fiber pathways. Based on connectiv- brane potential’s exact range of values the membrane recovery (also in mV), ity patterns discovered in labora- (-80 ≤ v ≤ 30). By using a dual- which reflects the negative effects on tory experiments, researchers have scaling factor scheme, we can reduce the membrane potential caused by built several mathematical models of the precision lost during the conver- factors such as the active K+ and inac- connectivity.7 The resulting connec- sion and hence maintain a good pre- tive Na+ ionic current. Parameters a, tivity knowledge was used to create cision level.9 We also optimize the b, c, and d are adjustable to reproduce large-scale neural network models, in- implementation’s performance and the whole range of biological spiking cluding Izhikevich’s model to simulate accuracy by firing patterns. brain activity.8 Accurate brain modeling requires We selected the Izhikevich model • expanding the width from 16 to not only the neuronal dynamics but to demonstrate real-time simulation 32 bits during the computation to also a comprehensive map of struc- with 1 ms resolution on SpiNNa- achieve better precision; tural connection patterns in the hu- ker. We derived a 16-bit fixed-point • transforming the equations to man brain. Connectivity is closely arithmetic implementation to save allow the use of efficient ARM in- related to the neural coding problem: both computation and storage space, structions (SMLAWB and SMLATT,

94 Computing in Science & Engineering

CISE-12-5-Novel.indd 94 06/08/10 10:20 AM Raster plot in Matlab (xed-point) 500

400

300

200 Neuron ID 100 which are signed multiply- simulation involves a 500-neuron 0 accumulate operations); 0 200 400 600 800 1,000 network with an excitatory- • adjusting the parameters and (a) Time (milliseconds) inhibitory ratio at 4:1. Each precomputing Equation 1 as neuron is randomly connected Raster plot from test chip much as possible; and 500 to 25 other neurons. We ran- • programming in ARM as- domly selected 12 excitatory sembly language. 400 and three inhibitory neurons as 300 biased neurons, each receiving In our implementation, one a constant input stimulus of iteration of Izhikevich equa- 200 20 mV. Neuron ID tions takes six fixed-point arith- 100 The spike raster plots com- metic and two shift operations. pare a fixed-point arithmetic This is more efficient than the 0 simulation in Matlab (Figure 4a) 0 200 400 600 800 1,000 original implementation, which (b) Time (milliseconds) with the same neural model ex- requires 13 floating-point oper- ecuting on a real SpiNNaker ations. In a practical implemen- States of neuron ID 0 on the test chip chip (Figure 4b). The spike tim- tation, the complete subroutine 40 Electrical current Membrane potential ings match on both scenarios for computing Izhikevich equa- 20 and show the same rhythm of tions can be performed in as few 0 4 Hz. Figure 4c shows the ac- as 20 instructions if the neuron –20 tivity of an excitatory neuron mV doesn’t fire. If the neuron does –40 (ID 0) executing on the SpiN- fire, it takes 10 more instruc- –60 Naker test chip. tions to reset the value and send –80 a spike event. The detailed im- plementation, along with preci- 0 200 400 600 800 1,000 o test the SpiNNaker chips, (c) Time (milliseconds) sion and performance analyses, T we ported a doughnut 9 is described elsewhere. Figure 4. Spike raster plots of a 500-neuron model. hunter application. As Figure 5a To achieve low communica- (a) A 500-neuron, fixed-point Matlab simulation, shows, the hunter neuron net- tion overhead, we use an event- (b) 500 neurons on a SpiNNaker test chip, and work is composed of a few fast- address mapping (EAM) scheme. (c) the states of neuron 0 (excitatory) on a spiking Izhikevich neurons. SpiNNaker test chip. The EAM scheme keeps synap- The application requires a server tic weights at the post-synaptic running on a host PC and a cli- end (neurons receiving the spike) and the SDRAM to the local DTCM, be- ent running on SpiNNaker. The con- set a relationship (in a mapping table) fore computing the update following nection between the server and client between the spike event and the ad- the Izhikevich equations. A core can is via an Ethernet interface. dress of the synaptic weight. Hence, easily find the synaptic weights asso- The server models the environ- no synaptic weight information needs ciated with the fired neuron. It does ment with a doughnut and a hunter to be carried in a spike packet. When this by matching the incoming spike (see Figure 5b). The doughnut ap- a neuron fires, the packet propagates packet with entries in the mapping pears at a random location on the through a series of multicast routers table, which is organized as a binary screen and the hunter moves to chase according to the preloaded routing tree. the doughnut. The server tracks the table in each router, and finally arrives Figure 4 shows a series of spike ras- locations of the doughnut and the at the destination processing cores. ter plots; the x-axis shows advancing hunter, and provides the visual stimu- The EAM scheme employs the biological simulation time, while li to the hunter. The SpiNNaker chip two memory systems, DTCM and the y-axis shows a simulated neuron. models the hunter’s neural network: SDRAM, to store and access efficiently A point in a graph represents that the server sends visual inputs, which synaptic connections. DMA opera- the given neuron fired at the speci- are propagated to the motor neurons tions transfer each weight block from fied simulation time. The neural through the simple neural network.

September/October 2010 95

CISE-12-5-Novel.indd 95 06/08/10 10:20 AM N o v e l A r c h i t e c t u r e s

Sensor (vision) 34 Locomotion 35 01 12 23

02 24 48 Search

04 26 Left

05 16 27

06 17 28 40 46 50 Forward Vision input Vision 07 18 29 41

08 19 30 42 47 51 Right

09 20 31 43

10 21 32 44

11 22 33

(a) (b)

Figure 5. The doughnut hunter application running on a SpiNNaker Test Chip. (a) The doughnut hunter neural model. (b) The doughnut hunter in action.

The motor signals are sent back References Anatomical and Functional Connectiv- to the server to control the hunter 1. H . Markram, “The Blue Brain Project,” ity in Graphs and Cortical Connection movement. Nature Reviews Neuroscience, vol. 7, Matrices,” Cerebral Cortex, vol. 10, These experiments demonstrate 2006, pp. 153–160. no. 2, 2000, pp. 127–141. firmly, for the first time, that we have 2. E.M. Izhikevich, “Simple Model of 8. E.M. Izhikevich, J.A. Gally, and G.M. working silicon for SpiNNaker. How- Spiking Neurons,” IEEE Trans. Neural Edelman, “Spike-Timing Dynamics ever, we haven’t yet demonstrated a Networks, vol. 14, no. 6, 2003, of Neuronal Groups,” Cerebral Cortex, real-time simulation of a billion (109) pp. 1569–1572. vol. 14, no. 8, 2004, pp. 933–944. neurons—which is the objective of 3. M.M. Khan et al., “SpiNNaker: Map- 9. X. Jin, S. Furber, and J. Woods, our research. To put that large num- ping Neural Networks onto a Massively “Efficient Modeling of Spiking Neural ber into perspective, a human brain Parallel Chip Multiprocessor,” Proc. Networks on a Scalable Chip Multi­ contains approximately 1011 neurons. Int'l J. Conf. Neural Networks, IEEE Press, processor,” Proc. Int’l J. Conf. Neural Having working silicon for SpiN­ 2008, pp. 2849–2856. Networks, Ablex Publishing, 2008, Naker is a significant project mile- 4. P.A. Merolla et al., “Expandable Net- pp. 2812–2819. stone that takes us a step closer to our works for Neuromorphic Chips,” IEEE 2012 plan of deploying a SpiNNaker Trans. Circuits and Systems, vol. 54, configuration with more than one no. 2, 2007, pp. 301–311. Xin Jin is an embedded software engineer in million ARM cores. 5. L .A. Plana et al., “A GALS Infrastructure the Media Processing Division of ARM Ltd., for a Massively Parallel Multiprocessor,” and was involved in the SpiNNaker group Acknowledgments IEEE Design & Test, vol. 24, no. 5, 2007, during his doctoral studies. His research We thank the Engineering and pp. 454–463. interests include neural modeling algo- Physical Sciences Research Council 6. A. L. Hodgkin and A. Huxley, “A rithms and software development on par- (EPSRC), Silistix, and ARM for sup- Quantitative Description of Mem- allel neuromorphic hardware. Jin has a PhD porting this research. Mikel Luján is brane Current and Its Application to in computer science from the University of supported by a Royal Society Uni- Conduction and Excitation in Nerve,” Manchester. Contact him at [email protected]. versity Research Fellowship. Arash J. Physiology, vol. 117, no. 4, 1952, ac.uk. Ahmadi at the University of South- pp. 500–544. ampton developed the doughnut hunter 7. O . Sporns, G. Tononi, and G. Edelman, Mikel Luján is a Royal Society University application. “Theoretical Neuroanatomy: Relating Research Fellow in the School of Computer

96 Computing in Science & Engineering

CISE-12-5-Novel.indd 96 06/08/10 10:20 AM Science at the . Sergio Davies is a PhD student in the School Computer Science at the University of His research interests include managed of Computer Science at the University of Manchester, where he leads the Ad- runtime environments and virtualiza- Manchester. His research interests include vanced Processor Technologies research tion, many-core architectures, and appli- artificial neural networks, neural network group. His research interests include cation-specific systems and optimizations. simulation, and synaptic plasticity. Davies multicore computing, energy-efficient Luján has a PhD in computer science has an MSc in telecommunication engineer- VLSI design, and neural systems engi- from the University of Manchester. Con- ing from the University of Naples. Contact neering. Furber has a PhD in aerodynam- tact him at mikel.lujan@manchester. him at [email protected]. ics from the . He ac.uk. is a fellow of the Royal Society, the Royal Steve Temple is a research fellow in the Academy of Engineering, and IEEE, and Luis A. Plana is a research fellow in the School of Computer Science at the University is a 2010 Millennium Technology Prize School of Computer Science at the Univer- of Manchester. His research interests include Laureate. Contact him at steve.furber@ sity of Manchester. His research interests VLSI design and hardware system design. manchester.ac.uk. include the design and implementation Temple has a PhD in computer science from of systems-on-chip and on-chip intercon- the University of Cambridge. Contact him at nect. Plana has a PhD in computer science [email protected]. Selected articles and columns from from Columbia University. He is a senior IEEE Computer Society publica- member of IEEE. Contact him at luis.plana@ Steve Furber is the ICL Processor of Com- tions are also available for free at http:// manchester.ac.uk. puter Engineering in the School of ComputingNow.computer.org.

September/October 2010 97

CISE-12-5-Novel.indd 97 06/08/10 10:20 AM