ICSA Industrial Research Collaborations The ARC Video Processor Recent history of research work interacting with IP companies In 2005, ARC and ICSA collaborated on the architecture of a high performance, low power, configurable media processor targeting mobile video applications. The ARC Video and ARC Sound Advanced products are now in production and licensed to a number of fabless semiconductor companies developing portable media devices. A number of innovative architectural concepts are combined in ARC Video to sustain high performance at very low power consumption. Early Clustered VLIW Research For example, it is able to decode standard TV resolution H.264, VC-1, MPEG-2 or MPEG-4 video streams (2 Mbps, 30 fps) when running at only 166 MHz. At that frequency, a design implemented in 90nm silicon processes will consume as little as 45 mW. This project ran from 1996-99, in collaboration with UPC Barcelona, developing scalable VLIW c0 c1 c2 c3

architectures and compiler techniques. This led to the concept of Clustered VLIW, in which local local local local schedule schedule schedule schedule clusters of functional units share resources and may also share values with neighbouring MPEG-2MPEG-2 andand MPEG-4MPEG-4 decodersdecoders forfor thethe ARCARC VideoVideo architecturearchitecture werewere developeddeveloped inin thethe SchoolSchool ARC Video - Top-level Power Analysis clusters. The Distributed Modulo Scheduling algorithm was developed to produce highly- of Informatics, in collaboration with ARC. TSMC 90nm G Process, Virage libraries of Informatics, in collaboration with ARC. ARC 750 RAMs ARC 750 core SIMD RAMs SIMD core Clock tree DMA unit VLC decoder Bus interfaces efficient code schedules for compute-bound loop nests. ALU ALU ALU ALU ALU ALU ALU ALU TheThe resultingresulting MPEG-2MPEG-2 decoderdecoder cancan decodedecode aa standardstandard definitiondefinition videovideo (30(30 fps)fps) onon anan ARCARC ALU ALU ALU ALU 50 Avg. power = 45 mW @ 166 MHz ALU ALU ALU ALU Video processor running at just 100 MHz and consuming only 25 mW. This is believed to be Video processor running at just 100 MHz and consuming only 25 mW. This is believed to be 45 Local Local Local Local the lowest-power software solution for MPEG-2 decoding available today. register register register register the lowest-power software solution for MPEG-2 decoding available today. 40 file file file file 1. M. Fernandes, N.P. Topham and J. Llosa, Distributed Modulo Scheduling, in Proc. 5th 35 Annual International IEEE Conf. on High Perf. Comp. Architecture (HPCA-5), Orlando Fl, Inter-cluster communication network 30 1998. A 4-cluster VLIW Processor 25 20

ARCŖ 750D CPU ARC Video SIMD Core ARC750D Processor DMA Unit 15 Data Memory Pipeline Memory Interface 10 SCM DA DA1 DA2 DMA-assisted VLC WB Power consumption (mW @ 166 MHz) 5 decoder DMA DMA In Out 0 I0 P1 P2 P3 The Siroyan OneDSP Processor SIMD CPU Fetch Align Dec RF ALU Sel WB SCQ 8 x 16-bit Data H.264 Frames Decoded SR-8C ARCŖ 750D (pixel processing) CPU In 1999 Topham co-founded Siroyan, a UK intellectual property startup company, with the aim SIMD Extensions Power analysis of ARC Video [6] of developing a scalable DSP based on the clustered VLIW concept. The architecture was 128-bitSDM wide SDM

announced at Microprocessor Forum in 2001, and shortly afterwards a 2-cluster SDA SDM SR-4C Instruction Data SCM SIMD cache cache Code implementation was produced in 0.18um and 0.15um silicon. The company filed 13 worldwide DMAVideo Engine DMA Queue patents on innovationd in architecture, microarchitecture and compiler algorithms. Fetch SCA Dec SR-2C RF X1 X2 Acc WB In 2003 the Siroyan OneDSP design and its associated intellectual property was acquired by 200mW, 2GIPS @ 250 MHz Block diagram of ARC Record Mechanism SIMD Decode SIMD Execute pipe Accumulate result the Corporation Inc, the second largest manufacturer of FPGA devices. SIMD I-fetch Vector Reg. Read SIMD Data Access Write-back SR-1C Video architecture SIMD Code Access Dependency check

50mW, 500 MIPS 2. N.P. Topham, The Siroyan OneDSP Architecture, Microprocessor Forum, San Jose 2001. @ 250 MHz Pipeline structure of the ARC SIMD core Die plots showing four versions of OneDSP

Die plot of ARC Video core [5]

2 Architecting the ARC 600 Configurable Processor Publicly-announced ARC Video licensees • 3.25 mm in a 90nm TSMC G process • H.264 D1 30fps at 160 MHz In 2003 Topham led a team within ARC International (LSE:ARK) to develop an ultra low-power configurable • 40 mW power consumption processor, based around the ARCompact instruction set. This was announced at Microprocessor Forum in 2003. Shortly after completing the ARC 600, Topham returned to The University of Edinburgh to take up the Chair in 5. N.P. Topham, ARC’s SIMD Extensions for Multimedia Applications, Fall Processor Forum, San Jose and Tokyo 2005. Computer Systems. This began a collaboration between ARC and the University which continues to the present 6. N.P. Topham, Low Power Challenges in High Performance Media Processors, Fall Processor Forum, (invited seminar) San Jose 2006. day. The ARC 600 has been widely licensed and deployed in areas ranging from USB flash storage to mobile TV. 290 MHz worst-case 0.13um ASIC process Dimensions: 1.05mm x 1.87mm ( 2 sq.mm ) Scalable Multi-core JIT-translation for High Speed CPU Simulation 8 KB data cache Media Processing 4-way associative The ICSA micro-architecture research team have developed one of the worlds fastest CPU functional simulators. This uses a patent-

4-way associative During 2006, ARC and ICSA have been collaborating on the design 1

Selection of publicly-announced ARC 600 licensees since 2003 Cache KB Instruction 8 pending dynamic JIT translation mechanism to achieve simulation of a scalable multi-core media processor architecture. This was 2 announced in October 2006 and will form the basis of a range of speeds up to 322 million instructions per second on a 3 GHz host . platforms for very high performance media processing. For typical embedded RISC processors, with CPI of 1.3, this 3. N.P. Topham, The ARC 600 Architecture, Microprocessor Forum, SanJose 2003. Die plot showing ARC 600 floorplan [3] Each CPU can control up to 8 independent Media Processors. represents an effective simulated clock rate of over 400 MHz. A new and novel interconnection protocol supports event-driven This technology could be deployed commercially, across a wide processing directly in hardware, and achieves extremely low latency range of different embedded CPU architectures. Licensing communication – as seen by software. discussions are currently in progress.

ARM 11 The ARC 700 Configurable Processor 2 4.45 mm Core Products based on this architecture will be announced in 2007. Fast Simulator Execution Rate There has been a continuous collaboration between ARC and ICSA to optimize the ARC 700 GPP Floorplan (0.13um) JIT MBP 2.16GHz Interp MBP 2.16GHz performance of the ARC 700, a high-performance configurable core developed during 2003- Frequency optimized = 420 MHz, 140 Kgates JIT Xeon 5160 3GHz Interp Xeon 5160 3GHz 04. The ARC 700 has also been widely licensed and deployed in areas ranging from 350 MIPS 24K 300 3.0 mm2 2. Measured on a Dell PE1950, network processors, media processors, mobile handsets, and communications satellites. Core 250 Xeon 5160 CPU 2.99GHz, ARC 700 Core B 4 MB L2 cache, 4GB memory. 200 D-$ H D-$ T 150 Die area and ARC 700 tags 1.23 mm2 dirty power Core 100 BTAC 50 consumption Core Power Consumption I-$ tags 0 comparisons 0 5 10 15 20

AR M 11 0.6 with simliar Dhrystone iterations ARM and MIPS 24K 0.5 Data Cache Instruction Cache Selection of publicly-announced ARC 700 licensees since 2004 16 KBytes 16 KBytes st 0.25 1. N. Topham, UK Patent application no. 0621711.1, filed 1 November 2006. MIPS cores [4]. ARC 700 4-way 2-way ® ™ 0 0.2 0.4 0.6 0.8 7. N.P. Topham, IntroducingIntroducing thethe ARCARC VRaptorVRaptor MediaMedia mW / MHz ArchitectureArchitecture, Fall Processor Forum, San Jose and Tokyo 4. N.P. Topham, Secrets of the ARC 700 Revealed, Embedded Processor Forum, 2004. ARC 700 floorplan 2006.