7-2 IEEE Asian Solid-State Circuits Conference November 16-18, 2009 / Taipei, Taiwan CRISP-DS: Dual-Stream Coarse-grained Reconfigurable Image Stream Processor for HD Digital Camcorders and Digital Still Cameras

Tsung-Huang Chen, Jason C. Chen, Teng-Yuan Cheng, and Shao-Yi Chien Media IC and System Lab Graduate Institute of Electronics Engineering and Department of Electrical Engineering National Taiwan University BL-421, 1, Sec. 4, Roosevelt Rd., Taipei 106, Taiwan

Abstract— A 329mW 600M-Pixels/s dual-stream coarse- In this paper, a dual-stream coarse-grained reconfigurable grained reconfigurable image stream processor is implemented in image stream processor (CRISP-DS) is designed to support μm 2 TSMC 0.13 CMOS technology with a core size of 4.84mm . HD image processing with high power efficiency. It is charac- The reconfigurable pipelined processing element array architec- ture makes a good balance between computing performance and terized as follows. First, the reconfigurable pipelined process- flexibility with only 10Kb on-chip memory. Moreover, a new dual- ing element (PE) array architecture can provide a good balance stream architecture is proposed to improve the flexibility and between computing performance and flexibility with low on- hardware efficiency by processing two independent image streams chip memory cost. Second, the new concept of dual-stream with two-layer context switching, and an isolation technique is processing is introduced to improve both the performance and also proposed to improve the power consumption. Implementa- tion results show that it achieves 1.52 times power efficiency than flexibility. With PE isolation technique, CRISP-DS can achieve previous works and can meet the requirements of high-definition better power efficiency than previous works. It can meet the video camcorders and digital still cameras. requirements of HD digital camcorders and still cameras with only 329mW in power consumption. I. INTRODUCTION This paper is organized as follows. First, the architectural concept of CRISP-DS is shown in Section II, and the proposed Image signal processing engine in a digital camcorder and architecture is introduced in Section III based on this concept. digital still camera is a critical part to generate high-quality After that, Section IV shows the implementation results and images and video frames [1]. As the resolution and frame the comparison to previous works. Finally, Section V con- rate of image sensors grow higher and higher, several design cludes this paper. challenges are introduced. First, to process high-definition (HD) videos and images with more than 10M pixels, the II. ARCHITECTURAL CONCEPT OF CRISP-DS required computation becomes enormous. It usually leads to high hardware cost and high power consumption, which is Fig. 1 shows the architectural comparison between SIMD not beneficial for these handheld devices. Second, to execute image processors and CRISP. In an SIMD image processor more advanced image processing algorithms, a high flexibility shown in Fig. 1(a) [4] [5] [6], an SIMD array with more is required, which means programmable processors are better than one hundred of processing elements (PEs) are designed solutions. Therefore, hybrid approaches with a DSP and a to provide high computing power. However, to fill these PEs dedicated hardware are adopted in many commercial products with required pixel streams, a huge on-chip memory buffer is [2] [3]. To support more complex image processing algorithms required as well as a high-bandwidth channel to the external and higher image resolution, many SIMD image processors memory, which enormously increase the power consumption are proposed in recent years [4] [5] [6]. However, the power and cost. In our coarse-grained reconfigurable image stream consumption is still large even with high-end technologies. processor (CRISP) architecture [7], as shown in Fig. 1(b), Our previous work, coarse-grained reconfigurable image several types of reconfigurable stage processing elements stream processor (CRISP) [7] can achieve better power ef- (RSPEs) are specially designed to fit the characteristics of ficiency. By inspecting the characteristics of image processing image processing algorithms as reconfigurable fabrics. These algorithms, several reconfigurable fabrics are designed with RSPEs are connected by the reconfigurable interconnection a reconfigurable interconnection. The flexibility can also be unit. To map an image processing algorithm on CRISP, the achieved by changing the contexts of this reconfigurable designers only need to write the context registers (CRs) to hardware. It is proved that CRISP provides higher power reconfigure the RSPEs and interconnection. Then the image efficiency compared with DSP based approaches; however, stream can be fed from and written back via the SoC bus, and there is still a large room for flexibility improvement. the large on-chip memory buffer is not required.

978-1-4244-4434-2/09/$25.00 ©2009 IEEE 193 High Bandwidth Channel Image Processing Algorithm I

Type Type Type Type Type On-Chip Memory Buffer Program B A C D A Memory

Off-Chip SIMD Array Memory Input RSPE RSPE RSPE RSPE Output I/F A B C D I/F P P P P P P P P Contoller E E E E E E E E Time Frame 1

Input RSPE RSPE RSPE RSPE Output (a) I/F A B C D I/F Reconfigurable Stage Processing Element (RSPE) Time Frame 2 A B C D X (a) Reconfigurable Interconnection Image Processing Algorithm II Type Type Type Type Type B A C D A

SoC Bus

CR CR CR CR Input RSPE RSPE Register Output I/F A B File I/F

Input RSPE RSPE Output I/F C D I/F RSPE RSPE C D (b) (b) Fig. 1. Architectural comparison between (a) image processors with SIMD array [4] [5] [6] and (b) coarse-grained reconfigurable image stream processor Fig. 2. Architectural concepts of (a) single-stream CRISP [7] and (b) dual- (CRISP) [7]. stream CRISP (this work).

Although the CRISP architecture achieves high hardware dual-stream RSPEs (DS-RSPE). The local memory RSPEs are efficiency, it still has several limitations. It can well implement designed to support different kinds of data accessing patterns single-path image processing algorithms as shown in Fig. 2(a). for image processing tasks, and the register file RSPE acts However, for more complex algorithms where the required as a latency adjustment unit to synchronize different image number and types of RSPEs do not match to those in the streams; color interpolation RSPE is designed to support fabricated chip, more than one time frames are required, as various demosaicking algorithms; multiplier-and-accumulator shown in Fig. 2(a), which will lead to long execution time (MAC) RSPE is designed to support image filtering and since the whole image is needed to be stored out to the off- matrix operations; pixel-based RSPE is designed to support chip memory at the end of each time frame and is loaded pixel-independent arithmetics and table-look-up operations; back at the beginning of each time frame. Moreover, it cannot accumulator (ACC) RSPE is designed to support several support multi-path image processing algorithms as shown in measurement functions for auto-white-balance, auto-exposure, Fig. 2(b). and auto-focus; downsampler RSPE is a programmable down- To improve the flexibility and efficiency of CRISP, in this sampler module; ALU RSPE is designed to support general- paper, a new concept of dual-stream processing is proposed, purpose image processing operations. These RSPEs are dy- as shown in Fig. 2(b). To implement multi-path algorithms, a namically connected with the reconfigurable interconnection register file is designed as a dummy RSPE to achieve synchro- unit, which is also configured by the context registers. The nization between different paths. Besides, since the working details of several RSPEs will be demonstrated in Section III- frequency of the image processor is usually higher than the C. of image sensors, two image streams are handled by one RSPE as two threads with context switching. With this B. Circuits and Communication Protocol of a DS-RSPE concept, more complicated algorithms can be mapped within one time frame, as shown in Fig. 2(b), where RSPE A is a The circuits and the communication protocol for each dual- dual-stream RSPE. stream RSPE (DS-RSPE) are shown in Fig. 4. As shown in Fig. 4(a), the core of each RSPE is the reconfigurable datapath, III. PROPOSED ARCHITECTURE which accesses input stream from the data selector and output data to the output registers. To communicate and synchronize A. System Architecture between different RSPEs, a unified communication protocol Fig. 3 shows the system architecture of CRISP-DS. The is designed as shown in Fig. 4(b). With the time division input stream can come from a 12b interface in the multiplex (TDM) protocol, a DS-RSPE can process two in- preview/camcorder mode or from the SoC bus (AHB) in the dependent streams with two layers of contexts switched in picture-taking mode, and the output stream is written out via different time slots. In order to avoid stream conflict in the the SoC bus. There are ten types of RSPEs configured by the same time slot, a re-synchronization module is designed to context registers, and they can be classified into three classes: allocate the two input streams in different time slots, as shown local memory RSPEs, single-stream RSPEs (SS-RSPE), and in Fig. 4(c). Moreover, Fig. 4(c) also shows the isolation

194 DS-RSPE Date Stream 1 AHB Master/Slave Stream1 Output Reigsters Data Selector Host Master Wrapper Slave Wrapper Valid1 Resync. Date Stream 2 Processor Reconfigurable Steam2 Output Image Input Output and Sensor Data Datapath Reigsters Sensor Interface Interface Isolation Valid2 Sync. Signal 1 PE Controller Sync. Signal 2

O Input Stream

Display u To Main CRISP-DS t

p

Interface u Controller Statistical t Context Registers S Registers

t

r

e

a Interconnection Module Main Controller m Stream1 Sync. Selector Resync. Reconfiguration Reconfiguration Stream and Context Switch Contexts Contexts Selection External Stream2 Sync. Selector Isolation Memory Reconfigurable Interconnection

Context Registers Color Interpolation 2x2 Window Reg. 3x3 Window Reg. Register File ACC Module Downsampler Pixel-based Reconfigurable Line Buffers Pixel-based ALU Pixel-based MAC Stage MAC

MAC Processing Elements Input Output Stream 1 Stream 1 AHB (RSPEs) DS- RSPE Input Output Stream 2 Stream 2 (a)

Reconfigurable Interconnection CLK Local Memory Dual Stream Single Stream RSPEs RSPEs RSPEs Context C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1

Stream 1 href Fig. 3. System architecture of CRISP-DS. Data D1 D2 D3 D4 D5 D6 D6 D7 D8

Stream 2 href Data D1 D2 D3 D4 D5 D6 D6 module, which can isolate idled RSPEs to avoid redundant power consumption. The input interface of CRISP-DS shown (b) in Fig. 3 works as a stream formation module to translate and href resample the sensor data with higher frequency to support the Data dual stream function. href Iso. Stream 1 Resync Data Iso. C. Reconfigurable Datapath iso

Some reconfigurable datapaths of different RSPEs are href Iso. Stream 2 Resync shown in Fig. 5. Fig. 5(a) shows the line buffer RSPEs and Data Iso. window register RSPEs. With interconnecting these RSPEs, iso different patterns of image windows can be formed and Input accessed in one cycle, which is beneficial for image processing iso tasks. Fig. 5(b) shows the ALU RSPE, which is a 2-thread (c) VLIW general image processing RSPE with multi-layer con- text registers instead of a program memory. Fig. 5(c) shows the Fig. 4. Circuits and communication protocol of a dual-stream RSPE. (a) The MAC RSPE, which can be used to execute different kinds of detailed architecture of a dual-stream RSPE. (b) The communication protocol filtering operations or matrix multiplication operations. Three of a dual-stream RSPE. (c) The resynchronization and isolation circuits. MAC RSPEs can also be connected together for more complex filters with larger windows. shown in Fig. 8, which shows that the power efficiency of IV. IMPLEMENTATION RESULTS AND COMPARISON CRISP-DS is 1.52 timers of our previous work, and it is much higher than other previous works since it can achieve similar A prototype chip of CRISP-DS is fabricated with TSMC performance with much lower cost and power consumption. 0.13μm 1P8M process. The measured chip specifications and micrograph are shown in Fig. 6. CRISP-DS is compared with V. C ONCLUSION the state-of-the-art image processors as shown in Fig. 7. It shows that the memory cost of CRISP-DS is much lower In this paper, a dual-stream coarse-grained reconfigurable than those of SIMD array approaches [5] [6] because of the image stream processor (CRISP-DS) is designed to support CRISP architecture, as described in Section II. The complex the image processing pipelines of high-definition (HD) digital image signal processing (ISP) pipeline shown in Fig. 7 is camcorders and digital still cameras. The reconfigurable array implemented with different processors as a benchmark for architecture with reconfigurable stage PEs (RSPEs) specially processing speed. Because of the dual-stream concept, CRISP- designed for image processing operations results in an efficient DS outperforms CRISP [7] and can achieve comparable per- processor with high processing capability of 600MPixels/s formance with FIESTA [6]. When a simple 3x3 median filter and low power consumption of 329mW when the operation is used as the benchmark, it also shows that the performance frequency is 200MHz. Moreover, because of the reconfigurable of CRISP-DS is quite similar to that of XETAL-II [5]. The PE array architecture, the on-chip memory cost is only 10Kb, comparison of power efficiency with technology scaling is which is much lower than other state-of-the-art SIMD image

195 x15 x3 x7 ISSCC’02 [3] CICC’07 [7] ISSCC’07 [5] ISSCC’08 [6] This Work D D D D D TMS320C64s CRISP XETAL-II FIESTA CRISP-DS

D D D D D Architecture VLIW DSP CRISP SIMD Array SIMD Array CRISP-DS with VLIW PE Line Buffers D D D Technology 130nm 180nm 90nm 65nm 130nm Die Size 72mm2 7.72mm2 74mm2 152.83mm2 10.11mm2 D D D On-Chip SRAM Size 8Mb 0.077Mb 10Mb 17.4Mb 0.104Mb Line Buffers Line Buffers D D D Power Consumption 718mW 218mW 600mW 783mW 329mW Line Buffers Line Buffers D D D for ISP 0.18fps 8.04fps N/A 60fps** 48.23fps Pipeline with a (a) 1920x1080 sensor* Frame Rate for 3x3 0.41fps 11.5fps 23fps N/A 20fps Special Function Unit for Specific Operations Median Filter on a 10M Image Set1 Input Selector FU1 FU2 FU3 FU4 Input Data Set2

otx RegistersContext White CFA Color De-noise

otx Selector Context (Bayer Pattern) Balance Interpolation Correction filtering

CMP 3bits Set3 Min /Max ALU MUL Data OP from Register File (x16) Set4 Other Mux Setn RSPEs FU Output Output Edge Color Contrast Saturation Output MUX Data enhancement Conversion Enhancement Enhancement

*: With the image signal processing pipeline shown above g()(b) **: the image signal processing pipeline is different from the above one, but the complexity is similar In1 source In2 source Input Selector MUX MUX

FU1 FU2 FU3 FU4 FU5 FU6 FU7 FU8 FU9 Set1 MUL SUB

Context Registers Context Fig. 7. Specification comparison with previous works. ABS

Constant1 Constant2 Constant3 Selector Context MUX

Adder Tree1 Adder Tree2 Adder Tree3 9-Input Sorting Network Power Efficiency (fps/mW) OUT Set2 Center Constant4 with Technology Scaling*** Pixel

Window Registers Adder Tree Threshold MUX 719.06x 6.78x 1.52x Comparator 0.50 MAC 1 MAC 2 MAC 3 MUX

Output Selector Adder Tree for 5x5 Filter OUT1 OUT2 OUT3 0.33 (c)

Fig. 5. Datapath of (a) line buffers and window register RSPEs, (b) ALU 0.08 7.22x10-4 RSPE, and (c) MAC RSPE. ISSCC'02 ISSCC'08 CICC'07 This Work [3] [6] [7] CRISP-DS ***: Power consumption is normalized to 65nm process P =P (C /C )(V /V )2, Technology TSMC 0.13um CMOS 1P8M 65 ori 65 ori 65 ori where ori denotes the original process Package 128CQFP Die Size 3.18 x 3.18 mm Core Size 2.20mm x 2.20 mm MAC Logic Gate Count 330,364 Line Line Line Fig. 8. Power efficiency comparison with previous works. (2-input NAND) Buffer MAC Buffer Buffer Window Max. Working 200MHz Reg. Frequency Max. Data Rate 600M Pixels/s* Down- sampler Reg. ALU Power 329mW@ 200MHz, 1.3V File Consumption [2] W. Rabadi, R. Talluri, K. Illgne, J. Liang, and Y. Yoo, “Programmable On-Chip Memory 104,192b Line Color Interpolation Line DSP platform for digital still cameras,” in Proc. 1999 IEEE International Processing 48 fps @ 1920x1080 video (Dual-stream mode) Buffer Buffer Capability 96 fps @ 1920x1080 video (Single-stream mode) Conference on Acoustics and Speech and Signal Processing (ICASSP99), 289 fps @ 720x480 video (Dual-stream mode) ACC Pixel 579 fps @ 720x480 video (Single-stream mode) Mar. 1999, pp. 2235 – 2238. Histogram 18 fps @ image size = 4072(H) x 2720(V) = Memory [3] S. Agarwala, P. Koeppen, T. Anderson, A. Hill, M. Ales, R. Damodaran, 11M-pixels (Picture-taking mode) *: one pixel= one 8-bit data. In the picture-taking mode, it processes the image L. Nardini, P. Wiley, S. Mullinnix, J. Leach, A. Lell, M. Gill, J. Gol- data from the system bus. ston, D. Hoyle, A. Rajagopal, A. Chachad, M. Agarwala, R. Castille, N. Common, J. Apostol, H. Mahmood, M. Krishnan, D. Bui, Q.-D. A. P. Fig. 6. Chip specification and micrograph. Groves, N. Luong, N. Nagaraj, and R. Simar, “A 600MHz VLIW DSP,” in Digest of Technical Papers of 2002 IEEE International Solid-State Circuits Conference (ISSCC2002), vol. 2, Feb. 2002, pp. 38–390. [4] S. Kyo, T. Koga, S. Okazaki, and I. Kuroda, “A 51.2-GOPS scalable processors. Furthermore, a new dual-stream dataflow and the video recognition processor for intelligent cruise control based on a linear associated processing elements are designed to process two array of 128 four-way VLIW processing elements,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 1992–2000, Nov. 2003. independent image streams with one RSPE, which makes the [5] A. Abbo, R. Kleihorst, V. Choudhary, L. Sevat, P. Wielage, S. Mouy, prototype chip achieve 1.52 times power efficiency than our and M. Heijligers, “XETAL-II: A 107 GOPS, 600mW massively-parallel previous work. processor for video scene analysis,” in Digest of Technical Papers of 2007 IEEE International Solid-State Circuits Conference (ISSCC2007), Feb. 2007, pp. 270–602. ACKNOWLEDGEMENTS [6] S. Arakawa, Y. Yamaguchi, S. Akui, Y. Fukuda, H. Sumi, H. Hayashi, The authors would like to thank Chip Implementation M. Igarashi, K. Ito, H. Nagano, M. Imai, and N. Asari, “A 512GOPS fully- programmable digital image processor with full HD 1080p processing Center (CIC) for chip fabrication. capabilities,” in Digest of Technical Papers of 2008 IEEE International Solid-State Circuits Conference (ISSCC2008), Feb. 2008, pp. 312–313. REFERENCES [7] J. C. Chen, C.-F. Shen, and S.-Y. Chien, “Coarse-grained reconfigurable [1] R. Ramanath, W. E. Snyder, Y. Yoo, and M. S. Drew, “Color image image stream processor for digital still cameras and camcorders,” in processing pipeline: a general survey of digital still camera processing,” Proceedings of Custom Integrated Circuits Conference (CICC2007), Sept. IEEE Signal Processing Mag., vol. 22, no. 1, pp. 34–43, Jan. 2005. 2007, pp. 81–84.

196