C-Based Hardware Design of Imdct Accelerator for Ogg Vorbis Decoder
Total Page:16
File Type:pdf, Size:1020Kb
C-BASED HARDWARE DESIGN OF IMDCT ACCELERATOR FOR OGG VORBIS DECODER Shinichi Maeta1, Atsushi Kosaka1, Akihisa Yamada1, 2, Takao Onoye1, Tohru Chiba1, 2, and Isao Shirakawa1 1Department of Information Systems Engineering, 2Sharp Corporation Graduate School of Information Science and Technology, 2613-1 Ichinomoto, Tenri, Nara, 632-8567 Japan Osaka University phone: +81 743 65 2531, fax: +81 743 65 3963, 2-1 Yamada-oka, Suita, Osaka, 565-0871 Japan email: [email protected], phone: +81 6 6879 7808, fax: +81 6 6875 5902, [email protected] email: {maeta, kosaka, onoye, sirakawa}@ist.osaka-u.ac.jp ABSTRACT ARM7TDMI is used as the embedded processor since it has This paper presents hardware design of an IMDCT accelera- come into wide use recently. tor for an Ogg Vorbis decoder using a C-based design sys- tem. Low power implementation of audio codec is important 2. OGG VORBIS CODEC in order to achieve long battery life of portable audio de- 2.1 Ogg Vorbis Overview vices. Through the computational cost analysis of the whole decoding process, it is found that Ogg Vorbis requires higher Figure 1 shows a block diagram of the Ogg Vorbis codec operation frequency of an embedded processor than MPEG processes outlined below. Audio. In order to reduce the CPU load, an accelerator is designed as specific hardware for IMDCT, which is detected MDCT Psycho Audio Remove Channel Acoustic VQ as the most computation-intensive functional block. Real- Signal Floor Coupling time decoding of Ogg Vorbis is achieved with the accelera- FFT Model Ogg Vorbis tor and an embedded processor both run at 36MHz. The Bitstream operation frequency is at the same level as that of MPEG Audio Reconstruct Channel Audio decoding process by an embedded processor. IMDCT IQ Signal Curve Decoupling 1. INTRODUCTION Figure 1: Process Flow of Ogg Vorbis. With recent advances in audio compression technology, it has seen a steady increase in the popularity of digital music • MDCT, FFT: that is delivered over the Internet. Therefore, there are strong Coefficients in the time domain are transformed to those demands for portable audio players supporting various audio in the frequency domain by MDCT and FFT, where compression formats. MDCT coefficients and FFT coefficients are used as en- Ogg Vorbis [1] is a new audio compression format, coding data and parameters in the succeeding modules, which is distinctive in terms of fully open, non-proprietary, respectively. and patent-and-royalty-free. The original Ogg Vorbis refer- • Psycho Acoustic Model: ence decoder is performed with floating-point arithmetic, Inaudible spectrums are removed on the basis of “psy- which makes it difficult to achieve realtime decoding by an cho-acoustic” masking and a silence threshold. embedded processor with no FPU (Floating Point Unit). A fixed-point arithmetic version of the Ogg Vorbis decoder, • Remove Floor: called Tremor [1], was released in September 2002, which A spectrum envelope, called a floor curve, is generated. enabled an embedded processor to decode Ogg Vorbis bit- Then, MDCT coefficients are divided by their floor stream in realtime. However, its decoding process still re- curve values to flatten the shape of spectrums. quires higher computational cost than that of MPEG Audio • Channel Coupling: [2, 3], and suffers from high power consumption. Further data compaction is attempted by correlating the Motivated by this technical issue, the present paper pro- two channels. poses hardware design for an Ogg Vorbis audio decoder with an embedded processor. In order to reduce CPU load, a func- • Vector Quantization: tional block with high computational cost in Ogg Vorbis de- A set of spectrums, which represents a particular region, coding, specifically IMDCT, is implemented by dedicated is approximated by a codevector. circuits through the use of a C-based design system. An 1361 To the contrary, the decoding algorithm is accomplished designs [4]. The user language of the Bach system, called by tracing above mentioned processes in the reverse order, as Bach C, is designed based on ANSI-C. Since basic grammar illustrated in Figure 1. of Bach C is the same as that of ANSI-C except several ex- tensions, designers can write Bach C code in the same way 2.2 Software Simulation as the conventional C code. Therefore designers can utilize In order to estimate the operation frequency required for existing software library in ANSI-C for hardware designs. realtime decoding, the execution cycle of the Tremor decod- The Bach system consists of mainly a synthesizer and a ing process was analyzed. In this analysis, the ARM7TDMI simulator. The synthesizer includes high level synthesis was used as the target processor and the decoding process technique and generates register-transfer level circuits in was executed with sample bitstreams of a violin, trumpet, VHDL. The simulator translates Bach C code into ANSI-C and flute encoded by the Ogg Vorbis Version 1 encoder at and compiles it using a conventional C compiler. Since the the sampling rate of 44.1kHz and the target bitrate of simulator runs more than 100 times as fast as RTL simula- 128kbps. To generate the profile summary of the decoding tors, design validation period can be drastically reduced. process, each sample was decoded by using ARMulator (an emulator of ARM processors) with zero memory wait. Table Bach system 1 summarizes the result of profiling. Bach Table 1: Profile Summary – Num. of Cycles Simulator cycle num. of frame cycle/frame ANSI-C Bach C Bach RTL VHDL Logic Netlist violin 4,095,509,129 2,592 1,578,839 Synthesizer Synthesizer trumpet 2,030,970,723 1,540 1,318,812 flute 3,730,821,502 3,015 1,237,420 Figure 2:Design Flow using Bach System. As can be seen from this table, the number of cycles needed Figure 2 shows the design flow. First we modify the for 1 frame is calculated as 1,578,839. Since the allowable original ANSI-C code so that it can be synthesized by the time for processing a long frame (1,024 samples) at 44.1kHz Bach synthesizer. What we need to do at this phase are: is 1,024/44,100 = 23.2[ms], the required operation fre- 1) to specify bit width for each variable if necessary, and quency of the embedded processor is determined as 2) to rewrite a dereference operator (*) of specifying an 1,578,839/23.2 = 68.1[MHz]. As reported in [2] and [3], array element to an operator []. other audio decoding processes such as MP3 and MPEG-2 We improve the Bach C code while validating it using the AAC are executed generally around 30MHz of operation Bach simulator. Once register-transfer level VHDL code is frequency by using ARM7TDMI. Contrary to this, Tremor generated, the code can be easily synthesized into netlist by decoding process requires almost twice of the clock fre- conventional logic synthesizer. quency. Thus it should be pointed out that the operation fre- quency must be reduced by employing an accelerator for a 3.2 Hardware Implementation of IMDCT functional block with high computational complexity. When we design hardware, we often need to consider differ- In order to evaluate the computational complexity in de- ent cost from that of software implementation. In software tail, computational cost for each functional block in the implementation, it is important to minimize the total number Tremor decoding is analyzed as is summarized in Table 2. of operations since the size of hardware (proces- sor/RAM/ROM) is fixed. In hardware implementation, it is Table 2: Profile Summary - Execution Time important to take account of trade-off between area and op- IMDCT IQ Others eration cycles. violin 51.12% 11.12% 37.76% Suppose the following two functions f1 and f2. trumpet 51.33% 11.21% 37.46% f 1: x, y, z → x × y + z; flute 51.12% 11.12% 37.76% f 2 : x, y → x × y; According to the result of this analysis, the IMDCT process takes more than half of the total processing time. Therefore, For software implementation, it would be better to have two our Ogg Vorbis audio decoder is designed to be equipped functions f1 and f2, although f2 can be replaced with f1 by with IMDCT accelerator. setting z to 0. For hardware implementation, it would be better to have a module for only f1 than to have two mod- 3. C-BASED DESIGN OF IMDCT ACCELERATOR ules for f1 and f2 in spite of increasing the total number of operations. 3.1 Design Method High level synthesis in the synthesizer tries to minimize In contrast with conventional RTL-based design, several C- area by sharing functional units among operations in input based design systems have been developed to make higher code, where the sharing is performed to operation by opera- level design more efficient. The Bach system is a C-based tion. Therefore the synthesizer often fails to share a module, design system which has been applied to several real LSI a set of functional units, among several routines in more 1362 complicated cases. Hence, it is important to write a common for(i=0;i<M;i++){ for(i=0;i<M+2;i++){ routine used as many times as possible in input source code mdct_butterfly_generic(); mdct_butterfly_generic(); } } without degrading the original performance. for(i=0;i<N/2;i+=32){ for(i=0;i<N/2;i+=8){ N/2-point N/4-point 32-point 16-point 8-point mdct_butterfly_32(){ mdct_butterfly_8(); butterfly butterfly butterfly butterfly butterfly … } N/2 N/2 N/2 N/2 N/2 mdct_butterfly_16(){ … mdct_butterfly_8(); } 3N/8 } } (a) Tremor (b) Proposed implementation N/4 N/4 ・・・ Figure 5: Implementation Example.