Perceptual Feature based Music Classification - A DSP Perspective for a New Type of Application

H. Blume, M. Haller M. Botteck, W. Theimer Chair for Electrical Engineering and Computer Systems Research Center RWTH Aachen University Meesmannstr. 103, Schinkelstraße 2, 52062 Aachen, Germany 44807 Bochum, Germany {blume,haller}@eecs.rwth-aachen.de {martin.botteck,wolfgang.theimer}@nokia.com

Abstract — Today, more and more computational power is popular music portals (AMG allmusic, amazon, mp3.com): available not only in desktop computers but also in portable these three share only 70 common genre names of several devices such as smart phones or PDAs. At the same time the hundred available at each portal. availability of huge non-volatile storage capacities (flash memory etc.) suggests to maintain huge music databases even in mobile As of today, genre classification data needs to be created devices. Automated music classification promises to allow manually at some point in time and is subject to corruption keeping a much better overview on huge data bases for the user. upon transfer of the collection from one device to another. In Such a classification enables the user to sort the available huge the absence of a common genre taxonomy an automated music archives according to different genres which can be either classification approach promises to take into account individual predefined or user defined. It is typically based on a set of listening habits and preferences. Such classification needs to perceptual features which are extracted from the music data. rely on how the music sounds rather than how it is named in Feature extraction and subsequent music classification are very order to provide desired listening experiences. The main computational intensive tasks. Today, a variety of music features process of music classification is depicted in Fig. 1. and possible classification algorithms optimized for various application scenarios and achieving different classification Genre 1 qualities are under discussion. In this paper results concerning the computational needs and the achievable classification rates on Genre 2 different processor architectures are presented. The inspected Music- Feature Classifi- Database Extraction cation processors include a general purpose P IV dual core processor, Genre 3 heterogeneous digital signal processor architectures like a Digital Signal Processor Nomadik STn8810 (featuring a smart audio accelerator, SAA) as Genre N well as an OMAP2420. In order to increase classification performance, different forms of feature selection strategies Figure 1. Principle of perceptual feature based classification (heuristic selection, full search and Mann-Whitney-Test) are applied. Furthermore, the potential of a hardware-based A variety of algorithms for this purpose has been developed acceleration for this class of application is inspected by ranging from data mining machine learning approaches [8] to performing a fine as well as a coarse grain instruction tree techniques known from speech recognition [1]. A good analysis. Instruction trees are identified, which could be overview of existing approaches can be found in [19], latest attractively implemented as custom instructions speeding up this developments are evaluated against each other in the MIREX class of applications. contest [9]. All these algorithms impose serious requirements on computation power moving huge amounts of data. Up to Keywords — music information retrieval, music classification, now, their implementation on mobile end-user hardware has feature extraction, processor performance, processor architecture not been studied. optimization, ASIP The work presented here is based on a study of the computational effort for simplified but representative I. INTRODUCTION classification use cases. Three major approaches were Listening to music from digital storage becomes ever more identified to reduce these needs: popular. Storage capacity available in current mobile and portable devices increased dramatically; future devices will • On algorithmic level an optimal selection of the set of provide even larger memories. Consequently, users carry perceptual features used for classification is around huge music collections containing thousands of tracks. investigated. Here, performing a parameter full search Maintaining overview on such collections becomes a difficult as well as a statistical approach (Mann-Whitney-Test) undertaking. Common approaches attempt to sort musical derives the most significant music features for content by "genre" thus expressing commonalities between classification purposes. tracks. Albeit the definitions for "genre" as well as the • On software implementation level all available boundaries between them lack a common taxonomy. [10] and processor specific code optimization techniques (e.g. [14] describe in detail the existing genre diversity analyzing use of specific custom instructions) are discussed.

978-1-4244-1985-2/08/$25.00 ©2008 IEEE 92 Furthermore, the impact of reducing computation word (denoted as feature F19) is defined in the frequency domain. length from floating-point to fixed-point on compu- S(k) is computed using the amplitude spectrum A(k) = |X(k)| tational performance as well as on classification quality with X(k) being the discrete Fourier transform of x(n) is inspected. ⎛ 1 2 ⎞ • On processor architecture level the design of an kS ⋅= log10)( 10 ⎜ kA )( ⎟ . (2) ⎝ N ⎠ application specific instruction set processor (ASIP) is explored. Therefore, the results of a fine grain as well as TABLE I. PERCEPTUAL FEATURES DEPLOYED a coarse grain instruction tree analysis are presented thus identifying the optimization potential as well as the No. Feature design baseline for such a processor. F1 discrete spectrum (computed by FFT) F2 autocorrelation (performed in time domain) The paper is organized as follows: Chapter II introduces F3 autocorrelation (performed in frequency domain) some basics on algorithms for machine based music locating the first three maxima of the F4 classification. Chapter III shortly sketches the hardware autocorrelation platforms deployed in the investigation. The following chapter F5 root mean square describes the music classification experiments used for F6 low energy windows reference. Chapter V discusses the results of a computation F7 sum of correlated components time analysis of these experiments and elaborates possible F8 zero crossing rate algorithmic optimizations like e.g. optimization technique for F9 relative periodicity amplitude peaks significant feature identification. Chapter VI quantifies possible F10 ratio of second and first periodicity peak performance gains through identification of instruction trees F11 fundamental frequency which could be implemented as custom instructions. Finally, a F12 magnitude of spectrum conclusion is given in chapter VII. F13 MFCC F14 spectral centroid II. BASICS OF FEATURE BASED MUSIC CLASSIFICATION F15 spectral rolloff Automatically evaluating similarities in listening F16 spectral extend experience between music tracks implies (at least) two F17 amplitude of spectrum processing steps (see Fig. 1): F18 spectral flux F19 power spectrum • compute music features based on technical definitions F20 spectral kurtosis of music properties (feature extraction) and F21 spectral bandwidth F22 position of main peaks • evaluate similarities according to given sets F23 chroma vector (classification). F24 amplitude of maximum in chromagram pitch interval between two max peaks in The entire automated music classification is based on the F25 assumption that specific listening experiences somehow chromagram translate to properties expressed in technical terms. And, Classifying the music tracks by perceptual features requires indeed, there is e.g. a resemblance between musical rhythm pre-determining feature values for the target category: A subset patterns and certain properties of the autocorrelation function – of tracks needs to be selected and categorized manually however inexact this may be [15]. Characteristic sounds will (training phase). This "ground truth" will determine common somehow determine signal spectral shape as well as influence properties of the desired category and determine which feature the Cepstral Coefficients on a Mel frequency scale (MFCC) specifically leads to a track not being part of that category. [18, 11]. Tab. I contains a list of such features, i.e. the ones Then, classification processing can be brought down to which have been inspected in the course of the experiments computing the Euclidian distance between feature vectors and discussed here. Several more complex ones [19, 1, 11] have selecting suitably close ones. According to the rules of a also been proposed. classifier algorithm a song is assigned to the most common Two exemplary features shall be shortly sketched here: The category amongst its neighbors. Actually, a variety of classifier algorithms has been introduced [8, 19] in order to subdivide the zero-crossing rate (rzero-crossing, denoted as feature F8) provides a rough estimate of the dominating frequency within a time feature vector design space and to assign a new entry to one of window based on the number of signal sign changes by: the existing feature classes (genres). Some of the classifiers discussed in literature and referenced in this work are listed in 1 N −2 Tab. II. r = ⋅ nx −+ nx )(sgn)1(sgn (1) − ingcrosszero ∑ ()N − 12 n=0 In the course of this work performance and computational requirements of the k-Nearest Neighbor (KNN) classifier with the discrete time signal x(n), n ∈ [0, Ntotal -1], the total promised to be attractive. It shall be described here as an number of samples Ntotal within the music track and N, the explanatory example for the classification task. number of samples within one analyzed window. Whereas the zero-crossing rate is computed in the time domain there are other important features that utilize the signal spectrum instead. For example, the power spectrum S(k)

93 TABLE II. CLASSIFIER ALGORITHMS USED IN THIS WORK • a heterogeneous OMAP2420 processor [17]. This Support Vector Subdivides the feature space heterogeneous processor features a general purpose Machine (SVM) using nonlinear functions ARM1136 core (fClk = 330 MHz, 32 kBytes of I/D cache) and a TMS320C55x DSP (f = 220 MHz). Classification by analysis of Clk k-Nearest-Neighbor feature distances to the next As the focus of the optimizations discussed later on is (KNN) k nearest neighbors in the mainly related to the Nomadik platform, some details of this feature space platform are briefly presented here. Computes the pdf of the The Nomadik STn8810 is manufactured in a 130 nm Radial Basis features using radial basis CMOS technology and features three processor cores, operated Function Network functions. Classification by with VDD = 1.2 V. The main processor, an ARM926EJ-S core, (RBF) determination of the best is controlling and configuring all system components. matching function Operating system and user programs are also running on this Computes the pdf of the core. It is supported by two DSP cores, based on ST's Gaussian Mixture features using Gaussian MMDSP+ architecture: Model functions. Classification by • A so-called Smart Audio Accelerator (SAA) is (GMM) determination of the best responsible for audio processing tasks such as coding, matching function decoding, mixing of audio streams and application of Neural network, that is audio effects (mainly addressed here for implementing Multilayer trained for the separation of audio feature extraction). Perceptron (MLP) different classes. • A Smart Video Accelerator (SVA) is responsible for The KNN classifier relies on similarities of feature vectors video processing tasks. This MMDSP+ based core is in the feature space. Its training phase simply implies to store enhanced by some video specific hardware and software respective feature vectors for all music tracks belonging to the functions (not used in the context of this work). target categories. In the classification phase new data sets (i.e. The hardware reference board (NDK-10B Core Board) music tracks) are classified by computing the Euclidian applied here features 40 kBytes of on-Chip SRAM, 32 MBytes distances between their feature vectors and those from the of SDRAM, 16 MBytes of NOR-Flash memory and 32 kBytes target category. The track is assigned to that category that is of on- Boot-ROM. Furthermore, this platform includes most common among its k nearest neighbors (see Fig. 2). several interfaces such as USB, Fast IrDA, SD-/MMC-Card Hence, the distances need to be ranked and the minimum (k=1) interfaces and LCD display controller. The NOR-Flash- will be selected. Larger values for k will imply to perform a memory and the SDRAM are included with the processor cores majority vote between the k closest target tracks. Therefore, the in the same package as a system-in-package. Both, the ARM parameter k will influence classification quality as well as core as well as the SAA feature no floating point unit. required computational effort.

Feature Vectors from Genre 1 IV. PERFORMING MUSIC CLASSIFICATION EXPERIMENTS

Feature Vectors from Genre 2 There are several publications discussing reference

Feature Vector to be classified experiments for perceptual feature based music classification. All these experiments differ in number and type of used genres, Feature B

max. distance for 1 neighbor feature definitions, classifiers or the size of the used test database. Tab. III lists some reference experiments including max. distance for 3 neighbors the main experimental setup parameters. Subsequently, the experiments discussed here are described in detail and the Feature A corresponding results are related to the results of the reference Figure 2. K-nearest neighbor classifier (KNN) for k = 1 und k = 3 experiments. Music Test Sets III. DSP HARDWARE PLATFORMS A set of 729 different music tracks (sample rate 22050 Hz, Three different hardware platforms respectively processor mono, 16 bit per sample) from various artists has been used. architectures have been used as reference platforms in the These have been obtained as license-free music from the course of these experiments: Magnatune-label [6]. This test set has also been used in course of the international MIREX contest [9]. The music tracks were • a general purpose P IV dual core processor featuring a sorted manually into the following six different genres: clock frequency of 3.2 GHz and 1 GByte of DDR2 SDRAM, targeted for stationary desktop applications, • Classical • Electronic • Jazz / Blues • Metal / Punk • a heterogeneous digital signal processor (DSP) Nomadik STn8810 [16] featuring an ARM926EJ-S core • Rock / Pop • World (fClk = 264 MHz) as well as a smart audio accelerator From the complete music database, different test sets have (SAA, fClk = 133 MHz) targeted for mobile applications, been extracted:

94 Set 1 consists of hand sorted songs. For each song it has rate of 58 % respectively 68 % (depending on the classifier been subjectively tested if it is representative for the respective which was used) could be achieved. While here, using a genre. Subsets containing 12 songs per genre have been built, KNN 3 classifier and only five different features (see resulting in a total of 72 songs. experiment No. 22 in Tab. IV) for a similar experiment 62 % was achieved. Set 2 only consists of music tracks from the genres "Classical" and "Rock / Pop". It has been defined in order to Obviously, classification quality highly depends on test the influence of the number of genres on the classification selecting appropriate feature sets. Due to the non-linear nature quality. For each genre 30 songs have been selected, resulting of dependencies here any simplifications and optimizations of in a total amount of 60 songs. computational effort will have substantial consequences on the classification behavior. Cross Validation Experiments Therefore, the following chapter in detail describes In order to derive classification rates so-called cross inspections of these relations by several experiments. validation experiments have been performed. Within these experiments a subset of the data is used for training purposes 70% (training of the classifier) and another subset of the data is used for testing of the classifier. Here, 9/10 of the data is used for 60% classifier training and 1/10 is used for testing. Each cross 50% validation experiment is repeated 10 times and the final classification rate is the average out of these 10 experiments. 40% From each test data set, all implemented features have been 30% extracted and stored in a feature vector. Each feature is Classification Quality 20% calculated on a window by window basis applying a window size of 512 samples per window. The feature vectors are stored 10% together with the according genres for further processing in a 0% so-called attribute relation file format (arff). 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Index of Experiment For the first experiments different (non-optimized) possible feature vectors (heuristically selected as well as according to Figure 3. Exemplary classification results (six genres, data test set 1) using reference proposals in the literature, see Tab. III) have been non-optimized reference feature sets applied and the resulting classification quality has been determined. These first experiments featured a dynamics 98% concerning the classification quality of more than 20 % (i.e. 96% classification quality varied from about 40 to 62 %). What 94% could be noticed from these experiments is that the kind and 92% number of features as well as the used classification algorithm 90% have a significant impact on the classification quality as well as on the required computation time on the different hardware 88% 86% platforms. Quality Classification Tab. IV lists the parameters of the executed experiments. In 84% the first column, the experiments are numbered for further 82% reference. The second column provides the used classifier 80%0% algorithm (KNN, RBF). The number attached to the classifier 1 3 5 7 9 11131517192123252729 symbol specifies the number of considered neighbors Index of Experiment respectively the number of considered radial basis functions. Figure 4. Exemplary classification results (two genres, data test set 2) using The following columns provide the used features. non-optimized reference feature sets

The achieved classification results of the experiments V. COMPUTATION TIME ANALYSIS OF DSP-BASED MUSIC described in Tab. IV are depicted in Fig. 3 and Fig. 4. CLASSIFICATION Depending on the features used, classification rates up to First of all, reference experiments have been performed on 62 % could be achieved for the experiment with six genres. all reference hardware platforms (P IV, Nomadik, OMAP, see Reducing the number of genres leads to a drastic improvement section III). Fig. 5 depicts the required computation time for of the classification quality. Using only two genres leads to extracting a typical feature set of nine different reference classification rates up to 97 %. features (F1, F2, F4, F5, F7-F10, F13, Tab. I) from 23 seconds Here, feature combinations have been used which have of music (i.e. 1000 data windows with 512 samples each). been taken from literature. But as there is a strong dependency It can be seen that for some implementation forms, the on various parameters (feature definitions, classifier, test computation time might be even longer than the original database etc.) the classification results partially differ from the duration of the music track. Notably, the ARM cores results published in literature. For example in [18] within two (particularly when using floating-point number representation) different experiments with six different genres (six different featuring moderate clock frequencies typical for mobile devices forms of Jazz) and using 14 different features, a classification are not able to solve this task in an acceptably short time.

95 TABLE III. STATE-OF-THE-ART OF PERCEPTUAL MUSIC ANALYSIS EXPERIMENTS size of Typ. achieved Ref. # Genres Classifier music # Features Classification Properties database Quality [2] 15 SVM ~500 5 70% manually generated decision tree (first coarse genre classification, then succeeding refinement) [3] 2 MLP 414 9 90-91% classification based on sections of the music tracks, KNN 1-7 82-90% succeeding majority voting on the previous results [12] 7 GMM 4-64 175 1 79-92% only the MFCC-coefficients are used [18] 10 KNN 1-5 1000 14 56-60% In the experiments with four and six genres, the 4 400 70-78% genres are close to each other (e.g. four different 6 600 56-58% forms of Classics, six different forms of Jazz) 10 GMM 2-4 1000 60-61% 4 400 81-88% 6 600 62-68% [20] 2 SVM 100 2 96% differentiation between vocal and non-vocal music KNN 1 79% [21] 4 SVM 100 5 93% manually generated two-stage decision tree using KNN 80% different features at each node GMM 87% [22] 5 SVM 1022 4 78% Genres: piano, symphony, pop, Beijing opera, KNN 3 72% Chinese comic dialogue

TABLE IV. USED FEATURE COMBINATIONS IN THE COURSE OF THE PERFORMED EXPERIMENTS (TYPICAL FEATURE COMBINATIONS TAKEN FROM LITERATURE) Feature ty ty s (F7) (F7) s F6) F6) Peaks F8) F8) eaks (F22) (F22) eaks requency (F11) F14) F14) F15) of approx. w. F16) F19) Index of Experiment Classifier Spectral Centroid ( Spectral Rolloff ( Spectral Rolloff ( (F15a) magnitude) (F18) Spectral Flux Spectral Extend ( low Energy Windows ( RMS (F5) Sum of Correlated Component Zero Crossing Rate ( Rel. Periodicity Amplitude (F9) Second of & Ratio First Periodici (F10) Peak Main of Position P Power Spectrum ( MFCC (F13) Fundamental F 1 KNN 1 X X X X X X X 2 KNN 3 X X X X X X X 3 KNN 5 X X X X X X X 4 KNN 7 X X X X X X X 5 RBF 4 X 6 RBF 8 X 7 RBF 32 X 8 KNN 1 X X X X X 9 KNN 3 X X X X X 10 RBF 3 X X X X X 11 KNN 3 X X X 12 KNN 1 X X X X X X 13 KNN 3 X X X X X X 14 KNN 5 X X X X X X 15 RBF 2 X X X X X X 16 RBF 3 X X X X X X 17 RBF 4 X X X X X X 18 KNN 3 X X X X X X X X X X 19 KNN 5 X X X X X X X X X X 20 RBF 3 X X X X X X X X X X 21 KNN 1 X X X X X 22 KNN 3 X X X X X 23 KNN 5 X X X X X 24 RBF 3 X X X X X 25 KNN 1 X X X X X X 26 KNN 3 X X X X X X 27 KNN 5 X X X X X X 28 RBF 2 X X X X X X 29 RBF 3 X X X X X X 30 RBF 4 X X X X X X

96 45 increases the resulting classification quality by about 10 % to

40 72 % while reducing the computational effort by 25 % (compared to using 19 different features). 35 duration of reference music track 30 One aspect that has to be regarded is how to efficiently

25 identify the optimum parameter combinations for different classification problems when regarding the complete diversity 20 of today's available features. Today, more than sixty different 15 features and several tens of classifiers have been proposed. Computation Time [s] Time Computation 10 Performing a "full search" on this huge design space would 5 excess all possible computation times even for an in advance

0 determination of the best parameters. One possible way out of P IV (3.2 GHz, SAA (133 MHz, ARM1136 on ARM926 on ARM926 on this dilemma is choosing a statistical test in order to identify floating-point) fixed-point) OMAP2420, Nomadik, Nomadik, (330MHz, (264 MHz, (264 MHz, suitable parameter combinations. In order to prove that such an floating-point) floating-point) fixed-point) approach is a valuable means a so-called Mann-Whitney-Test Figure 5. Computation time for extracting nine different features from 1000 has been exemplarily performed for the reduced design space sample windows (512 samples each) on different hardware platforms. discussed before (19 features and four different classifiers). "float"/"fixed" denote floating-point/fixed-point implementations respectively. The duration of the original music track (23 sec.) is marked by a dashed line The Mann-Whitney-Test is a so-called non-parametric statistical test which is highly suitable for identifying single Computation time for the subsequent classification is much important parameters. For a description of the Mann-Whitney- shorter. For example, on the SAA a KNN 3 classifier Test see e.g. [13, 5]. It has been applied here and performing evaluating ten features requires less than 70 ms. Therefore, the this test identified the most important features in less than one complete computation time is dominated by the feature hour on the P IV desktop PC. This means that a nearly extraction time and it has to be analyzed how the computational optimum solution (achieving a classification rate of 67 % effort spent for extracting the features is distributed over the percent and a computation time which is only 15 % worse than features. Fig. 6 presents a detailed analysis of the required the optimum solution found by the full search) could be found computation times for the different features (here for an in less than 3 % of the search time compared to a full search. implementation on the SAA, 133 MHz, code optimized fixed- point implementation). For identifying the features see feature Fig. 7 depicts the classification quality and the required list in Tab. I. computation time on the SAA-platform for different feature set, classifier and code optimization level combinations. The 12002200 significant influence of the optimization of these parameter combinations on the resulting quality effort relations becomes 10002000 obvious. Furthermore, it can be seen that the results identified by the Mann-Whitney-Test are very close to the absolute 800 optimum found by the parameter full search.

600 Computation Time Classification Quality 150 80%

400 70% Computation Time [ms] Time Computation 120 60% 200 y

50% 90 0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 F23 F24 F25 40% Feature Number 60 30% Figure 6. Computation time for extracting single features from 1000 Classification Qualit windows with 512 samples each, SAA, 133 MHz, code optimized fixed-point Computation Time [s] 20% 30 implementation. 10%

It can be seen that the computation time of some features is 0 0% 19 features, 19 features, nine reference seven opt. features eight opt. features significantly dominating the computational effort. For example code opt. DSP-suited features (full search) (Mann-Whitney- the features F2 (autocorrelation), F11 (fundamental frequency) on C-level code optimization Test) and F19 (power spectrum) require more than 65 % of the Figure 7. Quality-effort relation for different parameter combinations complete computation time when extracting all 25 features. (classification with six genres, computation time for feature extraction on SAA, 133 MHz for 10 min. of music) Exemplarily, optimization of the number and kind of features applied in classification has been performed here for the classification experiment using six different genres. Starting VI. PERFORMANCE-OPTIMIZATION THROUGH PROCESSOR with 19 different features and four different classifiers a "full ADAPTATION search" for all possible feature and classifier combinations has Besides optimizations on algorithmic or software level, also been performed. This required a computation time of nearly optimization approaches on hardware level must be 36 hours on a P IV (3.2 GHz). A feature set with only seven investigated. For example, the computational intensive parts of different features (F9, F10, F13, F14, F16, F19, F22) in tasks that have to be performed on a programmable processor combination with a KNN 5 classifier could be identified which can be analyzed identifying those instructions respectively

97 combinations of instructions that occur most frequently. preceding instructions is performed, which compute the used Subsequently, it can be investigated whether those instructions operands. Here, for the ASL-instruction this is line n-3 for the could be accelerated by implementing them as a dedicated register R2. For the ADD-instruction the last writing access to custom instruction. This implies to adapt the instruction set register R3 can be found in line n-2. Hence, the instruction tree architecture of a processor according to the needs of specific depicted on the right side of Fig. 8 can be extracted from the applications. Those architectures are commonly known as assembler code given on the left. ASIPs (application specific instruction set processors, [4]). R3 R4 R1 R6 Generally, it has to be differentiated between fine grain and coarse grain instruction set extensions. Fine grain extensions ADD IMUL mean to accelerate combinations of only few subsequent … R2 R5 R7 R3 instructions (most common example is to accelerate the n-3 R2 = ADD R3 , R4; frequently occurring combination of multiply and add n-2 R3 = IMUL R1 , R6; ASL ADD n-1 R2 = ASL R2 , R5; R4 = ADD R7, R3; instructions by a so-called MAC-instruction). R2 R4 n R5 = SUB R2 , R4; … In contrast to this, coarse grain acceleration means to iden- SUB tify and accelerate complete functional kernels (e.g. filter kernels) by implementing them in form of a weakly program- R5 mable macro (coprocessor) attached to the programmable Figure 8. Exemplary instruction tree and analysis of data dependencies processor core. In the course of this work the most computational intensive Such a code profiling and instruction tree analysis has been part of music classification (i.e. the feature extraction) has been performed on the feature extraction process. The extraction analyzed concerning the most frequently occurring process for all the features listed in Tab. I resulted in about combinations of instructions. fifteen possible instruction trees which could be accelerated by introduction of a corresponding custom instruction (CI). The The Nomadik profiling tools provide a sequential list of execution time for those CIs has been estimated and the result- instructions executed in the feature routines. They list the total ing acceleration factor rres for the complete workload (extrac- number of clock cycles required to execute each instruction. A tion of all features) has been determined. The introduction of script-based analysis of the profiling file regards each only those custom instructions is attractive where instruction as possible starting point for a new instruction tree. For that purpose, for each operand, that instruction is sought- τ r = CPU , CIwithout ≥ 1 . (3) after, where this operand has been computed. The search res τ continues recursively from this instruction. , CIwithCPU The two most attractive instruction trees which could be The search process is determined by two variables, the replaced by a CI are depicted in Fig. 9. Both instruction trees maximum instruction tree size and the maximum search depth. could be implemented as arithmetic CIs which could be The maximum instruction tree size specifies that number of performed in one clock cycle (using a 130 nm technology and instructions upon which operands of the single branches are not targeting 133 MHz as for the SAA). The resulting performance searched for any more. The maximum search depth specifies gain on the complete workload amounts to 3 % (instruction how many previous code lines are analyzed for finding the tree a) respectively 8 % (instruction tree b). computation of an operand before terminating the search. If the end of a linear code block or a memory access is detected, the As an alternative to this only moderate acceleration of the search also terminates. feature extraction process also a coarse grain optimization through acceleration of complete functional kernels shall be If an instruction tree has been found, the number of cycles regarded. Therefore, computational intensive kernels of the spent for the initial instruction is taken as frequency of feature extraction process have been identified. Such an operation of this instruction tree. The newly found instruction analysis reveals that computing the FFT of the sample windows tree is compared to all instruction trees that have been found consumes about 40 % of the complete computation time. before. If it is still existing in the instruction tree list, the Therefore, an FFT kernel is a most attractive candidate for a counter for the frequency of occurrence for this instruction tree hardware based acceleration of this workload. is incremented. Otherwise, this instruction tree is added to the list of instruction trees. The following estimation reveals the optimization potential through introduction of a hardware based FFT macro: In the following, this procedure is explained using a basic example. Given the instruction tree depicted in Fig. 8. On the • The computation of a 512 sample FFT on the SAA core left side of Fig. 8 the corresponding (pseudo) assembler code is requires 46900 clock cycles. provided. As a starting point for the search the SUB-instruction in code line n shall be used. The input operands of this • Implementing an exemplary Radix-4 FFT for N = 512 instruction are the registers R2 and R4. Starting with line n, samples would result in a computation time of those instructions are searched for in the preceding lines where 2N-1 = 1023 clock cycles (for exemplary Radix-4 FFT- the contents of those registers are computed. Here, this is found implementations see e.g. [7]). in line n-1, where two instructions are executed in parallel, which compute the results for these two registers. Starting from these two instructions, again the recursive search for the next

98 a) specific MAC instruction been applied. Attractive feature combinations have been 23 0 identified which maximize the quality-effort relation. Furthermore, the potential of a hardware-based acceleration for -1 Inversion this class of application is inspected by performing a fine as 23 0 23 0 23 0 23 0 well as a coarse grain instruction-analysis. Those instruction trees are identified, which could be attractively implemented as x x custom instructions speeding up this class of applications. Attractive performance gains of about 8 % for fine grain and 60 % for coarse grain custom instructions were derived. 47 0 47 0

<< 1 Bit Shift << 1 Bit Shift REFERENCES [1] A. Berenzweig., B. Logan, D. Ellis, B. Whitman, "A Large-Scale 48 0 48 0 Evaluation of Acoustic and Subjective Music Similarity Measures", ISMIR 2003 [2] S. Brecheisen, H.-P. Kriegel, P. Kunath, A. Pryakhin, "Hierarchical + / - Genre Classification for large Music Collections", Proceedings of the ICME 2006, pp. 1385-1388 49 0 [3] C. Costa, J. Valle, A. Koerich, "Automatic Classification of Audio Data", Proceedings of the IEEE International Conference on Systems b) conditional ADD/SUB instruction Man and Cybernetics 2004, pp. 562-567 [4] M. Gries, K. Keutzer, H. Meyr, G. Martin, Building ASIPs, Springer 2005 [5] M. Haneda, P. Knijnenburg, H. Wijshoff, "Code Size Reduction by Compiler Tuning", Proceedings of the 6th International Workshop SAMOS 2006, LNCS 4017, pp. 186-195 [6] Magnatune reference music database - http://magnatune.com/ defined in context with the MIREX contest at http://www.music-ir.org/mirex/ 2005/index.php/Audio_Genre_Classification [7] U. Meyer-Baese, Digital Signal Processing with Field Programmable Figure 9. Exemplary instruction trees which could be replaced by CIs Gate Arrays, Springer, 2007 a) specific MAC instruction; b) conditional ADD/SUB instruction [8] I. Mierswa, K. Morik, "Automatic Feature Extraction for Classifying Audio Data", Machine Learning Journal Vol 58, Feb. 2005, pp. 127-149 According to this estimation, a hardware based 512 sample [9] Music Information Retrieval Evaluation eXchange (MIREX 2008) FFT would result in an acceleration factor of about 45 http://www.music-ir.org/mirex/2008/index.php/Main_Page compared to the software based FFT on the SAA. As [10] F. Pachet, D. Cazaly, "A taxonomy of musical genres", in Proc. Content- mentioned before, the share of the FFT on the computation Based Multimedia Information Access (RIAO), Paris, France, 2000 time of all feature extraction functions amounts to about 40 %. [11] G. Peeters, "A Large Set of Audio Features for Sound Description in the CUIDADO Project", IRCAM, France, 2004 Hence, the estimated acceleration factor of using a hardware [12] D. Pye, "Content Based Methods for the Management of Digital Music", based FFT for the complete application is about 1.67. IEEE International Conference on Acoustics, Speech, and Signal Therefore, this coarse grain acceleration offers a significant Processing, Volume 6, 2000, pp. 2437-2440 advantage compared to the fine grain acceleration through [13] L. Sachs, Applied Statistics: A Handbook of Techniques, Springer 1984 introduction of custom instructions. For a final evaluation of [14] N. Scaringella, G. Zoia, D. Mlynek, "Automatic Genre Classification of these different possible acceleration approaches also the Music Content", IEEE Signal Processing Magazine 2/2006 Vol 2, resulting hardware costs have to be regarded (i.e. silicon area pp. 133-141 and power consumption of CIs respectively FFT macro). [15] J. Seppänen, A. Eronen, J. Hiipakka, "Joint Beat & Tatum Tracking from Music Signals", ISMIR2006, pp. 23–28 [16] STMicroelectronics, "Nomadik STn8810 Advanced Datasheet", 2006 VII. CONCLUSION [17] http://focus.ti.com/pdfs/wtbu/TI_omap2420.pdf Perceptual feature based music classification opens a new [18] G. Tzanetakis, P. Cook, "Musical Genre Classification of Audio class of applications for managing the ever increasing size of Signals", IEEE Transaction on Speech and Audio Processing, Vol. 10, music databases. For this type of applications the extraction of No. 5, July 2002, pp. 293-302 music features as well as subsequent music classification have [19] I. Vatolkin, W. Theimer, "Introduction of Methods for Automatic Classi- to be performed. But these are very computational intensive fication of Music Data", Nokia Research Center Tech. Report NRC-TR- tasks. This paper presents the computational needs and 2007-012, http://research.nokia.com/files/NRC-TR-2007-012.pdf classification rates achievable on different processor [20] C. Xu, N. Maddage, X. Shao, "Automatic Music Classification and Summarization", IEEE Transactions on Speech and Audio Processing, architectures. The inspected processors include a general Vol. 13, No. 3, May 2005, pp. 441-450 purpose P IV dual core processor (targeted for desktop [21] C. Xu, N. Maddage, X. Shao, F. Cao, Q. Tian, "Musical Genre applications) and heterogeneous digital signal processor Classification using Support Vector Machines", Acoustics, Speech, and architectures like a Nomadik STn8810 (featuring a smart audio Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE accelerator, SAA) as well as an OMAP2420 (targeted for International Conference on, pp. 429-432 mobile applications). In order to increase the classification [22] Y. Zhang, J. Zhou, "A Study on Content Based Music Classification", performance, different forms of feature selection strategies Signal Processing and Its Applications, 2003. Proceedings. Seventh (heuristic selection, full search and Mann-Whitney Test) have International Symposium on , pp. 113-116

99