DOI 10.7603/s40601-014-0015-7 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

Audio Music Monitoring: Analyzing Current Techniques for Song Recognition and Identification

E.D. Nishan W. Senevirathna and Lakshman Jayaratne

Received 20 Jul 2015 Accepted 13 Aug 2015

Abstract—when people are attaching or interesting in  Can we identify a cover song when multiple versions something, usually they are trying to interact with it frequently. exist? Music is attached to people since the day of they were born. When music repository grows, people faced lots of challenges  Can we obtain a statistical report about broadcasted such as finding a song quickly, categorizing, organizing and even songs in a radio channel without a manual monitoring listening again when they want etc. Because of this, people tend process? to find electronic solutions. To index music, most of the researchers use content based information retrieval mechanism Above considerations motivate researches to find proper since content based classification doesn’t need any additional solutions for these challenges. As of now, so many ideas have information rather than audio features embedded to it. As well been proposed by researches as well as some of them have as it is the most suitable way to search music, when user don’t been implemented, Shazam is one of example for that. know the meta data attached to it, like author of the song. The However still this is a challenging research area since there is most valuable application of this audio recognition is copyright no optimal solution. This problem become even infringement detection. Throughout this survey we will present approaches which were proposed by various researchers to complex when, detect, recognize music using content base mechanisms. And  Audio signal is altered by noise. finally we will conclude this by analyzing the current status of this era.  Audio signal is polluted by adding unnecessary audio Keywords— Audio fingerprint; features extraction; wavelets; object like advertisement in radio broadcasting. broadcast monitoring; Audio classification; Audio identification.  When multiple versions are existed.

I. INTRODUCTION  Only a small part of a song is available. At any of above situations, human auditory system can usic repositories in the world are increasing recognize music but providing an automated electronic M exponentially. New artist can come to the field easily solution is very challenging task since similarity between with new technologies. Once we listen a new song, we can’t original music and querying music could be very few or these get it again easily if we don’t know the meta data of that song similar features may not be possible to model mathematically. like author or singer. However the most common method of It means researches need to consider perceptual features also, accessing music is through textual meta-data but this is no in order to provide a proper solution. Feature extraction can longer function properly against huge music collection. When be considered as the heart of any of these approaches since the we come to the audio music recognition era, followings are accuracy and all are depended on the way of feature the key considerations. extraction.  Can we find an unknown song using a small part of it Rest of this survey, will provide broader overview and or humming the melody? comparisons of proposed feature extractions, searching  Can we organize, index songs without meta data like algorithms and overall solutions architectures. singer of the song?  Can we detect copyright infringement? For an example after a song was broadcasted in a radio channel.

DOI: 10.5176/2251-3043_4.3.328

©The Author(s) 2015. This article is published with open access by the GSTF 23 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

II. CLASSIFICATIONS (RECOGNITION) VS. IDENTIFICATIONS classification will be broken down into further steps. In [1] they used two steps, in the first stage, audio signal is What is the different between audio recognition segmented and classified into basic types, including speech, (classification) and identification? In audio classification, music, several types of environmental sounds, and silence. audio object will be classified into pre-defined sets like song, They called it as the coarse-level classification. In the second advertisement, vocals etc. but they are not identified further. stage, further classification is conducted within each basic Ultimately we know that this is a song or advertisement but type. For speech, they differentiated it into voices of man, we don’t know what that song is! Audio classification is less woman, child as well as speech with a music background and complex than recognition. Most of the time, we can see that so on. For music, it is classified according to the instruments these two things are combined each other in order to get better or types (for example, classics, blues, jazz, rock and roll, result. For an example, in audio song recognition system, first music with singing and the plain song). For environmental we can extract only songs among collection of other audio sounds, they classified them into finer classes such as objects using audio classifier and output will be fed in to the applause, bell ring, footstep, windstorm, laughter, birds' cry, audio recognition system. Using that kind of approach we can and so on. They called this as the fine-level classification. get better result by narrow downing the search space. There Overall idea was reducing the searching space step by step in are more proposed audio classification approaches. Some of order to get better results. As well as we can use proper feature them will be discussed in next sub section. extraction mechanism for each finer level classes based on its basic type. For an example, due to differences in the

A. Audio classifications origination of the three basic types of audio, i.e. speech, music 1) Overview and environmental sounds, different approaches can be taken in their fine classification. Most of the researches have used There are considerable amount of real world low-level (physical, acoustic) features such as Spectral applications for audio classification. For an example it will be Centroid or Mel-frequency Coefficients but end users may very helpful to be able to search sound effects automatically prefer to interact with a higher semantic level [2]. For an from a very large audio database in films post processing, example they may need to find dog barking sound instead of which contains sounds of explosion, windstorm, earthquake, environmental sounds. However low-level features can be animals and so on[1]. As well as audio content analysis and easily extract using signal processing than high-level classification is also useful for audio-assisted video (perceptual) features. classifications. For an example, all video of gun fight scenes should include the sound of shooting and or explosions, but Most of the researchers have used Hidden Markov image content may vary significantly from one scene to Model (HMM) and Gaussian Mixture Model (GMM) as the another. pattern recognition tool. Those are the widely used very powerful statistical tools in pattern recognition. To use those When classifying an audio content into different sets, tools we have to extract unique features. Any audio feature different classes have to be considered. Most of the researches can be grouped into two or more sets. Most of the researches have started this classifying speech and music. However these grouped all audio features into two group, physical (or classes are depended on the situations. For example, “music”, mathematical) features and conceptual features. Physical “speech” and “others” can be considered for the parsing of features are directly extracted from the audio wave such as news stories whereas audio recording can be classified into energy of the wave, frequency, peaks, average zero crossings “speech”, ”laughter”, ”silences” and “non speech” for the and so on. These features cannot be identified by the human purpose of segmenting discussions recording in meetings[1]. auditory system. But perceptual features are the features In any cases above, we have to consider, extract some sort of human can understand like loudness, pitch, timbre, rhythm audio features. This is the challenging part as well as past and so on. Perceptual features cannot easily be model by researches are differed from this point. But we can consider mathematical functions but those are the very important audio “feature extraction of audio classification” and “feature features since human uses those features to differentiate extraction of audio identification” separately since most of the audios. times these two cases consider disjoin feature sets [7]. However sometime we can see that audio features 2) Feature extraction of audio classification classified into hierarchical groups with similar characteristics Actually, most of the time output of the audio [12]. They divide all audio features into six main categories, classification is the input of the audio identification. This will refer the Figure 1. reduce the searching space and speed up the process and help to retrieve better results. Most of the researchers, audio

©The Author(s) 2015. This article is published with open access by the GSTF 24 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

Figure 1. High level, Audio Feature Classification[12].

However no one can define audio feature and its category exactly since there is no broad consensus on the Figure 3. The organization of features in Frequency Domain [12] allocation of features to particular groups. We can see that same feature may be classified into two different groups by a) Temporal (Raw) Domain features two different researchers. It is depended on the different Most of the time, we can’t extract features without viewpoints of the authors. Features defined in the figure 1 can altering the native audio signal. But there are several features be further classified into several groups considering the which can be extracted from native audio signal those features structure of each feature. are known as temporal features. Since we don’t want to alter Considering the structure of the temporal domain the native signal it is very law cost feature extraction feature, in [12], they classified it into three sub groups of methodology. But only using this feature we can’t uniquely features: amplitude-based, power-based, and zero crossing- identify audio music. based features. Each of these features related to one or more Zero crossing rate is a main temporal domain feature. physical property of the wave, refer the Figure 2. This is very helpful but low cost feature which is often used in audio classification. Usually we define is as the number of zero crossings in the temporal domain within one second. It is a rough estimation of dominant frequency and the spectral centroid [12]. Sometime we obtain ZCR by altering the audio signal bit. In this case we extract frequency information and corresponding intensities scaled sub bands from time domain zero crossings. It gives more stable measurement for us and it is very helpful in noisy environment. Since noises are always spread around zero axes but this is not creating considerable amount of peaks therefore peak related zero crossing rate will remain unchanged. Figure 2. The organization of features in Temporal Domain [12]. Amplitude-Based Features are another example for In here, some researches had defined zero crossings temporal domain features. We can obtain this feature by rate (ZCR) as a physical feature. Frequency domain signals directly computing the frequency of audio signal. It is again are the very important features. Most of the researches good measurement but subject to change even audio signal is consider only the frequency domain features. Next we will alter little bit by noise like unwanted affects. look at the frequency domain feature classification done by Power measurement is also a raw domain signal [12] refer the Figure 3. which is almost same as the amplitude based features. The Sometime we can see that some researches had further power or the energy of a signal is the square of the amplitude classified other four main features as well. But those are not represented by the waveform. Volume is well known power very important. Next we will see the main characteristics of measurement feature it is widely used in silence detection and major features. speech/music segmentation.

©The Author(s) 2015. This article is published with open access by the GSTF 25 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

b) Physical features musical instrument sounds are harmonic. Sometime some Most of the audio features are obtain from frequency sound can be mixer of harmonic and non-harmonic. However domain since almost all features live in this domain. Before this feature also can be used to classify audio objects [1]. extracting frequency domain features we have to transform the Spectral Flatness Measure (SFM): which is an base signal into some other formats. To do that, we can use estimation of the tone-like or noise-like quality for a band in several methods. The most popular methods are the Fourier the spectrum [1]. Really used for audio classifications. transform and the autocorrelation. Other popular methods are the Cosine transform, Wavelet transform, and the constant Q There are some other widely used physical features transform [12]. Frequency domain signal can be categorized like, Mel-Frequency Cepstrum Coefficients into two major class, physical features and perceptual features. (MFCC),Papaodysseuset al. (2001) presented the “band Physical domain features are defined using physical representative vectors”, which are an ordered list of indexes characteristics of audio signal which have not semantic of bands with prominent tones (i.e. with peaks with significant meanings. Next we will discuss mainly used physical features amplitude). Energy of each band is used by Kimura et al. and then perceptual features. (2001). Normalized spectral sub-band centroids are proposed by Seo et al. (2005). Haitsma et al. use the energies of 33 bark- Auto-regression-Based Features: In statistics and scaled bands to obtain their “hash string”, which is the sign of signal processing, an autoregressive (AR) model is a the energy band differences (both in the time and the representation of a type of random process; as such, it frequency axis) and so on. describes certain time-varying processes in nature, economics, etc.[18]. This is widely used standard techniques Most of the time silent audio frames are identified for speech/music discrimination. This can be used to extract earlier and those are not directed to further processing. There basic parameters of a speech signal, such as formant are several approaches to identify/define a silent frame. Some frequencies and the vocal tract transfer function [18]. researched have used ZCR property. In [4], they have used Sometime we can see that this feature group is divided further something like below to define silent frames. into two group, linear predictive coding (LPC) and Line Before feature extraction, an audio signal (8-bit ISDN spectral frequencies (LSF). But in here we are not going to μ-law encoding) is pre-emphasized with parameter 0.96 and discuss about these sub group in detailed. then divided into frames. Given the sampling frequency of Short-Time Fourier Transform-Based Features 8000 Hz, the frames are of 256 samples (32ms) each, with (STFT): this is another widely used audio feature based on 25% (64 samples or 8ms) overlap in each of the two adjacent the audio spectrum. STFT can be used to obtain characteristics frames. A frame is hamming-windowed by, wi = 0.54 – 0.46 of both frequency component and phase component. There are * cos(2πi/256). It is marked as a silent frame if, several features under STFT such as Shannon entropy, Renyi 256 entropy, spectral centroid, spectral bandwidth, spectral 2 2 flatness measure, spectral crest factor and Mel-frequency ∑(푤푖 푠푖) < 400 cepstral coefficients [15]. 푖=1

Short-time energy function: Energy of an audio signal is measured by amplitude of that signal. When we Where si is the pre-emphasized signal magnitude at i represent amplitude variation over time it is called energy and 4002 is an empirical threshold. Even most of the function of that signal. For speech signals, it is a basis for researches have used physical features, in order to give better distinguishing voiced speech components from unvoiced result we have to consider perceptual features as well since speech components, as the energy function values for those are the features recognize by human auditory system. unvoiced components are significantly smaller than those of the voiced components [1]. Spectral peaks: this is very important feature since it is noise robust representation of audio wave. Noises are spread Short-time average zero-crossing rate (ZCR): This across zero axes therefore noises are not affected on peaks. feature is another measurement to classify voiced speech This feature is mainly used to create a unique finger print from components and unvoiced speech components. Usually voice a small segment of audio clip captured by mobile phone or component have much smaller ZCR than unvoiced component some other device. The strength of the technique is that it [1]. solely relies on the salient frequencies (peaks) and rejects all Short-time fundamental frequency (FuF): Using this other spectral content [12]. feature we can find harmonic properties. Usually most

©The Author(s) 2015. This article is published with open access by the GSTF 26 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

c) Perceptual features continues spectrum. Usually tonality is measured by How can human beings recognize an audio? Using bandwidth and/or flatness. some sort of audio features which are sensitive to human Bandwidth is usually defined as the magnitude- auditory system, those features are known as perceptual weighted average of the differences between the spectral features. Usually these features cannot be easily extracted, components and the spectral centroid. As we already know since those cannot easily be model mathematically. However tonal sounds typically have line spectrum therefore there are number of research attempts in this era, the final goal component variation related to the SC is law it means tonal of those researches is to model perceptual features sounds have law bandwidth than noise like sounds. There are mathematically. In [5], statistical values (including means, several other feature classes which measure tonality of audio variances, and autocorrelations) of several time- and signal such as Spectral dispersion, Spectral roll-off point, frequency-domain measurements were used to represent Spectral crest factor, Sub-band spectral flux (SSF) and perceptual features, such as loudness, brightness, bandwidth, Entropy. More details of each feature can be found in [12]. and pitch. This method is only suitable for sounds with a single timbre. Other than that following are the mostly used Loudness: We use this feature in our day to day life, perceptual features. According to the past researches is the characteristic of a sound that is primarily a perceptual features can be grouped into six groups, refer the psychological correlate of physical strength (amplitude). Figure 4. More formally, it is defined as "that attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud" [20]. In other words, Loudness is “that attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from soft to loud” [12]. According to the definition, this is widely used perceptual feature as well as this feature can be extracted easily than other perceptual features. Pitch: Again this is much closed audio feature to the human auditory system like loudness. Pitch is a basic dimension of audio and this is defined together with loudness, duration, and timbre. In the past researches, this feature was widely used to genre classification and audio identification. Pitches are compared as "higher" and "lower" in the sense Figure 4: The organization of features in Frequency Domain perceptual[12]. associated with musical melodies, which require sound whose frequency is clear and stable enough to distinguish from noise Brightness: The word “Brightness” is more familiar [21]. in “illumination” context. Usually we measure brightness of illumination surfaces like LCD monitors. What is meant by Chroma: this feature is an interesting and powerful “brightness” of that context? If illumination is very high then representation of audio. We know that any tone is belongs to we define it as the high brightness otherwise law brightness. one of musical octaves. To classify tone into musical octaves Likewise we can define audio brightness using its frequency we used tone height i.e. if tone height is less then it may instead of “illumination”. A sound becomes brighter as the belongs to class “C” and if it is very high then it could belongs high-frequency content becomes more dominant and the low- to class “B”. But there is another measurement of a tone class frequency content becomes less dominant [12]. which is called “Chroma”. The entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or Most of the times, we measure brightness as the chroma) of the musical octave. Since, in music, notes exactly spectral centroid (SC). It indicates where the "center of mass" one octave apart are perceived as particularly similar, of the spectrum is. Perceptually, it has a robust connection knowing the distribution of chroma even without the absolute with the impression of "brightness" of a sound. It is calculated frequency (i.e. the original octave) can give useful musical as the weighted mean of the frequencies present in the signal, information about the audio [14]. determined using a Fourier transform, with their magnitudes as the weights [19]. Harmonicity: is a property that distinguishes periodic signals (harmonic sounds) from non-periodic signals Tonality: this is more important audio feature to (inharmonic and noise-like sounds). Harmonics are distinguish noise-like sound from other sounds. Tonal sounds frequencies at integer multiples of the fundamental frequency typically have line spectra whereas noise-like sound have

©The Author(s) 2015. This article is published with open access by the GSTF 27 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

[12] i.e. if the fundamental frequency is f, the harmonics have Most of the time Spectral features are used however frequencies 2f, 3f, 4f . . . etc. Harmonic frequencies are depending on the targeted research/system, features are equally spaced by the width of the fundamental frequency and extracted from one or many categories. For an example can be found by repeatedly adding that frequency. As a Continuous feature category is the most suitable one for practical example we can say that the nodes of a vibrating emotional detection [13]. Most of the time before perceptual string are harmonics, Refer the Figure 5. feature extraction, we have to do some preprocessing things in order to extract perceptual features more accurately. Even there are several classifications, basic features remain unchanged (class they fall may vary). 3) Applications of audio classification As we already mention that the major application of audio classification is the audio identification. i.e. audio identification systems use audio classifications systems’ output as the input to their system or pre-processing part of audio identification systems is done by audio classification systems. This will reduce the searching space and through this approach we can provide efficient and accurate audio identification systems. Followings are some other application of audio classification. 1. Genre classification: A music genre is a conventional category that identifies pieces of music. There are several well-known categories such as Pop, Rock, Jazz, Hip hop etc. Discussed audio classification Figure 5. Nodes of a vibrating string are harmonics [11] methodologies are heavily used in genre classifications. According to the past literature we can see that there are some other several, different audio feature classifications 2. Automatic Emotion Recognition: It is well known that as well. For the completeness, next we will look at other human speech contains not only the linguistic content, definitions briefly. but also the emotion of the speaker. The emotion may play a key role in many applications like in Acoustical speech features reported in the literature entertainment electronics to gather emotional user can be shown as Figure 6[13]. Existing systems use a number behaviors, in Automatic Speech Recognition to resolve of integrated continuous, qualitative, and spectral as well as “how it was said” other than “what it was said”, and in the Teager energy operator (TEO)-based features. text-to-speech systems to synthesize emotionally more natural speech[13]. Audio classification approaches are widely used in such a system. 3. Indexing video contents: Now most of the researches use audio channel of video files to index/classify video object. For an example, if there are frequent gun firings or exploding sounds of a video object then it can be classified as a war seen.

Figure 6. Examples of acoustical features reported in the literature can be grouped into four categories.

©The Author(s) 2015. This article is published with open access by the GSTF 28 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

Figure 7. General flow of the audio identification process.

B. Audio Identification This small part can be come from any place from the original track. In this case we don’t know the offset of that part. To 1) Overview address this problem we can use framing thing. Look at the Audio identification is very challenging task compared Figure 8. to audio classification since we have to specifically match unknown audio object with thousands of pre-installed audio objects whereas in audio classification we classify any audio object into small number of pre-defined classes. As we discussed earlier most of the researches have joined these two together in order to get better result. First we classify unknown audio object and identify its class then we can match this unknown audio object among other pre install object in the same class. By doing this we can speed up the process by omitting the unrelated class of audio objects as well as obtain better results. Figure 8. Diving audio object into set of overlapping frames. This In this section we focus only on the identification part. approach facilitates to identify any small part of unknown audio object According to the past literature, we can provide high level which is extracted from anywhere of the original source. overview of overall process done by most of the researches. Refer the Figure 8, we can see that there are sets of Look at the Figure 7. In here feature extraction part is exactly frames and overlapping areas. According to the past same as the feature extraction of audio classifications which researches we can interpret this image as this. Parameters for we have already discussed. The key thing is to discuss here is sampling frequency and frame size can be different from one that how to create audio archives and searching research to another but most of the researches have used mechanisms. Thesetwo things we will discuss later in following parameter values. Given the sampling frequency detail.Apart from that almost all researches have framed of 8000 Hz, the frames are of 256 samples (32 ms) each, with audio object into sets of overlapping frames. The reason for 25% (64 samples or 8 ms) overlaps in each of the two adjacent doing it is a very important thing. Usually we have to identify frames. A frame is Hamming-windowed by [9], audio object like a song when a small part of it is presented.

©The Author(s) 2015. This article is published with open access by the GSTF 29 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

2) Audio identification methodologies This is the most important part of the audio identification. In here we will discuss, what are the proposed mechanisms used to create an archives of audio. In other words, now we have various audio features which should be store in a database. What are the proposed approaches used to convert a sets of audio features to an audio fingerprint? a) Audio fingerprint Audio fingerprinting is best known for its ability to link unlabeled audio to corresponding meta-data (e.g. artist and song name), regardless of the audio format. This is very Figure 9. The flow of the Audio Fingerprint model powerful and widely used method, the main advantage is that this is independent from the audio format. What is mean by The ideal fingerprints should have following properties. fingerprint? It is a unique representation of an object for an 1. Should be able to accurately identify an audio example human fingerprint can be used to identify a human object. likewise we can use audio fingerprint to identify an audio 2. Robust representation against the distortion or uniquely. Audio file is just a binary file nothing more than interference in the transmission channel. that. If we use this digital file as the fingerprint we may face 3. Generate powerful fingerprint using only few several problems. Usually unknown audio object can be a part of original one or it may corrupt partially. In such a case we seconds of the audio object. can’t get unique fingerprint. Therefore, the direct comparison 4. It should be computationally efficient. of the digitalized waveform is neither efficient nor effective. 5. Size of the fingerprints should be small. To alleviate mentioned issues we usually split audio object 6. Less complexity of the fingerprint extraction. into sets of overlapping frames as we discussed earlier and generate sets of fingerprint for each and every frame. But This method is less vulnerable to attack since changing again we can’t use the digitalized waveform of a frame as the the fingerprint means alteration of the quality of the sound. fingerprint. First we extract one or more desired audio Usually fingerprint data base will be very huge one since we features as we discussed in section 2.2. Then we join those have to extract many finger prints from an audio object. feature values according to some specific manner this is Therefore we can’t use traditional bruit-force like searching changed from one research to another. At the end of this we mechanism for this. Instead of it, many researchers have used can obtain some sort of string representation of several indexed look-up tables which give result very fast. features. A more efficient implementation of this approach could use a hash method, such as MD5 (Message Digest 5) or b) Audio water marking CRC (Cyclic Redundancy Checking), to obtain a compact Audio water marking is some sort of massage which is representation of the combined features [2]. Downside of this embedded to the audio object when it is recording. According presentation is that, hash value is fragile i.e. even a single bit to [16] watermarking is the addition of some form of is change it is enough to get completely different hash value. identifying mark that can be used to prove the authenticity or Therefore we can’t use fingerprint method as it is for robust ownership of a candidate item. Embedding a water mark will implementation. As a whole we can represent fingerprinting not alter the perception of the song. Finally Identification of model as the Figure 9. a song title is possible by extracting the message embedded in the audio. Actually this is not a content based audio identification mechanism since we don’t worry about the audio properties because of this nature sometime audio watermarking identification mechanism is known as “blind detection” method. Dual Tone Multi-Frequency (DTMF) is the origin point of this water marking approach. DTMF used in touch-tone and mobile telephony. There are two frequencies in DTMF for bit 1 and 0 [16].

©The Author(s) 2015. This article is published with open access by the GSTF 30 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

DTMF 1 tone: 697Hz and 1209Hz combined d) Auditory Zernike moment DTMF 0 tone: 941Hz and 1336Hz combined All of the discussed methods so far consist with a major To reduce the data to be watermarked we can use series drawback i.e. they are working on raw (uncompressed) audio of bit-representations of its ASCII codes. Every character has formats like wav. But we all know that, nowadays a unique ASCII code. According to that we can represent any compressed audio format like MP3 music, has grown into the character as a pattern of pure sine waves using the combined dominant way to store music on personal computers and/or DTMF frequencies for 1 and 0. This approach can be transmit it over the Internet [17]. Therefore it will be very nice represented as Figure 10. if we can directly recognize compressed audio without decompressing it, and definitely, it will be more efficient and However audio water marking can be tampered since more accurate. There is very few attempts works on this is not an audio property itself. As well as we don’t have compressed audio domain; this method is one of them. As any option to already released legacy audio objects like songs. most of the identification methods, this approach also creates Other thing is that using this method we can’t identify two a fingerprint at the end. But the way we used in this method songs/audio object with same perception but one without is considerably different from the others. watermark. Actually “Zernike moment” feature is used by image c) Using Neural Network/SVM processing techniques such as image recognition, image Support Vector Machine (SVM) is also a widely used watermarking, human face recognition and image analysis, approach for audio identification. Actually SVM is widely due to its prominent property of strong robustness and used for audio classification instead of identification for rotation, scale, and translation (RST) invariance. Because of completeness we will discuss it here. It is a statistical learning these things, researches have motivated to use Zernike algorithm for classifiers. SVM is used to solve many practical moment for audio information retrievals as well. problems such as face detections, three-dimensional (3-D) According to the past researches, we can see that there objects recognition and so on. are four kind of compressed domain features, i.e., modified Again, features are extracted using methods which we discrete cosine transform (MDCT) spectral coefficients, discussed earlier and those features use to train the classifier. MFCC, MPEG-7, and chroma vectors from the compressed Most of the time we can see that perceptual features like MP3 bit stream. Actually Zernike moment define using very composed of total power, sub-band powers, brightness, complex sets of polynomials. We are not going to discuss bandwidth and pitch and mel-frequency cepstral coefficients about this in very detail. You can find more information in (MFCCs) are used. Then means and standard deviations of [17]. For completeness we will show how to grab Zernike the feature trajectories over all frames are computed and these moment in image. Following are extracted from [17]. statistics are considered as feature sets for the audio sound. We already mentioned that Zernike moment is defined After that we create training vector data and train the SVM using sets of polynomials which form a complete orthogonal classifier. In here we are not going to discuss about SVM in basis set defined on the unit disk x2 + y2 ≤ 1. These detail you can find more information in [4][8][9][13]. polynomials have the form, There are some other widely used neural networks based methods like Nearest neighbor(NN), Nearest Feature Line(NFL)[5] and so on.

Figure 10: The flow of the initial water marking system

©The Author(s) 2015. This article is published with open access by the GSTF 31 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

Where n is a non-negative integer, m is a non-zero fine for new releases, but there is no option for already integer subject to the constraints that (n − |m|) is non negative released audio. and even, ρ is the length of vector from the origin to the pixel Another approach to the copyright-protection problem (x, y), and θ is the angle between the vector and x-axis in is audio fingerprint. In this method, as we discussed earlier, counter-clockwise direction. Rnm(ρ) is the Zernike radial we can construct a fingerprint by analyzing an audio signal polynomials in (ρ, θ) polar coordinates defined as, that is uniquely associated with the audio signal. After that we can identify a song by searching for its fingerprint in a previously constructed database. This kind of solution can be used tomonitor radio broadcasting and audio file sharing systems and so on. b) Searching audio objects effectively

* Note that, Rn,m(ρ) = Rn,-m(ρ), so Vn,-m(ρ,θ) = V n,m(ρ ,θ). Sometime we need to download/find a song but we don’t know the lyrics exactly. In this case we can query audio Zernike moments are the projection of a function onto database by humming the melody or providing a part of that these orthogonal basis functions. The Zernike moment of song. As an example, suppose an automated system organize order n with repetition m for a continuous two-dimensional a user’s music collection by properly naming each file (2D) function f(x, y) that vanishes outside the unit disk is according to artist and song title. Another application could defined as, attempt to retrieve the artist and title of a song given a short clip recorded from a radio broadcast or perhaps even hummed into a microphone [10]. In such a case we can use content base audio identification methods to query the data base. Audible Magic and Shazam are examples of such For 2D signal-like digital image, the integrals are systems which already used audio fingerprinting [6]. replaced by summations to, Sometime we may want to search, index and organize songs in our personal computer. Usually we may have same song with different names and in different locations. In such a case we can use content base audio identification methodologies to do these tasks.

“Zernike moment” features can only be extracted from c) Analyzing Audio objects for video indexing 2D space but audio data is time variant 1D data. Therefore Usually we identify videos by using image processing we have to map 1D audio data to 2D space somehow. We can techniques. But it is very inefficient and low accurate see that there are several ways to do this according to the past method. Instead of it we can analyze audio which is attached researches. For example some we can construct a series of to the video file to index it [1]. This is properly suited for consecutive granule-MDCT 2D images [17]. commercial advertisement tracking systems. 3) Applications of Audio Identification d) Speech Recognition Audio identification is very important real world Speech recognition (SR) is the translation of spoken problem therefore we can find many applications in this area. words into text. It is also known as "automatic speech In this section we will discuss several important real world recognition" (ASR), "computer speech recognition", or just applications. "speech to text" (STT). Additionally, research addresses the a) Copyright infringement detection recognition of the spoken language, the speaker, and the extraction of emotions [13]. This is another major application Music copyright enforcement is major problem when of Audio Identification. we are dealing with digital audio files. Digital audio file can easily be copied and distributed. Audio watermarking which we discussed earlier is one of solution to that problem. Before III. OPEN ISSUES releasing a song, we can embed watermarks which will not affect to the audio quality. After that we can identify that As we discussed earlier, this is still a growing research audio object by extracting the watermark. This is working area. The reason is there are several and major

©The Author(s) 2015. This article is published with open access by the GSTF 32 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

challengers/issues which are not addressed properly so far. In Still this is a young research area hence there are lots of this section we will discussed about those open issues. rooms for improvements. Finding, searching, indexing audio file using meta data attached to it is no longer functions Most of the time, we can’t perform the major audio properly. Audio repository is rapidly increasing, new songs analyzing task in a controlled environment. It is the major are introduced frequently therefore we have to move to the issue faced by researches. There are thousands of content based audio identification mythologies. According to interruptions/interference such as unwanted noise effects, the past history, most of the researches have used audio audio alterations, playback speed, tempo and beat like audio fingerprinting concept to do that. The most important part is characteristics variations, variations of the signal source and the feature extraction of any of these methods since it is the so on. We can divide those issues into two major groups i.e. heart of the system. Still we don’t have rich robust features psychoacoustic and technical. against any kind of signal distortions and alterations. As well Psychoacoustics focuses on the mechanisms that as most of the solutions can’t scale to fit current audio process an audio signal in a way that sensations in our brain repositories. Therefore now we have to think about robust are caused. Even if the human auditory system has been and scalable solution. extensively investigated in recent years, we still do not fully Cover song identification or dealing with several understand all aspects of auditory perception [13]. Therefore versions of the same song is a very important research area modelingpsychological features in order to simulate human when we discuss about audio identification approaches. This perceptionis not a trivial task but it is really important. This is even more important when we thinking about intellectual is a one of major overhead in this research area. property of artists. There are several attempts on this area like Normally humans recognize unknown audio using their [3], but this should be improved in the future. historical knowledge. This is very important to identify a new version or cover copy of original audio object but we can easily model this historical knowledge mathematically. For ACKNOWLEDGEMENTS an example is audio object masking. Masking is “the process I offer my sincerest gratitude to my supervisor, Dr. K.L. by which the threshold of hearing for one sound is raised by Jayaratne, who has supported me throughout my research. I the presence of another (masking) sound”. Human auditory would like to show my gratitude to Mr. Brian for supporting system has especial capability to distinguish between me. Finally thank everybody who contributed to the simultaneous masking and temporal masking using successful realization of my project. frequency selectivity of human eye. This is model mathematically using the loudness of audio objects but it is not provided 100% accuracy compared to native auditory REFERENCES system. [1] T. Zhang and C.-C. J. Kuo, “Hierarchical system for content-based Other than that there are several technical difficulties audio classification and retrieval,” in Photonics East (ISAM, VVDC, as well. An audio signal is usually exposed to distortions, IEMB), 1998, pp. 398-409,International Society for Optics and such as interfering noise and channel distortions. Therefore Photonics, 1998. modeling technically robust solution is very challenging task. [2] P. Cano, “Content-Based Audio Search from Fingerprinting to Semantic Audio Retrieval,” Ph.D. Dissertation. UPF, 2007. Noises, Sound pressure level, Tempo variations, concurrently presence of several audio objects and so on are [3] J. Serrà, E. Gómez, and P. Herrera, “Audio cover song identification and similarity: background, approaches, evaluation, and beyond,” in affected on any audio recognition algorithm badly. Those are Advances in Music Information Retrieval, vol. 274, Z. Ras and A. A. the major issues/challenges in this area. When we are Wieczorkowska, Eds. Springer-Verlag Berlin / Heidelberg, 2010, pp. introducing any new feature we have to think about these 307-332. challenges. [4] S. Z. Li and G.-dong Guo, “Content-based audio classification and retrieval using SVM learning,” Invited Talk PCM, 2000. [5] S. Z. Li, “Content-based audio classification and retrieval using the IV. CONCLUSIONS AND FUTURE DIRECTIONS nearest feature line method,” Speech and Audio Processing, IEEE Transactions on, vol. 8, no. 5, pp. 619-625, 2000. Throughout this review, we discussed about digital [6] T. Huang, Y. Tian, W. Gao, and J. Lu, “Mediaprinting: Identifying audio classification and identification techniques done by multimedia content for digital rights management,” 2010. various researches. As a conclusion we can summarize our [7] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based findings as below. multimedia information retrieval: State of the art and challenges,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 2, no. 1, pp. 1-19, 2006.

©The Author(s) 2015. This article is published with open access by the GSTF 33 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

[8] J. T. Foote, “Content-Based Retrieval of Music and Audio,” in AUTHORS’ PROFIEL MULTIMEDIA STORAGE AND ARCHIVING SYSTEMS II, PROC. OF SPIE, 1997, pp. 138-147. [9] G. Guo and S. Z. Li, “Content-based audio classification and retrieval by support vector machines,” Neural Networks, IEEE Transactions on, vol. 14, no. 1, pp. 209-215, 2003. [10] M. Riley, E. Heinen, and J. Ghosh, “A text retrieval approach to content-based audio retrieval,” in Int. Symp. on Music Information Retrieval (ISMIR), 2008, pp. 295-300. [11] Wikipedia, “Harmonic --- Wikipedia, The Free Encyclopedia.” http://en.wikipedia.org/w/index.php?title=Harmonic&oldid=6574919 25, 2015. [Online; accessed6-May-2015]. Nishan Senevirathna (B.Sc. in Computer Science (SL)) obtained his B.Sc (Hons) in Computer Science from the University of Colombo [12] D. Mitrović, M. Zeppelzauer, and C. Breiteneder, “Features for School of Computing (UCSC), Sri Lanka in 2013. Currently working as content-based audio retrieval,” Advances in computers, vol. 78, pp. 71- a Senior Software Engineer at CodeGen International (Pvt) Ltd and 150, 2010. following a M.Phil Degree program at UCSC. His research interests [13] M. C. Sezgin, B. Gunsel, and G. K. Kurt, “Perceptual audio features includes Multimedia Computing, Image Processing, High Performance for emotion detection,” EURASIP Journal on Audio, Speech, and Computing and Human Computer Interaction. Music Processing, vol. 2012, no. 1, pp. 1-21, 2012. [14] M. A. Bartsch and G. H. Wakefield, “Audio thumbnailing of popular music using chroma-based representations,” Multimedia, IEEE Transactions on, vol. 7, no. 1, pp. 96-104, Feb. 2005. [15] A. Ramalingam and S. Krishnan, “Gaussian mixture modeling using short time fourier transform features for audio fingerprinting,” in Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, 2005, pp. 1146-1149.

[16] R. Healy and J. Timoney, “Digital Audio Watermarking with Semi- Blind Detection for In-Car and Domestic Music Content Identification,” in Audio Engineering Society Conference: 36th Dr. Lakshman Jayaratne-(Ph.D (UWS),B.Sc. (SL), MACS,MCS International Conference: Automotive Audio, 2009. (SL), and MIEEE) obtained his B.Sc (Hons) in Computer Science from the University of Colombo (UCSC), Sri Lanka in 1992. He obtained his [17] W. Li, C. Xiao, and Y. Liu, “Low-order auditory Zernike moment: a PhD degree in Information Technology in 2006 from the University of novel approach for robust music identification in the compressed Western Sydney, Sydney, Australia. He is working as a Senior Lecturer domain,” EURASIP Journal on Advances in Signal Processing, vol. at the UCSC, University of Colombo. He was the President of the IEEE 2013, no. 1, 2013. Chapter of Sri Lankan in 2012. He has wide experience in actively [18] D. Mitrović, M. Zeppelzauer, and C. Breiteneder, “Chapter 3 - Features engaging in IT consultancies for public and private sector organizations for Content-Based Audio Retrieval,” in Advances in Computers: in Sri Lanka. He was worked as a Research Advisor to Ministry of Improving the Web, vol. 78, Elsevier, 2010, pp. 71-150. Defense, Sri Lanka. He Awarded in Recognition of Excellence in Research in the year 2013 at Postgraduate Convocation of University of [19] B. Gajic and K. K. Paliwal, “Robust feature extraction using subband Colombo, Sri Lanka. His research interest includes Multimedia spectral centroid histograms,” in Acoustics, Speech, and Signal Information Management, Multimedia Databases, Intelligent Human- Processing, 2001. Proceedings. (ICASSP ’01). 2001 IEEE Web Interaction, Web Information Management and Retrieval, and Web International Conference on, 2001, vol. 1, pp. 85-88 vol.1. Search Optimization. Also his research interest includes Audio Music Monitoring for Radio Broadcasting and Computational Approach to [20] B. R. Glasberg and B. C. J. Moore, “A Model of Loudness Applicable Train on Music Notations for Visually Impaired in Sri Lanka. to Time-Varying Sounds,” J. Audio Eng. Soc, vol. 50, no. 5, pp. 331- 342, 2002. [21] K. Kondo, “Method of changing tempo and pitch of audio by digital signal processing.” Google Patents, 1999. This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

©The Author(s) 2015. This article is published with open access by the GSTF 34