Download/Find a Song but We Don’T Know the Lyrics Exactly
Total Page:16
File Type:pdf, Size:1020Kb
DOI 10.7603/s40601-014-0015-7 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015 Audio Music Monitoring: Analyzing Current Techniques for Song Recognition and Identification E.D. Nishan W. Senevirathna and Lakshman Jayaratne Received 20 Jul 2015 Accepted 13 Aug 2015 Abstract—when people are attaching or interesting in Can we identify a cover song when multiple versions something, usually they are trying to interact with it frequently. exist? Music is attached to people since the day of they were born. When music repository grows, people faced lots of challenges Can we obtain a statistical report about broadcasted such as finding a song quickly, categorizing, organizing and even songs in a radio channel without a manual monitoring listening again when they want etc. Because of this, people tend process? to find electronic solutions. To index music, most of the researchers use content based information retrieval mechanism Above considerations motivate researches to find proper since content based classification doesn’t need any additional solutions for these challenges. As of now, so many ideas have information rather than audio features embedded to it. As well been proposed by researches as well as some of them have as it is the most suitable way to search music, when user don’t been implemented, Shazam is one of example for that. know the meta data attached to it, like author of the song. The However still this is a challenging research area since there is most valuable application of this audio recognition is copyright no optimal solution. This problem become even more infringement detection. Throughout this survey we will present approaches which were proposed by various researchers to complex when, detect, recognize music using content base mechanisms. And Audio signal is altered by noise. finally we will conclude this by analyzing the current status of this era. Audio signal is polluted by adding unnecessary audio Keywords— Audio fingerprint; features extraction; wavelets; object like advertisement in radio broadcasting. broadcast monitoring; Audio classification; Audio identification. When multiple versions are existed. I. INTRODUCTION Only a small part of a song is available. At any of above situations, human auditory system can usic repositories in the world are increasing recognize music but providing an automated electronic M exponentially. New artist can come to the field easily solution is very challenging task since similarity between with new technologies. Once we listen a new song, we can’t original music and querying music could be very few or these get it again easily if we don’t know the meta data of that song similar features may not be possible to model mathematically. like author or singer. However the most common method of It means researches need to consider perceptual features also, accessing music is through textual meta-data but this is no in order to provide a proper solution. Feature extraction can longer function properly against huge music collection. When be considered as the heart of any of these approaches since the we come to the audio music recognition era, followings are accuracy and all are depended on the way of feature the key considerations. extraction. Can we find an unknown song using a small part of it Rest of this survey, will provide broader overview and or humming the melody? comparisons of proposed feature extractions, searching Can we organize, index songs without meta data like algorithms and overall solutions architectures. singer of the song? Can we detect copyright infringement? For an example after a song was broadcasted in a radio channel. DOI: 10.5176/2251-3043_4.3.328 ©The Author(s) 2015. This article is published with open access by the GSTF 23 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015 II. CLASSIFICATIONS (RECOGNITION) VS. IDENTIFICATIONS classification will be broken down into further steps. In [1] they used two steps, in the first stage, audio signal is What is the different between audio recognition segmented and classified into basic types, including speech, (classification) and identification? In audio classification, music, several types of environmental sounds, and silence. audio object will be classified into pre-defined sets like song, They called it as the coarse-level classification. In the second advertisement, vocals etc. but they are not identified further. stage, further classification is conducted within each basic Ultimately we know that this is a song or advertisement but type. For speech, they differentiated it into voices of man, we don’t know what that song is! Audio classification is less woman, child as well as speech with a music background and complex than recognition. Most of the time, we can see that so on. For music, it is classified according to the instruments these two things are combined each other in order to get better or types (for example, classics, blues, jazz, rock and roll, result. For an example, in audio song recognition system, first music with singing and the plain song). For environmental we can extract only songs among collection of other audio sounds, they classified them into finer classes such as objects using audio classifier and output will be fed in to the applause, bell ring, footstep, windstorm, laughter, birds' cry, audio recognition system. Using that kind of approach we can and so on. They called this as the fine-level classification. get better result by narrow downing the search space. There Overall idea was reducing the searching space step by step in are more proposed audio classification approaches. Some of order to get better results. As well as we can use proper feature them will be discussed in next sub section. extraction mechanism for each finer level classes based on its basic type. For an example, due to differences in the A. Audio classifications origination of the three basic types of audio, i.e. speech, music 1) Overview and environmental sounds, different approaches can be taken in their fine classification. Most of the researches have used There are considerable amount of real world low-level (physical, acoustic) features such as Spectral applications for audio classification. For an example it will be Centroid or Mel-frequency Coefficients but end users may very helpful to be able to search sound effects automatically prefer to interact with a higher semantic level [2]. For an from a very large audio database in films post processing, example they may need to find dog barking sound instead of which contains sounds of explosion, windstorm, earthquake, environmental sounds. However low-level features can be animals and so on[1]. As well as audio content analysis and easily extract using signal processing than high-level classification is also useful for audio-assisted video (perceptual) features. classifications. For an example, all video of gun fight scenes should include the sound of shooting and or explosions, but Most of the researchers have used Hidden Markov image content may vary significantly from one scene to Model (HMM) and Gaussian Mixture Model (GMM) as the another. pattern recognition tool. Those are the widely used very powerful statistical tools in pattern recognition. To use those When classifying an audio content into different sets, tools we have to extract unique features. Any audio feature different classes have to be considered. Most of the researches can be grouped into two or more sets. Most of the researches have started this classifying speech and music. However these grouped all audio features into two group, physical (or classes are depended on the situations. For example, “music”, mathematical) features and conceptual features. Physical “speech” and “others” can be considered for the parsing of features are directly extracted from the audio wave such as news stories whereas audio recording can be classified into energy of the wave, frequency, peaks, average zero crossings “speech”, ”laughter”, ”silences” and “non speech” for the and so on. These features cannot be identified by the human purpose of segmenting discussions recording in meetings[1]. auditory system. But perceptual features are the features In any cases above, we have to consider, extract some sort of human can understand like loudness, pitch, timbre, rhythm audio features. This is the challenging part as well as past and so on. Perceptual features cannot easily be model by researches are differed from this point. But we can consider mathematical functions but those are the very important audio “feature extraction of audio classification” and “feature features since human uses those features to differentiate extraction of audio identification” separately since most of the audios. times these two cases consider disjoin feature sets [7]. However sometime we can see that audio features 2) Feature extraction of audio classification classified into hierarchical groups with similar characteristics Actually, most of the time output of the audio [12]. They divide all audio features into six main categories, classification is the input of the audio identification. This will refer the Figure 1. reduce the searching space and speed up the process and help to retrieve better results. Most of the researchers, audio ©The Author(s) 2015. This article is published with open access by the GSTF 24 GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015 Figure 1. High level, Audio Feature Classification[12]. However no one can define audio feature and its category exactly since there is no broad consensus on the Figure 3. The organization of features in Frequency Domain [12] allocation of features to particular groups. We can see that same feature may be classified into two different groups by a) Temporal (Raw) Domain features two different researchers. It is depended on the different Most of the time, we can’t extract features without viewpoints of the authors. Features defined in the figure 1 can altering the native audio signal.