An Audio Fingerprinting System for Live Version Identification Using Image Processing Techniques
Total Page:16
File Type:pdf, Size:1020Kb
AN AUDIO FINGERPRINTING SYSTEM FOR LIVE VERSION IDENTIFICATION USING IMAGE PROCESSING TECHNIQUES Zafar Rafii Bob Coover, Jinyu Han Northwestern University Gracenote, Inc. Evanston, IL, USA Emeryville, CA, USA zafarrafi[email protected] fbcoover,[email protected] ABSTRACT the same rendition of a song, and will consider cover versions (e.g., a live performance) to be different songs. For a review Suppose that you are at a music festival checking on an artist, on audio fingerprinting, the reader is referred to [1]. and you would like to quickly know about the song that is be- ing played (e.g., title, lyrics, album, etc.). If you have a smart- Cover identification systems precisely aim at identifying phone, you could record a sample of the live performance and a song given an alternate rendition of it (e.g., live, remas- compare it against a database of existing recordings from the ter, remix, etc.). A cover version essentially retains the same artist. Services such as Shazam or SoundHound will not work melody, but differs from the original song in other musical here, as this is not the typical framework for audio fingerprint- aspects (e.g., instrumentation, key, tempo, etc.) [9]. ing or query-by-humming systems, as a live performance is In [10], beat tracking and chroma features are used to neither identical to its studio version (e.g., variations in in- deal with variations in tempo and instrumentation, and cross- strumentation, key, tempo, etc.) nor it is a hummed or sung correlation is used between all key transpositions. In [11], melody. We propose an audio fingerprinting system that can chord sequences are extracted using chroma vectors, and a deal with live version identification by using image process- sequence alignment algorithm based on Dynamic Program- ing techniques. Compact fingerprints are derived using a log- ming (DP) is used. In [12], chroma vectors are concatenated frequency spectrogram and an adaptive thresholding method, into high-dimensional vectors and nearest neighbor search is and template matching is performed using the Hamming sim- used. In [13], an enhanced chroma feature is computed, and a ilarity and the Hough Transform. sequence alignment algorithm based on DP is used. Cover identification systems are designed to capture the Index Terms— Adaptive thresholding, audio fingerprint- melodic similarity while being robust to the other musical as- ing, Constant Q Transform, cover identification pects [9]. Some systems also propose to use short queries [14, 15, 16] or hash functions [12, 17, 18], in a formalism 1. INTRODUCTION similar to audio fingerprinting. Yet, all those systems aim at identifying a cover song given a full and/or clean recording, Audio fingerprinting systems typically aim at identifying an and will not apply in case of short and noisy excerpts, such audio recording given a sample of it (e.g., the title of a song), as those that can be recorded from a smartphone in a concert. by comparing the sample against a database for a match. Such For a review on cover identification, and more generally on systems generally first transform the audio signal into a com- audio matching, the reader is referred to [9] and [19]. pact representation (e.g., a binary image) so that the compari- We propose an audio fingerprinting system that can deal son can be performed efficiently (e.g., via hash functions) [1]. with live version identification by using image processing In [2], the sign of energy differences along time and fre- techniques. The system is specially intended for applications quency is computed in log-spaced bands selected from the where a smartphone user is attending a live performance from spectrogram. In [3], a two-level principal component analy- a known artist and would like to quickly know about the song sis is computed from the spectrogram. In [4], pairs of time- that is being played (e.g., title, lyrics, album, etc.). As com- frequency peaks are chosen from the spectrogram. In [5], the puter vision is shown to be practical for music identification sign of wavelets computed from the spectrogram is used. [20, 5], image processing techniques are used to derive novel Audio fingerprinting systems are designed to be robust to fingerprints that are robust to both audio degradations and au- audio degradations (e.g., encoding, equalization, noise, etc.) dio variations, while still compact for an efficient matching. [1]. Some systems are also designed to handle pitch or tempo In Section 2, we describe our system. In Section 3, we deviations [6, 7, 8]. Yet, all those systems aim at identifying evaluate our system using live queries against a database of This work was done while the first author was intern at Gracenote, Inc. studio references. In Section 4, we conclude this article. 2. SYSTEM 2.1. Fingerprinting In the first stage, compact fingerprints are derived from the audio signal, by first using a log-frequency spectrogram to capture the melodic similarity and handle key variations, and then an adaptive thresholding method to reduce the feature size and handle noise degradations and local variations. 2.1.1. Constant Q Transform First, we transform the audio signal into a time-frequency rep- resentation. We propose to use a log-frequency spectrogram based on the Constant Q Transform (CQT) [21]. The CQT is a transform with a logarithmic frequency resolution, mirror- ing the human auditory system and matching the notes of the Western music scale, so well adapted to music analysis. The CQT can handle key variations relatively easily, as pitch devi- ations correspond to frequency translations in the transform. We compute the CQT by using a fast algorithm based on the Fast Fourier Transform (FFT) in conjunction with the use Fig. 1. Overview of the fingerprinting stage. The audio signal of a kernel [22]. We derive a CQT-based spectrogram by us- is first transformed into a log-frequency spectrogram by us- ing a time resolution of around 0.13 second per time frame ing the CQT. The CQT-based spectrogram is then transformed and a frequency resolution of one quarter tone per frequency into a binary image by using an adaptive thresholding method. channel, with a frequency range spanning from C3 (130.81 Hz) to C8 (4186.01 Hz), leading to 120 frequency channels. 2.2. Matching 2.1.2. Adaptive Thresholding In the second stage, template matching is performed between query and reference fingerprints, by first using the Hamming Then, we transform the CQT-based spectrogram into a binary similarity to compare all pairs of time frames at different pitch image. We propose to use an adaptive thresholding method shifts and handle key variations, and then the Hough Trans- based on two-dimensional median filtering. Thresholding is a form to find the best alignment and handle tempo variations. method of image segmentation that uses a threshold value to turn a grayscale image into a binary image. Adaptive thresh- olding methods adapt the threshold value on each pixel of the 2.2.1. Hamming Similarity image by using some local statistics of the neighborhood [23]. First, we compute a similarity matrix between the query and For each time-frequency bin in the CQT-based spectro- all the references. We propose to use the Hamming similar- gram, we first compute the median of the neighborhood given ity between all pairs of time frames in the query and refer- a window size. We then compare the value of the bin with the ence fingerprints. The Hamming similarity is the percentage value of its median, and assign a 1 if the former is higher than of bins that matches between two arrays (1’s and 0’s) [24]. the latter, and 0 otherwise, as shown in Equation 1. We use a We first compute the matrix product of the query and ref- window size of 35 frequency channels by 15 time frames. erence fingerprints, after converting the fingerprints via the 8(i; j);M(i; j) = median X(I;J) function f(x) = 2x − 1. We then convert the matrix product −1 i−∆i≤I≤i+∆i via the function f (x) = (x + 1)=2, and normalize each j−∆j ≤J≤j+∆j ( (1) value by the number of frequency channels in one fingerprint. 1 if X(i; j) > M(i; j) Each bin in the resulting matrix then measures the Hamming 8(i; j);B(i; j) = 0 otherwise similarity between any two pairs of time frames in the query and reference fingerprints. We compute the similarity matrix The idea here is to cluster the CQT-based spectrogram for different pitch shifts in the query. We used a number of into foreground (1), where the energy is locally high, and ±10 pitch shifts, assuming a maximum key variation of ±5 background (0), where the energy is locally low, as shown in semitones between a live performance and its studio version. Figure 1. This method leads to a compact fingerprint, that can The idea here is to measure the similarity for both the handle noise degradations, while allowing local variations. It foreground and the background between fingerprints, as we can be thought as a relaxation of the peak finder used in [4]. believe that both components matter when identifying audio. 2.2.2. Hough Transform 3. EVALUATION Then, we identify the best alignment between the query and 3.1. Dataset the references, which would correspond to a line around an angle of 45◦ in the similarity matrix, that intersects the bins We first build, for different artists of varied genres, a set of with the largest cumulated Hamming similarity. We propose studio references, by extracting full tracks from studio al- to use the Hough Transform, based on the parametric repre- bums, and two sets of live queries, by extracting short ex- sentation of a line as ρ = x cos θ + y sin θ.