Audio-Based Fingerprinting for Radio Station Recommendations
Total Page:16
File Type:pdf, Size:1020Kb
AUDIO-BASED FINGERPRINTING FOR RADIO STATION RECOMMENDATIONS Stefan Langer, Markus Friedrich, Liza Obermeier, Emma Munisamy Andre´ Ebert Claudia Linnhoff-Popien∗ inovex GmbH LMU Munich, Mobile and Distributed Data Management and Analytics Systems Group andre.ebert@ifi.lmu.de [first name].[last name]@ifi.lmu.de ABSTRACT and precise station recommendations. This paper focuses on the latter and thus on the question how The world of linear radio broadcasting is characterized by a to generate meaningful recommendations for radio stations. wide variety of stations and played content. That is why find- Two main types of approaches can be distinguished for that: ing stations playing the preferred content is a tough task for a Recommendations based on Collaborative Filtering (CF) and potential listener, especially due to the overwhelming number Content-based Filtering (CBF) approaches. CF focuses on of offered choices. Here, recommender systems usually step user opinions and behavior, which can lead to privacy issues in but existing content-based approaches rely on metadata and as well as the Cold Start Problem (see Section 2). Thus, a thus are constrained by the available data quality. Therefore, CBF based recommender system working with characteris- we propose a new pipeline for the generation of audio-based tics derived from available station data is preferred in context radio station fingerprints relying on audio stream crawling of this work. In [1], it could be shown that station recom- and a deep autoencoder. We show that the proposed finger- mendation based on available metadata is possible, but only prints are especially useful for characterizing radio stations if the metadata reaches a certain quality level. This is prov- by their audio content and thus are an excellent representation ably often not the case. Other meaningful characteristics can for meaningful and reliable radio station recommendations. be directly derived from the station’s audio signal which sig- Index Terms— Hybrid Radio, Multimedia Services, Rec- nificantly reduces metadata quality requirements. ommender Systems, Audio Analysis, Fingerprinting We propose a Deep Learning-based audio crawling and fin- gerprint extraction pipeline for the characterization of radio 1. INTRODUCTION stations and show visual results for numerous stations. Fur- Despite emerging competition from on-demand content ser- thermore, we detail on how a recommender system based on vices, linear radio broadcasting still remains one of the most the developed fingerprint can be implemented. The paper is popular entertainment and information media in Europe. Its structured as follows: Section 2 provides a brief overview advantage lies in its technical simplicity, its topicality and its across recommendation concepts and related work. Section personal approach conveyed by professional moderators. 3 explains the basic concept of the proposed fingerprinting However, with services like Spotify, Deezer or Google Play pipeline and its implementation. Its recommendation capa- strong competitors have recently appeared, which have an ad- bilities are evaluated in Section 4 while Section 5 summarizes vantage by providing personalized listening experience and a this work. wide range of contents combined with precise recommenda- tion systems. 2. RELATED WORK The challenge for radio broadcasters is now to enrich their Next to recommendation approaches such as You May Like classic, linear radio programme with online-based personal- (YML), Knowledge-based Filtering (KF), and Demographic ized technologies in order to improve the listening experience Filtering (DF), there are two mainly recognized concepts: and to bridge the gap between linear and on-demand content Content-based Filtering and Collaborative Filtering [2, 3, 4]. providers. These so-called hybrid radio technologies com- CF-, DF-, and KF-based systems show good performance if prise of techniques for privacy-preserving user data collec- enough user data is available. But an open issue is the so- tion (feedback channel), on-the-fly content substitution (e.g., called Cold Start Problem, which occurs in the initial phase replacing ads with songs from a pre-selected music playlist), of the system where not enough user data is available to cre- ∗The HRADIO project and thus this work was funded by H2020, the EU ate meaningful recommendations [5, 6, 7]. Another issue for Framework Programme for Research and Innovation. CF and DF systems are so-called Filter Bubbles, describing the creation of closed-off, synthetic environments in which always the same items are recommended, disregarding the ex- istence of contrary or different items beyond the bubble [8, 9]. In contrast to that, an initial issue of CBF-based systems is the need for high-quality metadata precisely describing the items to recommend [1]. The concept presented in this paper is part of the HRADIO platform, which provides a vast amount of metadata and audio information within a hybrid radio context [1]. For this reason, and to avoid issues like Filter Bubbles or the Cold Start Problem, it takes a CBF-based approach. In order to compare radio services on basis of their au- dio features, existing approaches which utilize deep learning Fig. 1. The proposed pipeline consists of four modules: the for music genre recognition (MGR) can be utilized [10, 11]. Data Collector, the auDeep autoencoder, the Fingerprinter, Gwardys et al. propose a concept using transfer learning in and the Recommender. combination with a convolutional neural network for MGR [12]. Logan et al. and Siddiquee et al. present methods for measuring the similarity of music on basis of audio sig- 4. EVALUATION nals [13, 14]. After clustering raw audio features, Logan et This section describes the data acquisition and training pro- al. compare entities using the Earth Movers Distance (EMD) cess. Then, the recommender system is evaluated and an [15]. C¸ataltepe et al. use adaptive features and user group- additional statistical analysis of the fingerprint archetypes is ing to take note to the aspect that different traits within music conducted. Finally, we discuss the differences in recommen- are of different importance for each user by including his- dations between different times of day. torical information about the users’ listening behaviour [16]. Together with others, these works provide valuable input for 4.1. Data Acquisition and Training the proposed concept. The DataCollector component (see Figure 1) records radio stations by requesting a list of radio services and its HTTP 3. CONCEPT stream addresses from the HRADIO Metadata Platform [1]. Currently, a list of 461 valid and unique streams is received. This chapter details the steps of the proposed pipeline as de- During 24 hours, each station is recorded for 5 seconds within picted in Figure 1. The Data Collector requests a list of radio intervals of two minutes. To reduce the amount of news in- services (with included HTTP stream bearers) from the HRA- cluded in the audio snippets, we do not record 5 minutes be- DIO Metadata Platform [1]. Then, a deep neural autoencoder fore and after full hours. This leads to a total number of 576 provided by the auDeep toolkit is trained to reduce the di- samples per radio station. Thereby, a full day of samples for mensions of the raw audio input [17]. Therefore, a mel-scaled 431 radio stations could be recorded, while 30 stations could spectogram of each audio snippet is extracted in prior. These not be recorded entirely due to server-side connection prob- spectograms serve as training input for the autoencoder and lems. In total, we collected 266,239 audio snippets, whereas express a human perceptible image representation for each 17,983 belong to incomplete radio station recordings. Subse- snippet. An autoencoder consists of an encoder and a de- quently, the auDeep autoencoder was trained on all 266,239 coder component. The encoder maps the input to a smaller audio files, whereas the fingerprinting is only applied to com- space, the so-called latent space. Based on this, the decoder plete sets of audio snippets (for 431 radio stations). We use tries to restore the original input. The latent space optimally the hyperparameters suggested by auDeep (a window width comprises only the dimensions that the decoder needs for the of 0:08 with an overlap of 0:04, a fixed length of 5 seconds complete reconstruction of the original input. The encoder per input snippet, and clipping of values below −60) [17]. can therefore be used to create a compressed representation The network is trained across 64 epochs with a batch size of the input data. By using the encoder component in our of 64 on 2 layers with each having 256 gated recurrent units concept, the input data is compressed and the samples are re- (GRU), a learning rate of 0:001 and a dropout of 0:2. Train- duced to a vector with 1024 dimensions. These vectors are ing the network for 7 days resulted in a loss of 0:237. Subse- the input of the Fingerprinter, which trains a K-Means clus- quently, the Fingerprinter creates fingerprints of all complete tering model [18]. The distribution of samples in each cluster stations, using all 266,239 vectors generated by the auDeep per radio station is regarded as the station’s fingerprint. The component. The K-Means Clustering algorithm divides all last component shown in Figure 1 is the Recommender, which data points into n clusters. The parameter n is determined by recommends radio stations similar to a particular input sta- using the silhouette coefficient [19], resulting in a range from tion on basis of the Euclidean distance of their fingerprints. A 9 to 16 and showing a peak value at 11 as can be seen in Fig- small distance implies a comparably high similarity. ure 2. Each data point is assigned to exactly one cluster and third radio station LBC UK is assigned the genres Non-fiction, Local/Regional, and News. The closest 3 non-variant stations are Bayern 5 Plus with the genre Information with an Eu- clidean distance of 117:97, WDR 3 with the genres Classical Musik and Cultural with an Euclidean distance of 118:22, and SWR 2 Archiv Radio with the genre Documentary and an eu- clidean distance of 118:25.