AUDIO-BASED FINGERPRINTING FOR RADIO STATION RECOMMENDATIONS

Stefan Langer, Markus Friedrich, Liza Obermeier, Emma Munisamy Andre´ Ebert Claudia Linnhoff-Popien∗ inovex GmbH LMU Munich, Mobile and Distributed Data Management and Analytics Systems Group andre.ebert@ifi.lmu.de [first name].[last name]@ifi.lmu.de

ABSTRACT and precise station recommendations. This paper focuses on the latter and thus on the question how The world of linear radio broadcasting is characterized by a to generate meaningful recommendations for radio stations. wide variety of stations and played content. That is why find- Two main types of approaches can be distinguished for that: ing stations playing the preferred content is a tough task for a Recommendations based on Collaborative Filtering (CF) and potential listener, especially due to the overwhelming number Content-based Filtering (CBF) approaches. CF focuses on of offered choices. Here, recommender systems usually step user opinions and behavior, which can lead to privacy issues in but existing content-based approaches rely on metadata and as well as the Cold Start Problem (see Section 2). Thus, a thus are constrained by the available data quality. Therefore, CBF based recommender system working with characteris- we propose a new pipeline for the generation of audio-based tics derived from available station data is preferred in context radio station fingerprints relying on audio stream crawling of this work. In [1], it could be shown that station recom- and a deep autoencoder. We show that the proposed finger- mendation based on available metadata is possible, but only prints are especially useful for characterizing radio stations if the metadata reaches a certain quality level. This is prov- by their audio content and thus are an excellent representation ably often not the case. Other meaningful characteristics can for meaningful and reliable radio station recommendations. be directly derived from the station’s audio signal which sig- Index Terms— Hybrid Radio, Multimedia Services, Rec- nificantly reduces metadata quality requirements. ommender Systems, Audio Analysis, Fingerprinting We propose a Deep Learning-based audio crawling and fin- gerprint extraction pipeline for the characterization of radio 1. INTRODUCTION stations and show visual results for numerous stations. Fur- Despite emerging competition from on-demand content ser- thermore, we detail on how a recommender system based on vices, linear radio broadcasting still remains one of the most the developed fingerprint can be implemented. The paper is popular entertainment and information media in Europe. Its structured as follows: Section 2 provides a brief overview advantage lies in its technical simplicity, its topicality and its across recommendation concepts and related work. Section personal approach conveyed by professional moderators. 3 explains the basic concept of the proposed fingerprinting However, with services like Spotify, Deezer or Google Play pipeline and its implementation. Its recommendation capa- strong competitors have recently appeared, which have an ad- bilities are evaluated in Section 4 while Section 5 summarizes vantage by providing personalized listening experience and a this work. wide range of contents combined with precise recommenda- tion systems. 2. RELATED WORK The challenge for radio broadcasters is now to enrich their Next to recommendation approaches such as You May Like classic, linear radio programme with online-based personal- (YML), Knowledge-based Filtering (KF), and Demographic ized technologies in order to improve the listening experience Filtering (DF), there are two mainly recognized concepts: and to bridge the gap between linear and on-demand content Content-based Filtering and Collaborative Filtering [2, 3, 4]. providers. These so-called hybrid radio technologies com- CF-, DF-, and KF-based systems show good performance if prise of techniques for privacy-preserving user data collec- enough user data is available. But an open issue is the so- tion (feedback channel), on-the-fly content substitution (e.g., called Cold Start Problem, which occurs in the initial phase replacing ads with songs from a pre-selected music playlist), of the system where not enough user data is available to cre- ∗The HRADIO project and thus this work was funded by H2020, the EU ate meaningful recommendations [5, 6, 7]. Another issue for Framework Programme for Research and Innovation. CF and DF systems are so-called Filter Bubbles, describing the creation of closed-off, synthetic environments in which always the same items are recommended, disregarding the ex- istence of contrary or different items beyond the bubble [8, 9]. In contrast to that, an initial issue of CBF-based systems is the need for high-quality metadata precisely describing the items to recommend [1]. The concept presented in this paper is part of the HRADIO platform, which provides a vast amount of metadata and audio information within a hybrid radio context [1]. For this reason, and to avoid issues like Filter Bubbles or the Cold Start Problem, it takes a CBF-based approach. In order to compare radio services on basis of their au- dio features, existing approaches which utilize deep learning Fig. 1. The proposed pipeline consists of four modules: the for music genre recognition (MGR) can be utilized [10, 11]. Data Collector, the auDeep autoencoder, the Fingerprinter, Gwardys et al. propose a concept using transfer learning in and the Recommender. combination with a convolutional neural network for MGR [12]. Logan et al. and Siddiquee et al. present methods for measuring the similarity of music on basis of audio sig- 4. EVALUATION nals [13, 14]. After clustering raw audio features, Logan et This section describes the data acquisition and training pro- al. compare entities using the Earth Movers Distance (EMD) cess. Then, the recommender system is evaluated and an [15]. C¸ataltepe et al. use adaptive features and user group- additional statistical analysis of the fingerprint archetypes is ing to take note to the aspect that different traits within music conducted. Finally, we discuss the differences in recommen- are of different importance for each user by including his- dations between different times of day. torical information about the users’ listening behaviour [16]. Together with others, these works provide valuable input for 4.1. Data Acquisition and Training the proposed concept. The DataCollector component (see Figure 1) records radio stations by requesting a list of radio services and its HTTP 3. CONCEPT stream addresses from the HRADIO Metadata Platform [1]. Currently, a list of 461 valid and unique streams is received. This chapter details the steps of the proposed pipeline as de- During 24 hours, each station is recorded for 5 seconds within picted in Figure 1. The Data Collector requests a list of radio intervals of two minutes. To reduce the amount of news in- services (with included HTTP stream bearers) from the HRA- cluded in the audio snippets, we do not record 5 minutes be- DIO Metadata Platform [1]. Then, a deep neural autoencoder fore and after full hours. This leads to a total number of 576 provided by the auDeep toolkit is trained to reduce the di- samples per radio station. Thereby, a full day of samples for mensions of the raw audio input [17]. Therefore, a mel-scaled 431 radio stations could be recorded, while 30 stations could spectogram of each audio snippet is extracted in prior. These not be recorded entirely due to server-side connection prob- spectograms serve as training input for the autoencoder and lems. In total, we collected 266,239 audio snippets, whereas express a human perceptible image representation for each 17,983 belong to incomplete radio station recordings. Subse- snippet. An autoencoder consists of an encoder and a de- quently, the auDeep autoencoder was trained on all 266,239 coder component. The encoder maps the input to a smaller audio files, whereas the fingerprinting is only applied to com- space, the so-called latent space. Based on this, the decoder plete sets of audio snippets (for 431 radio stations). We use tries to restore the original input. The latent space optimally the hyperparameters suggested by auDeep (a window width comprises only the dimensions that the decoder needs for the of 0.08 with an overlap of 0.04, a fixed length of 5 seconds complete reconstruction of the original input. The encoder per input snippet, and clipping of values below −60) [17]. can therefore be used to create a compressed representation The network is trained across 64 epochs with a batch size of the input data. By using the encoder component in our of 64 on 2 layers with each having 256 gated recurrent units concept, the input data is compressed and the samples are re- (GRU), a learning rate of 0.001 and a dropout of 0.2. Train- duced to a vector with 1024 dimensions. These vectors are ing the network for 7 days resulted in a loss of 0.237. Subse- the input of the Fingerprinter, which trains a K-Means clus- quently, the Fingerprinter creates fingerprints of all complete tering model [18]. The distribution of samples in each cluster stations, using all 266,239 vectors generated by the auDeep per radio station is regarded as the station’s fingerprint. The component. The K-Means Clustering algorithm divides all last component shown in Figure 1 is the Recommender, which data points into n clusters. The parameter n is determined by recommends radio stations similar to a particular input sta- using the silhouette coefficient [19], resulting in a range from tion on basis of the Euclidean distance of their fingerprints. A 9 to 16 and showing a peak value at 11 as can be seen in Fig- small distance implies a comparably high similarity. ure 2. Each data point is assigned to exactly one cluster and third radio station LBC UK is assigned the genres Non-fiction, Local/Regional, and News. The closest 3 non-variant stations are Bayern 5 Plus with the genre Information with an Eu- clidean distance of 117.97, WDR 3 with the genres Classical Musik and Cultural with an Euclidean distance of 118.22, and SWR 2 Archiv Radio with the genre Documentary and an eu- clidean distance of 118.25. All 3 stations near to LBC UK publish a lot of spoken content, as suggested by their genres Local/Regional, News, or Documentary.

4.2.2. Archetypal Analysis Fig. 2. Silhouette coefficents for different numbers of clus- Archetypal analysis extracts representative individuals in ters. a data set. A data point is defined by its affiliation to k archetypes [20]. In soccer, for example, a player could be the fingerprint is derived from a histogram across all clusters described by 10% defender, 50% midfielder, and 40% striker. for each station. This fingerprint vector now serves as input We use this concept to find representative stations on basis of for the Recommender. their fingerprints. The optimal amount of 4 archetypes was determined using the so-called elbow criterion [21]. Ther- 4.2. Recommender System fore, the residual sum of squares (RSS) for different numbers The fingerprints are the basis for station recommendations. of archetypes is visualized in a scree plot (see Figure 4). The A small Euclidean distance between two station fingerprints optimal number of archetypes is the one where the curve has implies high similarity between those two. There are two pos- its strongest bend. Figure 3 shows a plot of all fingerprints, re- sibilities for giving recommendations: 1) the k-nearest radio duced to 2 dimensions by using a Principal Component Anal- stations are suggested to the user as similar, 2) only radio sta- ysis (PCA) [22], where stations are points coloured according tions within a certain Euclidean distance are listed. 1) car- to their genre. Archetypes are represented as black triangles. ries the risk that the returned station list may also contain The first archetype is at [115.77, −164.82]. Antenne P de- distant radio stations. In 2) the distances of provided radio fined as genre Oldies is the closest station to it with a distance stations are small in any case, but the result set may be empty. of 19.86. Within a radius of an Euclidean distance of 150, 71 Moreover, because of subjective user ratings, providing high- stations could be found in total. SWR1 RP is nearby next with quality recommendations is not a trivial task. a distance of 51.90, encompassing the genres Pop Music and Regional. The second archetype is at [292.96, 385.21]. Noods 4.2.1. Evaluation of Recommendation Results Radio is the closest station with a distance of 19.23 not listing Table 1 shows the 3 closest radio stations (according to their any genres. Within a radius of an Euclidean distance of 150, metadata) with their Euclidean distance to a requested ser- 7 stations could be found in total. 95.3 KGY Olympia is the vice. The station BR Klassik is assigned to the genres Classi- next nearest station with a distance of 41.56, classified as cal Music and Special Music. The closest 3 stations are NDR AOR / Slow Rock / Soft Rock. The third archetype is located Kultur with the genres Classical Music and Cultural with an at [−144.58, 50.05]. Crawly has a distance of 8.76 to Euclidean distance of 59.58, HR 2 with the genres Classical it, encompassing the genres Classic/Dance/Pop-rock, Disco, Musik, Cultural, and Special Music and an Euclidean distance Local/Regional, Dance/Dance-pop, and Showbiz. Within a of 60.81, and Classic FM with the genres Classical Music and radius of an Euclidean distance of 150, 175 stations could News and an Euclidean distance of 60.93. All 3 stations near be found in total. The next other station providing genre to BR Klassik play similar content within the genres Classi- metadata is Heart Dorset with a distance of 11.32, classified cal Music and Cultural. The second radio station Heart UK by the same genres. The last archetype is at [138.10, 60.20]. is assigned to the genres Classic/Dance/Pop-rock, Disco, Lo- Bayern 2 Nord is the closest station with a distance of 88.83 cal/Regional, Dance/Dance-pop, and Showbiz. The closest 3 and genre Cultural. Within a radius of an Euclidean distance stations are XTRA Reloaded with the genres Rap/Hip of 150, 48 stations could be found in total. The next other Hop/Reggae with an Euclidean distance of 41.09, 3FM Isle station offering genre meta data is Classic FM with a distance of Man with the genre Hit-Chart and an Euclidean distance of 99.47 to the archetype, playing Classical Music and News. of 86.34, and FFH Rock with the genres Rock, Soft Rock, Grunge, Heavy Rock, and Rock & Roll and an Euclidean dis- 4.2.3. Comparing Recommendations by Day Times tance of 88.29. The only station close to Heart UK by genre In addition to recommendations on whole-day recordings, we is 3FM Isle of Man. The other two stations can be consid- created 3 additional, normalized fingerprints per times of day, ered not similar, being assigned to Rap/HipHop/Raggae ver- namely night, morning, and day. Night fingerprints consider sus Rock, Soft Rock, Grunge, Heavy Rock, Rock & Roll. The audio samples from 09:00 pm to 05:00 am, morning finger- Requested station 1st closest station 2nd closest station 3rd closest station NDR Kultur HR2 Classic FM BR Klassik Classical Music Classical Music Classical Music Classical Music Cultural Cultural News Special Music distance: 59.58 distance: 60.81 distance: 60.93 FFH Rock Heart UK Reloaded 3FM Isle of Man Rock Classic/Dance/Pop-rock Rap/HipHop/Raggae Hit-Chart Soft Rock Disco distance: 41.09 distance: 86.34 distance: 88.29 WDR 3 LBC UK Bayern 5 Plus SWR 2 Archiv Radio Classical Music Local/Regional Information Documentary Cultural News distance: 117.97 distance: 118.25 distance: 118.22

Table 1. Three examples of recommendation requests and the three closest results including their Euclidean distances.

Fig. 4. Number of archetypes and corresponding RSS values. Fig. 3. Visualization of fingerprint vectors, using PCA. The This so-called scree plot is used for the selection of the best dots mark the radio services, colored by genre. Black trian- number of archetypes. gles represent archetypes.

audio samples. Following this step, we trained a Deep Neural prints consider samples from 05:00 am to 09:00 am, and day Autoencoder with auDeep in order to reduce each sample’s fingerprints consider samples from between 09:00 am and dimensions. The Fingerprinter then clustered the samples 09:00 pm. Comparing those 3 fingerprints to each other and to and created a fingerprint for each service. The last compo- the whole-day fingerprints provides interesting insights into nent, the Recommender, is able to give recommendations de- how radio stations change throughout the day. E.g., large pending on the Euclidean distance between fingerprints. We changes appear on WDR Event, which had comparably large evaluated the Recommender by analyzing the recommenda- distances between day time-specific and whole-day finger- tion results and noticed similar genres for close radio sta- prints (in sum a deviation of 0.91), leading to the assumption tions. Additionally, we did an Archetypal Analysis on the that the style of content changes heavily throughout the day. fingerprints, leading to 4 archetypes. By analyzing their clos- Heart Beds - Luton shows only minor changes with a compa- est radio stations, we noticed dissimilar genres between the rably small change of summed-up 1.13. archetypes and similar ones between the closest stations sur- rounding them. Finally, we compared whole-day fingerprints 5. CONCLUSION to a night-morning-day approach. The time of day influences In this paper, we presented a holistic pipeline for radio station recommendations for some services heavily, while others stay fingerprinting and recommendation. The first component uti- constant throughout 24 hours. In summary, this approach pro- lized throughout the process is the HRADIO Metadata Plat- vides valuable recommendations only on basis of audio sig- form, which delivers metadata and bearer addresses of radio nals and without the need of additional metadata. For further stations. The next component, the DataCollector recorded evaluation, a comprehensive user study should follow up to 461 radio stations over 24 hours, generating a total of 266,239 substantiate the high quality of recommendations. 6. REFERENCES Society for Music Information Retrieval. 2018; 1 (1): 4- 21., 2018. [1] Markus Friedrich, Andre´ Ebert, Carsten Hahn, Georg Schneider, Liza Obermeier, Alexander Erk, and Iris [12] Grzegorz Gwardys and Daniel Grzywczak, “Deep im- Jennes, “A distributed metadata platform for hybrid ra- age features in music information retrieval,” Interna- dio services,” in International Conference on Innova- tional Journal of Electronics and Telecommunications, tions for Community Services. Springer, 2019, pp. 166– vol. 60, no. 4, pp. 321–326, 2014. 183. [13] Beth Logan and Ariel Salomon, “A music similarity [2] James Bennett, Stan Lanning, et al., “The netflix prize,” function based on signal analysis.,” in ICME, 2001, pp. in Proceedings of KDD cup and workshop. New York, 22–25. NY, USA, 2007, vol. 2007, p. 35. [14] Md Mahfuzur Rahman Siddiquee, Md Saifur Rahman, [3] Robin Burke, “Hybrid recommender systems: Survey Shahnewaz Ul Islam Chowdhury, and Rashedur M Rah- and experiments,” User modeling and user-adapted in- man, “Association rule mining and audio signal process- teraction, vol. 12, no. 4, pp. 331–370, 2002. ing for music discovery and recommendation,” Interna- tional Journal of Software Innovation (IJSI), vol. 4, no. [4] Francesco Ricci, Lior Rokach, and Bracha Shapira, “In- 2, pp. 71–87, 2016. troduction to recommender systems handbook,” in Rec- ommender systems handbook, pp. 1–35. Springer, 2011. [15] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas, “The earth mover’s distance as a metric for image re- [5] Shiyu Chang, Jiayu Zhou, Pirooz Chubak, Junling Hu, trieval,” International journal of computer vision, vol. and Thomas Huang, “A space alignment method for 40, no. 2, pp. 99–121, 2000. cold-start tv show recommendations,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, [16] Zehra C¸ataltepe and Berna Altinel, “Music recommen- 2015. dation based on adaptive feature and user grouping,” in 2007 22nd international symposium on computer and [6] Xuan Nhat Lam, Thuc Vu, Trong Duc Le, and Anh Duc information sciences. IEEE, 2007, pp. 1–6. Duong, “Addressing cold-start problem in recommenda- tion systems,” in Proceedings of the 2nd international [17] Michael Freitag, Shahin Amiriparian, Sergey Pu- conference on Ubiquitous information management and gachevskiy, Nicholas Cummins, and Bjorn¨ Schuller, communication. ACM, 2008, pp. 208–211. “audeep: Unsupervised learning of representations from audio with deep recurrent neural networks,” The Jour- [7] Lesly Alejandra Gonzalez Camacho and Solange Nice nal of Machine Learning Research, vol. 18, no. 1, pp. Alves-Souza, “Social network data to alleviate cold- 6340–6344, 2017. start in recommender system: A systematic review,” In- formation Processing & Management, vol. 54, no. 4, pp. [18] Fabian Pedregosa, Gael¨ Varoquaux, Alexandre Gram- 529–544, 2018. fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- [8] Eli Pariser, The filter bubble: What the Internet is hiding cent Dubourg, et al., “Scikit-learn: Machine learning in from you, Penguin UK, 2011. python,” Journal of machine learning research, vol. 12, no. Oct, pp. 2825–2830, 2011. [9] Daniel Geschke, Jan Lorenz, and Peter Holtz, “The triple-filter bubble: Using agent-based modelling to test [19] S Aranganayagi and K Thangavel, “Clustering cat- a meta-theoretical framework for the emergence of filter egorical data using silhouette coefficient as a relocat- bubbles and echo chambers,” British Journal of Social ing measure,” in International Conference on Compu- Psychology, vol. 58, no. 1, pp. 129–149, 2019. tational Intelligence and Multimedia Applications (IC- CIMA 2007). IEEE, 2007, vol. 2, pp. 13–17. [10] Sander Dieleman, Philemon´ Brakel, and Benjamin Schrauwen, “Audio-based music classification with a [20] Adele Cutler and Leo Breiman, “Archetypal analysis,” pretrained convolutional network,” in 12th Interna- Technometrics, vol. 36, no. 4, pp. 338–347, 1994. tional Society for Music Information Retrieval Con- ference (ISMIR-2011). University of Miami, 2011, pp. [21] Manuel Eugster and Friedrich Leisch, “From spider- 669–674. man to hero-archetypal analysis in r,” 2009. [11] Sergio Oramas, Francesco Barbieri, Oriol Nieto, and [22] Ian Jolliffe, Principal component analysis, Springer, Xavier Serra, “Multimodal deep learning for music 2011. genre classification,” Transactions of the International