UNSUPERVISED LEARNING OF DEEP FEATURES FOR MUSIC SEGMENTATION

Matthew C. McCallum

Gracenote Inc.

ABSTRACT performed on the basis of timbral and harmonic features in the con- text of spectral clustering [9], or in the context of time-translation Music segmentation refers to the dual problem of identifying bound- invariant features such as HMM state histograms [8]. In addition, a aries between, and labeling, distinct music segments, e.g., the cho- number of feature transformations have been considered with respect rus, verse, bridge etc. in popular music. The performance of a to clustering approaches, including non-negative matrix factoriza- range of music segmentation algorithms has been shown to be de- tions (NMFs) of audio features [11] or self-similarity matrices [12], pendent on the audio features chosen to represent the audio. Some as well as learned features from labelled segmentation data [5]. approaches have proposed learning feature transformations from In the context of supervised deep learning some work has ad- music segment annotation data, although, such data is time con- dressed the problem of music segmentation [13–15], where perfor- suming or expensive to create and as such these approaches are mance is likey bound by the limited annotated music segmentation likely limited by the size of their datasets. While annotated mu- data available. The largest effort to collect labeled music segmenta- sic segmentation data is a scarce resource, the amount of available tion data is in the SALAMI dataset [16], providing 2246 annotations music audio is much greater. In the neighboring field of semantic of music structure in 1360 audio files. This amount of data is small audio unsupervised deep learning has shown promise in improving in the context of deep learning, but promising results were obtained. the performance of solutions to the query-by-example and sound Recently, in the neighboring field of semantic audio, unsuper- classification tasks. In this work, unsupervised training of deep vised approaches towards embedding audio features have attracted feature embeddings using convolutional neural networks (CNNs) is attention in the literature [17–19]. In particular, the benefits of learn- explored for music segmentation. The proposed techniques exploit ing audio feature embeddings in an unsupervised manner have be- only the time proximity of audio features that is implicit in any come apparent in [17]. Because there is no requirement for labeled audio timeline. Employing these embeddings in a classic music data in training such embeddings, they can be trained on much larger segmentation algorithm is shown not only to significantly improve datasets and in turn, this has achieved impressive results when such the performance of this algorithm, but obtain state of the art perfor- pre-trained embeddings are employed in sound classification and mance in unsupervised music segmentation. query by example tasks [17]. Despite the promising results obtained Index Terms— Music information retrieval, Acoustic signal for tasks related to the identification of events in audio, little to no processing, Convolutional neural network, Deep learning investigation has been made into the use of such unsupervised audio feature embeddings for music content specifically. 1. INTRODUCTION 1.2. Contributions Music segmentation refers to the task of labeling distinct segments of music in a way that is similar to a human annotation. For exam- This work focuses on the unsupervised training of CNNs to obtain ple the chorus, verse, intro, outro and bridge in popular music.. The meaningful features for music segmentation methods. This is a natu- boundary between such segments may be due to a number of factors, ral progression of the application of modern machine learning meth- for example, a change in melody or chord progression, a change in ods to the problem of music segmentation for three reasons. rhythm, changes in instrumentation, dynamics, key or tempo. This Firstly, previous literature pays careful attention to obtaining task is generally evaluated with two classes of metrics. The first features that are representative of musical characteristics that are arXiv:2108.12955v1 [cs.SD] 30 Aug 2021 class, boundary detection, refers to the ability of the algorithm to perceptually important in identifying segment boundaries such as locate the locations of such boundaries in time. The second class, timbre or harmonic repetition [4–6, 9, 10, 20–24]. Whether or not segment labelling, refers to the labelling of segments where two seg- a given feature is important in identifying a segment boundary or la- ments that are disconnected in time are labelled as same or different bel may be genre or song specific and so a data based approach may based on their perceptual similarity [1–3]. be promising in either producing a representation that better general- izes across genres or in the least may be arbitrarily learned for each genre specifically. Several data based approaches have been inves- 1.1. Prior Work tigated for the music segmentation problem [5, 10, 13], with little to A number of techniques exist in the literature that address either the no work in the context of unsupervised deep learning. boundary detection [4–6], segment labelling [7] problems, or both Secondly, musical content has a structure that may be exploited simultaneously [8–11]. Methods addressing solely the former met- to further improve the machine learning methodologies employed in ric typically involve the generation of a novelty function via a self works such as [17–19]. The characteristics that define music such as similarity matrix (SSM) representation. Other approaches address- rhythm and harmonicity have not been exploited in these previous ing both of the aforementioned metrics focus on clustering audio works. warranting some investigation in the context of music. features based on characteristics that are expected to remain homo- Finally, labeled data for the problem of music segmentation is geneous within a given musical segment. Such clustering has been notoriously time consuming and/or expensive to produce [25], as such an unsupervised machine learning approach that can exploit R CQT windows are analyzed centered at times bi +r(bi+1 −bi)/R large amounts of unlabeled music data is highly desirable. The ap- for integers, r ∈ {0..R − 1}. These beat synchronized CQT repre- proach in this paper investigates deep learning in an unsupervised sentations are then aggregated across B beats providing Q = BR framework, where only the time locality and a comparative analy- time indices in the representation x[q, k] at the input to the CNN. sis of time local data is exploited. Such an approach overcomes the necessity to hand annotate data, and hence may be scaled to the full 3. SAMPLING extent of the available music data. No longer limited by the size of the training dataset, it is expected to generalize better than hand Motivated by the unreasonable effectiveness of data in deep learn- crafted features that rely on aspects such as timbre or repetition that ing [31], methods are proposed here that create noisy positive / neg- may be specific to certain music genres. ative examples exploiting the facts that a) musical segments form contiguous regions in a song’s timeline, and b) each distinct musical 2. AUDIO FEATURE EMBEDDING segment label typically occurs for the minority of a song’s timeline. Specifically, it is proposed to use the time proximity information The approach of here is to create an embedding for audio features by implicit in a song’s features - sampling features that occur close transforming an audio representation via a CNN into a domain that together or at a minimum distance apart for positive and negative ex- is representative of musical structure. These embeddings may be amples respectively. That is, an anchor beat index, ia, is selected via trained by employing a loss function, such as contrastive loss [26], a uniform sampling of the beat indices in a given song, {0..L − 1}. or triplet loss [27], that observes positive (similar) or negative (dis- Thereafter, a positive beat index, ip, is chosen uniformly sampled similar) pairs or triplets of examples, where a triplet represents an from beat indices {max(ia − δp, 0).. min(ia + δp,L − 1)}, and anchor and both a positive example and a negative example. The a negative beat index in is chosen as uniformly sampled across triplet loss function is often shown to result in superior performance two regions, {max(ia − δn,max, 0).. max(ia − δn,min, 0)}, and to contrastive loss, which has been argued to be due to the its relative {min(ia + δn,min,L − 1).. min(ia + δn,max,L − 1)}. An exam- nature [28]. That is, the triplet loss function is designed to provide ple of the positive and negative sampling distributions is shown in a positive gradient with respect to increasing distance between posi- Fig. 1. Intuitively, this results in an embedding in which clusters tive pairs or the anchor and positive example of a triplet, relative to represent features that frequently occur close together. This is useful the distance between negative pairs, up to a given margin. As such, for the problem of the structural segmentation of music as segment it is employed in this research. boundaries are typically infrequent enough to be described as rare If a transformation from an input feature, x[q, k] (or equivalently events, at least in comparison to lengths of the segments themselves. x for notational simplicity) to an embedding space, e.g., via a CNN, It is interesting to consider the rates of false positives and nega- K×Q D is described as f : R → R , then a Euclidean triplet loss with tives that result from the aforementioned sampling paradigm. Upon a given margin, α, and a mini-batch of training data consisting of the selection of ia it may be denoted to fall somewhere in the nth musi- T = xc , xc , xc c=C−1 set a p n c=0 may be described as, cal segment of class s and length ls,n. If δp < lsep, where lsep is the minimum number of indices between s and any identically labeled C−1 X h c c  2 c c 2 i segment, then the probability of the positive example being selected L(T ) = f (xa) − f xp − kf (xa) − f (xn)k2 + α , 2 + c=0 from a distinct segment, i.e., a false positive, is, (1) where subscript a, p and n represent an anchor, positive and negative  2δp−ls,n ls,n ≤ δp input example respectively.  ls,n  2 Ideally, in the context of music segmentation positive examples δp 3δp 1 P (FP|ls,n; δp) = 2 − 4l + 2 δp < ls,n < 2δp . will be composed of audio feature representations from the same 2ls,n s,n  δp  2δp ≤ ls,n music segment as the anchor, while negative examples will represent 4ls,n audio features from distinct music segments. Thus, once trained, Similarly, if δn,min < lsep and δn,max ≥ L, then the probabillity input features could be transformed via that CNN into a space where P m ls,m distinct clusters with respect to a Euclidean distance metric would of a false negative is simply L − (1 − P (FP|ls,n; δn,min)). represent distinct music segments. In an unsupervised context with In practice the aforementioned assumptions are realistic in many no prior information about the music segmentation, exact selection scenarios, but do not hold under all conditions. An empirical analy- of such examples is not possible, however strategies are described in sis of the rate of false positives is shown in Fig. 2. It is clear that with Section 3 that may improve this selection. an increasing δp, an increasing rate of false positives is observed, al- The properties that are significant in forming clusters represent- though it is important to note that smaller δp restricts the maximum ing music segments in the embedded space are learned from the data observed time separation between the anchor and positive example. at the CNN input. These embedded features may be representative With musical phrases often lasting 16 beats, it is reasonable to set of any of the musical qualities mentioned in Section 1, provided δp ≥ 16 to discourage distinct clustering of features within a sin- that these qualities are observable from the input features. Here, gle phrase. An empirical analysis false negatives is shown in Fig. 3. a constant-Q transform (CQT) [29] is used due to its ability to accu- There it can be seen that for the datasets shown δn,min > 28 and rately represent transient and harmonic audio qualities, its translation δn,max > 116 result in relatively low false negative rates. invariant representation of harmonic structures, and the promising While structural annotations are not available under the scope results observed for this feature specifically in [30]. of unsupervised learning, it is interesting to ask whether analysis of A time window of CQT data providing K frequency bins across signal features may be employed to decrease the false-negative or Q time windows forms a time frequency representation that may false-positive rate. It was shown in [7] that 2D Fourier magnitude c c c compose any of xa, xp or xn. In this work it is found significantly coefficients of HPCPs can be a useful feature in segment labeling. advantageous to synchronize the CQT analysis windows with the It is proposed here to use a similar feature to inform the sampling beat of the music. That is, if a beat at index i occurs at time bi, then of positive and negative examples. That is, for every selected ia, P (i ) i n a BeatlesTUT SALAMI a) %FN 0 L in δn,max δn,min P (ip)

δn,min 0 i δp p

Fig. 1. Sampling distributions for in and ip, for a given choice of ia. Δ%FN b)

%FP δn,min

δn,max δn,max δp

Fig. 2. Observed false positive (FP) rates in the BeatlesTUT (), and Fig. 3. Observed false negative (FN) rates for various δn,min and SALAMI (), datasets for varying δp. Dashed lines show results δn,max. Row (a) shows FN raters resulting from unbiased sampling, with unbiased sampling, and solid lines employ the comparitive 2D while row (b) represents the change in FN rates when employing the FFT technique described here. comparitive 2D FFT technique described in this paper. a comparison of the log-amplitude of the 2D Fourier transform of −(i−j)2 the log-amplitude of 8 beat long CQT segments is considered. The g[i, j] = sgn(i)sgn(j)e σ (3) Euclidean distances between these CQT segments centered at two −κ ≤ i, j ≤ κ times before ia, i.e., ia − 4 and ia − 16, and two times after ia, for is convolved along the diagonal of the SSM, i.e., ia + 4 and ia + 16 are considered. The side (“before” where κ κ X X i < ia or “after” where i > ia) with minimum Euclidean distance η[ν] = S¯[ν + i, ν + j]g[ν + i, ν + j] (4) is then assumed to be more likely in the same musical segment as ia i=−κ j=−κ and is chosen from which to sample ip while the opposite side may producing the novelty function η[ν]. Note that it was found advanta- be chosen from which to sample in, in addition to the constraints, geous to set g[i, j] to 0 where |i| ≤ 1 or |j| ≤ 1. Finally, boundaries δp, δn,min and δn,max above. The changes in false negative and false positive rates employing this sampling paradigm can be seen in are detected as peaks in the novelty function. Specifically, peaks at Figures 2 and 3, respectively, where some improvement is observed. which the peak-to-mean ratio exceeds a given threshold, τ, are se- lected as segment boundaries, i.e., where,

4. BOUNDARY DETECTION (2T + 1)η[ν] > τ. (5) PT η[ν + t] To evaluate the effectiveness of the music embeddings described in t=−T this work, the problem of music segment boundary detection is con- sidered. Perhaps the simplest and most well known method for mu- 5. RESULTS sic boundary detection is that of Foote [4]. While this has been sur- passed in performance by several methods, e.g., [6, 9], its simplicity For evaluation, two datasets are considered: the BeatlesTUT dataset makes it effective in demonstrating the utility of the audio embed- consisting of 174 hand annotated tracks from The Beatles cata- dings proposed here, as will be seen in the results in Section 5. logue [20], and the internet archive portion of the SALAMI dataset The SSM of the proposed features is computed as, (SALAMI-IA) consisting of 375 hand annotated recordings [16]. The former dataset is perhaps the most widely evaluated in the music segmentation literature. The latter is employed here for two S[i, j] = kf (x [q, k]) − f (x [q, k])k2 (2) i j 2 reasons, firstly the complete SALAMI dataset audio is not avail- where xi[q, k] and xj [q, k] correspond to beat synchronous CQT able to the author due to copyright restrictions, and secondly, the segments centered at beats i and j, respectively. SALAMI-IA dataset is particularly interesting as it consists pri- Note that this work endeavors to create features that are close marily of live recordings with many imperfections. It provides a with respect to Euclidean distance at any point within a music seg- dataset that is indicative of segmentation performance when there ment. If successful, the SSM of embedded features typically con- are mistakes either by musicians or recording engineers, resulting in tains block structures as opposed to the path structures typically rep- imperfect repetitions and distorted or noisy audio in many cases. resentative of repetition. These structures are evident in Fig. 4. It was For comparison, two baseline algorithms are included in the re- found beneficial in practice to perform median filtering on S[i, j] to sults as specified and measured in [30], note that these algorithms produce S¯[i, j] which reduces noise in the distances between em- too included beat-synchronized features. Firstly, the method of [4] bedded features while maintaining the aforementioned structures. is included as it most closely mirrors the algorithm of Section 4. To detect segment boundaries in S¯[i, j], a checkerboard kernel, Secondly, the algorithm of [6] is widely evaluated as having the best n eaie r qa.I rciei a on that found was it positives practice false In all not δ equal. so are and separation of negatives important, difficulty the and is is it that that examples argue between [27,28] 3, The and 3 hours. Figures 8 in to approximately set taking was train- epochs margin and 240 triplet epoch, over 1 place SALAMI-IA form took the min-batches ing in 256 songs dataset. all BeatlesTUT se- excluding and randomly songs, by 28,345 real-time from in lecting formed were tracks selected training, randomly During be to here. size experiments found of the was mini-batches in 5 employed Fig. is in perfor- and described high effective modern architecture most The on trained GPUs. be mance may it that so enough octaves. simple 6 and octave per bins 12 Hz, 40 of features frequency CQT minimum The a use 3. Section Trans- in Fourier described 2D sampling comparison the as form using features algo- but same an approach, the ”Beat-Synchronized” employing by ”Biased”, the inference and [32]); and to training similar rithm during successive (estimated between markers interpolated beat linearly suc- times CQT between at 128 centered ms employing windows 3.6 ”Beat-Synchronized”, of windows; size ”Unsyn- CQT hop algo- investigated, cessive constant all are a For to methods employing input here. three chronized”, presented the algorithms, results at proposed the used in the benchmark, are and and su- proposed provide [30], rithms, to in evaluated been performance have perior features CQT mu- segmentation. in detection sic boundary unsupervised to respect with performance employ layers ReLU convolutional a All employ layer activation. dense and convolutional Each chan- respectively. and nel frequency time, represent dimensions layer, convolutional 5 Fig. marked are boundaries lines. detected yellow both, with On Muse. by ”Starlight” song 4 Fig.

n,min

72 x 512 x 64 x 512 x 72 Beats 100 200 400 300 h tutr fCN mlydi hsppri hsnt be to chosen is paper this in employed CNNs of structure The Convolutional o onaydtcin meddfaue eeosre tev- at observed were features embedded detection, boundary For xml fmda lee S n oet ucinfrthe for function novelty and SSM filtered median of Example . rhtcueo h N mlydi xeiet.Freach For experiments. in employed CNN the of Architecture . 0

1 = Max Pooling 2x4 Pooling Max

and

δ 128 x 128 x 36

n,max Convolutional

C 96 = 3x4 Pooling Max 96 = α 0 = ossigo 6tilt rmec f6 of each from triplets 16 of consisting

rvd pia results. optimal provide

. 256 x 32 x 12

1

ept h bevderrrates error observed the Despite . Convolutional Max Pooling 2x4 Pooling Max

6 × 128 Layer Dense 4

kernels.

Dense Layer 128 Layer Dense δ 128 Layer Linear p

16 = L2 Normalization L2 , lee sn n88wno.Tecekror enlwsconfig- was kernel checkerboard with The ured window. median 8x8 were any an representations by using SSM disadvantaged filtered not resolution). is time method in this shortcoming ensure a at to beat BPM per 120 typical approximately is (this sec- ”Unsynchronized” 0.2484 for every once onds or approaches, beat-synchronized for beat ery rnfr oprtv apigo eto )t ute improve further to 3) Fourier Section 2D performance. of may the rhythm, in sampling feature, (and comparitive musical sampling Transform beat-synchronized common perfor- in a art exploited that the be found of state was obtain It can mance. algorithm algorithm, this segmentation employ- of music by performance traditional task that the a shown the in was in embeddings it such em- utility particular, ing In music their of to segmentation. music training respect of unsupervised with investigated the were for beddings methods work, this In afore- the of robustness. many some contains providing which imperfections data, mentioned features training of of proximity the clustering time from the embedding, observed on performance proposed based simply the a performed In is noise, features expected. by be corrupted might or these drop when music imperfect, patterns, algorithm repetition / the become in audio Because changes patterns detect of dataset. to quality designed SALAMI is poor the [6] of of the result portion to This this due in be data to adjustment. postulated parameter ob- be additional is might art any a the of without dataset, state a served SALAMI-IA the over such the improvement [4], on performance Foote significant Furthermore, unsupervised of in art that segmentation. the of music to state similar the with competitive algorithm becomes method an in features deep data. unseen to generalize algorithm’s to so detection ability and boundary that per- only, the dataset displays was here dataset BeatlesTUT SALAMI-IA parameters noted the the on of be results selection should observing any It by formed algorithms, respectively. shown proposed al- 2, are the reference Table dataset for and and SALAMI-IA 1 proposed are and Table the level BeatlesTUT in tolerance of the second evaluation for 3 The gorithms the at [1]. rate employed hit detection boundary the size window factor [4] Algorithm [4] Algorithm isdSampling Biased Beat-Synchronized Unsynchronized [6] Sampling Biased Beat-Synchronized Unsynchronized [6] ti neetn oseta ipyb mlyn h proposed the employing by simply that see to interesting is It of recall and precision F-measure, trimmed the evaluation, For al 2 Table al 1 Table σ efrac erc o h AAIAdataset SALAMI-A the for metrics Performance . efrac erc o h eteTTdataset BeatlesTUT the for metrics Performance . 18 = . T 5 10 = and .CONCLUSION 6. 0.533 0.535 0.497 0.493 0.446 F-Measure 0.662 0.648 0.597 0.651 0.503 F-Measure κ n threshold and 40 = ± ± ± ± ± ± ± ± ± ± 0.16 0.15 0.16 0.17 0.17 0.17 0.17 0.17 0.17 0.18 n o ekpcig h crest the picking, peak for and 0.491 0.491 0.429 0.454 0.457 Precision 0.663 0.647 0.589 0.622 0.579 Precision τ 1 = ± ± ± ± ± ± ± ± ± ± . 35 0.21 0.20 0.18 0.20 0.21 0.20 0.20 0.19 0.19 0.21 a employed. was 0.656 0.660 0.653 0.595 0.483 Recall 0.691 0.677 0.625 0.708 0.461 Recall ± ± ± ± ± ± ± ± ± ± 0.16 0.16 0.15 0.19 0.19 0.19 0.18 0.17 0.19 0.17 7. REFERENCES [17] Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel PW Ellis, Shawn Hershey, Jiayang Liu, R Channing Moore, and Rif A [1] Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon, Saurous, “Unsupervised learning of semantic audio represen- Oriol Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raf- tations,” in Acoustics, Speech and Signal Processing (ICASSP), fel, “mir eval: A transparent implementation of common MIR IEEE International Conference on, 2018, pp. 126–130. metrics,” in ISMIR, 2014, pp. 367–372. [18] Justin Salamon and Juan Pablo Bello, “Unsupervised feature [2] Oriol Nieto, Morwaread M Farbood, Tristan Jehan, and learning for urban sound classification,” in Acoustics, Speech Juan Pablo Bello, “Perceptual analysis of the F-measure for and Signal Processing (ICASSP), IEEE International Confer- evaluating section boundaries in music,” in ISMIR, 2014, pp. ence on, 2015, pp. 171–175. 265–270. [19] Justin Salamon and Juan Pablo Bello, “Feature learning with [3] Hanna M Lukashevich, “Towards quantitative measures of deep scattering for urban sound analysis,” in EUSIPCO, 2015, evaluating song segmentation,” in ISMIR, 2008, pp. 375–380. pp. 724–728. [4] Jonathan Foote, “Automatic audio segmentation using a mea- [20] Jouni Paulus and Anssi Klapuri, “Music structure analysis by sure of audio novelty,” in Multimedia and Expo (ICME). IEEE finding repeated parts,” in Proceedings of the 1st ACM work- International Conference on, 2000, pp. 452–455. shop on Audio and music computing multimedia, 2006, pp. 59– [5] Brian McFee and Daniel PW Ellis, “Learning to segment songs 68. with ordinal linear discriminant analysis,” in Acoustics, Speech [21] Wei Chai, “Semantic segmentation and summarization of mu- and Signal Processing (ICASSP), IEEE International Confer- sic: methods based on tonality and recurrent structure,” IEEE ence on, 2014, pp. 5197–5201. Signal Processing Magazine, vol. 23, no. 2, pp. 124–132, 2006. [6] Joan Serra, Meinard Muller,¨ Peter Grosche, and Josep Ll Ar- [22] Matija Marolt, “A mid-level melody-based representation for cos, “Unsupervised music structure annotation by time series calculating audio similarity,” in ISMIR, 2006, pp. 280–285. structure features and segment similarity,” IEEE Transactions on Multimedia, vol. 16, no. 5, pp. 1229–1240, 2014. [23] Kristoffer Jensen, “Multiple scale music segmentation using rhythm, timbre, and harmony,” EURASIP Journal on Applied [7] Oriol Nieto and Juan Pablo Bello, “Music segment similarity Signal Processing, vol. 2007, no. 1, pp. 159–159, 2007. using 2D-Fourier magnitude coefficients,” in Acoustics, Speech and Signal Processing (ICASSP), IEEE International Confer- [24] Johan Pauwels, Florian Kaiser, and Geoffroy Peeters, “Com- ence on, 2014, pp. 664–668. bining harmony-based and novelty-based approaches for struc- tural segmentation,” in ISMIR, 2013, pp. 601–606. [8] Mark Levy and Mark Sandler, “Structural segmentation of mu- sical audio by constrained clustering,” IEEE transactions on [25] Cheng-i Wang, Gautham J Mysore, and Shlomo Dubnov, “Re- audio, speech, and language processing, vol. 16, no. 2, pp. visiting the music segmentation problem with crowdsourcing,” 318–326, 2008. in ISMIR, 2017, pp. 738–744. [9] Brian McFee and Dan Ellis, “Analyzing song structure with [26] Raia Hadsell, Sumit Chopra, and Yann LeCun, “Dimension- spectral clustering,” in ISMIR, 2014, pp. 405–410. ality reduction by learning an invariant mapping,” in In Proc. Computer Vision and Pattern Recognition Conference (CVPR), [10] Ron J Weiss and Juan Pablo Bello, “Unsupervised discovery of 2006, pp. 1735–1742. temporal structure in music,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1240–1251, 2011. [27] Florian Schroff, Dmitry Kalenichenko, and James Philbin, “Facenet: A unified embedding for face recognition and clus- [11] Oriol Nieto and Tristan Jehan, “Convex non-negative matrix tering,” in Proceedings of the IEEE conference on computer factorization for automatic music structure identification,” in vision and pattern recognition, 2015, pp. 815–823. Acoustics, Speech and Signal Processing (ICASSP), IEEE In- ternational Conference on, 2013, pp. 236–240. [28] Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenb¨ uhl,¨ “Sampling matters in deep embedding learning,” [12] Florian Kaiser and Thomas Sikora, “Music structure discovery in Proc. IEEE International Conference on Computer Vision in popular music using non-negative matrix factorization,” in (ICCV), 2017. ISMIR, 2010, pp. 429–434. [29] Christian Schorkhuber¨ and Anssi Klapuri, “Constant-Q trans- [13] Karen Ullrich, Jan Schluter,¨ and Thomas Grill, “Boundary de- form toolbox for music processing,” in 7th Sound and Music tection in music structure analysis using convolutional neural Computing Conference, Barcelona, Spain, 2010, pp. 3–64. networks,” in ISMIR, 2014, pp. 417–422. [30] Oriol Nieto and Juan Pablo Bello, “Systematic exploration of [14] Jan Schluter¨ and Sebastian Bock,¨ “Improved musical onset computational music structure research.,” in ISMIR, 2016, pp. detection with convolutional neural networks,” in Acoustics, 547–553. Speech and Signal Processing (ICASSP), IEEE International Conference on, 2014, pp. 6979–6983. [31] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta, “Revisiting unreasonable effectiveness of data in deep [15] Thomas Grill and Jan Schluter, “Music boundary detection learning era,” in Proc. IEEE International Conference on Com- using neural networks on spectrograms and self-similarity lag puter Vision (ICCV), 2017, pp. 843–852. matrices,” in EUSIPCO, 2015, pp. 1296–1300. [32] Anssi P Klapuri, Antti J Eronen, and Jaakko T Astola, “Anal- [16] Jordan Bennett Louis Smith, John Ashley Burgoyne, Ichiro Fu- ysis of the meter of acoustic musical signals,” IEEE Transac- jinaga, David De Roure, and J Stephen Downie, “Design and tions on Audio, Speech, and Language Processing, vol. 14, no. creation of a large-scale database of structural annotations,” in 1, pp. 342–355, 2006. ISMIR, 2011, pp. 555–560.