<<

Modeling the Music Genre Perception across Language-Bound Cultures

Elena V. Epure1, Guillaume Salha1,2, Manuel Moussallam1, Romain Hennequin1 1 Deezer Research, Paris, France 2 LIX, Ecole´ Polytechnique, Palaiseau, France [email protected]

Abstract coverage of music genres is strenuous (Bogdanov et al., 2019). The music genre perception expressed through To address this challenge, we study the feasi- human annotations of artists or albums varies bility of cross-culturally annotating music items significantly across language-bound cultures. with music genres, without relying on a parallel These variations cannot be modeled as mere corpus. In this work, culture is related to a com- translations since we also need to account for munity speaking the same language (Kramsch and cultural differences in the music genre per- ception. In this work, we study the feasibil- Widdowson, 1998). The specific research ques- ity of obtaining relevant cross-lingual, culture- tion we build upon is: assuming consistent pat- specific music genre annotations based only terns of music genres association with music items on language-specific semantic representations, within cultures, can a mapping between these pat- namely distributed concept embeddings and terns be learned by relying on language-specific ontologies. Our study, focused on six lan- semantic representations? It is worth noting that, guages, shows that unsupervised cross-lingual since music genres fall within the class of Culture- music genre annotation is feasible with high accuracy, especially when combining both Specific Items (Aixela´, 1996; Newmark, 1988), types of representations. This approach of cross-lingual annotation, in this case, cannot be studying music genres is the most extensive framed as standard translation, as one also needs to date and has many implications in musicol- to model the dissimilar perception of music genres ogy and music information retrieval. Besides, across cultures. we introduce a new, domain-dependent cross- Our work focuses on four language families, lingual corpus to benchmark state of the art Germanic (English-en and Dutch-nl), Romance multilingual pre-trained embedding models. (Spanish-es and French-fr), Japonic (Japanese-ja), Slavic (Czech-cs), and on two types of language- 1 Introduction specific semantic representations, ontologies and multi-word expression embeddings. A prevalent approach to culturally study music gen- First, ontologies are often used to represent mu- res starts with a common set of music items, e.g. sic genres, showing how they relate conceptually artists, albums, tracks, and assumes that the same (Schreiber, 2016). We identify Wikipedia1, the on- music genres would be associated with the items in line multilingual encyclopedia, to be particularly all cultures (Ferwerda and Schedl, 2016; Skowron relevant to our study. It extensively documents et al., 2017). However, music genres are subjective. worldwide music genres relating them through a Cultures themselves and individual musicological coherent set of relation types across languages (e.g. backgrounds influence the music genre perception, derivative genre, sub-genre). Though the relations which can differ among individuals (Sordo et al., types are the same per language, the actual music 2008; Lee et al., 2013). For instance, a Westerner genres and the way they are related can differ. In- may relate to soul and , while a Brazil- deed, Pfeil et al. (Pfeil et al., 2006) have shown that ian to baile funk that is a type of rap (Hennequin Wikipedia contributions expose cultural differences et al., 2018). Thus, accounting for cultural dif- aligned with the ones in the physical world. ferences in music genres’ perception could give Second, music genres can be represented from a a more grounded basis for such cultural studies. distributional semantics perspective. Word vector However, ensuring both a common set of music items and culture-sensitive annotations with broad 1https://en.wikipedia.org

4765 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4765–4779, November 16–20, 2020. c 2020 Association for Computational Linguistics spaces are generated from large corpora following multilingual pre-trained embedding models to de- the distributional hypothesis, i.e. words with simi- rive representations for multi-word concepts in the lar contexts have akin meanings. As languages are music domain. Our study can enable complete mu- passed on culturally, we assume that the language- sicological research, but also localized music infor- specific corpora used to create these spaces are mation retrieval. This latter application is crucial sufficient to convey concepts’ cultural specificity for online music streaming platforms that lever- into their vector representations. In our study, we age music genre annotations to provide worldwide, focus on multiple recent multilingual pre-trained user-personalized music recommendations. models to generate word or sentence2 embeddings Our domain-specific study complements other (Arora et al., 2017; Grave et al., 2018; Artetxe and works benchmarking general-language sentence Schwenk, 2019; Devlin et al., 2019; Lample and representations (Conneau et al., 2018). Finally, Conneau, 2019), to account for variances in the we provide an in-depth formal analysis of the used corpora or model designs. retrofitting part of our method. We prove the strict Lastly, we combine the semantic representations convexity of retrofitting and show that the ontol- by retrofitting distributed music genre embeddings ogy concepts’ final embeddings converge to the to music genre ontologies. Retrofitting (Faruqui same values despite the order in which concepts et al., 2015) modifies each concept embedding are iteratively updated, on condition that we know such that the representation is still close to the dis- a single initial node embedding in each connected tributed one, but also encodes ontology information. component of the ontology. Initially, we retrofit music genres per language, us- ing monolingual ontologies. Then, by partially 2 Related Work aligning these ontologies, we apply retrofitting to learn multilingual embeddings from scratch. Music genres are conceptual representations en- compassing a set of conventions between the mu- The results show that we can model the cross- sic industry, artists, and listeners about individual lingual music genre annotation with high accuracy music styles (Lena, 2012). From a cultural perspec- by combining both types of language-specific se- tive, it has been shown that there are differences mantic representations. When comparing the rep- in how people listen to music genres. (Ferwerda resentations derived from multilingual pre-trained and Schedl, 2016; Skowron et al., 2017). Average models, the smooth inverse frequency averaging listening habits in some countries span across many (Arora et al., 2017) of aligned word embeddings music genres and are less diverse in other countries outperforms the state of the art approaches. To our (Ferwerda and Schedl, 2016). Also, cultural dimen- knowledge, this simple method has been rarely sions proved strong predictors for the popularity of used to embed multilingual sentences (Vargas specific music genres (Skowron et al., 2017). et al., 2019), and we hypothesize its potential as a strong baseline on other cross-lingual datasets Despite the apparent agreement on the music and tasks too. Finally, embedding learning based style for which the music genres stand, conveyed on retrofitting leads to better multilingual music in the earlier definition and implied in the related genre representations than when inferred with pre- works too, music genres are subjective concepts trained embedding models. This opens the possi- (Sordo et al., 2008; Lee et al., 2013). To address bility to learn embeddings for rare music genres this subjectivity, Bogdanov et al.(2019) proposed or languages when aligned music genre ontologies a dataset of music items annotated with English are available. music genres by different sources. In this line of work, we address the divergent perception of music Summing up, our contributions are: 1) a study genres. Still, we focus on multilingual, unsuper- on how effective language-specific semantic repre- vised music genre annotation without relying on sentations of music genres are for modeling cross- content features, i.e. audio or lyrics. We also com- lingual annotation, without relying on a parallel plement similar studies in other domains (art: Eleta music item corpus; 2) an extensive evaluation of and Golbeck, 2012) with another research method. 2Music genres can be multi-word expressions. Previous Then, in the literature, there are other works that work (Shwartz and Dagan, 2019a) successfully embed phrases benchmark pre-trained word and sentence embed- with sentence embedding models. In particular, contextualized language models result in more meaningful representations ding models. van der Heijden et al.(2019) com- for diverse composition tasks. pares multilingual contextual language models for

4766 named-entity recognition and part-of-speech tag- Language nl fr es cs ja en 12604 28252 32891 4772 14752 ging. Shwartz and Dagan(2019b) use multiple nl 7139 7689 1885 3426 static and contextual word embeddings to represent fr 15616 3046 8622 multi-word expressions and assess their capacity es 3245 7644 cs 2065 to capture meaning shift and implicit meaning in compositionality. Conneau et al.(2018) formulate Table 1: Number of music items for language pair. a new task to evaluate cross-lingual sentence repre- sentation centered on natural language inference. Number of unique music genres Language Corpus (Avg. per item) Ontology Compared to these works, our benchmark is en 558 (2.12 ± 1.34) 10748 aimed at cross-lingual annotation; we target a spe- nl 204 (1.71 ± 1.06) 1529 fr 364 (1.75 ± 1.06) 2905 cific domain, music, for which we try concept em- es 525 (2.11 ± 1.34) 3988 bedding adaptation with retrofitting; and we also cs 133 (2.23 ± 1.34) 1418 test a multilingual sentence representation obtained ja 192 (1.51 ± 1.11) 1609 with smooth inverse frequency averaging of mul- Table 2: Number of unique music genres in the corpus tilingual word embeddings. As discussed in a re- (Section 3.2) and in the ontology (Section 4.1). cent survey on cross-lingual word embedding mod- els (Ruder et al., 2019), there is a need to unlock domain-specific data to assess if general-language any target tag t, f can be defined as: sentence representations are also accurate across K T 1 X sk t domains. Our work builds towards this goal. f ({s , s , . . . , s }) = , (1) t 1 2 K K ||s|| ||t|| k=1 2 2 3 Cross-lingual Music Genre Annotation where || · ||2 is the Euclidean norm.

Further, we formalize the cross-lingual annotation 3.2 Test Corpus task and the strategy to evaluate it in Section 3.1. Wikipedia records worldwide music artists and We describe the test corpus used in this work, to- their discographies, with a frequent mentioning gether with its collection procedure in Section 3.2. of their music genres. By manually checking the Wikipedia pages of miscellaneous music items, we 3.1 Problem Formalization observed that their music genres vary significantly across languages. For instance, Knights of Cydonia, The cross-lingual music genre annotation consists a single by Muse, was annotated in Spanish as pro- of inferring, for music items, tags in a target lan- gressive rock, while in Dutch as progressive metal guage Lt, knowing tags in a source language Ls. and . In Figure1, we show another For instance, knowing the English music genres example of different annotations in English, Span- of (, , alternative ish, and Japanese from Wikipedia infoboxes. As rock), the goal is to predict and rock alterna- Wikipedia writing is localized, contributors’ culture tivo in Spanish. As shown in the example, but also can lead to differences in the multilingual content Section1, the problem goes beyond translation and on the same topic (Pfeil et al., 2006), particularly instead targets a model able to map concepts, po- for subjective matters. Thus, Wikipedia was a suit- tentially dissimilar, across languages and cultures. able source for assembling the test corpus. Formally, given S a set of tags in language Ls, P Using DBpedia (Auer et al., 2007) as a proxy to the partitions of S and T a set of tags in language Wikipedia, we collected music items such as artists |T | Lt, a mapping scoring function f : P(S) → IR and albums, annotated with music genres in at least can attribute a prediction score to each target tag, two of the six languages (en, nl, fr, es, cs and ja). relying on subsets of source tags drawn from S We targeted MusicalWork, MusicalArtist and Band (Hennequin et al., 2018; Epure et al., 2019, 2020). DBpedia resource types, and we only kept music The produced score incorporates the degree of re- items that were annotated with music genres which latedness of each particular input source tag to the appeared at least 15 times in the corpus. Our final target tag. A common approach to compute relat- corpus includes 63246 music items. The number edness in distributional semantics relies on cosine of annotations for each language pair is presented similarity. Thus, for {s1, ..., sK } source tags and in Table1. We also show in Table2 the number of 4767

Figure 1: Wikipedia infoboxes of Puerto Rican artist Milton Cardona, in English (en), Spanish (es) and Japanese (ja) languages. Some music genre annotations are culture-specific, such as World/ワールドミュージック which is present for en and ja but not for es, or 実験音楽 (Experimental Music) for ja only. unique music genres per language in the corpus and of hiphop or musica´ electronica´ is the origin of syn- the average number of tags for each music item. thpunk. Academic and practitioner communities The en and es languages use the most diverse often use ontologies or knowledge graphs to rep- tags. This can be because more annotations exist in resent music genre relations and enrich the music these languages, in comparison to cs, which has the genre definitions (Schreiber, 2016; Lisena et al., least annotations and least diverse tags. However, 2018). As mentioned in Section1, we use in this the mean number of tags per item appears relatively study Wikipedia-based music genre ontologies be- high for cs, while ja has the smallest mean number cause the multilingual Wikipedia contributions on of tags per item. the same topic can differ and these differences have been proven aligned with the ones in the physical 4 Language-specific Semantic world (Pfeil et al., 2006). Representations for Music Genres We further describe how we crawl the Wikipedia- based music genres ontologies for the six languages This work aims to assess the possibility of obtain- by relying on DBpedia. For each language, first, ing relevant cross-lingual music genre annotations, we constitute the seed list using two sources: the able to capture cultural differences too, by rely- DBpedia resources of type MusicGenre and their ing on language-specific semantic representations. aliases linked through the wikiPageRedirects rela- Two types of semantic representations are inves- tion; the music genres discovered when collecting tigated given their popularity: ontologies to rep- the test corpus (introduced in Section 3.1) and their resent music genre relations (presented in Section aliases. Then, music genres are fetched by visiting 4.1) and distributed embeddings to represent multi- the DBpedia resources linked to the seeds through word expressions in general (presented in Section the relations wikiPageRedirects, musicSubgenre, 4.2). In contrast to this unsupervised approach, stylisticOrigin, musicFusionGenre and derivative3. mapping patterns of associating music genres with The seed list is updated each time, allowing the music items across cultures could have also been crawling to continue until no new resource is found. enabled with a parallel corpus. However, gathering In DBpedia, resources are sometimes linked to a corpus that includes all music genres for each their equivalents in other languages through the pair of languages is challenging. relation sameAs. For most experiments, we rely 4.1 Music Genre Ontology on monolingual music genres ontologies. How-

Conceptually, music genres are interconnected en- 3We present these relations by their English names, which tities. For example, rap west coast is a sub-genre may be translated in the DBpedia versions in other languages.

4768 ever, we also collect the cross-lingual links between with the multilingual scope: multilingual Bidirec- music genres to include a translation baseline for tional Encoder Representations from Transformers cross-lingual annotation, i.e. for each music genre (BERT, Devlin et al., 2019) and Cross-lingual Lan- in a source language, we predict its equivalent in a guage Model (XLM, Lample and Conneau, 2019). target language using DBpedia. Besides, we try to BERT (Devlin et al., 2019) is trained to jointly learn aligned embeddings from scratch by relying predict a masked word in a sentence and whether on these partially aligned music genre ontologies, sentences are successive text segments. Similar to as will be discussed in Section 4.3. fastText (Grave et al., 2018), subword and word po- The number of unique Wikipedia music genres sition information is also used. An input sentence is discovered in each language is presented in Table2. tokenized against a limited token vocabulary with Let us note that the graph numbers are much larger a modified version of the byte pair encoding al- than the test corpus numbers, emphasizing the chal- gorithm (BPE, Sennrich et al., 2016). Multilin- lenge to constitute a parallel corpus that covers all gual BERT is trained as a single-language model, language-specific music genres. fed with 104 concatenated monolingual Wikipedias (Pires et al., 2019; Wu and Dredze, 2019). 4.2 Music Genre Distributed Representations XLM (Lample and Conneau, 2019) has a similar As music genres are multi-word expressions, we architecture to BERT. Also, it shares with BERT make use of existing sentence representation mod- one training objective, the masked word prediction, els. We also inquire into word vector spaces and and the tokenization using BPE, but applied on obtain sentence embeddings by hypothesizing that sentences differently sampled from each monolin- music genres are generally compositional, i.e. the gual Common Crawl corpus. Compared to BERT, sense of a multi-word expression is conveyed by two other objectives are introduced, to predict a the sense of each composing word (e.g. West Coast word from previous words and a masked word by rap, jazz blues; there are also non-compositional leveraging two parallel sentences. Thus, to train examples like hard rock). We set our investiga- XLM, several multilingual aligned corpora are used tion scope to multilingual pre-trained embedding (Lample and Conneau, 2019). models, and we consider both static and contextual word/sentence representations as described next. Multilingual Sentence Embeddings. Contextu- Multilingual Static Word Embeddings. The alized language models can be exploited in multiple classical word embeddings we study are the multi- ways. First, as Lample and Conneau(2019) show, lingual fastText word vectors trained on Wikipedia by training the transformers on multi-lingual data, and Common Crawl (Grave et al., 2018). The cross-lingual word vectors are obtained in an unsu- model is an extension of the Common Bag of pervised way. The word vectors can be accessed Word Model (CBOW, Mikolov et al., 2013), which through the model lookup table. These embed- includes subword and word position information. dings are merely aligned but not contextual, thus di- The fastText word vectors are trained in distinct rectly comparable to fastText. For these three types languages. Thus, we must ensure that the monolin- of cross-lingual non-contextual word embeddings, gual word vectors are projected in the same space fastText (FT), the multilingual BERT’s lookup ta- for cross-lingual annotation. We perform the align- ble (mBERT) and the XLM’s lookup table (XLM), ment with the method proposed by Joulin et al. we compute the sentence embedding using the stan- (2018), which treats word translation as a retrieval dard average (avg) or the smooth inverse frequency task and introduces a new loss relying on a relaxed averaging (sif ) introduced by Arora et al.(2017). cross-domain similarity local scaling criterion. Formally, let c denote a music genre composed Multilingual Contextual Word Embeddings . of multiple tokens {t1, t2, . . . , tM }, tm the embed- Contextual word embeddings (Peters et al., 2017; ding of each token tm initialized from a given pre- Devlin et al., 2019), in contrast to the classical ones, trained embedding model or to d-dimensional4 null are dynamically inferred based on the given context vector 0d if tm is absent from the model vocabu- d sentence. This type of embedding can address pol- lary, and qˆi ∈ IR the representation of c which ysemy as the word sense is disambiguated through we want to infer. The avg strategy computes qˆi as the surrounding text. In our work, we include two recent contextualized language models compatible 4In AppendixB, we report d for each embedding type.

4769 1 PM M m=1 tm. The sif strategy computes qˆi as: Schwenk, 2019). The model is based on a BiLSTM M encoder trained on corpora in 93 languages to learn 1 X a q = t (2) multilingual fixed-length sentence embeddings. As i M a + f m m=1 tm in other models, sentences are tokenized against a fixed vocabulary, obtained with BPE from the con- qˆ = q − uuT q (3) i i i catenated multilingual corpora. LASER appears where ftm is the frequency of tm, a is a hyper- highly effective without requiring task-specific fine- parameter usually fixed to 10−3 (Arora et al., 2017) tuning (Artetxe and Schwenk, 2019). and u is the first singular vector obtained through the singular value decomposition (Golub and Rein- 4.3 Retrofitting Music Genre Distributed sch, 1971) of Q, the embedding matrix computed Representations to Ontologies with the Equation2 for all music genres. Vocabu- Retrofitting (Faruqui et al., 2015) is a method to lary tokens of pre-trained embedding models are refine vector space word representations by con- usually sorted by decreasing frequency in the train- sidering the relations between words as defined in ing corpus, i.e. the higher the rank, the more fre- semantic lexicons such as WordNet (Miller, 1995). quent the token. Thus, based on the Zipf’s law The intuition is to modify the distributed embed- (Zipf, 1949), fwm can be approximated by 1/ztm , dings to become closer to the representations of the ztm being the rank of tm. The intuition of this sim- concepts to which they are related. Ever since the ple sentence embedding method is that uncommon original work, many uses of retrofitting have been words are semantically more informative. explored to semantically specialize word embed- Second, contextualized language models can be dings in relations such as synonyms or antonyms used as feature extractors representing sentences (Kiela et al., 2015; Kim et al., 2016), in other lan- from the contextual embeddings of the associated guages than a source one (Ponti et al., 2019) or in tokens. Multiple strategies exist to retrieve con- specific domains (Hangya et al., 2018). textual token embeddings: to use the embeddings Enhanced extensions of retrofitting exist, but layer or the last hidden layer or to apply min or max they require supervision (Lengerich et al., 2018). pooling over time (Devlin et al., 2019). To infer a The original method (Faruqui et al., 2015) is un- fixed-length representation of a multi-word music supervised and can simply yet effectively lever- genre, we try max and mean pooling over token age distributed embeddings and ontologies for im- embeddings (Lample and Conneau, 2019; Reimers proved representations. Thus, we mainly rely on and Gurevych, 2019), obtained with the diverse it, but we apply some changes as further described. strategies mentioned before. We denote these sen- Let Ω = (C,E) be an ontology including the con- tence embeddings XLMCtxt and mBERTCtxt. cepts C and the semantic relations between these The contextualized language models can be fur- concepts E ⊆ C × C. The retrofitting goal is to ther fine-tuned for particular downstream tasks, learn new concept embeddings, Q ∈ IRn×d with yielding better sentence representations (Eisensch- n = |C| and d the embedding dimension. The los et al., 2019; Lample and Conneau, 2019). Ex- d learning starts with initializing each qi ∈ IR , the isting evaluations of cross-lingual sentence repre- new embedding for concept i ∈ C, to qˆi, the initial sentations are centered on natural language infer- distributed embedding, and then iteratively updates ence (XNLI, Conneau et al., 2018) or classification qi until convergence as follows: (Eisenschlos et al., 2019). The cross-lingual mu- P sic genre annotation would be closer to the XNLI j:(i,j)∈E (βij + βji)qj + αiqˆi qi ← P (4) task; hence we could fine-tune the pre-trained mod- j:(i,j)∈E (βij + βji) + αi els on a parallel corpus of music genres transla- tions or music genre annotations. However, our re- α and β are positive scalars weighting the impor- search investigates language-specific semantic rep- tance of the initial, respectively, the related con- resentations. Also, using translated music genres cept embeddings in computation. The formula was would not model their different perception across reached through the optimization of the retrofitting cultures while obtaining an exhaustive corpus of objective using the Jacobi method (Saad, 2003). cross-lingual annotations is challenging. Equation4 is a corrected version of the original Last, we explore LASER, a universal language- work, as for a concept i, not only βij appears in it, agnostic sentence embedding model (Artetxe and but also βji. That is to say that when computing the

4770 Pair GTrans DBpSameAs mBERTavg FTsif XLMCtxt LASER RfituΩFTsif en-nl 59.9 ± 0.3 72.2 ± 0.2 86.2 ± 0.2 86.5 ± 0.1 85.4 ± 0.3 80.9 ± 0.1 90.0 ± 0.1 en-fr 58.4 ± 0.1 70.0 ± 0.2 85.2 ± 0.3 87.4 ± 0.3 86.6 ± 0.2 82.0 ± 0.2 90.8 ± 0.2 en-es 56.9 ± 0.0 65.4 ± 0.2 83.8 ± 0.1 86.9 ± 0.2 85.4 ± 0.3 81.8 ± 0.1 89.9 ± 0.1 en-cs 60.6 ± 0.6 78.4 ± 0.6 88.2 ± 0.4 88.6 ± 0.4 89.0 ± 0.5 85.4 ± 0.3 90.4 ± 0.3 en-ja 60.9 ± 0.1 70.4 ± 0.2 72.4 ± 0.2 80.8 ± 0.3 74.2 ± 0.3 73.0 ± 0.1 86.7 ± 0.3 nl-en 53.5 ± 0.1 56.7 ± 0.3 74.6 ± 0.3 79.8 ± 0.4 74.2 ± 0.7 70.5 ± 0.1 84.3 ± 0.1 nl-fr 54.4 ± 0.2 60.0 ± 0.4 76.1 ± 0.3 79.3 ± 0.8 74.9 ± 1.0 72.6 ± 0.5 81.5 ± 0.7 nl-es 53.1 ± 0.2 56.8 ± 0.2 75.0 ± 0.4 77.7 ± 0.5 73.5 ± 0.3 71.0 ± 0.3 80.5 ± 0.4 nl-cs 57.8 ± 0.1 50.0 ± 0.0 80.8 ± 0.4 80.6 ± 0.2 79.0 ± 0.4 76.8 ± 0.2 83.4 ± 0.5 nl-ja 57.5 ± 0.4 62.7 ± 0.4 65.8 ± 2.5 74.9 ± 1.0 66.0 ± 0.6 68.5 ± 0.2 80.0 ± 0.7 fr-nl 58.6 ± 0.1 65.3 ± 0.4 79.4 ± 0.1 81.9 ± 0.4 78.6 ± 0.1 74.3 ± 0.4 84.7 ± 0.3 fr-en 55.3 ± 0.0 59.7 ± 0.2 77.2 ± 0.5 83.0 ± 0.2 78.8 ± 0.5 74.2 ± 0.4 87.7 ± 0.1 fr-es 54.1 ± 0.1 59.0 ± 0.1 77.5 ± 0.4 81.8 ± 0.3 78.7 ± 0.5 75.4 ± 0.6 85.3 ± 0.2 fr-cs 59.1 ± 0.3 70.0 ± 0.6 82.7 ± 0.6 83.9 ± 0.4 83.1 ± 0.2 80.2 ± 0.5 87.2 ± 0.3 fr-ja 59.1 ± 0.2 64.7 ± 0.5 61.5 ± 0.2 77.9 ± 0.1 69.5 ± 0.3 70.5 ± 0.3 81.4 ± 0.3 es-nl 59.8 ± 0.3 67.2 ± 0.2 82.3 ± 0.3 82.8 ± 0.9 81.3 ± 0.8 76.7 ± 0.5 85.9 ± 0.6 es-fr 57.4 ± 0.2 64.8 ± 0.3 81.0 ± 0.3 85.0 ± 0.3 82.2 ± 0.5 78.1 ± 0.8 87.5 ± 0.3 es-en 57.0 ± 0.1 61.7 ± 0.0 78.4 ± 0.3 84.7 ± 0.2 79.5 ± 0.2 74.8 ± 0.4 88.8 ± 0.3 es-cs 60.3 ± 0.2 72.2 ± 0.4 85.5 ± 0.5 85.6 ± 0.6 85.9 ± 0.5 82.7 ± 0.5 88.0 ± 0.4 es-ja 60.9 ± 0.1 67.0 ± 0.5 67.7 ± 0.1 78.3 ± 0.6 72.8 ± 0.8 71.2 ± 0.4 83.1 ± 0.6 cs-nl 57.6 ± 0.6 50.0 ± 0.0 78.5 ± 0.7 78.3 ± 0.9 78.1 ± 0.4 73.1 ± 0.4 81.1 ± 1.2 cs-fr 54.2 ± 0.2 60.0 ± 0.3 77.1 ± 1.0 78.5 ± 0.2 79.9 ± 0.7 75.0 ± 0.9 81.4 ± 0.3 cs-es 53.7 ± 0.4 56.9 ± 0.3 75.8 ± 0.3 77.7 ± 0.8 78.8 ± 0.4 73.7 ± 0.3 81.6 ± 0.9 cs-en 54.2 ± 0.2 57.1 ± 0.1 74.2 ± 0.2 78.9 ± 0.1 78.3 ± 0.4 73.4 ± 0.4 84.5 ± 0.4 cs-ja 58.6 ± 0.2 64.0 ± 0.3 65.8 ± 1.4 76.9 ± 0.1 72.2 ± 1.1 72.2 ± 1.1 80.5 ± 0.5 ja-nl 54.8 ± 0.4 61.6 ± 1.1 62.1 ± 0.4 72.8 ± 1.0 63.6 ± 0.9 68.3 ± 1.2 76.9 ± 0.3 ja-fr 53.3 ± 0.2 58.4 ± 0.2 50.6 ± 0.2 73.7 ± 0.6 58.4 ± 0.4 66.7 ± 0.3 77.8 ± 0.1 ja-es 52.7 ± 0.1 55.9 ± 0.4 56.1 ± 0.4 73.9 ± 0.4 60.0 ± 0.2 67.4 ± 0.4 78.8 ± 0.5 ja-cs 56.1 ± 0.5 65.7 ± 0.7 64.7 ± 0.6 77.5 ± 0.2 64.5 ± 0.5 73.3 ± 0.6 80.7 ± 0.4 ja-en 52.5 ± 0.1 55.8 ± 0.1 49.0 ± 0.1 75.6 ± 0.3 56.8 ± 1.0 64.2 ± 0.4 81.6 ± 0.8

Table 3: Macro-AUC scores (in %, best overall in bold, best locally underlined). The first part corresponds to the translation baselines; the second to the best distributed representations; the last to the retrofitted FTsif vectors. partial derivative of the retrofitting objective con- bedding depending on the relation semantics in our cerning i, two non-zero terms are corresponding music genre ontology (Epure et al., 2020). Specifi- to the related concept j: when i is the source and cally, we distinguish between equivalence and re- j is the target and vice-versa (Bengio et al., 2006; latedness as follows: Saha et al., 2016). The further modifications that  1 : (i, j) ∈ E ⊂ E we make regard the parameters α and β. For each   i ∈ C, Faruqui et al.(2015) fix αi to 1, and βij to βij = βij :(i, j) ∈ E − E 1  0, :(i, j) 6∈ E degree(i) for (i, j) ∈ E or 0 otherwise; degree(i) is the number of related concepts i has in Ω. where E contains the equivalence relation types While many embedding models can handle un- (wikiPageRedirects, sameAs); E − E contains the known words nowadays, concepts may still have relatedness relation types (stylisticOrigin, music- unknown initial distributed vectors, depending Subgenre, derivative, musicFusionGenre). We la- on the model’s choice. For this case, expanded bel this modified version of retrofitting as Rfit. retrofitting (Speer and Chin, 2016) has been pro- Finally, we want to highlight a crucial aspect posed, considering α = 0, for each concept i with i of retrofitting. Previous works (Speer and Chin, unknown initial distributed vector, and α = 1 for i 2016; Hayes, 2019; Fang et al., 2019) claim that, the rest. Thus, q is initialized to 0 and updated i d while the retrofitting updating procedure converges, by averaging the embeddings of its related con- the results depend on the order in which the up- cepts at each iteration. Let us notice that, through dates are made. We prove in AppendixA that retrofitting, representations are not only modified the retrofitting objective is strictly convex when at but also learned from scratch for some concepts. least one initial concept vector is known in each We adopt the same approach to initialize α. connected component. Hence, with this condition Moreover, we also adjust the parameters β to satisfied, retrofitting converges to the same solution weight the importance of each related concept em- always and independently of the updates’ order.

4771 en 5 Experiments Pair DBpaΩNNDist RfitaΩFTsif en-nl 83.7 ± 0.1 90.7 ± 0.0 en-fr 82.7 ± 0.3 91.7 ± 0.1 Cross-lingual music genre annotation, as formal- en-es 81.1 ± 0.3 91.4 ± 0.2 ized in Section3, is a typical multi-label predic- en-cs 86.6 ± 0.3 91.6 ± 0.4 tion task. For evaluation, we use the Area Under en-ja 81.3 ± 0.1 89.1 ± 0.2 Pair DBp NNDist Rfitja FT the receiver operating characteristic Curve (AUC, aΩ aΩ sif ja-nl 68.5 ± 0.7 75.7 ± 0.3 Bradley, 1997), macro-averaged. We report the ja-fr 71.9 ± 0.1 76.7 ± 0.5 mean and standard deviations of the macro AUC ja-es 68.9 ± 0.4 76.1 ± 0.5 3 ja-cs 77.9 ± 0.8 82.2 ± 0.8 scores using -fold cross-validation. For each lan- ja-en 70.4 ± 0.4 82.3 ± 0.5 guage, we apply an iterative split (Sechidis et al., 2011) of the test corpus that balances the number of Table 4: Macro-AUC scores (in %; those larger than samples and the tag distributions across the folds. RfituΩFTsif in Table3 in bold) with vectors learned We pre-process the music genres by either replac- by retrofitting to aligned monolingual ontologies. ing special characters with space ( -/,) or removing them (()’:.!$). For Japanese, we introduce spaces best7 music genre embeddings computed with each between tokens obtained with Mecab5. Embed- word/sentence pre-trained model or method. When dings are then computed from pre-processed tags. averaging static multilingual word embeddings, We test two translation baselines, one based on 6 those from mBERT often yield the most relevant Google Translate (GTrans) and one on the DBpe- cross-lingual annotations, while when applying the SameAs dia relation (DBpSameAs). In this case, sif averaging, the aligned FT word vectors are a source music genre is mapped on a single or no the best choice. Between the two contextual word target music genre, its embedding being in the form |T | embedding models, XLMCtxt significantly outper- {0,1} . For XLMCtxt, we compute the sentence forms mBERTCtxt, thus we report only the former. embedding by averaging the token embeddings ob- We can notice that all distributed representations tained with mean pooling across all layers. For of music genres can model quite well the varying mBERT , we apply the same strategy, but by Ctxt music genre annotation across languages. FT re- max pooling the token embeddings instead. We sif sults in the most relevant cross-lingual annotations chose these representations as they showed the best consistently for 5 out of 6 languages as a source. performance experimentally compared to the other For cs though, the embeddings from XLM are strategies described in Section 4.2. Ctxt sometimes slightly better. LASER under-performs When retrofitting language-specific music genre for most languages but ja, for which the vectors embeddings, we use the corresponding monolin- obtained with mBERTavg are less suitable. gual ontology (Rfit ). When we learn multilin- uΩ The last column of Table3 shows the results gual embeddings from scratch with retrofitting, of cross-lingual annotation when using the FT by knowing only music genre embeddings in one sif vectors retrofitted to monolingual music genre on- language (la), we use the partially aligned DBpe- tologies. The domain adaptation of concept embed- dia ontologies which contain the SameAs relations dings, inferred with general-language pre-trained (Rfitla ). For this case, we also propose a baseline aΩ models, significantly improves music genre anno- representing a source concept embedding as a vec- tation modeling across all pairs of languages. tor of geodesic distances in the partially aligned Table4 shows the results when using retrofitting ontologies to each target concept (DBpaΩNNDist). to learn music genre embeddings from scratch. Results. Table3 shows the cross-lingual anno- Here, distributed vectors are known for one lan- tation results. The standard translation, GTrans, guage (en, respectively ja8) and the monolingual leads to the lowest results being over-performed ontologies are partially aligned. Even though not by a knowledge-based translation, more adapted necessarily all music genres are linked to their to this domain (DBpSameAs). Also, these results equivalents in the other language, the concept rep- show that translation methods fail to capture the resentations learned in this way are more relevant dissimilar cross-cultural music genre perception. for cross-lingual annotation, for all pairs involving The second part of Table3 contains only the en as the source and for ja-cs and ja-en. In fact, the

5https://taku910.github.io/mecab/ 7The complete results are presented in AppendixB. 6https://translate.google.com 8The results for the other languages are in AppendixB.

4772 baseline (DBpaΩNNDist) reveals that the aligned ways better than the results from L2 to L1. This ontologies stand-alone can model the cross-lingual observation explains two trends in Table3. First, annotation quite well, in particular for en. the scores achieved for en or es as the source, the languages with the largest number of music genres, Discussion. The results show that using transla- are the best. Second, the results for the same pair tion to produce cross-lingual annotations is limited of languages could vary a lot, depending on the as it does not consider the culturally divergent per- role each language plays, source, or target. ception of music genres. Instead, monolingual se- One possible explanation is that the prediction mantic representations can model this phenomenon from languages with fewer music genre tags such as rather well. For instance, from Milton Cardona’s L2 towards languages with more music genre tags music genres in es, salsa and jazz, it correctly pre- such as L1 is more challenging because the target dicts the Japanese equivalent of fusion (フュー language contains more specific or rare annotations. ジョン) in ja. Yet, while a thorough qualitative For instance, when checking the results per tag analysis requires more work, preliminary explo- from cs to en we observed that among the tags with ration suggests that larger gaps in perception might the lowest scores, we found , zeuhl, or still be inadequately modeled. For instance, for . However, other common music genres, Santana’s album Welcome tagged with jazz in es, it such as or hard rock, were also poorly does not predict pop in fr. predicted, showing that other causes exist too. Is When comparing the distributed embeddings, a the unbalanced number of music genres used in simple method that relies on a weighted average annotations a cultural consequence? Related work of multilingual aligned word vectors significantly (Ferwerda and Schedl, 2016) seems to support this outperforms the others. Although rarely used be- hypothesis. Then could we design a better mapping fore, we question if we can notice such high per- function that leverages the unbalanced numbers of formance with other multilingual data-sets. The music genres in cross-cultural annotations? We will cross-lingual annotations are further improved by dedicate a thorough investigation of these questions retrofitting the distributed embeddings to monolin- as future work. gual ontologies. Interestingly, the vector alignment does not appear degraded by retrofitting to disjoint 6 Conclusion graphs. Or, the negative impact is limited and ex- ceeded by introducing domain knowledge in rep- We have presented an extensive investigation on resentations. Further, as shown in Table4, joining cross-lingual modeling of music genre annotation, semantic representations in this way proves very focused on six languages, and two common ap- suitable to learn music genre vectors from scratch. proaches to semantically represent concepts: on- Regarding the scores per language, we obtained tologies and distributed embeddings9. the lowest ones for ja as the source. We could ex- Our work provides a methodological framework plain this by either a more challenging test corpus to study the annotation behavior across language- or still incompatible embeddings in ja, possibly bound cultures in other domains too. Hence, the because of the quality of the individual embedding effectiveness of language-specific concept repre- models for this language and the completeness of sentations to model the culturally diverse percep- the Japanese music genre ontology. Also, we did tion could be further probed. Then, we combined not notice any particular improvement for pairs of the semantic representations only with retrofitting. languages from the same language family, e.g. fr However, inspired by paraphrastic sentence em- and es. However, we would need a sufficiently size- bedding learning, one can also consider the music able parallel corpus exhaustively annotated in all genre relations as paraphrasing forms with different languages to reliably compare the performance for strengths (Wieting et al., 2016). Finally, the mod- pairs of languages from the same language family els to generate cross-lingual annotations should or different ones. be thoroughly evaluated in downstream music re- Finally, by closely analysing the results in Table trieval and recommendation tasks. 3, we noticed that given two languages L1 and L2, with more music genre embeddings in L1 than in L2 (from both ontology and corpus), the results 9https://github.com/deezer/ of mapping annotations from L1 to L1 seems al- CrossCulturalMusicGenrePerception

4773 References Julian Eisenschlos, Sebastian Ruder, Piotr Czapla, Marcin Kadras, Sylvain Gugger, and Jeremy Javier Franco Aixela.´ 1996. Culture-specific items in Howard. 2019. MultiFiT: Efficient multi-lingual lan- translation. Translation, power, subversion, 8:52– guage model fine-tuning. In Proceedings of the 78. 2019 Conference on Empirical Methods in Natu- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. ral Language Processing and the 9th International A simple but tough-to-beat baseline for sentence em- Joint Conference on Natural Language Processing beddings. In International Conference on Learning (EMNLP-IJCNLP), pages 5702–5707, Hong Kong, Representations. China. Association for Computational Linguistics. Mikel Artetxe and Holger Schwenk. 2019. Mas- Irene Eleta and Jennifer Golbeck. 2012. A study of sively multilingual sentence embeddings for zero- multilingual social tagging of art images: Cultural shot cross-lingual transfer and beyond. Transac- bridges and diversity. In Proceedings of the ACM tions of the Association for Computational Linguis- 2012 Conference on Computer Supported Coopera- tics, 7:597–610. tive Work, CSCW ’12, pages 695–704, New York, NY, USA. Association for Computing Machinery. Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Elena V. Epure, Anis Khlif, and Romain Hennequin. 2007. Dbpedia: A nucleus for a web of open data. 2019. Leveraging knowledge bases and parallel an- In The Semantic Web, pages 722–735, Berlin, Hei- notations for music genre translation. In Conference delberg. Springer. of the International Society of Music Information Re- trieval, ISMIR 2019. Parsiad Azimzadeh and Peter A. Forsyth. 2016. Weakly chained matrices, policy iteration, and im- Elena V. Epure, Guillaume Salha, and Romain Hen- pulse control. SIAM Journal on Numerical Analysis, nequin. 2020. Multilingual music genre embeddings 54(3):1341–1364. for effective cross-lingual music item annotation. In Conference of the International Society of Music In- Yoshua Bengio, Olivier Delalleau, and Nicolas formation Retrieval, ISMIR 2020. Le Roux. 2006. Label Propagation and Quadratic Criterion, semi-supervised learning edition, pages Lanting Fang, Yong Luo, Kaiyu Feng, Kaiqi Zhao, and 193–216. MIT Press. Aiqun Hu. 2019. Knowledge-enhanced ensemble learning for word embeddings. In The World Wide Dmitry Bogdanov, Alastair Porter, Hendrik Schreiber, Web Conference, WWW ’19, page 427–437, New Julian´ Urbano, and Sergio Oramas. 2019. The acous- York, NY, USA. Association for Computing Machin- ticbrainz genre dataset: Multi-source, multi-level, ery. multi-label, and large-scale. In Proceedings of the 20th Conference of the International Society for Mu- Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, sic Information Retrieval (ISMIR 2019), pages 360– Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. 367, Delft, The Netherlands. International Society Retrofitting word vectors to semantic lexicons. In for Music Information Retrieval (ISMIR). Proceedings of the 2015 Conference of the North Andrew P. Bradley. 1997. The use of the area under American Chapter of the Association for Computa- the roc curve in the evaluation of machine learning tional Linguistics: Human Language Technologies, algorithms. Pattern Recognition, 30(7):1145–1159. pages 1606–1615, Denver, Colorado. Association for Computational Linguistics. Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- ina Williams, Samuel Bowman, Holger Schwenk, Bruce Ferwerda and Markus Schedl. 2016. Investigat- and Veselin Stoyanov. 2018. XNLI: Evaluating ing the relationship between diversity in music con- cross-lingual sentence representations. In Proceed- sumption behavior and cultural dimensions: A cross- ings of the 2018 Conference on Empirical Methods country analysis. In 24th Conference on User Mod- in Natural Language Processing, pages 2475–2485, eling, Adaptation, and Personalization (UMAP) Ex- Brussels, Belgium. Association for Computational tended Proceedings: 1st Workshop on Surprise, Op- Linguistics. position, and Obstruction in Adaptive and Personal- ized Systems (SOAP), Halifax, NS, Canada. Jon Dattorro. 2005. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing. Gene H Golub and Christian Reinsch. 1971. Singu- lar value decomposition and least squares solutions. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and In Handbook for Automatic Computation: Volume Kristina Toutanova. 2019. BERT: Pre-training of II: Linear Algebra, pages 134–151. Springer Berlin deep bidirectional transformers for language under- Heidelberg, Berlin, Heidelberg. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- for Computational Linguistics: Human Language mand Joulin, and Tomas Mikolov. 2018. Learning Technologies, Volume 1 (Long and Short Papers), word vectors for 157 languages. In Proceedings of pages 4171–4186, Minneapolis, Minnesota. Associ- the Eleventh International Conference on Language ation for Computational Linguistics. Resources and Evaluation (LREC-2018), Miyazaki,

4774 Japan. European Languages Resources Association Jennifer C Lena. 2012. Banding together: How com- (ELRA). munities create genres in popular music. Princeton University Press. Viktor Hangya, Fabienne Braune, Alexander Fraser, and Hinrich Schutze.¨ 2018. Two methods for do- Ben Lengerich, Andrew Maas, and Christopher Potts. main adaptation of bilingual tasks: Delightfully sim- 2018. Retrofitting distributional embeddings to ple and broadly applicable. In Proceedings of the knowledge graphs with functional relations. In Pro- 56th Annual Meeting of the Association for Com- ceedings of the 27th International Conference on putational Linguistics (Volume 1: Long Papers), Computational Linguistics, pages 2423–2436, Santa pages 810–820, Melbourne, Australia. Association Fe, New Mexico, USA. Association for Computa- for Computational Linguistics. tional Linguistics. Dmetri Hayes. 2019. What just happened? Evaluating Pasquale Lisena, Konstantin Todorov, Cecile´ Cecconi, retrofitted distributional word vectors. In Proceed- Franc¸oise Leresche, Isabelle Canno, Fred´ eric´ Puyre- ings of the 2019 Conference of the North American nier, Martine Voisin, Thierry Le Meur, and Raphael¨ Chapter of the Association for Computational Lin- Troncy. 2018. Controlled vocabularies for music guistics: Human Language Technologies, Volume 1 metadata. In Conference of the International Soci- (Long and Short Papers), pages 1062–1072, Min- ety on Music Information Retrieval, Paris, France. neapolis, Minnesota. Association for Computational Linguistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeffrey Dean. 2013. Distributed represen- Niels van der Heijden, Samira Abnar, and Ekaterina tations of words and phrases and their composition- Shutova. 2019. A comparison of architectures and ality. In Proceedings of the 26th International Con- pretraining methods for contextualized multilingual ference on Neural Information Processing Systems word embeddings. Computing Research Repository, - Volume 2, NIPS’13, page 3111–3119, Red Hook, arXiv:1912.10169. NY, USA. Curran Associates Inc. Romain Hennequin, Jimena Royo-Letelier, and George A. Miller. 1995. Wordnet: A lexical database Manuel Moussallam. 2018. Audio based disam- for english. Commun. ACM, 38(11):39–41. biguation of music genre tags. In Conference of the International Society of Music Information Peter Newmark. 1988. A textbook of translation, vol- Retrieval, ISMIR 2018. ume 66. Prentice hall New York. Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Matthew Peters, Waleed Ammar, Chandra Bhagavat- Herve´ Jegou,´ and Edouard Grave. 2018. Loss in ula, and Russell Power. 2017. Semi-supervised translation: Learning bilingual word mapping with sequence tagging with bidirectional language mod- a retrieval criterion. In Proceedings of the 2018 els. In Proceedings of the 55th Annual Meeting of Conference on Empirical Methods in Natural Lan- the Association for Computational Linguistics (Vol- guage Processing, pages 2979–2984, Brussels, Bel- ume 1: Long Papers), pages 1756–1765, Vancouver, gium. Association for Computational Linguistics. Canada. Association for Computational Linguistics. Douwe Kiela, Felix Hill, and Stephen Clark. 2015. Ulrike Pfeil, Panayiotis Zaphiris, and Chee Siang Ang. Specializing word embeddings for similarity or relat- 2006. Cultural Differences in Collaborative Author- edness. In Proceedings of the 2015 Conference on ing of Wikipedia. Journal of Computer-Mediated Empirical Methods in Natural Language Processing, Communication, 12(1):88–113. pages 2044–2048, Lisbon, Portugal. Association for Computational Linguistics. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual bert? Computing Joo-Kyung Kim, Marie-Catherine de Marneffe, and Research Repository, arXiv:1906.01502. Eric Fosler-Lussier. 2016. Adjusting word embed- dings with semantic intensity orders. In Proceedings Edoardo Maria Ponti, Ivan Vulic,´ Goran Glavas,ˇ Roi of the 1st Workshop on Representation Learning for Reichart, and Anna Korhonen. 2019. Cross-lingual NLP, pages 62–69, Berlin, Germany. Association for semantic specialization via lexical relation induction. Computational Linguistics. In Proceedings of the 2019 Conference on Empirical Claire Kramsch and HG Widdowson. 1998. Language Methods in Natural Language Processing and the and culture. Oxford University Press. 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 2206– Guillaume Lample and Alexis Conneau. 2019. Cross- 2217, Hong Kong, China. Association for Computa- lingual language model pretraining. In 33rd Con- tional Linguistics. ference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. Nils Reimers and Iryna Gurevych. 2019. Sentence- BERT: Sentence embeddings using Siamese BERT- Jin Ha Lee, Kahyun Choi, Xiao Hu, and J. Stephen networks. In Proceedings of the 2019 Conference on Downie. 2013. K-pop genres: A cross-cultural ex- Empirical Methods in Natural Language Processing ploration. In Conference of the International Society and the 9th International Joint Conference on Natu- on Music Information Retrieval, ISMIR 2013. ral Language Processing (EMNLP-IJCNLP), pages

4775 3982–3992, Hong Kong, China. Association for Francisco Vargas, Kamen Brestnichki, Alex Pa- Computational Linguistics. padopoulos Korfiatis, and Nils Hammerla. 2019. Multilingual factor analysis. In Proceedings of the Sebastian Ruder, Ivan Vulic,´ and Anders Søgaard. 57th Annual Meeting of the Association for Com- 2019. A survey of cross-lingual word embedding putational Linguistics, pages 1738–1750, Florence, models. Journal of Artificial Intelligence Research, Italy. Association for Computational Linguistics. 65(1):569–630. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Yousef Saad. 2003. Iterative methods for sparse lin- Livescu. 2016. Towards universal paraphrastic sen- ear systems, 2nd edition. Society for Industrial and tence embeddings. In 4th International Conference Applied Mathematics, USA. on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Pro- Tanay Kumar Saha, Shafiq Joty, Naeemul Hassan, and ceedings. Mohammad Al Hasan. 2016. Dis-s2v: Discourse informed sen2vec. Computing Research Repository, Shijie Wu and Mark Dredze. 2019. Beto, bentz, be- arXiv:1610.08078. cas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Hendrik Schreiber. 2016. Genre ontology learning: Empirical Methods in Natural Language Processing Comparing curated with crowd-sourced ontologies. and the 9th International Joint Conference on Natu- In Proceedings of the International Conference on ral Language Processing (EMNLP-IJCNLP), pages Music Information Retrieval (ISMIR), pages 400– 833–844, Hong Kong, China. Association for Com- 406, New York, USA. putational Linguistics.

Konstantinos Sechidis, Grigorios Tsoumakas, and Ioan- George Kingsley Zipf. 1949. Human behavior and nis Vlahavas. 2011. On the stratification of multi- the principle of least effort. Cambridge, Addison- label data. In Proceedings of the 2011 Euro- Wesley Press. pean Conference on Machine Learning and Knowl- edge Discovery in Databases - Volume Part III, ECML PKDD’11, page 145–158, Berlin, Heidel- berg. Springer-Verlag.

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computa- tional Linguistics.

Vered Shwartz and Ido Dagan. 2019a. Still a pain in the neck: Evaluating text representations on lexical com- position. Transactions of the Association for Com- putational Linguistics, 7:403–419.

Vered Shwartz and Ido Dagan. 2019b. Still a pain in the neck: Evaluating text representations on lexical com- position. Transactions of the Association for Com- putational Linguistics, 7:403–419.

Marcin Skowron, Florian Lemmerich, Bruce Ferwerda, and Markus Schedl. 2017. Predicting genre pref- erences from cultural and socio-economic factors for music retrieval. In Advances in Information Retrieval, pages 561–567, Cham. Springer Interna- tional Publishing.

Mohamed Sordo, Oscar Celma, Martin Blech, and En- ric Guaus. 2008. The Quest for Musical Genres: Do the Experts and the Wisdom of Crowds Agree? In Conference of the International Society on Music In- formation Retrieval.

Robyn Speer and Joshua Chin. 2016. An ensemble method to produce high-quality word embeddings. Computing Research Repository, arXiv:1604.01692.

4776 A Strict Convexity of Retrofitting G includes at least one node from Vˆ , we conclude that A+B is a weakly chained diagonally dominant Theorem. Let V be a finite vocabulary with matrix (Azimzadeh and Forsyth, 2016), i.e. that: |V | = n. Let Ω = (V,E) be an ontology repre- sented as a directed graph which encodes semantic • A + B is WDD; relationships between vocabulary words. Further, • for each i ∈ V such that row i is not SSD, let Vˆ ⊆ V be the subset of words which have there exists a walk in the graph whose adja- non-zero initial distributed representations, qˆ . The i cency matrix is A + B (two nodes i and j goal of retrofitting is to learn the matrix Q ∈ IRd, are connected if (A + B) = (A + B) 6= 0), stacking up the new embeddings q ∈ IRd for each ij ji i starting from i and ending at a node associated i ∈ V . The objective function to be minimized is: to a SSD row. P 2 Φ(Q) = i∈Vˆ αi||qi − qˆi||2 Such matrices are nonsingular (Azimzadeh and Pn P 2 T + i=1 (i,j)∈E βij||qi − qj||2, Forsyth, 2016), which implies that Q → Q (A + B)Q is a positive-definite quadratic form. As A+B where the α and β are positive scalars. Assuming i ij is a symmetric positive-definite matrix, there exists that each connected component of Ω includes at a matrix M such that A + B = MT M. Therefore, least one word from Vˆ , the objective function Φ is denoting || · ||2 the squared Frobenius matrix norm: strictly convex w.r.t. Q. F  T  Proof. First of all, let Qˆ denote the n × d ma- T r Q (A + B)Q ˆ  T T  2 trix whose i-th row corresponds to qˆi if i ∈ V , = T r Q M MQ = ||QM||F and to the d-dimensional null vector 0 otherwise. d which is strictly convex w.r.t. Q due to the strict Let A denote the n × n diagonal matrix verifying convexity of the squared Frobenius norm (see e.g. Aii = αi if i ∈ Vˆ and Aii = 0 otherwise. Let B 3.1 in Dattorro(2005)). Since the sum of strictly denote the n × n symmetric matrix such as, for convex functions of Q (first trace in Φ(Q)) and all i, j ∈ {1, ..., n} with i 6= j, Bij = Bji = 1 Pn linear functions of Q (second trace in Φ(Q)) is − (βij + βji) and Bii = |Bij|. With 2 j=1,j6=i still strictly convex w.r.t. Q, we conclude that the these notations, and with T r(·) the trace operator objective function Φ is strictly convex w.r.t. Q. for square matrices, we have: Corollary. The retrofitting update procedure is P α || − ˆ ||2 i∈Vˆ i qi qi 2 insensitive to the order in which nodes are updated.  ˆ T ˆ  = T r (Q − Q) A(Q − Q) The aforementioned updating procedure for Q  T T  = T r QT AQ − Qˆ AQ − QT AQˆ + Qˆ AQˆ . (Faruqui et al., 2015) is derived from Jacobi iter- ation procedure (Saad, 2003; Bengio et al., 2006) Also: and converges for any initialization. Such a conver- n gence result is discussed in Bengio et al.(2006). It X X 2  T  can also be directly verified in our specific setting βij||qi − qj||2 = T r Q BQ . i=1 (i,j)∈E by checking that each irreducible element of A+B, i.e. each connected component of the underlying Therefore, as the trace is a linear mapping, we have: graph constructed from this matrix, is irreducibly

 T  diagonally dominant (see 4.2.3 in Saad(2003)) and Φ(Q) = T r Q (A + B)Q + then by applying Theorem 4.9 from Saad(2003)  T T  T r Qˆ AQˆ − Qˆ AQ − QT AQˆ . on each of these components. Besides, due to its strict convexity w.r.t. Q, the objective function Φ Then, we note that A + B is a weakly diagonally admits a unique global minimum. Consequently, dominant matrix (WDD) as, by construction, ∀i ∈ the retrofitting update procedure will converge to P {1, ..., n}, |(A + B)ii| ≥ j6=i |(A + B)ij|. Also, the same embedding matrix regardless of the order for all i ∈ Vˆ , the inequality is strict, as |(A + in which nodes are updated. P P B)ii| = αi + j6=i |Bij| > j6=i |(A + B)ij| = P ˆ B Extended Results j6=i |Bij|, which means that, for all i ∈ V , row i of A + B is strictly diagonally dominant (SSD). The following Tables 5 and 6 provide more com- Assuming that each connected component of graph plete results from our experiments.

4777 Pair FTavg XLMavg mBERTavg FTsif XLMsif mBERTsif XLMctxt mBERTctxt en-nl 75.2 ± 0.2 83.7 ± 0.1 86.2 ± 0.2 86.5 ± 0.1 86.6 ± 0.2 88.3 ± 0.2 85.4 ± 0.3 85.1 ± 0.2 en-fr 78.6 ± 0.2 84.4 ± 0.2 85.2 ± 0.3 87.4 ± 0.3 87.7 ± 0.3 87.6 ± 0.2 86.6 ± 0.2 84.9 ± 0.3 en-es 77.5 ± 0.1 82.5 ± 0.1 83.8 ± 0.1 86.9 ± 0.2 86.7 ± 0.2 86.2 ± 0.2 85.4 ± 0.3 84.7 ± 0.2 en-cs 74.5 ± 0.6 87.2 ± 0.6 88.2 ± 0.4 88.6 ± 0.4 90.6 ± 0.5 90.6 ± 0.3 89.0 ± 0.5 87.2 ± 0.1 en-ja 69.6 ± 0.5 70.9 ± 0.1 72.4 ± 0.2 80.8 ± 0.3 76.1 ± 0.3 70.5 ± 0.3 74.2 ± 0.3 68.9 ± 0.2 nl-en 73.8 ± 0.5 71.9 ± 0.4 74.6 ± 0.3 79.8 ± 0.4 77.7 ± 0.4 78.2 ± 0.3 74.2 ± 0.7 73.5 ± 0.5 nl-fr 63.9 ± 0.5 72.7 ± 0.9 76.1 ± 0.3 79.3 ± 0.8 78.4 ± 1.0 79.5 ± 0.3 74.9 ± 1.0 75.1 ± 0.7 nl-es 63.7 ± 0.4 71.6 ± 0.6 75.0 ± 0.4 77.7 ± 0.5 76.8 ± 0.3 77.6 ± 0.6 73.5 ± 0.3 75.0 ± 0.3 nl-cs 65.1 ± 0.3 77.4 ± 0.4 80.8 ± 0.4 80.6 ± 0.2 82.0 ± 0.4 83.2 ± 0.4 79.0 ± 0.4 79.6 ± 0.2 nl-ja 64.8 ± 0.2 65.2 ± 0.8 65.8 ± 2.5 74.9 ± 1.0 70.1 ± 0.4 67.6 ± 0.2 66.0 ± 0.6 65.0 ± 0.2 fr-nl 67.7 ± 1.0 77.5 ± 0.4 79.4 ± 0.1 81.9 ± 0.4 81.6 ± 0.3 82.0 ± 0.4 78.6 ± 0.1 77.7 ± 0.2 fr-en 76.2 ± 0.2 74.7 ± 0.6 77.2 ± 0.5 83.0 ± 0.2 81.5 ± 0.3 80.8 ± 0.1 78.8 ± 0.5 77.2 ± 0.5 fr-es 71.0 ± 0.2 75.6 ± 0.3 77.5 ± 0.4 81.8 ± 0.3 81.0 ± 0.5 80.0 ± 0.4 78.7 ± 0.5 78.2 ± 0.3 fr-cs 70.4 ± 0.8 80.1 ± 0.4 82.7 ± 0.6 83.9 ± 0.4 85.3 ± 0.3 85.5 ± 0.5 83.1 ± 0.2 82.0 ± 0.4 fr-ja 71.1 ± 0.3 67.4 ± 0.5 61.5 ± 0.2 77.9 ± 0.1 73.1 ± 0.1 66.9 ± 0.2 69.5 ± 0.3 64.8 ± 0.5 es-nl 68.5 ± 0.5 80.3 ± 0.6 82.3 ± 0.3 82.8 ± 0.9 83.8 ± 0.7 84.8 ± 0.2 81.3 ± 0.8 80.7 ± 0.4 es-fr 70.8 ± 0.4 79.9 ± 0.5 81.0 ± 0.3 85.0 ± 0.3 84.6 ± 0.4 84.5 ± 0.4 82.2 ± 0.5 80.3 ± 0.4 es-en 75.3 ± 0.1 76.4 ± 0.2 78.4 ± 0.3 84.7 ± 0.2 83.2 ± 0.4 82.9 ± 0.3 79.5 ± 0.2 78.1 ± 0.4 es-cs 68.9 ± 0.8 83.2 ± 0.7 85.5 ± 0.5 85.6 ± 0.6 88.2 ± 0.7 88.6 ± 0.6 85.9 ± 0.5 84.2 ± 0.4 es-ja 65.2 ± 0.4 70.4 ± 0.2 67.7 ± 0.1 78.3 ± 0.6 74.4 ± 0.2 68.6 ± 0.5 72.8 ± 0.8 65.1 ± 0.6 cs-nl 68.5 ± 1.0 75.7 ± 0.9 78.5 ± 0.7 78.3 ± 0.9 80.7 ± 0.9 80.5 ± 1.0 78.1 ± 0.4 76.8 ± 0.3 cs-fr 64.5 ± 0.9 74.7 ± 0.9 77.1 ± 1.0 78.5 ± 0.2 80.7 ± 0.3 80.7 ± 0.9 79.9 ± 0.7 76.2 ± 1.2 cs-es 65.3 ± 0.9 73.8 ± 0.7 75.8 ± 0.3 77.7 ± 0.8 79.8 ± 0.2 78.7 ± 0.3 78.8 ± 0.4 75.9 ± 0.8 cs-en 70.3 ± 0.5 70.9 ± 0.0 74.2 ± 0.2 78.9 ± 0.1 78.9 ± 0.5 79.2 ± 0.5 78.3 ± 0.4 74.0 ± 0.3 cs-ja 67.1 ± 1.1 70.8 ± 0.2 65.8 ± 1.4 76.9 ± 0.1 72.7 ± 0.3 67.3 ± 0.7 72.2 ± 1.1 68.3 ± 0.1 ja-nl 62.0 ± 0.5 59.5 ± 1.0 62.1 ± 0.4 72.8 ± 1.0 68.0 ± 0.6 67.1 ± 0.2 63.6 ± 0.9 52.4 ± 0.3 ja-fr 66.0 ± 1.1 47.6 ± 0.2 50.6 ± 0.2 73.7 ± 0.6 69.4 ± 0.7 65.6 ± 0.3 58.4 ± 0.4 43.9 ± 0.7 ja-es 63.2 ± 0.3 53.6 ± 0.3 56.1 ± 0.4 73.9 ± 0.4 67.5 ± 1.0 63.6 ± 0.6 60.0 ± 0.2 48.7 ± 0.6 ja-cs 61.7 ± 1.3 64.8 ± 0.8 64.7 ± 0.6 77.5 ± 0.2 73.3 ± 0.6 69.2 ± 0.4 64.5 ± 0.5 56.8 ± 0.9 ja-en 72.1 ± 1.0 46.3 ± 0.4 49.0 ± 0.1 75.6 ± 0.3 66.8 ± 0.6 64.1 ± 0.7 56.8 ± 1.0 44.0 ± 0.2

Table 5: Macro-AUC scores (in %, best locally underlined). The first two parts correspond to averaging or applying sif averaging to static multilingual word embeddings; the third part corresponds to the contextual sentence embeddings.

4778 en Pair DBpaΩNNDist RfituΩFTsif RfitaΩFTsif en-nl 83.7 ± 0.1 90.0 ± 0.1 90.7 ± 0.0 en-fr 82.7 ± 0.3 90.8 ± 0.2 91.7 ± 0.1 en-es 81.1 ± 0.3 89.9 ± 0.1 91.4 ± 0.2 en-cs 86.6 ± 0.3 90.4 ± 0.3 91.6 ± 0.4 en-ja 81.3 ± 0.1 86.7 ± 0.3 89.1 ± 0.2 nl Pair DBpaΩNNDist RfituΩFTsif RfitaΩFTsif nl-en 72.4 ± 0.7 84.3 ± 0.1 86.9 ± 0.1 nl-fr 72.9 ± 0.1 81.5 ± 0.7 85.0 ± 0.7 nl-es 69.2 ± 0.1 80.5 ± 0.4 83.3 ± 0.2 nl-cs 68.0 ± 0.5 83.4 ± 0.5 83.4 ± 0.5 nl-ja 72.3 ± 0.4 80.0 ± 0.7 81.6 ± 1.1 fr Pair DBpaΩNNDist RfituΩFTsif RfitaΩFTsif fr-nl 75.8 ± 0.1 84.7 ± 0.3 87.5 ± 0.3 fr-en 74.5 ± 0.4 87.7 ± 0.1 89.1 ± 0.2 fr-es 72.1 ± 0.3 85.3 ± 0.2 86.4 ± 0.2 fr-cs 82.0 ± 0.8 87.2 ± 0.3 89.6 ± 0.3 fr-ja 76.7 ± 0.6 81.4 ± 0.3 84.0 ± 0.4 es Pair DBpaΩNNDist RfituΩFTsif RfitaΩFTsif es-nl 78.3 ± 0.9 85.9 ± 0.6 88.6 ± 0.5 es-fr 78.5 ± 0.4 87.4 ± 0.3 89.0 ± 0.3 es-en 76.0 ± 0.1 88.8 ± 0.3 90.6 ± 0.5 es-cs 82.5 ± 0.9 88.0 ± 0.4 90.4 ± 0.3 es-ja 77.8 ± 0.7 83.1 ± 0.6 84.7 ± 1.0 cs Pair DBpaΩNNDist RfituΩFTsif RfitaΩFTsif cs-nl 65.1 ± 0.2 81.1 ± 1.2 81.1 ± 1.2 cs-fr 75.4 ± 0.5 81.4 ± 0.3 85.1 ± 0.6 cs-es 72.6 ± 0.6 81.6 ± 0.9 84.4 ± 1.0 cs-en 75.6 ± 0.5 84.5 ± 0.4 88.2 ± 0.3 cs-ja 77.1 ± 0.9 80.5 ± 0.5 84.5 ± 0.7 ja Pair DBpaΩNNDist RfituΩFTsif RfitaΩFTsif ja-nl 68.5 ± 0.7 76.9 ± 0.3 75.7 ± 0.3 ja-fr 71.9 ± 0.1 77.8 ± 0.1 76.7 ± 0.5 ja-es 68.9 ± 0.4 78.8 ± 0.5 76.1 ± 0.5 ja-cs 77.9 ± 0.8 80.7 ± 0.4 82.2 ± 0.8 ja-en 70.4 ± 0.4 81.6 ± 0.8 82.3 ± 0.5

Table 6: Macro-AUC scores (in %) with vectors learned by leveraging aligned monolingual ontologies. The first column shows the results by relying on the aligned ontologies only. The second and third columns show the results obtained by retrofitting FTsif embed- dings to monolingual ontologies, respectively, aligned monolingual ontologies.

Embedding Model Dimension FT 300 mBERT 768 XLM 2048 LASER 1024

Table 7: Embedding dimensions for pre-trained models used in our study, corresponding to the values provided and optimized by each model’s authors.

4779