Going Beyond T-SNE: Exposing\Texttt {Whatlies} in Text Embeddings

$Going Beyond T-SNE: Exposing\Texttt {Whatlies} in Text Embeddings$ Going Beyond T-SNE: Exposing whatlies in Text Embeddings Vincent D. Warmerdam Thomas Kober Rachael Tatman Rasa Rasa Rasa Schonhauser¨ Allee 175 Schonhauser¨ Allee 175 Schonhauser¨ Allee 175 10119 Berlin 10119 Berlin 10119 Berlin [email protected] [email protected] [email protected] Abstract We introduce whatlies, an open source toolkit for visually inspecting word and sentence embeddings. The project offers a unified and extensible API with current support for a range of popular embedding backends including spaCy, tfhub, huggingface transformers, gensim, fastText and BytePair embeddings. The package combines a domain specific language for vector arithmetic with visualisation tools that make exploring word embeddings more intuitive and concise. It offers support for many popular dimensionality reduction techniques as well as many interactive visualisations that can either be statically exported Figure 1: Projections of wking, wqueen, wman, wqueen − wking or shared via Jupyter notebooks. The project and wman projected away from wqueen − wking. Both the vector arithmetic and the visualisation were done using the https:// documentation is available from whatlies. The support for arithmetic expressions is integral rasahq.github.io/whatlies/. in whatlies because it leads to more meaningful visualisations and concise code. 1 Introduction The use of pre-trained word embeddings (Mikolov of how representations for queen, king, man, and et al., 2013a; Pennington et al., 2014) or language woman can be projected along the axes vqueen−king model based sentence encoders (Peters et al., 2018; and vmanjqueen−king in order to derive a visualisation Devlin et al., 2019) has become a ubiquitous part of the space along the projections. of NLP pipelines and end-user applications in both The perhaps most widely known tool for visu- industry and academia. At the same time, a grow- alising embeddings is the tensorflow projector1 ing body of work has established that pre-trained which offers 3D visualisations of any input em- embeddings codify the underlying biases of the beddings. The visualisations are useful for under- text corpora they were trained on (Bolukbasi et al., arXiv:2009.02113v1 [cs.CL] 4 Sep 2020 standing the emergence of clusters and the neigh- 2016; Garg et al., 2018; Brunet et al., 2019). Hence, bourhood of certain words and the overall space. practitioners need tools to help select which set of However, the projector is limited to dimensionality embeddings to use for a particular project, detect reduction as the sole preprocessing method. More potential need for debiasing and evaluate the debi- recently, Molino et al.(2019) have introduced par- ased embeddings. Simplified visualisations of the allax which allows explicit selection of the axes latent semantic space provide an accessible way to on which to project a representation. This creates achieve this. an additional layer of flexibility as these axes can Therefore we created whatlies, a toolkit of- also be derived from arithmetic operations on the fering a programmatic interface that supports vec- embeddings. tor arithmetic on a set of embeddings and visual- The major difference between the tensorflow pro- ising the space after any operations have been car- ried out. For example, Figure1 shows an example 1https://projector.tensorflow.org/ jector, parallax and whatlies is that the first two model= tf_hub+ 'nnlm-en-dim50/2' provide a non-extensible browser-based interface, lang_tf= TFHubLanguage(model) emb_tf= lang_tf['whatlies is awesome'] whereas whatlies provides a programmatic one. Therefore whatlies can be more easily extended # Huggingface to any specific practical need and cover individ- bert= 'bert-base-cased' lang_hf= HFTransformersLanguage(bert) ual use-cases. The goal of whatlies is to of- emb_hf= lang['whatlies rocks'] fer a set of tools that can be used from a Jupyter notebook with a range of visualisation capabili- In order to retrieve a sentence representa- ties that goes beyond the commonly used static tion for word-level embeddings such as fastText, T-SNE (van der Maaten and Hinton, 2008) plots. whatlies returns the summed representation of whatlies can be installed via pip, the code the individual word vectors. For pre-trained en- is available from https://github.com/RasaHQ/ coders such as BERT (Devlin et al., 2019) or Con- whatlies2 and the documentation is hosted at veRT (Henderson et al., 2019), whatlies uses its https://rasahq.github.io/whatlies/. internal [CLS] token for representing a sentence. 2 What lies in whatlies — Usage and Similarity Retrieval. The library also supports Examples retrieving similar items on the basis of a number of commonly used distance/similarity metrics such as Embedding backends. The current version cosine or Euclidean distance: of whatlies supports word-level as well as from whatlies.language import \ sentence-level embeddings in any human language SpacyLanguage that is supported by the following libraries: lang= SpacyLanguage('en_core_web_md') • BytePair embeddings (Sennrich et al., 2016) lang.score_similar("man", n=5, via the BPemb project (Heinzerling and metric='cosine') Strube, 2018) [(Emb[man], 0.0), (Emb[woman], 0.2598254680633545), • fastText (Bojanowski et al., 2017) (Emb[guy], 0.29321062564849854), (Emb[boy], 0.2954298257827759), ˇ (Emb[he], 0.3168887495994568)] • gensim (Rehu˚rekˇ and Sojka, 2010) # NB: Results are cosine _distances_ • huggingface (Wolf et al., 2019) Vector Arithmetic. Support of arithmetic ex- • sense2vec (Trask et al., 2015); via spaCy pressions on embeddings is integral in any whatlies functions. For example the code for • spaCy3 creating Figure1 from the Introduction highlights 4 that it does not make a difference whether the plot- • tfhub ting functionality is invoked on an embedding itself Embeddings are loaded via a unified API: or on a representation derived from an arithmetic operation: from whatlies.language import \ SpacyLanguage, FasttextLanguage, \ import matplotlib.pylab as plt TFHubLanguage, HFTransformersLangauge from whatlies import Embedding # spaCy man= Embedding("man",[0.5, 0.1]) lang_sp= SpacyLanguage('en_core_web_md') woman= Embedding("woman",[0.5, 0.6]) emb_king= lang_sp["king"] king= Embedding("king",[0.7, 0.33]) emb_queen= lang_sp["queen"] queen= Embedding("queen",[0.7, 0.9]) man.plot(kind="arrow", color="blue") # fastText woman.plot(kind="arrow", color="red") ft= 'cc.en.300.bin' king.plot(kind="arrow", color="blue") lang_ft= FasttextLanguage(ft) queen.plot(kind="arrow", color="red") emb_ft= lang_ft['pizza'] diff= (queen- king) orth= (man| (queen- king)) # TF-Hub tf_hub= 'https://tfhub.dev/google/' diff.plot(color="pink", show_ops=True) 2Community PRs are greatly appreciated . orth.plot(color="pink", 3 https://spacy.io/ , show_ops=True) 4https://www.tensorflow.org/hub # See Figure 1 for the result :) This feature allows users to construct custom While for Spanish, the correct answer reina is queries and use it e.g. in combination with the sim- only at rank 3 (excluding rey from the list), the ilarity retrieval functionality. For example, we can second ranked monarca (female form of monarch) validate the widely circulated analogy of Mikolov is getting close. For Dutch, the correct answer et al.(2013b) on spaCy’s medium English model koningin is at rank 2, surpassed only by koningen in only 4 lines of code (including imports): (plural of king). Another interesting observation is that the cosine distances — even of the query wqueen ≈ wking − wman + wwoman words — vary wildly in the embeddings for the two from whatlies.language import \ languages. SpacyLanguage Sets of Embeddings. In the previous examples lang= SpacyLanguage('en_core_web_md') we have typically only retrieved single embeddings. >e= lang["king"]- lang["man"]+\ However, whatlies also supports the notion of lang["woman"] an “Embedding Set”, that can hold any number of > lang.score_similar(e, n=5, metric='cosine') embeddings: [(Emb[king], 0.19757413864135742), from whatlies.language import \ (Emb[queen], 0.2119154930114746), SpacyLanguage (Emb[prince], 0.35989218950271606), (Emb[princes], 0.37914562225341797), lang= SpacyLanguage("en_core_web_lg") (Emb[kings], 0.37914562225341797)] words=["prince", "princess", "nurse", Excluding the query word king5, the analogy re- "doctor", "man", "woman", turns the anticipated result: queen. "sentences also embed"] # NB: 'sentences also embed' will be Multilingual Support. whatlies supports # represented as the sum of the #. 3 individual words. any human language that is available from its current list of supported embedding backends. This emb= lang[words] allows us to check the royal analogy from above in It is often more useful to analyse a set of em- languages other than English. The code snippet be- beddings at once, rather than many individual ones. low shows the results for Spanish and Dutch, using Therefore, any arithmetic operations that can be pre-trained fastText embeddings6. applied to single embeddings, can also be applied from whatlies.language import \ to all of the embeddings in a given set. FasttextLanguage es= FasttextLanguage("cc.es.300.bin") The emb variable in the previous code example nl= FasttextLanguage("cc.nl.300.bin") represents an EmbeddingSet. These are col- lections of embeddings which can be simpler to emb_es= es["rey"]- es["hombre"]+\ es["mujer"] analyse than many individual variables. Users can, emb_nl= nl["koning"]- nl["man"]+\ for example, apply vector arithmetic to the entire nl["vrouw"] EmbeddingSet. es.score_similar(emb_es, n=5, new_emb=

Load more