<<

Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

Angie Boggust†1 , Brandon Carter†1 , and Arvind Satyanarayan1 1CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA

Figure 1: The Embedding Comparator (left) facilitates comparisons of embedding spaces via local neighborhood dominoes: small multiple visualizations depicting local substructures (right). Abstract Embeddings mapping high-dimensional discrete input to lower-dimensional continuous vector spaces have been widely adopted in applications as a way to capture domain semantics. Interviewing 13 embedding users across disciplines, we find comparing embeddings is a key task for deployment or downstream analysis but unfolds in a tedious fashion that poorly supports systematic exploration. In response, we present the Embedding Comparator, an interactive system that presents a global comparison of embedding spaces alongside fine-grained inspection of local neighborhoods. It systematically surfaces points of comparison by computing the similarity of the k-nearest neighbors of every embedded object between a pair of spaces. Through case studies, we demonstrate our system rapidly reveals insights, such as semantic changes following fine-tuning, language changes over time, and differences between seemingly similar models. In evaluations with 15 participants, we find our system accelerates comparisons by shifting from laborious manual specification to browsing and manipulating visualizations.

1. Introduction general English to legal or medical text [HR18]). In speech recog- nition [BH14], [BZK∗17], recommendation sys- Embedding models map high-dimensional discrete objects into tems [KBV09], computational [YWBA18,BB19,RBT ∗19, arXiv:1912.04853v2 [cs.HC] 6 Mar 2021 lower-dimensional continuous vector spaces such that the vectors BBB∗19], and computational art [ERR∗17, HE18], evaluating em- of related objects are located close together. Although the indi- beddings has helped inform future training procedures and reveal vidual and structure of embedding spaces can be diffi- the impact of different training datasets, model architectures, hy- cult to interpret, embeddings have become widely used in machine perparameters, or even random weight initializations. learning (ML) applications because their structure usefully captures domain-specific semantics. For example, in natural language pro- cessing (NLP), embeddings map words into real-valued vectors in To understand how users evaluate and compare embeddings, we a way that co-locates semantically similar words [MSC∗13]. conducted a series of semi-structured interviews with users across disciplines who frequently use embedding models as part of their A key task when working with embedding models is evaluating research or in application domains. Users balance between exam- the representations they learn. For instance, users may wish to de- ining global semantic structure via plots termine whether embeddings can be transferred between tasks in a and inspecting local neighborhoods of specific embedded objects. domain with limited training data (e.g., applying an embedding of Our conversations reveal shortcomings of these approaches, includ- ing unprincipled object selection strategies that rely heavily on do- main knowledge or repetitive ad hoc analysis, and siloed tools that † Both authors contributed equally to this research. focus on either one model at a time or depict only the global struc- 2 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples ture of the embedding space. As a result, users feel concerned they representations learned by different models, as internal representa- may miss unexpected insights or lack a comprehensive understand- tions may differ even while input saliency or input-output behavior ing of the embedding space. Moreover, users are unable to develop remains the same. In our formative interviews (Section3), we found tight feedback loops or rapidly iterate between generating and an- users often compare these internal representations (e.g., to compare swering hypotheses as their current processes include limited inter- semantic differences between hidden layers of a particular model). active capabilities and, thus, require tedious manual specification. In response, we present the Embedding Comparator, an interac- 2.2. Visual embedding techniques and tools tive system for analyzing a pair of embedding models. Drawing on Interpreting the representations learned at the embedding layers of the insights from our formative interviews, the Embedding Com- ML models is challenging as embedding spaces are generally high- parator balances between visualizing the models’ global structures dimensional and latent. To reason about these spaces, researchers with comparing the local neighborhoods. To simplify identifying project the high-dimensional vectors down to two or three di- the similarities and differences between the two models, the system mensions using techniques such as principal component analysis calculates a similarity score for every embedded object based on its (PCA) [Jol86], t-SNE [MH08], and UMAP [MHM18]. Visualiz- reciprocal local neighborhood (i.e., how many of an object’s nearest ing these projections reveals the global geometry of these spaces as neighbors are shared between the two models, and how many are well as potential substructures such as clusters, but effectively do- unique to each model). These scores are visualized in several ways ing so may require careful tuning of hyperparameters [WVJ16] — a including through a histogram of scores, through color-encoding process that can require non-trivial ML expertise. The Embedding the global geometry plots, and critically, through local neighbor- Comparator provides a modular system design such that users can hood dominoes: small multiple visualizations that facilitate rapid use a dimensionality reduction technique of their choice. By de- comparisons of local substructures. A variety of interactive me- fault, however, we use PCA as it highlights, rather than distorts, the chanics help facilitate a tight iterative loop between analyzing these global structure of the embedding space [WVJ16], and is determin- global and local views — for instance, by interactively selecting istic, a desire of our formative interviewees (Section3). points in the global plots, or by searching for specific objects, users can filter dominoes, and hovering over dominoes highlights their By default, many projection packages generate visualizations points in the global views to provide additional context. that are static and thus do not facilitate a tight question-answering feedback loop as users need to repeatedly regenerate visualiza- Through case studies and first-use studies, we demonstrate how tions, slowing down the exploration process. Recently, researchers the Embedding Comparator helps scaffold and accelerate real- have begun to explore interactive systems for exploring embed- world exploration and analysis of embedding spaces. Using tasks dings including via direct manipulation of the projection [STN∗16, based on our formative interviews, we show how our system sup- PLvdM∗16,HPvU∗17], interactively filtering and reconfiguring vi- ports use cases such as understanding effects of fine-tuning and sual forms [HG18, TKBH16], and defining attribute vectors and conducting linguistic analysis. Our system design enables replica- analogies [LJLH19]. While our approach draws inspiration from tion of previously published results using only a handful of interac- these prior systems, and similarly provides facilities for exploring tions, without the need for task-specific metrics. As we demonstrate local neighborhoods, the Embedding Comparator primarily focuses in case studies (Section5) and validate in first-use studies (Sec- on identifying and highlighting the similarities and differences be- tion6), the Embedding Comparator shifts the process of analyzing tween different representations of embedded objects. To do so, we embeddings from tedious and error-prone manual specification to compute a similarity metric for every embedded object and use this browsing and manipulating a series of visualizations. metric to drive several interactive visualizations (Section4). The Embedding Comparator is freely available as open- source software, with source code at: https://github. 2.3. Techniques for comparing embedding spaces com/mitvis/embedding-comparator, and a live demo at: http://vis.mit.edu/embedding-comparator. To compare spaces, some techniques align embeddings through linear transformation [TZCS15, HLJ16b, HLJ16a, MLS13, CTL18] or alignment of neurons or the subspaces they span [LYC∗16, ∗ 2. Related work WHG 18]. In contrast, the Embedding Comparator does not align the embeddings and can be used in cases where a linear mapping 2.1. ML model interpretability between the spaces does not exist, which may occur if they have different structures [MLS13]. Our system exposes the objects that ML models are widely regarded as being “black boxes” as it are most and least similar between two vector spaces via a recip- is difficult for humans to reason about how models arrive at rocal local neighborhood similarity metric, and local neighborhood their decisions [Lip18]. Numerous tools help users understand based metrics have been shown to usefully capture differences in model behavior [HKPC18], including visualizations for specific ar- embedding spaces [HLJ16a]. As desired by our interviewees, our chitectures [SGPR18, SGB∗19, BZS∗19, LLL∗18]. More general system is deterministic and faithful to the reciprocal local neigh- techniques involve evaluating input feature importance [RSG16, borhoods of the models (unlike transformations that rely on linear STY17, SGK17, CMJG19], saliency [OSJ∗18], or neuron activa- maps or weights learned by stochastic gradient optimization). tions [KAKC17]. Boxer [GBYH20] compares discrete-choice clas- sifiers, but treats them as black boxes without considering their in- Other techniques require users to first identify and query par- ternals. In contrast to these methods, our focus is on comparing the ticular objects of interest or are restricted to particular types of Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples 3

∗ Model-Driven Users Data-Driven Users data or models. Unlike parallel embeddings [ANH 20], our sys- U1 U2 U3 U4 U5 U6 U7 U8 U9 U10 U11 U12 U13 Domain Machine Learning (M) / Computational Biology (B) tem does not rely on clustering or assume embeddings form M M M M M M B B M H H B H Digital Humanities (H) clusters corresponding to semantically meaningful concepts. Em- Affiliation Academia (A) / Industry (I) A A I A A A A I I A A A A Title Graduate Student (S) / Engineer (E) ∗ R R R R S S S E E P P S R beddingVis [LNH 18] compares network (graph) embeddings Researcher (R) / Professor (P) through relationships between node metrics and embeddings. Tasks and Goals ∗ Project X X X X Types Liu et al. [LBT 17] identify differences in how and Multimodal machine learning X X X GloVe embeddings capture syntactic relationships, but do so by vi- Natural language processing X X X Computer vision X X X X sualizing semantic and syntactic analogies specifically in the con- Knowledge graphs X Data compression X text of neural word embeddings. Similarly, Liu et al. [LJLH19] Computational biology X X X X X X evaluate consistency of attribute and analogy vectors across latent Computational linguistics X X X Use Cases Understanding model performance X X X X X embedding spaces. In contrast, our system does not rely on analo- Selecting embeddings for model initialization X X X Selecting embeddings for downstream tasks X X X X X X X X X gies, which may only exist for certain classes of embedding mod- Understanding differences in datasets X X X X X X els [EDH19, AH19]. Heimerl et al. [HG18] present visualizations Comparison Different models on the same inputs X X X X X X Tasks Different layers of the same model X X X X of nearest neighbors and co-occurrences for word embeddings to Learned embeddings to ground truth data X X X show how the meaning of a word changes over time, but do not au- Pre-trained and fine-tuned embeddings X X X X Embedding spaces trained on different data X X X X X tomatically identify such objects whose representations differ most ∗ Processes and Chanllenges between models. Wang et al. [WLA 18] use cosine distance to find Tools Dimensionality reduction X X X X X X X X X Clustering X X X X X X X X nearest neighbors between embeddings and then arbitrarily sam- K-nearest neighbors X X X X X X X X X X X Comparison Select objects and analyze nearest neighbors X X X X X X X X X X X ple words to determine if the semantic meanings differ between Strategies Reduce the dimensionality of the embedding space X X X X X X X X the models. Other work has demonstrated differences in representa- Evaluate performance on a downstream task X X X X X X X Align embedding spaces X X X tions learned by convolutional neural networks (CNNs) by decon- Comparison Selecting objects to analyze is ad hoc X X X X X X ∗ Challenges volution of representations of sampled images [YYB 14], align- Querying single objects doesn't yield global insight X X X X ∗ Dimensionality reduction may not segment the space X X X X ment of hidden units with interpretable concepts [BZK 17], or pro- Techniques produce inconsistent patterns X X X X jections of representations at different layers or epochs [RFFT16]. The Python software package repcomp [Shi18] quantifies the dif- Figure 2: A matrix summary of our formative interviews grouped ference between two embedding spaces through local neighbor- by user class. Users compare embedding spaces for a variety of hood similarity but only outputs a single global similarity value tasks and goals, but find current processes unstructured and incon- between the spaces and does not permit fine-grained inspection of sistent due to their reliance on ad hoc exploration. object-level differences. In contrast, the Embedding Comparator computes a similarity metric for every embedded object and ini- tializes its view to begin with objects that are the most and least dings such as “describe how you used embedding spaces in a recent similar between the two models. As we find in our formative in- project.” Our questions aimed to answer the following questions: terviews (Section3), users would benefit from tools that system- • What types of projects are embeddings used in? atically identify objects of interest, alleviating the need to sample • How do embeddings help users achieve their goals? them in a random or biased way. Moreover, we present these objects • When do users compare embedding spaces? in an interactive graphical system that facilitates rapid exploration • What tools and strategies are used to compare embeddings? of differences between the embedding spaces. • What challenges do users face when comparing embeddings?

3. Formative interviews 3.1. User tasks and goals To understand how embedding spaces are analyzed and compared, Through our interviews, we find users compare embeddings in con- and to identify process pitfalls and limitations of existing tools, we sistent ways, but their tasks and goals for comparing embeddings conducted a series of semi-structured interviews with 13 embed- differs significantly (see Fig.2). Specifically we find users fall ding users from both academia and industry. As listed in Fig.2, our into two categories: model-driven users who compare embedding users work in a diverse range of domains, hold a variety of titles, spaces as a way to understand model performance and behavior; and use embeddings in a wide range of projects. We recruited em- and data-driven users who study embedded representations to un- bedding users from within our organization and through an open cover properties of the underlying data. call on Twitter. Since our goal was to learn how users leveraged Model-driven users. Model-driven users analyze and compare embedding spaces, we filtered out users who reported having no embeddings as a way to develop a deep knowledge of how their experience with embeddings. models work and why. These users are interested in questions We conducted semi-structured interviews with one interviewee such as what is responsible for the performance improvement be- at a time. Each interview lasted 30–60 minutes. We initially con- tween two models? (U1–U5), what embeddings should I use for ducted interviews with users in person but switched to video chat initialization? (U1, U4, U6), what model will perform the best on due to the COVID-19 pandemic. In each interview, we began by my downstream task? (U1, U2, U4, U5, U8), and how do differ- describing our study objective: to understand the landscape of em- ences in training data affect model behavior? (U3, U7). To answer bedding use. We then asked users open ended questions and en- these questions, model-driven users perform embedding compari- couraged users to describe their specific experience with embed- son tasks such as: comparing learned embeddings from different 4 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples layers of the same model, comparing embeddings from different and could be viewed as cherry-picking by the research community. models trained on the same data, comparing learned embeddings To mitigate these concerns, many users additionally compared em- to the ground truth, and comparing generic embeddings to those beddings on quantitative downstream tasks (e.g., comparing word that have been fine-tuned for a particular task. embeddings on genre classification [U10]). While doing so added some sense of rigor to their process, users found this procedure to Data-driven users. Data-driven users utilize embeddings as a way be “embarrassingly empirical” (U10) because it did not provide to represent and understand relationships in the data they study. the same rich insights users got from local neighborhood explo- These users typically work on projects in applied machine learn- ration. As U1 emphasized, “Qualitative [analysis] is often more ing domains like computational biology or historical linguistics powerful than just a single quantitative number.” and use embedding spaces “to investigate questions that have been made through non-computational methods” (U13). Example ques- Throughout our interviews, users expressed concerns about the tions from data-driven users we interviewed include: what are the unreliability of common analysis techniques such as t-SNE and em- structural relationships between sequences? (U12), what bedding alignment algorithms (U4, U7, U9, U13). We found users is the relationship between x-ray images and corresponding radi- distrust t-SNE due to its sensitivity to hyperparameters and stochas- ology reports? (U9), and how has the plot speed in fiction nov- ticity, which can lead to wildly different projections and often times els shifted over time? (U10). Use cases include selecting embed- misleading results [WVJ16]. Comparing embeddings using t-SNE dings for downstream tasks (e.g., comparing pre-trained word em- caused users to distrust their conclusions as they were unable to beddings to embeddings fine-tuned on task-specific literature for draw a distinction between “real” findings and t-SNE artifacts. This document classification [U11]) and using embeddings to analyze led U4 to stop using t-SNE altogether, noting, “You can tease apart differences in the data (e.g., comparing word embeddings trained whatever you want from t-SNE. Sometimes it shows you what you on literature from different centuries [U13] or comparing embed- want and if it doesn’t then you can spend a bit of time until you ded protein relationships to ground truth data [U12]). see what you want to see.” Users expressed similar concerns about embedding space alignment algorithms, such as orthogonal Pro- crustes, because the algorithms can produce unreliable and mean- 3.2. Current processes and challenges ingless alignments (see Section 2.3). While U13 previously applied Through our interviews, we find model-driven and data-driven Procrustes alignment, they are hesitant to use it again in the fu- users use similar tools, apply similar strategies, and run into com- ture because they could “only find the story [they] wanted to tell mon problems when comparing embedding spaces. A common sometimes” and struggled to determine whether their results were workflow, referred to as “global check and local check” by U5 representative of the embedding spaces or resulted from the unpre- and their colleagues, involves comparing the global structures of dictability of the method. embedding spaces via dimensionality reduction and then select- ing objects within the space and comparing their k-nearest neigh- bors. Analyzing the global structure of embedding spaces enables 3.3. Embedding Comparator design goals users to gain insight into the semantic concepts represented by each embedding space, while comparing local neighborhoods of select To inform the design of the Embedding Comparator, we distill our objects can unearth unexpected relationships between objects or formative interviews into the following design goals: confirm user hypotheses. For example, when studying how liter- 1. Surface similarities and differences for systematic compari- ary texts personify non-human characters, U11 analyzed the global son. Across all users, the central goal of comparing embedding structure of their embedding spaces using dimensionality reduction models is to understand the degree to which they are similar, and techniques, finding the spaces had separated into discrete clusters where the differences lie. A recurrent breakdown is how unprin- representing personhood identifiers: professions, education types, cipled and time-consuming identifying this information is with etc. This finding propelled U11 to probe the local neighborhoods of current approaches (e.g., performing a dimensionality reduction pilot known character types within each cluster like “ ” to discover that does not result in informative separation or manually gener- train conductor new associated personifying words like “ ”. ating objects to analyze via k-nearest neighbors). Moreover, our Comparing local neighborhoods of specific objects was a critical interviews suggest that users are most interested in seeing the task for the majority of our users (11/13), especially in cases when objects that are most similar or different — as U2 noted, “It is global projections were unable to meaningfully segment the space; most interesting what happens at the extremes.” however, many users expressed concerns that the way they selected 2. Display projection plots to surface global semantic differ- objects to analyze was unprincipled and reduced their confidence in ences between embedding spaces. The majority of our users their results. Data-driven users selected objects using their domain (9/13) use dimensionality-reduced projection plots as a primary expertise (e.g., words known to change over time [U11, U13] or mechanism for evaluating embedding spaces. Besides users’ fa- known to interact with SARS-CoV-2 [U12]), but expressed miliarity with these views, our interviews highlighted that visu- concern that this approach would not uncover unexpected results. alizing global structure provides necessary context to meet our Model-driven users selected objects they expected would be chal- first design goal as the overall shape, density, and clustering of lenging for the model (U6), objects they hypothesized would dis- a projection can reveal similarities and differences in a glance. play insightful differences (U1, U4, U5), or objects selected at ran- For example, U1 described how viewing the global projection dom (U6); however, they were often concerned that this unsystem- of a particular embedding model caused them to stumble upon atic approach prevented them from understanding the entire space an unexpected structure they later wrote a paper about. Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples 5

3. Compare local neighborhoods to identify object-level simi- bases and may even have a different number of dimensions d. For larities and differences. While projection plots provide useful this reason, we compute the similarity for each object in the embed- global context, the substance of our users’ analysis occurs by ding space through similarity of the reciprocal local neighborhood drilling down into and comparing the local neighborhoods of around the object in each of the embedding spaces. Precisely, for embedded objects. For example, U2 noted that even in their pa- each object w ∈ V, we compute the local neighborhood similarity pers they “often report a few nearest neighbors of their models (LNS) of w between the two embedding spaces as: to show that they capture [meaningful] properties”. The Em- LNS(w) = S(k-NN (w), k-NN (w)) (1) bedding Comparator must provide interface elements that facil- 1 2 itate rapid comparisons of reciprocal local neighborhoods for where k-NNi(w) returns the k-nearest neighbors of w in embedding each embedded object. space Ei and S is a similarity metric between the two lists of near- 4. Interactively link global and local views for rapid analysis. est neighbors. Here, we compute k-nearest neighbors with cosine or Users typically explained their global and local analysis as a Euclidean distance between the embeddings [TP10] and take S as single technique; however, current tools force users to complete the Jaccard (intersection over union) similarity between the sets of each task independently by first projecting the high dimensional neighbors. The Jaccard similarity between two sets C1 and C2 is de- |C1∩C2| space and independently querying for objects of interest. The fined as J(C1,C2) = and scales between 0 (the sets are dis- |C1∪C2| lack of interactivity slows down their analysis processes and joint) and 1 (the sets are identical). Users can choose between these makes it difficult to deeply explore and identify differences that different metrics in the interface, or introduce additional functions are not obvious. Thus, the Embedding Comparator should use via a JavaScript API call. This deterministic metric enables the Em- interactive techniques to link global and local views together, in- bedding Comparator to systematically sort and identify objects of cluding allowing users to select local views from global views, interest, alleviating the need for users to sample them in a random highlight local neighborhoods within the global projections, and or biased way or to formulate hypotheses a priori about which ob- perform targeted searches to surface the local neighborhoods of jects to investigate (Design Goal 1, Design Goal 5). specific embedded objects. 5. Yield deterministic and reproducible results that inspire user confidence. Our formative interviews revealed many users 4.2. Global views avoided unreliable analysis techniques such as t-SNE and Pro- The Embedding Comparator’s left-hand sidebar (shown in Fig.3a– crustes alignment. Users want to be confident the patterns they e) provides configuration options and interactive global views of uncover are representative of true patterns in the embedding the embedding spaces. A user begins by selecting a dataset, which space and are not artifacts of non-deterministic tools. As such, specifies the embedding vocabulary, followed by two models, each the Embedding Comparator should leverage techniques that pro- of which defines the embedding space for that vocabulary (Fig.3a). duce reproducible results users can trust. Users can load arbitrarily many models, and by decoupling datasets from models, the Embedding Comparator makes it easy to compare several different models with the same vocabulary. 4. System design Beneath each model, the Embedding Comparator shows a global Informed by our formative interviews, the Embedding Comparator projection (Fig.3b): a scatter plot that depicts the geometric struc- computes a similarity score for every embedded object based on ture of the model’s embedding space, with object names shown in its reciprocal local neighborhoods. Critically, this similarity metric a tooltip on hover. As per Design Goal 2, these projections pro- does not require the two models to have the same dimensionality vide valuable context during exploration, and can help reveal in- nor do they need to be aligned in any way. As a result, the Em- teresting global properties such as the presence of distinct clusters. bedding Comparator is capable of supporting a wide range of em- We use PCA to perform dimensionality reduction because, as com- bedding models. Furthermore, the simplicity of computing overlap pared to alternatives like t-SNE, it is deterministic (Design Goal 5) between reciprocal local neighborhoods makes this metric intuitive and highlights rather than distorts the global structure of the embed- to a broad range of users, which we validate in our first-use stud- ding space [Jol86,WVJ16]. However, we provide a modular system ies (Section6). To allow users to rapidly identify similarities and design, and users can choose an alternate technique. differences between the models, this similarity metric is visualized via a number of global views (including projection plots and a his- The similarity distribution (Fig.3c) displays the distribution of togram distribution) as well as through local neighborhood domi- LNS similarity values (Eq.1) over all objects in the embedding noes: small multiple views of local substructures. space. Bars are colored using a diverging red-yellow-blue color scheme to draw attention to the most extreme values (objects that are the most and least similar between the two selected models) 4.1. Computing local neighborhood similarity in accordance with Design Goal 1. We reapply this color encoding in the global projections to help users draw connections between An embedding space is a function E : V → d that maps objects R the two visualizations and to help reveal global patterns or clusters in vocabulary V into a d-dimensional real-valued vector space. For of objects with related similarity values (such as those in the case example, in NLP, V may be a vocabulary of English words, and E studies of Fig.4 and Fig.5). Both the similarity distribution and the maps each word into a 200-dimensional vector. global projections can be used to interactively filter the dominoes Here, we consider two embedding spaces E1 and E2 over the (Design Goal 4; Section 4.3), and the search bar (Fig.3e) can be same set of objects V. The embedding spaces may have different used to populate the dominoes of specific object(s) of interest. 6 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

Figure 3: Users populate the Embedding Comparator by selecting a pair of embedding models (a). Global projections (b) visualize the geometry of the embeddings, and local neighborhood dominoes (f) visualize an object’s local neighborhood. Users explore by brushing over the similarity distribution (c), tuning the neighborhood parameters (d), and searching for objects (e). Hovering on a domino highlights its neighborhood in the global projection (h), and hovering on an object in a domino highlights that object in the local neighborhood plots (g).

The parameter controls (Fig.3d) enable the user to interactively The dominoes’ information-dense display is designed to facili- change the value of k used to define the size of local neighbor- tate rapid acquisition of neighborhood-level insights. By scanning hoods for computing similarity, and select the distance metric used down the dominoes, users see not only the geometries involved but for computing distance between vectors. The default value of k is also specific common and unique neighbors to trigger hypothesis 50, which we found to provide insightful results across our various generation. Previous iterations of the Embedding Comparator used experiments and case studies. Changes in either of these controls separate tabular lists to display the most and least similar words immediately update the rest of the interface (Design Goal 4). across models (and common and unique neighbors for individual objects) but, in early user tests, we found that such a presentation produced a high cognitive load as users tried to map back and forth 4.3. Local neighborhood dominoes between the various lists. Thus, with the domino design, we encap- To meet Design Goal 3, the Embedding Comparator introduces sulate all local neighborhood information associated with a given local neighborhood dominoes: a small multiples visualization to embedded object into a single interface element while still support- surface local substructures and facilitate rapid comparisons. Each ing cross-model comparisons. For instance, in the “mean” domino domino consists of a set of interactive neighborhood plots and com- in Fig.3f, the neighbor plots reveal substructures within the local mon and unique neighbor lists. neighborhoods — in model B, the bottom appears to relate to the mathematical notion of “mean”, while the top is more synonymous Neighborhood plots — side-by-side scatter plots that show the with “convey” and shares neighbors with model A. We confirm this k nearest neighbors of the domino object in each model — hypothesis by scanning the common and unique lists, where we ob- graphically display the relationships between the object and its serve more mathematical words under model B than A. neighbors. These plots use the same projections as the global pro- jection views (Section 4.2) to ensure all geometries are visualized consistently. Color encodes whether each neighbor is common to 4.4. Linking global and local views both models or unique to a single model. These neighbors are also Linking the global and local views (Design Goal 4) enables users displayed as separate scrollable lists above and below the plots and to rapidly iterate between comparing the overall embedding spaces are sorted by the distance to the main object. For example, Fig.3f and inspecting individual objects. When hovering over a domino, shows the domino for the word “mean” from an embedding trained the object and its local neighborhoods are highlighted in each of on text from the early 1800s (model A, left) to the 1990s (model B, the respective global projections with the purple/green color en- right). To facilitate cross-model comparisons, dominoes are inter- coding preserved (Fig.3h). This interaction allows users to contex- active: hovering over a neighbor in the plots or lists highlights it tualize local neighborhoods within the overall embedding space. across the entire domino (Fig.3g). Motivated by Design Goal 1 Similarly, interactive selections (e.g., lassoing or brushing) in the and our users’ feedback that the most interesting insights often lie global projections or similarity distribution filter the dominoes, al- at the extremes, the default view of the Embedding Comparator lowing users to drill down and investigate objects of interest. lists two columns of dominoes (shown in Fig.4): the first displays dominoes of the least similar objects (in increasing order) and the 5. Evaluation: case studies second displays those of the most similar objects (in decreasing or- der). To increase information scent [Pir03], we adapt the Scented We illustrate how the Embedding Comparator accelerates real- Widgets technique [WHA07] and augment the scroll bars with a list world workflows through case studies drawn from our formative of domino objects and sparklines of their similarity scores (Fig.4). interviews (see supplement for additional applications). Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples 7

Figure 4: The Embedding Comparator applied to case study Transfer learning for fine-tuned word embeddings compares a model before and after fine-tuning for . The Embedding Comparator reveals the effect of transfer learning on the embedding space via the global projections, as well as neutral sentiment words that have taken on sentiment meanings (e.g., “avoid”) via the dominoes.

5.1. Transfer learning for fine-tuned word embeddings fine-tuning the embeddings for this classification task has affected the global arrangement of objects in the vector space. Our formative interviews reveal that users employ transfer learn- By inspecting local neighborhood dominoes at the extremes of ing to improve performance on downstream tasks [LSD15, HR18] the similarity distribution, the researcher can understand how fine- but existing tools make it difficult to compare the trade-off be- tuning has affected semantic meaning of specific objects. For ex- tween pre-trained embeddings’ generalizability and fine-tuned em- ample, in the original space “bore” is most closely related to “car- beddings’ domain-specific expressiveness. To evaluate the Embed- rying” or “containing”, but after fine-tuning for sentiment analysis ding Comparator on this task, we develop a transfer learning case it is most closely related to “boredom”, “boring”, or “dull” — study in collaboration with an ML researcher (model-driven user). hence, fine-tuning has had the intended effect, where “bore” now Here, the researcher trains an LSTM network [HS97] to predict takes on its sentiment-related meaning. Similarly, “redeeming” has whether movie reviews express positive or negative sentiment. The changed in sentiment from the positive sentiment definition: “com- researcher initializes the network using pre-trained fastText English pensate for faults” to a negative sentiment definition likely related word embeddings [MGB∗18] and refines them using limited movie to the reviewer idiom “no redeeming qualities”. Besides words, the review data [MDP∗11]. Once complete, the researcher wishes to dominoes reveal that numbers less than 10 have also be affected by investigate the effect of the refinement process: for example, iden- the sentiment analysis task — for example, “7” has become more tifying words that are ordinarily synonyms but not in the context of closely related to positive adjectives as a result of numeric scales sentiment prediction or words which are not ordinarily synonyms used to rank movies within the reviews. Words that remained un- but become interchangeable in the context of sentiment prediction. changed seem uncorrelated with movie sentiment, such as proper nouns like “vietnam” and “europe” as well as “cinematographer”. We use the Embedding Comparator to compare the pre-trained fastText word embeddings with those fine-tuned for the sentiment The researcher we worked with on this case study was excited analysis task. Fig.4 shows the initial view in the Embedding Com- about the results the Embedding Comparator helped reveal. Gen- parator after loading both models. Our system immediately sur- erating these types of insights with prior tools would require sig- faces a number of insights about how the embedding space has nificant tedious effort — besides manually constructing the neces- changed as a result of fine-tuning. The color encoding shared be- sary views (e.g., within a Jupyter notebook), the researcher would tween the similarity distribution and global projections reveals the have needed to formulate hypotheses a priori about which specific words that changed the most as a result of fine-tuning have moved words to investigate. In contrast, by calculating a similarity score toward the outer regions of the embedding space. By hovering over for every word, and by visualizing local neighborhoods as domi- the fine-tuned embedding space global projection, the researcher noes, the Embedding Comparator surfaces this information more finds positive sentiment words moved toward the left side of the directly. As a result, the system transforms the process of compar- projection, while negative sentiment words moved toward the right. ing embedding models from requiring explicit and manual steer- These interactive global views enable the researcher to identify how ing by the researcher, towards more of a browsing experience. This 8 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

Figure 5: The Embedding Comparator applied to case study Language evolution via diachronic word embeddings compares word embedding models trained on literature from different decades. The Embedding Comparator demonstrates how the English language has changed over time, e.g., the word “gay” has changed in meaning from “happy” to “homosexual” while the meanings of numbers have not changed. shift frees the researcher to focus on generating and answering hy- to define and compute a task-specific semantic displacement met- potheses about their models, and allows for more serendipitous dis- ric to uncover these findings. Our method for comparing embedded covery. For instance, while using the Embedding Comparator, the objects through local neighborhoods is agnostic to application do- researcher discovered an unexpected result: the word “generous” mains and tasks. As a result, the Embedding Comparator scaffolds adopted a more negative result after fine-tuning. To explain this and accelerates the analysis process for users regardless of their finding required digging into the training data to uncover phrases ML expertise — novice users need minimal technical knowledge such as “I’m being generous giving this movie 2 stars” and “2 stars to replicate established linguistic analysis while experts can devote out of a possible 10 and that is being overly generous.” their effort to designing task-specific metrics only when necessary.

To further compare the two models, the linguist can explore the 5.2. Language evolution via diachronic word embeddings effect of varying the number of neighbors k using the nearest neigh- Our second case study follows how a linguist (a data-driven bors slider. Smaller values of k enable comparisons of how imme- user) employs embedding models to study the evolution of diate neighbors of a word have changed over time. By dragging languages over time. Word embeddings have been shown to the slider to decrease k, the linguist discovers words that have high capture diachronic changes in language (i.e., changes over similarity at small values of k, such as “les”, which has the same time) [HLJ16b]. Here, we use diachronic word embeddings from small set of French words as its nearest neighbors in both models. HistWords [HLJ16b], trained on English books written from 1800– This finding may be surprising to the linguist given the models were 2000 grouped by decade. We select embeddings from five different trained on English text. Larger values of k enable comparisons of decades spanning this time period (1800–1810, 1850–1860, 1900– larger neighborhoods. Words with high similarity at larger k have 1910, 1950–1960, and 1990–2000) and evaluate how the Embed- large neighborhoods that have remained consistent over time. In- ding Comparator surfaces words whose semantics have evolved. creasing k reveals words such as days and months whose meanings have remained consistent between the time periods. Fig.5 shows the Embedding Comparator comparing embeddings of text written in 1900–1910 to text written in 1990–2000. The lin- Using this case study, we can also see the benefits of being guist can immediately replicate known insights [HLJ16b], such as able to easily switch between alternative models trained on the the change in meaning of “gay” (from “happy” to “homosexual”) same dataset. For example, using the corresponding drop down and “major” (from “military” to “important”) over time. It also menus, the linguist could switch to comparing 1800–1810 vs. reveals words such as “aids”, whose meaning changed from “as- 1900–1910 to find “nice” moves away from meaning “refined” and sists” to the disease HIV/AIDS which was not named until the early “subtle” and moves toward “pleasant”, in line with previous find- 1980s [fDCC∗82], along with many other words that are similarly ings [HLJ16b]. Similarly, by fixing one model to 1990–2000 and ripe for further linguistic analysis. In contrast to the original anal- varying the other, the similarity distribution reveals the pairwise di- ysis [HLJ16b], with the Embedding Comparator, there is no need achronic changes: bars would shift rightward as models trained on to manually align the various embedding spaces nor do users need more recent text have more similar words. Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples 9

Number of Insights Number of Theories Seconds to First Insight 6. Evaluation: first-use studies 25 5 The Embedding Comparator is designed to help users efficiently 20 4 1500 15 3 analyze the similarities and differences between embedding spaces. 1000 To evaluate its effectiveness, we performed first-use studies in 10 2 500 which participants compared two embedding spaces using their 5 1 ∗ normal process in a Jupyter Notebook [KRKP 16] and with Embedding Comparator 0 0 0 0 5 10 15 20 25 0 1 2 3 4 5 0 500 1000 1500 the Embedding Comparator. In particular, users compared Hist- Jupyter Notebook Jupyter Notebook Jupyter Notebook Words [HLJ16b] embeddings from 1800–1810 to embeddings from Figure 6: Using the Embedding Comparator, participants gained 1990–2000 (Section 5.2). We chose HistWords embeddings be- more insights and theories and reduced their time to first insight. cause they model English language over time, so no task-specific knowledge was required. We recruited participants via an open call within our organization. We performed first-use studies via video 6.2. Qualitative results chat with 15 representative users: six computer science graduate students, one psychology graduate student, one computational biol- Despite having access to starter code in the Jupyter Notebook and ogy graduate student, four ML engineers, and three post-docs, who being encouraged to use their own code or outside resources, par- all self-reported having researched or worked with embeddings. ticipants experienced a barrier of entry to coding and found it chal- lenging to write code that aided them in their analysis of the em- Each first-use study lasted 60–90 minutes, and participants were beddings. Three participants did not write any of their own code compensated with $20 Amazon gift cards. Users spent half of during the 45 minute Jupyter Notebook section, and only four par- the study using the Embedding Comparator and half using the ticipants felt the code they had written allowed them to understand Jupyter Notebook. We randomized the starting interface: seven the similarities and differences between the two embedding spaces. users started with the Embedding Comparator and eight users A common workflow used by nine participants involved generating started with the Jupyter Notebook. Each condition began with an words they expected to differ between the two spaces and com- introduction to how the embeddings were loaded into the inter- paring their nearest neighbors. This process caused a user’s biases face and a prompt to determine the similarities and differences of to drive their exploration (e.g., “I am looking for a word with two the two embeddings spaces. In the Jupyter Notebook, we provided meanings where the meanings are very different” [P6]; or, “be- users with code to plot dimensionality reduction projections of the cause I am Arab, I want to look at [the word] Arab” [P1]). As a re- embeddings and print the nearest neighbors of an object, and we sult, users rarely uncovered unexpected results, and they were often encouraged users to use any tools — including their own code — frustrated because they had to rely on their intuition — as P1 said, that would help them analyze the embeddings. Participants were “I don’t have a sophisticated way of looking through this. Some asked to think-aloud throughout the study and, at the end of each 1800s and 1900s history knowledge would be useful.” Interestingly, condition, were asked to enumerate the similarities and differences after using the Jupyter Notebook, several participants did not feel between the two embedding spaces and describe what they would that they had developed sufficient insights to understand how the do if they had access to the original models and data. spaces differed. For instance, when asked to enumerate the simi- larities and differences they had uncovered P9 said, “I don’t have 6.1. Quantitative results a good understanding of these embeddings.”, and P15 answered, “I don’t think I gained enough insights.” Further, P14 who had We reviewed recordings of each study session and measured the been exposed to the Embedding Comparator first felt the Embed- insights and theories our users identified using each interface. We ding Comparator’s “quick and helpful analysis” had enabled them define an insight as the user indicating that an embedded object, or to “know right away what embeddings to use [in a downstream type of embedded object, is represented similarly or differently in task]”, so they “would not have written any code.” the two embedding spaces in a meaningful way (e.g., “gay” moving in meaning from “happy” to “homosexual” or an embedding space In comparison, when using the Embedding Comparator, users projection showing meaningful clusters). We say the user had a the- generated insights quickly by looking at the most and least similar ory if they generated a reason for an insight or expressed a desire to lists. Common insights included the meaning of numbers and reli- look at the original data or models to better understand an insight. gious words (e.g., “catholic” and “god”) staying constant over time (P1, P2, P5, P7–P13); “aids” changing in meaning from “help” to We measure the number of insights, number of theories, and time “HIV” (P1, P3, P6, P8, P11–P13); and, “logic”, “calculated”, and to first insight (measured from the end of our introduction) for both “volume” shifting towards mathematical denotations (P1, P3, P5, conditions and show participant-level results in Fig.6. We com- P9, P10, P12, P13). Using the Embedding Comparator, users were pute statistical significance using a single-tailed paired sample t- able to gain insights into global semantic differences between the test. Users developed their first insight faster (p = 0.003) using the embedding spaces. For instance, P9 found “words that are com- Embedding Comparator (µ = 1:16 min, σ = 1:20 min) as compared monly used without significant changes to their meaning over time to the Jupyter Notebook (µ = 8:13 min, σ = 7:54 min). Users also are similar between the two embeddings, whereas other words that developed more insights (p = 0.004) using the Embedding Com- have changed their usage over time or correspond to a new inven- parator (µ = 10.7, σ = 6.6) as compared to the Jupyter Notebook tion, like the car, are of course dissimilar in the two embeddings. I (µ = 4.1, σ = 3.3), and users generated more theories (p = 0.001) us- did not find this in my previous analysis.” ing the Embedding Comparator (µ = 1.7, σ = 1.6) as compared to the Jupyter Notebook (µ = 0.3, σ = 0.5). Users found the dominoes invaluable in their exploration, saying 10 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

“it’s hard to compress all this information in a Jupyter Notebook, generate numerous theories including ideas for improving the data so it’s nice to have a tool to be able to browse a lot of words at representation (e.g., preprocessing numbers into a learned once” (P5), and “I am able to see [objects] in a graphical view token [P5]); new experimental procedures (e.g., training embed- which helps me see [similarities and differences]” (P3). Using the dings on native English books vs. books translated from Chinese as dominoes, users were able to analyze many objects at a time and a way to compare culture [P7]); reasons for representation failures speed up their analysis. For instance, P13 found dominoes allowed (e.g., word frequency affects semantic stability over time [P1, P9, them to “get a quick sense of what words distinguish one [embed- P10, P14], or words used in many contexts are susceptible to be- ding space] from the other” and P8 articulated, “The differences of ing represented differently between the two embedding spaces [P2, the words across time immediately popped out here. This was pre- P10, P12, P13]); and, questions about the original data (what is the cisely what I was trying to articulate with the [Jupyter Notebook].” distribution of genres included in the datasets? [P5–P7, P9, P10, P12] and what was the data preprocessing pipeline? [P3, P12]). Our first-use studies also validate our local neighborhood simi- After using both interfaces, P5 noted the Embedding Comparator larity metric and the use of this metric to drive our system by sorting “made it very clear that they were differences in the embedding dominoes. Interestingly, without prior knowledge of the Embed- spaces” as compared to the Jupyter Notebook where “it is a lot ding Comparator, one participant implemented a metric similar to harder to find the differences.” As P4 said, using the Embedding ours for computing the local neighborhood intersection during their Comparator “would be very helpful to my research.” Jupyter exploration, suggesting local neighborhood similarity is in- tuitive to users. However, their implementation was not optimized and did not scale to compute similarity for a sample of more than 7. Discussion and future work 40 words, leading to time spent waiting for code to run and uneasi- ness debugging. Since the Embedding Comparator calculates simi- We present the Embedding Comparator, a novel interactive sys- larity for all embedded objects automatically when embeddings are tem for comparing embedding spaces. Informed by formative in- loaded, it shifts users to immediately focus on meaningful parts of terviews conducted with embedding users across disciplines, our the analysis process as opposed to tedious implementation. design balances between visualizing information about the over- all embedding spaces and enabling exploration of local neighbor- Laying out the dominoes ordered by similarity aided users in hoods. To directly surface similarities and differences, a similarity understanding the similarities and differences between the embed- score is computed for every embedded object and encoded across ding spaces. P1 attributed seeing “a lot more words and what is global and local views. To facilitate rapid comparisons, we in- most similar and least similar” as the reason why they generated troduce local neighborhood dominoes: small multiple visualiza- insights faster using the Embedding Comparator and compared the tions of local geometries and lists of common and unique objects. two interfaces by saying “in the Notebook words were randomly Through case studies of tasks described by our formative intervie- chosen, but here they are sorted, which lets you see things in a wees, we demonstrate how the Embedding Comparator transforms more organized way and lets words stick out.” P1 found it to be the analysis process from requiring tedious and error-prone manual invaluable when generating insights, saying the “main thing that specification to browsing and interacting with graphical displays. was useful was having [the Embedding Comparator] order every- Moreover, by computing a similarity score for each object, and us- thing, being able to look at the different parts of the distribution, ing it to drive the various views, the Embedding Comparator imme- and seeing [objects] in order - the least similar and the most sim- diately surfaces interesting insights, and published domain-specific ilar.” Similarly, P5 reported “seeing the differences in sets in the results can be replicated with only a handful of interactions. nearest neighborhoods is a much easier way to compare the sets of embeddings” and P2 summarized their experience by saying, “[the Further study of our design goals suggests several avenues for fu- Embedding Comparator] presented the information in a better way ture work. One extension, to provide richer linking between global and just by doing that we have some sort of a breakthrough.” and local views, is to consider how metadata tags (e.g., part-of- speech or sentiment) can be visualized and used to filter dominoes. After analyzing the dominoes in the default view of the Em- Dominoes were designed around concise visual representations of bedding Comparator, users varied the parameter controls to further objects that correspond to semantically meaningful differences be- compare models and test hypotheses. Users often brushed over the tween the objects, and one direction for future work is how to visu- similarity distribution to analyze words at a specific similarity value alize objects when such representations do not implicitly exist. For (P1, P5, P7–P15), used the search bar to analyze words that had instance, for sentences or document embeddings [LM14], it would come to mind during their analysis (P2, P3, P5–P15), and varied be interesting to explore techniques for identifying specific com- k to explore the hierarchy of a word’s local neighborhood (P1, P2, ponents of each object (e.g., a subset of words in a document) re- P6, P7, P9–P15). For example, after discovering the neighbors of sponsible for the differences in the embeddings and visually display “content” were related to “satisfied” in one model and “substances” these via dominoes. Our similarity metric usefully surfaces differ- in the other, P10 wanted to understand “how these models repre- ences between pairs of models, and one next step would generalize sent [words] that have two meanings.” They hypothesized that both this metric to consider n-wise similarities. Finally, how might in- models may have learned both definitions and k was simply too terface elements in the Embedding Comparator allow users to link small to see the full neighborhood. After increasing k, P10 found to specific examples in the training data? Such interaction would the spaces were still very dissimilar, leading them to theorize one help further tighten the question-answering loop by providing fine- dataset may have contained more scientific text than the other. grained context for understanding why particular similarities or dif- When using the Embedding Comparator, users were also able to ferences exist between the same object across multiple models. Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples 11

Acknowledgments [HKPC18]H OHMAN F., KAHNG M.,PIENTA R.,CHAU D. H.: Visual analytics in : An interrogative survey for the next frontiers. This work is supported by a grant from the MIT-IBM AI IEEE transactions on visualization and computer graphics 25, 8 (2018), Lab. We also thank Jonas Mueller for helpful feedback. 2674–2693.2 [HLJ16a]H AMILTON W. L., LESKOVEC J.,JURAFSKY D.: Cultural shift or linguistic drift? comparing two computational measures of se- References mantic change. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing. Conference on Empirical Methods [AH19]A LLEN C.,HOSPEDALES T.: Analogies explained: Towards un- in Natural Language Processing (2016), vol. 2016, NIH Public Access, derstanding word embeddings. In International Conference on Machine p. 2116.2 Learning (2019), pp. 223–231.3 [HLJ16b]H AMILTON W. L., LESKOVEC J.,JURAFSKY D.: Diachronic ∗ [ANH 20]A RENDT D.L.,NUR N.,HUANG Z.,FAIR G.,DOU W.: word embeddings reveal statistical laws of semantic change. In Proceed- Parallel embeddings: a visualization technique for contrasting learned ings of the 54th Annual Meeting of the Association for Computational representations. In Proceedings of the 25th International Conference on Linguistics (Volume 1: Long Papers) (2016), pp. 1489–1501.2,8,9,4 Intelligent User Interfaces (2020), pp. 259–274.3 [HPvU∗17]H ÖLLT T., PEZZOTTI N., VAN UNEN V., KONING F., [BB19]B EPLER T., BERGER B.: Learning protein sequence embeddings LELIEVELDT B. P., VILANOVA A.: CyteGuide: Visual guidance for hi- using information from structure. In International Conference on Learn- erarchical single-cell analysis. IEEE Transactions on Visualization and ing Representations (2019).1 Computer Graphics 24, 1 (2017), 739–748.2 ∗ [BBB 19]B ILESCHI M.L.,BELANGER D.,BRYANT D.H.,SANDER- [HR18]H OWARD J.,RUDER S.: Universal fine-tuning SON T., CARTER B.,SCULLEY D.,DEPRISTO M.A.,COLWELL L.J.: for text classification. In Proceedings of the 56th Annual Meeting of Using deep learning to annotate the protein universe. bioRxiv (2019), the Association for Computational Linguistics (Volume 1: Long Papers) 626507.1 (2018), pp. 328–339.1,7

[BH14]B ENGIO S.,HEIGOLD G.: Word embeddings for speech recog- [HS97]H OCHREITER S.,SCHMIDHUBER J.: Long short-term memory. nition. In Fifteenth Annual Conference of the International Speech Com- Neural computation 9, 8 (1997), 1735–1780.7,4 munication Association (2014).1 IN ARZILAY AAKKOLA ∗ [JBJ18]J W., B R.,J T.: Junction tree variational [BZK 17]B AU D.,ZHOU B.,KHOSLA A.,OLIVA A.,TORRALBA A.: for molecular graph generation. In International Conference Network Dissection: Quantifying interpretability of deep visual repre- on Machine Learning (2018), pp. 2323–2332.3 sentations. In Computer Vision and (2017).1,3 ∗ [Jol86]J OLLIFFE I. T.: Principal components in . In [BZS 19]B AU D.,ZHU J.-Y., STROBELT H.,BOLEI Z.,TENENBAUM Principal Component Analysis. Springer, 1986, pp. 129–155.2,5 J.B.,FREEMAN W. T., TORRALBA A.: GAN Dissection: Visualizing and understanding generative adversarial networks. In Proceedings of the [KAKC17]K AHNG M.,ANDREWS P. Y., KALRO A.,CHAU D. H. P.: International Conference on Learning Representations (ICLR) (2019).2 Activis: Visual exploration of industry-scale deep neural network mod- ∗ els. IEEE transactions on visualization and computer graphics 24, 1 [C 15]C HOLLET F., ETAL.: . https://keras.io, 2015.4 (2017), 88–97.2 [CMJG19]C ARTER B.,MUELLER J.,JAIN S.,GIFFORD D.: What [KB15]K INGMA D. P., BA J.: Adam: A method for stochastic optimiza- made you do this? understanding black-box decisions with sufficient in- tion. In International Conference on Learning Representations (2015). put subsets. In Artificial Intelligence and Statistics (2019).2 4 [CTL18]C HEN J.,TAO Y., LIN H.: Visual exploration and comparison [KBV09]K OREN Y., BELL R.,VOLINSKY C.: Matrix factorization of word embeddings. Journal of Visual Languages & Computing 48 techniques for recommender systems. Computer 42, 8 (2009), 30–37. (2018), 178–186.2 1 [EDH19]E THAYARAJH K.,DUVENAUD D.,HIRST G.: Towards under- [KPHL17]K USNER M.J.,PAIGE B.,HERNÁNDEZ-LOBATO J.M.: standing linear word analogies. In Proceedings of the 57th Annual Meet- Grammar variational autoencoder. In Proceedings of the 34th Interna- ing of the Association for Computational Linguistics (2019), pp. 3253– tional Conference on Machine Learning-Volume 70 (2017), JMLR. org, 3262.3 pp. 1945–1954.3 [ERA∗16]E ISNER B.,ROCKTÄSCHEL T., AUGENSTEIN I.,BOSNJAK [KRKP∗16]K LUYVER T., RAGAN-KELLEY B.,PÉREZ F., GRANGER M.,RIEDEL S.: emoji2vec: Learning emoji representations from their B.E.,BUSSONNIER M.,FREDERIC J.,KELLEY K.,HAMRICK J.B., description. In Proceedings of The Fourth International Workshop on GROUT J.,CORLAY S., ETAL.: Jupyter notebooks-a publishing format Natural Language Processing for Social Media (2016), pp. 48–54.2,4 for reproducible computational workflows. In ELPUB (2016), pp. 87–90. [ERR∗17]E NGEL J.,RESNICK C.,ROBERTS A.,DIELEMAN S., 9 NOROUZI M.,ECK D.,SIMONYAN K.: Neural audio synthesis of mu- [LBT∗17]L IU S.,BREMER P.-T., THIAGARAJAN J.J.,SRIKUMAR V., sical notes with . In Proceedings of the 34th In- WANG B.,LIVNAT Y., PASCUCCI V.: Visual exploration of semantic ternational Conference on Machine Learning-Volume 70 (2017), JMLR. relationships in neural word embeddings. IEEE transactions on visual- org, pp. 1068–1077.1 ization and computer graphics 24, 1 (2017), 553–562.3 [fDCC∗82] FOR DISEASE CONTROL (CDCC., ETAL.: Update on ac- [Lip18]L IPTON Z. C.: The mythos of model interpretability. Queue 16, quired immune deficiency syndrome (aids)–united states. MMWR. Mor- 3 (2018), 31–57.2 bidity and mortality weekly report 31, 37 (1982), 507.8 [LJLH19]L IU Y., JUN E.,LI Q.,HEER J.: Latent space cartography: Vi- [GBYH20]G LEICHER M.,BARVE A.,YU X.,HEIMERL F.: Boxer: sual analysis of vector space embeddings. In Computer Graphics Forum Interactive comparison of classifier results. In Computer Graphics Forum (2019), vol. 38, Wiley Online Library, pp. 67–78.2,3 (2020), vol. 39.2 [LLL∗18]L IU S.,LI Z.,LI T., SRIKUMAR V., PASCUCCI V., BREMER [HE18]H A D.,ECK D.: A neural representation of sketch drawings. In P.-T.: NLIZE: A perturbation-driven visual interrogation tool for ana- International Conference on Learning Representations (2018).1 lyzing and interpreting natural language inference models. IEEE trans- [HG18]H EIMERL F., GLEICHER M.: Interactive analysis of word vector actions on visualization and computer graphics 25, 1 (2018), 651–660. embeddings. Computer Graphics Forum 37, 3 (2018), 253–265.2,3 2 12 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

[LM14]L E Q.,MIKOLOV T.: Distributed representations of sentences [SGB∗19]S TROBELT H.,GEHRMANN S.,BEHRISCH M.,PERER A., and documents. In International conference on machine learning (2014), PFISTER H.,RUSH A. M.: Seq2seq-vis: A visual debugging tool for pp. 1188–1196. 10 sequence-to-sequence models. IEEE transactions on visualization and computer graphics 25, 1 (2019), 353–363.2 [LNH∗18]L I Q.,NJOTOPRAWIRO K.S.,HALEEM H.,CHEN Q.,YI C.,MA X.: EmbeddingVis: A visual analytics approach to compara- [SGK17]S HRIKUMAR A.,GREENSIDE P., KUNDAJE A.: Learning im- tive network embedding inspection. In 2018 IEEE Conference on Visual portant features through propagating activation differences. In Proceed- Analytics Science and Technology (VAST) (2018), IEEE, pp. 48–59.3 ings of the 34th International Conference on Machine Learning-Volume 70 (2017), JMLR.org, pp. 3145–3153.2 [LSD15]L ONG J.,SHELHAMER E.,DARRELL T.: Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE confer- [SGPR18]S TROBELT H.,GEHRMANN S.,PFISTER H.,RUSH A.M.: ence on computer vision and pattern recognition (2015), pp. 3431–3440. LSTMVis: A tool for visual analysis of hidden state dynamics in recur- 7 rent neural networks. IEEE transactions on visualization and computer graphics 24 [LYC∗16]L I Y., YOSINSKI J.,CLUNE J.,LIPSON H.,HOPCROFT J.E.: , 1 (2018), 667–676.2 Convergent learning: Do different neural networks learn the same repre- [Shi18]S HIEBLER D.: repcomp, 2018. [Online; accessed ]. sentations? In Proceedings of the International Conference on Learning URL: https://pypi.org/project/repcomp/.3 Representations (ICLR) (2016).2 [SI15]S TERLING T., IRWIN J. J.: Zinc 15–ligand discovery for every- [MDP∗11]M AAS A.L.,DALY R.E.,PHAM P. T., HUANG D.,NG one. Journal of chemical information and modeling 55, 11 (2015), 2324– A. Y., POTTS C.: Learning word vectors for sentiment analysis. In 2337.3 Proceedings of the 49th Annual Meeting of the Association for Compu- [STN∗16]S MILKOV D.,THORAT N.,NICHOLSON C.,REIF E., tational Linguistics: Human Language Technologies (Portland, Oregon, USA, June 2011), Association for Computational Linguistics, pp. 142– VIÉGAS F. B., WATTENBERG M.: Embedding Projector: Interac- arXiv preprint 150. URL: http://www.aclweb.org/anthology/P11-1015. tive visualization and interpretation of embeddings. arXiv:1611.05469 7,4 (2016).2 [STY17]S UNDARARAJAN M.,TALY A.,YAN Q.: Axiomatic attribution [MGB∗18]M IKOLOV T., GRAVE E.,BOJANOWSKI P., PUHRSCH C., for deep networks. In Proceedings of the 34th International Conference JOULIN A.: Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources on Machine Learning-Volume 70 (2017), JMLR. org, pp. 3319–3328.2 and Evaluation (LREC 2018) (2018).7,4 [TKBH16]T URKAY C.,KAYA E.,BALCISOY S.,HAUSER H.: Design- ing progressive and interactive analytics processes for high-dimensional [MH08]M AATEN L. V. D.,HINTON G.: Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.2 data analysis. IEEE transactions on visualization and computer graphics 23, 1 (2016), 131–140.2 [MHM18]M CINNES L.,HEALY J.,MELVILLE J.: UMAP: Uniform manifold approximation and projection for reduction. arXiv [TP10]T URNEY P. D., PANTEL P.: From frequency to meaning: Vector preprint arXiv:1802.03426 (2018).2 space models of semantics. Journal of artificial intelligence research 37 (2010), 141–188.5 [MLS13]M IKOLOV T., LE Q. V., SUTSKEVER I.: Exploiting sim- ilarities among languages for machine translation. arXiv preprint [TZCS15]T AN L.,ZHANG H.,CLARKE C.,SMUCKER M.: Lexical arXiv:1309.4168 (2013).2 comparison between wikipedia and twitter corpora by using word em- beddings. In Proceedings of the 53rd Annual Meeting of the Association ∗ [MSC 13]M IKOLOV T., SUTSKEVER I.,CHEN K.,CORRADO G.S., for Computational Linguistics and the 7th International Joint Confer- DEAN J.: Distributed representations of words and phrases and their ence on Natural Language Processing (Volume 2: Short Papers) (2015), compositionality. In Advances in neural information processing systems pp. 657–661.2,1 (2013), pp. 3111–3119.1,4 [Wei88]W EININGER D.: SMILES, a chemical language and information ∗ [OSJ 18]O LAH C.,SATYANARAYAN A.,JOHNSON I.,CARTER S., system. 1. introduction to methodology and encoding rules. Journal of SCHUBERT L.,YE K.,MORDVINTSEV A.: The building blocks of in- chemical information and computer sciences 28, 1 (1988), 31–36.3 terpretability. Distill 3, 3 (2018), e10.2 [WHA07]W ILLETT W., HEER J.,AGRAWALA M.: Scented widgets: [Pir03]P IROLLI P.: Exploring and finding information. HCI models, the- Improving navigation cues with embedded visualizations. IEEE Trans- ories and frameworks: Toward a multidisciplinary science (2003), 157– actions on Visualization and Computer Graphics 13, 6 (2007), 1129– 191.6 1136.6 ∗ [PLvdM 16]P EZZOTTI N.,LELIEVELDT B. P., VAN DER MAATEN L., [WHG∗18]W ANG L.,HU L.,GU J.,HU Z.,WU Y., HE K.,HOPCROFT HÖLLT T., EISEMANN E.,VILANOVA A.: Approximated and user steer- J.: Towards understanding learning representations: To what extent do able tSNE for progressive visual analytics. IEEE transactions on visual- different neural networks learn the same representation. In Advances in ization and computer graphics 23, 7 (2016), 1739–1752.2 Neural Information Processing Systems (2018), pp. 9584–9593.2 ENNINGTON OCHER ANNING [PSM14]P J.,S R.,M C.: GloVe: Global [WLA∗18]W ANG Y., LIU S.,AFZAL N.,RASTEGAR-MOJARAD M., Proceedings of the 2014 conference vectors for word representation. In WANG L.,SHEN F., KINGSBURY P., LIU H.: A comparison of word on empirical methods in natural language processing (EMNLP) (2014), embeddings for the biomedical natural language processing. Journal of pp. 1532–1543.1,4 biomedical informatics 87 (2018), 12–20.3 [RBT∗19]R AO R.,BHATTACHARYA N.,THOMAS N.,DUAN Y., CHEN [WVJ16]W ATTENBERG M.,VIÉGAS F., JOHNSON I.: How to use t- X.,CANNY J.,ABBEEL P., SONG Y. S.: Evaluating protein transfer SNE effectively. Distill 1, 10 (2016), e2.2,4,5 learning with tape. In Advances in Neural Information Processing Sys- tems (2019).1 [YWBA18]Y ANG K.K.,WU Z.,BEDBROOK C.N.,ARNOLD F. H.: Learned protein embeddings for machine learning. 34, [RFFT16]R AUBER P. E., FADEL S.G.,FALCAO A.X.,TELEA A.C.: 15 (2018), 2642–2648.1 Visualizing the hidden activity of artificial neural networks. IEEE trans- actions on visualization and computer graphics 23, 1 (2016), 101–110. [YYB∗14]Y U W., YANG K.,BAI Y., YAO H.,RUI Y.: Visual- 3 izing and comparing convolutional neural networks. arXiv preprint arXiv:1412.6631 (2014).3 [RSG16]R IBEIRO M. T., SINGH S.,GUESTRIN C.: “why should I trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and (2016), pp. 1135–1144.2 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples 1 Supplementary Information

S1. Additional case studies S1.1. Word embeddings pre-trained on different corpora Our third case study evokes a common use case experienced by users in our formative interviews: choosing between models that appear equally viable for use in a downstream application (e.g., predicting topics based on customer review text). GloVe [PSM14] is a popular model that offers several variants of embeddings trained with different datasets. Here, we demonstrate how the Embedding Comparator can be used to understand the impact of training GloVe with data from either Wikipedia & Newswire or Twitter and replicates known lexical differences between the models. As Fig. S1 shows, the Embedding Comparator immediately displays a number of differences between these two pre-trained embedding models that arise due to the datasets on which they were trained. Among the words that differ most are shorthand or slang expressions such as “bc”, “bout”, and “def”. The local neighborhood dominoes for these words show the ways in which their semantic meanings differ between the two models. For example, “def” is associated with the notion of being defeated and “beats” (as in sporting results) and hence countries (e.g. “canada” and “usa”), while in Twitter it is most similar to conversational words such as “definitely” and “probably”. Another insight revealed by our system is the difference in languages present in the training corpora. Words such as “era”, “dale”, and “solo” take on their English meanings in the Wikipedia & Newswire model, but are related to Spanish words in the Twitter variant. This finding suggests that the Twitter model was trained on multi-lingual data, while the Wikipedia & Newswire model may be limited to English. Finally, the Embedding Comparator reveals how words such as “swift” and “galaxy” may be used very differently in different media. In the Wikipedia & Newswire model, “swift” refers to the adjective swift (i.e. quick), whereas in Twitter, “swift” is related to the musical artist Taylor Swift. Likewise, “galaxy” refers either to space or to Samsung Galaxy electronics based on whether embeddings were trained on Wikipedia & Newswire or on Twitter, respectively. Lexical comparison between Wikipedia and Twitter via embeddings has been explored in previous work [TZCS15]. We find that the Embedding Comparator surfaces characteristic words previously reported to differ most between the two corpora, including “bc” and “ill” (Fig. S1). Moreover, our system directly identifies such words via similarity of reciprocal local neighborhoods without requiring alignment of the two embedding spaces or introducing variance by learning a linear transformation using a stochastic procedure. Using these insights from the Embedding Comparator, a user can make a more informed decision about which set of embeddings may be more appropriate to adopt for their system. For example, if classifying longer or more formal customer reviews, the model trained on Wikipedia would likely perform better, but if classifying casually written reviews that contain slang or multi-lingual text, the Twitter corpus may generalize better to real-world data.

Figure S1: The Embedding Comparator, applied to case study: Word embeddings pre-trained on different corpora, compares a word embed- ding model trained on Wikipedia and Newswire text to a model trained on Twitter text. Despite using the same architecture, the similarity distribution shows the models represent words differently. Looking at the dominoes suggests the Wikipedia/Newswire model was trained on proper English text, while the Twitter model contains Spanish words (e.g., “solo”) and emphasizes popular culture references (e.g., “swift”). 2 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

S1.2. Emoji representations Our final case study demonstrates how the Embedding Comparator can be used in a non-NLP setting. We compare two representations of Unicode emoji images: emoji embeddings learned based on their textual descriptions in the Unicode emoji standard (emoji2vec) [ERA∗16] and raw RGBA pixel values in the emoji images. Fig. S2 shows the Embedding Comparator loaded with these two models. In general, we find that nearest neighbors of emojis in the emoji2vec model are emojis whose semantic meaning is similar, while the nearest neighbors in the pixel model are emojis whose images are similar. For instance, the domino for the baseball emoji (Fig. S2) shows that the nearest neighbors in the emoji2vec model are other sports-related emojis (e.g., soccer ball, football), while those in the pixel model are emojis with similar color/shape (e.g., white clocks). Likewise, the running shoe and balloon emojis in the pixel model share no apparent to their nearest neighbors but share object shape and color, respectively. These insights enable us to understand how the two models differ — the emoji2vec embeddings capture semantics learned from the emoji textual descriptions, while the pixel representations only reflect similarity of the emoji images, as one would expect. Such insights are also reflected in the emoji whose local neighborhoods are most similar across the two models (“Grinning Squinting Face”), where the unique objects lists reveal that nearest neighbors in emoji2vec are generally positive-sentiment emojis, while those in the pixel model are circular faces containing more diverse (positive and negative) sentiment.

Figure S2: The Embedding Comparator, applied to case study: Emoji representations, compares embeddings learned from textual descriptions (emoji2vec) of emojis to embeddings learned from the emojis’ raw pixel values. The emoji2vec model captures semantic similarity of emojis while the pixel model captures image-level similarities. For example, in the emoji2vec embeddings the balloon emoji surrounded by other party supplies, whereas in the pixel model it is surrounded by other red objects. Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples 3

S1.3. Additional example on chemical molecules data We applied the Embedding Comparator to compare latent embeddings of chemical molecules [SI15] learned by two different variational autoencoder (VAE) architectures, Grammar VAE [KPHL17] and junction tree VAE [JBJ18] (Fig. S3). This application demonstrates a potential limitation of our system on long object labels. In Fig. S3A, the molecules are represented as textual SMILES strings [Wei88], which are often used to describe chemical structures. We find that these string representations are long, obfuscating the ability to rapidly scroll through dominoes. Fig. S3B shows the same models but where a 2-D structure of each molecule is shown instead, rendering dominoes more readable.

Figure S3: View of the Embedding Comparator applied to two embeddings of chemical molecules learned by different variational autoencoder architectures. (A) Molecules are represented as SMILES strings. (B) Molecules are represented as 2-D structures. 4 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

S2. Case study details Here, we detail the datasets and preprocessing steps used in our case studies.

S2.1. Transfer learning for fine-tuned word embeddings We downloaded pre-trained fastText [MGB∗18] word embeddings, which are available online at https://fasttext.cc/docs/en/ english-vectors.html. We use the 300-dimensional wiki-news-300d-1M embeddings consisting of 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus, and statmt.org news datasets. We train an LSTM [HS97] to classify binary sentiment in movie reviews from the Large Movie Review dataset [MDP∗11] containing 25000 training reviews and 25000 test reviews from the Internet Movie Database (IMDb). We use default tokenization settings for this dataset as provided in Keras [C∗15]. We define our vocabulary as the top 5000 most frequent words in the movie review dataset and truncate reviews to a maximum length of 500 words (with pre-padding). Our architecture is defined as follows: 1. Input/Embeddings : Sequence with 500 words. The word at each timestep is represented by a 300-dimensional embedding. 2. LSTM: Recurrent layer with 100-unit LSTM (forward direction only, dropout = 0.2, recurrent dropout = 0.2). 3. Dense: 1 neuron (sentiment output), sigmoid activation. Prior to training, the embeddings are initialized using the pre-trained fastText embeddings. Of our vocabulary of size 5000, 4891 tokens were present in the fastText embeddings. For tokens not present in fastText, we initialize embeddings as all-zero vectors. We train our model for 3 epochs (batch size = 64) with the Adam optimizer [KB15] using default parameters in Keras [C∗15] to minimize binary cross-entropy on the training set. The final model achieves 84.7% test set accuracy (85.4% training set accuracy). We did not further tune the architecture or hyperparameters. For analysis in the Embedding Comparator, we output the initial fastText embeddings and fine-tuned embeddings for the 4891 words whose embeddings were initialized from fastText.

S2.2. Language evolution via diachronic word embeddings We use the HistWords dataset [HLJ16b] in our case study of diachronic word embeddings, which have been shown to exhibit changes in se- mantic meaning of words over time. We use pre-trained word embeddings from [HLJ16b], accessed at https://nlp.stanford.edu/ projects/histwords/. We use the All English (1800s-1990s) set of embeddings, which are 300-dimensional word2vec em- beddings [MSC∗13]. This dataset provides word embeddings trained on English books from each decade from 1800 to 2000. For exploration in the Embedding Comparator, we select embeddings taken from five different decades spanning this time period: 1800-1810, 1850-1860, 1900-1910, 1950-1960, and 1990-2000. We filter each embedding space to the top 10000 most frequent words from its decade and com- pute the intersection of these sets over the five decades we selected, producing a vocabulary containing 6121 words from each model for comparison in the Embedding Comparator.

S2.3. Word embeddings pre-trained on different corpora In this case study, we use pre-trained embeddings from GloVe [PSM14], available online at https://nlp.stanford.edu/ projects/glove/. The Wikipedia/newswire embeddings were trained on the Wikipedia 2014 and Gigaword 5 (newswire text) datasets containing 6 billion tokens (GloVe 6B), while the Twitter word embeddings were trained on text from Twitter tweets containing 27 billion tokens (GloVe 27B). We use the 100-dimensional embeddings trained on each of these corpora. We filter each of the embedding models to the top 10K most frequent words from its respective corpus and then intersect the resulting vocabularies, giving a shared vocabulary containing 3303 words. We use the Embedding Comparator to compare embeddings from each model for words in this shared vocabulary.

S2.4. Emoji representations We use pre-trained 300-dimensional emoji embeddings from emoji2vec [ERA∗16], accessed from https://github.com/uclnlp/ emoji2vec. Our image (pixel) embedding model uses emoji images obtained from https://github.com/iamcal/emoji-data/ blob/master/sheet_apple_16.png. The raw 18 × 18 × 4 RGBA images are flattened into a 1296-dimensional vector, which is the pixel model embedding. We use 670 total emojis that are common across these datasets. Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples 5

S3. Additional dominoes

Figure S4: Additional dominoes from case study: Transfer learning for fine-tuned word embeddings. The word “bore” has changed in mean- ing from a general definition: “carried”, to a more sentiment rich definition: “dull”. “redeeming” has changed from the positive sentiment definition: “compensate for faults” to a negative sentiment definition likely related to the reviewer idiom “no redeeming qualities”. The number “7” has changed from its definition as a numeric symbol to a number indicative of score (e.g., 7 out of 10).

Figure S5: Additional dominoes from case study: Language evolution via diachronic word embeddings. The domino for “nice” shows its definition change from “fine” in 1800-1810 to “pleasant” in 1900–1910. The word “aids” was synonymous with “assists” in 1900–1910 but later became associated with HIV/AIDS in 1990–2000. Over the course of the 20th century, “score” changed from a measure of time (e.g., four score) to a measure of rank.

Figure S6: Additional dominoes from case study: Word embeddings pre-trained on different corpora. Using the model trained on news text “def” is short for “defeated”, whereas using the model trained on Twitter data “def” is slang for “definitely”. The word “dale” is an English name in the news model, but is a represented by its Spanish meaning in the Twitter model. “galaxy” in the news model is related to space, but in the Twitter model is related to the Samsung Galaxy line of phones. 6 Boggust et al. / Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

Figure S7: Additional dominoes where the neighborhood of the word has not changed. From the Transfer learning for fine-tuned word embeddings case study, “cinematographer” does not change when fine tuning a general English model on a movie review dataset because “cinematographer” already refers to the film industry. From the Language evolution via diachronic word embeddings case study, “25” has not changed from 1800 to 2000, indicating that numbers are not susceptible to chronological changes in meaning. “fernando” from the Word embeddings pre-trained on different corpora case study does not change in meaning whether the model was trained on news data or Twitter data likely because it is a proper noun with no other meanings.