Transformer Visualization Via Dictionary Learning: Contextualized Embedding As a Linear Superposition of Transformer Factors
Total Page:16
File Type:pdf, Size:1020Kb
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors Zeyu Yun∗ 2 Yubei Chen∗ 1;2 Bruno A Olshausen2;4 Yann LeCun1;3 1 Facebook AI Research 2 Berkeley AI Research (BAIR), UC Berkeley 3 New York University 4 Redwood Center for Theoretical Neuroscience, UC Berkeley Abstract models have excellent performance in those prob- ing tasks. These results indicate that transformer Transformer networks have revolutionized models have learned the language representation NLP representation learning since they were related to the probing tasks. Though the probing introduced. Though a great effort has been made to explain the representation in trans- tasks are great tools for interpreting language mod- formers, it is widely recognized that our un- els, their limitation is explained in (Rogers et al., derstanding is not sufficient. One important 2020). We summarize the limitation into three ma- reason is that there lack enough visualiza- jor points: tion tools for detailed analysis. In this pa- per, we propose to use dictionary learning • Most probing tasks, like POS and NER tag- to open up these ‘black boxes’ as linear su- gering, are too simple. A model that performs perpositions of transformer factors. Through well in those probing tasks does not reflect the visualization, we demonstrate the hierarchi- model’s true capacity. cal semantic structures captured by the trans- former factors, e.g. word-level polysemy dis- • Probing tasks can only verify whether the cer- ambiguation, sentence-level pattern formation, tain prior structure is learned in a language and long-range dependency. While some of model. They can not reveal the structures be- these patterns confirm the conventional prior yond our prior knowledge. linguistic knowledge, the rest are relatively unexpected, which may provide new insights. • It’s hard to locate where exactly the related We hope this visualization tool can bring fur- linguistic representation is learned in the trans- ther knowledge and a better understanding of how transformer networks work. former. Efforts are made to remove those limitations and 1 Introduction make probing tasks more diverse. For instance, Though the transformer networks (Vaswani et al., (Hewitt and Manning, 2019) propose “structural 2017; Devlin et al., 2018) have achieved great probe”, which is a much more intricate probing success, our understanding of how they work is task. (Jiang et al., 2020) propose to automatically still fairly limited. This has triggered increas- generate certain probing tasks. Non-probing meth- ing efforts to visualize and analyze these “black ods are also explored to relieve the last two limi- arXiv:2103.15949v1 [cs.CL] 29 Mar 2021 boxes”. Besides a direct visualization of the atten- tations. For example, (Reif et al., 2019) visualize tion weights, most of the current efforts to inter- embedding from BERT using UMAP and show pret transformer models involve “probing tasks”. that the embeddings of the same word under dif- They are achieved by attaching a light-weighted ferent contexts are separated into different clusters. auxiliary classifier at the output of the target trans- (Ethayarajh, 2019) analyze the similarity between former layer. Then only the auxiliary classifier embeddings of the same word in different contexts. is trained for well-known NLP tasks like part-of- Both of these works show transformers provide a speech (POS) Tagging, Named-entity recognition context-specific representation. (NER) Tagging, Syntactic Dependency, etc. (Ten- (Faruqui et al., 2015; Arora et al., 2018; Zhang ney et al., 2019; Liu et al., 2019) show transformer et al., 2019) demonstrate how to use dictionary learning to explain, improve, and visualize the un- ∗ equal contribution. Correspondence to: Zeyu Yun <[email protected]>, Yubei Chen contextualized word embedding representations. In <yubeic@{fb.com, berkeley.edu}> this work, we propose to use dictionary learning to alleviate the limitations of the other transformer as fx(1)(s; i); x(2)(s; i); ··· ; x(L)(s; i)g, where interpretation techniques. Our results show that x(l)(x; i) 2 Rd. Each x(l)(x; i) represents the hid- dictionary learning provides a powerful visualiza- den output of Transformer at layer-l, given the in- tion tool, which even leads to some surprising new put sentence s and index i, d is the dimension of knowledge. the embedding vectors. 2 Method To learn a dictionary of Transformer Factors with Non-Negative Sparse Coding. Let S be the Hypothesis: contextualized word embedding as set of all sequences, X(l) = fx(l)(s; i)js 2 S; i 2 a sparse linear superposition of transformer [0; len(s)]g, and X = X(1) [ X(2) [···[ X(L). factors. It is shown that word embedding vectors 8x 2 X, we assume x is a sparse linear superposi- can be factorized into a sparse linear combination tion of transformer factors: of word factors (Arora et al., 2018; Zhang et al., 2019), which correspond to elementary semantic x = Φα + , s:t: α 0; (1) meanings. An example is: where Φ 2 IRd×m is a dictionary matrix with m apple =0:09\dessert" + 0:11\organism" + 0:16 columns Φ:;c , α 2 IR is a sparse vector of coef- ficients to be inferred and is a vector containing \fruit" + 0:22\mobile&IT" + 0:42\other": independent Gaussian noise samples, which are as- We view the latent representation of words in a sumed to be small relative to x. Typically m > d transformer as contextualized word embedding. so that the representation is overcomplete. This Similarly, our hypothesis is that a contextualized inverse problem can be efficiently solved by FISTA word embedding vector can also be factorized as a algorithm (Beck and Teboulle, 2009). The dictio- sparse linear superposition of a set of elementary nary matrix Φ can be learned in an iterative fashion elements, which we call transformer factors. by using non-negative sparse coding, which we leave to the appendix sectionC. Each column Φ:;c of Φ is a transformer factor and its corresponding sparse coefficient αc is its activation level. Visualization by Top Activation and LIME In- terpretation. An important empirical method to visualize a feature in deep learning is to use the input samples, which trigger the top activation of the feature (Zeiler and Fergus, 2014). We adopt this convention. As a starting point, we try to Figure 1: Building block (layer) of transformer visualize each of the dimensions of a particular (l) Due to the skip connections in each of the trans- layer, X . Unfortunately, the hidden dimensions former blocks, we hypothesize that the representa- of transformers are not semantically meaningful, tion in any layer would be a superposition of the hi- which is similar to the uncontextualized word em- erarchical representations in all of the lower layers. beddings (Zhang et al., 2019). As a result, the output of a particular transformer Instead, we can try to visualize the transformer block would be the sum of all of the modifications factors. For a transformer factor Φ:;c and for a layer-l, we denote the 1000 contextualized word along the way. Indeed, we verify this intuition (l) with the experiments. Based on the above observa- vectors with the largest sparse coefficients αc ’s as (l) (l) tion, we propose to learn a single dictionary for the Xc ⊂ X , which correspond to 1000 different (l) contextualized word vectors from different layers’ sequences Sc ⊂ S. For example, Figure3 shows output. the top 5 words that activated transformer factor-17 Given a transformer model with L layer and Φ:;17 at layer-0, layer-2, and layer-6 respectively. a tokenized sequence s, we denote a contex- Since a contextualized word vector is generally af- tualized word embedding1 of word w = s[i] fected by many tokens in the sequence, we can use LIME (Ribeiro et al., 2016) to assign a weight to 1In the following, when we use either “word vector" or “word embedding", to refer to the latent output (at a particular each token in the sequence to identify their relative layer of BERT) of any word w in a context s importance to αc. The detailed method is left to Section3. To Determine Low-, Mid-, and High-Level Transformer Factors with Importance Score. As we build a single dictionary for all of the trans- former layers, the semantic meaning of the trans- former factors has different levels. While some of (a) (b) the factors appear in lower layers and continue to Figure 2: Importance score (IS) across all layers for be used in the later stages, the rest of the factors two different transformer factors. (a) This figure shows may only be activated in the higher layers of the a typical IS curve of a transformer factor correspond- transformer network. A central question in rep- ing to low-level information. (b) This figure shows a resentation learning is: “where does the network typical IS curve of a transformer factor corresponds to learn certain information?” To answer this ques- mid-level information. tion, we can compute an “importance score” for (l) (l) word-level disambiguation, sentence-level pattern each transformer factor Φ:;c at layer-l as Ic . Ic is formation, and long-range dependency. In the fol- the average of the largest 1000 sparse coefficients (l) (l) lowing, we provide detailed visualization for each αc ’s, which correspond to Xc . We plot the im- semantic category. We also build a website for the portance scores for each transformer factor as a interested readers to play with these results. curve is shown in Figure2. We then use these im- portance score (IS) curves to identify which layer Low-Level: Word-Level Polysemy Disambigua- a transformer factor emerges. Figure 2a shows tion. While the input embedding of a token con- an IS curve peak in the earlier layer.