Supplementary Materials
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Materials B LIME: Local Interpretable Model-Agnostic Explanations A Importance Score (IS) Curves After we trained the dictionary Φ through non- negative sparse coding, the inference of the sparse code of a given input is 2 α(x)=argmin||x − Φα||2 + λ||α||1 α∈R For a given sentence and index pair (s, i), the em- bedding of word w = s[i] by layer l of transformer is x(l)(x, i). Then we can abstract the inference of a specific entry of sparse code of the word vector (a) (b) as a black-box scalar-value function f: Figure 9: (a) Importance score of 16 transformer fac- tors corresponding to low level information. (b) Im- f((s, i)) = α(x(l)(s, i)) portance score of 16 transformer factors corresponds to mid level information respectively. Let RandomMask denotes the operation that generates perturbed version of our sentence s by The importance score curve’s characteristic has masking word at random location with “[UNK]” a strong correspondence to a transformer factor’s (unkown) tokens. For example, a masked sentence categorization. Based on the location of the peak of could be an IS curve, we can classify a transformer factor as [Today is a [‘UNK’],day] low-level, mid-level or high-level. The importance Let h denote a encoder for perturbed sentences score for low-level transformer factors peak in early compared to the unperturbed sentence s, such that layers and slowly decrease across the rest of the 0 if s[t]=[‘UNK’] layers. On the other hand, the importance score h(s)t = 1 Otherwis for mid-level and high-level transformers slowly increases and peaks at higher layers. In Figure 9, The LIME algorithm we used to generated we show two sets of the examples to demonstrate saliency map for each sentences is the following: the clear distinction between those two types of IS curves. Algorithm 1 Explaining Sparse Coding Activation Taking a step back, we can also plot IS curve for using LIME Algorithm each dimension of word vector (without sparse cod- 1: S = {h(s)} ing) at different layers. They do not show any spe- 2: Y = {f(s)} cific patterns, as shown in Figure 10. This makes 3: for each i in {1, 2, ..., N} do intuitive sense since we mentioned that each of the 4: si ← RandomMask(s) entries of a contextualized word embedding does 5: S←S∪h(si) not correspond to any clear semantic meaning. 6: Y ← Y ∪ f(si) 7: end for 8: w ← Ridgew(S,Y) Where Ridgew is a weighted ridge regression defined as: 2 2 w =argmin||SΩw − Y || + λ||w||2 w∈Rt (a) (b) Ω=diag(d(h(S1),1),d(h(S2),1), ··· ,d(h(Sn),1)) Figure 10: (a) Importance score calculated using cer- tain dimension of word vectors without sparse coding. d(·, ·) can be any metric that measures how much (b) Importance score calculated using sparse coding of a perturbed sentence is different from the original word vectors. sentence. If a sentence is perturbed such that every token is being masked, then the distance h(h(s),1) should be 0, if a sentence is not perturbed at all, Then we use a typical iterative optimization pro- then h(h(s),1) should be 1. We choose d(·, ·) to cedure to learn the dictionary Φ described in the be cosine similarity in our implementation. main section: In practice, we also uses feature selection. This 1 2 min X−ΦAF +λ αi1, s.t. αi 0, (2) is done by running LIME twice. After we obtain A 2 i the regression weight w1 for the time, we use it to find the first k indices corresponds to the entry in 1 2 min X − ΦΩAF , Φ:,j2 ≤ 1. (3) w1 with highest absolute value. We use those k Φ 2 index as location in the sentence and apply LIME These two optimizations are both convex, we for the second time with only those selected indices solve them iteratively to learn the transformer fac- from step 1. tors: In practice, we use minibatches contains 200 w Overall, the regression weight can be regarded word vectors as X. The motivation of apply In- wk as a saliency map. The higher the weight is, the verse Frequency Matrix Ω is that we want to make s[k] more important the word in the sentence since sure all words in our vocabulary has the same con- it contributes more to the activation of a specific tribution. When we sample our minibatch from A, transformer factor. frequent words like “the” and “a” are much likely We could also have negative weight in w.In to appear, which should receive lower weight dur- general, negative weights are hard to interpret in the ing update. context of transformer factor. The activation will Optimization 2 can converge in 1000 steps using increase if they are removed those word correspond the FISTA algorithm2. We experimented with dif- to negative weights. Since a transformer factor ferent λ values from 0.03 to 3, and choose λ =0.27 corresponds to a specific pattern, then word with to give results presented in this paper. Once the negative weights are those word in a context that sparse coefficients have been inferred, we update behaves “opposite" of this pattern. our dictionary Φ based on Optimization 3 by one step using an approximate second-order method, C The Details of the Non-negative Sparse where the Hessian is approximated by its diagonal Coding Optimization to achieve an efficient inverse (Duchi et al., 2011). Let S be the set of all sequences, recall how we de- The second-order parameter update method usually fined word embedding using hidden state of trans- leads to much faster convergence. Empirically, we (l) (l) former in the main section: X = {x (s, i)|s ∈ train 200k steps and it takes roughly 2 days on a S, i ∈ [0,len(s)]} as the set of all word embedding Nvidia 1080 Ti GPU. at layer l, then the set of word embedding across all layers is defined as D Hyperlinks for More Transformer Visualization (1) (2) (L) X = X ∪ X ∪···∪X In the following three sections, we provide visu- alization of more example transformer factor in In practice, we use BERT base model as our trans- low-level, mid-level, and high-level. Here’s table former model, each word embedding vector (hid- of Contents that contain hyperlinks which direct to den state of BERT) is dimension 768. To learn the each level: transformer factors, we concatenate all word vector x ∈ X into a data matrix A. We also defined f(x) • Low-Level: E to be the frequency of the token that is embedded • Mid-Level: F in word vector x. For example, if x is the embed- ding of the word “the”, it will have a much larger • High-Level: G frequency i.e. f(x) is high. Using f(x), we define the Inverse Frequency Matrix Ω: Ω is a diagonal matrix where each entry on the diagonal is the square inverse frequency of each word, i.e. 2 1 1 The FISTA algorithm can usually converge within 300 Ω=diag( , , ...) steps, we use 1000 steps nevertheless to avoid any potential f(x1) f(x2) numerical issue. E Low-Level Transformer Factors price family even sat away from park’ s supporters during the trial itself. Transformer factor 2 in layer 4 • on 25 january 2010, the morning of park’ s 66th Explaination: Mind: noun, the element of a birthday, he was found hanged and unconscious in person that enables them to be aware of the his world and their experiences. • was her, and knew who had done it", expressing his conviction of park’ s guilt. • that snare shot sounded like somebody’ d • jeremy park wrote to the north@-@ west evening kicked open the door to your mind". mail to confirm that he • i became very frustrated with that and finally • vanessa fisher, park’ s adoptive daughter, made up my mind to start getting back into things." appeared as a witness for the prosecution at the • when evita asked for more time so she could • they played at< unk> for years before joining make up her mind, the crowd demanded," ¡ ahora, oldham athletic at boundary park until 2010 evita,< when they moved to oldham borough’ s previous • song and watch it evolve in front of us... almost ground,< as a memory in your head. • theme park guests may use the hogwarts express • was to be objective and to let the viewer make up to travel between hogsmead his or her own mind." • s strength in both singing and rapping while • managed to give me goosebumps, and those comparing the sound to linkin park. moments have remained on my mind for weeks • in a statement shortly after park’ s guilty verdict, afterward." he said he had" no doubt" that • rests the tir’ d mind, and waking loves to dream • june 2013, which saw the band travel to rock am •, tracks like’ halftime’ and the laid back’ one ring and rock im park as headline act, the song was time 4 your mind’ demonstrated a[ high] level of moved to the middle of the set technical precision and rhetorical dexter • after spending the first decade of her life at the • so i went to bed with that on my mind". central park zoo, pattycake moved permanently to •ment to a seed of doubt that had been playing on the bronx zoo in 1982.