Representing Unordered Data Using Complex-Weighted Multiset Automata
Total Page:16
File Type:pdf, Size:1020Kb
Representing Unordered Data Using Complex-Weighted Multiset Automata Justin DeBenedetto 1 David Chiang 1 Abstract In this paper, we propose to compute representations of multi- sets using weighted multiset automata, a variant of weighted Unordered, variable-sized inputs arise in many finite-state (string) automata in which the order of the input settings across multiple fields. The ability for set- symbols does not affect the output. This representation can and multiset-oriented neural networks to handle be directly implemented inside a neural network. We show this type of input has been the focus of much work how to train these automata efficiently by approximating in recent years. We propose to represent multisets them with string automata whose weights form complex, using complex-weighted multiset automata and diagonal matrices. show how the multiset representations of certain existing neural architectures can be viewed as Our representation is a generalization of DeepSets’, and it special cases of ours. Namely, (1) we provide a also turns out to be a generalization of the Transformer’s new theoretical and intuitive justification for the position encodings (Vaswani et al., 2017). In Sections 4 and Transformer model’s representation of positions 5, we discuss the application of our representation in both using sinusoidal functions, and (2) we extend the cases. DeepSets model to use complex numbers, enabling it to outperform the existing model on an extension • Transformers (Vaswani et al., 2017) encode the posi- of one of their tasks. tion of a word as a vector of sinusoidal functions that turns out to be a special case of our representation of multisets. So weighted multiset automata provide a new 1. Introduction theoretical and intuitive justification for sinusoidal posi- tion encodings. We experiment with several variations Neural networks which operate on unordered, variable-sized on position encodings inspired by this justification, and input (Vinyals et al., 2016; Wagstaff et al., 2019) have although they do not yield an improvement, we do find been gaining interest for various tasks, such as processing that learned position encodings in our representation do graph nodes (Murphy et al., 2019), hypergraphs (Maron better than learning a different vector for each absolute et al., 2019), 3D image reconstruction (Yang et al., 2020), position. and point cloud classification and image tagging (Zaheer et al., 2017). Similar networks have been applied to multiple • We extend the DeepSets model to use our representation, instance learning (Pevný & Somol, 2016). which amounts to upgrading it from real to complex One such model, DeepSets (Zaheer et al., 2017), computes a numbers. On an extension of one of their tasks (adding a representation of each element of the set, then combines the sequence of one-digit numbers and predicting the units representations using a commutative function (e.g., addition) digit), our model is able to reach perfect performance, arXiv:2001.00610v3 [cs.FL] 28 Aug 2020 to form a representation of the set that discards ordering whereas the original DeepSets model does no better information. Zaheer et al. (2017) provide a proof that any than chance. function on sets can be modeled in this way, by encoding sets as base-4 fractions and using the universal function 2. Definitions approximation theorem, but their actual proposed model is far simpler than the model constructed by the theorem. We first review weighted string automata, then modify the definition to weighted multiset automata. 1Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA. Correspondence 2.1. String automata to: Justin DeBenedetto <[email protected]>, David Chiang <dchi- [email protected]>. Weighted finite automata are commonly pictured as directed Proceedings of the 37th International Conference on Machine graphs whose edges are labeled by symbols and weights, Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- but we use an alternative representation as collections of thor(s). matrices: for each symbol a, this representation takes all Representing Unordered Data Using Complex-Weighted Multiset Automata the transitions labeled a and arranges their weights into an representation: adjacency matrix, called µ¹aº. This makes it easier to use h i the notation and techniques of linear algebra, and it also λ2 = 1 0 makes clearer how to implement weighted automata inside M2 q1 h i T neural networks. ρ2 = 0 1 b " # Let K be a commutative semiring; in this paper, K is either 0 1 R or C. q2 µ2¹bº = : 0 0 Definition 1. A K-weighted finite automaton (WFA) over Σ is a tuple M = ¹Q; Σ; λ; µ; ρº, where: 2.2. Multiset automata The analogue of finite automata for multisets is the special • Q = f1;:::; dg is a finite set of states, case of the above definition where multiplication of the transition matrices µ¹aº does not depend on their order. • Σ is a finite alphabet, Definition 2. A K-weighted multiset finite automaton is one whose transition matrices commute pairwise. That is, for all × • λ 2 K1 d is a row vector of initial weights, a; b 2 Σ, we have µ¹aºµ¹bº = µ¹bºµ¹aº. Example 2. Automata M1 and M2 are multiset automata • µ : Σ ! Kd×d assigns a transition matrix to every because they each have only one transition matrix, which symbol, and trivially commutes with itself. Below is a multiset automaton that accepts multisets over fa; bg where the number of a’s is • ρ 2 Kd×1 is a column vector of final weights. a multiple of three and the number of b’s is exactly one. a The weight λq is the weight of starting in state q; the weight »µ¹aº¼qr is the weight of transitioning from state q a a M3 q1 q2 q3 to state r on input a, and the weight ρq is the weight of accepting in state q. We extend the mapping µ to strings: b b b ∗ If w = w1 ··· wn 2 Σ , then µ¹wº = µ¹w1º · · · µ¹wnº. Then, a a the weight of w is λµ¹wºρ. In this paper, we are more q4 q5 q6 interested in representing w as a vector rather than a single weight, so define the vector of forward weights of w to be a fwM ¹wº = λµ¹wº. (We include the final weights ρ in our h i h i T λ ρ examples, but we do not actually use them in any of our 3 = 1 0 0 0 0 0 3 = 0 0 0 1 0 0 experiments.) 20 1 0 0 0 03 20 0 0 1 0 03 6 7 6 7 6 7 6 7 Note that, different from many definitions of weighted au- 60 0 1 0 0 07 60 0 0 0 1 07 6 7 6 7 tomata, this definition does not allow -transitions, and there 61 0 0 0 0 07 60 0 0 0 0 17 µ3¹aº = 6 7 µ3¹bº = 6 7 : 60 0 0 0 1 07 60 0 0 0 0 07 may be more than one initial state. (Hereafter, we use to 6 7 6 7 6 7 6 7 stand for a small real number.) 60 0 0 0 0 17 60 0 0 0 0 07 6 7 6 7 60 0 0 1 0 07 60 0 0 0 0 07 Example 1. Below, M1 is a WFA that accepts strings where 4 5 4 5 the number of a’s is a multiple of three; ¹λ1; µ1; ρ1º is its Our multiset automata (Definition 2) describe the recogniz- matrix representation: able distributions over multisets (Sakarovitch, 2009, Section 4.1). A recognizable distribution is intuitively one that we h i λ1 = 1 0 0 can compute by processing one element at a time using only M1 a a finite amount of memory. h i T ρ1 = 1 0 0 a a This is one of two natural ways to generalize string automata q1 q2 q3 20 1 03 to multiset automata. The other way, which describes all 6 7 the rational distributions of multisets (Sakarovitch, 2009, ¹ º 6 7 µ1 a = 60 0 17 : 6 7 Section 3.1), places no restriction on transition matrices and 61 0 07 4 5 computes the weight of a multiset w by summing over all paths labeled by permutations of w. But this is NP-hard to And M2 is a WFA that accepts fbg; ¹λ2; µ2; ρ2º is its matrix compute. Representing Unordered Data Using Complex-Weighted Multiset Automata Our proposal, then, is to represent a multiset w by the vector different. Then J + D is diagonalizable. Let E = PDP−1; −1 of forward weights, fwM ¹wº, with respect to some weighted then kE k ≤ kPkkDkkP k = κ¹PºkDk ≤ , and A + E = multiset automaton M. In the context of a neural network, P¹J + DºP−1 is also diagonalizable. the transition weights µ¹aº can be computed by any function as long as it does not depend on the ordering of symbols, So M is close to an automaton ¹λ; µ + E; ρº where µ + E and the forward weights can be used by the network in any is diagonalizable. Furthermore, by a change of basis, we way whatsoever. can make µ + E diagonal without changing the automaton’s behavior: 3. Training Proposition 2. If M = ¹λ; µ; ρº is a multiset automaton and P is an invertible matrix, the multiset automaton ¹λ0; µ0; ρ0º Definition 2 does not lend itself well to training, because where parameter optimization needs to be done subject to the commutativity constraint. Previous work (DeBenedetto & λ0 = λP−1 Chiang, 2018) suggested approximating training of a multiset µ0¹aº = Pµ¹aºP−1 automaton by training a string automaton while using a 0 regularizer to encourage the weight matrices to be close ρ = Pρ to commuting.