Supporting Information

Supporting Information 1 S1 Details on Syllable Clustering by VAE 2 This section is a detailed description of our syllable clustering based on VAE (§4.2). S1.1 describes the 3 seq2seq backbone of the VAE, and S1.2 explains the ABCD-VAE together with the optimization objective. 4 S1.3 defines the parameter settings and the training procedure. Finally, S1.4 discusses problems with the 5 standard Gaussian VAE and the motivation behind our discrete VAE. 6 S1.1 Seq2Seq Autoencoder 7 Figure S1.1 shows the global architecture of our seq2seq VAE, which consists of three modules: the encoder, 8 ABCD-VAE, and the decoder. The entire network receives a time series of syllable spectra, y := (y1;:::; yT ), 9 as its input (in the encoder module) and reconstructs the input data. The reconstruction includes the 10 prediction of each spectrum, ^y := (^y1;:::; ^yT ), as well as that of the offset T of the time series, implemented 11 by binary judgments of whether each time step t is the offset (ht = 1) or not (ht = 0). During the 12 reconstruction process, a fixed-dimensional representation of the entire syllable was obtained between the 13 encoder and decoder and was classified into a discrete category by the ABCD-VAE module. 14 The backbone of the encoder module is the bidirectional LSTM (Hochreiter and Schmidhuber, 1997; 15 Schuster and Paliwal, 1997). This RNN processes the input spectra forward and backward. The last hidden 16 and cell states in the two directions are concatenated and transformed by a multi-layer perceptron (MLP). 17 The MLP output is fed to the ABCD-VAE module (see S1.2) and classified into a discrete category that 18 has a corresponding real-value vector representation. The ABCD-VAE outputs the vector representation 19 of the assigned category, which is concatenated with the embedding of the speaker s of the input syllable. 20 Accordingly, the discrete syllable categories in the ABCD-VAE need not encode speaker characteristics, 21 resulting in speaker normalization (van den Oord et al., 2017; Chorowski et al., 2019; Dunbar et al., 2019; 22 Tjandra et al., 2019). See S1.4 for the motivation behind using this speaker normalization for the Bengalese 23 finch data. The concatenation of the output from the ABCD-VAE and the speaker embedding is transformed 24 by another MLP and fed to the decoder LSTM, which is unidirectional. For each time, step t 2 f1;:::;T g, 25 the output from the LSTM is sent to two distinct MLPs. One of them computes the logits for the offset 26 predictions (P(ht)). The other MLP parameterizes the isotropic Gaussian probability density function of the 27 spectrum reconstruction (cf. Kingma and Welling, 2014). We sampled ^yt using this Gaussian, which is used 1 28 as the input to the LSTM at the next time step t + 1 (the initial input is ^y0 = 0). 29 S1.2 ABCD-VAE 30 This section provides details on the ABCD-VAE. We start with a mathematical description of the model 31 and then move to an explanation of the network implementation. Just like other VAEs, we assumed a prior (i) (i) 32 distribution of the latent feature z of each time-series data i. z is discrete in this study and its prior is the 33 Dirichlet-Categorical distribution. Eq. 1 and 2 below define this prior as the two-step generative procedure. (i) (i) 34 The time-series data—represented by the spectra y and offset judgments h —are generated conditioned (i) 35 on z , whose probability function is implemented by the decoder (Eq. 3). π ∼Dirichlet(α) (1) z(i) j π ∼Categorical(π) (2) (y(i); h(i)) j z(i) ∼p(· j z(i); s(i)) = Decoder(z(i); s(i)) (3) K−1 36 Where α := (α1; : : : ; αK ) are positive real numbers and π := (π1; : : : ; πK ) 2 ∆ is a probability vector PK 37 (i.e., 8k 2 f1;:::;Kg; πk ≥ 1 ^ k=1 πk = 1). The Dirichlet-Categorical prior causes a rich-gets-richer bias, 1 Note that the mathematically correct input to the decoder LSTM is the ground truth spectra, yt, rather than the reconstruction, ^yt, because the objective function is the joint probability of y and h (see Eq. 4). However, the ground truth input to the decoder caused a uninformative latent variable problem. The decoder LSTM is considerably powerful and easily trained to fit to the overall distribution of the time-series data while ignoring information from the encoder (Bowman et al., 2016; Zhao et al., 2017; Liu et al., 2019). We found that our seq2seq VAE did not suffer from this issue when the noisy, reconstructed values ^yt were used instead of the ground truth (cf. also adopted by Chorowski et al., 2019). 1 Decoder s Embed P(h1 = 0) P(h2 = 0) P(hT = 1) ABCD-VAE Sigmoid (Figure S1.2b) Concat MLP Encoder ^y ^y ^y MLP 1 2 T Concat LSTM LSTM LSTM MLP MLP LSTM LSTM LSTM y1 y2 yT ^y0 = 0 Figure S1.1: The architecture of the RNN-VAE, consisting of the Encoder module and the RNN Decoder module. The wavy arrows represent isotropic Gaussian sampling parameterizedby the output of the previous MLP. 38 preferring a smaller number of categories to be used repeatedly (Bishop, 2006; O’Donnell, 2015; Little, 2019), 39 while the (uniform) categorical prior—standard in the categorical VAE (Jang et al., 2017)—eats up all the 40 categories available. Because of this Occam’s razor effect, the Dirichlet-Categorical prior (and its extension 41 to unbounded choices, the Dirichlet process) is popular in Bayesian learning when the model needs to detect 42 the appropriate number of categories in the posterior (see Anderson, 1990; Kurihara and Sato, 2004, 2006; 43 Teh et al., 2006; Kemp et al., 2007; Goldwater et al., 2009; Feldman et al., 2013; Kamper et al., 2017; Morita 44 and O’Donnell, To appear, for examples in computational linguistics and cognitive science). The role of the 45 encoder is to approximate the posterior p(π; z j y; h) of the model in Eq. 1-3. We make several assumptions 2 46 on the approximated posterior, denoted by q(π; z j y). 47 1. π and z are independent given y in q: i.e., q(π; z j y) = q(π j y)q(z j y). (i) (j) 48 2. Each z is the categorical distribution and is independent of the other z (i 6= j) given the corre- (i) 49 sponding data y . 50 3. π is independent of y in q: i.e., q(π j y) = q(π). 51 4. q(π) is the Dirichlet distribution whose parameters are in the form of ! := Nθ + α, where N is 52 the data size (the total number of time-series data) and θ is a trainable vector in the simplex (i.e., P 53 8k 2 f1;:::;Kg; θk ≥ 0 ^ k θk = 1). 2Note that the time series input into the encoder contains implicit information about its length/offset, h(h). 2 54 The assumptions in 1 and 3 are imported from the mean-field variational inference, and 3 provides the optimal 3 55 form of q(π) under the assumptions (Bishop, 2006). The optimization objective of the entire VAE is the 56 maximization of the evidential lower bound (ELBO) of the log marginal likelihood log p(y; h) (Kingma and 57 Welling, 2014; Bowman et al., 2016). log p(y; h) ≥ log p(y; h) − DKL [q(π; z j y) k p(π; z j y; h)] = − DKL [q(π; z j y) k p(π; z)] + Eq [log p(y; h j z)] (4) =:ELBO 58 Based on the assumptions of the approximated posterior q, the first term in Eq. 4 is rewritten as follows: N X h (i) (i) i h (i) i DKL [q(π; z j y) k p(π; z)] =Eq [log q(π)] − Eq [log p(π)] + Eq log q(z j y ) − Eq log p(z j π) i=1 59 Where each term has a closed form. During the mini-batch learning, the first two terms outside the summation 60 operation, the index i of which denotes data, are multiplied by B=N, where B is the batch size. For the second PN (i) (i) (i) 61 term in Eq. 4, Eq [log p(y; h j z)] = i=1 Eq log p(y ; h j z ) , we approximate the computation of the 62 expectation using the Monte Carlo method (Kingma and Welling, 2014). We adopted the Gumbel-Softmax 63 approximation proposed by Jang et al. (2017) because the exact sampling from the categorical distribution (i) i 64 q(z j y ) is incompatible with gradient-based training. h (i) (i) (i) i (i) (i) (i) (i) K−1 Eq log p(y ; h j z ) ≈ log p(y ; h j ~z )(~z 2 ∆ : Sample from Gumbel-Softmax) 65 To simplify the learning process, the linear transformation before and after the Gumbel-Softmax sampling 66 share the same weight matrix M. The linear transformation before the Gumbel-Softmax computes the 67 logits by multiplying the output of the encoder with M, while the transformation after the Gumbel-Softmax (i) T 4 68 computes ~z M (Figure S1.2a). In other words, M is the “codebook” whose column vectors are the 69 real-value representation of the corresponding discrete categories. The first linear transformation computes 70 the similarity between the encoder output and each column vector, and the second transformation picks up (i) 71 the column vector of the sampled category (assuming that the Gumbel-Softmax sample ~z is close to a 72 one-hot vector).

Load more