Supporting Information

1 S1 Details on Syllable Clustering by VAE

2 This section is a detailed description of our syllable clustering based on VAE (§4.2). S1.1 describes the

3 seq2seq backbone of the VAE, and S1.2 explains the ABCD-VAE together with the optimization objective.

4 S1.3 defines the parameter settings and the training procedure. Finally, S1.4 discusses problems with the

5 standard Gaussian VAE and the motivation behind our discrete VAE.

6 S1.1 Seq2Seq

7 Figure S1.1 shows the global architecture of our seq2seq VAE, which consists of three modules: the encoder, 8 ABCD-VAE, and the decoder. The entire network receives a time series of syllable spectra, y := (y1,..., yT ), 9 as its input (in the encoder module) and reconstructs the input data. The reconstruction includes the 10 prediction of each spectrum, ˆy := (ˆy1,..., ˆyT ), as well as that of the offset T of the time series, implemented 11 by binary judgments of whether each time step t is the offset (ht = 1) or not (ht = 0). During the 12 reconstruction process, a fixed-dimensional representation of the entire syllable was obtained between the

13 encoder and decoder and was classified into a discrete category by the ABCD-VAE module.

14 The backbone of the encoder module is the bidirectional LSTM (Hochreiter and Schmidhuber, 1997;

15 Schuster and Paliwal, 1997). This RNN processes the input spectra forward and backward. The last hidden

16 and cell states in the two directions are concatenated and transformed by a multi-layer (MLP).

17 The MLP output is fed to the ABCD-VAE module (see S1.2) and classified into a discrete category that

18 has a corresponding real-value vector representation. The ABCD-VAE outputs the vector representation

19 of the assigned category, which is concatenated with the embedding of the speaker s of the input syllable.

20 Accordingly, the discrete syllable categories in the ABCD-VAE need not encode speaker characteristics,

21 resulting in speaker normalization (van den Oord et al., 2017; Chorowski et al., 2019; Dunbar et al., 2019;

22 Tjandra et al., 2019). See S1.4 for the motivation behind using this speaker normalization for the Bengalese

23 finch data. The concatenation of the output from the ABCD-VAE and the speaker embedding is transformed

24 by another MLP and fed to the decoder LSTM, which is unidirectional. For each time, step t ∈ {1,...,T },

25 the output from the LSTM is sent to two distinct MLPs. One of them computes the logits for the offset 26 predictions (P(ht)). The other MLP parameterizes the isotropic Gaussian probability density function of the 27 spectrum reconstruction (cf. Kingma and Welling, 2014). We sampled ˆyt using this Gaussian, which is used 1 28 as the input to the LSTM at the next time step t + 1 (the initial input is ˆy0 = 0).

29 S1.2 ABCD-VAE

30 This section provides details on the ABCD-VAE. We start with a mathematical description of the model

31 and then move to an explanation of the network implementation. Just like other VAEs, we assumed a prior (i) (i) 32 distribution of the latent feature z of each time-series data i. z is discrete in this study and its prior is the

33 Dirichlet-Categorical distribution. Eq. 1 and 2 below define this prior as the two-step generative procedure. (i) (i) 34 The time-series data—represented by the spectra y and offset judgments h —are generated conditioned (i) 35 on z , whose probability function is implemented by the decoder (Eq. 3).

π ∼Dirichlet(α) (1) z(i) | π ∼Categorical(π) (2) (y(i), h(i)) | z(i) ∼p(· | z(i), s(i)) = Decoder(z(i), s(i)) (3)

K−1 36 Where α := (α1, . . . , αK ) are positive real numbers and π := (π1, . . . , πK ) ∈ ∆ is a probability vector PK 37 (i.e., ∀k ∈ {1,...,K}, πk ≥ 1 ∧ k=1 πk = 1). The Dirichlet-Categorical prior causes a rich-gets-richer bias, 1 Note that the mathematically correct input to the decoder LSTM is the ground truth spectra, yt, rather than the reconstruction, ˆyt, because the objective function is the joint probability of y and h (see Eq. 4). However, the ground truth input to the decoder caused a uninformative problem. The decoder LSTM is considerably powerful and easily trained to fit to the overall distribution of the time-series data while ignoring information from the encoder (Bowman et al., 2016; Zhao et al., 2017; Liu et al., 2019). We found that our seq2seq VAE did not suffer from this issue when the noisy, reconstructed values ˆyt were used instead of the ground truth (cf. also adopted by Chorowski et al., 2019).

1 Decoder s Embed P(h1 = 0) P(h2 = 0) P(hT = 1)

ABCD-VAE Sigmoid (Figure S1.2b) Concat

MLP Encoder ˆy ˆy ˆy MLP 1 2 T

Concat

LSTM LSTM LSTM MLP

MLP LSTM LSTM LSTM

y1 y2 yT ˆy0 = 0

Figure S1.1: The architecture of the RNN-VAE, consisting of the Encoder module and the RNN Decoder module. The wavy arrows represent isotropic Gaussian sampling parameterizedby the output of the previous MLP.

38 preferring a smaller number of categories to be used repeatedly (Bishop, 2006; O’Donnell, 2015; Little, 2019),

39 while the (uniform) categorical prior—standard in the categorical VAE (Jang et al., 2017)—eats up all the

40 categories available. Because of this Occam’s razor effect, the Dirichlet-Categorical prior (and its extension

41 to unbounded choices, the Dirichlet process) is popular in Bayesian learning when the model needs to detect

42 the appropriate number of categories in the posterior (see Anderson, 1990; Kurihara and Sato, 2004, 2006;

43 Teh et al., 2006; Kemp et al., 2007; Goldwater et al., 2009; Feldman et al., 2013; Kamper et al., 2017; Morita

44 and O’Donnell, To appear, for examples in computational linguistics and cognitive science). The role of the

45 encoder is to approximate the posterior p(π, z | y, h) of the model in Eq. 1-3. We make several assumptions 2 46 on the approximated posterior, denoted by q(π, z | y).

47 1. π and z are independent given y in q: i.e., q(π, z | y) = q(π | y)q(z | y).

(i) (j) 48 2. Each z is the categorical distribution and is independent of the other z (i 6= j) given the corre- (i) 49 sponding data y .

50 3. π is independent of y in q: i.e., q(π | y) = q(π).

51 4. q(π) is the Dirichlet distribution whose parameters are in the form of ω := Nθ + α, where N is

52 the data size (the total number of time-series data) and θ is a trainable vector in the simplex (i.e., P 53 ∀k ∈ {1,...,K}, θk ≥ 0 ∧ k θk = 1).

2Note that the time series input into the encoder contains implicit information about its length/offset, h(h).

2 54 The assumptions in 1 and 3 are imported from the mean-field variational inference, and 3 provides the optimal 3 55 form of q(π) under the assumptions (Bishop, 2006). The optimization objective of the entire VAE is the

56 maximization of the evidential lower bound (ELBO) of the log marginal likelihood log p(y, h) (Kingma and

57 Welling, 2014; Bowman et al., 2016).

log p(y, h) ≥ log p(y, h) − DKL [q(π, z | y) k p(π, z | y, h)]

= − DKL [q(π, z | y) k p(π, z)] + Eq [log p(y, h | z)] (4) =:ELBO

58 Based on the assumptions of the approximated posterior q, the first term in Eq. 4 is rewritten as follows:

N X  h (i) (i) i h (i) i DKL [q(π, z | y) k p(π, z)] =Eq [log q(π)] − Eq [log p(π)] + Eq log q(z | y ) − Eq log p(z | π) i=1

59 Where each term has a closed form. During the mini-batch learning, the first two terms outside the summation

60 operation, the index i of which denotes data, are multiplied by B/N, where B is the batch size. For the second PN  (i) (i) (i)  61 term in Eq. 4, Eq [log p(y, h | z)] = i=1 Eq log p(y , h | z ) , we approximate the computation of the 62 expectation using the Monte Carlo method (Kingma and Welling, 2014). We adopted the Gumbel-Softmax

63 approximation proposed by Jang et al. (2017) because the exact sampling from the categorical distribution (i) i 64 q(z | y ) is incompatible with gradient-based training.

h (i) (i) (i) i (i) (i) (i) (i) K−1 Eq log p(y , h | z ) ≈ log p(y , h | ˜z )(˜z ∈ ∆ : Sample from Gumbel-Softmax)

65 To simplify the learning process, the linear transformation before and after the Gumbel-Softmax sampling

66 share the same weight matrix M. The linear transformation before the Gumbel-Softmax computes the

67 logits by multiplying the output of the encoder with M, while the transformation after the Gumbel-Softmax (i) T 4 68 computes ˜z M (Figure S1.2a). In other words, M is the “codebook” whose column vectors are the

69 real-value representation of the corresponding discrete categories. The first linear transformation computes

70 the similarity between the encoder output and each column vector, and the second transformation picks up (i) 71 the column vector of the sampled category (assuming that the Gumbel-Softmax sample ˜z is close to a

72 one-hot vector). Thus, our VAE is similar to the vector-quantized VAE (van den Oord et al., 2017; Chorowski

73 et al., 2019; Tjandra et al., 2019), which uses the L2-similarity instead of our unnormalized cosine similarity

74 (without the randomness).

75 The exact implementation of the ABCD-VAE is based on the scaled dot-product attention used in the

76 Transformer (single head attention; Vaswani et al., 2017; Devlin et al., 2018) and depicted in Figure S1.2b.

77 The attention mechanism first computes the dot product of the encoder’s output (query) and the codebook 78 √M (memory; i.e., both key and value), yielding the similarity between the two. This similarity is scaled by 79 Dh, where Dh is the dimensionality of the encoder output and the column vectors of the codebook (i.e., 80 the number of rows in M). The scaled similarity is transformed by the softmax into the posterior probability (i) (i) (i) 81 vector q(z | y ), and a Gumbel-Softmax sample ˜z is drawn from this probability distribution. Finally, (i) T 82 ˜z is multiplied by M (used as the value) and sent to the decoder.

83 S1.3 Parameter Settings and Training Procedure

84 The input time series y1,..., T were obtained as follows. We first applied the short-time Fourier transform 85 to the recordings having the 8 msec Hanning window and 4 msec stride (i.e., 256 samples for the window

3The optimal update of θ is proportional to the expected number of data classified into each category (Bishop, 2006). However, the exact computation of these expected counts requires iterations over all data and is inefficient. Therefore, we train θ using the gradient ascent. 4For simplicity, the linear transformations do not have bias terms.

3 To Decoder

Regularize by • −DKL[q(π, z | y) k p(π, z)] ˜z(i)

To Decoder Gumbel-Softmax value Sample

q(z(i) | y(i)) Linear W2

(i) Softmax Transpose ˜z MT Gumbel-Softmax Sample √ Scale by Dh Transpose q(z(i) | y(i)) similarity Softmax • M key query Linear W1

Codebook From Encoder From Encoder M (a) Shared Weight Matrix (b) Attention-based Implementation

Figure S1.2: ABCD-VAE.

4 86 length and 128 samples for the step size because the sampling rate was 32 kHz). The spectral amplitude was −15 −1 87 then log-transformed (after 2 was added to avoid underflow) and rescaled by 11 .

88 All the hidden states in the VAE had the dimensionality Dh = 256 except in the discrete feature space.

89 The number of possible discrete categories (i.e., upper bound) was K = 128 and the non-linearity of the

90 MLPs was tanh. The neural network was trained by the stochastic gradient ascent for 20 epochs. The first (i) (i) 91 five epochs were used for “pretraining” where the posterior probability q(z | y ) was multiplied with the

92 codebook M without sampling from the Gumbel-Softmax distribution. After this process, the temperature −5 93 τ of the Gumbel-Softmax was annealed every 1000 iterations according to the schedule τ = exp(−10 m),

94 where m is the number of iterations after the initial five epochs (cf. Jang et al., 2017). The was

95 initially set as 1.0 and multiplied by 0.1 after every epoch in which validation loss was not improved. The

96 gradient norms were clipped at 1.0 to avoid an explosion. No dropout or momentum were introduced.

97 S1.4 Problems with the Gaussian VAE Analysis of the Bengalese Finch Song

98 This section discusses the real-value features and problems of Bengalese finch syllables obtained by the

99 standard Gaussian VAE. The Gaussian VAE is a popular way of obtaining real-valued features of data in an

100 arbitrary dimensional space (Kingma and Welling, 2014) and has been used for analyses of animal vocalization

101 (Coffey et al., 2019; Goffinet et al., 2019; Sainburg et al., 2019b). We also tested it on our Bengalese finch

102 data in an early stage of this study, and the obtained syllable features exhibited some clear clusters when we

103 looked at each individual bird separately. Figure S1.3a shows the syllable features of an individual (the same

104 syllables as the ones reported in Figure 3c). The original features had 16 dimensions, which we embedded

105 into the two-dimensional space by tSNE for visualization (van der Maaten and Hinton, 2008).

106 When we represent syllables of multiple individuals, however, we no longer see such clear clusters.

107 Figure S1.3b shows the syllable features of all the 18 individuals used in this study (distinguished by the 5 108 colors and shapes of the markers). Many syllables are confused around the center, and tiny clusters in the

109 peripheral are individual-specific. We also tried the speaker normalization technique used in discrete VAEs

110 (van den Oord et al., 2017; Chorowski et al., 2019; Tjandra et al., 2019), but this did not solve the problem

111 (Figure S1.3c). Given the inappropriate distribution of the syllable features encoded by the Gaussian VAE,

112 we adopted end-to-end clustering with the ABCD-VAE.

113 S2 Details on the Transformer Language Model

114 Our analysis of context dependency was based on the Transformer language model of a Bengalese finch song

115 and English sentences (Vaswani et al., 2017; Devlin et al., 2018; Dai et al., 2019). This section discusses

116 the model parameters and training procedure we used. The Transformer consisted of six layers with eight

117 attention heads per layer. The dimensionality of the hidden states, including the middle layer of the MLPs,

118 was 512. We adopted the relative position encoding proposed by Dai et al. (2019). The input embeddings of

119 the Bengalese finch syllables were additively combined with the embeddings of the speaker identity. Dropout

120 was applied at the rate of 0.1 to the input embeddings (+ the speaker embeddings for the Bengalese finch

121 data), the output of each Transformer sublayer before the residual connection, and to the attention weights.

122 We trained the Transformer for 20, 000 iterations using the Adam optimizer with a learning rate of 0.001, 123 β1 = 0.9, β2 = 0.999, and weight decay of 0.01 (Kingma and Ba, 2015). The learning rate was updated 124 according to the schedule used by Vaswani et al. (2017) and Devlin et al. (2018), with 1, 000 warmup iterations.

125 The batch size was 128.

126 S3 Concrete Examples Problematic to the Mutual Information Analysis

127 This section is a discussion of more concrete examples of sequential data where the mutual information

128 metric diverges from the intuitive concept of “context dependency,” defined by or related to the memory

129 burden on animal agents that produce/recognize the sequential data. We first provided a mathematical

130 model of the problem with the individual-specific tokens discussed in §1. We then introduced some simulation

131 results that demonstrate the difficulty in interpreting mutual information as the index of context dependency

132 (S3.2). It should be noted that recent studies on human language (Lin and Tegmark, 2017) and birdsong

5Owing to the computational cost of tSNE, we only visualized the human-annotated portion of the data.

5 (a) Single Individual (b03)

(b) All Individuals (c) Speaker-Normalized

Figure S1.3: Continuous-valued encodings of Bengalese finch syllables by Gaussian VAE. The original dimensionality of the features was 16, and they were embedded into the two-dimensional space by tSNE. (a) Syllables from a single individual (b03, the same data as those reported in Figure 3c). (b) Syllables from all the individuals were plotted together, using colors and shapes of the markers to distinguish the individuals. (c) Syllables from all the individuals encoded with the speaker normalization technique used in discrete VAEs.

6 Predecessor (X) Follower (X+d) a b s1 s2 a 1/16 1/16 1/16 1/16 b 1/16 1/16 1/16 1/16 s1 1/16 1/16 1/8 0 s2 1/16 1/16 0 1/8

Table S3.1: Probability of each pair of tokens in the same sequence.

133 (Sainburg et al., 2019a) did not use mutual information to assess the agent-based context dependency. Instead,

134 they analyzed the decay in the mutual information to diagnose the generative model behind the data. The

135 purpose of this section is not to criticize these studies but to note that mutual information cannot replace

136 our model-based analysis.

137 S3.1 Problem with Individuality

138 The mutual information I measures the expected divergence between the joint distribution of two tokens, X 139 and X+d, at certain distance d and the product of their marginal probability.

  P(X,X+d) I(X,X+d) :=E log2 (5) P(X)P(X+d) X X P(X = x, X+d = x+d) = P(X = x, X+d = x+d) log2 (X = x) (X+d = x+d) x x+d P P

140 Mutual information is zero if X and X+d are independent. It should be noted that this is a pairwise 141 metric, and the other tokens appearing between the two are ignored. For example, X+d can be conditionally 142 independent of X given other tokens between them, providing the same information as X+d for the prediction 6 143 of X, but such relations are not detected by the metric.

144 As briefly stated in the §1, the mutual information diverges from the natural concept of context dependency

145 when some tokens encode individual information. Suppose that sequences of tokens are generated by iterating

146 the following procedure:

147 1. One of two individuals generate a sequence at random (uniformly sample the individual-specific token 148 s ∈ {s1, s2}).

149 2. Uniformly randomly choose whether a shared or individual-specific token is sampled.

150 3. Sample one of two shared tokens (x ∈ {a, b}) or emit the individual-specific token (= s).

151 Then, the probability of each pair of predecessor and follower tokens is as shown in Table S3.1 regardless of

152 their distance; we will encounter every possible pair at random except that only one of the two individual- 153 specific tokens is included in a single sequence, and thus, we will never see the heterogeneous pairs, (s1, s2) 154 and (s2, s1). Accordingly, the mutual information, if measured globally across sequences, is constant at 0.25. 155 This contradicts the agent-oriented concept of context dependency where the individual agents generate the

156 sequences without referring to the past tokens, and those who read and/or hear the sequences can predict 157 which one of s1 and s2 will come next based on their latest occurrence, not further past.

158 S3.2 Algebraic Complexity of the Mutual Information Analysis

159 The particular problem with individual-specific tokens discussed above is not difficult to solve because we can

160 simply condition the relevant probabilities (Eq. 5) on the individual that generate each sequence. However,

6It is not impossible to assess the conditional effect of mutual information in principle. We may replace all the probabilities in Eq. 5 with the conditional ones. Such an index, however, would be difficult to estimate in practice because of the exponentially possible sequences of conditioning tokens.

7 161 the mismatch between the mutual information and the agent-oriented concept of context dependency is not

162 limited to that specific case. In this section, we show that data collected from finite-state automata (FSA),

163 which has been a popular model of Bengalese finch song (Hosino and Okanoya, 2000; Okanoya, 2004; Kakishita

164 et al., 2007), can yield different mutual information scores despite their identical context dependency from a

165 generative perspective. The simulations in this section are complex and it was not easy to obtain the mutual

166 information from the formal definition in Eq. 5. Thus, we used the version of mutual information proposed

167 by Sainburg et al. (termed “SMI” below 2019a, cf. Futrell et al. 2019 applied the same metric to hierarchical

168 dependencies in human language syntax) for analysis of real birdsong data, which computes an estimated

169 mutual information Iˆ of data (Grassberger, 2003; Lin and Tegmark, 2017) and corrects it with shuffled data 7 170 Xsh,Xsh,+d.

Iˆ(X,X+d) :=Sˆ(X) + Sˆ(X+d) − Sˆ(X,X+d) 1 X Sˆ :=log (N) − N ψ(N ) 2 N x x x

SMI :=Iˆ(X,X+d) − Iˆ(Xsh,Xsh,+d)

171 SMI can capture the concept of context dependency in a better manner owing to the utilization of shuffled

172 baseline in its computation. For example, SMI will be zero for the example with individual-specific tokens

173 discussed in the previous section. The mutual information does not change during the shuffling operation, so

174 the score of the original data is canceled out by that of the shuffled data. However, there are other cases

175 where SMI diverges from the agent-based concept of context dependency. In the section below, we discuss

176 periodic Markov processes (Lin and Tegmark, 2017). A popular birdsong model (and voice sequences of

177 other animals) is finite-state automata (FSA; Hosino and Okanoya, 2000; Okanoya, 2004; Kakishita et al.,

178 2007, but see Kershenbaum et al., 2014 and Morita and Koda, 2019 for possible effectiveness of language

179 models beyond the capacity of FSA). FSA transitions among a finite number of states and emits/processes a

180 token associated with the transition. Figure S3.1a shows a FSA model of a Bengalese finch song that was

181 proposed by Okanoya (2004) and is commonly cited in studies about the song syntax (e.g., Berwick et al.,

182 2011; Miyagawa et al., 2013). We generated a sequence of 100, 000 tokens from this FSA (with the uniformly 183 random choice between a and c at the state q2), and the SMI estimated from this data is shown in Figure S3.2. 184 The SMI dropped fast and went below 0.01 when the inter-token distance was 7 or greater. This does not

185 significantly contradict the agent-based concept of context dependency. An agent only needs to remember

186 the last emitted token to correctly simulate the FSA (i.e., the FSA can be simulated by a bigram model; 8 187 speaking in the language of formal language theory, the generated patterns are 2-strictly locally testable).

188 However, small modifications to the FSA in Figure S3.1a can result in completely different SMI scores. 189 Figure S3.1b removes the branch at q2 and makes the loop back to q1 obligatory, while there are still two 190 possible followers of b (= a and c) chosen at random. While this change does not require any extra effort for 191 the data production/processing, and could even simplify the process owing to the reduced number of states,

192 the SMI is now constant around 0.693. Likewise, the small extension shown in Figure S3.1c, delaying the loop 193 back to q1 after the choice of c, puts the SMI converge around 0.693. This extension does not change the 194 agent-based context dependency, either as agents can still simulate the extended FSA if they remember the

195 last emitted token and nothing further from the past, preserving the 2-strictly local testability. The non-zero 9 196 SMI of the non-branching and extended FSAs roots in their periodicity. Taking the non-branching FSA 197 (Figure S3.1b) for example, we only observe the predecessor-follower pairs, (a, c), (c, a), and (b, b), when they 198 are 2m tokens apart (m ∈ Z+), and the other patterns occur elsewhere (2m + 1). Thus, the joint probability 199 of the pairs is different from the product of each member’s marginal probability, pulling I (and Iˆ) above ˆ 200 zero. On the other hand, this periodicity is broken by the shuffling, creating a difference between I(X,X+d) ˆ 201 and I(Xsh,Xsh,+d) and keeping the SMI non-zero. Similarly, the extended FSA in Figure S3.1c goes back to

7The shuffling operates on each sequence; therefore, tokens in different sequences are not mixed. 8Berwick et al. (2011) argue that the formal language characterized by the FSA in Figure S3.1a is not strictly locally testable. This is incorrect because we can test whether each string is a possible outcome of the FSA simply by matching each of its substrings of length 2 to the repertoire. {ab, ba, bc, ca}. 9Lin and Tegmark (2017) prove that mutual information in periodic Markovian processes is characterized by an exponential decay plus a constant bottomline.

8 a a, c a b c start q0 q1 q2 q3 a b start q0 q1 q2 a (a) Okanoya (2004) (b) Non-Branching a

a b c d start q0 q1 q2 q3 q4

a (c) Extended

Figure S3.1: FSA models of Bengalese finch song. (a) FSA proposed by Okanoya (2004). (b) A simplified version of (a), removing the transitional branch at q2 while keeping the two possible emissions, a and c. (c) An extended version of (a), delaying the loop back to q1 after the choice of c.

Figure S3.2: SMI of FSA in Figure S3.1.

9 202 each state two to four steps after it leaves the state. Hence, we never see homogeneous pairs like (b, b) when 203 their distance is an odd integer. While it is not clear how much periodicity exists in real birdsong and other

204 sequential data in biology, problem related to the mutual information analysis should be recognized. Given

205 the complexity of such restrictions, we conclude that mutual information cannot replace the model-based

206 analysis for the assessment of agent-oriented context dependency.

207 References

208 Anderson, J. R. (1990). The adaptive character of thought. Studies in cognition. L. Erlbaum Associates,

209 Hillsdale, NJ.

210 Berwick, R. C., Okanoya, K., Beckers, G. J., and Bolhuis, J. J. (2011). Songs to syntax: the linguistics of

211 birdsong. Trends in Cognitive Science, 15(3):113–121.

212 Bishop, C. M. (2006). Pattern recognition and . Information science and statistics. Springer,

213 New York.

214 Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. (2016). Generating sentences

215 from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural

216 Language Learning.

217 Chorowski, J., Weiss, R. J., Bengio, S., and van den Oord, A. (2019). Unsupervised speech representation

218 learning using . IEEE/ACM Transactions on Audio, Speech, and Language Processing,

219 27(12):2041–2053.

220 Coffey, K. R., Marx, R. G., and Neumaier, J. F. (2019). DeepSqueak: a -based system for

221 detection and analysis of ultrasonic vocalizations. Neuropsychopharmacology, 44(5):859–868.

222 Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and Salakhutdinov, R. (2019). Transformer-XL: Attentive

223 language models beyond a fixed-length context.

224 Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional

225 transformers for language understanding. arXiv:1810.04805.

226 Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, J., Cao, X.-N., Miskic, L., Dugrain, C.,

227 Ondel, L., Black, A. W., Besacier, L., Sakti, S., and Dupoux, E. (2019). The Zero Resource Speech

228 Challenge 2019: TTS without T. In Proceedings of Interspeech 2019, pages 1088–1092.

229 Feldman, N. H., Goldwater, S., Griffiths, T. L., and Morgan, J. L. (2013). A role for the developing lexicon

230 in phonetic category acquisition. Psychological Review, 120(4):751–778.

231 Futrell, R., Wilcox, E., Morita, T., Qian, P., Ballesteros, M., and Levy, R. (2019). Neural language models as

232 psycholinguistic subjects: Representations of syntactic state. In Proceedings of the 2019 Conference of the

233 North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

234 Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational

235 Linguistics.

236 Goffinet, J., Mooney, R., and Pearson, J. (2019). Inferring low-dimensional latent descriptions of animal

237 vocalizations. bioRxiv.

238 Goldwater, S., L Griffiths, T., and Johnson, M. (2009). A Bayesian framework for word segmentation:

239 Exploring the effects of context. Cognition, 112:21–54.

240 Grassberger, P. (2003). Entropy estimates from insufficient samplings.

241 Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8):1735–1780.

242 Hosino, T. and Okanoya, K. (2000). Lesion of a higher-order song nucleus disrupts phrase level complexity in

243 bengalese finches. Neuroreport, 11(10):2091–2095.

10 244 Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with Gumbel-softmax. In 5th

245 International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,

246 Conference Track Proceedings.

247 Kakishita, Y., Sasahara, K., Nishino, T., Takahasi, M., and Okanoya, K. (2007). Pattern extraction improves

248 automata-based syntax analysis in songbirds. Lecture Notes in Artificial Inteligence, 4828:320–332.

249 Kamper, H., Jansen, A., and Goldwater, S. (2017). A segmental framework for fully-unsupervised large-

250 vocabulary speech recognition. Computer Speech & Language, 46:154–174.

251 Kemp, C., Perfors, A., and Tenenbaum, J. (2007). Learning overhypotheses with hierarchical Bayesian models.

252 Developmental Science, 10(3):307–321.

253 Kershenbaum, A., Bowles, A. E., Freeberg, T. M., Jin, D. Z., Lameira, A. R., and Bohn, K. (2014). Animal

254 vocal sequences: not the Markov chains we thought they were. Proceedings of the Royal Society of London

255 B: Biological Sciences, 281(1792).

256 Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International

257 Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference

258 Track Proceedings.

259 Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. The International Conference on

260 Learning Representations (ICLR) 2014.

261 Kurihara, K. and Sato, T. (2004). An application of the variational Bayesian approach to probabilistic

262 context-free grammars. In International Joint Conference on Natural Language Processing Workshop

263 Beyond Shallow Analyses.

264 Kurihara, K. and Sato, T. (2006). Variational Bayesian grammar induction for natural language. In Sakakibara,

265 Y., Kobayashi, S., Sato, K., Nishino, T., and Tomita, E., editors, Grammatical Inference: and

266 Applications: 8th International Colloquium, ICGI 2006, Tokyo, Japan, September 20-22, 2006. Proceedings,

267 pages 84–96. Springer Berlin Heidelberg, Berlin, Heidelberg.

268 Lin, H. W. and Tegmark, M. (2017). Critical behavior in physics and probabilistic formal languages. Entropy,

269 19(7):299.

270 Little, M. A. (2019). Machine Learning for Signal Processing: , Algorithms, and Computational

271 Statistics. Oxford University Press.

272 Liu, D., Xue, Y., He, F., Chen, Y., and Lv, J. (2019). µ-forcing: Training variational recurrent autoencoders

273 for text generation. ACM Transactions on Asian Low-Resource Language Information Processing, 19(1).

274 Miyagawa, S., Berwick, R., and Okanoya, K. (2013). The emergence of hierarchical structure in human

275 language. Frontiers in Psychology, 4:71.

276 Morita, T. and Koda, H. (2019). Superregular grammars do not provide additional explanatory power but

277 allow for a compact analysis of animal song. Royal Society Open Science, 6(7):190139. Preprinted in

278 arXiv:1811.02507.

279 Morita, T. and O’Donnell, T. J. (To appear). Statistical evidence for learnable lexical subclasses in Japanese.

280 Linguistic Inquiry. Accepted with major revisions.

281 O’Donnell, T. J. (2015). Productivity and reuse in language : a theory of linguistic computation and storage.

282 MIT Press, Cambridge, MA; London, England.

283 Okanoya, K. (2004). Song syntax in Bengalese finches: proximate and ultimate analyses. Advances in the

284 Study of Behavior, 34:297–345.

285 Sainburg, T., Theilman, B., Thielk, M., and Gentner, T. Q. (2019a). Parallels in the sequential organization

286 of birdsong and human speech. Nature Communications, 10(3636).

11 287 Sainburg, T., Thielk, M., and Gentner, T. Q. (2019b). Latent space visualization, characterization, and

288 generation of diverse vocal communication signals. bioRxiv.

289 Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on

290 Signal Processing, 45(11):2673–2681.

291 Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of

292 the American Statistical Association, 101(476):1566–1581.

293 Tjandra, A., Sisman, B., Zhang, M., Sakti, S., Li, H., and Nakamura, S. (2019). VQVAE unsupervised unit

294 discovery and multi-scale Code2Spec inverter for Zerospeech Challenge 2019.

295 van den Oord, A., Vinyals, O., and Kavukcuoglu, K. (2017). Neural discrete representation learning. In

296 Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors,

297 Advances in Neural Information Processing Systems 30, pages 6306–6315. Curran Associates, Inc.

298 van der Maaten, L. and Hinton, G. (2008). Visualizing high-dimensional data using t-sne. Journal of Machine

299 Learning Research, 9:2579–2605.

300 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin,

301 I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R.,

302 Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages

303 5998–6008. Curran Associates, Inc.

304 Zhao, T., Zhao, R., and Eskenazi, M. (2017). Learning discourse-level diversity for neural dialog models

305 using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association

306 for Computational Linguistics (Volume 1: Long Papers), pages 654–664. Association for Computational

307 Linguistics.

12