The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs Wen-Yi Hsiao,1 Jen-Yu Liu,1, Yin-Cheng Yeh,1 Yi-Hsuan Yang2 1Yating Team, Taiwan AI Labs, Taiwan 2Academia Sinica, Taiwan fwayne391, jyliu, yyeh,
[email protected] Abstract To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of var- ious types. For example, to describe a musical note, one needs separate tokens to indicate the note’s pitch, duration, velocity (dynamics), and placement (onset time) along the time grid. While different types of tokens may possess different proper- ties, existing models usually treat them equally, in the same way as modeling words in natural languages. In this paper, we present a conceptually different approach that explicitly takes Figure 1: Illustration of the main ideas of the proposed com- into account the type of the tokens, such as note types and pound word Transformer: (left) compound word modeling metric types. And, we propose a new Transformer decoder ar- that combines the embeddings (colored gray) of multiple to- K chitecture that uses different feed-forward heads to model to- kens fwt−1;kgk=1, one for each token type k, at each time kens of different types. With an expansion-compression trick, step t − 1 to form the input ~xt−1 to the self-attention layers, we convert a piece of music to a sequence of compound words and (right) toke type-specific feed-forward heads that predict by grouping neighboring tokens, greatly reducing the length the list of tokens for the next time step t at once at the output.