Journal of Machine Learning Research 22 (2021) 1-35 Submitted 3/20; Revised 10/20; Published 3/21 Attention is Turing Complete Jorge P´erez
[email protected] Department of Computer Science Universidad de Chile IMFD Chile Pablo Barcel´o
[email protected] Institute for Mathematical and Computational Engineering School of Engineering, Faculty of Mathematics Universidad Cat´olica de Chile IMFD Chile Javier Marinkovic
[email protected] Department of Computer Science Universidad de Chile IMFD Chile Editor: Luc de Raedt Abstract Alternatives to recurrent neural networks, in particular, architectures based on self-attention, are gaining momentum for processing input sequences. In spite of their relevance, the com- putational properties of such networks have not yet been fully explored. We study the computational power of the Transformer, one of the most paradigmatic architectures ex- emplifying self-attention. We show that the Transformer with hard-attention is Turing complete exclusively based on their capacity to compute and access internal dense repre- sentations of the data. Our study also reveals some minimal sets of elements needed to obtain this completeness result. Keywords: Transformers, Turing completeness, self-Attention, neural networks, arbi- trary precision 1. Introduction There is an increasing interest in designing neural network architectures capable of learning algorithms from examples (Graves et al., 2014; Grefenstette et al., 2015; Joulin and Mikolov, 2015; Kaiser and Sutskever, 2016; Kurach et al., 2016; Dehghani et al., 2018). A key requirement for any such an architecture is thus to have the capacity of implementing arbitrary algorithms, that is, to be Turing complete. Most of the networks proposed for learning algorithms are Turing complete simply by definition, as they can be seen as a control unit with access to an unbounded memory; as such, they are capable of simulating any Turing machine.