Arxiv:2103.03292V1 [Q-Bio.BM] 4 Mar 2021

Eﬃcient generative modeling of protein sequences using simple autoregressive models

Jeanne Trinquier,1, 2 Guido Uguzzoni,3, 4 Andrea Pagnani,3, 4, 5 Francesco Zamponi,2 and Martin Weigt1, 6 1Sorbonne Université,CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005 Paris, France 2Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, UniversitéPSL, CNRS, Sorbonne Université,Universitéde Paris, F-75005 Paris, France 3Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy 4Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO) - Italy 5INFN Sezione di Torino, Via P. Giuria 1, I-10125 Torino, Italy 6Email: [email protected] ABSTRACT

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 102 and 103). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model’s entropy, the size of the functional sequence space related to a specific protein family. In the case of response regulators, we find a huge number of ca. possible 1068 sequences, which nevertheless constitute only the astronomically small fraction 10−80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.

INTRODUCTION counts to 1065 10650 values, to be learned from the M = 103 106 sequences− contained in most protein fam- − The impressive growth of sequence databases is ilies. Selecting adequate generative model architectures prompted by increasingly powerful techniques in data- is thus of outstanding importance. driven modeling, helping to extract the rich informa- The currently best explored generative models for tion hidden in raw data. In the context of protein se- proteins are so-called coevolutionary models [15], such quences, unsupervised learning techniques are of partic- as those constructed by the Direct Coupling Analysis ular interest: only about 0.25% of the more than 200 (DCA) [13, 22, 39] (a more detailed review of the state of million amino-acid sequences currently available in the the art is provided below). They explicitly model the us- Uniprot database [56] have manual annotations, which age of amino acids in single positions (i.e. residue conser- can be used for supervised methods. vation) and correlations between pairs of positions (i.e. Unsupervised methods may benefit from evolutionary residue coevolution). The resulting models are mathe- relationships between proteins: while mutations mod- matically equivalent to Potts models [36] in statistical ify amino-acid sequences, selection keeps their biological physics, or to Boltzmann machines in statistical learn- functions and their three-dimensional structures remark- ing [1]. They have found numerous applications in pro- ably conserved. The Pfam protein family database [20], tein biology. e.g., lists more than 19,000 families of homologous pro- The effect of amino-acid mutations is predicted via the arXiv:2103.03292v2 [q-bio.BM] 13 Sep 2021 teins, offering rich datasets of sequence-diversified but log-ratio log P (mutant)/P (wildtype) between mutant { } functionally conserved proteins. and wildtype probabilities. Strong correlations to muta- In this context, generative statistical models are tional effects determined experimentally via deep muta- rapidly gaining interest. The natural sequence vari- tional scanning have been reported [23, 30]. Promising ability across a protein family is captured via a prob- application are the data-driven design of mutant libraries for protein optimization [11, 12, 41], and the use of Potts ability P (a1, ..., aL) defined for all amino-acid sequences models as sequence landscapes in quantitative models of (a1, ..., aL). Sampling from P (a1, ..., aL) can be used to generate new, non-natural amino-acid sequences, which protein evolution [9, 16]. in an ideal case should be statistically indistinguishable Contacts between residues in the protein fold are ex- from the natural sequences. However, the task of learning tracted from the strongest epistatic couplings between P (a1, ..., aL) is highly non-trivial: the model has to assign double mutations, i.e. from the direct couplings giving probabilities to all 20L possible amino-acid sequences. the name to DCA [39]. These couplings are essential For typical proteins of lengths L = 50 500, this ac- input features in the wave of deep-learning (DL) meth- − 2 ods, which currently revolutionize the field of protein- Potts models, i.e. bmDCA [22], and related methods [8, structure prediction [25, 46, 58, 61]. 52, 57]. An alternative implementation of bmDCA, in- The generative implementation bmDCA [22] is able cluding a decimation of statistically irrelevant couplings, to generate artificial but functional amino-acid se- has been presented in [6] and is the one used as a bench- quences [45, 54]. Such observations suggest novel but mark in this work; the Mi3 package [27] also provides a almost unexplored approaches towards data-driven pro- GPU-based accelerated implementation. tein design, which complement current approaches based However, Potts models or Boltzmann machines are not mostly on large-scale experimental screening of random- the only generative-model architectures explored for pro- ized sequence libraries or time-intensive bio-molecular tein sequences. Latent-variable models like Restricted simulation, typically followed by sequence optimization Boltzmann machines [55] or Hopfield-Potts models [48] using directed evolution, cf. [31, 33] for reviews. learn dimensionally reduced representations of proteins; Here we propose a simple model architecture called using sequence motifs, they are able to capture groups of arDCA, based on a shallow (one-layer) autoregressive collectively evolving residues [44] better than DCA mod- model paired with generalized logistic regression. Such els, but are less accurate in extracting structural infor- models are computationally very efficient, they can mation from the learning MSA [48]. be learned in few minutes, as compared to days for An important class of generative models based on la- bmDCA and more involved architectures. Nevertheless, tent variables are variational autoencoders (VAE), which we demonstrate that arDCA provides highly accurate achieve dimensional reduction, but in the flexible and generative models, comparable to the state of the art in powerful setting of deep learning. The DeepSequence im- mutational-effect and residue-contact prediction. Their plementation [43] was originally designed and tested for simple structure makes them more robust in the case of predicting the effects of mutations around a given wild limited data. Furthermore, and this may have impor- type. It currently provides one of the best mutational- tant applications in homology detection [59], our autore- effect predictors, and we will show below that arDCA gressive models are the only generative models we know provides comparable quality of prediction for this spe- about, which allow for calculating exact sequence prob- cific task. The DeepSequence code has been modified abilities, and not only non-normalized sequence weights. in [38] to explore its capacities in generating artificial Thereby arDCA enables the comparison of the same se- sequences being statistically indistinguishable from the quence in different models for different protein families. natural MSA; it was shown that its performance was Last but not least, the entropy of arDCA models, which substantially less accurate than bmDCA. Another im- is related to the size of the functional sequence space as- plementation of a VAE was reported in [29]; also in this sociated to a given protein family, can be computed much case the generative performances are inferior to bmDCA, more efficiently than in bmDCA. but the organization of latent variables was shown to Before proceeding, we provide here a short review of carry significant information on functionality. Further- the state of the art in generative protein modeling. The more, some generated mutant sequences were successfully literature is extensive and rapidly growing, so we will tested experimentally. Interestingly, it was also shown concentrate on the methods being most directly relevant that learning VAE on unaligned sequences decreases the as compared to the scope our work. performance as compared to pre-aligned MSA as used by We focus on generative models purely based on se- all before-mentioned models. This observation was com- quence data. The sequences belong to homologous pro- plemented by Ref. [14], which reported a VAE implemen- tein families, and are given in form of multiple sequence tation trained on non-aligned sequences from UniProt, alignments (MSA), i.e. as a rectangular matrix = with length 10 < L < 1000. The VAE had good re- m D (ai i = 1, ..., L; m = 1, ..., M) containing M aligned pro- construction accuracy for small L < 200, which however | m teins of length L. The entries ai equal either one of the dropped significantly for larger L. The latent space also standard 20 amino acids, or the alignment gap “–”. In in this case shows an interesting organization in terms of total, we have q = 21 possible different symbols in . function, which was used to generate in silico proteins The aim of unsupervised generative modeling is to learnD with desired properties, but no experimental test was a statistical model P (a1, ..., aL) of (aligned) full-length provided. The paper does not report any statistical test sequences, which faithfully reflects the variability found of the generative properties (such as a Pearson correla- in : sequences belonging to the protein family of inter- tion of two-point correlations), and the publicly not yet estD should have comparably high probabilities, unrelated available code makes a quantitative comparison to our sequences very small probabilities. Furthermore, a new results currently impossible. artificial MSA 0 sampled sequence by sequence from Another interesting DL architecture is that of a Gen- D model P (a1, ..., aL) should be statistically and function- erative Adversarial Network (GAN), which was explored ally indistinguishable from the natural aligned MSA in [42] on a single family of aligned homologous se- D given as input. quences. While the model has a very large number of A way to achieve this goal is the above-mentioned trainable parameters ( 60M), it seems to reproduce well use of Boltzmann-machine learning based on conserva- the statistics of the training∼ MSA, and most importantly, tion and coevolution, which leads to pairwise-interacting the authors could generate an enzyme with only 66% 3

Data – MSA of Autoregressive model

AAACH3icbVDLSsQwFE3Hd31VXboJDoqCllZ8bQTBjQsXIziOMC3lNs1oMG1KkgpDnT9x46+4caGIuPNvzNQufB0IHM45l5t74pwzpT3vw2qMjI6NT0xO2dMzs3PzzsLihRKFJLRNBBfyMgZFOctoWzPN6WUuKaQxp5345njod26pVExk57qf0zCFq4z1GAFtpMjZW8OtdYj8Tdd1NyE63cBBYK/hQxzkUiQRq1x2B1HJtvxBnfI3IqfpuV4F/Jf4NWmiGq3IeQ8SQYqUZppwUKrre7kOS5CaEU4HdlAomgO5gSvaNTSDlKqwrO4b4FWjJLgnpHmZxpX6faKEVKl+GptkCvpa/faG4n9et9C9g7BkWV5ompGvRb2CYy3wsCycMEmJ5n1DgEhm/orJNUgg2lRqmxL83yf/JRfbrr/remc7zaODuo5JtIxW0Dry0T46QieohdqIoHv0iJ7Ri/VgPVmv1ttXtGHVM0voB6yPT1cGnZ0= homologous sequences P (a1,...,aL) AAACC3icbVDLSgMxFM3UV62vUZduQotQoQwzotiNUHDjQqGCfUAfQyZN29AkMyQZoQzdu/FX3LhQxK0/4M6/MW1noa0HLhzOuZd77wkiRpV23W8rs7K6tr6R3cxtbe/s7tn7B3UVxhKTGg5ZKJsBUoRRQWqaakaakSSIB4w0gtHV1G88EKloKO71OCIdjgaC9ilG2ki+nW8nsIh8r8tLjuOUkH/T5SftiZ/wS2+m3E58u+A67gxwmXgpKYAUVd/+avdCHHMiNGZIqZbnRrqTIKkpZmSSa8eKRAiP0IC0DBWIE9VJZr9M4LFRerAfSlNCw5n6eyJBXKkxD0wnR3qoFr2p+J/XinW/3EmoiGJNBJ4v6scM6hBOg4E9KgnWbGwIwpKaWyEeIomwNvHlTAje4svLpH7qeOeOe3dWqJTTOLLgCORBEXjgAlTANaiCGsDgETyDV/BmPVkv1rv1MW/NWOnMIfgD6/MHL32X8A== m m (a ,...,a ) m=1,...,M maximum 1 L = P (ai ai 1,...,a1) { } | likelihood i Y

AAAB6nicdVDJSgNBEK2JW4xb1KOXxiB4GnpM4sRbwIsHDxHNAskQejo9SZOehe4eIQz5BC8eFPHqF3nzb+wsgoo+KHi8V0VVPT8RXGmMP6zcyura+kZ+s7C1vbO7V9w/aKk4lZQ1aSxi2fGJYoJHrKm5FqyTSEZCX7C2P76c+e17JhWPozs9SZgXkmHEA06JNtIt6V/3iyVs45rrYAdhu3xexW7ZELd6gcsV5Nh4jhIs0egX33uDmKYhizQVRKmugxPtZURqTgWbFnqpYgmhYzJkXUMjEjLlZfNTp+jEKAMUxNJUpNFc/T6RkVCpSeibzpDokfrtzcS/vG6qg5qX8ShJNYvoYlGQCqRjNPsbDbhkVIuJIYRKbm5FdEQkodqkUzAhfH2K/ietM9up2vimUqpXlnHk4QiO4RQccKEOV9CAJlAYwgM8wbMlrEfrxXpdtOas5cwh/ID19glyRI3f

AAAB6nicdVDLSgNBEOyNrxhfUY9eBoPgaZk1iRtvAS8eI5oHJEuYnUySwdnZZWZWCEs+wYsHRbz6Rd78GycPQUULGoqqbrq7wkRwbTD+cHIrq2vrG/nNwtb2zu5ecf+gpeNUUdaksYhVJySaCS5Z03AjWCdRjEShYO3w7nLmt++Z0jyWt2aSsCAiI8mHnBJjpRvS9/rFEnZxzfewh7BbPq9iv2yJX73A5QryXDxHCZZo9IvvvUFM04hJQwXRuuvhxAQZUYZTwaaFXqpZQugdGbGupZJETAfZ/NQpOrHKAA1jZUsaNFe/T2Qk0noShbYzImasf3sz8S+vm5phLci4TFLDJF0sGqYCmRjN/kYDrhg1YmIJoYrbWxEdE0WosekUbAhfn6L/SevM9aouvq6U6pVlHHk4gmM4BQ98qMMVNKAJFEbwAE/w7Ajn0XlxXhetOWc5cwg/4Lx9AklYjcQ= … a1 … aL

Mutational prediction Contact prediction Sequence generation

AAAB9XicdVDLSgMxFM3UV62vqks3wSJUKEPGtk7dFdy4cFHBtkI7Dpk004ZmMkOSUUrpf7hxoYhb/8Wdf2P6EFT0wIXDOfdy7z1BwpnSCH1YmaXlldW17HpuY3Nreye/u9dScSoJbZKYx/ImwIpyJmhTM83pTSIpjgJO28HwfOq376hULBbXepRQL8J9wUJGsDbSbaOIfadk23YJ+5fHfr6AbFRzHeRAZJdPq8gtG+JWz1C5Ah0bzVAACzT8/Hu3F5M0okITjpXqOCjR3hhLzQink1w3VTTBZIj7tGOowBFV3nh29QQeGaUHw1iaEhrO1O8TYxwpNYoC0xlhPVC/van4l9dJdVjzxkwkqaaCzBeFKYc6htMIYI9JSjQfGYKJZOZWSAZYYqJNUDkTwten8H/SOrGdqo2uKoV6bRFHFhyAQ1AEDnBBHVyABmgCAiR4AE/g2bq3Hq0X63XemrEWM/vgB6y3T9fbkMU=

AAAB/3icdVDLSgMxFM3UV62vUcGNm2ARXMiQsa1TdwU3LivYB7TDkEkzbdrMgyQjlLELf8WNC0Xc+hvu/BvTh6CiBy4czrmXe+/xE86kQujDyC0tr6yu5dcLG5tb2zvm7l5TxqkgtEFiHou2jyXlLKINxRSn7URQHPqctvzR5dRv3VIhWRzdqHFC3RD3IxYwgpWWPPMAe6yrYuh77BRibzjnQ88sIgtVHRvZEFml8wpySpo4lQtUKkPbQjMUwQJ1z3zv9mKShjRShGMpOzZKlJthoRjhdFLoppImmIxwn3Y0jXBIpZvN7p/AY630YBALXZGCM/X7RIZDKcehrztDrAbytzcV//I6qQqqbsaiJFU0IvNFQcqhfnIaBuwxQYniY00wEUzfCskAC0yUjqygQ/j6FP5PmmeWXbHQdblYqy7iyINDcAROgA0cUANXoA4agIA78ACewLNxbzwaL8brvDVnLGb2wQ8Yb58PNpV4

AAAB8XicdVDLSgMxFL1TX7W+qi7dBIvgasjY1qm7ghuXFewD22HIpGkbzGSGJCOU0r9w40IRt/6NO//G9CGo6IHA4Zx7yT0nSgXXBuMPJ7eyura+kd8sbG3v7O4V9w9aOskUZU2aiER1IqKZ4JI1DTeCdVLFSBwJ1o7uLmd++54pzRN5Y8YpC2IylHzAKTFWuiUh75kERSEPiyXs4prvYQ9ht3xexX7ZEr96gcsV5Ll4jhIs0QiL771+QrOYSUMF0brr4dQEE6IMp4JNC71Ms5TQOzJkXUsliZkOJvOLp+jEKn00SJR90qC5+n1jQmKtx3FkJ2NiRvq3NxP/8rqZGdSCCZdpZpiki48GmUA25Cw+6nPFqBFjSwhV3N6K6IgoQo0tqWBL+EqK/ietM9eruvi6UqrXlnXk4QiO4RQ88KEOV9CAJlCQ8ABP8Oxo59F5cV4XozlnuXMIP+C8fQKJp5DP single-site mutation ai bi double mutation ai bi,aj bj sampling from P (a1,...,aL)

AAAB/nicbVBNS8NAEN3Ur1q/ouLJy2IR6qUkothjQQWPFewHNCFsttt26WYTdidCCQX/ihcPinj1d3jz37htc9DWBwOP92aYmRcmgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoh0UxwyZrAQbBOohiJQsHa4eh66rcfmdI8lg8wTpgfkYHkfU4JGCmwj7wbJoDgW1whAfcgxmHAzwK77FSdGfAycXNSRjkagf3l9WKaRkwCFUTrrusk4GdEAaeCTUpeqllC6IgMWNdQSSKm/Wx2/gSfGqWH+7EyJQHP1N8TGYm0Hkeh6YwIDPWiNxX/87op9Gt+xmWSApN0vqifCmy+nGaBe1wxCmJsCKGKm1sxHRJFKJjESiYEd/HlZdI6r7qXVef+olyv5XEU0TE6QRXkoitUR3eogZqIogw9o1f0Zj1ZL9a79TFvLVj5zCH6A+vzBx/1lEU= ! AAACEHicbVDLSgMxFM3UV62vqks3wSJWkDIjil0WVHBZwT6gHYY7aaZNm8kMSUYopZ/gxl9x40IRty7d+TemnS609UDC4Zx7Sc7xY86Utu1vK7O0vLK6ll3PbWxube/kd/fqKkokoTUS8Ug2fVCUM0FrmmlOm7GkEPqcNvzB1cRvPFCpWCTu9TCmbghdwQJGQBvJyx+3rynXkN74BhfBY20dYd9jpxi8fsr7J16+YJfsKfAicWakgGaoevmvdiciSUiFJhyUajl2rN0RSM0Ip+NcO1E0BjKALm0ZKiCkyh1NA43xkVE6OIikOULjqfp7YwShUsPQN5Mh6J6a9ybif14r0UHZHTERJ5oKkj4UJByblJN2cIdJSjQfGgJEMvNXTHoggWjTYc6U4MxHXiT1s5JzUbLvzguV8qyOLDpAh6iIHHSJKugWVVENEfSIntErerOerBfr3fpIRzPWbGcf/YH1+QP5KJs/ ! ! AAACDHicbVDLSsNAFJ3UV62vqks3g0WsUEIiit0IBTcuFCrYBzRpmEwn7dCZJMxMhBL6AW78FTcuFHHrB7jzb5y2WWjrgYHDOedy5x4/ZlQqy/o2ckvLK6tr+fXCxubW9k5xd68po0Rg0sARi0TbR5IwGpKGooqRdiwI4j4jLX94NfFbD0RIGoX3ahQTl6N+SAOKkdKSVyw5KSz7nt3lFdM0K7530+UnzthL+aU9VW6PxzplmdYUcJHYGSmBDHWv+OX0IpxwEirMkJQd24qVmyKhKGZkXHASSWKEh6hPOpqGiBPpptNjxvBIKz0YREK/UMGp+nsiRVzKEfd1kiM1kPPeRPzP6yQqqLopDeNEkRDPFgUJgyqCk2ZgjwqCFRtpgrCg+q8QD5BAWOn+CroEe/7kRdI8Ne1z07o7K9WqWR15cAAOQRnY4ALUwDWogwbA4BE8g1fwZjwZL8a78TGL5oxsZh/8gfH5A5y5mCM= m m E(ai bi) E(a b ,a b ) (b ,...,b ) ! i ! i j ! j { 1 L }m=1,...,M 0

FIG. 1. Schematic representation of the arDCA approach: Starting from a MSA of homologous sequences, we use maximum- likelihood inference to learn an autoregressive model, which factorizes the joint sequence probability P (a1, ..., aL) into conditional single-residue probabilities P (ai|ai−1, ..., a1). Deﬁning the statistical energy E(a1, ..., aL) = − log P (a1, ..., aL) of a sequence, we consequently predict mutational eﬀects and contacts as statistical energy changes when substituting residues individually or in pairs, and we design new sequences by sampling from P (a1, ..., aL).

identity to the closest natural one, which was still found RESULTS to be functional in vitro. An alternative implementation of the same architecture was presented in [2], and applied Autoregressive models for protein families to the design of antibodies; also in this case the resulting sequences were validated experimentally. Here we propose a computationally efficient approach based on autoregressive models. We start from the exact Not all generative models for proteins are based on decomposition sequence ensembles. Several research groups explored the possibility of generating sequences with given three- P (a1, ..., aL) = P (a1) P (a2 a1) P (aL aL 1, ..., a1) , dimensional structure [3, 32, 34], e.g. via a VAE [26] or · | ··· | − (1) a Graph Neural Network [51], or by inverting structural of the joint probability of a full-length sequence into a prediction models [4, 21, 37, 40]. It is important to stress product of more and more involved conditional proba- that this is a very different task from ours (our work bilities P (ai ai 1, ..., a1) of the amino acids ai in sin- does not use structure), so it is difficult to perform a gle positions,| conditioned− to all previously seen posi- direct comparison between our work and these ones. It tions ai 1, ..., a1. While this decomposition is a direct would be interesting to explore, in a future work, the consequence− of Bayes’ theorem, it suggests an impor- possibility to unify the different approaches and to use tant change in viewpoint on generative models: while sequence and structure jointly for constructing improved learning the full P (a1, ..., aL) from the input MSA is a generative models. task of unsupervised learning (sequences are not labeled),D learning the factors P (ai ai 1, ..., a1) becomes a task of | − In summary, for the specific task of interest here, supervised learning, with (ai 1, ..., a1) being the input − namely generate an artificial MSA statistically indistin- (feature) vector, and ai the output label (in our case a guishable from the natural one, one can take as refer- categorical q-state label). We can thus build on the full ence models bmDCA [6, 22] in the context of Potts- power of supervised learning, which is methodologically model-like architectures, and DeepSequence [43] in the more explored than unsupervised learning [10, 24, 28]. context of deep networks. We will show in the follow- In this work, we choose the following parameterization, ing that arDCA performs comparably to bmDCA, and previously used in the context of statistical mechanics of better than DeepSequence, at strongly reduced computa- classical [60] and quantum [47] systems: tional cost. From anecdotal evidence in the works mentioned above, and in agreement with general observa- i 1 exp h (a ) + − J (a , a ) tions in machine learning, it appears that deep architec- i i j=1 ij i j P (ai ai 1, ..., a1) = , tures may be more powerful than shallow architectures, | − n zi(ai 1, ..., a1) o −P provided that very large datasets and computational re- (2) i 1 − sources are available [43]. Indeed, we will show that for with zi(ai 1, ..., a1) = ai exp hi(ai)+ j=1 Jij(ai, aj) the related task of single-mutation predictions around being a normalization− factor.{ In machine learning, this} a wild type, DeepSequence outperforms arDCA on rich parameterization is knownP as soft-maxP regression, the datasets, while the inverse is true on small datasets. generalization of logistic regression to multi-class la- 4 bels [28]. This choice, as detailed in the section Meth- advantage when the same sequence in different models ods, enables a particularly efficient parameter learning shall be compared, as in homology detection and protein by likelihood maximization, and leads to a speedup of family assignment [18, 49], cf. the example given below. 2-3 orders of magnitude over bmDCA, as is reported in The ansatz in Eq. (2) can be generalized to more com- Table I. Because the resulting model is parameterized by plicated relations. We have tested a two-layer architec- a set of fields hi(a) and couplings Jij(a, b) as in DCA, we ture, but did not observe advantages over the simple soft- dub our method as arDCA. max regression, as will be discussed at the end of the Besides comparing the performance of this model to paper. bmDCA and DeepSequence, we will also use simple Thanks, in particular, to the possibility of calculating “fields-only” models, also known as profile models or the gradient exactly, arDCA models can be inferred much independent-site models. In these models, the joint prob- more efficiently than bmDCA models. Typical inference ability of all positions in a sequence factorizes over all times are given in Table I for five representative families, positions, P (a1, ..., aL) = i=1,...,L fi(ai), without any and show a speedup of about 2-3 orders of magnitude conditioning to the sequence context. Using maximum- with respect to the bmDCA implementation of [6], both Q likelihood inference, each factor fi(ai) equals the empir- running on a single Intel Xeon E5-2620 v4 2.10GHz CPU. ical frequency of amino acid ai in column i of the input We also tested the Mi3 package [27], which is able to learn MSA . similar bmDCA models in a time of about 60 minutes D A few remarks are needed. for the PF00014 family and 900 minutes for the PF00595 Eq. (2) has striking similarities to standard DCA [13], family, while running on two TITAN RTX GPUs, thus but also important differences. The two have exactly remaining much more computationally demanding than the same number of parameters, but their meaning is arDCA. quite different. While DCA has symmetric couplings Jij(a, b) = Jji(b, a), the parameters in Eq. (2) are directed and describe the influence of site j on site i for The positional order matters j < i only, i.e. only one triangular part of the J-matrix is filled. Eq. (1) is valid for any order of the positions, i.e. for The inference in arDCA is very similar to any permutation of the natural positional order in the plmDCA [19], i.e. to DCA based on pseudo-likelihood amino-acid sequences. This is no longer true, when we maximization [5]. In particular, both in arDCA and parameterize the P (ai ai 1, ..., a1) according to Eq. (2). plmDCA the gradient of the likelihood can be computed Different orders may give| − different results. In the supple- exactly from the data, while in bmDCA it has to be esti- mentary Section S1 we show that the likelihood depends mated via Monte Carlo Markov Chain (MCMC), which on the order, and that we can optimize over orders. We requires the introduction of additional hyperparameters also find that the best orders are correlated to the en- (such as the number of chains, the mixing time, etc.) tropic order, where we select first the least entropic, i.e. that can have an important impact on the quality of the most conserved, variables, progressing successively to- inference, see [17] for a recent detailed study. wards the most variable positions of highest entropy. The In plmDCA each a is, however, conditioned to all site entropy s = f (a) log f (a) can be directly cal- i i − a i i other aj in the sequence, and not only by partial se- culated from the empirical amino-acid frequencies fi(a) quences. The resulting directed couplings are usually of all amino acids aPin site i. symmetrized akin to standard Potts models. On the con- Because the optimization over the possible L! site or- trary, the Jij(a, b) that appear in arDCA cannot be inter- derings is very time consuming, we use the entropic order preted as “direct couplings” in the DCA sense, cf. below as a practical heuristic choice. In all our tests, described for details on arDCA-based contact prediction. However, in the next sections, the entropic order does not per- plmDCA has limited capacities as a generative model form significantly worse than the best optimized order [22]: symmetrization moves parameters away from their we found. maximum-likelihood value, probably causing a loss in A close-to-entropic order is also attractive from the model accuracy. No such symmetrization is needed for point of view of interpretation. The most conserved sites arDCA. come first. If the amino acid on those sites is the most arDCA, contrary to all other DCA methods, allows frequent one, basically no information is transmitted fur- for calculating the probabilities of single sequences. In ther. If, however, a sub-optimal amino acid is found in a bmDCA, we can only determine sequence weights, but conserved position, this has to be compensated by other the normalizing factor, i.e. the partition function, re- mutations, i.e. necessarily by more variable (more en- mains inaccessible for exact calculations; expensive ther- tropic) positions. Also the fact that variable positions modynamic integration via MCMC sampling is needed come last, and are modeled as depending on all other to estimate it. The conditional probabilities in arDCA amino acids, is well interpretable: these positions, even are individually normalized; instead of summing over qL if highly variable, are not necessarily unconstrained, but sequences we need to sum L-times over the q states of they can be used to finely tune the sequence to any sub- individual amino acids. This may turn out as a major optimal choices done in earlier positions. 5

Cij ent. Cij dir. Cij Cij Cijk ent. Cijk dir. Cijk Cijk entropy entropy t/min t/min L M arDCA arDCA bmDCA DeepSeq arDCA arDCA bmDCA DeepSeq arDCA bmDCA arDCA bmDCA PF00014 53 13600 0.97 0.96 0.95 0.81 0.84 0.82 0.83 0.80 1.2 1.5 1 204 PF00076 70 137605 0.97 0.97 0.97 0.84 0.78 0.76 0.85 0.77 1.6 1.7 19 2088 PF00595 80 36690 0.96 0.95 0.97 0.93 0.87 0.87 0.92 0.65 1.2 1.5 8 4003 PF00072 112 823798 0.96 0.96 0.93 0.95 0.89 0.88 0.88 0.92 1.4 1.8 9 1489 PF13354 202 7515 0.97 0.96 0.95 0.95 0.93 0.91 0.92 0.92 0.9 1.2 10 3905

TABLE I. The table summarizes the data used (protein families, sequence lengths L and numbers M, together with the Pearson correlations between empirical and model-generated connected correlations Cij and Cijk, for bmDCA, for arDCA using entropic or direct positional orders, and for DeepSequence. The entropies/site and computational running times for model learning (on a single Intel Xeon E5-2620 v4 2.10GHz CPU) are also provided for arDCA and bmDCA. Best values for each measure are evidenced. Similar results for the 32 protein families with deep-mutational scanning data are given in the Supplementary Table S1.

For this reason, all coming tests are done using increas- of the data with those estimated from a sample of the ing entropic order, i.e. with sites ordered before model arDCA model. Results are shown for the response- learning by increasing empirical si values. Supplemen- regulator Pfam family PF00072 [20]. Other proteins are tary Figs. S1-3 shows a comparison with alternative or- shown in Table I and supplementary Section S3, Figs. S5- derings, such as the direct one (from 1 to L), several 6. We find that, for these observables, the empirical and random ones, and the optimized one, cf. also Table I for model averages coincide very well, equally well or even some results. slightly better than for the bmDCA case. In particular for the one- and two-point quantities this is quite surpris- ing: while bmDCA fits them explicitly, i.e. any deviation arDCA provides accurate generative models is due to imperfect fitting of the model, arDCA does not fit them explicitly, and nevertheless obtains higher preci- To check the generative property of arDCA , we com- sion. pare it with bmDCA [22], i.e. the most accurate gener- In Table I, we also report the results for sequences sam- ative version of DCA obtained via Boltzmann machine pled from DeepSequence [43]. While its original imple- learning. bmDCA was previously shown to be genera- mentation aims at scoring individual mutations, cf. Sec- tive not only in a statistical sense, but also in a biologi- tion Predicting mutational effects via in-silico deep mu- cal one: sequences generated by bmDCA were shown to tational scanning, we apply the modification of Ref. [38] be statistically indistinguishable from natural ones, and allowing for sequence sampling. We observe that for most importantly, functional in vivo for the case of cho- most families, the two- and three-point correlations of rismate mutase enzymes [45]. We also compare the gen- the natural data are significantly less well reproduced erative property of arDCA with DeepSequence [38, 43] by DeepSequence than by both DCA implementations, as a prominent representative of deep generative models. confirming the original findings of [38]. Only in the To this aim, we compare the statistical properties of largest family, PF00072 with more than 800,000 se- natural sequences with those of independently and identi- quences, DeepSequence reaches comparable or, in the cally distributed (i.i.d.) samples drawn from the different case of the three-point correlations, even superior per- generative models P (a1, ..., aL). At this point, another formance. important advantage of arDCA comes into play: while A second test of the generative property of arDCA is generating i.i.d. samples from, e.g., a Potts model re- given by Figs. 2d-g. Panel d shows the natural sequences quires MCMC simulations, which in some cases may have projected onto their first two principal components (PC). very long decorrelation times and thus become tricky and The other three panels show generated data projected computationally expensive [6, 17] (cf. also supplementary onto the same two PCs of the natural data. We see Section S2 and Fig. S4), drawing a sequence from the that both arDCA and bmDCA reproduce quite well the arDCA model P (a1, ..., aL) is very simple and does not clustered structure of the response-regulator sequences require any additional parameter. The factorized expres- (both show a slightly broader distribution than the nat- sion Eq. (1) allows for sampling amino acids position by ural data, probably due to the regularized inference of position, following the chosen positional order, cf. the the statistical models). On the contrary, sequences gen- detailed description in supplementary Section S2. erated by a profile model Pprof (a1, ..., aL) = i fi(ai) Figs. 2a-c show the comparison of the one-point amino- assuming independent sites, do not show any clustered acid frequencies fi(a), and the connected two- and three- structure: the projections are concentrated aroundQ the point correlations origin in PC space. This indicates that their variability is almost unrelated to the first two principal components C (a, b) = f (a, b) f (a)f (b) , (3) ij ij − i j of the natural sequences. C (a, b, c) = f (a, b, c) f (a, b)f (c) f (a, c)f (b) ijk ijk − ij k − ik j From these observations, we conclude that arDCA pro- f (b, c)f (a) + 2f (a)f (b)f (c) , vides excellent generative models, of at least the same ac- − jk i i j k 6

a b c

e f g h

FIG. 2. Generative properties of arDCA for PF00072: Panels a-c compare the single-site frequencies fi(a) and two- and three-site connected correlations Cij (a, b) and Cijk(a, b, c) found in the sequence data and samples from models, for arDCA (blue) and bmDCA (red). Panels d-g show different samples projected onto the first two principal components of the natural data. Datasets are the natural MSA (d) and samples from arDCA (e), bmDCA (f) and a profile model (g). Results for other protein families are shown in the Supplementary Figs. S5-S6. curacy of bmDCA. This suggests fascinating perspectives Now, we can easily compare two sequences differing by in terms of data-guided statistical sequence design: if se- one or few mutations. For a single mutation a b , i → i quences generated from bmDCA models are functional, where amino acid ai in position i is substituted with also arDCA-sampled sequences should be functional. But amino acid bi, we can determine the statistical-energy this is obtained at much lower computational cost, cf. Ta- difference ble I and without the need to check for convergence of MCMC, which makes the method scalable to much bigger P (a1, ..., ai 1, bi, ai+1, ...., aL) ∆E(ai bi) = log − . proteins. → − P (a1, ..., ai 1, ai, ai+1, ...., aL) − (5) If negative, the mutant sequence has lower statistical en- Predicting mutational effects via in-silico deep ergy; the mutation ai bi is thus predicted to be bene- mutational scanning ficial. On the contrary,→ a positive ∆E predicts a delete- rious mutation. Note that, even if not explicitly stated The probability of a sequence is a measure of its good- on the left-hand side of Eq. (5), the mutational score ness. For high-dimensional probability distributions, it is ∆E(a b ) depends on the whole sequence background i → i generally convenient to work with log-probabilities. Us- (a1, ..., ai 1, ai+1, ...., aL) it appears in, i.e. on all other ing inspiration from statistical physics, we introduce a amino acids− a in all positions j = i. j 6 statistical energy It is now easy to perform an in-silico deep mutational scan, i.e. to determine all mutational scores ∆E(ai bi) E(a1, ..., aL) = log P (a1, ..., aL) , (4) → − for all positions i = 1, ..., L and all target amino acids as the negative log-probability. We thus expect func- bi relative to some reference sequence. In Fig. 3a, we tional sequences to have very low statistical energies, compare our predictions with experimental data over while unrelated sequences show high energies. In this more than 30 distinct experiments and wildtype pro- sense, statistical energy can be seen as a proxy of (nega- teins, and with state-of-the art mutational-effect pre- tive) fitness. Note that in the case of arDCA, the statisti- dictors. These contain in particular the predictions us- cal energy is not a simple sum over the model parameters ing plmDCA (aka evMutation [30]), variational autoen- as in DCA, but contains also the logarithms of the local coders (DeepSequence [43]), evolutionary distances be- partition functions zi(ai 1, ..., a1), cf. Eq. (2). tween wildtype and the closest homologs showing the − 7

FIG. 3. Prediction of mutational effects by arDCA: Panel a shows the Spearman rank correlation between results of 32 deep- mutational scanning experiments and various computational predictions. We compare arDCA with profile models, plmDCA (aka evMutation [30]), DeepSequence [43], and GEMME [35], which currently are considered the state of the art. Detailed information about the datasets and the generative properties of arDCA on these datasets are provided in the Supplementary Sec.S4. Panel b shows a more detailed comparison between arDCA and DeepSequence, the symbol size is proportional to the sequence number in the training MSA for prokaryotic and eukaryotic datasets (blue dots). Viral datasets are indicated by red squares. considered mutation (GEMME [35]) – all of these meth- less diverged viral protein families. In this case, DeepSe- ods take, in technically different ways, the context de- quence, which relies on data-intensive deep learning, be- pendence of mutations into account. We also compare it comes unstable. It becomes also harder to outperform to the context-independent prediction using the above- profile models, e.g. plmDCA does not achieve this. mentioned profile models. arDCA perform similarly or, in one out of four cases, substantially better than the profile model. It can be seen that the context-dependent predictors outperform systematically the context-independent pre- To go into more detail, we have compared more quan- dictor, in particular for large MSA in prokaryotic and tiatively the predictions of arDCA and DeepSequence, eukaryotic proteins. The four context-dependent models currently considered as the state-of-the-art mutational perform in a very similar way. There is a little but sys- predictor. In Fig. 3b, we plot the performance of the tematic disadvantage for plmDCA, which was the first two predictors against each other, with the symbol size published predictor of the ones considered here. being proportional to the number of sequences in the The situation is different in the typically smaller and training MSA of natural homologs. Almost all dots are 8 close to the diagonal (apart from few viral datasets), the direct coupling J (b , b ) J (b , a ) J (a , b ) + ij i j − ij i j − ij i j with 15/32 datasets having a better arDCA prediction, Jij(ai, aj) between sites i and j. and 17/32 giving an advantage to DeepSequence. The figure also shows that arDCA tends to perform better on smaller datasets, while DeepSequence takes over on For contact prediction, we can treat these effective cou- larger datasets. In suppelmentary Fig. S7, we have also plings in the standard way (compute the Frobenius norm measured the correlations between the two predictors. in zero-sum gauge, apply the average product correction, Across all prokaryotic and eukaryotic datasets, the two cf. supplementary Sec. S5 for details). The results are show high correlations in the range of 82% – 95%. These represented in Figs. 4 (cf. also supplementary Figs. S8- values are larger than the correlations between predic- 10). The contact maps predicted by arDCA and bmDCA tions and experimental results, which are in the range of are very similar, and both capture very well the topolog- 50% – 60% for most families. This observation illustrates ical structure of the native contact map. The arDCA that both predictors extract a highly similar signal from method gives in this case a few more false positives, re- the original MSA, but this signal may be quite differ- sulting in a slightly lower positive predictive value (panel ent from the experimentally measured phenotype. Many c). However, note that the majority of the false posi- experiments actually provide only rough proxies for pro- tives for both predictors are concentrated in the upper tein fitness, like e.g. protein stability or ligand-binding right corner of the contact maps, in a region where the affinity. To what extent such variable underlying pheno- largest subfamily of response-regulators domains, char- types can be predicted by unsupervised learning based acterized by the coexistence with a Trans reg C DNA- on homologous MSA thus remains an open question. binding domain (PF00486) in the same protein, has a We thus conclude that arDCA permits a fast and ac- homo-dimerization interface. curate prediction of mutational effects, in line with some of the state-of-the-art predictors. It systematically outperforms profile models and plmDCA, and is more stable One difference should be noted: for arDCA, the defi- than DeepSequence in the case of limited datasets. This nition of effective couplings via epistatic effects depends observation, together with the better computational effi- on the reference sequence (a1, ..., aL), in which the muta- ciency of arDCA, suggests that DeepSequence should be tions are introduced; this is not the case in DCA. So, in used for predicting mutational effects for individual pro- principle, each sequence might give a different contact teins represented by very large homologous MSA, while prediction, and accurate contact prediction in arDCA arDCA is the method of choice for large-scale studies might require a computationally heavy averaging over a (many proteins) or small families. GEMME, based on large ensemble of background sequences. Fortunately, as phylogenetic informations, astonishingly performs very we have checked, the predicted contacts hardly depend similarly to arDCA, even if the information taken into on the reference sequence chosen. It is therefore possi- account seems different. ble to take any arbitrary reference sequence belonging to the homologous family, and determine epistatic couplings relative to this single sequence. This observation causes an enormous speedup by a factor M, with M being the Extracting epistatic couplings and predicting depths of the MSA of natural homologs. residue-residue contacts

The best-known application of DCA is the prediction The aim of this section was to compare the perfor- of residue-residue contacts via the strongest direct cou- mance of arDCA in contact prediction, when compared plings [39]. As argued before, the arDCA parameters are to established methods using exactly the same data, i.e. a not directly interpretable in terms of direct couplings. To single MSA of the considered protein family. We have predict contacts using arDCA, we need to go back to the chosen bmDCA in coherence to the rest of the paper, biological interpretation of DCA couplings: they repre- but apart from little quantitative differences, the con- sent epistatic couplings between pairs of mutations [50]. clusions remain unchanged when looking to DCA vari- For a double mutation ai bi, aj bj, epistasis is de- ants based on mean-field or pseudo-likelihood approxi- fined by comparing the effect→ of the double→ mutation with mations, cf. supplementary Fig. S9. The recent success the sum of the effects of the single mutations, when in- of Deep-Learning–based contact prediction has shown troduced individually into the wildtype background: that the performance can be substantially improved if coevolution-based contact prediction for thousands of ∆∆E(bi, bj) = ∆E(ai bi, aj bj) (6) families is combined with supervised learning based on → → ∆E(ai bi) ∆E(aj bj) , known protein structures, as done by popular meth- − → − → ods like RaptorX, DeepMetaPSICOV, AlphaFold or tr- where the ∆E in arDCA are defined in analogy to Rosetta [25, 46, 58, 61]. We expect that the performance Eq. (5). The epistatic effect ∆∆E(bi, bj) provides an ef- of arDCA could equally be boosted by supervised learn- fective direct coupling between amino acids bi, bj in sites ing, but this goes clearly beyond the scope of our work, i, j. In standard DCA, ∆∆E(bi, bj) is actually given by which concentrates on generative modeling. 9

a b c

FIG. 4. Prediction of residue-residue contacts by arDCA as compared to bmDCA. Panels a and b show the true (black, upper triangle) and predicted (lower triangle) contact maps for PF00072, with blue (red) dots indicating true (false) positive predictions. Panel c shows the positive predictive values (PPV, fraction of true positives in the ﬁrst predictions) as a function of the number of predictions.

Estimating the size of a family’s sequence space E(a1, ..., aL) resulting from the local partition functions zi(ai 1, ..., a1) lead to a non-trivial entropy reduction. − The MSA of natural sequences contains only a tiny fraction of all sequences, which would have the functional properties characterizing a protein family under consid- DISCUSSION eration, i.e. which might be found in newly sequenced species or be reached by natural evolution. Estimating We have presented a class of simple autoregressive this number of possible sequences, or their entropy models, which provide highly accurate and computation- S = log , isN quite complicated in the context of DCA- N ally very efficient generative models for protein-sequence type pairwise Potts models. It requires advanced sam- families. While being of comparable or even superior pling techniques [7, 53]. performance to bmDCA across a number of tests in- In arDCA, we can explicitly calculate the sequence cluding the sequence statistics, the sequence distribution probability P (a1, ..., aL). We can therefore estimate the in dimensionally reduced principal-component space, the entropy of the corresponding protein family via prediction of mutational effects and residue-residue contacts, arDCA is computationally much more efficient S = P (a1, ..., aL) log P (a1, ..., aL) than bmDCA. The particular factorized form of autore- − a ,...,a 1X L gressive models allows for exact likelihood maximization. = E(a1, ..., aL) P , (7) It allows also for the calculation of exact sequence h i probabilities (instead of sequence weights for Potts mod- where the second line uses Eq. (4). The ensemble aver- els). This fact is of great potential interest in homology age P can be estimated via the empirical average over detection using coevolutionary models, which requires to h·i a large sequence sample drawn from P . As discussed be- compare probabilities of the same sequence in distinct fore, extracting i.i.d. samples from arDCA is particularly models corresponding to distinct protein families. To il- simple due to their particular factorized form. lustrate this idea in a simple, but instructive case, we Results for the protein families studied here are given have identified two subfamilies of the PF00072 protein in Table I. As an example, the entropy density equals family of response regulators. The first subfamily is char- S/L = 1.4 for PF00072. This corresponds to acterized by the existence of a DNA-binding domain of 1.25 1068 sequences. While being an enormous number,N ∼ the Trans reg C protein family (PF00486), the second it constitutes· only a tiny fraction of all qL 1.23 10148 by a DNA-binding domain of the GerE protein family possible sequences of length L = 112.∼ Interestingly,· (PF00196). For each of the two subfamilies, we have the entropies estimated using bmDCA are systemati- extracted randomly 6,000 sequences used to train sub- cally higher than those of arDCA. On the one hand, family specific profile and arDCA models, with P1 being this is no surprise: both reproduce accurately the em- the model for the Trans reg C and P2 for the GerE sub- pirical one- and two-residue statistics, but bmDCA is a family. Using the log-odds ratio log P1(seq)/P2(seq) maximum entropy model, which maximizes the entropy to score all remaining sequences from{ the two subfami-} given these statistics [13]. On the other hand, our obser- lies, the profile-model was able to assign 98.6% of all se- vation implies that the effective multi-site couplings in quences to the correct sub-family, and 1.4% to the wrong 10 one. arDCA has improved this to 99.7% of correct, and shown in supplementary Fig. S13, the predicted struc- only 0.3% of incorrect assignments, reducing the grey- tures are very similar to each other, and within a root zone in sub-family assignment by a factor 3-4. Further- mean-square deviation of less than 2A˚ from an exem- more, some of the false assignments of the profile model plary PDB structure. The contacts maps extracted from had quite large scores, cf. the histograms in supplemen- the trRosetta predictions are close to identical. tary Fig. S11, while the false annotations of the arDCA While these observation do not prove that arDCA- model had scores closer to zero. Therefore, if we consider generated sequences are functional or fold into the correct that a prediction is reliable only if there is no wrong pre- tertiary structure, they are coherent with this conjecture. dictions for a larger log-odds ratio score, then the score Autoregressive models can be easily extended by of arDCA is 97.5% while the one of the profile model is adding hidden layers in the ansatz for the conditional only 63.7%. probabilites P (ai ai 1, ..., a1), with the aim to increase | − The importance of accurate generative models becomes the expressive power of the overall model. For the also visible via our results on the size of sequence space families explored here, we found that the one-layer (or sequence entropy). For the response regulators used model Eq. (2) is already so accurate, that adding more as example throughout the paper (and similar observa- layers only results in similar, but not superior perfor- tions are true for all other protein families we analyzed), mance, cf. supplementary Sec. S6. However, in longer or we find that “only” about 1068 out of all possible 10148 more complicated protein families, the larger expressive amino-acid sequences of the desired length are compati- power of deeper autoregressive models could be helpful. ble with the arDCA model, and thus suspected to have Ultimately, the generative performance of such extended the same functionality and the same 3D structure of models should be assessed by testing the functionality the proteins collected in the Pfam MSA. This means of the generated sequences in experiments similar to [45]. that a random amino-acid sequence has a probability of 80 about 10− to be actually a valid response-regulator sequence. This number is literally astronomically small, corresponding to the probability of hitting one particu- METHODS lar atom when selecting randomly in between all atoms in our universe. The importance of a good coevolutionary Inference of the parameters modeling becomes even more evident when considering all proteins being compatible with the amino-acid con- We first describe the inference of the parameters via servation patterns in the MSA: the corresponding profile likelihood maximization. In a Bayesian setting, with uni- model still results in an effective sequence number of 1094, form prior (we discuss regularization below), the optimal i.e. a factor of 1026 larger than the sequence space re- parameters are those that maximize the probability of the m specting also coevolutionary constraints. As was verified data, given as a MSA = (ai i = 1, ..., L; m = 1, ..., M) in experiments, conservation provides insufficient infor- of M sequences of alignedD length| L: mation for generating functional proteins, while taking J∗,h∗ = arg max P ( J,h ) coevolution into account leads to finite success probabil- { } J,h D|{ } ities. { } = arg max log P ( J,h ) Reproducing the statistical features of natural se- J,h D|{ } { } quences does not necessarily guarantee the sampled se- M L quences to be fully functional protein sequences. To en- m m m = arg max log P (ai ai 1, ..., a1 ) J,h | − hance our confidence in these sequences, we have per- { } m=1 i=1 X Y formed two tests. M L m m m First we have reanalyzed the bmDCA-generated se- = arg max log P (ai ai 1, ..., a1 ) . (8) J,h | − quences of [45], which were experimentally tested for { } m=1 i=1 their in-vivo chorismate-mutase activity. Starting from X X Each parameter h (a) or J (a, b) appears in only one con- the same MSA of natural sequences, we have trained an i ij ditional probability P (ai ai 1, ..., a1), and we can thus arDCA model and calculated the statistical energies of all − maximize independently each| conditional probability in non-natural and experimentally tested sequences. As is Eq. (8): shown in supplementary Fig. S12, the statistical energies have a Pearson correlation of 97% wit the bmDCA ener- M m m m gies reported in [45]. In both cases functional sequences Jij∗ ,hi∗ = arg max log P (ai ai 1, ..., a1 ) { } Jij ,hi | − are restricted to the region of low statistical energies. { } m=1 X Furthermore, we have used small samples of 10 arti- M i 1 m − m m ficial or natural response-regulator sequences as inputs = arg max hi(ai ) + Jij(ai , aj ) Jij ,hi { } m=1 " j=1 for trRosetta [61], in a setting which allows for protein- X X structure prediction based only on the user-provided m m MSA, i.e. no homologous sequences are added by tr- log zi(ai 1, ...a1 ) − − Rosetta, and no structural templates are used. As is # 11 where coherently with the one used in plmDCA. Note that the gradients are computed exactly at each iteration, as an i 1 − explicit average over the data, and hence without the zi(ai 1, ...a1) = exp hi(ai) + Jij(ai, aj) (9) −   need of MCMC sampling. This provides an important ai j=1 X  X  advantage over Boltzmann-machine learning. Finally, in order to partially compensate for the phylo- is the normalization factor of the conditional probability of variable ai. genetic structure of the MSA, which induces correlations Differentiating with respect to hi(a) or to Jij(a, b), among sequences, each sequence is reweighted by a coef- with j = 1, ..., i 1, we get the set of equations: ficient wm [13]: − M m m M 1 ∂ log zi(ai 1, ...a1 ) 1 m m − J∗ ,h∗ = arg max w log P (a J ,h ) , 0 = δa,ai , ij i m ij i M − ∂h (a) { } Jij ,hi Meff |{ } m=1 i { } m=1 X (10) X M m m (13) 1 ∂ log zi(ai 1, ...a1 ) which leads to the same equations as above with the 0 = δa,am δb,am − , M i j − ∂J (a, b) m=1 ij only modification of the empirical average as data = X 1 M m h•i wm . Typically, wm is given by the in- where δ is the Kronecker symbol. Using Eq. (9) we Meff m=1 • a,b verse of the number of sequences having least 80% se- find P quence identity with sequence m, and Meff = wm m m m ∂ log zi(ai 1, ...a1 ) m m denotes the effective number of independent sequences. − = P (ai = a ai 1, ..., a1 ) , P ∂hi(a) | − The goal is to remove the influence of very closely related m m sequences. Note however that such reweighting cannot ∂ log zi(ai 1, ...a1 ) m m − = P (ai = a ai 1, ..., a1 )δam,b . fully capture the hierarchical structure of phylogenetic ∂J (a, b) | − j ij relations between proteins. (11) The set of equations thus reduces to a very simple form: Sampling from the model m m fi(a) = P (ai = a ai 1, ..., a1 ) , | − D m m (12) Once the model parameters are inferred, a sequence fij(a, b) = P (ai = a ai 1, ..., a1 ) δam,b , | − j can be iteratively generated by the following procedure: D ED 1 M m where = M m=1 denotes the empirical data 1. Sample the first residue from P (a1) h•iD • average, and fi(a), fij(a, b) are the empirical one- and P 2. Sample the second residue from P (a2 a1) where a1 two-point amino-acid frequencies. Note that for the first | variable (i = 1), which is unconditioned, there is no equa- is sampled in the previous step. tion for the couplings, and the equation for the field takes ... the simple form f1(a) = P (a1 = a), which is solved by L. Sample the last residue from h (a) = log f (a) + const. 1 1 P (aL aL 1, aL 2, ..., a2, a1) Unlike the corresponding equations for the Boltzmann | − − learning of a Potts model [22], there is a mix between Each step is very fast because there are only 21 possible probabilities and empirical averages in Eq. (12), and values for each probability. Both training and sampling there is no explicit equality between one- and two-point are therefore extremely simple and computationally effi- marginals and empirical one and two-point frequencies. cient in arDCA. This means that the ability to reproduce the empirical one- and two-point frequencies is already a statistical test for the generative properties of the model, and not only ACKNOWLEDGMENTS for the fitting quality of the current parameter values. The inference can be done very easily with any algo- We thank Indaco Biazzo, Matteo Bisardi, Elodie rithm using gradient descent, which updates the fields Laine, Anna-Paola Muntoni, Edoardo Sarti and Kai and couplings proportionally to the difference of the two Shimagaki for helpful discussions and assistance with sides of Eq. (12). We used the Low Storage BFGS the data. We especially thank Francisco McGee and method to do the inference. We also add a L2 regu- Vincenzo Carnevale for providing generated samples larization, with regularization strength of 0.0001 for the from DeepSequence as in Ref. [38]. Our work was par- generative tests and 0.01 for mutational effects and con- tially funded by the EU H2020 Research and Innovation tact prediction. A small regularization leads to better Programme MSCA-RISE-2016 under Grant Agreement results on generative tests, but a larger regularization is No. 734439 InferNet. J.T. is supported by a PhD needed for contact prediction or mutational effects. Con- Fellowship of the i-Bio Initiative from the Idex Sorbonne tact prediction can indeed suffer from too large param- University Alliance. eters, and therefore a larger regularization was chosen, 12

Author contributions: A.P., F.Z. and M.W. designed peting interests. research; J.T., G.U. and A.P. performed research; J.T., G.U., A.P., F.Z. and M.W. analyzed the data; J.T., F.Z. Code availability: Codes in Python and Julia are avail- and M.W. wrote the paper. able at https://github.com/pagnani/ArDCA.git. Data availability: Data is available at Competing interests: The authors declare no com- https://github.com/pagnani/ArDCAData.

[1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A ods in protein co-evolution. Nature Reviews Genetics, learning algorithm for boltzmann machines. Cognitive 14(4):249–261 (2013). Science, 9(1):147–169 (1985). [16] J. A. de la Paz, C. M. Nartey, M. Yuvaraj, and F. Morcos. [2] T. Amimeur, J. M. Shaver, R. R. Ketchem, J. A. Tay- Epistatic contributions promote the unification of incom- lor, R. H. Clark, J. Smith, D. Van Citters, C. C. Siska, patible models of neutral molecular evolution. Proceed- P. Smidt, M. Sprague, et al. Designing feature-controlled ings of the National Academy of Sciences, 117(11):5873– humanoid antibody discovery libraries using generative 5882 (2020). adversarial networks. bioRxiv (2020). [17] A. Decelle, C. Furtlehner, and B. Seoane. Equilibrium [3] N. Anand-Achim, R. R. Eguchi, A. Derry, R. B. Altman, and non-equilibrium regimes in the learning of restricted and P. Huang. Protein sequence design with a learned boltzmann machines. arXiv preprint arXiv:2105.13889 potential. bioRxiv (2020). (2021). [4] I. Anishchenko, T. M. Chidyausiku, S. Ovchinnikov, S. J. [18] S. R. Eddy. A new generation of homology search tools Pellock, and D. Baker. De novo protein design by deep based on probabilistic inference. In Genome Informatics network hallucination. bioRxiv (2020). 2009: Genome Informatics Series Vol. 23, pages 205– [5] S. Balakrishnan, H. Kamisetty, J. G. Carbonell, S.-I. Lee, 211. World Scientific (2009). and C. J. Langmead. Learning generative models for [19] M. Ekeberg, C. Lövkvist,Y. Lan, M. Weigt, and E. Au- protein fold families. Proteins: Structure, Function, and rell. Improved contact prediction in proteins: using pseu- Bioinformatics, 79(4):1061–1078 (2011). dolikelihoods to infer potts models. Physical Review E, [6] P. Barrat-Charlaix, A. P. Muntoni, K. Shimagaki, 87(1):012707 (2013). M. Weigt, and F. Zamponi. Sparse generative modeling [20] S. El-Gebali, J. Mistry, A. Bateman, S. R. Eddy, A. Lu- via parameter reduction of Boltzmann machines: Ap- ciani, S. C. Potter, M. Qureshi, L. J. Richardson, G. A. plication to protein-sequence families. Physical Review Salazar, A. Smart, et al. The pfam protein families E104:024407 (2021). database in 2019. Nucleic Acids Research, 47(D1):D427– [7] J. P. Barton, A. K. Chakraborty, S. Cocco, H. Jacquin, D432 (2019). and R. Monasson. On the entropy of protein families. [21] C. Fannjiang and J. Listgarten. Autofocused oracles for Journal of Statistical Physics, 162(5):1267–1293 (2016). model-based design. arXiv preprint arXiv:2006.08052 [8] J. P. Barton, E. De Leonardis, A. Coucke, and S. Cocco. (2020). Ace: adaptive cluster expansion for maximum entropy [22] M. Figliuzzi, P. Barrat-Charlaix, and M. Weigt. How graphical model inference. Bioinformatics, 32(20):3089– pairwise coevolutionary models capture the collective 3097 (2016). residue variability in proteins? Molecular Biology and [9] M. Bisardi, J. Rodriguez-Rivas, F. Zamponi, and Evolution, 35(4):1018–1027 (2018). M. Weigt. Modeling sequence-space exploration and [23] M. Figliuzzi, H. Jacquier, A. Schug, O. Tenaillon, and emergence of epistatic signals in protein evolution. arXiv M. Weigt. Coevolutionary landscape inference and preprint arXiv:2106.02441 (2021). the context-dependence of mutations in beta-lactamase [10] C. M. Bishop. Pattern recognition and machine learning. tem-1. Molecular Biology and Evolution, 33(1):268–280 Springer (2006). (2016). [11] R. R. Cheng, F. Morcos, H. Levine, and J. N. [24] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Onuchic. Toward rationally redesigning bacterial two- Deep learning, volume 1. MIT press Cambridge (2016). component signaling systems using coevolutionary infor- [25] J. G. Greener, S. M. Kandathil, and D. T. Jones. Deep mation. Proceedings of the National Academy of Sciences, learning extends de novo protein modelling coverage 111(5):E563–E571 (2014). of genomes using iteratively predicted structural con- [12] R. R. Cheng, O. Nordesjö,R. L. Hayes, H. Levine, S. C. straints. Nature Communications, 10(1):1–13 (2019). Flores, J. N. Onuchic, and F. Morcos. Connecting the [26] J. G. Greener, L. Moffat, and D. T. Jones. Design of sequence-space of bacterial signaling proteins to pheno- metalloproteins and novel protein folds using variational types using coevolutionary landscapes. Molecular biology autoencoders. Scientific reports, 8(1):1–12 (2018). and evolution, 33(12):3054–3064 (2016). [27] A. Haldane and R. M. Levy. Mi3-gpu: Mcmc-based in- [13] S. Cocco, C. Feinauer, M. Figliuzzi, R. Monasson, and verse ising inference on gpus for protein covariation anal- M. Weigt. Inverse statistical physics of protein sequences: ysis. Computer Physics Communications, 260:107312 a key issues review. Reports on Progress in Physics, (2021). 81(3):032601 (2018). [28] T. Hastie, R. Tibshirani, and J. Friedman. The elements [14] Z. Costello and H. G. Martin. How to hallucinate func- of statistical learning: data mining, inference, and pre- tional proteins. arXiv:1903.00458 (2019). diction. Springer Science & Business Media (2009). [15] D. De Juan, F. Pazos, and A. Valencia. Emerging meth- [29] A. Hawkins-Hooker, F. Depardieu, S. Baur, G. Couairon, 13

A. Chen, and D. Bikard. Generating functional protein M. Socolich, P. Kast, D. Hilvert, R. Monasson, S. Cocco, variants with variational autoencoders. PLoS Computa- M. Weigt, et al. An evolution-based model for designing tional Biology 17(2):e1008736 (2021). chorismate mutase enzymes. Science, 369(6502):440–445 [30] T. A. Hopf, J. B. Ingraham, F. J. Poelwijk, C. P. Schärfe, (2020). M. Springer, C. Sander, and D. S. Marks. Mutation [46] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, effects predicted from sequence co-variation. Nature L. Sifre, T. Green, C. Qin, A. Z´ıdek,ˇ A. W. Nelson, Biotechnology, 35(2):128–135 (2017). A. Bridgland, et al. Improved protein structure pre- [31] P.-S. Huang, S. E. Boyken, and D. Baker. The coming of diction using potentials from deep learning. Nature, age of de novo protein design. Nature, 537(7620):320–327 577(7792):706–710 (2020). (2016). [47] O. Sharir, Y. Levine, N. Wies, G. Carleo, and [32] J. Ingraham, V. K. Garg, R. Barzilay, and T. S. A. Shashua. Deep autoregressive models for the efficient Jaakkola. Generative models for graph-based protein de- variational simulation of many-body quantum systems. sign. (2021). Physical review letters, 124(2):020503 (2020). [33] C. Jäckel, P. Kast, and D. Hilvert. Protein design by [48] K. Shimagaki and M. Weigt. Selection of sequence motifs directed evolution. Annu. Rev. Biophys., 37:153–173 and generative hopfield-potts models for protein families. (2008). Physical Review E, 100(3):032128 (2019). [34] B. Jing, S. Eismann, P. Suriana, R. J. Townshend, and [49] J. Söding. Protein homology detection by hmm–hmm R. Dror. Learning from protein structure with geomet- comparison. Bioinformatics, 21(7):951–960 (2005). ric vector perceptrons. arXiv preprint arXiv:2009.01411 [50] T. N. Starr and J. W. Thornton. Epistasis in protein (2020). evolution. Protein Science, 25(7):1204–1218 (2016). [35] E. Laine, Y. Karami, and A. Carbone. Gemme: a simple [51] A. Strokach, D. Becerra, C. Corbi-Verge, A. Perez-Riba, and fast global epistatic model predicting mutational ef- and P. M. Kim. Fast and flexible protein design using fects. Molecular Biology and Evolution, 36(11):2604–2619 deep graph neural networks. Cell Systems, 11(4):402– (2019). 411 (2020). [36] R. M. Levy, A. Haldane, and W. F. Flynn. Potts hamil- [52] L. Sutto, S. Marsili, A. Valencia, and F. L. Gervasio. tonian models of protein co-variation, free energy land- From residue coevolution to protein conformational en- scapes, and evolutionary fitness. Current Opinion in sembles and functional dynamics. Proceedings of the Na- Structural Biology, 43:55–62 (2017). tional Academy of Sciences, 112(44):13567–13572 (2015). [37] J. Linder and G. Seelig. Fast differentiable dna and pro- [53] P. Tian and R. B. Best. How many protein sequences fold tein sequence optimization for molecular design. arXiv to a given structure? a coevolutionary analysis. Biophys- preprint arXiv:2005.11275 (2020). ical Journal, 113(8):1719–1730 (2017). [38] F. McGee, Q. Novinger, R. M. Levy, V. Carnevale, and [54] P. Tian, J. M. Louis, J. L. Baber, A. Aniana, and A. Haldane. Generative capacity of probabilistic pro- R. B. Best. Co-evolutionary fitness landscapes for se- tein sequence models. arXiv preprint arXiv:2012.02296 quence design. Angewandte Chemie International Edi- (2020). tion, 57(20):5674–5678 (2018). [39] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. [55] J. Tubiana, S. Cocco, and R. Monasson. Learning protein Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa, constitutive motifs from sequence data. Elife, 8:e39397 and M. Weigt. Direct-coupling analysis of residue coevo- (2019). lution captures native contacts across many protein fam- [56] UniProt Consortium. Uniprot: a worldwide hub of pro- ilies. Proceedings of the National Academy of Sciences, tein knowledge. Nucleic Acids Research, 47(D1):D506– 108(49):E1293–E1301 (2011). D515 (2019). [40] C. Norn, B. I. Wicky, D. Juergens, S. Liu, D. Kim, [57] S. Vorberg, S. Seemayer, and J. Söding. Synthetic pro- D. Tischer, B. Koepnick, I. Anishchenko, D. Baker, and tein alignments by ccmgen quantify noise in residue- S. Ovchinnikov. Protein sequence design by conforma- residue contact prediction. PLoS Computational Biology, tional landscape optimization. Proceedings of the Na- 14(11):e1006526 (2018). tional Academy of Sciences, 118(11):e2017228118 (2021). [58] S. Wang, S. Sun, Z. Li, R. Zhang, and J. Xu. Ac- [41] J. M. Reimer, M. Eivaskhani, I. Harb, A. Guarné, curate de novo prediction of protein contact map by M. Weigt, and T. M. Schmeing. Structures of a dimod- ultra-deep learning model. PLoS Computational Biology, ular nonribosomal peptide synthetase reveal conforma- 13(1):e1005324 (2017). tional flexibility. Science, 366(6466) (2019). [59] G. W. Wilburn and S. R. Eddy. Remote homology search [42] D. Repecka, V. Jauniskis, L. Karpus, E. Rembeza, with hidden potts models. PLOS Computational Biology, I. Rokaitis, J. Zrimec, S. Poviloniene, A. Laurynenas, 16(11):e1008085 (2020). S. Viknander, W. Abuajwa, et al. Expanding functional [60] D. Wu, L. Wang, and P. Zhang. Solving statistical me- protein sequence spaces using generative adversarial net- chanics using variational autoregressive networks. Phys- works. Nature Machine Intelligence, 3(4):324–333 (2021). ical Review Letters, 122(8):080602 (2019). [43] A. J. Riesselman, J. B. Ingraham, and D. S. Marks. Deep [61] J. Yang, I. Anishchenko, H. Park, Z. Peng, S. Ovchin- generative models of genetic variation capture the effects nikov, and D. Baker. Improved protein structure predic- of mutations. Nature Methods, 15(10):816–822 (2018). tion using predicted interresidue orientations. Proceed- [44] O. Rivoire, K. A. Reynolds, and R. Ranganathan. ings of the National Academy of Sciences, 117(3):1496– Evolution-based functional decomposition of proteins. 1503 (2020). PLoS Computational Biology, 12(6):e1004817 (2016). [45] W. P. Russ, M. Figliuzzi, C. Stocker, P. Barrat-Charlaix,