bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Generating functional variants with variational

2 autoencoders

1 1 1 3 Alex Hawkins-Hooker , Florence Depardieu , Sebastien Baur , Guillaume 1 1 1 4 Couairon , Arthur Chen , and David Bikard

1 5 Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France

6 Abstract

7 The design of novel with specified function and controllable biochemical properties 8 is a longstanding goal in bio-engineering with potential applications across medicine and nan- 9 otechnology. The vast expansion of protein sequence databases over the last decades provides 10 an opportunity for new approaches which seek to learn the sequence-function relationship 11 directly from natural sequence variation. Advances in deep generative models have led to 12 the successful modelling of diverse kinds of high-dimensional data, from images to molecules, 13 allowing the generation of novel, realistic samples. While deep models trained on protein 14 sequence data have been shown to learn biologically meaningful representations helpful for 15 a variety of downstream tasks, their potential for direct use in protein engineering remains 16 largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 17 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the 18 luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence 19 input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are 20 able to reproduce patterns of usage characteristic of the family, the MSA VAE is 21 better able to capture long-distance dependencies reflecting the influence of 3D structure. To 22 validate the practical utility of the models, we used them to generate variants of luxA whose 23 function was tested experimentally. As further evidence of the practicality of these methods for 24 design, we showed that conditional variants of both models could be used to increase the sol- 25 ubility of luxA without disrupting function. Altogether 18/24 of the variants generated using 26 the AR-VAE and 21/23 variants generated using the MSA VAE retained some luminescence 27 activity, despite containing as many as 35 differences relative to any training set sequence. 28 These results demonstrate the feasibility of using deep generative models to explore the space 29 of possible protein sequences and generate useful variants, providing a method complementary 30 to rational design and directed evolution approaches.

31 Recombinant proteins have found uses in many medical and industrial applications where it is 32 frequently desirable to identify protein variants with modified properties such as improved stabil- 33 ity, catalytic activity, and modified substrate preferences. The systematic exploration of protein 34 variants is made extremely challenging by the enormous space of possible sequences and the dif- 35 ficulty of accurately predicting protein fold and function. Directed evolution approaches enable a 36 more or less random local search of sequence space but are typically limited to the exploration of 37 sequences differing by only a few mutations from a given natural sequence [1, 2]. When knowledge 38 of the is available, computer aided rational design can help identify interesting 39 modifications [3]. Beyond the identification of sequence variants, computational approaches have 40 enabled the generation of small synthetic protein domains that mimic natural folds while using 41 sequences that are distant from what is seen in [4, 5, 6]. These techniques take advantage 42 of structural information and physical modeling, as well as statistical analysis of amino-acid con- 43 servation and co-evolution. Recent progress has also been made in the rational design of proteins 44 with artificial folds from scratch [7, 8, 9]. All these computational design approaches nonetheless 45 remain for now limited in their success and in the types of protein they can model. 46 Machine learning methods provide an alternative and potentially complementary approach ca- 47 pable of exploiting the information available in protein sequence and structure databases. Natural 48 sequence variation provides a rich source of information about the structural and biophysical con- 49 straints on amino acid sequence in functional proteins, but the unlabelled nature of much of the 50 available data provides a challenge for straightforward supervised learning methods. The frame- 51 work of generative modelling shows promise for exploiting this information in an unsupervised

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

52 manner. Generative models are machine learning methods which seek to model the distribution 53 underlying the data, allowing for the generation of novel samples with similar properties to those 54 on which the model was trained [10]. In recent years, deep neural network based generative models 55 such as Variational Autoencoders [11], Generative Adversarial Networks [12], and deep autoregres- 56 sive models [13, 14, 15] trained on large databases of images [12], audio [14], text [16, 17], and even 57 chemicals [18], have been shown to be capable of generating novel, realistic samples. Generative 58 models can also easily be adapted to include auxiliary information to guide the generative process, 59 by modelling the distribution of the data conditioned on the auxiliary variables. Such conditional 60 generation is of particular interest for protein design where it is frequently desirable to maintain a 61 particular function while modifying a property such as stability or solubility. 62 While there have recently been several successes in applying deep learning techniques to mod- 63 elling protein sequences in tasks including contact prediction [19], secondary structure prediction 64 [20, 21], and prediction of the fitness effects of mutations [22], the possibility of applying generative 65 modelling methods in the design of new sequences has only very recently begun to be explored 66 [23, 24, 25, 26, 27], and experimental evidence for the viability of these techniques is scarce. To 67 realise the promise of generative models in protein engineering, work remains to be done in un- 68 derstanding the consequences of various design choices, the strengths and limitations of different 69 types of model and the possibilities for integration into existing engineering workflows. 70 One particularly important consideration is the nature of the input representation to the model. 71 Many traditional successes in protein sequence analysis have relied on features derived from multi- 72 ple sequence alignments of related proteins, which simplify the inference of structural and functional 73 constraints from sequence data [28]. But alignmnents become large and unreliable as more distant 74 proteins are added [29], placing an effective limit on the diversity of sequences that can be related 75 in this way. For this reason, several recent works have explored deep learning methods which are 76 capable of fully exploiting the data in sequence databases by working with raw sequence inputs. 77 Deep sequence models such as LSTMs and transformers trained on datasets spanning the entire 78 range of known sequences have been shown to learn representations which distill structural and 79 functional information from the sequence [30, 31]. Despite these promising results, it remains 80 unclear whether the representations learned are more informative than simple features computed 81 from local alignments [32] and the generative capacity of these models, though acknowledged, is 82 almost entirely unexplored. 83 Here, as a practical illustration of the application of deep generative design to protein engineer- 84 ing, we developed variational autoencoder (VAE) models capable of generating novel variants of 85 bacterial luciferase, an enzyme which emits light through the oxidation of flavin mononucleotide 86 (FMNH2). We proposed separate architectures to work with raw and aligned sequence input 87 which, when trained on a family of almost 70000 luciferase-like protein sequences, learned repre- 88 sentations of sequence identity which captured functional information and were able to generate 89 new sequences displaying patterns of amino acid usage characteristic of the family. Moreover, 90 conditional versions of the models trained with auxiliary solubility information enabled control of 91 the predicted solubility level of generated sequence variants. In order to confirm the generative 92 capacity of the models, they were used to generate variants of the luxA subunit of the luciferase 93 from Photorhabdus luminescens. A number of the variants generated by each model were selected 94 for synthesis and assessed for function when expressed as recombinant proteins in E. coli.

95 1 Results

96 Generative VAE models for protein families Variational autoencoders (VAEs) are deep 97 latent variable models consisting of two subnetworks in an autoencoder structure [11]. The encoder 98 network learns to map data points to low-dimensional ‘latent vectors’, while the decoder network 99 learns to reconstruct data points from their low-dimensional latent encodings. Either raw or 100 aligned protein sequences can be passed as input to a VAE model by representing them as fixed-size 101 matrices whose columns contain ‘one-hot encoded’ representations of the identity of the amino acid 102 at each position in the sequence (Methods). When trained on a training set of sequence inputs of 103 the same kind, a VAE thus learns a ‘latent representation’ of the content of each sequence. A prior 104 enforcing smoothness on the representations output by the encoder ensures that novel sequences 105 can be generated either by varying the latent representation around that of existing sequences, or by 106 sampling from the prior distribution over the latent vectors, and then feeding the resulting vectors 107 through the decoder. This latent variable-governed generative process is particularly attractive for

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

108 design applications because it can straightforwardly be used to bias generation towards particular 109 regions of sequence space, either by sampling from the vicinity of the latent representations of 110 target sequences, or by facilitating optimization based strategies which search the latent space for 111 novel sequences with desirable properties [33, 18]. 112 As a practical testbed for deep generative protein design, we chose to work with a dataset of 113 69,130 homologues of bacterial luciferase obtained from InterPro [34] (IPR011251). We worked 114 with two versions of the dataset: one containing raw unaligned sequences, and one constructed 115 from a multiple sequence alignment (MSA) of the dataset built using Clustal Omega (Methods). 116 Alignments of large protein families can be very wide, presenting a challenge for methods seeking to 117 model variation at all positions. We chose instead to build models capable of generating variants of 118 a single target protein, the luciferase luxA subunit from P. luminescens. We therefore dropped all 119 columns of the MSA which were unoccupied in the luxA target. We split the dataset into a training 120 set and a holdout validation set, using the same split for both aligned and raw sequences. In order 121 to avoid highly similar sequences occurring in the training and validation sets, we first clustered 122 all the sequences using mmseqs2 [35], and then added clusters chosen at random to the validation 123 set until the total number of sequences in the validation clusters reached 20% of the total. In order 124 to assess generalisation to a range of distances from the training set, three train-validation splits 125 were created using sequence identity thresholds of 30%, 50% and 70% in the clustering. Since our 126 ultimate goal was the generation of variants with reasonably close similarity to the target protein, 127 we mainly used the split at a clustering threshold of 70% sequence identity for the development 128 of models, but report amino acid reconstruction accuracies on all three splits in Supplementary 129 Table 1. 130 Following models previously developed to model the fitness consequences of mutations [22], we 131 used a standard design of fully connected feed-forward encoder and decoder networks for the models 132 taking aligned input (MSA VAE, Methods). Preliminary experiments with a similar architecture 133 on unaligned sequence data yielded poor results, with the generated sequences often failing to 134 register as hits when scored with the family profile HMM from [36]. In a VAE with a feed- 135 forward decoder, the output variables are conditionally independent given the latent variables, 136 meaning that all information about local conditional dependencies must be stored implicitly in the 137 latent variables. The importance of capturing such local dependencies in unaligned sequence data 138 makes autoregressive models such as recurrent neural networks (RNNs), which can be trained to 139 explicitly model the relevant conditional distributions, a natural choice. VAEs can be enhanced 140 with autoregressive decoders to reduce the burden on the latent space to capture local information, 141 and architectures based on this principle have been used to model images, text and molecules 142 [15, 37, 16, 18]. 143 To handle raw sequences we therefore designed a model incorporating a convolutional encoder 144 as well as a hybrid decoder [38] containing feed-forward and autoregressive components (AR-VAE, 145 Methods). We found that this hybrid structure was crucial in allowing the model to fit sequences 146 containing hundreds of amino acids, and helped ensure that the latent space was used, partially 147 circumventing the well-documented optimization difficulties that arise when training VAE models 148 with autoregressive decoders [16, 38]. As an initial confirmation of the advantages of the chosen 149 architecture, we scored a set of 3000 sequences generated by sampling from the prior of the AR- 150 VAE model with the family’s profile HMM. As baselines we also computed HMM scores for sets 151 of sequences generated by the MSA VAE model, and by a model having the same architecture 152 as MSA VAE trained on raw sequence data. The vast majority of sequences generated by both 153 MSA-VAE and AR-VAE were scored as hits by the HMM (96.8% and 99.7% respectively, at an 154 E-value threshold of 0.001), whereas the sequences generated by the baseline model trained on raw 155 inputs only scored as hits just over half the time (57.3%).

156 Models learn representations encoding features relevant to biological function To 157 model the distribution of sequences within a , VAEs develop internal representations 158 of the content of sequences at multiple resolutions. To explore the biological significance of these 159 representations we first examined the weights in the output layer of the decoder. At each point 160 in the sequence this layer is parameterised by a weight matrix whose columns represent learned 161 ‘embeddings’ of amino acid identity, which combine with the network’s hidden representation via 162 a softmax transformation to output the probabilities of observing each amino acid at that point. 163 If the weights of this layer are tied across all positions, as was the case for AR-VAE models, a 164 single set of embeddings is obtained. Visual inspection of a two dimensional projection of these

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Figure 1: Schematic representation of the input representation and VAE models used in the study. Models take as input either raw or aligned sequences. In the latter case, the inputs correspond to the rows of an MSA of the luciferase family. Only columns of the MSA corresponding to positions (highlighted in red) present in the target protein (marked with *) are retained. In both cases, the sequences are one-hot encoded before being fed into the model. Different architectures were used depending on the type of sequence input. The model developed to work with aligned sequences (MSA VAE) used fully-connected feed-forward networks in both the encoder and the decoder. The model developed to work with raw sequences (AR-VAE) comprised a CNN encoder and a decoder which combined upsampling with autoregression. The decoder sequentially outputs predictions for the identity of the amino acid at each point in the sequence, conditioned on the upsampled latent representation together with the previous amino acids in either the input sequence (during training, blue arrow) or the generated sequence (when being used generatively, red arrow).

165 embeddings obtained using PCA indicates that they reflect the biochemical properties of the various 166 amino acids: for example the negatively charged amino acids (D and E) and the positively charged 167 amino acids (K, R and H) cluster tightly together (Figure2), while the second principal component 168 seems to separate the hydrophobic amino acids (both the hydrophobic and aromatic groups in the 169 legend) from the polar amino acids, recapitulating the major groupings in traditional classification 170 schemes [39]. As further validation of the biological relevance of these embeddings we found that 171 the cosine similarities between embeddings for pairs of amino acids correlated well with the entries 172 in the BLOSUM 62 substitution matrix. Finally, to understand the models’ representations at a 173 more global level, we examined the distribution of latent vectors associated with sequences coming 174 from distinct sub-families within the set of luciferase-like proteins. The InterPro sub-families form 175 visually distinct clusters in the space of the first two principal components, especially for the 176 model trained on the MSA (Figure 2), indicating that global information about functional and 177 evolutionary relationships between sequences is captured in the latent variables.

178 Models capture patterns of amino acid usage characteristic of members of the family 179 Protein families are characterised by statistical features that reflect the shared evolutionary history 180 and related structure and function of members of the family. Patterns of amino acid conservation 181 at individual positions reflect the presence of functionally important sites and are used by profile 182 HMMs to identify family members [40], while correlations in amino acid usage at pairs of positions 183 are signatures of structurally constrained evolutionary covariation which can be used to infer 184 contacts between residues [41, 42, 43]. Models such as VAEs which seek to learn the distribution 185 of sequences in the family can be evaluated for their capacity to reproduce these characteristic 186 statistical features. To further probe the ability of the AR-VAE and MSA VAE models to generate 187 realistic sequences, we therefore calculated first- and second-order amino acid statistics from the 188 sets of 3000 sequences previously generated by sampling from the prior of each model and compared

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

4 Aromatic 3 Negative L I Positive 2 V IPR016215 Hydrophobic IPR019949 M 1 Polar IPR019952 F Unique IPR019945 0 W IPR022290

PC2 Y R P PC2 1 C T IPR019911 IPR019951 K 2 H Q BLOSUM 62 score S A IPR023934 G IPR024014 3 N IPR033924 4 E D 5 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 PC1 PC1 Output embedding cosine similarity

Figure 2: Visualisation of biologically relevant features of protein sequences learnt by VAE models. Left: pairwise cosine similarities between amino-acid output embeddings from an AR-VAE model trained on unaligned sequences correlate with amino acid substitution scores in BLOSUM62 sub- stitution matrix (Spearman ρ = 0.423, n = 190); centre: projection of AR-VAE output embedding weights onto first two principal components groups embeddings corresponding to biochemically related amino acids; right: sub-families cluster within the latent vector representation of sequences from MSA VAE (projected onto first two principal components for ease of visualisation).

189 them to corresponding statistics calculated from the sequences in the training set. Making a 190 comparison of these statistics requires an alignment of the generated sequences to the training 191 sequences. Such an alignment is automatically available for MSA VAE; for AR-VAE we used 192 Clustal Omega to jointly align all training and generated sequences, and again filtered columns 193 based on the alignment of the target luxA sequence. Given these alignments, we computed single- 194 site amino acid frequencies at all positions and pairwise amino acid frequencies and covariances 195 [44, 28] at all pairs of positions for both the subset of the alignment corresponding to generated 196 sequences and the subset corresponding to training sequences (Figure 3). Both VAE models were 197 able to reproduce the statistics observed in the natural sequences reasonably well, with the MSA 198 VAE sequences showing especially good agreement. As a simple baseline, we also sampled a set of 199 3000 sequences from the PFAM profile HMM for the family, and compared statistics at match states 200 to statistics at positions assigned to match states when aligning the training set to the model. By 201 construction, HMM models ignore interactions between residues, and therefore generate sequences 202 which show similarity to natural sequences in first order statistics (patterns of conservation) but 203 whose covariances only differ from zero due to the effects of finite sampling. We note that detailed 204 direct comparison of the results between models is challenging due to the statistics being computed 205 from different model-specific alignments, and in the case of the HMM model, due to the fact that 206 it was not trained on the same data. Nonetheless, the analysis serves to illustrates the fact that 207 the VAE models, in contrast to simpler profile models, are able to reproduce second order statistics 208 without being fit to them directly, and, in the case of AR-VAE, without requiring aligned input 209 data. 210 To obtain a more qualitative understanding of the kinds of dependencies that the models 211 were able to capture, we used the direct coupling analysis software CCMPred [43] to identify the 212 most strongly ‘coupled’ pairs of positions in the generated sequences. Direct coupling analysis 213 seeks to explain the observed (first and second order) amino acid statistics in terms of couplings 214 between positions in a statistical model [28]. The most strongly coupled positions in natural 215 family alignments are good predictors of contacts in protein 3D structure. We therefore compared 216 the couplings inferred from generated sequences to the contacts in the 3D structure of luxA by 217 visualising the resulting predicted contact maps (Figure 4). Whereas the MSA VAE generated 218 sequences which exhibited strong dependencies between positions at a range of distances, yielding 219 an inferred contact map bearing a reasonable resemblance to the ground truth, the sequences 220 generated by the AR-VAE showed a bias towards local couplings.

221 1.1 Experimental validation of generated sequences

222 Novel luminescent variants generated from latent vicinity of a target luminescent 223 protein Bacterial luciferase is a heterodimeric enzyme which catalyzes the light-emitting reaction 224 responsible for the bioluminescence of a number of bacterial species. The two homologous subunits,

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Single site frequencies Pairwise frequencies Covariances 1.0 1.0 0.3

0.8 MSA VAE 0.8 0.2

0.6 0.6 0.1

0.4 0.4 0.0

Pearson's r: 0.988 Pearson's r: 0.960 0.2 0.2 0.1 Pearson's r: 0.892

0.0 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.1 0.0 0.1 0.2 0.3

1.0 1.0 0.3

0.8 AR-VAE 0.8 0.2

0.6 0.6 0.1

0.4 0.4 0.0

Pearson's r: 0.959 Pearson's r: 0.898 0.2 0.2 0.1 Pearson's r: 0.766

0.0 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.1 0.0 0.1 0.2 0.3

1.0 1.0 0.3 Statistics from generated sequences 0.8 Profile HMM 0.8 0.2

0.6 0.6 0.1

0.4 0.4 0.0

0.2 0.2 0.1 Pearson's r: 0.287 Pearson's r: 0.916 Pearson's r: 0.814

0.0 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.1 0.0 0.1 0.2 0.3 Statistics from training set sequences

Figure 3: Statistics computed from alignments of generated sequences to natural sequences from the training set. Similarity of statistics between generated and natural sequences reflect the ability of models to capture important types of sequence variation. Single-site amino acid frequencies (left) capture patterns of residue conservation at each position in the alignment, while co-occurrence frequencies (centre) and covariances (right) between amino acid identities at different pairs of positions reflect patterns of evolutionary covariation which may indicate structural or functional constraints. Sequences were generated by sampling from the prior of the VAE models. For MSA VAE the resulting sequences were already aligned; for the raw sequences generated by AR-VAE, a new MSA was first constructed by running Clustal Omega on the set of sequences sampled from the model together with the natural sequences in the training set, using the bacterial luciferase family PFAM profile HMM as an External Profile Alignment, following which statistics for generated and natural sequences were computed from the corresponding subsets of the alignment. As a baseline we also report results for statistics generated by the profile HMM from PFAM. In this case the training set statistics were computed from the alignment of the training sequences to the profile HMM.

225 encoded by the genes luxA and luxB, have different roles: the luxA subunit contains the active site, 226 while the luxB subunit is thought to provide conformational stability [45]. Since the luciferase 227 activity is primarily encoded by the luxA gene, we sought to generate novel variants of the luxA 228 subunit, taking as our seed sequence the luxA protein from the species P. luminescens (UniProt 229 id: P19839). For both AR-VAE and MSA VAE models we generated a set of candidate variants by 230 sampling latent vectors from the neighbourhood of the latent space encoding of P19839 (Methods) 231 and passing them through the decoder. To validate these candidates, 12 sequences from each model 232 were selected for synthesis (supplementary dataset 1), spanning a range of distances (17-48 total 233 differences including substitutions and deletions) to P19839. 234 To assess the activity of the generated variants, the sequences were synthesised and expressed

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

0 0 0

20 ) Å (

e c

100 100 100 16 n a t s i d

e

12 u 200 200 200 d i s e r - r

8 e 300 t

300 300 n I

4 0 100 200 300 0 100 200 300 0 100 200 300

Figure 4: Comparison of inter-residue couplings inferred from generated sequences to contacts in the 3D structure of a luxA protein. Left: contact map of luxA, showing 1000 closest contacts separated by at least 4 sequence positions. Centre and right: top 1000 couplings inferred from sequences generated by MSA VAE and AR-VAE respectively, coloured by distance between residues in luxA 3D structure. Couplings were predicted using CCMPred on samples of 3000 sequences, and only couplings between residues separated by at least 4 sequence positions were shown. The patterns of inferred couplings reflect the dependencies captured by the models: while MSA VAE captures realistic dependencies between positions at a range of distances, the sequences generated by AR-VAE exhibit a bias towards local dependencies.

235 from a plasmid in an E. coli strain carrying the luxCDBE genes on a second plasmid. 9 of the 236 11 successfully synthesised sequences generated by the MSA VAE showed measurable luminescent 237 activity, compared to 6 of 12 generated by the model trained on unaligned sequences (Figure5). 238 Furthermore, the MSA VAE sequences showed a level of luminescence comparable to that of the 239 wild-type protein, while the AR-VAE sequences tended to have reduced luminescence. Remarkably, 240 there was no evidence in a drop-off in luminescence as the distance from the wild type increased 241 for the sequences generated by the model trained on the MSA. This was not true for sequences 242 generated by the model trained on the unaligned family members. Comparison of the generated 243 sequences to other luxA sequences from the training set revealed that several of the more distant 244 variants from both models were closer to other training set sequences than they were to the seed P. 245 Luminescens luxA. This was especially true for the MSA VAE, indicating that this model’s latent 246 space is organised in such a way as to encourage the exploration of functional regions of sequence 247 space lying between existing sequences. Even taking this into account, the luminescent MSA VAE 248 variants were all between 18-35 substitutions (including deletions) from any training set sequence.

249 Conditional VAEs enable enhancement of solubility of a luxA protein In order to assess 250 the ability of our models to generate novel functional sequences with specified biophysical properties 251 we further sought to use conditional variants of the VAE models to increase the solubility of the 252 P19839 luxA sequence. Proteins frequently aggregate and precipitate when expressed at high 253 concentrations [46]. This phenomenon is a challenge in a wide range of applications, from the 254 production of protein therapeutics to the study of protein and structure, leading to 255 interest in engineering of increased-solubility variants [47]. We considered P19839 to be a suitable 256 target to test the use of conditional VAE models for solubility engineering as it was predicted to be 257 insoluble by a recent sequence-based computational solubility prediction method, protein-sol [48], 258 with subsequent experimentation confirming that it was indeed recovered in the insoluble fraction 259 when expressed in E. coli (see Supplementary Figure 1). 260 Training sequences were grouped into three equally-sized bins by predicted solubility value 261 calculated using protein-sol and the bin label was used as the conditioning variable when training 262 conditional versions of both AR-VAE and MSA VAE models, corresponding to a specification of 263 either low, medium or high solubility for the sequence. P19839 was assigned to the low solubility 264 bin. In order to generate variants with increased solubility, we sampled latent vectors from the 265 neighbourhood of the encoding of P19839 and passed them through the decoder together with the 266 conditioning variable, which was fixed to a value corresponding to either medium or high solubility. 267 To check that conditioning was successful, we generated 100 sequences at each solubility level from 268 the conditional AR-VAE and MSA VAE models, and calculated predicted solubility values for the

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Unconditional generation MSA VAE conditioning AR-VAE conditioning 1.0 MSA VAE 8 AR-VAE 0.8 107 6 0.6 106 4 0.4 Luminescence Soluble fraction 2 105 luxA WT Luminescence (log10) 0.2 luxA WT

0 0.0 104 0 10 20 30 40 50 60 Mid High Mid High Distance to luxA WT Target solubility level Target solubility level

Figure 5: Luminescence measurements for synthesised protein sequences generated from latent vectors sampled from the neighbourhood of the encoding of the P. luminescens luxA sequence. Left: luminescence of sequences generated by VAE models trained on raw (AR-VAE) or aligned (MSA VAE) sequences from the family of luciferase-like proteins (mean across three biological replicates, error bars represent standard deviation). Wild-type sequence luminescence is displayed as a dashed green line. Distance is computed as number of substitutions and indels relative to wild type. The MSA VAE model was able to generate functional sequences with large numbers of differences to wild type, whereas the AR-VAE model seemed to introduce deleterious mutations more rapidly. Center and right: measurements of both solubility and luminescence for sequences generated by VAE models conditioned on predicted solubility level show that conditional models can be used to engineer increased-solubility variants of a luxA sequence while preserving function. Solubility is reported as the ratio of the amount of protein present in the supernatant to the total amount in both supernatant and pellet of lysed E. coli cells over-expressing the protein, as measured by a dot blot assay (mean of four technical replicates, error bars represent standard deviation).

269 new sequences (Figure6). Both models were able to control the predicted solubility level fairly 270 successfully while introducing only relatively few additional mutations compared to the original 271 decoding. To understand the changes being made by the conditional VAE models to increase 272 protein-sol’s predicted solubility values, we computed several of the features used as inputs by 273 protein-sol for both the generated sequences and training sequences. The features for generated 274 sequences tended to have values which were shifted from P19839’s values towards those exhibited on 275 average by soluble sequences from the training set (Figure6). For example, when asked to produce 276 sequences at the highest solubility level, the models produced sequences with more negative charge 277 than P19839, by favouring substitutions of neutral or positively charged residues with negatively 278 charged aspartic and glutamic acids (D and E, Figure 6). 279 Finally, we randomly selected 6 sequences from each of the two increased solubility levels for 280 each model for synthesis, and measured luminescence as well as solubility. To measure solubility, 281 the protein variants were cloned on a pET28/16 plasmid with a His tag and expressed in Rosetta 282 E. coli cells for 3H, followed by mechanical lysis and centrifugation. The amount of luxA protein 283 present in the supernatant and in the pellet were measured by performing a dot blot assay using an 284 anti-HisTag antibody, and the average fraction in the supernatant across four replicates was taken 285 as a measure of solubility. 5 out of 12 sequences generated by the MSA VAE model showed clearly 286 improved solubility and 3 out of these maintained a high luminescence level (Figure 5). Almost all 287 sequences generated by AR-VAE showed signs of improved solubility, and out of the 4 sequences 288 with a considerable fraction (>10%) in the supernatant, 3 maintained a high luminescence level. 289 In both cases these improvements were achieved while introducing only a relatively small number 290 of mutations to the wild type (12-26), highlighting that conditioning separated information about 291 solubility from information about general protein content successfully enough to be useful for fairly 292 sensitive engineering.

293 2 Discussion

294 We have developed variational autoencoder models capable of generating novel functional variants 295 of a luminescent protein when trained on a set of homologues of the target. Computational analysis

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Mid solubility variants luxA WT High sol. Mid sol.

0.55 Model AR-VAE 1.5 0.50 MSA VAE 1.0 0.45

0.40 0.5 High solubility variants

0.35 0.0

Predicted solubility 0.30 0.5 0.25 A C D E F G H I K L M N P Q R S T V W Y K-R D+E F+W+Y

low mid high Sequence composition difference to WT (%) 15.0 12.5 10.0 7.5 5.0 2.5 0.0 Solubility conditioning level Charge

Figure 6: Computational analysis of variants generated by conditional VAE models conditioned on predicted solubility level. Left: distribution of predicted solubilities of sequences generated when conditioning on each of three solubility levels (median and upper and lower quartiles indicated with horizontal lines); centre: difference in amino acid composition percentages between generated variants at highest solubility level and original P19839 luxA sequence, including values for combined amino acid features used in protein-sol prediction algorithm; right: distribution of charge of luxA variants generated by conditioning on high (top) and medium (bottom) solubility levels. For comparison, the charge of the original P19839 luxA sequence is shown as a dashed line, the average charge for high solubility sequences in the training set is shown as a solid green line, and the average charge for medium solubility sequences in the training set is shown as a solid red line).

296 of separate VAE models developed for raw sequences and aligned sequences suggested that the 297 version trained on MSA data more plausibly reproduced the statistical features characteristic of 298 the structural and functional constraints on members of the family arrived at and maintained over 299 the course of evolution. Experimental validation confirmed that a significant fraction of variants of 300 a target luxA protein generated by both models were functional, while confirming the strengths of 301 the MSA model, which generated a set of variants which almost without exception retained high 302 levels of luminescence despite diverging by as many as 35 amino acid differences from any protein 303 in the dataset. 304 The application of generative models to multiple alignments of protein families is not new. 305 Markov random field models with pairwise couplings between residues are the basis of the most 306 successful unsupervised contact prediction methods [42, 43]. Recent advances in efficient inference 307 methods for these models have been shown to permit the generation of sequences which accurately 308 reproduce the low-order statistics from natural MSAs [44, 49]. While their potential as the basis 309 of protein design approaches has been noted [28], to our knowledge there has been no attempt to 310 validate this experimentally or to explore the ways in which design might be achieved. Besides 311 this, a number of considerations motivate the exploration of deep generative models like VAEs. 312 VAEs are able to capture higher-order dependencies, can be adapted to straightforwardly and 313 flexibly incorporate conditioning information, and learn continuous latent spaces which offer various 314 possibilities for controlled generation, including the local sampling strategy employed here, or, 315 given appropriate auxiliary data, the conversion of sequence optimisation problems to continuous 316 optimisation problems in latent space [18]. Finally, VAEs provide a flexible modelling framework 317 which can be extended to handle raw sequence data, permitting the generation of full-length 318 sequences and offering the possibility of training on sequence data from multiple families. 319 Previously, it was shown that VAE models trained on multiple sequence alignments of single 320 protein families could be used to predict the fitness consequences of mutations, by comparing the 321 approximate likelihoods of mutant and wild-type sequences under the model [22]. Unlike [22] we 322 focus on the generative capabilities of VAE models, and, when working with aligned sequence 323 input, filter the alignment in such a way as to allow the generation of full variant sequences of 324 a single target protein. While we found VAEs trained on aligned sequence data to be effective 325 at reliably generating functional variants at a range of distances to a target protein, there are 326 nonetheless shortcomings to this approach. Building a training dataset for this model requires the 327 construction of a large multiple sequence alignment. Even where sufficient related sequences are 328 available this poses challenges. Such alignments will often have a very large number of columns, and 329 while a relevant subset of columns can be retained, as done here, this restricts the sequence variety 330 that can be generated, since only one or a handful of sequences will be fully represented in the

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

331 retained columns. Moreover, the construction of large alignments remains a difficult problem with 332 a trade-off in this context between the decline in alignment accuracy associated with the arbitrary 333 addition of extra homologous sequences [29] and the desire to use large numbers of sequences to 334 exploit the capacity of the model and fit it reliably. 335 The avoidance of these issues is a key advantage of models trained on raw sequence data. In 336 principle such models would make it possible to leverage the full variety of protein sequence informa- 337 tion by, for example, pretraining models on entire protein sequence databases [30, 31]. Conditional 338 VAEs with feed-forward decoders have previously been used to generate sequences with specified 339 metal binding sites or structural topologies when trained on raw sequences spanning multiple fami- 340 lies [25]. Other recent work has used deep autoregressive models without latent variables to handle 341 raw sequence data, inspired by similar approaches in natural language processing [30, 27]. Here, 342 seeking to combine the advantages of latent variable models and autoregressive models, we showed 343 that a VAE with an autoregressive decoder could be used to generate realistic sequences when 344 trained on the members of a single family, and provided experimental validation of the function of 345 a number of generated luxA variants. However, we also found that this model seemed to be less 346 effective at capturing the long-range dependencies between amino acids at different positions than 347 the MSA-based model. Closing this performance gap is an important challenge for future work if 348 the potential advantages of training models on raw sequence data are to be fully realised. Other 349 sequence models such as transformers might be better suited to capturing long-range interactions 350 and have already shown promise in modelling proteins [31, 32]. More fundamentally, evaluating 351 the capability of alternative kinds of model in silico requires quantitative measures of the quality 352 of generated sequences, and this, too, remains a difficult problem. 353 The successful generation of full, functional variants at a range of distances to a given engi- 354 neering target opens the door to a multitude of applications in the field of protein design. Here we 355 showed that using conditional variants of the models it was possible to generate new variants with 356 modified biophysical properties in a controlled way, by generating variants of a luxA protein with 357 increased solubility relative to wild type. More generally, a model capable of guiding exploration 358 of distant regions of functional sequence space could be used to significantly improve the efficiency 359 of existing design approaches [50], or to constrain the search of sequence space for proteins with 360 desirable properties [33, 18]. The ability to generate novel sequences with a desired function is 361 an important desideratum in protein engineering approaches, and while here we have shown that 362 straightforward conditioned generation is sufficient to generate novel, functional sequences satis- 363 fying basic biochemical criteria, we expect that combining these kinds of methods with existing 364 engineering techniques will result in even more powerful and widely applicable methods for protein 365 sequence design.

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

366 3 Methods

367 3.1 Dataset Construction

368 Selection of sequences All sequences containing a luciferase-like domain (IPR011251) were 369 downloaded from InterPro [34]. Sequences longer than 504 amino acids were discarded. The 370 remaining 69130 sequences were clustered using mmseqs2 [35] with a sequence identity threshold 371 of 70%. To create a validation set, clusters were randomly removed from the training set until the 372 number of sequences in all of the removed clusters was 20% of the total. The same train/validation 373 split was used for models irrespective of whether they took as input aligned or unaligned versions 374 of the sequences.

375 Multiple sequence alignment To create a multiple sequence alignment from the dataset, the 376 full set of training and validation sequences were aligned using Clustal Omega [29] using the profile 377 HMM of the bacterial luciferase family from PFAM [36] as an external profile alignment to guide the 378 creation of the MSA. The resulting MSA was very wide, presenting potential modelling challenges. 379 To circumvent these, only a subset of columns were retained on the basis of the target protein 380 (details below).

381 Input representation All sequences are represented as fixed size matrices by one-hot encoding 382 the amino acids, such that a sequence of length L is represented by a L×21 matrix (a gap/padding 383 character is used together with the 20 standard amino acids). When raw sequences are used as 384 input, a fixed input size is ensured by right padding sequences up to a length 504 (and dropping 385 sequences exceeding this length). When aligned sequences are used as input, all columns in the 386 MSA which are assigned gaps in the alignment of the target luxA protein P19839 are dropped, 387 leaving 360 columns.

388 3.2 Variational Autoencoders

389 VAEs [11] posit a set of latent variables z associated with each input x and model the joint 390 distribution p(x, z) = pθ(x|z)p(z) of the latents and the observations. The distribution pθ(x|z) 391 over the values of the observed variables given the latents is parametrised by a neural network (the 392 ‘decoder’) with weights θ, and p(z) is a prior over the latents, typically chosen to be a factorised 393 Gaussian distribution. An inference model qφ(z|x) parametrised by a second neural network (the pθ (x|z)p(z) 394 ‘encoder’) is introduced to approximate the intractable posterior pθ(z|x) = R , allowing pθ (x|z)p(z)dz 395 the construction of a training objective representing a lower bound on the log-likelihood (the 396 Evidence Lower Bound or ELBO):

L(φ, θ; x) = Eqφ(z|x)[log pθ(x|z) − DKL(qφ(z|x)||p(z))] ≤ log pθ(x) (1)

397 Jointly maximising this objective over a set of training examples with respect to the weights of the 398 two networks enables the generative model and the inference model to be learned simultaneously. 399 The VAE framework offers flexibility in the architectures of the encoder and decoder networks. 400 In a standard setup in which feed-forward networks are used for both encoder and decoder, the 401 observed variables are conditionally independent given the latents. A more flexible output dis- 402 tribution can be obtained by instead decoding autoregressively [16, 37]. That is, given a latent 403 vector z, the output sequence x = (x1, ..., xL) is generated one position at a time, with the decoder 404 modelling the conditional distributions pθ(xi|x1, ..., xi−1, z) of each output xi given the values of 405 its predecessors and the latents. This corresponds to modelling the distribution of x given z in the Q 406 factorised form pθ(x|z)= i pθ(xi|x1, ..., xi−1, z). 407 A VAE can straightforwardly be adapted to model the distribution of the data conditioned on 408 auxiliary variables c by conditioning the encoder and decoder networks on these variables [51]. 409 The objective then becomes

L(φ, θ; x) = Eqφ(z|x,c)[logpθ(x|z, c) − DKL(qφ(z|x, c)||p(z|c))] (2)

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

410 3.3 Model architecture

411 MSA VAE Both the encoder and the decoder are fully connected neural networks with two 412 hidden layers. We experimented with a range of layer sizes and latent dimensions, settling on 413 256 units per hidden layer and a latent dimensionality of 10 unless otherwise specified. ReLU 414 activations were used for hidden units, and softmax activations for the output units.

415 AR-VAE We used a convolutional neural network for the encoder, consisting of 5 layers of 1D 416 convolutions of width 2. Apart from the first layer, the convolutions were applied with a stride 417 of 2. The first layer used 21 filters; this number was doubled in each successive layer. PReLU 418 activations were used, and batch normalization was applied in each layer. The size of the latent 419 dimension was 50. 420 The decoder consisted of two components, similar to the decoder in a ‘hybrid’ autoregressive 421 VAE model developed for text [38]: an ‘upsampling’ component, which contained 3 layers of 422 transposed convolutions to ‘upsample’ the latent vector to a sequence of the same length as the 423 output sequence; and an autoregressive component, which was a GRU with 512 units which took as 424 input at each timestep the full sequence of previous amino-acids and upsampled latent information, 425 and output the predicted identity of the amino acid at the next timestep. 426 Optimization difficulties have been reported when training VAEs with powerful autoregressive 427 decoders [16, 38]. To address these we followed [16] in applying dropout to the amino acid context 428 supplied as input to the GRU during training, such that 45% of the amino acid context was 429 masked out, forcing the network to rely on the information transmitted via the upsampled latent 430 code together with the conditional information in the masked amino acid sequence to make its 431 prediction.

432 Conditioning on predicted solubility Solubility predictions were made for all proteins by 433 running the protein-sol [48] software on the full sequences. The resulting solubility predictions 434 were continuous values ranging between 0 and 1. We discretized this information by binning the 435 sequences into 3 equally-sized bins corresponding to low, mid and high solubility. Bin membership 436 was fed as a one-hot encoded categorical variable as additional input to both encoder and decoder 437 in conditional versions of the models.

438 3.4 Details of Model Training and Selection Procedures

439 Models were trained using SGD with a batch size of 32 and the validation set was used to monitor 440 performance at the end of each epoch as measured by ELBO loss and amino-acid reconstruction 441 accuracy. Unless otherwise specified, the Adam optimizer was used with a learning rate of 0.001. 442 We also monitored the reconstruction accuracy of P19839, the luxA protein which had been se- 443 lected for synthesis. This single-datapoint reconstruction accuracy was considered when choosing 444 model hyperparameters together with the other two metrics and the evaluations described above. 445 In particular, weights from the epoch which showed the best reconstruction accuracy of P19839 446 without evidence of overfitting (i.e. increase in validation loss) were saved and used to generate 447 the variants of P19839 that were tested experimentally.

448 3.5 Computational analysis of generated sequences

449 For each model, a set of 3000 sequences was generated by sampling latent vectors from the prior and 450 decoding greedily. Before further analysis, we constructed an alignment of the sequences generated 451 by AR-VAE to the training sequences, following the procedure used to create the training alignment 452 for MSA VAE. In detail, we ran Clustal Omega on the sequences from the training set together 453 with the generated sequences, using the profile HMM to guide the alignment, and subsequently 454 dropped all columns unoccupied in the row corresponding to the P19839 luxA sequence. We used 455 the EVCouplings python package [52] to compute both single site amino acid frequencies and 456 pairwise co-occurrence frequencies separately for the aligned generated sequences and the aligned 457 training sequences. Gap frequencies were not included in the comparison between generated and 458 training statistics. To infer couplings we ran CCMPred on the aligned generated sequences. We 459 used EVCouplings to compare the resulting couplings to contacts in the 3D structure of the luxA 460 protein.

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

461 3.6 Selection of variants for synthesis

462 In total 48 sequences were synthesised and tested for function as luciferase luxA subunits. These 463 corresponded to 12 sequences for each of four models: unconditional and conditional versions of 464 the MSA VAE trained on aligned sequences, and unconditional and conditional versions of the 465 AR-VAE model trained on raw sequences.

466 Unconditional generation Unconditional models trained on aligned and unaligned sequence 467 data were used to generate sets of candidate luminescent luxA proteins. 12 sets of sequences were 468 selected for synthesis from each model. To generate the sequences, first the encoded P19839 luxA 469 sequence was passed through the encoder of the VAE model to obtain the mean and covariance for 470 the diagonal Gaussian posterior distribution over the latent variables. To enhance diversity, each 471 dimension of the posterior variance was scaled up by a fixed factor (of 4 for MSA VAE, 1 for the AR- 472 VAE model). 500 latent vectors were sampled from the resulting scaled posterior distribution for 473 each model. In the case of AR-VAE a further source of randomness was added to the autoregressive 474 decoding process through temperature sampling (T=0.5). The resulting sequences were separated 475 into 6 approximately equally sized bins based on the number of differences from the input protein. 476 2 variants were chosen at random from each of these bins for synthesis, allowing variants having a 477 range of distances to the original P19839 sequence to be tested.

478 Conditional generation Mean and variance parameters for the posterior distribution over the 479 latent variables were obtained by passing the encoded P19839 luxA sequence together with its 480 one-hot encoded original solubility level (low) through the encoder. Sequence diversity was gener- 481 ated differently for the two types of model. For the MSA VAE, 100 latent vectors were sampled 482 from the posterior for each of the two increased solubility levels (mid and high), and sequences 483 were generated by passing the vectors together with the desired conditioning values through the 484 decoder. For the AR-VAE models, the mean latent vector was used to generate all variants, with 485 diversity amongst the 100 candidates generated for each conditioning level coming from temper- 486 ature sampling (T=0.3). 6 sequences were selected at random from each desired solubility level 487 for each model. To prevent over-similarity amongst the synthesised sequences, members of pairs 488 of selected sequences with less than 3 differences between them were replaced at random until all 489 pairs satisfied this diversity criterion.

490 3.7 Synthesis and cloning of sequence variants

491 The genes corresponding to the different variants of luxA were synthesized by Twist Bioscience 492 (protein sequences are provided as supplementary dataset 1 and DNA sequences as supplementary 493 dataset 2). The variants were amplified from synthesized DNA fragments using primers F342/F343 494 and cloned under the control of the T7 promoter and upstream of C-terminal His- Tag sequence 495 in plasmid pET28/16 [53] amplified with primers F340/F341 through Gibson assembly [54]. All 496 plasmids were verified by Sanger sequencing. To study the function of the variants, the resulting 497 plasmids were introduced in E. coli DH5 alpha carrying all the other genes of the P. luminescens lux 498 operon on plasmid pDB283. This plasmid was obtained by deletion of luxA from plasmid pCM17 499 [55] by amplification using primers B731/LC545 and B732/LC327, followed by Gibson assembly 500 of the two PCR fragments. Transformants were selected on LB agar containing kanamycin (50 501 µg/ml) and ampicillin (100 µg/ml).The plasmids described here are readily available from the 502 authors upon request.

503 3.8 Bioluminescence measurements

504 Strains were grown in triplicate overnight at 37°C and diluted 1:100 in 1 ml of LB broth containing 505 kanamycin (50 µg/ml) and ampicillin (100 µg/ml) in a 96-well microplate that was incubated at 506 37°C with shaking in a ThermoMixer (Eppendorf). After 7h of growth, 100 µl of culture was 507 transferred to a white 96-well microplate (Greiner, Kremsmünster) and luminescence measured 508 during 10 s on a Centro XS LB 960 microplate luminometer (Berthold Technologies, Thoiry).

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Table 1: Oligonucleotide sequences.

Name Sequence 5’-3’ B731 TAATATAATAGCGAACGTTGAGTACTAAAGTTTCCAAATTTCATAGAGAGTCC B732 CTATGAAATTTGGAAACTTTAGTACTCAACGTTCGCTATTATATTAGCTAAGG LC545 TCATTTCGAACCCCAGAGTC LC327 CGCCTTCTTGACGAGTTCTT F340 CTTCTTAAAGTTAAACAAAATTATTTCTAGAGG F341 GAGATCCGGCTGCTAACAAAG F342 GCTTCCTTTCGGGCTTTGTTAG F343 GATAACAATTCCCCTCTAGAAATAATTTTG

509 3.9 Protein solubility measurements

510 The recombinant plasmids were transformed into E. coli Rosetta cells. The transformants were 511 grown in liquid Luria-Bertani medium containing chloramphenicol (20 µg/ml) plus ampicillin (100 512 µg/ml) at 30°C until mid exponential phase (OD = 0.8-0.9). Isopropyl-β-D-thiogalactopyranoside 513 (IPTG, 1 mM) was added to induce recombinant protein production and incubation was pursued 514 for 3 h. Cells were resuspended in 1 ml of lysis buffer (Hepes 50 mM pH7.5, NaCl 0.4 mM, 515 EDTA 1mM, DTT 1 mM, Triton X-100 0.5%, glycerol 10%), and then lysed on ice with a precellys 516 homogenizer (Bertin Technologies) using the micro-organism lysing kit VK01 with the following 517 conditions : 5 times for 30 s at 7800 rpm with 30 s of pause between homogeneization steps. 518 Soluble proteins were separated from aggregated proteins and cellular debris by centrifugation at 519 5000 g and 4 °C for 20 min. Pellets containing protein aggregates were resuspended in 1 ml of lysis 520 buffer. For Western blot analysis, the samples were prepared in Laemmli buffer with addition of 521 10% beta-mercaptoethanol and denatured at 95°C for 5 min. Soluble and insoluble fractions were 522 run on a 4-12% Bis-Tris sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE). 523 Proteins were transferred to a nitrocellulose membrane (Invitrogen), which was blocked in 3% 524 skim milk in PBS for 30 min and was successively incubated with primary (Anti-HisTag diluted 525 1:500 or Anti-GroEL diluted 1:1000 in blocking buffer) antibody and secondary antibody (diluted 526 1:10000) conjugated to DyLight 800 (Tebu), and detected under chemiluminescent imaging system 527 (LI-COR Odyssey Instrument). The His tag was detected using the mouse monoclonal Anti-His- 528 Tag antibody (Abcam). The GroEL control was detected with the mouse anti-GroEL monoclonal 529 antibody (Abcam). Three washes for 5 min in PBS were performed after each incubation step. For 530 dot blot analysis, 2% SDS and 10% β-mercaptoethanol were added to the samples before denaturing 531 for 10 min at 95°C. 5µl of the denatured samples were directly spotted on the nitrocellulose 532 membrane and the antibody hybridization performed as for the Western blot. Protein levels were 533 calculated using the Image Studio software package.

534 3.10 Code availability

535 Python implementations of models and training procedure are available at https://github. 536 com/alex-hh/deep-protein-generation

537 3.11 Conflicts of interest

538 The authors declare that they have no conflict of interest.

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

539 References

540 [1] Michael S. Packer and David R. Liu. Methods for the directed evolution of proteins. Nature 541 Reviews Genetics, 16(7):379–394, July 2015.

542 [2] Frances H. Arnold. Directed Evolution: Bringing New Chemistry to Life. Angewandte Chemie 543 International Edition, 57(16):4143–4148, April 2018.

544 [3] Gabriel J. Rocklin, Tamuka M. Chidyausiku, Inna Goreshnik, Alex Ford, Scott Houliston, 545 Alexander Lemak, Lauren Carter, Rashmi Ravichandran, Vikram K. Mulligan, Aaron Cheva- 546 lier, Cheryl H. Arrowsmith, and David Baker. Global analysis of protein folding using mas- 547 sively parallel design, synthesis, and testing. Science, 357(6347):168, 2017.

548 [4] Bassil I. Dahiyat and Stephen L. Mayo. De Novo Protein Design: Fully Automated Sequence 549 Selection. Science, 278(5335):82–87, October 1997.

550 [5] Christina M. Kraemer[U+2010]Pecore, Juliette T. J. Lecomte, and John R. Desjarlais. A de 551 novo redesign of the WW domain. Protein Science, 12(10):2194–2205, October 2003.

552 [6] William P. Russ, Drew M. Lowery, Prashant Mishra, Michael B. Yaffe, and Rama Ran- 553 ganathan. Natural-like function in artificial WW domains. Nature, 437(7058):579–583, 554 September 2005.

555 [7] Pehr B. Harbury, Joseph J. Plecs, Bruce Tidor, Tom Alber, and Peter S. Kim. High-Resolution 556 Protein Design with Backbone Freedom. Science, 282(5393):1462–1467, November 1998.

557 [8] Brian Kuhlman, Gautam Dantas, Gregory C. Ireton, Gabriele Varani, Barry L. Stoddard, and 558 David Baker. Design of a Novel Globular Protein Fold with Atomic-Level Accuracy. Science, 559 302(5649):1364–1368, November 2003.

560 [9] Po-Ssu Huang, Scott E. Boyken, and David Baker. The coming of age of de novo protein 561 design. Nature, 537(7620):320–327, September 2016.

562 [10] Ian Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv:1701.00160 563 [cs], December 2016. arXiv: 1701.00160.

564 [11] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In Yoshua Bengio 565 and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 566 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

567 [12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil 568 Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Z. Ghahramani, 569 M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural 570 Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.

571 [13] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A Neural Proba- 572 bilistic Language Model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003.

573 [14] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex 574 Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative 575 Model for Raw Audio. arXiv:1609.03499 [cs], September 2016. arXiv: 1609.03499.

576 [15] Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel Recurrent Neural 577 Networks. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 578 33nd International Conference on Machine Learning, ICML 2016, , NY, USA, 579 June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1747– 580 1756. JMLR.org, 2016.

581 [16] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy 582 Bengio. Generating Sentences from a Continuous Space. In Proceedings of The 20th SIGNLL 583 Conference on Computational Natural Language Learning, pages 10–21, Berlin, Germany, 584 August 2016. Association for Computational Linguistics.

585 [17] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Lan- 586 guage Models are Unsupervised Multitask Learners. 2019.

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

587 [18] Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, 588 Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. 589 Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik. Automatic Chemical Design Using a Data- 590 Driven Continuous Representation of Molecules. ACS Central Science, 4(2):268–276, February 591 2018.

592 [19] Sheng Wang, Siqi Sun, Zhen Li, Renyu Zhang, and Jinbo Xu. Accurate De Novo Prediction 593 of Protein Contact Map by Ultra-Deep Learning Model. PLoS , 13(1), 594 January 2017.

595 [20] M. Spencer, J. Eickholt, and J. Cheng. A Deep Learning Network Approach to ab initio 596 Protein Secondary Structure Prediction. IEEE/ACM Transactions on Computational Biology 597 and , 12(1):103–112, January 2015.

598 [21] Sheng Wang, Jian Peng, Jianzhu Ma, and Jinbo Xu. Protein Secondary Structure Prediction 599 Using Deep Convolutional Neural Fields. Scientific Reports, 6:18962, January 2016.

600 [22] Adam J. Riesselman, John B. Ingraham, and Debora S. Marks. Deep generative models of 601 genetic variation capture the effects of mutations. Nature Methods, 15(10):816–822, October 602 2018.

603 [23] Jérôme Tubiana, Simona Cocco, and Rémi Monasson. Learning protein constitutive motifs 604 from sequence data. eLife, 8:e39397, March 2019.

605 [24] Payel Das, Kahini Wadhawan, Oscar Chang, Tom Sercu, Cicero Dos Santos, Matthew Riemer, 606 Vijil Chenthamarakshan, Inkit Padhi, and Aleksandra Mojsilovic. PepCVAE: Semi-Supervised 607 Targeted Design of Antimicrobial Peptide Sequences. arXiv:1810.07743 [cs, q-bio, stat], Oc- 608 tober 2018. arXiv: 1810.07743.

609 [25] Joe G. Greener, Lewis Moffat, and David T. Jones. Design of metalloproteins and novel 610 protein folds using variational autoencoders. Scientific Reports, 8(1):16189, November 2018.

611 [26] Donatas Repecka, Vykintas Jauniskis, Laurynas Karpus, Elzbieta Rembeza, Jan Zrimec, 612 Simona Poviloniene, Irmantas Rokaitis, Audrius Laurynenas, Wissam Abuajwa, Otto 613 Savolainen, Rolandas Meskys, Martin K. M. Engqvist, and Aleksej Zelezniak. Expanding 614 functional protein sequence space using generative adversarial networks. bioRxiv, page 789719, 615 January 2019.

616 [27] Adam Riesselman, Jung-Eun Shin, Aaron Kollasch, Conor McMahon, Elana Simon, Chris 617 Sander, Aashish Manglik, Andrew Kruse, and . Accelerating Protein Design 618 Using Autoregressive Generative Models. bioRxiv, page 757252, September 2019. Publisher: 619 Cold Spring Harbor Laboratory Section: New Results.

620 [28] Simona Cocco, Christoph Feinauer, Matteo Figliuzzi, Remi Monasson, and Martin Weigt. 621 Inverse Statistical Physics of Protein Sequences: A Key Issues Review. Reports on Progress 622 in Physics, 81(3):032601, March 2018. arXiv: 1703.01222.

623 [29] Fabian Sievers and Desmond G. Higgins. Clustal Omega for making accurate alignments of 624 many protein sequences. Protein Science, 27(1):135–145, 2018.

625 [30] Ethan C. Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M. 626 Church. Unified rational protein engineering with sequence-based deep representation learning. 627 Nature Methods, 16(12):1315–1322, 2019.

628 [31] Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zit- 629 nick, Jerry Ma, and Rob Fergus. Biological Structure and Function Emerge from Scaling 630 Unsupervised Learning to 250 Million Protein Sequences. bioRxiv, page 622803, April 2019.

631 [32] Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John F. Canny, 632 Pieter Abbeel, and Yun S. Song. Evaluating Protein Transfer Learning with TAPE. In 633 Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché Buc, Emily B. Fox, 634 and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual 635 Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 636 2019, Vancouver, BC, Canada, pages 9686–9698, 2019.

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

637 [33] David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling 638 for robust design. In ICML, pages 773–782, May 2019.

639 [34] Sarah Hunter, , Teresa K. Attwood, , , David Binns, 640 , Ujjwal Das, Louise Daugherty, Lauranne Duquenne, Robert D. Finn, , 641 Daniel Haft, Nicolas Hulo, Daniel Kahn, Elizabeth Kelly, Aurélie Laugraud, Ivica Letunic, 642 David Lonsdale, Rodrigo Lopez, Martin Madera, John Maslen, Craig McAnulla, Jennifer 643 McDowall, Jaina Mistry, Alex Mitchell, Nicola Mulder, Darren Natale, , 644 Antony F. Quinn, Jeremy D. Selengut, Christian J. A. Sigrist, Manjula Thimma, Paul D. 645 Thomas, Franck Valentin, Derek Wilson, Cathy H. Wu, and Corin Yeats. InterPro: the 646 integrative protein signature database. Nucleic Acids Research, 37(Database issue):D211–215, 647 January 2009.

648 [35] Martin Steinegger and Johannes Söding. MMseqs2 enables sensitive protein sequence searching 649 for the analysis of massive data sets. Nature Biotechnology, 35(11):1026–1028, 2017.

650 [36] Robert D. Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y. Eberhardt, Sean R. 651 Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, Erik L. L. Sonnhammer, 652 John Tate, and Marco Punta. Pfam: the protein families database. Nucleic Acids Research, 653 42(D1):D222–D230, January 2014.

654 [37] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David 655 Vazquez, and Aaron Courville. PixelVAE: A Latent Variable Model for Natural Images. 656 November 2016.

657 [38] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A Hybrid Convolutional Varia- 658 tional Autoencoder for Text Generation. In Proceedings of the 2017 Conference on Empirical 659 Methods in Natural Language Processing, pages 627–637, Copenhagen, Denmark, September 660 2017. Association for Computational Linguistics.

661 [39] William Ramsay Taylor. The classification of amino acid conservation. Journal of Theoretical 662 Biology, 119(2):205–218, March 1986.

663 [40] Richard Durbin, Sean R. Eddy, , and Graeme Mitchison. Biological Sequence 664 Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 665 1998.

666 [41] David T. Jones, Daniel W. A. Buchan, Domenico Cozzetto, and Massimiliano Pontil. PSICOV: 667 precise structural contact prediction using sparse inverse covariance estimation on large mul- 668 tiple sequence alignments. Bioinformatics, 28(2):184–190, January 2012. Publisher: Oxford 669 Academic.

670 [42] Debora S. Marks, Lucy J. Colwell, Robert Sheridan, Thomas A. Hopf, Andrea Pagnani, 671 Riccardo Zecchina, and . Protein 3D Structure Computed from Evolutionary 672 Sequence Variation. PLOS ONE, 6(12):e28766, December 2011.

673 [43] Stefan Seemayer, Markus Gruber, and Johannes Söding. CCMpred–fast and precise predic- 674 tion of protein residue-residue contacts from correlated mutations. Bioinformatics (Oxford, 675 England), 30(21):3128–3130, November 2014.

676 [44] Matteo Figliuzzi, Pierre Barrat-Charlaix, and Martin Weigt. How Pairwise Coevolutionary 677 Models Capture the Collective Residue Variability in Proteins? and Evo- 678 lution, 35(4):1018–1027, 2018.

679 [45] Zachary T. Campbell, Andrzej Weichsel, William R. Montfort, and Thomas O. Baldwin. 680 Crystal structure of the bacterial luciferase/flavin complex provides insight into the function 681 of the beta subunit. Biochemistry, 48(26):6085–6094, July 2009.

682 [46] Wei Wang. Instability, stabilization, and formulation of liquid protein pharmaceuticals. In- 683 ternational Journal of Pharmaceutics, 185(2):129–188, August 1999.

684 [47] Jean-Denis Pédelacq, Emily Piltch, Elaine C. Liong, Joel Berendzen, Chang-Yub Kim, Beom- 685 Seop Rho, Min S. Park, Thomas C. Terwilliger, and Geoffrey S. Waldo. Engineering soluble 686 proteins for structural genomics. Nature Biotechnology, 20(9):927–932, September 2002.

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

687 [48] Max Hebditch, M. Alejandro Carballo-Amador, Spyros Charonis, Robin Curtis, and Jim War- 688 wicker. Protein–Sol: a web tool for predicting protein solubility from sequence. Bioinformatics, 689 33(19):3098–3100, October 2017.

690 [49] Susann Vorberg, Stefan Seemayer, and Johannes Söding. Synthetic protein alignments by 691 CCMgen quantify noise in residue-residue contact prediction. PLOS Computational Biology, 692 14(11):e1006526, November 2018. Publisher: Public Library of Science.

693 [50] Kevin K. Yang, Zachary Wu, and Frances H. Arnold. Machine-learning-guided directed evo- 694 lution for protein engineering. Nature Methods, 16(8):687–694, August 2019. Number: 8 695 Publisher: Nature Publishing Group.

696 [51] Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi- 697 supervised Learning with Deep Generative Models. In NIPS, 2014.

698 [52] Thomas A Hopf, Anna G Green, Benjamin Schubert, Sophia Mersmann, Charlotta P I Schärfe, 699 John B Ingraham, Agnes Toth-Petroczy, Kelly Brock, Adam J Riesselman, Perry Palmedo, 700 Chan Kang, Robert Sheridan, Eli J Draizen, Christian Dallago, Chris Sander, and Debora S 701 Marks. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinfor- 702 matics, 35(9):1582–1584, May 2019.

703 [53] Arnaud Chastanet, Juliette Fert, and Tarek Msadek. Comparative genomics reveal novel heat 704 shock regulatory mechanisms in Staphylococcus aureus and other Gram-positive bacteria. 705 Molecular Microbiology, 47(4):1061–1073, 2003.

706 [54] Daniel G. Gibson, Lei Young, Ray-Yuan Chuang, J. Craig Venter, Clyde A. Hutchison, and 707 Hamilton O. Smith. Enzymatic assembly of DNA molecules up to several hundred kilobases. 708 Nature Methods, 6(5):343–345, May 2009.

709 [55] Ki-Jong Rhee, Hao Cheng, Antoneicka Harris, Cara Morin, James B. Kaper, and Gail Hecht. 710 Determination of spatial and temporal colonization of enteropathogenic E. coli and enterohe- 711 morrhagic E. coli in mice using bioluminescent in vivo imaging. Gut Microbes, 2(1):34–41, 712 January 2011.

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

713 4 Supplementary Information

Figure 1: Quantitaive analysis of solubility for each variant of LuxA in comparison with wild type LuxA. A. LuxA tagged with a His-tag was quantify by western blot in the supernatant (S, soluble fraction) and in the pellet (P, insoluble fraction) compared to the empty vector pET28. GroEL was used as loading control. The His tag was detected using the mouse monoclonal Anti-His- Tag antibody, whereas the GroEL control was detected with the mouse anti-GroEL monoclonal antibody. The molecular mass are given in kilodaltons and indicated to the right of the membrane. Arrowheads indicate the position of recombinant proteins. The levels of solubility for each variant of LuxA generated from aligned (B) or raw (C) sequences were analysed by dot blotting. Aliquots of 5 µl of soluble (S) and insoluble (P) fractions from IPTG-induced Rosetta cells overexpressing variants of LuxA were spotted on nitrocellulose membrane and their intensities were quantified using Image Studio software package. The dot blots of 4 technical replicates used to compute solubility are shown.

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Table 1: Amino acid reconstruction accuracies on heldout clusters by cluster sequence identity threshold used to construct train/test split. Baseline VAE is a baseline model with the same architecture as MSA VAE, but the same latent dimension as AR-VAE, trained on raw sequence data

Clustering threshold 30% 50% 70% AR-VAE 33.4% 47.5% 54.6% Baseline VAE 16.1% 28.4% 39.5% MSA VAE 37.2% 54.7% 64.0%

20