A tractable genotype-phenotype map for the self-assembly of protein quaternary structure

Sam F. Greenbury,1 Iain G. Johnston,2 Ard A. Louis,3 and Sebastian E. Ahnert1 1Theory of Condensed Matter Group, Cavendish Laboratory, University of Cambridge, UK 2Department of Mathematics, Imperial College London, London, UK 3Rudolf Peierls Centre for Theoretical Physics, University of Oxford, UK (Dated: October 15, 2018) The mapping between biological genotypes and phenotypes is central to the study of biological evolution. Here we introduce a rich, intuitive, and biologically realistic genotype-phenotype (GP) map, that serves as a model of self-assembling biological structures, such as protein complexes, and remains computationally and analytically tractable. Our GP map arises naturally from the self-assembly of polyomino structures on a 2D lattice and exhibits a number of properties: redundancy (genotypes vastly outnumber phenotypes), phenotype bias (genotypic redundancy varies greatly between phenotypes), genotype component disconnectivity (phenotypes consist of disconnected mutational networks) and shape space covering (most phenotypes can be reached in a small number of mutations). We also show that the mutational robustness of phenotypes scales very roughly logarithmically with phenotype redundancy and is positively correlated with phenotypic evolvability. Although our GP map describes the assembly of disconnected objects, it shares many properties with other popular GP maps for connected units, such as models for RNA secondary structure or the HP lattice model for protein tertiary structure. The remarkable fact that these important properties similarly emerge from such different models suggests the possibility that universal features underlie a much wider class of biologically realistic GP maps.

Keywords: genotype-phenotype (GP) map, self- modelled with notable success. Firstly, genetic regula- assembly, robustness, evolvability, polyomino, protein tory networks have been approximated using a variety quaternary structure of abstract models, including Boolean networks [6, 7]. Despite their simplicity, Boolean networks have demon- strated a remarkable ability to produce biologically re- I. INTRODUCTION alistic results. For example, they have been shown to reproduce key aspects of the yeast cell cycle [7]. Sec- ondly, RNA secondary structure can be routinely and Evolution is one of the most fundamental principles accurately predicted via a host of different methods [8], in biology. While organismal genotypes are becoming as a result of which it has become one of the best-known accessible due to rapid advances in sequencing technol- GP maps [4, 9], particularly for the study of evolutionary ogy, further understanding of the complicated mapping dynamics [9, 10]. Finally, the complex problem of a pro- from sequence to phenotype is necessary for a richer un- tein folding into well defined tertiary structure has been derstanding of evolutionary dynamics [1–4]. While the investigated using various models including the highly terms genotype and phenotype can be flexibly assigned simplified HP lattice model, where folds are represented in a biological system, genotypes are generally defined as self-avoiding walks on a lattice, and the full sequence as the genetic material upon which mutations act and is reduced to binary alphabet (H stands for hydrophobic phenotypes capture the properties of the organism on and P for polar amino acids) [11]. Despite its heavily which selection can differentiate. As such, the mapping coarse-grained nature, this model has produced impor- from genotype to phenotype – the GP map – links muta- arXiv:1311.0399v1 [q-bio.PE] 2 Nov 2013 tant biological insights [12] and has been shown to accu- tions to potentially selectable variation, and is therefore rately model known features of protein tertiary structure of critical importance in understanding evolutionary sys- [13]. While these models have been successful for specific tems. GP maps also provide a basis for understanding biological examples, another very important advantage important biological concepts such as mutational robust- of their tractability is the potential for extracting more ness and evolvability, which may profoundly affect evo- general underlying principles of GP maps, building on lutionary dynamics, and help determine the fundamen- our understanding of how evolution shapes the natural tal topologies of the landscapes upon which evolutionary world [2, 4]. processes occur [5]. In general it is intractable to directly model the de- In this paper we extend this family of coarse-grained tails of even small parts of a whole organismal GP map, models for biological structure, exploring a very recently due to both the very large numbers of genotypes, and a introduced model for tile self-assembly [18, 19] which can lack of knowledge of all possible phenotypes. Advances be viewed as a highly coarse-grained GP map for protein have been made in recent years, however, with the use quaternary structure (protein complexes). Understand- of simplified models. Three particular systems have been ing the formation of protein complexes is important as 2

Biological Model

RNA secondary structure

Protein tertiary

structure

A B

Protein quaternary B structure A A B B A

FIG. 1. A comparison of how three different biological structures may be represented in a corresponding model system. The top row compares a version of the iconic hammerhead (PDB reference 1RMN [14]) with its secondary structure representation (showing the bonding pattern of nucleotides) produced by the Vienna RNA package [15] shown alongside. The orange, blue and green colours in each part represent the bonded stems in the structure. The middle row depicts a cartoon of the tertiary structure of a single chain of length 21 (chain A) from an insulin protein (PDB reference 1APH [16]) which is compared to a schematic HP lattice protein interpretation on the right. The orange and blue colours are used to demonstrate the structural feature of alpha helices in the pictures. Finally in the bottom row, we show a protein complex (PDB reference 1BKD [17]) alongside a polyomino representation. The orange and blue colours represent the different subunits involved in the protein. The ability of the polyomino representation to capture the C4 symmetry of 1BKD is apparent from the rotation of the labelling on the subunits. demonstrated by the fact that most proteins form com- Invaluable resources for the study of protein complexes plexes in the cell (around 80% in yeast [20]) and the func- are the Protein Data Bank, providing a database contain- tion of these complexes is often strongly linked to their ing experimentally known protein quaternary structures physical form. Protein complexes are formed by the in- [22], and the 3D complex database [23] categorising PDB teraction of multiple individual protein chains to form structures with a graph representation of the interactions larger structures. The interaction between two chains between different subunits and a characterisation of the is predominantly mediated by hydrophobic forces acting symmetry in a complex. to pack together non-polar amino acids to provide fewer energetically unfavourable interactions with water [21]. The relationship between the topology of a protein complex and the individual amino acids that make up 3 its protein chains is highly complex due to the multiple ture. But, in principle, the Polyomino model can also functionality encoded in the protein sequence. For ex- capture the more general phenomenology of unbounded ample, correct tertiary structure, folding pathways and and non-deterministic assembly [18, 19]. Moreover, tiling other inter-protein interactions are all potential require- models have been used to study synthetic self-assembling ments for a single protein chain. Given these complexi- systems [26], and a better understanding of the design ties, a direct and complete GP map of protein quaternary space of polyominoes may aid in the design of these ar- structure is intractable as including all required function- tificial systems. ality would be unfeasible. Instead, in the spirit of the This article proceeds as follows. We first describe in highly simplified HP model for protein tertiary struc- detail the Polyomino model and some of its fundamental ture, we represent the proteins as square tiles on a lattice properties (Section II). We then analyse a wide range of [18, 19]. Interactions between tiles model the protein- properties of the resultant GP map and compare these protein interactions that lead to self-assembly. In the properties to those described, for example in [27], for the model, a genotype is a sequence of characters describ- HP model and the RNA secondary structure map. In ing interactions on the edges of the tiles, which, when particular, we show in Sections III A-III C that the map- combined with a self-assembly process on a 2D square ping from sequences to phenotypes for all three models lattice, leads to the formation of phenotypes comprising share the following general properties: redundancy (there of different square tile building blocks conjoined along in- are many more genotypes than phenotypes) leading to teracting edges. These square tile structures are known large neutral sets (the collection of all genotypes that as polyominoes and thus, we refer to the model as the map to a given phenotype) and phenotype bias (some Polyomino model and the resulting GP map as the Poly- phenotypes are associated with many more genotypes omino GP map. They are closely related to a wider class than others). A more fine-grained analysis shows that of lattice tiling models that have a long and important the neutral sets also exhibit component disconnectivity history in mathematics and computer science (which we (not all genomes in a neutral set can be linked with sin- discuss further in Section II). gle mutational steps). We proceed with a more detailed Despite these great simplifications, and in analogy with comparison of the Polyomino and RNA systems, through the highly schematic HP model, RNA secondary struc- considering shape space covering (most phenotypes can ture models or Boolean network models of GRNs, we be reached from any other phenotype with just a small expect the Polyomino model to provide insight into the number of mutations), before showing the mean muta- general structure of the full GP map for the formation tional robustness of a phenotype (the phenotypic robust- of protein complexes from folded proteins. In fact, our ness) scales very roughly logarithmically with the redun- earlier work [19] has already discussed the evolutionary dancy of a phenotype, and finally that it is positively cor- dynamics of the Polyomino model, demonstrating that related with the evolvability (defined here as the number it can be used to rationalise the preference of dihedral of other phenotypes potentially accessible from a pheno- over cyclic symmetries in homomeric protein tetramers type), as postulated to hold more generally by [28]. Fi- [24, 25]. nally, in Section IV we discuss some implications of the In Fig. 1, we compare coarse-grained representations remarkable agreement we find between the structure of for RNA secondary structure, protein tertiary structure our Polyomino GP map and those of the better studied and protein complexes. On the left hand side, the bio- RNA secondary structure and HP maps. logical representation is shown and on the right, a given model representation. Through this figure we wish to highlight two points. Firstly, that each of the model II. THE POLYOMINO SELF-ASSEMBLY systems is a dramatically simplified version of the cor- MODEL AND ITS ASSOCIATED GP MAP responding biological system and that the Polyomino model is of a similar order of coarse-graining to previ- The process of tiling and its connection with computer ous models. And secondly, that the Polyomino model science was first developed by Wang [29]. Since then, can provide a concise representation of real protein com- tiling models have been shown to be capable of compu- plexes by capturing features such as the symmetry of the tation and, in particular, Turing-universal computation subunit arrangements (C4 in this case). under the condition that cooperative binding is allowed In contrast to RNA secondary structure or protein between tiles, demonstrating the ability of 2D tiling sys- tertiary structure, where structure forms through the tems to model computational as well as structure-forming folding of a connected string of individual entities (nu- processes [30]. Rothemund and Winfree [31] studied the cleotides or amino acids), protein complexes are built program size complexity necessary to build a structure of by joining separate individual entities (protein chains). a given size. More general considerations of the complex- Furthermore, while for string-like self-assembly the fi- ity of tiled structures have since been discussed in [32], nal structure size is constrained, proteins can form un- with a more biological slant given by Ahnert et al. [18] bounded structures that can be ordered or disordered. In and applications to artificial biological systems discussed our work here, we focus on the subset of deterministic, by Rothemund et al. [26]. finite sized structures to model protein quaternary struc- Here, rather than focussing on these tiles as potential 4

a) Genotype Assembly Kit Phenotype

2 0 0 203201000420 2 0 0 1 0 4 faces encoded 3 0 2 assembly length L=12 clockwise algorithm base K=8 Interface Interaction Matrix

b) Example assembly

iteratively if no 0 more sites 0 0 available 1 2 2 2 2 0 0 2 2 0 2 0 3 0 1 2 0 3 3 0 3 4 0 2 4 0 0 0 2 1 0 0 0

seed tile placed identify available choose a random termination sites available site for next tile

FIG. 2. The Polyomino GP map. (a): The translation from an example genotype to phenotype is depicted, with the intermediate conversion into an assembly kit shown. In the genotype, the characters are underlined with the colour of the block they are assigned to. The interface interaction matrix is not shown explicitly, but throughout our work we assume the convention of interface types interacting in pairs (1 ↔ 2, 3 ↔ 4...), with 0 and c − 1 being neutral. The interaction matrix together with the assembly kit are passed to the assembly algorithm, which is used to produce the phenotype. (b): An example of the assembly process is shown, which proceeds in discrete time steps. The first tile in the assembly kit is used to seed the assembly. Further time steps follow, with the identification of available sites (depicted in light green and blue in the second picture from left), followed by the random choice of an available site and placement of the corresponding tile (a blue tile is placed on the blue available site in this example). Assembly is terminated when there are either no more available sites (as above), or if the structure is larger than Dmax (number of blocks) in width or height after which we assume the structure will no longer terminate but grow indefinitely. computing devices or as models for complexity, we ex- types and interface types interact in unique pairs (1 ↔ 2, plore how they can be used to understand the GP map 3 ↔ 4, ...), with the only fully neutral types being 0 and of a specific biological system, namely the self-assembly c − 1. Defining interface interactions in this way allows of finite-sized protein structures. Nevertheless, we are the single parameter c to control the number of potential aware that some of our conclusions may have applica- unique bond types. As such, the two parameters that tions for a wider class of systems. describe a given parameterisation of the Polyomino GP We now proceed with a more detailed description of map are the tile number Nt and the total number of in- the Polyomino model as a GP map. terface types c, allowing a particular Polyomino GP map

to be labelled as SNt,c. Below we discuss the genotype, phenotype and map used in the Polyomino GP map. A. Summary of the Polyomino GP map

The genotype is modelled as a character string repre- B. Genotype sentation of a set of Nt tiles which make up an assembly kit. The edges of each tile in the assembly kit is given a number which represents the interface type. Interac- The genotype for the set of building blocks in the as- tions between interface types are defined via an interface sembly kit is written as a bit-string of length L in base interaction matrix Aij. In our work here, we only con- K. The base we employ in this paper is the total num- sider one such interaction matrix type, with a total of ber of interface types c used in a given GP map. This c interface types. There are no self-interacting interface allows each base to mutate to any other base at each site. 5

The procedure of converting a genotype string into the 1. Available sites on the lattice are identified. These assembly kit is part of the map. are places on the lattice where a tile may be placed in a particular orientation such that it will form a bond to an adjacent tile that has already been C. Phenotype placed. In the assembly algorithm, a list is kept of the position, tile type and orientation for each possible tile placement. In the context of the self-assembly mapping, there are several ways of classifying a polyomino structure. These 2. A random available site on the lattice is chosen. may include criteria based on its overall shape and the individual tile types making up the final polyomino struc- 3. The chosen site is filled with the associated tile and ture, as well as individual tile orientations. These differ- with the corresponding orientation. ent possibilities are discussed in [19]. These steps are repeated until either: Here, we will classify the phenotype according to the overall shape of the polyomino independent of origin • There are no available sites for bonding. translation and C4 (90 degree) rotations. Note that chi- • The structure grows beyond a certain width or ral counterparts of polyominoes represent distinct phe- height Dmax, which is taken as a proxy for un- notypes. bounded assembly, so that the resulting phenotype is described as unbound. We set Dmax = 16 here, but our results are not sensitive to this cutoff as, D. Map for the polyomino systems we study here, there are no bounded structures larger than this. The map has two stages: conversion of the genotype At this point the assembly process is terminated and into the assembly kit, followed by assembly of a poly- the structure produced is recorded. To test whether the omino from an assembly process involving the assembly structure is deterministic, the assembly process is re- kit and the interface interaction matrix. A diagram rep- peated k times, with each C4 rotation of the final struc- resenting this process is shown in Fig. 2. ture checked against the recorded structure. If there are any differences between any of the k assemblies, the phenotype is classified as non-deterministic. Phenotypes 1. Genotype to Assembly Kit that are classified as unbound or non-deterministic struc- tures are represented by a single phenotype, which we The characters of the genotype are read from left to refer to as the undetermined (UND) phenotype and is right along the string. Characters are assigned to the assumed not to be biologically relevant in this context. next blank edge of a square tile in the assembly kit. The A more detailed discussion of the classification of poly- edges are taken in clockwise order (from the top side) omino structures is given in [19]. with all edges being written before moving on to a new block. The total number of tiles, in terms of the genotype III. PROPERTIES OF THE POLYOMINO GP string length, can be expressed as Nt = L/4. MAP

In this section, we analyse the Polyomino GP map 2. Assembly Kit to Phenotype by making measurements of redundancy, phenotype bias and component numbers, before moving on to analyse the The assembly of the 2D polyomino takes place on a properties of shape space covering, robustness and the re- square lattice where individual tiles from the assembly lationship between robustness and evolvability. For each kit are placed. The interface types on the edges of tiles measurement, comparisons are made with the RNA sec- can form an attractive interaction as determined by the ondary structure model and, for the first three of these binary interface interaction matrix Aij, with 1 denoting properties, an HP lattice protein model. The software attraction and 0 neutrality. A bond may form between used to model RNA secondary structure was the well- tiles if two adjacent (interacting) edges have interface known Vienna package [15], and for the HP lattice model types which attract, as defined by the interaction matrix. we used the data from the spaces enumerated by Irb¨ack The assembly process is initialised by placing (seeding) and Troein [33]. a single tile on the lattice. We will consider only GP maps in which the seed tile corresponds to the first tile described in the genotype. A different protocol where any A. Redundancy tile may be used to seed the assembly is also possible and does not significantly effect the results presented here. It is now well established that many different sequences The assembly then proceeds as follows: can generate similar protein or RNA phenotypes [36]. 6

GP Map NG NP Ncov/NP NC 7 0 Polyomino S2,8 1.7 × 10 13 54% 1,347 10 -1 Polyomino S3,8 6.9 × 10 147 16% – p 19 ∗ f -2 Polyomino S4,16 1.8 × 10 2,237 3% – 10 RNA, L = 12 1.7 × 107 57 47% 645 -3 RNA, L = 15 1.1 × 109 431 23% 12,526 -4 RNA, L = 20 1.1 × 1012 112,118 10% – -5 HP, L = 25 3.4 × 107 107,335 68% 148,253 -6 frequency, log

10 -7 TABLE I. Redundancy, phenotype bias and components in Polyomino, RNA and HP GP maps. Comparing the num- log -8 ber of phenotypes (NP ) to the number of genotypes (NG) for -9 each GP map highlights large-scale redundancy present. Phe- -10 notype bias is demonstrated in each map with the measure of 0 1 2 3 the fraction of phenotypes that covers 95% of the genotypes log10 rank (Ncov is the number of phenotypes that covers the 95% of genotypes). In all cases the fraction of phenotypes is signifi- cantly smaller than the fraction of genotypes being covered, FIG. 3. Phenotype bias in GP maps. The plot displays indicating the presence of a strong phenotype bias. The final three Polyomino GP maps with increasing tile numbers S2,8 column is the total number of genotype components (NC ) in (green), S3,8 (blue) and S4,8 (red) alongside the RNA L = 12 each GP map. In all cases (non-computable values left out) GP map (yellow). Note the log-log scale. The frequency of the number of components is larger than the number of phe- a given phenotype (structure) is plotted against the rank of notypes, indicating phenotypes tend to be spread out over its frequency within the map. In all three Polyomino maps, multiple disconnected components. RNA data for L = 12 we see an approximately exponential bias (linear trend in a was computed from the Vienna package [15] and taken from log-log plot). The errorbars for the Polyomino S4,16 map are [34] for L = 15, 20. The starred Polyomino S4,16 value for the associated standard error on the mean for the estimates NP is estimated from large-scale sampling of the GP map of frequencies calculated using the estimation algorithm for over multiple runs of the algorithm presented in [35]. The large maps developed by [35]. HP results were calculated from the data made available by [33]. Non-deterministic phenotypes and the trivial structure in RNA are excluded from the statistics.

B. Phenotype Bias

On top of the measure of redundancy in GP maps, it is also possible to consider phenotype bias, the idea that This large-scale redundancy is the basis, for example, of some phenotypes are much more redundant than others. the molecular clock hypothesis, which is widely used for Examples of bias can again be seen in RNA [9, 40] and inferring phylogenetic relationships [37]. It is therefore the HP model [12, 13, 27]. We demonstrate bias in the not surprising that such redundancy (also known in the GP maps in two ways. In Table I we present the fraction literature on GP maps as degeneracy/designability) has of phenotypes (Ncov/NP , where Ncov is the number of been observed in model RNA [9], HP lattice protein [13] phenotypes) that cover 95% of the genotype space. This and genetic regulatory [38] GP maps, as well as in more statistic shows that in all three GP maps (RNA, HP general model systems such as signalling networks [39]. and Polyomino), of the deterministic/non-trivial space, As such, redundancy is expected to be a typical feature a smaller fraction of phenotypes is required to cover a of GP maps. given amount of the genotype space, indicating a bias in the number of genotypes assigned to each phenotype. In Fig. 3, we plot the frequency (fractional coverage of In Table I, we show the number of genotypes (NG) genotype space) of each phenotype against its frequency- and phenotypes (NP ) for different polyomino, RNA and rank in three different Polyomino GP maps, as well as HP GP maps. As expected all three models show signif- for the RNA L = 12 GP map. In each of the Polyomino icant redundancy. The Polyomino model displays large- GP maps we see a large, approximately exponential bias scale redundancy shown by the vastly fewer phenotypes in phenotype frequencies (linear trend in a log-log plot). in comparison to genotypes for each of the GP maps pre- The data from the RNA GP map also shows a distinct sented. As can be seen in Table I, this is of a similar order bias, although the relationship is more complicated than to the RNA GP maps. For example, for RNA L = 12 and the more apparent exponential form seen in the Poly- 7 Polyomino S2,8 (which both have 1.7 × 10 genotypes), omino GP maps. Taken together these statistics show a there are 57 phenotypes for the former and 13 for the significant amount of phenotype bias in the Polyomino latter. GP map along with the RNA and HP GP maps. 7

C. Component disconnectivity 1.0 Further to the consideration of redundancy and bias of phenotypes in genotype space, we can ask how many 0.8 components a given phenotype forms in a particular GP map. In graph theory, a component is a set of nodes that 0.6 are connected, and thus when applied to genotype spaces, is a set of connected genotypes with the same phenotype. 0.4 Here we consider only point mutations. In RNA sim- ple biophysical considerations show that two point mu- 0.2 tations are needed to turn a purine-pyramidine bond into a pyramidine-purine bond. This neutral reciprocal sign fraction of phenotypes covered 0.0 epistasis leads to many individual components [41, 42]. We expect similar behaviour in polyominoes, where there 0 2 4 6 8 10 12 may be several ways of encoding a given phenotype that radius, n are not connected through neutral mutations due to there being multiple interface types. In the final column of Table I, we show the number of FIG. 4. Shape space covering in Polyomino S2,8 (green), S3,8 genotype components (NC ) for four different GP maps: (blue) and RNA L = 12 (yellow) GP maps. A random sam- Polyomino S2,8, RNA L = 12, RNA L = 15 and HP ple of 100 genotypes for each phenotype are taken, and the fraction of phenotypes discovered in a ball of radius n is mea- L = 25 (Polyomino S3,8 and S4,16 and RNA L = 20 GP maps were unobtainable due to computational ex- sured. This is then averaged over all sample phenotypes to give an approximate phenotype space covering for a given GP pense). In all four we see a substantially larger number map. Both Polyomino and RNA GP maps exhibit similar be- of components than phenotypes, indicating that pheno- haviour – shape space is almost completely covered in only a types tend to form disconnected components. Although few mutations highlighting that in highly redundant spaces, many components may be small, it is still the case that most phenotypes can be reached in only a few mutations. Er- whole-phenotype properties need to be considered care- ror bars are the standard error on the mean. The expected fully in the context of evolutionary dynamics, as it is asymptotic value of unity is used for the Polyomino S3,8 GP not typically possible to access every genotype of a given map for n > 4 due to computational unfeasibility. phenotype through neutral mutation.

found in the ball of genotypes n-mutations around each D. Shape space covering of them. Ferrada and Wagner [27] found that in both the RNA secondary structure and HP lattice protein GP GP maps typically exist in high-dimensional genotype maps, the fraction of phenotypes discovered increases in (sequence) spaces, a property with counterintuitive con- a roughly sigmoidal fashion with the increasing number sequences and a topic of current biological research for of independent mutations applied to the source genotype GP maps [43, 44]. For example, for an alphabet of K and at a greater rate for RNA than the HP GP map. letters (e.g. K = 4 for RNA or K = 20 for proteins) the In our work we measure shape space covering in a sim- number of genotypes scales exponentially as KL, while ilar manner: we picked a sample of 1,000 genotypes uni- the number of mutations needed to reach any two geno- formly across all phenotypes and measured the pheno- types is at most L, and so scales linearly. In practice, type of genotypes at each given radius (n). In Fig. 4 considerably fewer than L mutations are needed to con- we plot the average fraction of phenotypes found for the nect any two phenotypes. This property has been studied Polyomino S2,8, S3,8 and RNA L = 12 GP maps. The most extensively for RNA GP maps, and in that context general behaviour of both the Polyomino and RNA GP Schuster et al. [9], borrowed the term shape space cov- maps is similar. Phenotypes are almost completely cov- ering from its original use in immunology [45]. The HP ered within half the sequence length, and at a slightly model has also been considered with respect to shape higher rate in the Polyomino GP maps. This provides space covering [13, 27] with seemingly less rapid space clear evidence to suggest that the Polyomino GP map covering. has shape space covering – most phenotypes can be found We explore shape space covering in a similar way to within a ball containing many fewer genotypes than the Ferrada and Wagner [27] who compared RNA and HP entire space. models, building on work of earlier investigators. Shape space covering is measured by counting the average frac- tion of phenotypes found when applying a given number E. Robustness of independent mutations (the radius, n) to a genotype in the GP map. This process involves picking a sample of Biological systems need to be robust and consistently genotypes and then measuring the number of phenotypes produce the same phenotype in response to environmen- 8 tal perturbations or to genetic mutations [2]. Here we consider only the phenotypic effects of mutations to the 1.0 genotype. Mutational robustness describes the invari- p ance of a phenotype as a result of mutations to the geno- ρ 0.8 type. We focus here on single point mutations such that the fraction of 1-mutants that have the same phenotype 0.6 as the original genotype under examination is defined as the genotypic robustness. For a genotype , with pheno- g 0.4 type p, the genotype robustness can thus be expressed as follows 0.2 phenotype robustness,

np,g 0.0 ρg = (1) (K − 1)L -6 -5 -4 -3 -2 -1 0 where ρg is the genotypic robustness of g, np,g is the num- log10 frequency, log10 fp ber of 1-mutant neighbours of g with phenotype p, K is the base and L is the sequence length (there are a total of (K − 1)L 1-mutants for any genotype). The robustness FIG. 5. Robustness in GP maps. Phenotypic robustness is of the phenotype can then be considered as the average plotted against frequency (log-scale) for the Polyomino S2,8 of this quantity over all genotypes with phenotype p, re- (green), S3,8 (blue) and RNA L = 12 (yellow) GP maps. The sulting in top three points in the right hand side of the plot are the UND/trivial structure in the respective Polyomino and RNA 1 X systems. In the RNA case, the robustness scales roughly lin- ρ = ρ (2) p |P| g early with the logarithm of phenotype frequency. Whilst the g∈P trend is very roughly linear for the Polyomino GP maps, there is still a strong positive scaling relationship, demonstrating where ρp is the phenotypic robustness and P is the set of that polyomino phenotypes exhibit phenotypic robustness in genotypes with phenotype p. a similar manner to the RNA GP map. In Fig. 5 we plot the phenotypic robustness ρp against the frequency of each phenotype fp for the two Poly- omino GP maps S2,8 and S3,8 and the RNA L = 12 GP at the level of phenotypes. A possible geometric expla- map. Corroborating previous results [34], it can be seen nation is that phenotypes can become more mutationally that the RNA phenotypic robustness scales roughly loga- robust as a result of increased redundancy (more neutral rithmically with phenotype frequency. This linear trend neighbours). This in turn leads to connected components with log-frequency is very approximate for the Polyomino spreading out further across genotype space and possess- GP maps, although the robustness can still be seen to ing a greater ‘surface’, thus allowing a greater number of strongly scale with phenotype frequency. Both Poly- phenotypes to be accessible through neutral mutations. omino GP maps exhibit a slightly smaller phenotypic Wagner’s argument is really about the static structure robustness at a given frequency when compared to the of genotype space and does not account for any dynam- RNA GP map. Nonetheless, despite these two obser- ical evolutionary effects. Other authors have considered vations, the clear indication here is that Polyomino GP further static properties [50], including a different notion maps have a phenotype robustness that scales similarly of phenotypic evolvability (diversity evolvability) defined to phenotypes in the RNA GP map. as being the probability that two non-neutral mutations lead to different phenotypes. Such a definition of evolv- ability was found to not be correlated with robustness F. Robustness versus Evolvability in the RNA GP map. A dynamical study by Draghi et al. [51] demonstrated that the relationship between Robustness and evolvability are two evolutionary prop- robustness and evolvability could also depend on popu- erties that have received much recent attention in the lit- lation parameters and mutation rates. In our work here, erature [2, 28, 46–49]. As discussed in the previous sec- we simply consider whether the static properties of the tion, robustness is the ability of an organism to maintain Polyomino GP map behave in a similar manner to those its phenotype if its genotype is mutated. Evolvability on of the RNA GP map as demonstrated by Wagner [28] the other hand is considered as the capacity for producing without here commenting on the wider debate of how phenotypic variation [1, 28]. One might expect that for robustness and evolvability relate. an individual to be robust it would have to compromise In the left hand plots of Fig. 6, we show fractional its ability to produce variation (to be evolvable). Using evolvability versus fractional robustness, for genotypes the RNA GP map, Wagner [28] demonstrated that this and phenotypes in both the RNA L = 12 and Polyomino expected trade-off exists for individual genotypes, but S3,8 GP maps. The genotype plot is a binned version that the two properties are in fact positively correlated of 100 randomly sampled genotypes for each phenotype 9

1.0 1.0 0.14 1.0 0.12 0.8 0.10 00.08.8 0.86

0.06 0.4 0.04 p g ε

ε 0.2 0.02 0.6 0.6 0.00 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.14 1.0 0.4 0.4 0.12 . genotype evolvability, 0 8 0.10 phenotype evolvability, 0.08 0.6

00.06.2 0.42 0.04 0.2 0.02

0.00 0.0 0.0 0.0 0.0 00..22 00.4.4 0.06.6 0.08.8 1.01.0 0.0 00..22 00.4.4 0.06.6 0.08.8 1.01.0 genotype robustness, ρg phenotype robustness, ρp

FIG. 6. Robustness and Evolvability in the RNA and Polyomino GP map. In the LEFT columns we show a significant 2 negative correlation between genotypic robustness and genotypic evolvability, g (RNA (yellow, top): p-value = 4.9e-07, r = 0.89 and Polyomino (blue, bottom): p-value = 2.4e-09, r2 = 0.91). The error bars are the standard error on the mean genotypic evolvability for each genotypic robustness bin. The more robust a genotype is, the fewer phenotypes lie in its 1- mutant neighbourhood. In the RIGHT column, there is a significant positive correlation between phenotypic robustness and 2 2 phenotypic evolvability, p (RNA: p-value = 1.5e-09, r = 0.48 and Polyomino: p-value < 2.2e-16, r = 0.74 ). The far right hand phenotypes are the trivial structure and UND phenotype in RNA and Polyomino systems respectively. Both these results are in correspondence with [28].

(apart from the non-deterministic/trivial structure) in that could be reached. Whether they can be reached de- each GP map. The fraction of all phenotypes that can pends, of course, on details of the evolutionary dynamics. be produced in any 1-mutation to each sampled genotype For example, if only single mutations are available then (genotypic evolvability, g) is plotted against the fraction p should really be defined with respect to the relevant of those 1-mutations that produce the same phenotype as component [42]. Phenotypic robustness is defined in the that of the genotype being tested (genotypic robustness, same way as in Section III E, as the average genotypic Eq.1). For both polyominoes and RNA we see a signifi- robustness over all genotypes with phenotype p (Eq. 2). cant negative correlation. This expected negative corre- In the plots, we see strong positive correlations for both lation simply represents the trade-off between genotypic the RNA L = 12 and the Polyomino S3,8 GP maps, as evolvability and robustness at the level of the individual expected. In other words, for both maps, phenotypes genotype with a greater redundancy can be generated by a larger number of genotypes and are therefore, on average, more In the right hand plots of Fig. 6, we plot phenotypic likely to be mutationally robust. Furthermore, such phe- evolvability against phenotypic robustness. The pheno- notypes are also likely to have a larger set of genotypes, typic evolvability p of phenotype p is defined here as and therefore a greater diversity of other phenotypes po- the fraction of all phenotypes that can be reached in a tentially accessible from this set. Thus, phenotypic ro- single mutation from any genotype with phenotype p. bustness and potential evolvability are positively corre- We also refer to this as the potential evolvability be- lated. cause it represents the potential number of phenotypes 10

IV. DISCUSSION share these properties. This question can be sharpened by looking at the different properties separately. Redun- In this paper, we have explored the properties of a dancy should be widely shared across GP maps. Pheno- GP map for biological self-assembly of the kind exhib- type bias has been observed in models for gene-regulatory ited by protein quaternary structure based on a recently networks [52] and developmental networks [53]. Could introduced Polyomino model for tile assembly [18, 19]. it be a more general property of GP maps? Discon- We compared its properties to models of RNA secondary nected components have also been observed in a Boolean structure and the HP model for protein tertiary struc- threshold model for gene regulation [54]. To our knowl- ture. As is the case for these two well studied GP maps, edge shape space covering has not been studied for other we argue that even though our Polyomino model is highly GP maps, but general considerations based on the high schematic and thus misses many details of protein qua- dimensionality of genetic space suggest that something ternary structure, it may nevertheless provide important like this may be more widely relevant [55]. Finally, the biological insight into the structure of the design space correlation between phenotypic robustness and potential for protein complexes. evolvability is deeply connected to the geometry of neu- Despite the great complexity in potential phenotypes, tral sets and so is likely to be a much more general prop- the polyomino model remains tractable as demonstrated erty of GP maps. How this correlation plays out for re- in this paper by the ability to perform a wide variety of alistic evolutionary dynamics is, of course, a much more useful measurements on the GP map. Our main results complicated question. are: Firstly, the Polyomino model exhibits large-scale re- We are hopeful that more complete answers will be dundancy, a strong phenotype bias and the presence of derived through further analysis and comparisons of dif- disconnected genotype components across several param- ferent model GP maps in a similar manner to the work eterisations of the Polyomino GP map (Section III A- here and in [27] or [2]. An important related question III C). Secondly, that shape space may be covered in is whether model GP maps for parts of systems can be only a fraction of mutations – that is, all phenotypes are combined to achieve a more complete understanding of a significantly smaller number of mutations away from the evolution of phenotypic traits in the full organismal each other than the total sequence length (Section III D). GP maps [36]. Thirdly, phenotypic robustness scales very roughly with The Polyomino model can easily be adapted to study the logarithm of the phenotype frequency (Section III E). unbounded assembly or the assembly of synthetically pro- And finally, genotypic robustness and genotypic evolv- duced objects like DNA tiles or patchy colloids [26]. Thus ability are negatively correlated, whilst phenotypic ro- the perspective gained from viewing polyominoes as a GP bustness and phenotypic (potential) evolvability are pos- map may also shed light on the artificial design process itively correlated (Section III F). for these systems. The Polyomino model describes the self-assembly of Finally, although we introduce the Polyomino model disconnected units (proteins) into finite sized structures as a coarse-grained model for protein quaternary struc- (protein clusters) that can vary in size. By contrast, for ture, it is clear that the model is not capable of modelling RNA secondary structure and the HP model for pro- particular intricacies of some individual proteins. For ex- tein tertiary structure, strings of connected units (nu- ample, F1-ATPase is a protein whose subunit structure cleotides or amino acids) assemble into shapes of a fixed could not be accurately represented with a square tile size. Given the substantially different class of the pheno- model. We argue that the Polyomino model nevertheless types in our model, it is remarkable that the measured provides biological insight into questions about the global properties of these GP maps turn out to be so similar. nature of the GP map that would not be computationally This begs the question of whether what we observe is in accessible using more complex models with greater bio- fact a more general property of self-assembling systems, logical detail. However, further work is needed to assess or even broader, whether a wider class of GP maps will how well this new GP map performs in this respect.

[1] P. Alberch. From genes to phenotype: dynamical systems 10.1098/rstb.2009.0241. and evolvability. Genetica, 84(1):5–11, 1991. ISSN 0016- [5] Erik Svensson and Ryan Calsbeek. The adaptive land- 6707. doi:10.1007/BF00123979. scape in evolutionary biology. Oxford University Press, [2] Andreas Wagner. Robustness and evolvability in living 2012. systems. Princeton University Press Princeton, 2005. [6] S. A. Kauffman. Metabolic stability and epigenesis in [3] Denis Noble. The Music of Life: Biology Beyond Genes. randomly constructed genetic nets. Journal of Theoreti- OUP Oxford, February 2006. ISBN 0199228361. cal Biology, 22(3):437 – 467, 1969. ISSN 0022-5193. doi: [4] Massimo Pigliucci. Genotype–phenotype mapping 10.1016/0022-5193(69)90015-0. and the end of the ‘genes as blueprint’ metaphor. [7] Fangting Li, Tao Long, Ying Lu, Qi Ouyang, and Chao Philosophical Transactions of the Royal Society B: Tang. The yeast cell-cycle network is robustly designed. Biological Sciences, 365(1540):557–566, 2010. doi: Proc Natl Acad Sci U S A, 101(14):4781–4786, 2004. 11

ISSN 0027-8424 (Print). doi:10.1073/pnas.0305937101. 10.1093/nar/28.1.235. [8] David H. Mathews, Walter N. Moss, and Douglas H. [23] Emmanuel D Levy, Jose B Pereira-Leal, Cyrus Chothia, Turner. Folding and finding secondary structure. and Sarah A Teichmann. 3d complex: A structural classi- Cold Spring Harbor Perspectives in Biology, 2(12), 2010. fication of protein complexes. PLoS Comput Biol, 2(11): doi:10.1101/cshperspect.a003665. e155, 11 2006. doi:10.1371/journal.pcbi.0020155. [9] Peter Schuster, Walter Fontana, Peter F. Stadler, and [24] Emmanuel D Levy, Elisabetta Boeri Erba, Carol V Ivo L. Hofacker. From sequences to shapes and back: A Robinson, and Sarah A Teichmann. Assembly reflects case study in rna secondary structures. Proceedings of the evolution of protein complexes. Nature, 453(7199):1262– Royal Society of London. Series B: Biological Sciences, 1265, 2008. doi:10.1038/nature06942. 255(1344):279–284, 1994. doi:10.1098/rspb.1994.0040. [25] Gabriel Villar, Alex W. Wilber, Alex J. Williamson, [10] Matthew C. Cowperthwaite and Lauren Ancel Mey- Parvinder Thiara, Jonathan P. K. Doye, Ard A. Louis, ers. How mutational networks shape evolution: Lessons Mara N. Jochum, Anna C. F. Lewis, and Emmanuel D. from rna models. Annual Review of Ecology, Evo- Levy. Self-assembly and evolution of homomeric protein lution, and Systematics, 38(1):203–230, 2007. doi: complexes. Phys. Rev. Lett., 102:118106, Mar 2009. doi: 10.1146/annurev.ecolsys.38.091206.095507. 10.1103/PhysRevLett.102.118106. [11] Ken A. Dill. Theory for the folding and stability of glob- [26] Paul W. K Rothemund, Nick Papadakis, and Erik ular proteins. Biochemistry, 24(6):1501–1509, 1985. doi: Winfree. Algorithmic self-assembly of dna sierpin- 10.1021/bi00327a032. ski triangles. PLoS Biol, 2(12):e424, 12 2004. doi: [12] Hao Li, Robert Helling, Chao Tang, and Ned Wingreen. 10.1371/journal.pbio.0020424. Emergence of preferred structures in a simple model of [27] Evandro Ferrada and Andreas Wagner. A comparison of protein folding. Science, 273(5275):666–669, 1996. doi: genotype-phenotype maps for rna and proteins. Biophys- 10.1126/science.273.5275.666. ical Journal, 102(8):1916 – 1925, 2012. ISSN 0006-3495. [13] E. Bornberg-Bauer. How are model protein struc- doi:10.1016/j.bpj.2012.01.047. tures distributed in sequence space? Biophysical Jour- [28] Andreas Wagner. Robustness and evolvability: a nal, 73(5):2393 – 2403, 1997. ISSN 0006-3495. doi: paradox resolved. Proceedings of the Royal Society 10.1016/S0006-3495(97)78268-7. B: Biological Sciences, 275(1630):91–100, 2008. doi: [14] , Christoph Gohlke, Thomas M Jovin, 10.1098/rspb.2007.1137. Eric Westhof, and Fritz Eckstein. A three-dimensional [29] Hao Wang. Proving theorems by pattern recognition – model for the hammerhead ribozyme based on fluores- ii. Bell System Tech. Journal, 40(1):1–41, 1961. doi: cence measurements. Science, 266(5186):785–789, 1994. 10.1002/j.1538-7305.1961.tb03975.x. doi:10.1126/science.7973630. [30] Erik Winfree, Xiaoping Yang, and Nadrian C Seeman. [15] Ivo L. Hofacker. Vienna rna secondary structure server. Universal computation via self-assembly of dna: Some Nucleic Acids Research, 31(13):3429–3431, 2003. doi: theory and experiments. DNA Based Computers Two, 10.1093/nar/gkg599. 44:191–213, 1996. [16] O. Gursky, J. Badger, Y. Li, and D.L. Caspar. Confor- [31] PWK Rothemund and E Winfree. The program-size mational changes in cubic insulin crystals in the ph range complexity of self-assembled squares (extended abstract). 7–11. Biophysical Journal, 63(5):1210 – 1220, 1992. ISSN Proceedings of the thirty-second annual ACM symposium 0006-3495. doi:10.1016/S0006-3495(92)81697-1. on Theory of computing, pages 459–468, 2000. doi: [17] P Ann Boriack-Sjodin, S Mariana Margarit, Dafna Bar- 10.1145/335305.335358. Sagi, and John Kuriyan. The structural basis of the ac- [32] David Soloveichik and Erik Winfree. Complexity of self- tivation of ras by sos. Nature, 394(6691):337–343, 1998. assembled shapes. SIAM Journal on Computing, 36(6): doi:10.1038/28548. 1544–1569, 2007. doi:10.1137/S0097539704446712. [18] S. E. Ahnert, I. G. Johnston, T. M. A. Fink, J. P. K. [33] Anders Irb¨ack and Carl Troein. Enumerating designing Doye, and A. A. Louis. Self-assembly, modularity, and sequences in the hp model. Journal of Biological Physics, physical complexity. Phys. Rev. E, 82:026117, Aug 2010. 28(1):1–15, 2002. doi:10.1023/A:1016225010659. doi:10.1103/PhysRevE.82.026117. [34] Steffen Schaper. On the significance of neutral spaces [19] Iain G. Johnston, Sebastian E. Ahnert, Jonathan P. K. in adaptive evolution. PhD thesis, University of Oxford, Doye, and Ard A. Louis. Evolutionary dynamics in a 2012. simple model of self-assembly. Phys. Rev. E, 83:066105, [35] Thomas J¨org,Olivier Martin, and Andreas Wagner. Neu- Jun 2011. doi:10.1103/PhysRevE.83.066105. tral network sizes of biological rna molecules can be com- [20] Emmanuel D Levy, Jose B Pereira-Leal, Cyrus Chothia, puted and are not atypically small. BMC , and Sarah A Teichmann. 3d complex: A structural classi- 9(1):464, 2008. ISSN 1471-2105. doi:10.1186/1471-2105- fication of protein complexes. PLoS Comput Biol, 2(11): 9-464. e155, 11 2006. doi:10.1371/journal.pcbi.0020155. [36] Andreas Wagner. The origins of evolutionary innova- [21] Tina Perica, JosephA Marsh, FilipaL Sousa, Eviatar tions: a theory of transformative change in living sys- Natan, LucyJ Colwell, SebastianE Ahnert, and SarahA tems. Oxford University Press, 2011. Teichmann. The emergence of protein complexes: [37] John P. Thorpe. The molecular clock hypothesis: Bio- quaternary structure, dynamics and allostery. Bio- chemical evolution, genetic differentiation and systemat- chemical Society transactions, 40(3):475, 2012. doi: ics. Annual Review of Ecology and Systematics, 13:pp. 10.1042/BST20120056. 139–168, 1982. ISSN 00664162. [22] Helen M. Berman, John Westbrook, Zukang Feng, [38] S. Ciliberti, O. C Martin, and A. Wagner. Robustness Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. can evolve gradually in complex regulatory gene networks Shindyalov, and Philip E. Bourne. The protein data with varying topology. PLoS Comput Biol, 3(2):e15, 02 bank. Nucleic Acids Research, 28(1):235–242, 2000. doi: 2007. doi:10.1371/journal.pcbi.0030015. 12

[39] Pau Fern´andez and Ricard V Sol´e.Neutral fitness land- Natl Acad Sci USA, 95(15):8420–8427, July 1998. doi: scapes in signalling networks. Journal of the Royal Soci- 10.1073/pnas.95.15.8420. ety, Interface / the Royal Society, 4(12):41–47, February [47] Jesse Bloom, Zhongyi Lu, David Chen, Alpan Raval, 2007. doi:10.1098/rsif.2006.0152. Ophelia Venturelli, and Frances Arnold. Evolution favors [40] Michael Stich, Carlos Briones, and Susanna C. Man- protein mutational robustness in sufficiently large popu- rubia. On the structural repertoire of pools of short, lations. BMC Biology, 5(1):29, 2007. ISSN 1741-7007. random {RNA} sequences. Journal of Theoretical Bi- doi:10.1186/1741-7007-5-29. ology, 252(4):750 – 763, 2008. ISSN 0022-5193. doi: [48] Alexis Gallagher. Evolvability: A Formal Approach. PhD 10.1016/j.jtbi.2008.02.018. thesis, University of Oxford, 2009. [41] Jacobo Aguirre, Javier M. Buld´u,Michael Stich, and [49] Joanna Masel and Meredith V. Trotter. Robustness and Susanna C. Manrubia. Topological structure of the evolvability. Trends in Genetics, 26(9):406 – 414, 2010. space of phenotypes: The case of rna neutral net- ISSN 0168-9525. doi:10.1016/j.tig.2010.06.002. works. PLoS ONE, 6(10):e26324, 10 2011. doi: [50] Matthew C. Cowperthwaite, Evan P. Economo, 10.1371/journal.pone.0026324. William R. Harcombe, Eric L. Miller, and Lauren Ancel [42] Steffen Schaper, Iain G. Johnston, and Ard A. Louis. Meyers. The ascent of the abundant: How mutational Epistasis can lead to fragmented neutral spaces and con- networks constrain evolution. PLoS Comput Biol, 4(7): tingency in evolution. Proceedings of the Royal Society e1000110, 07 2008. doi:10.1371/journal.pcbi.1000110. B: Biological Sciences, 279(1734):1777–1783, 2012. doi: [51] Jeremy A Draghi, Todd L Parsons, G¨unter P Wagner, 10.1098/rspb.2011.2183. and Joshua B Plotkin. Mutational robustness can facil- [43] Stuart Kauffman and Simon Levin. Towards a general itate adaptation. Nature, 463(7279):353–355, 2010. doi: theory of adaptive walks on rugged landscapes. Journal 10.1038/nature08694. of Theoretical Biology, 128(1):11 – 45, 1987. ISSN 0022- [52] Karthik Raman and Andreas Wagner. The evolv- 5193. doi:10.1016/S0022-5193(87)80029-2. ability of programmable hardware. Journal of The [44] Janko Gravner, Damien Pitman, and Sergey Gavrilets. Royal Society Interface, 8(55):269–281, 2011. doi: Percolation on fitness landscapes: Effects of correla- 10.1098/rsif.2010.0212. tion, phenotype, and incompatibilities. Journal of The- [53] Elhanan Borenstein and David C. Krakauer. An end oretical Biology, 248(4):627–645, October 2007. doi: to endless forms: Epistasis, phenotype distribution bias, 10.1016/j.jtbi.2007.07.009. and nonuniform evolution. PLoS Comput Biol, 4(10): [45] Alan S Perelson and George F Oster. Theoretical studies e1000202, 10 2008. doi:10.1371/journal.pcbi.1000202. of clonal selection: minimal antibody repertoire size and [54] G. Boldhaus and K. Klemm. Regulatory networks and reliability of self-non-self discrimination. Journal of the- connected components of the neutral space. The Eu- oretical biology, 81(4):645–670, 1979. doi:10.1016/0022- ropean Physical Journal B, 77(2):233–237, 2010. ISSN 5193(79)90275-3. 1434-6028. doi:10.1140/epjb/e2010-00176-4. [46] M. Kirschner and J. Gerhart. Evolvability. Proc [55] Sergey Gavrilets. Fitness landscapes and the origin of species (MPB-41). Princeton University Press, 2004.