Quick viewing(Text Mode)

Learning the Molecular Grammar of Protein Condensates from Sequence

Learning the Molecular Grammar of Protein Condensates from Sequence

Downloaded by guest on September 30, 2021 separation phase liquid–liquid pre- from The accessible behavior. cam.ac.uk/. molecular is phase DeePhase, in termed rooted high dictor, understanding platform a for a at proteome provide principles human results further the These and from accuracy. propensity from sequences LLPS distinguished such and lower identified that a with structured model proteins from unstructured integrated both an sequences LLPS-prone generated comparable By we features. a embed- dings, unsupervised knowledge-based at with used features behavior that knowledge-based combining classifier phase a learned of to embeddings accuracy and principles such model on underlying constructed language the classifier network-based a neural underpinning that in a features found learn trained sequence further we the To LLPS, residues. manner relative their hydrophobic hypothesis-free Swiss- in and balance a the fine polar or a of Bank show Shannon they content Data that lower Protein and the of Prot in and sequences hydrophobic, than proteins less entropy LLPS-prone disordered, that more highlighted sepa- are analysis protein phase Our liquid–liquid between (LLPS). protein machine- ration predicting associations constructed for further the models and learning level behavior of phase global and scope sequence a under- the for on strategy broaden constructs silico in standing To of an established alterations. use we observations, the these sequence through by experimentally some generated with probed proposed, key been tendency them have the a of condensates form determine with to Different that proteins regulation. parameters process of the and a about compartmentalization hypotheses as cellular recognized in increasingly role con- for (received biomolecular is 2021 into 17, proteins February densates approved of and separation MD, Bethesda, phase Diseases, Intracellular Kidney and Digestive and Diabetes 2020) of 13, Institute September National review Eaton, A. William by Edited Kingdom United 0HE, CB3 Cambridge Cambridge, of University Physics, a Knowles J. P. Tuomas and Saar L. Kadi embeddings and determinants sequence from protein condensates of grammar molecular the Learning PNAS govern- factors its sequence-specific protein, of a range of sequence A acid structure. amino primary linear strongly the conden- is can to formation importance crowding, sate fundamental molecular of of (14–17), LLPS level modulate the or strength, ionic (3, behavior phase protein 13). a govern 12, that 4, of factors focus infor- the valuable about the yielded mation collectively been have which and studies, of objective number drive biological important thus that an of has factors condensates become biomolecular range the protein-rich of Understanding wide formation (5–11). the a cancer recog- dis- in and metabolic of increasingly eases development roles and onset are important the including bodies processes, play condensate to resulting nized phenomenon This the (1–4). cells and living within organelles braneless L uu aidDprmn fCeity nvriyo abig,CmrdeC21W ntdKndm and Kingdom; United 1EW, CB2 Cambridge Cambridge, of University Chemistry, of Department Hamied Yusuf hl hne netisccniin,sc stemperature, as such conditions, extrinsic in changes While imlclrpoesta nepn h omto fmem- of occurring formation the widely underpins a that is process (LLPS) biomolecular separation phase iquid–liquid 01Vl 1 o 5e2019053118 15 No. 118 Vol. 2021 | ahn learning machine a,b,1 lxyS Morgunov S. Alexey , | | agaemodels language imlclrcondensates biomolecular a,b,3 a,1,2 https://deephase.ch. uzagQi Runzhang , | protein a ila .Arter E. William , doi:10.1073/pnas.2019053118/-/DCSupplemental at online information supporting contains article This 3 2 1 ulse pi ,2021. 7, April Published BY-NC-ND) (CC 4.0 NoDerivatives License distributed under is article access open This Submission.y Direct PNAS a is University article the This of subsidiary owned fully appli- a patent Cambridge.y Limited, a of Enterprise of subject Cambridge the by been filed have work cation this of Parts G.K., statement: W.E.A., manuscript.y interest the and Competing reviewed paper; and the results wrote T.P.J.K. the and discussed K.L.S. A.A.L. A.S.M., tools; and K.L.S., reagents/analytic data; new analyzed contributed A.S.M. R.Q. and and per- A.S.M., A.S.M. and K.L.S., K.L.S. research; research; formed designed T.P.J.K. and A.S.M., K.L.S., contributions: Author for interactions (26). multivalent protein and UBQLN2 the (25), disordered protein glycine-rich DNA-binding LAF-1 and in TAR arginine- domains in (24), residues (TDP-43) acid 43 protein amino tryptophan sar- aromatic of in positioning other fused of the the and abundance (22), of proteins context high (FUS)-family the the in coma as residues tyrosine such and proteins, arginine phase specific protein the of modulating separation in important be to sequence- features various specific determined have and events truncation, mutation deletion, site-specific through alterations sequence behav- its phase and parallel, ior protein In some between (23). relationship of reviewed the power examining recently studies been predictive has The hypotheses 18–22). these of 13, as (12, forward the features brought of central particular patterning in (LCRs) and regions valency low-complexity the and postulated interactions been hydrophobic have interactions, condensates electrostatic protein with of formation the ing owo orsodnemyb drse.Eal [email protected] Email: addressed. be may correspondence whom To CB1 Cambridge Centre, Kingdom.y Business United Paddocks The 8DH, A, Unit Analytics, Fluidic address: work.y Present this to equally contributed A.S.M. and K.L.S. eune ihhg cuay nldn rmindependent from including data. accuracy, test external high with phase-separating insight identified sequences that this classifiers trained combining machine-learning we embeddings, sequence by network-based neural Moreover, with proteins. phase- explicit to of separating identified common features proteins and sequence-specific and of propensity biophysical separation datasets phase bespoke varying constructed phase we its global determines behavior, sequence a protein develop how have To of behavior understanding objectives. phase important funda- protein become this predicting therefore of and basis process molecular mental the physiological Understanding various to functions. linked been protein-rich has and form compartments subcellu- lar to of formation proteins the underlies cellular condensates biomolecular many of tendency The Significance a er Krainer Georg , https://doi.org/10.1073/pnas.2019053118 b aeds aoaoy eatetof Department Laboratory, Cavendish . y raieCmosAttribution-NonCommercial- Commons Creative π –π . y a n cation–π and https://www.pnas.org/lookup/suppl/ lh .Lee A. Alpha , otcs and contacts, b , | f11 of 1

BIOPHYSICS AND To broaden the scope of these observations and understand combining knowledge-based features with unsupervised embed- on a global level the associations between the primary struc- dings, showed a high performance both when distinguishing ture of a protein and its tendency to form condensates, here, LLPS-prone proteins from structured ones and when identi- we developed an in silico strategy for analyzing the associ- fying them within the human proteome. Overall, our results ations between LLPS propensity of a protein and its amino shed onto the physicochemical factors modulating pro- acid sequence and used this information to construct machine- tein condensate formation and provide a platform rooted learning classifiers for predicting LLPS propensity from the in molecular principles for the prediction of protein phase sequence (Fig. 1). Specifically, by starting with a behavior. previously published LLPSDB database collating information on protein phase behavior under different environmental con- Results and Discussion ditions (27) and by analyzing the concentration under which Construction of Datasets and Their Global Sequence Comparison. To LLPS had been observed to take place in these experiments, link the amino acid sequence of a protein to its tendency to we constructed two datasets including sequences of different form biomolecular condensates, we collated data from two pub- LLPS propensity and compared them to fully ordered struc- licly available datasets, the LLPSDB (27) and the PDB (29), and tures from the (PDB) (29) as well as constructed three bespoke datasets—LLPS+ and LLPS− com- the Swiss-Prot (30) database. We observed phase-separating prising intrinsically disordered proteins with high and low LLPS proteins to be more hydrophobic, more disordered, and of propensity, respectively, and PDB∗ consisting of sequences that lower Shannon entropy and have their low-complexity regions were very unlikely to phase separate (see below). To create the enriched in polar residues. Moreover, high LLPS propen- first two datasets, the LLPSDB dataset was reduced to naturally sity correlated with high abundance of polar residues yet occurring sequence constructs with no posttranslational modi- the lowest saturation concentrations were reached when their fications and further filtered for sequences that were observed abundance was balanced with a sufficiently high hydrophobic to undergo phase separation on their own (i.e., homotypically), content. in isolation from other components, such as DNA, RNA, or Moreover, we used the outlined sequence-specific features an additional protein (Materials and Methods). In the resulting as well as implicit protein sequence embeddings generated dataset, the mean concentration at which each construct had using a neural network-derived word2vec model and trained been observed to phase separate was estimated and a thresh- classifiers for predicting the propensity of unseen proteins to old of 100 µM was used to divide the sequences according to phase separate. We showed that even though the latter strat- their high propensity (LLPS+; 137 constructs from 77 unique egy required no specific feature engineering, it allowed con- UniProt IDs; Dataset S1) or low propensity to phase separate structing classifiers that were comparably effective at identifying (Fig. 1B). The latter set (25 constructs) was then combined with LLPS-prone sequences as the model that used knowledge- the sequences from the LLPSDB that had not been observed based features, demonstrating that language models can learn to phase separate homotypically (72 constructs) and this cre- the molecular grammar of phase separation. Our final model, ated the LLPS− dataset (84 constructs from 52 unique UniProt

A

BC

D

Fig. 1. (A) DeePhase predicts the propensity of proteins to undergo phase separation by combining engineered features computed directly from protein sequences with protein sequence embedding vectors generated using a pretrained language model. The DeePhase model was trained using three datasets, namely two classes of intrinsically disordered proteins with a different LLPS propensity (LLPS+ and LLPS−) and a set of structured sequences (PDB∗). (B) To generate the LLPS+ and LLPS− datasets, the entries in the LLPSDB database (27) were filtered for single-protein systems. The constructs that phase separated at an average concentration below c = 100 µM were classified as having a high LLPS propensity (LLPS+; 137 constructs from 77 UniProt IDs) with the remaining 25 constructs together with constructs that had not been observed to phase separate homotypically classified as low-propensity dataset (LLPS−; 84 constructs from 52 UniProt IDs). (C) The 221 sequences clustered into 123 different clusters [Left, CD-hit clustering algorithm (28) with the lowest threshold of 0.4]. (Right) The 110 parent sequences showed high diversity by forming 94 distinct clusters. (D) The PDB∗ dataset (1,563 constructs) was constructed by filtering the entries in the PDB (29) to fully structured full-protein single chains and clustering for sequence similarity with a single entry selected from each cluster.

2 of 11 | PNAS Saar et al. https://doi.org/10.1073/pnas.2019053118 Learning the molecular grammar of protein condensates from sequence determinants and embeddings Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 erigtemlclrgamro rti odnae rmsqec eemnnsand determinants sequence from condensates protein embeddings of grammar molecular the Learning al. et Saar its times 1.5 by range interquartile the test, of boundaries Mann–Whitney the a exceed with that tested frequencies. values was to Significance correspond value. whiskers extreme the at most of the ends to The or value. size mean PDB the the LLPS in indicates the sequences lines in than center sequences fraction than IDR higher fraction a LCR and higher a LLPS had the and in hydrophobic construct average ( E the and that (LCRs) highlighted analysis regions low-complexity the (D) of i.2. Fig. across constructs similarity 221 for the sequences found the and clustered the Methods) we learn diver- and datasets, the to (Materials the characterize us further of allowed To sity propensity. rather with LLPS it correlate enhanced that but features sequence-specific conditions, about information any the imply under in not does rate proteins classification the the that note that We S2). Dataset IDs; cd ihteslce eune eie i apn otheir sequences to remaining The mapping amino Methods ). via 50 and verified (Materials disordered above sequences IDs any UniProt lengths selected include the for including with not filtered acids by did chains) alterna- constructed that (112,572 an PDB was residues unlikely the as dataset highly from serve This are sequences to separate. that it sequences phase comprising for to set constructed training was tive PDB the from that models of building amount for essential noticeable 1B, well. is a generalize (Fig. which indicate clusters same diversity, results different the sequence these from 94 Overall, derived identified been Right). we had sequence, construct where each cases single parent the from a for sequence sequence than single longest more the a keeping to by down ID UniProt sequences 221 of dataset 1C, (Fig. ters BC AB P Sequence length (a.a.) dataset, additional An 1000 1250 250 500 750 ≤ LLPS 0 1 uldsrbtosaesonin shown are distributions Full 0.01. (A–E LLPS+ + yrpoiiy (C hydrophobicity, ( B) a.a.), acids, amino (in length sequence (A) the of Comparison ) and **** .Mroe,we euigtecombined the reducing when Moreover, Left). LLPS- LLPS **** Category ns − DE PDB LLPS LCR fraction PDB* aaest om13uiu clus- unique 123 form to datasets 0.0 0.2 0.4 0.6 0.8 ∗ ossigo nre sampled entries of consisting , − Swiss-Prot LLPS+ aae antpaesepa- phase cannot dataset h ahdln in line dashed The . S1 Fig. Appendix, SI **** h nrnial iodrdrgos(Ds o h he riigdtst n h ws-rt Comparative Swiss-Prot. the and datasets training three the for (IDRs) regions disordered intrinsically the ) LLPS- **** ∗ Category rteSisPo aaes h oe on aabtenteupradtelwrqatl,adthe and quartile, lower the and upper the between data bound boxes The datasets. Swiss-Prot the or

**** Hydrophobicity −600 −400 −200 + 200 400 PDB* aae ca)wslne hni h LLPS the in than longer was (cyan) dataset 0 LLPS+ Swiss-Prot **** − h PDB the , LLPS- **** Category .Fnly entdthat noted the we than Finally, in any entropy 2B). sequences Shannon in Kyte (Fig. lower sequences the the exhibited datasets the sequences three using than LLPS-prone other datasets hydrophobic the four less of be the LLPS-prone concluded to in and (31) constructs hydrophobicity constructs scale the hydropathy the pro- Doolittle estimated and of a next confining all We of condensate. of acid length a amino protein into per the tein cost increasing entropic the as decreases expectations it theoretical the but the with in (gray) or construct Swiss-Prot 2A) average average in Fig. the the than construct that longer average was concluded the from first length the we in control analysis, reference construct the a From as S1). (30) propen- database formation 2 Swiss-Prot condensate (Fig. the enhanced using with sity linked factors are the understand that to aim the with features sequence-specific (PDB sequences diverse structured a Meth- 1D). to fully and dataset 1,563 PDB (Materials of original cluster set the reduced each process more from This no sequence ods). retain single to identity a sequence than for clustered were (13,325) **** ∗ ecmae h eeae aaesars ag fglobal of range a across datasets generated the compared We rteSisPo ga)dtst.I lohdalwrSannentropy Shannon lower a had also It datasets. (gray) Swiss-Prot the or , PDB* C IDR fraction ∗∗∗∗ 0.0 0.2 0.4 0.6 0.8 1.0 orsod otecs hnalaioaisaepeeta equal at present are acids amino all when case the to corresponds uldsrbtosaesonin shown are distributions full A–E; Swiss-Prot eoe a denotes LLPS+ hno nrp,tefato fsqec hti part is that sequence of fraction the entropy, Shannon ) LLPS ns LLPS − P

LLPS- oag)adtePDB the and (orange) au eo 10 below value **** − PDB Category +

Shannon entropy consistent is which datasets, (orange) 3.0 3.5 4.0 **** aae ca)ddntdfe nits in differ not did (cyan) dataset ∗ PDB* LLPS+ rteSisPo aae (Fig. dataset Swiss-Prot the or https://doi.org/10.1073/pnas.2019053118 −4 Swiss-Prot n sdntsn significance no denotes ns and , ns ∗ LLPS- mgna aaesadless and datasets (magenta) **** Category **** ∗ Fig. Appendix, SI ; PDB Fig. S3; Dataset PDB* PNAS ∗ (magenta; Swiss-Prot | f11 of 3

BIOPHYSICS AND COMPUTATIONAL BIOLOGY 2C)—an effect that can be linked to their extended prion-like disordered regions as basic parameters that set the constructed domains (21, 22). datasets apart (Fig. 2), we next set out to analyze the amino To understand how sequence complexity and the extent of acid composition of these regions. By classifying the amino acid disorder were linked to the tendency of proteins to undergo residues into polar, hydrophobic, aromatic, cationic, and anionic phase separation, we employed the SEG algorithm (32) to categories (Materials and Methods), we observed that the propen- extract LCRs for all of the sequences in the four datasets and sity of proteins to undergo LLPS was associated with a higher the IUPred2 algorithm (33) to identify their disordered regions relative content of polar (blue) and a reduced relative content of (Materials and Methods). This analysis revealed that constructs hydrophobic (orange) and anionic (purple) residues across the in the LLPS+ dataset had a larger fraction of the sequences that full amino acid sequence (Fig. 3A). The increased abundance of was part of the LCRs than of the sequences in any of the other polar residues was particularly pronounced within the LCRs with three datasets (Fig. 2D) and a higher degree of disorder than aromatic (green) and cationic (red) residues also being overrep- sequences in the PDB∗ or the Swiss-Prot dataset (Fig. 2E). resented within these regions compared to the sequences in the Swiss-Prot database (Fig. 3B), consistent with previous observa- Amino Acid Composition of the Constructs Undergoing LLPS. Hav- tions and findings (18, 21). Moreover, we observed that not only ing ascertained the length of the low-complexity and intrinsically are sequences with a high LLPS propensity enriched in polar

A Polar Aromatic D 40 LLPS+ Hydrophobic Cation Anionic

20 1.0 Full sequence

0.8 0 1.0 0.5 0.0 0.5 1.0 0.6 15 LLPS- fraction 0.4 10

0.2 5

0.0 0 LLPS+ LLPS- PDB* Swiss-Prot 1.0 0.5 0.0 0.5 1.0 Category PDB* diff 10 B LCR 1.0 5 0.8 0 0.6 1.0 0.5 0.0 0.5 1.0 3 Swiss-Prot diff fraction 10 0.4

0.2 1 10 0.0 LLPS+ LLPS- PDB* Swiss-Prot Category 1.0 0.5 0.0 0.5 1.0 Polars - hydrophobics

C E + Homotypically phase LLPS - PDB* LLPS+ separating LLPS LLPS- PDB* 0.5 Swiss-Prot 0.4

0.3

R2 LLPS+ 0.43 0.2 LLPS- 0.64 0.1 μM PDB* 0.92 0.1 Swiss-Prot 0.89 10 μM 1 mM Fraction of hydrophobic residues 0.0 0.2 0.4 0.6 0.8 Fraction of polar residues

Fig. 3. Comparison of the amino acid composition of the sequences within the LLPS+, LLPS−, PDB∗, and Swiss-Prot datasets. (A) LLPS-prone sequences were enriched in polar amino acid residues (blue) and depleted in hydrophobic (orange) and anionic (purple) residues. (B) The elevated abundance of polar residues was particularly pronounced within the LCR with the sequences in the LLPS+ also having their LCR regions enriched in aromatic (green) and cationic (red) residues compared to in the Swiss-Prot database. (C) The relationship between polar and hydrophobic residues was less tightly conserved among LLPS-prone sequences as indicated by smaller R2 values (Inset). (D) Compared to the Swiss-Prot and the PDB∗ datasets, LLPS-prone proteins showed an overabundance of polar residues relative to hydrophobic ones. (E) When further examining the relationship between the saturation concentration of a construct (indicated by the marker size; saturation concentration of 10 mM was assumed for PDB∗) and its amino acid composition, a strong overabundance of polar residues was seen to lead to an increase in saturation concentration with most LLPS-prone sequences showing a tight balance between the abundances of polar and hydrophobic residues.

4 of 11 | PNAS Saar et al. https://doi.org/10.1073/pnas.2019053118 Learning the molecular grammar of protein condensates from sequence determinants and embeddings Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 erigtemlclrgamro rti odnae rmsqec eemnnsand determinants sequence from condensates protein embeddings of grammar molecular the Learning al. et Saar ihnteLR Fg ] n h odvcbsdembed- word2vec-based the the between the and separation residues 4 and of 3]) degree (Fig. acid 2] used [Fig. amino [Fig. were IDRs cationic LCRs dings and and the LCRs aromatic, length, within the polar, (sequence of of EFs sequence part fraction when the be of case data to fraction the training identified the the for entropy, in both Shannon points plane data hydrophobicity, the 2D of a all on of vectors feature the relevant biophysically pre- capture the and to of information. (LM) 4B) capability model the (Fig. language illustrating trained hydrophobicity 4C), estimated (Fig. their as point to isoelectric such according clustered properties, 3-grams biophysical the plane that (2D) meaning], saw particular two-dimensional no we a have embed- axes on the (35); 3-gram library mapped [Multicore-TSNE each we for when vectors Crucially, prop- ding (34). the predicting proteins for of strategy erties serve learning transfer to effective shown an as previously been low-dimensional meaningful representations—has learn sequence availability to datasets the unlabeled approach—exploiting large an of 4A sequences Such (Fig. Methods). protein and convert vectors als to embedding This Swiss-Prot used 200-dimensional words. then the as into 3-grams was using overlapping model model and pretrained embed- corpus word2vec the such a as database generate pretrained to we fea- dings, Specifically, semantics-based engineered distributional embeddings. with aforementioned sequence combination the in used (EFs) tures we vectors (LLPS feature sep- datasets phase constructed undergo PDB to the proteins using of aration propensity the Phase predict to could Sequences Unseen of Propensity Separate. the Classifying for phase Model high a for disor- regions both (21). sticker propensity of separation hydrophobic importance Taken and the spacer process. illustrate dered phase-separation results these effective together, an an facilitate that regions 2, sticker would insufficient about from The originated of observed. likely again effect value was latter concentration critical saturation residues a the in polar reached increase of ones separa- abundance relative hydrophobic phase the undergo over after to polar contrast, propensity By of their tion. lowers abundance turn confirmation, low in structured a that which establishing proteins suggested of to analysis leads concentration residues saturation This a mM. to 10 correspond would size proteins included marker also We the 3E). acid (Fig. in amino exam- protein the to the with estimates of varied composition these concentration saturation used the then the how We been as ine separate. had 1B) been phase construct particular to had (Fig. the seen that separation which constructs at phase concentration satura- 149 homotypic lowest the the undergo evaluated of to first all seen we of effect, concentrations and phase this “spacer” tion between To protein interplay regions. of an “sticker” formation onset by facilitated the the are and the whereby separation interactions and (21) intermolecular theory high framework of This polymer very sticker” behavior. associative a and phase by whether “spacers protein motivated explore affects was to residues analysis aimed polar next of we content 3D), propen- (Fig. LLPS elevated sity with correlated clearly ones hydrophobic of nature a structured the of in more requirement proteins the the the to underpinning linked core nar- likely their hydrophobic are and fraction residues defined - hydrophobic rowly The of conserved 3C). abundance tightly (Fig. relative residues less high hydrophobic much and polar a between showed tionship also they residues; eue iesoaiyrdcinapoc 3)t visualize to (35) approach reduction dimensionality a used We to relative residues polar of abundance high the Since ∗ o riigtemdl.T ovr h eune into sequences the convert To models. the training for ) PDB enx eeoe ahn-erigcasfir that classifiers machine-learning developed next We ∗ aae n hywr iulzdsc httheir that such visualized were they and dataset PDB ∗ .Ti rcs eeldanotable a revealed process This D–E). dataset. LLPS + ca)adthe and (cyan) + , LLPS and Materi- − PDB and , ∗ erhsmlrypaesprtdudrhg concentrations, high literature under keyword-based separated were a that that phase in likely PhaSepDB similarly separating thus the is search phase within It as sequences lipids). envi- identified of ordered nonhomotypical presence fully a dex- the the in (e.g., pro- in or crowding (e.g., structured glycol) molecular ronment fully performed polyethylene of were ficoll, a amount separated tran, which phase extensive be an in to under observed listed experiments been was the had constructs revealed tein of protein analysis all This the where conditions. that of datasets) environmental all with training LLPSDB of together the the behavior LLPS examined phase constructing as we the for trend, regarded (used this be concen- validate high database normally To principle at not (37). so in hence prone do can would usually proteins and it they structured trations general, separation, fully In phase while regions. undergo that of disorder 35 known intrinsically that is highlighted no fur- sequences included A 196 S5). them these (Dataset of sequences obtained human examination the we LLPS-prone ther data, of 196 training any from of our set with removing with a overlapped hence After that and IDs entries database. LLPSDB UniProt (36) the PhaSepDB database this the used we separation human phase the within undergo proteome. from sequences to LLPS-prone identifying unlikely 2) propen- very and LLPS sequences high on a from on with capability sity sequences models their models distinguishing the evaluated 1) test tasks: we the to two Specifically, out of data. set test we performance external datasets, cross-validation generated our high within Dataset. External the an on lished Models the of Performance ( datasets simultane- three the to Methods). of trained all between classifiers distinguish multiclass ously as constructed were l F1adL- hthdbe rie on trained been mod- had models: that four all LM-1 PDB following For and S2). the EF-1 Fig. used Appendix, els we (SI to used analysis robust was proceeding that were split metrices test–train performance the These challenging. most the between tion and EF- between both and distinction that efficient allowed saw and features we precision, LM-based metrices, a accuracy, performance Using as in curve (ROC) (AUROC) set. sets receiver-operator-characteristic ID the same validation under UniProt area the same the and the to training with belong sequences into would that split so manner, were generalizabil- stratified ensure data to the Moreover, ity, process. class training each the from sequences during of number comparable a encountered when cases the the to For Methods). val- and for PDB (Materials out left a time data using each the performance of idation 20% their with estimated test and cross-validation data 25-fold of of pairs each for three pro- classifiers the of forest random classes trained three we the Specifically, teins. between distinguish could types feature featurization perform. these could use that strategies quantitative classifiers direct effectively give how to aim into not insight mapping manifold did and This lower-dimensional purpose visual a dimension. a dif- served into single such vectors a where multidimensional to of vectors of confined degree This embedding not 2E). the LM-based are (Fig. to ferences as distinctly contrast very such apart in dimensions, classes is two the the of set disorder, some because ably the between the clustering the The in them. sequences between the spread with datasets (magenta) is,t osrc netra e fLP-rn proteins, LLPS-prone of set external an construct to First, enx e u ogi nisgtit o elteetwo these well how into insight an gain to out set next We PDB PDB ∗ ∗ a sda riigst admsbe ie equally sized subset random a set, training a as used was LLPS n,adtoal,mdl FmliadL-ut that LM-multi and EF-multi models additionally, and, ∗ ∗ aaeswscerrfrteegnee etrs prob- features, engineered the for clearer was datasets swl sbetween as well as + aae a ape oesr httemodel the that ensure to sampled was dataset LLPS + n the and LLPS https://doi.org/10.1073/pnas.2019053118 LLPS − and − PDB aaesbigthe being datasets LLPS LLPS ∗ aigestab- Having PNAS aeil and Materials ihdistinc- with LLPS + − n the and LLPS | dataset + f11 of 5 and +

BIOPHYSICS AND COMPUTATIONAL BIOLOGY A B input layer output layer (one-hot encoded) predictions hidden layer (200-dim)

3-gram sum of hydrophobicities

C 200-dimensional Pre-trained embedding vector word2vec model - Corpus: SwissProt - Context window: 25 a.a. - Training: -gram with sum of pIs hierarichical softmax

DEEngineered features (EF) Embeddings from a language model (LM)

20 LLPS+ LLPS+ LLPS- LLPS- 15 PDB* 20 PDB* 10 10 5

0 0 −5

−10 −10

−15 −20 −20 −20 −10 0 10 20 −30 −20 −10 0 10 20 30

F G 100 H 1.0 EF 80 LM 80 0.8

60 60 0.6

40 40 AUROC 0.4 Precision Accuracy

20 20 0.2

0 0 0.0

+ + vs PDB* + vs PDB* vs PDB* + vs LLPS + vs LLPS + vs LLPS LLPS LLPS PDB* vs LLPS LLPS LLPS PDB* vs LLPS LLPS LLPS PDB* vs LLPS

Fig. 4. (A) A single-hidden-layer language model (LM) was pretrained to learn embedding vectors for each amino acid 3-gram (Materials and Methods). The generated embedding vectors clustered 3-grams according to their (B) hydrophobicity (evaluated as the sum of the Kyte and Doolittle hydrophobicity values of the individual amino acids in the 3-gram) and (C) isoelectric point (pI; sum of the pI values of the individual amino acids). Dimensionality reduction from the 200-dimensional vectors to the 2D plane was performed with the Multicore-TSNE library (35). Visualizing the similarity of the sequences in the LLPS+ (cyan), the LLPS− (orange), and the PDB∗ (magenta) datasets in a 2D plane revealed noticeable clustering between the LLPS+ and the PDB∗ datasets both for (D) EF vectors and (E) embedding vectors generated by the LM. (F and G) Accuracy (F) and precision (G) of the featurization strategies when distinguishing between the three pairs of the protein classes using a random forest classifier (performance shown for 25-fold cross-validation). (H) Both approaches were highly effective at distinguishing between the LLPS+ and the PDB∗ datasets (AUROC of 0.96 ± 0.01 for both). When discriminating between the more similar LLPS+ and LLPS− datasets, the AUROC scores were lower with the LM-based features potentially reaching a slightly higher score than EFs (AUROCs of 0.67 ± 0.01 and 0.71 ± 0.01, respectively).

6 of 11 | PNAS Saar et al. https://doi.org/10.1073/pnas.2019053118 Learning the molecular grammar of protein condensates from sequence determinants and embeddings Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 iiyo h oest dniyLP-rn eune from sequences LLPS-prone identify to models the of bility 0.99 and 0.98 between ranging 5E). AUROCs (Fig. their types with two the proteins between task of distinguish this effectively to for able were curves models ROC the 5 constructed (Fig. we entries). data, (20,365 Swiss-Prot these human Using the on predic- made the models to the corresponding 5 tions regions Fig. shaded were the in nonphase-separating with that highlighted sequences) triangles, are sequences pre- sequences; data the The LLPS-prone test on (circles, S4 ). external models the (Dataset four of process the sequences part training the by the of made all in dictions it used from were removed that having after S3) (Dataset erigtemlclrgamro rti odnae rmsqec eemnnsand determinants sequence from condensates protein embeddings of grammar molecular the Learning featurization. al. et for Saar used were ( model line). word2vec solid pretrained rate; positive a false from on the learned based for representations was bound (lower 200-dimensional half separating where negative nonphase LM-multi The as and separate excluded. constructed (E LM-1 phase data was to models training dataset reported for the been external data not with the Same had overlapping of that IDs proteins part phase Uniprot all undergo positive regarding their to The by had unlikely proteins). that highly 20,291 sequences sequences region; 161 PDB with (colored and the database circles) proteome PhaSepDB colored human LLPS the (pos.; on the on sequences (trained and based LLPS-prone EF-1 triangles) 161 model comprising colored of data (neg.; profiles test separation prediction external on The classes) (A) protein proteome. three human the from proteins LLPS-prone 5. Fig. the from sequences undergo of to (161) sampling unlikely random number by highly equal created data proteins was an test S7) of external (Dataset our set separation as phase The serving sequences S6). which 161 (Dataset of sequences, by list structured exclusively a fully triggered yielded eliminated not we argument, was by this Motivated linked it molecules. molecular directly protein as between be interactions sequence of homotypic cannot protein level transition the phase notable to their a and or crowding, conditions, nonhomotypical oprsno h UO ausfrRCcre hw in shown curves ROC for values AUROC the of Comparison ) diinly estott ana nih notecapa- the into insight an gain to out set we Additionally, ∗ efrac ftemdl netra aawe )dsrmntn ewe LSpoesqecsadsrcue rtisad2 identifying 2) and proteins structured and sequences LLPS-prone between discriminating 1) when data external on models the of Performance B O uvso h oes1 nteetra etdt ie n )we dniyn LSpoesqecsfo h ua proteome human the from sequences LLPS-prone identifying when 2) and line) (dashed data test external the on 1) models the of curves ROC (B) . and ahdln)adcnlddta l ftefour the of all that concluded and line) dashed D, E C AB Density

0 1 2 3 4 Density 0 1 2 3 4 0.0 0.00 LM-multi, proteome LM-1, proteome LM-multi, neg.testset LM-multi, pos.testset LM-1, neg.testset LM-1, pos.testset 0.2 Engineered features(EF)

AUROC Language model(LM) 0.25 0.0 0.2 0.4 0.6 0.8 1.0 0.4 prediction prediction 0.50 External testdata 0.6 EF-multi, proteome EF-1, proteome EF-multi, neg.testset EF-multi, pos.testset EF-1, neg.testset EF-1, pos.testset PDB 0.75 0.8 B and ∗ A 1.00 1.0 dataset and D o h w tasks. two the for C D .Ti efrac si otatt control to contrast in is performance This 0.74 5E). between (Fig. varying 0.81 proteome perfor- w.r.t. predictive and notable AUROCs a the 5 showed with models Fig. mance four in the shown of All are lines. proteome (w.r.t.) with upper curves to identify- ROC in respect an approximated models the The estimate of sequences. each LLPS-prone of could ing rate positive we PubMed false the this separating, for through bound nonphase positive proteins LLPS are the as in of search identified all or not that vitro to were assumption that observed in conservative a been either Making had experimentally vivo. that separation proteins phase these undergo further to were were they filtered and papers manually proteins, mem- related 2,763 described and that organelles resulting publications braneless obtain The the to abstracts. rechecked from manually their publications separation-related in phase all This included keywords extracting that (36). database database by PubMed PhaSepDB NCBI curated the was of construction database as the of performed of relied predictions search we part literature negative keyword-based sequences, false exhaustive LLPS-prone the of identifying on To rate when separation. the models phase of the undergo estimate not an non- do obtain of proteins that dataset many a proteins generate of to structured behavior challenging is phase it the unstudied, remains As proteome. human the Proteome 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.2 0.2 Engineered features(EF) Language model(LM) 0.4 0.4 LM-multi, w.r.t. proteome LM-1,w.r.t. proteome LM-multi, w.r.t. structureddata LM-1,w.r.t. structureddata EF-multi, w.r.t. proteome EF-1, w.r.t. proteome EF-multi, w.r.t. structureddata EF-1, w.r.t. structureddata prediction prediction Model LM-multi Model LM-1 Model EF-multi Model EF-1 + 0.6 0.6 n PDB and 0.8 0.8 https://doi.org/10.1073/pnas.2019053118 ∗ n oe Fmli(rie nall on (trained EF-multi model and ) 1.0 1.0 B PNAS and solid D, | C and f11 of 7 D)

BIOPHYSICS AND COMPUTATIONAL BIOLOGY experiments where, prior to training the models, the labels of the 1 suggested that not only can language models predict LLPS sequences were randomly reshuffled (SI Appendix, Fig. S3) and propensity, they also can capture information about the biophys- the actual sequence compositions were replaced by randomly ical features underpinning this process. Second, we hypothesized sampling amino acids from the Swiss-Prot database (SI Appendix, that a key difference between models EF-1 and LM-1—whether Fig. S4). All in all, the results indicate that our models can dis- or not disorder was used as an explicit input feature—may tinguish between LLPS-prone and structured ones and they can equip LM-1 with an enhanced capability to discriminate between also identify LLPS-prone proteins from within the human pro- disordered sequences of varying LLPS propensity. We tested teome. These results also highlight that w2v-based featurization this hypothesis by examining the predictions of the two mod- creates meaningful low-dimensional representations that can be els on highly disordered sequences (IDR fraction above 0.5) used for building classifiers for downstream tasks, in this case, for from the low LLPS-propensity dataset, LLPS−. As the mod- the prediction of protein phase behavior, without requiring prior els were trained on the LLPS+ and PDB∗ datasets, none of insight into the features that govern the process. the sequences from LLPS− were part of the training data for these two models. Our analysis (Fig. 6C) revealed that while Comparison of Explicitly Engineered Features and Learned Embed- both models could distinguish the low LLPS-propensity datasets dings. The use of two distinct featurization approaches—one from structured proteins, LM-1 was also able to discriminate that used only knowledge-based features and another one that between two classes of disordered proteins of varying LLPS relied on hypothesis-free embedding vectors—provided us with propensity (orange and red lines) in contrast to EF-1 for which the opportunity to investigate whether, in addition to being the predictions for these two classes of disordered proteins able to predict LLPS propensity, language models can also with varying LLPS propensity were nearly identical (green and learn the underlying features of protein condensate forma- blue lines). tion. The prediction profiles of the different models shown in Fig. 5 suggest that the difference between the two featuriza- DeePhase Model. Finally, with models EF-multi and LM-multi tion approaches was most pronounced when only LLPS+ and using different input features but still demonstrating comparably PDB∗ datasets were used for training (models EF-1 and LM-1). good performance, we created our final model, termed DeeP- We thus focused the exploratory analysis on these two models hase, where the prediction on every sequence was set to be the involving the same training data but a different featurization average prediction made by the two models. As expected, the strategy. models could effectively distinguish between the LLPS-prone First, we noticed that when the predictions of the models were and the structured proteins in the external test dataset (Fig. binned by intrinsic disorder, the predictions correlated with the 7A; cyan and pink regions; AUROC of 0.99). When identify- degree of disorder for both models (Fig. 6 A and B; data across ing LLPS-prone sequences from the human proteome (cyan and the full human proteome). While this correlation was not unex- gray regions) as outlined earlier, AUROC of 0.84 was reached, pected for model EF-1 for which disorder was an explicit input which was comparable to or slightly exceeded what models feature, the presence of such a correlation in the case of LM- EF-multi (0.83) and LM-multi (0.81) achieved on their own.

ABEngineered features (EF) Language model (LM) 0.95 0.6 0.95 0.5 0.85 0.85 0.75 0.75 0.4 0.65 0.65 0.4 0.55 0.55 0.3 0.45 0.45 Prediction 0.35 Prediction 0.35 0.2 0.2 0.25 0.25 0.15 0.15 0.1 0.05 0.05 0.0 0.25 0.65 0.85 0.05 0.45 0.15 0.35 0.55 0.75 0.95 0.15 0.35 0.05 0.25 0.45 0.55 0.65 0.75 0.85 0.95 Disorder fraction Disorder fraction

C 12.5 EF, positive test set 10.0 EF, negative test set LM, positive test set 7.5 LM, negative test set

EF, disordered yet 5.0 less LLPS-prone LM, disordered yet 2.5 less LLPS-prone

0.0 0.0 0.5 1.0 prediction

Fig. 6. Comparison of the models constructed using EFs and embedding vectors extracted from a LM. (A and B) The predicted LLPS-propensity score correlated with the disorder content when both EF- and LM-based embeddings were used. This trend indicated the language model was able to learn a key underlying feature associated with a high LLPS propensity. (C) The prediction profiles of models EF-1 and LM-1 on LLPS-prone sequences (external positive test set), structured sequences (external negative test set), and disordered sequences with a low LLPS propensity (sequences with IDR fraction above 0.5 that were part of the LLPS− dataset and underwent homotypic phase separation). As these models were trained on the LLPS+ and PDB∗ datasets, they did not use LLPS− as training data. Model LM-1 stood out for its capability to distinguish between disordered proteins with different phase-separation propensities (orange and red lines) in contrast to EF-1 that made nearly identical predictions on these two classes of proteins (blue and green lines).

8 of 11 | PNAS Saar et al. https://doi.org/10.1073/pnas.2019053118 Learning the molecular grammar of protein condensates from sequence determinants and embeddings Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 erigtemlclrgamro rti odnae rmsqec eemnnsand determinants sequence from condensates protein embeddings of grammar molecular the Learning al. et Saar sequences LLPS-prone while that in amino indicated protein the compositions of average analysis an acid our than Furthermore, entropy database. Swiss-Prot Shannon dis- the of lower degree a higher and a had order and hydrophobic LLPS-prone less that were highlighted propensity. sequences datasets LLPS curated the varying of of analysis proteins The of behav- sequences, datasets LLPS-prone phase constructed predicting we for its algorithm governs an build sequence and ior protein how understand To been Conclusion had not process would that training it sequences the ensured LLPS-prone was excluded. during any was comparison encountered as it the model DeePhase where DeePhase that manner the note favor a and We CatGRANULE in 7B). the constructed by (Fig. achieved models was exceed- what PScore 0.83, 10% was over proteome dataset, by the this removed ing w.r.t. On DeePhase sequences. we of 18,473 the AUROC to residues reduced the down which proteome 140 value, the than of threshold size this longer than are limited shorter is sequences algorithm that PScore best-performing the sequences of the com- use to a the as in As proteins (23). identified study of parative propensity been LLPS evaluating CatGran- recently for and algorithms had PScore that algorithms, all developed ule previously them two to to propensity LLPS set, were high training constructs yellow). our a 7A, These in (Fig. allocated sequences (38). DeePhase the S8 ) to yet pro- (Dataset related separate artificial evolutionarily study phase eval- 73 not to earlier also of observed an we set been test in experimentally still, a not To had of further that do set. score model teins LLPS-propensity training that DeePhase the the the sequences uated of with the DeeP- limits similarity of to the sequence performance regard high the with share that set, illustrating generalizes training dropped 0.83, to hase proteome to the down the did 0.84 w.r.t. sequences in that AUROC 161 from constructs reduction, sequences from this the reduced With test was 109. of these dataset thresh- any external only lowest the with retaining the cocluster and (28), train- train- not 0.4] algorithm and the of test [CD-hit with external old together the similarity data clustering low dataset by ing showed external Specifically, data. that the ing sequences reducing to after only can performance PScore the its which at analyzed threshold lowest the is this as above or human residues the and 140 from sequences CatGRANULE of LLPS-prone length to identifying a DeePhase when for (23), of prediction filtered Comparison LLPS were artificial (B) for evaluated. sequences 73 dataset, performing sequences. be latter of comparison, best nonrelated the set the reliable evolutionary for a be a score to on to For LLPS-propensity found and extends proteome. high recently [pink]), behavior a were proteins allocated that phase structured algorithms DeePhase 161 evaluate the (yellow). and to PScore, (38) [cyan] separate capability LLPS-prone phase its (161 be data that to test indicating validated external experimentally the been on have (gray), that proteome peptides human the on profile tion 7. Fig. AB

ocnld,w oprdtepromneo DeePhase of performance the compared we conclude, To we DeePhase, of generalizability the investigate further To Density 10 12 0 2 4 6 8 ePaepredic- DeePhase (A) predictors. LLPS developed previously to comparison and sequences nonrelated evolutionarily to DeePhase of Generalizability . . . . . 1.0 0.8 0.6 0.4 0.2 0.0 DeePhase predicitions prediction Human proteome Artificial peptides External neg.testset External pos.testset

Test set identified correctly 0.0 0.2 0.4 0.6 0.8 1.0 0.0 rt ies e fpoen ihyulkl oudroLP.Specifically, LLPS. undergo to unlikely highly proteins of set diverse a erate the of similarity Construction IDs. lowest UniProt its unique using 61 (28) from CD-hit entries using 0.4. 94 threshold, performed of was LLPS total the clustering generate a The to included separate dataset phase UniProt concen- latter to a unique observed at 77 separate been from phase not 100 constructs to than observed such higher been of had tration 137 that of constructs total The IDs. a LLPS the in in resulted included cess was it and prone 100 When LLPS below evaluated. was was concentration performed latter were concentration the experiments average positive the these and which combined at were separate phase to unique observed 231 of total a IDs. UniProt including different resulted entries 120 procedure experimental from examined This 769 constructs the acids. with amino dataset where 50 a experiments than in to longer sin- was or and a sequence modifications mutations only protein posttranslational included single-site conditions. no that or with experimental systems repeat protein to various occurring down under naturally filtered LLPS gle were exam- which entries of constructs 2,143 used, their occurrence The and was the proteins of name” for entries “LLPS protein ined 2,143 the of by total Specifically, a classified (27). documented “Datasets 2020) from May database repository 20 LLPSDB published on previously the (accessed using constructed were datasets the of Construction Methods and Materials of predicting for establish- behavior. principles proteome, phase molecular protein in human rooted the framework and a within ing ones when them structured both identifying from performance when proteins unsu- high LLPS-prone with a features distinguishing showed engineered embeddings, combined the that pervised learn comparable model DeePhase, to final behavior. a phase our models features, protein phase at knowledge-based language of on grammar of propensity molecular relied capability LLPS that the model predict demonstrating a to vec- to embedding able accuracy We unsupervised was on behavior. built phase tors machine- model protein construct the that to predicting observed model for language classifiers a we learning by datasets, embedding hypothesis-free generated as generated vectors by well as the balanced features identified on was the used abundance Relying concen- their residues. saturation when lowest hydrophobic reached the were residues, polar trations in enriched were o aho h osrcs h xeiet h osrc a been had construct the where experiments the constructs, the of each For 0.2 0.4 PDB AUROC l proteins All 0.5 0.6 0.7 0.8 0.9 LLPS ∗ Performance onexternaltestset CatGranule eecmie ihtecntut hthad that constructs the with combined were µM Dataset. + 0.77 and 0.6 LLPS nre ntePB(9 eeue ogen- to used were (29) PDB the in Entries PScore 0.78 − https://doi.org/10.1073/pnas.2019053118 0.8 ,tecntutwsrgre as regarded was construct the µM, Datasets. DeePhase + 0.83 .Ti pro- This S1). (Dataset dataset 1.0 h LLPS The Natural PNAS − + DeePhase PScore CatGRANULE aae.The dataset. n LLPS and | protein” f11 of 9 −

BIOPHYSICS AND COMPUTATIONAL BIOLOGY first, amino acid chains that were fully structured (i.e., did not include any 3-grams as words and a context window size of 25—parameters that have disordered residues) were extracted, which resulted in a total of 112,572 been previously shown to work effectively when predicting protein proper- chains. PDB chains were matched to their corresponding UniProt IDs using ties via transfer learning (34). The skip-gram pretraining procedure with nega- Structure Integration with Function, Taxonomy and Sequence service by tive sampling was used and implemented using the Python gensim library (39) the European Institute, and entries where sequence length with its default settings. This pretraining process created 200-dimensional did not match were discarded. Duplicate entries were removed and the embedding vectors for every 3-gram. To evaluate the embedding vectors for remaining 13,325 chains were clustered for their sequence identity using the protein, each protein sequence was broken into 3-grams using all three a conservative cutoff of 30%. One sequence from each cluster was selected, possible reading frames and the final 200-dimensional protein embeddings resulting in the final dataset of 1,563 sequences. were obtained by summing all of the constituent 3-gram embeddings.

Estimation of Physical Features from the Sequences. A range of explicit Machine-Learning Classifier Training and Performance Estimation. All classi- physicochemical features was extracted for all of the sequences in the four fiers were built using the Python scikit-learn package (40) with default datasets from their amino acid sequences (LLPS+, LLPS−, PDB∗, and Swiss- parameters. No hyperparameter tuning was performed for any of the Prot). Specifically, the molecular weight of each sequence and its amino models—while such a tuning step may have given an improvement in accu- acid composition were calculated using the Python package BioPython. The racy, it can also lead to an overfitted model that does not generalize well hydrophobicity of each sequence was evaluated by summing the individual to unseen data. For performing the training, the dataset was split into a hydrophobicity values of the amino acids in the sequences using the Kyte train and a validation test in a 1 : 4 ratio and 25-fold cross-validation was and Doolittle hydropathy scale (31). The Shannon entropy of each sequence used to estimate the performance of the model. For each fold, the split was was estimated from the formula performed in a stratified manner, such that entries from the same UniProt ID were always grouped together either to the training set or to the vali- N=20 X dation set. A random seed of 42 was used throughout. Multiclass classifiers ( ) = − log , H X pi 2 pi [1] were trained using the multiclass classification module of the Python scikit- i=1 learn package (40). The final prediction score was calculated as the sum of the probability of the sequence belonging to the LLPS+ and half of the where P corresponds to the frequency of each of the naturally occurring − 20 amino acids in the sequence. The LCRs for each of the sequences were probability of it belonging to the LLPS dataset. estimated using the SEG algorithm (32) with standard parameters. The disordered region was predicted with IUPred2a (33) that estimated the Data Availability. The data and code are available from GitHub, probability of disorder for each of the individual amino acid residues in the https://github.com/kadiliissaar/deephase. All the data is also included in sequence. The disorder fraction of a sequence was calculated as the fraction Datasets S1–S8 of residues in the total sequence that were considered disordered where a ACKNOWLEDGMENTS. The research leading to these results has received specific residue was classified as disordered when the disorder probability funding from the Schmidt Science Fellows program in partnership with stayed above 0.5 for at least 20 consecutive residues. the Rhodes Trust (K.L.S.), a St John’s College Junior Research Fellowship Finally, the amino acid sequence and the LCRs were described for their (to K.L.S.), a Trinity College Krishnan-Ang Studentship (to R.Q.) and the amino acid content by allocating the residues to the following groups: Honorary Trinity-Henry Barlow Scholarship (to R.Q.), the Engineering and amino acids with polar residues (serine, glutamine, asparagine, glycine, cys- Physical Sciences Research Council (EPSRC) Centre for Doctoral Training in teine, threonine, proline), with hydrophobic residues (alanine, isoleucine, NanoScience and Nanotechnology (NanoDTC) (EP/L015978/1 to W.E.A.), the leucine, methionine, phenylalanine, valine), with aromatic residues (tryp- EPSRC Impact Acceleration Program (W.E.A., G.K., T.P.J.K.), the European tophan, tyrosine, phenylalanine), with cationic residues (lysine, arginine, Research Council under the European Union’s Horizon 2020 Framework Pro- gram through the Marie Sklodowska-Curie Grant MicroSPARK (Agreement histidine), and with anionic residues (aspartic acid, glutamic acid). 841466 to G.K.), the Herchel Smith Fund of the University of Cambridge (to G.K.), a Wolfson College Junior Research Fellowship (to G.K.), the European Protein Sequence Embeddings. Protein sequence embeddings were evalu- Research Council under the European Union’s Seventh Framework Program ated using a pretrained word2vec model. Specifically, the pretraining was (FP7/2007–2013) through the European Research Council Grant PhysProt performed on the full Swiss-Prot database (accessed on 26 Jun 2020) using (Agreement 337969 to T.P.J.K.), and the Newman Foundation (T.P.J.K.).

1. C. P. Brangwynne et al., Germline P granules are liquid droplets that localize by 16. L. Lemetti et al., Molecular crowding facilitates assembly of spidroin- controlled dissolution/condensation. Science 324, 1729–1732 (2009). like proteins through phase separation. Eur. Polym. J. 112, 539–546 2. A. A. Hyman, C. P. Brangwynne, Beyond stereospecificity: Liquids and mesoscale (2019). organization of . Dev. 21, 14–16 (2011). 17. G. Krainer et al., Reentrant liquid condensate phase of proteins is stabilized by 3. A. A. Hyman, C. A. Weber, F. Julicher,¨ Liquid-liquid phase separation in biology. Annu. hydrophobic and non-ionic interactions. Nat. Commun. 12, 1085 (2021). Rev. Cell Dev. Biol. 30, 39–58 (2014). 18. E. W. Martin, T. Mittag, Relationship of sequence and phase separation in protein 4. S. Boeynaems et al., Protein phase separation: A new phase in . Trends low-complexity regions. 57, 2478–2487 (2018). Cell Biol. 28, 420–435 (2018). 19. R. M. Vernon et al., Pi-pi contacts are an overlooked protein feature relevant to phase 5. S. Alberti, A. Gladfelter, T. Mittag, Considerations and challenges in studying separation. Elife 7, e31486 (2018). liquid-liquid phase separation and biomolecular condensates. Cell 176, 419–434 20. G. L. Dignon, R. B. Best, J. Mittal, Biomolecular phase separation: From molecu- (2019). lar driving forces to macroscopic properties. Annu. Rev. Phys. Chem. 71, 53–75 6. A. Aguzzi, M. Altmeyer, Phase separation: Linking cellular compartmentalization to (2020). disease. Trends Cell Biol. 26, 547–558 (2016). 21. E. W. Martin et al., Valence and patterning of aromatic residues determine the phase 7. P. Li et al., Phase transitions in the assembly of multivalent signaling proteins. Nature behavior of prion-like domains. Science 367, 694–699 (2020). 483, 336–340 (2012). 22. J. Wang et al., A molecular grammar governing the driving forces for phase 8. E. Sokolova et al., Enhanced rates in membrane-free protocells formed separation of prion-like RNA binding proteins. Cell 174, 688–699 (2018). by coacervation of cell lysate. Proc. Natl. Acad. Sci. U.S.A. 110, 11692–11697 (2013). 23. R. M. Vernon, J. D. Forman-Kay, First-generation predictors of biological protein 9. J. Berry, S. C. Weber, N. Vaidya, M. Haataja, C. P. Brangwynne, RNA transcription mod- phase separation. Curr. Opin. Struct. Biol. 58, 88–96 (2019). ulates phase transition-driven nuclear body assembly. Proc. Natl. Acad. Sci. U.S.A. 112, 24. H. R. Li, W. C. Chiang, P. C. Chou, W. J. Wang, J. R. Huang, TAR DNA-binding pro- E5237–E5245 (2015). tein 43 (TDP-43) liquid–liquid phase separation is mediated by just a few aromatic 10. Y. Shin, C. P. Brangwynne, Liquid phase condensation in cell physiology and disease. residues. J. Biol. Chem. 293, 6090–6098 (2018). Science 357, eaaf4382 (2017). 25. S. Elbaum-Garfinkle et al., The disordered P granule protein LAF-1 drives phase sepa- 11. J. A. Riback et al., Stress-triggered phase separation is an adaptive, evolutionarily ration into droplets with tunable viscosity and dynamics. Proc. Natl. Acad. Sci. U.S.A. tuned response. Cell 168, 1028–1040 (2017). 112, 7189–7194 (2015). 12. C. P. Brangwynne, P. Tompa, R. V. Pappu, Polymer physics of intracellular phase 26. T. P. Dao et al., modulates liquid-liquid phase separation of transitions. Nat. Phys. 11, 899–904 (2015). UBQLN2 via disruption of multivalent interactions. Mol. Cell 69, 965–978 13. E. Gomes, J. Shorter, The molecular language of membraneless organelles. J. Biol. (2018). Chem. 294, 7115–7127 (2019). 27. Q. Li et al., LLPSDB: A database of proteins undergoing liquid–liquid phase separation 14. A. S. Raut, D. S. Kalonia, Effect of excipients on liquid–liquid phase separation and . Nucleic Acids Res. 48, D320–D327 (2020). aggregation in dual variable domain immunoglobulin protein . Mol. Pharm. 28. W. Li, A. Godzik, CD-hit: A fast program for clustering and comparing 13, 774–783 (2016). large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 15. K. Julius et al., Impact of macromolecular crowding and compression on protein– (2006). protein interactions and liquid–liquid phase separation phenomena. 29. H. M. Berman et al., The Protein Data Bank (accessed on 26 June 2020). Nucleic Acids 52, 1772–1784 (2019). Res. 28, 235–242 (2000).

10 of 11 | PNAS Saar et al. https://doi.org/10.1073/pnas.2019053118 Learning the molecular grammar of protein condensates from sequence determinants and embeddings Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 1 .Kt,R .Doite ipemto o ipaigtehdoahccaatro a of character hydropathic the displaying for method simple A Doolittle, F. R. Kyte, J. knowledge. protein 31. of hub worldwide A UniProt: Consortium, UniProt 30. 2 .C oto,S eehn ttsiso oa opeiyi mn cdsqecsand sequences acid amino in complexity local of Statistics Federhen, S. Wootton, C. J. 32. 4 .Agr,M .Mfa,Cniuu itiue ersnaino biological of representation distributed Continuous Mofrad, esti- R. content M. energy Asgari, pairwise The E. Simon, 34. I. Tompa, P. Csizmok, V. Dosztanyi, Z. 33. 5 .Uynv utcr-SE(06.https://github.com/DmitryUlyanov/Multicore- (2016). Multicore-TSNE Ulyanov, D. 35. erigtemlclrgamro rti odnae rmsqec eemnnsand determinants sequence from condensates protein embeddings of grammar molecular the Learning al. et Saar protein. Res. eunedatabases. sequence eune o eppoemc n genomics. and deep for sequences intrinsically and folded between discriminates proteins. composition unstructured acid amino from mated SE cesd1 aur 2021. January 15 Accessed TSNE. 56D1 (2019). D506–D515 47, .Ml Biol. Mol. J. 0–3 (1982). 105–132 157, opt Chem. Comput. .Ml Biol. Mol. J. 2–3 (2005). 827–839 347, 4–6 (1993). 149–163 17, lSOne PloS 0427(2015). e0141287 10, uli Acids Nucleic 8 .Duik,B .Rgr,A hhd .S rmr .Cikt,D ooegneigof engineering novo De Chilkoti, A. Cremer, S. P. Shahid, A. Rogers, A. structured B. Dzuricky, and M. disordered do 38. Why Qin, S. Mazarakos, K. Nguemaha, V. Zhou, X. H. 37. You K. 36. 0 .Pedregosa F. 40. R. 39. nrclua odnae sn rica iodrdproteins. disordered artificial using condensates intracellular separation? phase (2018). in differently behave proteins proteins. 8523 (2011). 2825–2830 45–50. pp. 2010), Malta, Valletta, (ELRA, in (2020). rceig fteLE 00Wrso nNwCalne o L Frameworks NLP for Challenges New on Workshop 2010 LREC the of Proceedings Reh ˇ ˚ u e,P ok,“otaefaeokfrtpcmdln ihlrecorpora” large with modeling topic for framework “Software Sojka, P. rek, ˇ hSpB aaaeo iudlqi hs eaainrelated separation phase liquid–liquid of database A PhaSepDB: al., et uli cd Res. Acids Nucleic cktlan ahn erigi python. in learning Machine Scikit-learn: al., et 34D5 (2020). D354–D359 48, https://doi.org/10.1073/pnas.2019053118 rnsBohm Sci. Biochem. Trends a.Chem. Nat. .Mc.Lan Res. Learn. Mach. J. PNAS 499–516 43, 814–825 12, | 1o 11 of 11 12,

BIOPHYSICS AND COMPUTATIONAL BIOLOGY