Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/622803; this version posted August 31, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Alexander Rives * 1 2 Joshua Meier * 1 Tom Sercu * 1 Siddharth Goyal * 1 Zeming Lin 2 Demi Guo 3 y Myle Ott 1 C. Lawrence Zitnick 1 Jerry Ma 4 y Rob Fergus 2 Abstract 1. Introduction Growth in the number of protein sequences in public In the field of artificial intelligence, a combination databases has followed an exponential trend over decades, of scale in data and model capacity enabled by un- creating a deep view into the breadth and diversity of protein supervised learning has led to major advances in sequences across life. This data is a promising ground for representation learning and statistical generation. studying predictive and generative models for biology using In the life sciences, the anticipated growth of se- artificial intelligence. Our focus here will be to fit a single quencing promises unprecedented data on natural model to many diverse sequences from across evolution. sequence diversity. Evolutionary-scale language Accordingly we study high-capacity neural networks, inves- modeling is a logical step toward predictive and tigating what can be learned about the biology of proteins generative artificial intelligence for biology. To from modeling evolutionary data at scale. this end we use unsupervised learning to train The idea that biological function and structure are recorded a deep contextual language model on 86 billion in the statistics of protein sequences selected through evolu- amino acids across 250 million protein sequences tion has a long history (Yanofsky et al., 1964; Altschuh et al., spanning evolutionary diversity. The resulting 1987; 1988). Out of the possible random perturbations to a model contains information about biological prop- sequence, evolution is biased toward selecting those that are erties in its representations. The representations consistent with fitness (Göbel et al., 1994). The unobserved are learned from sequence data alone; no informa- variables that determine a protein’s fitness, such as structure, tion about the properties of the sequences is given function, and stability, leave a record in the distribution of through supervision or domain-specific features. observed natural sequences (Göbel et al., 1994). The learned representation space has a multi-scale organization reflecting structure from the level Unlocking the information encoded in protein sequence of biochemical properties of amino acids to re- variation is a longstanding problem in biology. An analo- mote homology of proteins. Information about gous problem in the field of artificial intelligence is natural secondary and tertiary structure is encoded in the language understanding, where the distributional hypothe- representations and can be identified by linear pro- sis posits that a word’s semantics can be derived from the jections. Representation learning produces fea- contexts in which it appears (Harris, 1954). tures that generalize across a range of applications, Recently, techniques based on self-supervision, a form of enabling state-of-the-art supervised prediction of unsupervised learning in which context within the text is mutational effect and secondary structure, and im- used to predict missing words, have been shown to materi- proving state-of-the-art features for long-range alize representations of word meaning that can generalize contact prediction. across natural language tasks (Collobert & Weston, 2008; Dai & Le, 2015; Peters et al., 2018; Devlin et al., 2018). The ability to learn such representations improves significantly with larger training datasets (Baevski et al., 2019; Radford *Equal contribution yWork performed while at Facebook AI et al., 2019). Research 1Facebook AI Research 2Dept. of Computer Science, New York University 3Harvard University 4Booth School of Busi- Protein sequences result from a process greatly dissimilar ness, University of Chicago & Yale Law School. Correspondence to natural language. It is uncertain whether the models and to: Alexander Rives <[email protected]>. Pre-trained models objective functions effective for natural language transfer available at: <https://github.com/facebookresearch/esm>. across differences between the domains. We explore this question by training high-capacity Transformer language bioRxiv preprint doi: https://doi.org/10.1101/622803; this version posted August 31, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. models on evolutionary data. We investigate the resulting improvements in the learned representations. In recent work, unsupervised representations for the presence of biologi- self-supervision methods used in conjunction with large data cal organizing principles and information about intrinsic and high-capacity models produced new state-of-the-art biological properties. We find metric structure in the rep- results approaching human performance on various ques- resentation space that accords with organizing principles tion answering and semantic reasoning benchmarks (Devlin at scales from physicochemical to remote homology. We et al., 2018), and coherent natural text generation (Radford also find that secondary and tertiary protein structure can et al., 2019). be identified in representations. The properties captured by This paper explores self-supervised language modeling ap- the representations generalize across proteins. We apply the proaches that have demonstrated state-of-the-art perfor- representations to a range of prediction tasks and find that mance on a range of natural language processing tasks, ap- they improve state-of-art features across the applications. plying them to protein data in the form of unlabeled amino acid sequences. Since protein sequences use a small vocabu- 2. Background lary of twenty canonical elements, the modeling problem is more similar to character-level language models (Mikolov Sequence alignment and search is a longstanding basis et al., 2012; Kim et al., 2016) than word-level models. Like for comparative and statistical analysis of biological se- natural language, protein sequences also contain long-range quence data. (Altschul et al., 1990; Altschul & Koonin, dependencies, motivating use of architectures that detect 1998; Eddy, 1998; Remmert et al., 2011). Search across and model distant context (Vaswani et al., 2017). large databases containing evolutionary diversity assem- bles related sequences into a multiple sequence alignment (MSA). Within sequence families, mutational patterns con- 3. Scaling language models to 250 million vey information about functional sites, stability, tertiary con- diverse protein sequences tacts, binding, and other properties (Altschuh et al., 1987; 1988; Göbel et al., 1994). Conserved sites correlate with Large protein sequence databases contain diverse sequences functional and structural importance (Altschuh et al., 1987). sampled across life. In our experiments we explore Local biochemical and structural contexts are reflected in datasets with up to 250 million sequences of the Uniparc preferences for distinct classes of amino acids (Levitt, 1978). database (The UniProt Consortium, 2007) which has 86 Covarying mutations have been associated with function, billion amino acids. This data is comparable in size to large tertiary contacts, and binding (Göbel et al., 1994). text datasets that are being used to train high-capacity neu- ral network architectures on natural language (Devlin et al., The prospect of inferring biological structure and function 2018; Radford et al., 2019). To model the data of evolu- from evolutionary statistics has motivated development of tion with fidelity, neural network architectures must have machine learning on individual sequence families. Covari- capacity and inductive biases to represent its breadth and ance, correlation, and mutual information have confounding diversity. effects from indirect couplings (Weigt et al., 2009). Maxi- mum entropy methods disentangle direct interactions from We investigate the Transformer (Vaswani et al., 2017), which indirect interactions by inferring parameters of a posited has emerged as a powerful general-purpose model architec- generating distribution for the sequence family (Weigt et al., ture for representation learning and generative modeling, 2009; Marks et al., 2011; Morcos et al., 2011; Jones et al., outperforming recurrent and convolutional architectures in 2011; Balakrishnan et al., 2011; Ekeberg et al., 2013b). The natural language settings. We use a deep Transformer (De- generative picture can be extended to include latent vari- vlin et al., 2018) with character sequences of amino acids ables (Riesselman et al., 2018). from the proteins as input. Recently, self-supervision has emerged as a core direction in The Transformer processes inputs through a series of blocks artificial intelligence research. Unlike supervised learning that alternate self-attention with feed-forward connections. which requires manual annotation of each datapoint, self- Self-attention allows the network