Gene Expression and Protein Function: a Survey of Deep Learning Methods

Gene Expression and Protein Function: A Survey of Deep Learning Methods Saket Sathe∗ Sayani Aggarwal† Jiliang Tang‡ [email protected] [email protected] [email protected] ABSTRACT such data. Some of the biological data occurs in the form of sequences, others as complex folded compounds, and yet Deep learning methods have found increasing interest in re- others as microarrays. All these different types of data recent years because of their wide applicability for prediction quire different types of deep learning algorithms. The diver- and inference in numerous disciplines such as image recog- sity of deep learning methods in computational biology is nition, natural language processing, and speech recognition. quite vast, which is inherited from the rich diversity of dif- Computational biology is a data-intensive field in which the ferent types of biological data. In this survey, we examine types of data can be very diverse. These different types of two related areas of computational biology, which are gene structured data require different neural architectures. The expression and protein function prediction. These areas are problems of gene expression and protein function prediction related because proteins are created using gene templates, are related areas in computational biology (since genes con- and the specific sequence in gene defines the eventual func- trol the production of proteins). This survey provides an tion of the protein. There is a rich diversity of data and overview of the various types of problems in this domain problem types that are frequently studied in this domain. and the neural architectures that work for these data sets. This survey assumes the basic knowledge of neural net- Since deep learning is a new field compared to traditional works and deep learning as a prerequisite. Although a brief machine learning, much of the work in this area corresponds overview of neural networks is provided, it is not the focus to traditional machine learning rather than deep learning. of the survey. The primary focus is on how these models can However, as the sizes of protein and gene expression data be used in the context of protein and gene expression data. sets continue to grow, the possibility of using data-hungry Where needed, some background is also provided on how bi- deep learning methods continues to increase. Indeed, the ological concepts relate to the underlying machine learning previous five years have seen a sudden increase in deep learn- problems. We refer the reader to [Bishop 1995; Goodfellow ing models, although some areas of protein analytics and et al. 2016] for overviews on neural networks. gene expression still remain relatively unexplored. There- fore, aside from the survey on the deep learning work directly 1.1 Types of Data Addressed in this Survey related to these problems, we also point out existing deep Although diverse types of biological data exist, the focus of learning work from other domains that has the potential to this survey is primarily on protein data and genomic data. be applied to these domains. As discussed in section 3.2, proteins in the human body are manufactured from a base template inside the DNA. 1. INTRODUCTION Therefore, there is a natural connection between proteomics Deep learning has become increasingly popular in recent and geonomics. Indeed, genes are almost always expressed in years because of its uses in predictive applications, espe- order to create proteins with specific functions. Indeed, the cially in the image and sequential domain [Goodfellow et al. primary way in which DNA works is via the manufacturing 2016]. Deep learning models are generalizations of tradi- of different types of proteins. In particular, the following tional machine learning models like linear regression and lo- aspects are studied in this survey: gistic regression. In particular, deep learning methods work • Gene expression: Genomic data are among the most well when the data are richly structured in terms of the re- popular types of biological data. Typically, genes oc- lationships among different attributes. A classical example cur as sequences, which can be analyzed for struc- of this type of data is the image domain. In such cases, deep ture and prediction. Genomic data also occur as mi- learning methods can leverage the network depth in order croarrays, which have a somewhat different type of to engineer features from the underlying data. 2-dimensional structure. A multitude of applications Biological data represent a natural candidate for deep learn- exist for such data, such as the discovering of various ing algorithms because of the highly structured nature of types of diseases. This survey studies how specific aspects of the structure of genes are related to expressed ∗IBM T. J. Watson Research Center, Yorktown Heights, NY characteristics, such as the occurrence of diseases. For †Lakeland Senior High School, Shrub Oak, NY ‡ example, gene microarrays are often used to perform Michigan State University, East Lansing, MI studies of the gene mutations that lead to various types of cancers, tumors, and other diseases. • Deep learning of proteins: The chemical properties of biological compounds are closely related to their struc- various applications are provided to provide better in- tures. For example, the function of a protein is closely sights. related to its structures. Proteins occur as complex 3-dimensional shapes that are leveraged for analysis 2. A review of deep learning methods on health informat- and prediction. The properties of proteins are often ics is provided in [Rav`ıet al. 2017]. However, this arti- associated with their structures and with functions. cle is restricted in its scope to health care-related top- Deep learning methods can help in relating biologi- ics. Computational biology is a distinct field in its own cal compounds to their structure and function. This right, although it has significant overlaps with health survey presents a discussion of deep learning methods care. Health care is generally much broader, and it en- that predict the structures, functions, and shapes of compasses areas such as patient diagnosis prediction. proteins together with the interrelationships between This is not the focus of this article. these aspects. 3. An overview article [Webb 2018] in Nature provides some interesting perspectives on deep learning meth- • Connections to other types of data: In many cases, pro- ods for biology. However, this article is intended to be tein data and gene data do not occur in isolation, but an overview article in computational biology (which they may be annotated with various types of text and provides an excellent bird’s-eye view), but it does not may co-occur with other data types. Much of the sci- provide details at the survey level. An important rea- entific literature of biomedical discoveries is available son for this is that it focuses rather broadly on compu- in the form of text. In many cases, useful results are tational biology, which makes it difficult to provide de- obscured by the sheer volume of the publications avail- tails in specific areas. Furthermore, the survey article able in the literature. Therefore, combining the anal- in [Webb 2018] is not specifically focused on genomic yses of these large volumes of text with the insights or protein data. obtained directly from protein and gene data requires a great deal of work. Furthermore, some types of ge- 4. An overview of deep learning methods for genomic netic data and protein data occur in combination with data may be found in [Yue and Wang 2018]. The fo- various types of annotations that can be exploited for cus of genomic data intersects with some aspects of analytics. Some of these methods are discussed in this this review, although many aspects of [Yue and Wang survey. 2018] are not directly related to this survey. Further- more, [Yue and Wang 2018] do not discuss deep learn- Important common property of most types of biological data ing methods for proteomics. (such as proteins and genetic data) is that it has a rich amount of structure. Proteins and genes are often modeled It is noteworthy that deep learning is a relatively young field, as sequences or graphs, which are highly structured data and many obvious avenues for using deep learning in compu- sets. Such data domains are naturally suited to deep learn- tational biology have not been explored. Therefore, where ing algorithms. This survey provides a discussion of these possible, the obvious avenues and directions for research in different types of applications along with the various types using deep learning for protein analytics and genomics are of neural networks that support these applications. also pointed out in this survey. For this reason, this survey should also be considered a position paper, which provides 1.2 Previous Surveys numerous connections between known techniques and ex- Numerous surveys on machine learning methods in compu- plores obvious avenues for research. tational biology [Caragea and Honavar 2009; Wang et al. 2005; Schölkopf et al. 2004; Noble et al. 2004], although 1.3 Organization of Survey many of these surveys were written before deep learning This survey is organized as follows. The next section pro- methods became popular. Another point that we mention vides an overview of the key classes of deep learning meth- is that computational biology is a diverse field, and a survey ods. Although it is impossible to provide a comprehensive of all the areas of computational biology would be worthy overview of the different types of deep learning methods, of a book rather than a survey. It is often hard to provide we provide an overview of how important classes of neural a focused overview of such a broad area (and also provide models relate to biological data. Section 3 offers a discus- specific insight and positioning of the works already done) sion of deep learning methods for protein data.

Gene Expression and Protein Function: a Survey of Deep Learning Methods

Zinc Fingers and a Green Thumb: Manipulating Gene Expression in Plants Segal, Stege and Barbas 165

Chapter 12 Gene Expression and Regulation

Gene Expression: Layers of Gene Regulation

How Are Protein Products Made from a Gene?

How Genes Work

Handbook of Epigenetics: the New Molecular and Medical Genetics

Glossary of Common Terms in Genetics

DNA Wrap: Packaging Matters an Introduction to Epigenetics UNC-Chapel Hill’S Superfund Research Program

Glossary of Terms

Non-Coding Rnas As Regulators in Epigenetics (Review)

17 Cell Differentiation and Gene Expression Investigation a N D M O D E L I N G • 1–2 C L a S S S E S S I O N S

Regulation of Gene Expression (PDF)