Statistical Methods for Gene and Brain Networks
Total Page:16
File Type:pdf, Size:1020Kb
UCLA UCLA Previously Published Works Title Network Modeling in Biology: Statistical Methods for Gene and Brain Networks. Permalink https://escholarship.org/uc/item/1vj3r20c Journal Statistical science : a review journal of the Institute of Mathematical Statistics, 36(1) ISSN 0883-4237 Authors Wang, YX Rachel Li, Lexin Li, Jingyi Jessica et al. Publication Date 2021-02-01 DOI 10.1214/20-sts792 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Statistical Science 2021, Vol. 36, No. 1, 89–108 https://doi.org/10.1214/20-STS792 © Institute of Mathematical Statistics, 2021 Network Modeling in Biology: Statistical Methods for Gene and Brain Networks Y. X. Rachel Wang, Lexin Li, Jingyi Jessica Li and Haiyan Huang Abstract. The rise of network data in many different domains has offered researchers new insights into the problem of modeling complex systems and propelled the development of numerous innovative statistical methodologies and computational tools. In this paper, we primarily focus on two types of biological networks, gene networks and brain networks, where statistical net- work modeling has found both fruitful and challenging applications. Unlike other network examples such as social networks where network edges can be directly observed, both gene and brain networks require careful estimation of edges using measured data as a first step. We provide a discussion on exist- ing statistical and computational methods for edge estimation and subsequent statistical inference problems in these two types of biological networks. Key words and phrases: Gene regulatory networks, brain connectivity net- works, network reconstruction, network inference. 1. INTRODUCTION many biochemical networks have a scale-free topology with a few highly connected nodes [14], known as hubs in Network structures exist everywhere in biology as network analysis, which may correspond to key enzymes many biological systems function via complex interac- in biochemical processes. Another key goal in network tions among their individual components. In ecosystems, analysis is to detect communities, which are groups of species interact in a number of different forms which tightly connected nodes. These could be genes with re- are central to maintaining biodiversity, the most common lated functionalities, or regions of brain with coordinated being predator-prey relationships. In human brain, neu- actions. rons communicate by passing electric and chemical sig- From the statistical point of view, another reason why nals through synapses. At the cellular level, DNA, RNA, biological systems are particularly amenable to network proteins and other molecules participate in a variety of analysis lies in the richness of data made available by var- biochemical reactions that determine inner workings of ious technologies, especially for gene network and brain a cell. Networks offer a succinct mathematical represen- network modeling. In these areas, measurements of vari- tation of these systems, with “sets of items, which we ables are not limited to observational settings, and exten- will call vertices or sometimes nodes, with connections sive experimental studies can be performed to examine between them, called edges” [151]. how variables respond under different conditions. One Network modeling has been successfully applied in prominent example can be found in genomics studies, many settings where the biological questions of inter- where numerous high-throughput, deep sequencing tech- est have their counterparts in graph theory. For example, nologies have generated a staggering amount of data mea- suring gene expression levels and epigenetic interactions. Y. X. Rachel Wang is a Senior Lecturer in the School of One particularly relevant technology is RNA-seq, rou- Mathematics and Statistics, University of Sydney, Sydney, New tinely used nowadays to characterize the transcriptome. In South Wales, Australia (e-mail: [email protected]). addition to observational data, gene intervention data can Lexin Li is a Professor in the Department of Biostatistics and be obtained by performing gene knockout or knockdown Epidemiology, and School of Public Health, University of experiments to study the effect of perturbations. Another California, Berkeley, California, USA (e-mail: [email protected]). Jingyi Jessica Li is an Associate example is the studies of brain, where numerous imaging Professor, Department of Statistics, University of California, technologies, such as fMRI, have collected a wide variety Los Angeles, California, USA (e-mail: [email protected]). of brain images measuring distinct brain characteristics, Haiyan Huang is a Professsor in the Department of Statistics, ranging from brain structure and function to numerous University of California, Berkeley, California, USA (e-mail: chemical constituents. Such data can be collected under [email protected]). resting state or when the subjects are required to perform 89 90 WANG, LI, LI AND HUANG cognitive tasks (e.g., task-based fMRI). For this reason, Despite all being the focus of studies in the network lit- here we choose to discuss statistical methods for gene and erature, biological networks such as gene networks differ brain networks, with more focus on the former. from social networks in a few important aspects, which The enormous wealth of data provides both opportu- give rise to challenging situations for statistical modeling. nities and challenges for the analysis of the above two Compared to relationship networks obtained from popu- classes of networks. Unlike physical or social networks, lar social media, gene networks are typically smaller in interactions among genes are much harder to observe. Al- size. The former can additionally grow in size by includ- though experiments can be performed to search for and ing more users, whereas the size of gene networks is lim- verify each gene–gene interaction, it is much more cost ited by the number of genes that exist in an organism and effective to infer these interactions and reconstruct net- can be measured in an experiment. As will be explained work edges using statistical and computational tools on in detail in Section 2.1, edges in gene networks need to high-throughput gene expression data (more recently sin- be estimated from covariates. Since the measurements of gle cell expression data). The computational results can these covariates rely on specific technologies, the number help narrow down possible candidates for further experi- of samples one can take is often restricted by cost con- mental validation. The computationally inferred networks siderations and other practical constraints. Finally, since may contain up to tens of thousands of nodes requiring edges in these networks represent interactions between efficient methods for network inference. In this article, nodes, they are directly affected by the underlying dy- we focus on a few specific problems involved in gene namics in gene regulation. These biological processes are network analysis and mention other relevant applications complicated in nature; gene regulatory mechanisms de- in genomics beyond gene networks when appropriate. As pend on tissue types, cellular environment and their activ- another example in biology, we will also review statistical ities can be changed by disease state. All of these factors methods for constructing and analyzing brain connectiv- can give rise to challenging situations for estimating and ity networks. interpreting network structures. In the following sections, Without claiming to be exhaustive, we will discuss the we will review existing approaches in the relevant litera- challenges in these networks considering the type and ture with these limitations on biological data in mind. quality of data available, relevant biological questions to 2.1 Inferring Gene–Gene Relationships Using be addressed, and statistical and computational concerns. Expression Data For gene networks, we will primarily focus on the use of gene expression data measured by RNA-seq or more tra- In the past two decades, estimating gene–gene interac- ditional microarrays. We will also discuss RNA sequenc- tions have primarily relied on gene expression data, which ing data obtained at the single-cell level, known as single- have been made readily available in the form of microar- cell RNA-seq (scRNA-seq). For both gene and brain net- ray or RNA-seq data. Coexpression is one of the earliest works, we highlight the success and limitations of current concepts proposed to infer edges in a gene network and network modeling paradigms and statistical methodolo- is based on the concept of “guilt by association”: genes gies, and propose possible directions for future develop- that have similar expression profiles under different exper- ment. imental conditions are likely to be coregulated, and hence functionally related. However, despite the extensive lit- 2. GENE NETWORKS erature, many open questions remain due to the complex nature of gene interactions: in a broader sense, these coex- Gene regulatory networks play a fundamental role pression relationships can be nonlinear, transient and sub- in defining cell structure and function. In such a net- ject to changes depending on the cellular environment. In work, transcription factors (TFs), RNA and other small this section, without claiming to be exhaustive, we discuss molecules act as regulators to activate or repress the ex- a few main approaches for inferring gene networks that pression levels of genes,