Downloaded from Rnaseq CEU60/ and All the Source Codes for the Current Study Are Available from the Corresponding Author

Choi et al. BMC Genetics (2017) 18:93 DOI 10.1186/s12863-017-0561-z METHODOLOGY ARTICLE Open Access Network analysis for count data with excess zeros Hosik Choi1, Jungsoo Gim2, Sungho Won3,YouJinKim4, Sunghoon Kwon5 and Changyi Park6* Abstract Background: Undirected graphical models or Markov random fields have been a popular class of models for representing conditional dependence relationships between nodes. In particular, Markov networks help us to understand complex interactions between genes in biological processes of a cell. Local Poisson models seem to be promising in modeling positive as well as negative dependencies for count data. Furthermore, when zero counts are more frequent than are expected, excess zeros should be considered in the model. Methods: We present a penalized Poisson graphical model for zero inflated count data and derive an expectation- maximization (EM) algorithm built on coordinate descent. Our method is shown to be effective through simulated and real data analysis. Results: Results from the simulated data indicate that our method outperforms the local Poisson graphical model in the presence of excess zeros. In an application to a RNA sequencing data, we also investigate the gender effect by comparing the estimated networks according to different genders. Our method may help us in identifying biological pathways linked to sex hormone regulation and thus understanding underlying mechanisms of the gender differences. Conclusions: We have presented a penalized version of zero inflated spatial Poisson regression and derive an efficient EM algorithm built on coordinate descent. We discuss possible improvements of our method as well as potential research directions associated with our findings from the RNA sequencing data. Keywords: Count data, EM algorithm, Network, Zero inflation Background The main focus of this study is to infer the network Graphical models help us to explore relationships between structure for a count data. The auto-Poisson model in [5] nodes in graphs. Undirected graphical models or Markov is a natural extension of univariate Poisson distribution. random fields have been a popular class of models for rep- However it can model only negative dependencies, so that resenting conditional dependence relationships between the conditional distributions define a unique joint distri- nodes. Examples include Gaussian graphical models for bution consistently. Yang et al. [6] propose variants of the continuous data, Ising model for binary data, and multi- auto-Poisson model such as truncated, quadratic, and sub- nomial graphical models. These Markov networks help us linear Poisson graphical models(PGM). However none of to understand complex interactions between genes in bio- them provide a satisfactory answer to the question of how logical processes of a cell and have been well studied in to specify a consistent joint graphical model for count data bioinformatics. Examples of Markov networks in learning capturing both positive and negative dependencies. Allen the network structure from microarray and next genera- and Liu [4] consider a local PGM (LPGM). The LPGM tion sequencing data include [1–4]. For more details on does not have a consistent joint graphical model, but it Markov network inference, see those and the references has the local Markov property and thus the zero coeffi- therein. cient of an edge weight between two nodes implies the conditional independence of the two nodes given the oth- ers. Žitnik and Zupan [7] consider a latent factor Poisson *Correspondence: [email protected] 6Department of Statistics, University of Seoul, 02504, Seoul, Korea model and [8] propose to learn conditional dependence Full list of author information is available at the end of the article © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Choi et al. BMC Genetics (2017) 18:93 Page 2 of 10 structures for binary and Poisson data via marginal loss Methods functions. Also a semiparametric Guassian copula, called In this section, we present our graph learning method the nonparanormal graphical model (NPGM), has been based on a penalization of the ZISP in [14] and derive an proposed [9]. efficient EM algorithm for its computation. In practice, zero counts are sometimes more frequent than are expected under a univariate Poisson distribution. Zero inflated local Poisson graphical model In such cases, a zero-inflated Poisson (ZIP) distribution is Let N denote the number of observations and p denote often adopted. Applications of ZIP models include mod- thenumberofvariablesornodes.DenoteG = (V, E), eling of defects in quality control [10] and alcoholism where V ={1, ..., p} is the set of vertices or nodes and and substance abuse in medicine [11]. Extensions of a E is the set of edges. We use uppercase letters such as ZIP model in different frameworks are well-studied in X and Z when we refer to random variables. Observa- the literature. Dobbie and Welsh [12] extend the two tions are written in lowercase. For example, xi denote ith component approach in [13] for serially correlated count observation of X. Vectors and matrices are represented data exhibiting extra zeros. Monod [14] develops a zero- by boldface and blackboard boldface letters, respectively. inflated spatial Poisson (ZISP) model. Buu et al. [11] Define X = (xij)N×p,wherexij is generated from two studyvariableselectionmethodssuchasLASSOand latent components with zero and Poisson states. Let zij one-step SCAD for ZIP regression models. For compu- be a latent variable such that zij = 1ifxij is from zero tation, a local linear approximation (LLA) is adopted. state and zij = 0ifxij is from Poisson state. zij follows a The LLA algorithm fails to converge particularly with Bernoulli distribution with πj.LetI(·) denotes an indica- small sample sizes because it requires fitting unpenalized tor function. Then the ZISP model in [14] is defined by ZIP regression models. Wang et al. [15] propose an expectation maximization (EM) algorithm [16] for a penalized − x ZIP regression model built on coordinate descent algo- μj j e μj rithms. The EM algorithm seems to have some advantages P Xj = xj|Xk =xk, k = j = πjI(xj = 0)+ 1 − πj , x ! over the LLA algorithm in numerical convergence and j tuning. (1) In this paper, we are interested in the construction of = + graphical models for count data, particularly, with exces- where μj exp βj k=j βjkxk , βj is an intercept sive zeros. To this end, we propose a penalized version adjusting for Xj,andβjk is the parameter accounting for the ZISP model in [14] called zero inflated local Poisson the conditional relation between Xj and Xk. graphical model (ZILPGM) and derive an EM algorithm Due to the zero inflation term in the conditional prob- built on coordinate descent as in [15]. We show the effec- ability, the situation becomes more complicated in our tiveness of our method on simulated and real data. In case than in LPGM. Because the important part is the an application to a RNA sequencing data, we investigate pairwise interaction term in the pairwise-only depen- the gender effect by comparing the estimated networks dency models, the situation is basically similar. In order according to different genders. It has been well noted that to have a valid joint distribution, the coefficient for the gender is one of the major contributors in the differen- interaction term βjk should be non-positive. As in the tiation of gene expression profiles [17, 18] and various LPGM,wedonotsolvetheissueofnegativeparameters sexually dimorphic phenotypes, most of which result from in the Poisson graphical model. Note that any existing hormonal differences [19]. It was reported that transcrip- approaches (e.g. in [6]) do not succeed in giving a satis- tome study could be predicted to represent a different factory answer to the consistency issue. Rather, we focus promising approach for the identification of biological not on the consistency issue but on the practical issue pathways linked to sex hormone regulation and the analy- of estimating positive as well as negative dependencies as sis of associated gene regulatory networks [20]. However, in LPGM. the elucidation of underlying mechanisms of the gen- In order to learn graph structures, we consider the min- der differences is still an area of interest and intense imization of the penalized pseudo log-likelihood of (1) in investigation. the general weighted LASSO form: The paper is organized as follows. In “Methods” section, − x we propose a new graph learning method based on ZISP N p μij ij 1 e μij and provide an efficient EM type numerical algorithm. − log πjI(xij = 0) + 1 − πj N x ! In “Results” section, we compare performances of our i=1 j=1 ij method with LPGM on simulated and real data sets. p Some discussions and concluding remarks are given in + λ wjk|βjk|,(2) “Conclusions” section. j=1 k=j Choi et al. BMC Genetics (2017) 18:93 Page 3 of 10 = + ≥ where μij exp βj k=j βjkxik , λ 0isthepenalty distribution by introducing a latent variable and derive an parameter, and w ≥ 0 is an appropriate weight. As in EM algorithm. jk T = = [4], we can select the tuning parameter using the sta- Define β−j β0, (βk)k=j and xi,−j bility selection criterion in [21]. More specifically, we (1, (x ) = ))T . The log-likelihood function with respect select the optimal λ is selected from 30 equal-spaced ik k j max min max to complete data can be written as grid points in log scale on [λ , λ ], where λ = 1 N min = max × N N maxj∈{1,··· ,p} maxk=j i=1 xikxij and λ λ − N c =− − − 10 4.Foreachj, we fit poisson regression using glmnet.

Downloaded from Rnaseq CEU60/ and All the Source Codes for the Current Study Are Available from the Corresponding Author

Haploid Genetic Screens Identify an Essential Role for PLP2 in the Downregulation of Novel Plasma Membrane Targets by Viral E3 Ubiquitin Ligases

Exceptional Conservation of Horse–Human Gene Order on X Chromosome Revealed by High-Resolution Radiation Hybrid Mapping

Analysis of Gene Expression Data for Gene Ontology

Identification of the Binding Partners for Hspb2 and Cryab Reveals

Identification of Key Pathways and Genes in Endometrial Cancer Using Bioinformatics Analyses

Seq2pathway Vignette

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

The Inactive X Chromosome Is Epigenetically Unstable and Transcriptionally Labile in Breast Cancer

A Multi-Omics Interpretable Machine Learning Model Reveals Modes of Action of Small Molecules Natasha L

Downloaded from [266]

1 Supporting Information for a Microrna Network Regulates

Aberrant Promoter Methylation and Tumor Suppressive Activity of the DFNA5 Gene in Colorectal Carcinoma