Inferring Microbial Interaction Networks Based on Consensus Similarity Network Fusion
Total Page:16
File Type:pdf, Size:1020Kb
SCIENCE CHINA Life Sciences THEMATIC ISSUE: Computational life sciences November 2014 Vol.57 No.11: 1115–1120 • RESEARCH PAPER • doi: 10.1007/s11427-014-4735-x Inferring microbial interaction networks based on consensus similarity network fusion JIANG XingPeng1,2 & HU XiaoHua2* 1College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, USA; 2School of Computer, Central China Normal University, Wuhan 430079, China Received May 15, 2014; accepted July 21, 2014; published online October 17, 2014 With the rapid accumulation of high-throughput metagenomic sequencing data, it is possible to infer microbial species rela- tions in a microbial community systematically. In recent years, some approaches have been proposed for identifying microbial interaction network. These methods often focus on one dataset without considering the advantage of data integration. In this study, we propose to use a similarity network fusion (SNF) method to infer microbial relations. The SNF efficiently integrates the similarities of species derived from different datasets by a cross-network diffusion process. We also introduce consensus k-nearest neighborhood (Ck-NN) method instead of k-NN in the original SNF (we call the approach CSNF). The final network represents the augmented species relationships with aggregated evidence from various datasets, taking advantage of comple- mentarity in the data. We apply the method on genus profiles derived from three microbiome datasets and we find that CSNF can discover the modular structure of microbial interaction network which cannot be identified by analyzing a single dataset. species interaction, metagenome, diffusion process, biological network, modularity Citation: Jiang XP, Hu XH. Inferring microbial interaction networks based on consensus similarity network fusion. Sci China Life Sci, 2014, 57: 1115–1120, doi: 10.1007/s11427-014-4735-x The interaction among species is important to understand bility [5]. Chaffron et al. [6] used Fisher’s exact test with ecological system function. In macroscopic ecology, the false discovery rate (FDR) correction of P-value to build a structure of food web has been shown to play a pivotal role large-scale network in a scale of microbial community in the evolution of the environment [1]. However, the based on co-occurrence patterns of microorganism. Many structure of inter-species network in microscopic ecology is kinds of microorganisms associated in network are also hard to decipher due to the complexity of microbial com- evolutionarily similar. They also found a large number of munity [2]. Until the last decade, the development of evolutionarily distant species which are connected in the high-throughput metagenomic sequencing [3] and 16S se- network. Zupancic et al. [7] used Spearman correlation co- quencing technologies [4] allows the inference of large- efficient to construct a network of relationships among in- scale interactions among microbial species [5]. testinal bacteria from Amish people and used regression Computational methods were based on the co-occurrence analysis to investigate the relationships between three bac- pattern or correlations of microbial classifications (taxa terial networks (corresponding to the three enterotypes) and group in different taxonomic levels, such as species, genus metabolic phenotype, they found 22 bacterial species and and phylum). Similarity or regression based methods are the four operational taxonomic units (OUTs) related to meta- most commonly used because of their simplicity and feasi- bolic syndrome. Faust et al. [8] analyzed the data from the first stage of Human Microbiome Project—16S rRNA se- *Corresponding author (email: [email protected]) © The Author(s) 2014. This article is published with open access at link.springer.com life.scichina.com link.springer.com 1116 Jiang XP, et al. Sci China Life Sci November (2014) Vol.57 No.11 quence data of 239 individuals with 18 body positions (to- loaded from NCBI database (http://www-ab2.informatik. tally 4302 samples) to build a global interaction network of uni-tuebingen.de/megan/taxonomy/microbialattributes.zip) microorganisms. Friedman et al. [9] systematically investi- and used for network analysis: one categorizes whether a gated the species (or functional) diversity effect on the bacteria likes oxygen or not (aerobism), and the other cate- compositional data and found that the composition effect gorizes if the bacteria has motility or not (motility). Because can be ignored, and vice versa when the data has a very low a genus contains many species, we use the most representa- diversity of species. They proposed a new computational tive microbial attributes for each genus. Actually the micro- framework (sparse correlation for compositional data, bial attributes data contains other microbial features, we SparCC) to overcome the compositional bias in correlation tried some of other attributes in our experiments and we analysis of microbial co-occurrence data. Tong et al. [10] found that motility and aerobism are two features which investigated 179 endoscopic lavage samples from different have close connection to network modularity (see Results). intestinal regions in 64 subjects analysis of weighted co-occurrence network reveal microbial modules. 1.2 Similarity network fusion (SNF) These studies indicate that the inference of microbial in- teraction network is important to understand the structure of Firstly we construct networks of 69 genera for each of the a microbial community and hence, its functions and princi- three available data types and then we efficiently fuse these ples adapting its complex inhabit environments. Some other into one network that represents the full spectrum of under- researchers started to infer species pairwise interactions lying data [16,18]. such as competitive and cooperative interactions leveraging Suppose we have n genera and m measurements (in this to a large-scale microbiome data including high-throughput paper, they are Sanger sequencing and two high-throughput metagenomes and microbial genomes data [11,12]. These sequencing techniques—pyrosequencing and Illumina- computational efforts facilitated the discovery of previously based sequencing). A genera similarity network is repre- unknown principles of species interaction network, and ver- sented by a graph G=(V, E). The vertices V correspond to ified the consistency and resolved the contradiction of the the microbial genera, and the edges E are weighted by how application of macroscopically ecological theory in micro- similar the genera are. A weight matrix W is used to repre- scopical ecology [13]. sent all edges, with W indicating the similarity between However, different high-throughput measurements can ij only provide part of information of a microbial community genus i and j. W is often derived by a Gaussian heat kernel [14], thus the microbial networks inferred from different function [23] microbiome datasets are far from complete [3]. Furthermore, d 2 the integration of microbiome datasets is difficult due to the W expij , (1) ij noisy, heterogeneous, distributed and dynamic properties of ij microbiome data sources [15]. In this study, we focus on the co-occurrence data and we propose to use a similarity net- where is a hyperparameter that can be empirically set, dij work fusion (SNF) [16] approach to integrate the similari- is the Euclidean distance between i and j, and ij is used to ties among microbes which are from different datasets. We eliminate the scaling problem by the following definition: modify the SNF approach with a consensus k-nearest mean d i,, Nij mean d j N dij neighborhoods (k-NNs) [17] method instead of using k-NN , (2) [16,18] directly and we find that the modification improves ij 3 the performance of SNF. We validate the method on simu- where mean(d(i, N )) is the average value of the distances lation data and on the integration of three human gut micro- i biome datasets. between i and its neighbors. After defining W, a normalized weight matrix P could be obtained as follows: 1 Methods Wij , ji , 1.1 Dataset 2 W ki ik Pij (3) We use three datasets from the enterotype study by Arumu- 1 , ji . gam et al. [19,20]. The three datasets are from 33 Sanger 2 sequenced gut samples, 16S pyrosequencing data from 154 American individuals [21] and Illumina-based meta- The normalization is free of the scale of self-similarity in genomics data from 85 Danish individuals [22], respectively. the diagonal entries and avoids numerical instabilities [16]. The sequences from the three datasets are summarized at To define a kernel matrix which could be used to meas- the genus level and we use 69 genera which are present ure local affinity, Wang et al. [16] used k nearest neighbors across three datasets. Two microbial attributes are down- (k-NN) method: Jiang XP, et al. Sci China Life Sci November (2014) Vol.57 No.11 1117 1.3 Modularity analysis Wij , jN i , W Sij kN ik (4) We cluster the nodes in the microbial network by spectral i clustering to see if they tend to organize in modular [24]. 0, otherwise. Let L be the normalized Laplacian matrix of a weight graph The k-NN method filters out those low-similarity edges and P and L=ID1/2PD1/2, the spectral clustering aims to min- only keeps those k-nearest neighbors for vertices. imize the objective function as follows, Let P() and S() be the input similarity matrices from the min nk Trace Z LZ , s.t. ZZ I. dataset The SNF process is to iteratively update similarity QR matrix corresponding to each data type as follows: We compute the modularity Q which was defined by New- man et al. [25] following the spectral clustering procedure. Pk PS()v vv kv () S T, v 1,2,, m. (5) For a given partition of a network, Q is defined as m 1 1 kkij QPij c c , (8) ()v ij This procedure updates the status matrices P each time 22LLij and generates m parallel interchanging diffusion processes where k is the sum of all edge weights of one vertex i and L on m networks.