<<

Global transcriptional regulatory network for robustly connects expression to factor activities

Xin Fanga,1, Anand Sastrya,1, Nathan Miha,b, Donghyuk Kimc, Justin Tana, James T. Yurkovicha,b, Colton J. Lloyda, Ye Gaod, Laurence Yanga,2, and Bernhard O. Palssona,b,e,f,2

aDepartment of Bioengineering, University of California at San Diego, La Jolla, CA 92093; bBioinformatics and Program, University of California at San Diego, La Jolla, CA 92093; cDepartment of Genetic Engineering, Kyung Hee University, Yongin 17104, South Korea; dDivision of Biological Sciences, University of California at San Diego, La Jolla, CA 92093; eNovo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2970 Horsholm, Denmark; and fDepartment of Pediatrics, University of California at San Diego, La Jolla, CA 92093

Edited by Arnold L. Demain, Drew University, Madison, NJ, and approved August 10, 2017 (received for review February 14, 2017) Transcriptional regulatory networks (TRNs) have been studied mulated for over a dozen major TFs, with each study increasing intensely for >25 y. Yet, even for the Escherichia coli TRN— the number of known binding sites of a TF by 74–400% (10–13). probably the best characterized TRN—several questions remain. Here, we used this critical mass of data to perform a rig- Here, we address three questions: (i) How complete is our knowl- orous assessment of the latest high-confidence TRN (hiTRN) edge of the E. coli TRN; (ii) how well can we predict gene expres- of E. coli that is devoid of uncertain regulatory interactions. sion using this TRN; and (iii) how robust is our understand- We further examined this TRN’s ability to explain differential ing of the TRN? First, we reconstructed a high-confidence TRN in response to genetic and environmental per- (hiTRN) consisting of 147 transcription factors (TFs) regulating turbations. The hiTRN was reconstructed by using published 1,538 transcription units (TUs) encoding 1,764 . The 3,797 ChIP data for 15 TFs added to only high-confidence regula- high-confidence regulatory interactions were collected from pub- tory interactions from RegulonDB (7). We assessed our hiTRN lished, validated chromatin immunoprecipitation (ChIP) data and using transcriptomics compendia (14–16) and multiple computa- RegulonDB. For 21 different TF knockouts, up to 63% of the differ- tional approaches: unsupervised and supervised machine learn- entially expressed genes in the hiTRN were traced to the knocked- ing, mutual information (MI) analysis, network topology analy- out TF through regulatory cascades. Second, we trained super- sis, integer programing, and community detection (Fig. 1). vised algorithms to predict the expression of 1,364 TUs given TF activities using 441 samples. The algorithms Results accurately predicted condition-specific expression for 86% (1,174 The Coverage of the TRN Has Expanded, but Remains Incomplete. of 1,364) of the TUs, while 193 TUs (14%) were predicted bet- Our reconstructed hiTRN consisted of 147 TFs, 1,538 tran- ter than random TRNs. Third, we identified 10 regulatory mod- scription units (TUs), and 1,764 genes regulated by 3,797 high- ules whose definitions were robust against changes to the TRN confidence regulatory interactions (Dataset S1). We assessed the or expression compendium. Using surrogate variable analysis, we coverage of the hiTRN for explaining differential gene expres- also identified three unmodeled factors that systematically influ- sion in three ways. enced gene expression. Our computational workflow comprehen- Completeness. To assess hiTRN’s completeness, we computed sively characterizes the predictive capabilities and systems-level what fraction of the 1,764 genes whose expression changed across functions of an ’s TRN from disparate data types. 154 experimental conditions could be directly explained with the hiTRN (see Materials and Methods and SI Appendix, Fig. S1 for transcriptional regulation | transcriptomics | matrix factorization | regression Significance

While the transcriptional regulatory network (TRN) of transcriptional regulatory network (TRN) plays a major Escherichia coli has expanded considerably in recent years Arole in enabling an organism to modulate expression of through new chromatin immunoprecipitation (ChIP) data, an thousands of genes in response to environmental and genetic open question remains: Does the global TRN, reconstructed perturbations (1). Escherichia coli’s TRN is probably the most by combining ChIP data for individual transcriptions factors, extensively studied in any organism. However, the structure of consistently explain observed differential gene expression? even this TRN is still subject to considerable uncertainty, seri- We have reconstructed a high-confidence TRN, determined ously limiting its utility for predicting gene expression or for interpreting disparate datasets. Indeed, over a decade ago, a its consistency with transcriptomics and predictive capabili- combined metabolic and regulatory network model of E. coli ties across multiple conditions, extracted 10 functional regula- could explain only 15% of differential gene expression in tory modules, and characterized this network at the sequence response to the major environmental change of oxygen depri- and structural levels. Our multiomics algorithmic pipeline is vation (2). While much progress has been made since then (3– expected to facilitate rigorous validation and prioritization of 5), predicting global gene expression remains a fundamental experiments to elucidate TRNs in other . challenge (6). A global TRN can consist of regulatory interactions deter- Author contributions: X.F., A.S., L.Y., and B.O.P. designed research; X.F., A.S., N.M., D.K., mined from a variety of data sources. These include direct and L.Y. performed research; X.F., A.S., N.M., D.K., J.T., C.J.L., Y.G., and L.Y. analyzed data; and X.F., A.S., J.T.Y., L.Y., and B.O.P. wrote the paper. and indirect experimental evidence or computational predictions (7). For the latter, reducing false-positive interactions remains The authors declare no conflict of interest. challenging—the state of the art achieves 60% precision (8, 9). In This article is a PNAS Direct Submission. recent years, improved chromatin immunoprecipitation (ChIP) 1X.F. and A.S. contributed equally to this work. methods have enabled precise characterization of transcription 2To whom correspondence may be addressed. Email: [email protected] or lyang@ factor (TF) binding sites. Combining ChIP with transcriptomics eng.ucsd.edu. for TF KO strains has yielded high-confidence regulatory interac- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. tions in the conditions studied. Such ChIP studies have now accu- 1073/pnas.1702581114/-/DCSupplemental.

10286–10291 | PNAS | September 19, 2017 | vol. 114 | no. 38 www.pnas.org/cgi/doi/10.1073/pnas.1702581114 Downloaded by guest on September 25, 2021 Downloaded by guest on September 25, 2021 necal Esmyidct isn euaoyinterac- regulatory missing indicate may DEGs Unreachable the For overall S4). the Fig. (59–99%). dataset, similar Appendix, COLOMBOS was the (SI consistency from consistency experiments and KO TF TF a path of regulatory longest length the between correlation consistency negative nificant sign (Pearson the data and experimental P TF and regulated a TRN genes by the of indirectly) between number and the directly between (both correlation negative a for low was was consistency as sistency highest such The TFs local 2A). 51–99% for (Fig. observed was TRN genes the expressed with for (+) consistent nondifferentially accounting activation consistency and sign DEGs reflecting total consis- both signs an Overall, “sign (−). as edge repression hiTRN This with or the KO. (17), formulating TF graph by a influence conducted was given analysis DEGs tency” with interaction regulatory consistent each (SI was to assigned TF inhibition) out or (activation knocked the Bias. to Regulatory traced S2). Fig. successfully Appendix, be could that that experiments found TRN four we the identified, (16) For DEGs dataset analysis. have COLOMBOS same the the addi- performed from nine and experiments extracted KO also TF we tional comparison, For poorly. explained ments ol etepanDG nteTN(06% o experi- for (50–63%) We TRN 2B). the (Fig. paths in 0–63% regulatory DEGs that knocked- more the ments explain found or to We best traced one KOs. successfully could through were TF differ- TF TRN double 21 the out investigated or in We DEGs single of gene. of TF sets our knocked-out in ent a paths regulatory to through traced hiTRN be could conditions. the expression of gene 57% in Perturbations. one DEGs Genetic least of more at or for 20% enriched and regulon, were one least considered at genes for 1,764 enriched of were set the in (DEGs) genes differen- expressed of tially 27% average, On analyzed). conditions of explanation gene predict quantitatively to regression. hiTRN and our MI of using ability expression con- the mod- network assessed regulatory We of stable (D) basis of ules. the detection on and hiTRN reduction, our dimensionality sistency, of our knowledge Using our of (B) completeness [EcoMAC hiTRN. compendia the expression in reconstruct (14), shifts to analyzed combined we were hiTRN, data ChIP lished 1. Fig. age al. et Fang D A RegulonDB 2 = o l Escudb rcdt h eee F(i.2B). (Fig. TF deleted the to traced be could DEGs all Not Mutual Information H(X) Machine Learning .coli E. H(X|Y)

.10 Predicted ∆ ∆ euoD 7 n diinlpub- additional and (7) RegulonDB (A) workflow. our of Overview Supervised Expressison True Expression narP Analysis arcA H(X,Y) × I(X;Y) xrsin2(5,adCLMO 1).(C (16)]. COLOMBOS and (15), 2 Expression ChIP Binding 10 H(Y|X) ∆oxyR ∆fnr −7 Data ete vlae hte h euaoybias regulatory the whether evaluated then We .W lofudasig- a found also We S3). Fig. Appendix , (SI ) H(Y) and arcA/fnr and ete sesdwehrdifferential whether assessed then We C B Gene Expression ∆ ∆ hiTRN purR Consistency Condition soxS Consistency Differential Network ne ohcniin.W found We conditions. both under Conditions narL, Gene ∆purR wt dnn) hl experi- while adenine), (with compendium Expression narP, 6 fteDG nthe in DEGs the of ∼56% wtotaeie were adenine) (without and dnaA, eeautdthe evaluated We ) Core regulatory Dimensionality Enrichment Metagene Variance Modules Reduction Regulon # ofDimensions r Explained Con- purR. = −0.875, a infiat(emtto test, (permutation which significant 26%, was of coverage average an with conditions, available). experimental were metadata these 2 where from wild-type, Additionally, ranged the rates of rate- (11)]. 37% (growth growth to 18) binding to (5, due regulation PurR been global modeled have dependent on explicitly may expression fac- adenine be differential to additional some of need that effect bias suggests [e.g., regulatory consistency influencing sign tors low while tions, i.S5). Fig. Appendix, (SI data (88%) expression explained the sufficiently in variance dimensions of 40 that (PCA) analysis genes genes 4,189 from of correlated dimensionality data are the expression reduced changes the we expression NMF, Using whose com- conditions. linear genes across a the is metagene of A bination (19–21). modes by major which set metagenes, the dozen represent data several into expression genes of complex thousands reducing from subsystems iden- cohesive NMF tifies conditions. across changes important transcriptome (i.e., of they modes features) major because identify to part (19) matrix (NMF) nonnegative in factorization used thus interpret, We noisy. to and high-dimensional difficult are are genes data Entire 4,189 scriptomics the of in consisting Changes data of scriptomics Modes Major with Transcriptome. Consistent Is hiTRN The bias. regulatory including and topology , TRN individual likely beyond mechanisms will of coverage reconstruction DEG of precise nature 100% require interwoven achieving individ- highly Thus, for the coverage regulons. reflects DEG many experiments low KO The 63%. TF to ual up and Methods) SI ec fcniuu euaoyptsi h R)fo eee F oDEGs to TFs deleted from TRN) TRN. the Consistency the (exis- in in (A) paths Reachability regulatory (B) gene. contiguous consistency). expression TF of sign gene tence (i.e., deleted nondifferential bias a and regulatory differential for with observed accounting strains with TRN in our and of cells wild-type in 2. Fig. Number of DEGs B A Consistency (%) across varied DEGs for hiTRN our of coverage the Overall, 250 300 narL+ Nitrate 150 200 100 narL+ Nitrate 100 50

narL/narP+Nitrate 20 40 60 80 0 narL/narP+Nitrate 0 × DEG unreachable from TF ossec fhTNwt bevddfeeta eeexpression gene differential observed with hiTRN of Consistency

narP+ Nitrate narP+ Nitrate component principal using confirmed We samples. 441

narL+ NO narL/narP+ NO narL+ NO PNAS narL/narP+ NO eeautdorhTNi h otx ftran- of context the in hiTRN our evaluated We

narP+ NO narP+ NO | Anaerobic etme 9 2017 19, September arcA arcA OxyR+Fumarate fnr OxyR+FumaratearcA /fnrfnr OxyR+Nitrate OxyR+Nitrate

arcA P

arcA 3 = ×

fnr Aerobic DEG reachable from TF | arcA /fnr . 4 ape o4 meta- 40 to samples 441 arcA /fnrfnr 91 o.114 vol.

OxyR × OxyR × 4 ape.Tran- samples. 441 SoxS 10 SoxS

crp+nor −4

| crp+nor dnaA+nor ;

o 38 no. dnaA+nor IAppendix, SI fis+nor fis+nor purR+adenine

| purR purR+adeninepurR 10287

SYSTEMS BIOLOGY We then characterized the relationship between regulons and also relatively consistent when we used a different expression metagenes. To do so, we identified regulons that were enriched compendium, COLOMBOS (16). The normalized VI for the for genes that were determined to be major contributors within modules identified between the two compendia was 0.17, which each metagene (i.e., genes that had large coefficients within a was significant (permutation test, P < 10−4). metagene). We found that all metagenes were enriched for at least one regulon (Fig. 3). Regulatory Modules Identified Have Broad Implications. Next, we Furthermore, metagenes tended to be enriched for hiTRN- examined evolutionary characteristics of the regulatory modules regulons that shared related functions (Fig. 3). For example, at the DNA sequence and structure levels. some stress response TFs (rcsB, gadE, gadX, and gadW) were Evolutionary Conservation. We hypothesized that modules asso- enriched simultaneously for several metagenes. This result is ciated with vital functions would be conserved across species, consistent with the hiTRN structure, which causally links these while organism-specific responses would not. We thus computed TFs: rcsB → gadX → gadW → gadE (7, 22). Overlapping reg- conservation of the 147 TFs in our hiTRN across Enterobac- ulons were also coenriched in the same metagene including narL teriaceae and γ-, β-, α-, and δ-proteobacteria (SI Appendix, SI and narP, or nrdR and dnaA. Methods). We found two conserved modules (modules 7 and These results demonstrate coverage and coherency in the 10) involved with motility, metal ion uptake, and DNA damage hiTRN and consistency with high-dimensional transcriptomics response (SI Appendix, Fig. S8). We also found one significantly data, albeit at a more coarse-grained level than the reachability less-conserved module (module 1), primarily involved with vari- and regulatory bias described above. ous stress responses, including acid stress. This result was consis- tent with availability of alternative pH stress response systems or Robust Regulatory Modules Were Identified. Next, we evaluated alternative regulators of conserved proton consumption or gen- whether the organization of TFs in our hiTRN was consistent eration genes (22). with the functional organization of regulons as derived from the Binding Motifs. We investigated the sequence of DNA transcriptome. We thus developed a computational pipeline to binding motifs of the 147 TFs. We found that the TFs in modules identify clusters of TFs (or modules) that were significantly and 1, 2, 3, 8, and 9 shared more similar binding motifs within the strongly coenriched across conditions (SI Appendix, SI Methods). module compared with those in other modules (Mann–Whitney– We identified 10 coenriched modules (Fig. 4 and Dataset S2). Six Wilcoxon test P < 0.05). modules in particular represented core biological functions. For Protein Structure. We explored the structural similarity between example, module 6 included TFs and toxin–antitoxin pairs associ- TFs in each module to identify whether a structure–function ated with multiple stress responses. Since the E. coli TRN is still relationship existed within the modules. We aligned annotated expanding, we evaluated whether these modules would remain DNA-binding domains and the full structures of TF pairs using stable as new regulatory interactions are incorporated into the the FATCAT structural alignment tool (24). The TFs in mod- TRN. We randomly added up to 60 regulons and used our ules 4, 5, and 8 showed significantly higher structural similarity in pipeline to identify new modules, which were compared against both cases to TFs within the module compared with TFs in other the original modules. The average Jaccard index (SI Appendix, modules. Specifically, module 8 was enriched for the periplasmic SI Methods) between modules ranged from 0.34 to 0.81, and nor- binding protein-like I domain (25). malized variation of information (VI) (23) ranged from 0.025 to 0.34 (SI Appendix, Fig. S7). Module stability depended on which The Expression of Most TUs Can Be Predicted Quantitatively from regulons were added, since even when 34 new TFs and 1,290 TF Expression. We next asked whether gene expression could regulatory interactions were added, the clusters could be stable be quantitatively predicted as a function of TF expression lev- (Jaccard index = 0.55, normalized VI = 0.14). The modules were els across varying conditions and genetic perturbations. Eight

Transcription factors

M

e

t

a

g

e

n

e

s

Fig. 3. Regulon enrichment and functions of metagenes. Colors indicate functions of metagenes 1. Acid and drug resistance 3. Energy metabolism 5. Inorganic ion utilization based on enriched regulons. Functionally related Oxidative stress response Anaerobic growth Anaerobic growth regulons are enriched in the same metagenes. structure Amino acid metabolism Note that only TFs in modules 1, 3, and 5 are shown Biofilm formation Iron transport here. A full heatmap can be found in SI Appendix, Motility system Fig. S6.

10288 | www.pnas.org/cgi/doi/10.1073/pnas.1702581114 Fang et al. Downloaded by guest on September 25, 2021 Downloaded by guest on September 25, 2021 age al. et Fang evdtehTNsdsrbto ftenme fTsregulating TFs of number the of for distribution hiTRN’s TRNs the random served than (FDR-adjusted SVR better best significantly the predicted our (14%) were TUs using 1,364 that of accurately 193 identified more We TRNs. random predicted than Topology. hiTRN were TRN TUs to certain Expression whether TU of Sensitivity SVR the for regression original the (FDR-adjusted and expres- regression shuffled of profile the 1,364) between sion of TF differences (1,174 significant the 86% yielded of of pro- TUs total order captured A expression the (features). predictions maintaining profiles TU expression while our the outputs) whether (predicted shuffling files tested by effects also pro- Appendix, condition-specific we S11), (SI and model, results S10 this Figs. similar interac- Appendix , highly strong (SI viding only RegulonDB using on from by and tions EcoMAC S9) on Fig. and performed Appendix, COLOMBOS was (SI both hiTRN regression the The using increase. by power COLOMBOS should predictive the TRN TRN, the the of of structure the about more cover h riigdt eecreae ihtenme fknown of number the with correlated (Pearson were TFs data training the (R model determination (P the of data ficient of testing the power to predictive applied the when improved data (P significantly training model did SVR the linear only fit best-performing factors Not the sigma than Gaus- factors. better with and sigma SVR without linear kernel Gaussian and with the with SVRs both trained kernels, next sian we interactions, tory of influence Interactions. strong Nonlinear expression. the TU TUs highlighting on of factors factors, 1,093) sigma of sigma (999 91% known of with (FDR) fit the improved rate significantly under alone discovery models intercept-only false with of of compared a test models (1,045 the F 77% of improved an significantly fit in TFs the Using models, that bilinear 5A). determined TU-specific of (Fig. we 1,364) and case (31), interac- (30) each significance bilinear regressors for overall for as TFs allowed factors between that EcoMAC. tions term sigma in cooperativity and measured a of TFs were included total both TUs A tested identified Methods). 1,538 We by and the grouped (Materials of genes (7) 1,364 of RegulonDB expression from average TUs the predict to 26–29). structures gression (14, Regression. literature model the Linear in The Multivariate described (SVRs). those to best regressors similar the were vector and identify models support regression to four linear multiple explored four were procedure: modeling structures model potential in defined fully are Modules genes). shown. regulated not of S2 is number Datasets regulons the (i.e., between regulons overlap of The size the to proportional 4. Fig. ihfA;ihfB lrp arcA fnr e ucinlrgltr oue o 4 F.Sz frcage are rectangles of Size TFs. 147 for modules regulatory functional Ten and S3 fur R . P fis 0 = < crp 0.05). .41, cra obte con o olna regula- nonlinear for account better To P 2 .Tu,a edis- we as Thus, 5B). (Fig. .001) < hns argR fteSRwt im atr on factors sigma with SVR the of ) P < < eue utvraelna re- linear multivariate used We 0. pRsoxS cpxR 0. oee,sgafactors sigma However, 05. .Terno Rspre- TRNs random The 05). .Tecoef- The 5A). (Fig. .001) < .Using Methods). SI Nucleotidemetabolism Amino acidmetabolism 8. Carbonmetabolism 7. Motilitysystem Nucleotidemetabolism Toxin-antitoxin system 6. Superoxideresponse Anaerobic growth 5. Inorganicionutilization Irontransport Amino acidmetabolism Anaerobic growth 3. Energymetabolism Motilitysystem Biofilmformation Chromosomestructure Oxidativestressresponse 1. Acid anddrugresistance Not co-enriched enx evaluated next We ,btthe but .001), < 2,4,9,10: Unassigned .As,Tssaighg Iwith MI high sharing TFs Also, S12). Fig. Appendix , (SI TU a fteestsi otx-pcfi anr hs designing Thus, manner. context-specific a in occupancy the sites knowing high-resolution these and sites having of binding from for TF resulted TRN for progress well-defined measurements a This having TUs. of these signifi- importance predicted critical the was (FDR-adjusted cating expression TRNs random whose than better (14%) better cantly TUs significantly 193 TUs (FDR-adjusted of found profiles 1,364) Are expression TUs of shuffled of (1,174 than 14% 86% only for but expression Regulators. TUs, Direct of Their 86% to Linked for Unambiguously Expression Predict Can We biological other (34). data in rate variation growth single including systematic variables a processes, by Unmodeled explained to condition. be media also corresponded may distinct which sur- a such having of three source one identified systemati- variables, We that expression. rogate variables gene unmodeled influenced to cally nondifferentially related and potentially expression differential DEGs was unexplained the of of Some (17), 51–99% genes. activation) expressed explained or inhibition hiTRN (i.e., our (2), network bias for regulatory to 2004 accounting 70 and when in from topology that ranging found assessment coverage we knocked-out in Additionally, earlier result- increase 320%. the an an the with represents to in result Compared of back our hiTRN. reported average) traced our coverage in on were 15% cascades TRN (26% regulatory through the 63% TF in to DEGs up ing KOs, Differential TF Activation. Between ferent TF Connections and Causal Expression Many Gene Explains hiTRN The expression. gene predict n asi u nweg fthe of scope the knowledge concerning our questions in three gaps answered and we study, this In Discussion S19) and S18 Figs. the analysis. Appendix, in three PCA (SI data in of components of Presence principal clustering two glycerol). with first consistent not plus substrate was (LB a variables sources of surrogate use other the in to related present potentially from source, stemmed data variation ). S17 this unaccounted Fig. Appendix , that (SI imply laboratory were variables single These which a from variables, expression data model surrogate primary in enriched the best three predict the identified expres- to We using from factors) signatures. EcoMAC, sigma directly to with SVA SVR variables (i.e., applied unmodeled We such data. determines sion of differential (33) (SVA) effect observed analysis reconciling the variable for Surrogate difficulties expression. pro- hiTRN pose the cellular in may represented unmodeled not pertain- and to are effects variables procedure systemic batch Such experimental cesses. from the expression, to gene ing Expression. affect Gene can Influence factors Processes Cellular Other reflects regulatory organ- high-confidence other for result expres- need in interactions. the This encountered from reinforces also and binding. TRN is (26) issue isms the TF This of (32). of profiles even could nonidentifiability sion evidence TFs, with TUs alternate no issues most many known was of of there expression expression the Figs. if the Appendix, from (SI Therefore, predicted TFs be other S16). with and MI high S15 shared mod- TFs regulatory many a of share not , Appendix the did (SI genes that than ule genes higher between between significantly MI MI of was average average module 137) regulatory the of each (FDR However, within regulons (39 exact S14). inside their 28% Fig. genes the outside only Appendix, with to genes MI that with insensitive higher found compared significantly apparently we shared was TFs MI, measured TUs on of Based TRN. 86% of S13). sion Fig. Appendix , (SI Analysis. excluded were MI TU a of regulators known enx netgtdwypeitn h expres- the predicting why investigated next We PNAS .Frhroe xrsinprofiles expression Furthermore, Methods). SI | etme 9 2017 19, September .coli E. efudta cos2 dif- 21 across that found We | R n u blt to ability our and TRN o.114 vol. ecudpredict could We P | P o 38 no. < Unmeasured < < ,indi- 0.05), .We 0.05). (SI 0.05) | 10289

SYSTEMS BIOLOGY Training Accuracy Testing Accuracy A 1.0 0.8 0.6 0.4 0.2 0.0 -0.2 -0.4

B 0.7 0.6 0.5 0.4 0.3 Fig. 5. Accuracy of expression predictions on training and held-out testing transcription units. 0.2 (A) R2 (coefficient of determination) of pre- 0.1 dicted expression profile vs. true expression pro- file using various regression models. (B) R2 value 0.0 of the testing dataset predicted by a Gaussian kernel SVR, grouped by number of known TFs. -0.1 0246810Error bars indicate SD for groups with >3 observations.

experiments to define high-confidence regulatory interactions (38), with TF KO-validated ChIP-based interactions for 15 regulons from liter- should be prioritized when characterizing the TRN of an organ- ature: arcA and fnr (10, 39, 40), argR (41, 42), trpR, lrp (42), fur (13), gadEWX ism for which data are scarce. With the advances in ChIP-exo and (22), oxyR, soxRS (43), purR (11), crp (44), and cra (45). The regulatory direc- transcriptomics methods, a much more comprehensive under- tion (+ or −) was preserved from the original study. Both directions were standing of the TRN could be achieved in the near term. Further- added if the direction was uncertain. more, given recent progress in understanding dormant TF–DNA binding events (35, 36), it will be important to investigate diverse Expression Compendium Preparation. Experimental conditions from EcoMAC conditions for expanding the repertoire of high-confidence inter- (14) were filtered to exclude nonrelevant conditions as described in Yang et actions, which involves observing a proximal effect of binding on al. (46), resulting in expression profiles for 4,189 genes × 444 samples. Three gene expression. of these samples (wild-type E. coli MG1655 grown aerobically in M9 medium with glucose) were used as a reference.

We Robustly Understand Global TRN Function Within a Limited Scope. Nonnegative Matrix Factorization. We performed NMF using sklearn with We identified 10 regulatory modules representing core biologi- “nnsvd” initialization (47). The top genes accounting for 15% of each meta- cal functions from two expression compendia using the hiTRN. gene’s weight were used for regulon enrichment. We compared NMF with These modules showed evolutionary conservation at the DNA singular value decomposition to support our choice of 40 metagenes (20) sequence and protein structure levels and overlapped with previ- (SI Appendix, Fig. S20). We also used nonsmooth NMF (nsNMF) to identify ously identified clusters (37) (SI Appendix, SI Methods). Further- sparse metagenes (48) and removed genes from each metagene having coef- more, the modules were consistent when the hiTRN was per- ficients <0.001 (attributable to numerical error). Since NMF solves a noncon- turbed by adding random regulatory interactions from up to 60 vex optimization problem and requires multiple runs to ensure global opti- regulons. mality, we used two methods, by Kim and Tidor (20) and Wu et al. (21), to Together, these results indicate that core TRN functions are confirm that our NMF decomposition was stable (SI Appendix, SI Methods). understood robustly, and gene expression can be predicted. To Metagenes are defined in Dataset S4. grow the scope of the hiTRN, new high-precision ChIP experi- ments en masse directed at unconfirmed TFs or TFs regulating Regulatory Module Identification. We compiled a network of TFs that were uncharacterized genes are expected to greatly enhance the scope coenriched in a metagene, from 100 runs each of NMF and nsNMF (48). We of understanding of E. coli’s TRN and can do so in the near term. kept only 522 TF pairs that were strongly coenriched (Jaccard index > 0.18) Analyzing disparate data by using in silico models is likely to be and significant (permutation test, P < 0.05, from 100,000 random networks sampled from the observed frequency of coenriched TFs). We then identi- important for guiding us through the selection and execution of fied modules using multilevel modularity optimization (49). The modularity the most informative experiments to fill gaps in our understand- coefficient of 0.483 was above the recommended cutoff of 0.3 to indicate ing and to design experiments to test its robustness. community structure by Clauset et al. (50). The functional labels of the mod- ules were assigned by using DAVID (51) functional annotation, followed by Materials and Methods manual curation. High-Confidence Regulatory Network Reconstruction. To reconstruct the high-confidence TRN (hiTRN), we combined strong evidence interactions DEG Identification. DEGs were identified by using the R package limma

from RegulonDB 9.4 (7) according to the RegulonDB Evidence Classification in Bioconductor (52), with thresholds of | log2(Fold change| > 1 and

10290 | www.pnas.org/cgi/doi/10.1073/pnas.1702581114 Fang et al. Downloaded by guest on September 25, 2021 Downloaded by guest on September 25, 2021 0 eeoizS ta.(04 eemnn h oto icir frdxmtbls at metabolism redox of circuitry control the Determining (2014) al. et S, Federowicz 10. 3 e W ta.(04 eihrn u rncitoa euaoyntokhglgt its highlights network regulatory recon- transcriptional fur -scale Deciphering (2014) (2008) al. et BØ SW, Seo Palsson 13. YS, Park EM, Knight CL, Barrett BK, Cho in 12. regulon PurR The (2011) al. et BK, Cho 11. 6 aaa E ta.(03 h yoatru ueclssrgltr ewr and network regulatory tuberculosis mycobacterium The (2013) al. et JE, Galagan 26. factors transcription of determinants Functional (2003) SA Teichmann M, Babu Madan pairs fragment aligned 25. chaining by alignment structure Flexible (2003) A Godzik Y, Ye 24. genome-wide Decoding (2015) BO Meil Palsson 23. R, Szubin EJ, O’Brien D, Kim SW, Seo 22. spa- interpret to factorization matrix nonnegative Stability-driven (2016) al. et S, Wu of reduction 21. dimensionality through identification Subsystem pattern (2003) B molecular Tidor and PM, Metagenes Kim (2004) 20. JP Mesirov tran- TR, by Golub bacteria P, in Tamayo JP, expression Brunet gene of 19. control Shared (2013) al. et S, Berthoumieux 18. inconsis- removing and Detecting (2013) S Klamt LG, for Alexopoulos R, compendia Samaga expression IN, Melas gene Leveraging 17. v3.0: Colombos (2016) al. et the M, and Moretto profiling expression 16. Gene (2009) BO Palsson EM, the Knight reveals BK, model Cho genome-wide NE, Lewis multi-scale, integrative, 15. An (2014) al. et J, Carrera 14. age al. et Fang .W determined cross- We 10-fold Methods). stratified and a (Materials using overfitting known by and reduce TFs, evaluated to of were validation pairs Models all factors. for terms sigma cooperation/competition regulators We known TU, regulator. including each log-fold known features of one predict with least structures at to model having eight regression) TUs compared 1,364 vector of support expression in and change regression linear tiple using Regression. by Profile consistency Expression sign and (49) was reachability Python (17). Network in Matlab experiments. in igraph SigNetTrainer KO using TF 21 by for performed hiTRN the with DEGs Analysis. Consistency Network-Expression DEGs. for at enriched conditions, was 166 regulon these one of least 162 In 166 refer- expression. conditions, differential the these significant to Of condi- showed conditions. relative aerobic experimental profiles 174 under expression to source of corresponded ence samples carbon 441 as resulting glucose The with tions. M9 in grown MG1655 FDR-adjusted .FihJ,e l 20)Lresaempigadvldto of validation and mapping Large-scale inference. (2007) network al. gene et robust JJ, Faith for crowds 9. of Wisdom (2012) al. et D, Marbach 8. .Gm-atoS ta.(05 euod eso .:Hg-ee nerto fgene of integration High-level 9.0: version Regulondb (2015) al. initiation et transcription S, of Gama-Castro regulation global 7. and Local of (2016) expression SJ coordinate Busby DF, metabolites Browning regulatory 6. Few (2017) al. et K, Kochanowski tuberculo- Mycobacterium 5. the manipulating and Mapping (2014) al. et TR, Rustad 4. .Cadaeaa ,PieN 21)Poaiitcitgaiemdln fgenome- of modeling integrative Probabilistic (2010) ND Price S, high- Integrating Chandrasekaran (2004) 3. BO Palsson MJ, Herrgard JL, Reed EM, Knight MW, Covert 2. .Mart 1. h genome-scale. the profiles. expression of compendium a from regulation tional Methods D143. ope oebyn rnmtbls in metabolism iron beyond role complex in network regulatory 105:19462–19467. lrp the of struction Res beyond. and clustering motif coexpression, regulation, bacteria. in in genes metabolic central net- regulatory overexpression-derived factor work. transcription a using transcriptome sis hypoxia. in twists. allowing Machines Kernel and ory in responses stress acid cellular to multifaceted reveals networks regulatory GadEWX-transcriptional networks. gene 4295. local build and expression gene tial data. expression gene large-scale factorization. matrix using discovery cell. the of physiology global and factors scription integer using graphs. topologies interaction network on signaling programming and linear data experimental between tencies analyses. cross-species content. of for models silico in genome-scale of use of landscape phenotypic tuberculosis. cl eaoi n euaoyntok in networks regulatory and metabolic scale networks. bacterial elucidates data 92–96. computational and throughput shrci oitasrpinlrgltr network. regulatory transcriptional coli Escherichia shrci coli Escherichia 39:6456–6464. 20)Cmaigcutrnsb h aito finformation. of variation the by clusterings Comparing (2003) M a ˘ nzAtnoA ag C hefyD(08 ucinlognsto of organisation Functional (2008) D Thieffry SC, Janga A, ´ ınez-Antonio eoeBiol Genome Nature 9:796–804. a e Microbiol Rev Nat Bacteriol J P rcNt cdSiUSA Sci Acad Natl Proc < shrci coli Escherichia Bioinformatics 499:178–183. 5 he ape eeue sterfrne wild-type reference: the as used were samples Three 0.05. rti aiisadbnigsites. binding and families Protein : LSGenet PLoS 15:502. uli cd Res Acids Nucleic 191:3437–3444. shrci coli Escherichia Srne,NwYr) p173–187. pp York), New (Springer, shrci coli Escherichia 14:638–650. 10:e1004264. . 19:ii246–ii255. a Commun Nat eue uevsdmcielann (mul- learning machine supervised used We eoeRes Genome 107:17845–17850. rcNt cdSiUSA Sci Acad Natl Proc shrci coli Escherichia 44:D620–D623. . shrci coli Escherichia shrci coli Escherichia o ytBiol Syst Mol . o ytBiol Syst Mol LSCmu Biol Comput PLoS shrci coli Escherichia edtrie h ossec of consistency the determined We 6:7970. 13:1706–1718. shrci coli Escherichia o ytBiol Syst Mol o Biol Mol J rcNt cdSiUSA Sci Acad Natl Proc rnsGenet Trends o nlss rvdn context Providing analysis: for 10:735. . 13:903. -2MG1655. K-12 uli cd Res Acids Nucleic a Commun Nat . shrci coli Escherichia 101:4164–4169. rcNt cdSiUSA Sci Acad Natl Proc LSBiol PLoS 9:e1003204. 381:238–247. 9:634. n Mycobacterium and 19:75–79. 5:e8. erigThe- Learning 5:4910. uli Acids Nucleic Nature 113:4290– transcrip- 44:D133– 429: Nat 2 ici E ta.(05 im oesdfeeta xrsinaaye o RNA- for analyses expression St differential A, powers Kraskov Limma 53. (2015) al. et ME, integrated Ritchie and visualization, 52. annotation, for Database DAVID: (2003) al. et G, Dennis 51. in genome-scale the at complexes ArgR-DNA of architecture The response (2015) bacterial al. The et S, (2013) Cho PJ 41. Kiley R, Landick AZ, Ansari MS, Akhtar DM, Park 40. of analysis Genome-scale (2013) al. et confi- KS, and Myers protocols high-throughput 39. of classification Evidence genome. (2013) regulatory al. microbial et the V, for Weiss model 38. system-level A (2014) al. et AN, Brooks 37. of network DNA-binding The (2015) al. paradigm. et KJ, revolutionary Minch A 36. regulation: genome Prokaryotic (2012) A Ishihama 35. accurately integration Multi- (2016) I Tagkopoulos V, Zorraquino N, Rai M, Kim 34. (2012) AJ Lee GAF, Seber 31. reconstruction Genome-scale (2014) BO Palsson K, Zengler EM, Knight and D, networks Kim genetic BK, Cho Inferring (2003) JJ 30. Collins D, Lorenz D, Bernardo di TS, Gardner 29. H M, transcription Gustafsson global of 28. redesign Model-based (2009) A Jaramillo G, Rodrigo J, Carrera 27. 0 lue ,Nwa E or 20)Fnigcmuiysrcuei eylarge very in structure community Finding (2004) C Moore ME, Newman A, Clauset research. network complex 50. for package software (2006) igraph The (2006) RD T Nepusz Pascual-Marqui G, Csardi D, 49. Lehmann K, Kochi JM, Carazo A, Pascual-Montano python. in 48. learning Machine Scikit-learn: (2011) al. et F, Pedregosa metabolism of 47. core the of definition biology Systems (2015) al. et L, Yang carbon central on 46. regulation transcriptional of assessment Systems (2016) al. et inter- D, holoenzyme Kim RNAP and 45. DNA, Crp, of interrogation Chip-exo (2016) OxyR al. et of H, Latif reconstruction Genome-wide 44. (2015) BO Palsson tran- R, the Szubin Deciphering D, (2012) Kim BO SW, Palsson Seo K, Zengler 43. YS, Park S, Federowicz BK, Cho 42. sur- by studies expression gene in heterogeneity the Capturing of (2007) JD model Storey supported JT, Leek experimentally An 33. (2015) al. et ML, Arrieta-Ortiz 32. ne fteNtoa ntttso elhAad 0G129 and U01GM102098 Novo Sci- Awards NNF10CC1016517. and Grant DGE-1144086; Medical Foundation Health Fellowship Nordisk General Research of National Graduate DE-SC0008701; of Foundation Grant Institutes Energy Science Institute of Department National US National the the R01GM057089; by of supported ences was work This rank- Wilcoxon ACKNOWLEDGMENTS. the using scores MI of (α compared distribution test we sum (9), background al. a et to Faith in MI described this As (53). package Python NPEET the Analysis. Information assigned randomly not were regula- TFs of known TU. with distribution the MI the to high preserving having TFs TU, TU. each hav- per TRNs to tors random assigned 1,000 TFs models on trained random comparing those ing by against TRN expression determined known predicting the further for on We trained TRN the profiles. of expression significance regulator main- the while of profiles, order TU shuffled the randomly taining 1,000 them on comparing trained by models effects against condition-specific captured models our whether e E Rev studies. microarray and sequencing discovery. coli Escherichia oxidation carbon regulate to architecture site binding globally. diverse a uses ArcA regulator binding. factor transcription of features . in integration dence Biol Syst Mol Commun B Ser Acad Jpn for conditions unexplored in 7:13090. state cellular predicts in Biol network factor sigma the of profiling. expression via action of mode compound identifying challenge. expression gene dream3 the of One performance net—best elastic the regulation. networks. Syst Complex J Inter Intell (nsNMF). factorization matrix nonnegative Nonsmooth 12:2825–2830. data. high-throughput with consistent 112:10810–10815. is expression and doi:10.1101/080929. bioRxiv CRP. and Cra by metabolism doi:10.1101/069021. bioRxiv actions. in stress oxidative under networks coli regulatory transcriptional SoxRS and metabolism. acid amino of logic regulatory scriptional analysis. variable rogate subtilis York). New (Wiley, Statistics -2MG1655. K-12 12:4. 5:e9134. 28:403–415. 69:066138. lbltasrpinlrgltr network. regulatory transcriptional global LSGenet PLoS 6:5829. eoeBiol Genome hsRvE Rev Phys = uli cd Res Acids Nucleic 10:740. gae ,GasegrP(04 siaigmta information. mutual Estimating (2004) P Grassberger H, ogbauer 0.05). ¨ . 88:485–508. PNAS rqitM(00 eeepeso rdcinb otitgainand integration soft by prediction expression Gene (2010) M ornquist uli cd Res Acids Nucleic ¨ elRep Cell 1695:1–9. 9:e1003839. 70:066111. ecmue IbtenTsadtre ee using genes target and TFs between MI computed We 4:R60. | etakDne ilnk o aubediscussions. valuable for Zielinski Daniel thank We LSGenet PLoS ie eisi rbblt and Probability in Series Wiley Analysis, Regression Linear etme 9 2017 19, September 12:1289–1299. 37:e38. shrci coli Escherichia aaae(Oxford) Database 43:3079–3088. uli cd Res Acids Nucleic 3:1724–1735. LSGenet PLoS oooyadfntoa states. functional and Topology : shrci coli Escherichia | o ytBiol Syst Mol 2013:bas059. yoatru tuberculosis. Mycobacterium 9:e1003565. o.114 vol. 43:e47. shrci coli Escherichia a hmBiol Chem Nat EETasPtenAa Mach Anal Pattern Trans IEEE rcNt cdSiUSA Sci Acad Natl Proc Science | 11:839. N eel complex reveals FNR o 38 no. ahLanRes Learn Mach J . 8:65–71. a Commun Nat 301:102–105. Escherichia | Bacillus 10291 PLoS BMC Phys Proc Nat

SYSTEMS BIOLOGY