Bioinformatics

Metabolic pathway analysis

Jacques van Helden [email protected] Graph-based analysis of biochemical networks

Examples of metabolic pathways

Jacques van Helden [email protected] Methionine Biosynthesis in S.cerevisiae

Aspartate L-Aspartate biosynthesis ATP 2.7.2.4 Aspartate kinase HOM3 ADP

L-aspartyl-4-P

NADPH Aspartate semialdehyde 1.2.1.11 HOM2 NADP+; Pi deshydrogenase L-aspartic semialdehyde

NADPH Homoserine 1.1.1.3 HOM6 NADP+ deshydrogenase Threonine L-Homoserine Met31p MET31 biosynthesis met32p MET32 AcetlyCoA Homoserine 2.3.1.31 MET2 CoA O-acetyltransferase O-acetyl-homoserine Sulfur Sulfide assimilation O-acetylhomoserine 4.2.99.10 (thiol)- MET17

Homocysteine MET28 Cbf1p/Met4p/Met28p CBF1 Cysteine biosynthesis complex MET4

5-methyltetrahydropteroyltri-L-glutamate Methionine synthase 2.1.1.14 MET6 Gcn4p GCN4 5-tetrahydropteroyltri-L-glutamate (v it B12-independent) L-Methionine

Met30p MET30 S-adenosyl-methionine SAM1 H 0; ATP synthetase I 2 2.5.1.6 Pi, PPi S-adenosyl-methionine synthetase II SAM2 S-Adenosyl-L-Methionine Methionine Biosynthesis in E.coli

Aspartate L-Aspartate biosynthesis ATP aspartate kinase II/ 2.7.2.4 metL ADP homoserine dehydrogenase II L-aspartyl-4-P

NADPH Aspartate semialdehyde 1.2.1.11 asd NADP+; Pi deshydrogenase L-aspartic semialdehyde biosynthesis NADPH 1.1.1.3 NADP+ Threonine L-Homoserine biosynthesis SuccinylSCoA Homoserine Methionine 2.3.1.46 metA metJ HSCoA O-succinyltransferase repressor

Alpha-succinyl-L-Homoserine Cysteine L-Cysteine biosynthesis 4.2.99.9 Cystathionine-gamma-synthase metB Succinate Cystathionine

H2O 4.4.1.8 Cystathionine-beta-lyase metC Pyruv ate; NH4+ Homocysteine Cobalamin-independent- metE 5-MethylTHF 2.1.1.14 homocysteine transmethylase metR metR THF 2.1.1.13 Cobalamin-dependent- homocysteine transmethylase metH L-Methionine ATP; H2O 2.5.1.6 Pi; PPi S-Adenosyl-L-Methionine Alternative methionine pathways

L-Aspartate

2.7.2.4

S.cerevisiae L-aspartyl-4-P E.coli

1.2.1.11

L-aspartic semialdehyde

1.1.1.3 L-Homoserine

2.3.1.31 2.3.1.46

Alpha-succinyl-L-Homoserine O-acetyl-homoserine 4.2.99.9 Cystathionine

4.2.99.10 4.4.1.8

Homocysteine

2.1.1.14 L-Methionine

2.5.1.6

S-Adenosyl-L-Methionine KEGG "consensus pathway" for Methionine metabolism Lysine biosynthesis in Escherichia coli

Aspartate L-Aspartate biosynthesis ATP 2.7.2.4 aspartate kinase III metL ADP L-aspartyl-4-P

NADPH; H+ aspartate semialdehyde Methionine 1.2.1.11 asd NADP+; Pi deshydrogenase biosynthesis L-aspartic semialdehyde Threnonine pyruvate dihydrodipicolinate biosynthesis 4.2.1.52 dapA 2 H2O synthase dihydropicolinic acid

NADPH or NADH; H+ dihydrodipicolinate 1.3.1.26 dapB NADP+ or NAD+ reductase tetrahydrodipicolinate

succinyl CoA tetrahydrodipicolinae 2.3.1.117 dapD CoA N-succinyltransferase N-succinyl-epsilon-keto- L-alpha-aminopimelic acid

glutamate succinyl diaminopimelate 2.6.1.17 dapC alpha-ketoglutarate aminotransferase succinyl diaminopimelate

H2O N-succinyldiaminopimelate 3.5.1.18 dapE succinate desuccinylase LL-diaminopimelic acid diaminopimelate 5.1.1.7 epimerase dapF meso-diaminopimelic acid diaminopimelate lysR 3.5.1.18 lysA lysR CO2 decarboxylase protein L-lysine Lysine biosynthesis in Saccharomyces cerevisiae

2-Oxoglutarate

Acetyl-CoA LYS20 CoA 4.1.3.21 homocitrate synthase 1,2,4-Tricarboxylate homocitrate LYS7

H2O But-1-ene-1,2,4-tricarboxylate

4.2.1.36 homoaconitate hydratase LYS4 Homoisocitrate

NAD+ 1.1.1.87 H+; NADH Oxaloglutarate Homoisocitrate 1.1.1.87 CO2 dehydrogenase 2-Oxoadipate

L-Glutamate aminoadipate 2.6.1.39 2-Oxoglutarate aminotransferase L-2-Aminoadipate LYS5 H+ ; NADH (or NADPH) amlnoadipate semialdehyde 1.2.1.31 dehydrogenase NAD+( or NADP+); H2O LYS2 L-2-Aminoadipate 6-semialdehyde

L-Glutamate ; NADPH (or NADH); H+ saccharopine dehydrogenase 1.5.1.10 LYS9 NADP+ (OR NAD+); H2O (glutamate forming) N6-(L-1,3-Dicarboxypropyl)-L-lysine

NADP+ (OR NAD+) ; H2O saccharopine dehydrogenase 1.5.1.7 LYS1 2-Oxoglutarate ; NADPH (OR NADH) ; H+ (lysine forming) L-lysine Lysine biosynthesis in KEGG (yeast in green) EcoCyc example - proline utilization EcoCyc example - proline biosynthesis Ecocyc - metabolic overview KEGG example : proline and arginine metabolism (E.coli)

 where is proline ?

 how is proline synthesized in E.coli ?

 how is proline catabolized in E.coli ?

 is it obvious that reactions 1.5.99.8 and 1.5.1.2 have distinct side reactants ? Graph-based analysis of biochemical networks

Pathway reconstruction by reaction clustering

Jacques van Helden [email protected] A graph of compounds and reactions

Reactions from KEGG Compound nodes • 10,166 compounds (only 4302 used by one reaction) Reaction nodes • 5,283 reactions Arcs • 10,685 → reaction (7,297 non-trivial) • 10,621 reaction → (6,828 non-trivial) Metabolic Pathways as subgraphs

Escherichia coli

 4219 Genes (Blattner)

 967 enzymes (Swissprot)

 159 pathways (EcoCyc) Reconstructing a pathway from a subset of reactions

 Input:

 a set of reactions (the seed reactions)

 Output:

 a metabolic pathway including • the seed reactions, together with their substrates and products • optionally, some additional reactions, interaalated to improve the pathway connectivity

 the pathway can either be connected, or contain several unconnected components Seed nodes

Compound Reaction Seed Reaction Linking seed nodes

Compound Reaction Seed Reaction Direct link Enhance linking by intercalating reactions

Compound Reaction Seed Reaction Direct link Intercalated reaction Subgraph extraction Validation of the method

 Take a set of experimentally characterized pathways, and for each one

 Select a subset of enzymes

 Use the reactions catalysed by these enzymes as seed nodes

 Extract the subgraph

 Compare with known pathway Lysine biosynthesis in E.coli Aspartate L-Aspartate biosynthesis ATP 2.7.2.4 aspartate kinase III lysC ADP L-aspartyl-4-P

NADPH; H+ aspartate semialdehyde Methionine 1.2.1.11 asd NADP+; Pi deshydrogenase biosynthesis L-aspartic semialdehyde Threnonine pyruvate dihydrodipicolinate biosynthesis 4.2.1.52 dapA 2 H2O synthase dihydropicolinic acid

NADPH or NADH; H+ dihydrodipicolinate 1.3.1.26 dapB NADP+ or NAD+ reductase tetrahydrodipicolinate

succinyl CoA tetrahydrodipicolinae 2.3.1.117 dapD CoA N-succinyltransferase N-succinyl-epsilon-keto- L-alpha-aminopimelic acid

glutamate succinyl diaminopimelate 2.6.1.17 dapC alpha-ketoglutarate aminotransferase succinyl diaminopimelate

H2O N-succinyldiaminopimelate 3.5.1.18 dapE succinate desuccinylase LL-diaminopimelic acid diaminopimelate 5.1.1.7 epimerase dapF meso-diaminopimelic acid diaminopimelate lysR 3.5.1.18 lysA lysR CO2 decarboxylase protein L-lysine Example: reconstitution of lysine pathway

 Gap size: 0

 all Ecs from original pathway are provided as seeds

 Seeds

 1.2.1.11

 1.3.1.26

 2.3.1.117

 2.6.1.17

 2.7.2.4

 3.5.1.18

 4.1.1.20

 4.2.1.52

 5.1.1.7

 Result:

 Inferring reaction orientation (reverse or forward)

 Ordering Example: reconstitution of lysine pathway

 Gap size: 1

 5 seed reactions

 Result

 Inferring missing steps

 Inferring reaction orientation

 Ordering Example: reconstitution of lysine pathway

 Gap size: 2

 4 seed reactions

 Result

 E.coli pathway found

 Alternative pathways also returned Example: reconstitution of lysine pathway

 Gap size: 3

 3 seed reactions

 Result

 E.coli pathway is not found, because the program finds shortcuts between the seed reactions Applications of pathway reconstruction

 We have the complete genome for dozens of bacteria, for which there is almost no experimental characterization of metabolism

 For these genomes, enzymes have been predicted by sequence similarity

 In some cases, one expects to find the same pathways as in model organisms, but in other cases, variants or completely distinct pathways

 For each known pathway from model organisms

 Select the subset of reactions for which an exists in the target organism

 If a reasonable number of reactions are present • Using these as seeds, reconstruct a pathway • Preferentially (but not exclusively) intercalate reactions for which an enzyme has been detected in the target organism Graph-based analysis of biochemical networks

From gene expression data to pathways

Jacques van Helden [email protected] Reaction clustering and gene expression data

 Many biochemical pathways are co-regulated at the transcriptional level.

 Starting from the observation that a group of genes is co- regulated, try to find if they could be involved in a common pathway. Gene expression data: cell cycle

Alpha cdc15 cdc28 Elu

MCM

CLB2 SIC1 MAT

CLN2

Y'

MET

Spellman et al. (1998). Gilbert et al. (2000). Mol Biol Cell 9(12), 3273-97. Trends Biotech. 18(Dec), 487-495. Study case : cluster of co-regulated genes

ID name decription YKR069W met1 siroheme synthase YFR030W met10 subunit of assimilatory sulfite reductase YGL125W met13 putative methylenetetrahydrofolate reductase (mthfr) YKL001C met14 adenylylsulfate kinase YPR167C MET16 3'phosphoadenylylsulfate reductase YLR303W MET17 O-Acetylhomoserine-O-Acetylserine Sulfhydralase YJR010W met3 ATP sulfurylase YER091C met6 vitamin B12-(cobalamin)-independent isozyme of methionine synthase (also called N5-methyltetrahydrofolate homocysteine methyltransferase or 5-methyltetrahydropteroyl triglutamate YIR017C MET28 Throamnoscryisptteioinnea ml aectthivyalttroarn osff esrualfsuer) amino acid metabolism YGR055W MUP1 high affinity methionine permease YJR137C ECM17 ExtraCellular Mutant YER042W YIL074C YLL061W YLL062C YLR302C YNL276C YPL250C YPL274W KEGG - gene search in pathway maps KEGG - reaction coloring in pathway maps KEGG - reaction coloring in pathway maps KEGG - reaction coloring in pathway maps Building pathways from gene clusters

Experiment 2 1 3

... gene 1 expr protein 1 cat 1 react 1 chip chip chip gene 2 expr protein 2 cat 2 react 2 gene 1 1.24 0.43 0.40 0.40 gene 3 protein 3 gene 2 -0.56 NA NA NA expr cat 3 react 3 gene 3 1.39 0.26 -0.09 0.08 gene 4 expr protein 4 cat 4 gene 4 -0.30 0.66 0.72 -0.64

Gene gene 5 expr protein 5 cat 5 gene 5 -0.29 0.57 0.59 0.72 react 4 gene 6 0.66 0.38 0.48 0.03 gene 6 expr protein 6 cat 6 gene 7 1.15 0.32 0.20 0.48 gene 7 expr protein 7 gene 8 expr protein 8 gene expression gene 9 expr protein 9 profiles

Pathway Classification reconstruction

cluster of Putative co-regulated genes pathway Pathway found in Spellman’s “MET” cluster

Sulfate

ATP Sulfate adenylyl MET3 PPi 2.7.7.4 Adenylyl sulfate (APS)

ATP Adenylyl sulfate MET14 ADP 2.7.1.25 kinase 3'-phosphoadenylylsulfate (PAPS)

NADPH 3'-phosphoadenylylsulfate MET16 NADP+; AMP; 3'-phosphate (PAP); H+ 1.8.99.4 reductase sulfite Putative 3 NADPH; 5H+ MET5 1.8.1.2 Sulfite reductase 3 NADP+; 3 H2O Sulfite reductase MET10 sulfide (NADPH)

O-acetyl-homoserine O-acetylhomoserine 4.2.99.10 (thiol)-lyase MET17

Homocysteine 5-methyltetrahydropteroyltri-L-glutamate Methionine synthase 2.1.1.14 MET6 5-tetrahydropteroyltri-L-glutamate (vit B12-independent)

L-Methionine Analysis of Gene Expression Data

Gene cluster 20 genes

Identify genes coding for enzymes 7 enzymes

Identify subset of 8 reactions catalyzed reactions

Interconnect these reactions to find all possible pathways

Automatic Graph Layout Compare with Classical Pathways

Pathway Diagram Known 2 matching Pathways pathways Comparison with Sulfur assimilation

Sulfate (extracellular)

Sulfate transporter SUL1 Sulfate transport Sulfate transporter SUL2 Sulfate (intracellular)

ATP Sulfate adenylyl 2.7.7.4 MET3 PPi transferase

Adenylyl sulfate (APS) Met31p MET31 Met32p MET32 ATP Adenylyl sulfate 2.7.1.25 MET14 ADP kinase

3'-phosphoadenylylsulfate (PAPS)

NADPH 3'-phosphoadenylylsulfate 1.8.99.4 reductase MET16 MET28 NADP+; AMP; H+; Cbf1p/Met4p/Met28p CBF1 3'-phosphate (PAP) sulfite complex MET4 Putativ e 3 NADPH; 5H+ Sulfite reductase MET5 1.8.1.2 Gcn4p GCN4 3 NADP+; 3 H2O Sulfite reductase (NADPH) MET10 sulfide

Methionine biosynthesis Met31p MET30 Comparison with methionine biosynthesis

Aspartate L-Aspartate biosynthesis ATP 2.7.2.4 Aspartate kinase HOM3 ADP

L-aspartyl-4-P

NADPH Aspartate semialdehyde 1.2.1.11 HOM2 NADP+; Pi deshydrogenase L-aspartic semialdehyde

NADPH Homoserine 1.1.1.3 HOM6 NADP+ deshydrogenase Threonine L-Homoserine Met31p MET31 biosynthesis met32p MET32 AcetlyCoA Homoserine 2.3.1.31 MET2 CoA O-acetyltransferase O-acetyl-homoserine

Sulfur Sulfide O-acetylhomoserine assimilation 4.2.99.10 (thiol)-lyase MET17

Homocysteine MET28 Cbf1p/Met4p/Met28p CBF1 Cysteine biosynthesis complex MET4

5-methyltetrahydropteroyltri-L-glutamate Methionine synthase 2.1.1.14 MET6 Gcn4p GCN4 5-tetrahydropteroyltri-L-glutamate (v it B12-independent) L-Methionine

Met30p MET30 S-adenosyl-methionine SAM1 H 0; ATP synthetase I 2 2.5.1.6 Pi, PPi S-adenosyl-methionine synthetase II SAM2 S-Adenosyl-L-Methionine Summary

 Starting from an unordered cluster of genes, one gets an ordered set of reactions, connected to form a pathway

 Should permit discovery of novel pathways, that are not stored in any pathway database yet

 Interpretation of intercalated reactions

 enzyme is not regulated

 DNA chip defect for that gene

 gene was not on the DNA chip

 enzyme remains to be identified in that organism Analysis of data from Gasch et al.

 Gasch et al (2000). Molecular Biology of the Cell, 11:4241-4257

 6152 yeast genes

 Various conditions of stress (heat shock, osmotic shock, peroxide, amino acid starvation, nitrogen depletion

 Steady-state growth on alternative carbon sources

 Overexpression studies Selected experiments

MSN2.overexpression..repeat. MSN4.overexpression YAP1.overexpression ethanol.car.1

MSN2 overexpression MSN4 overexpression 800 YAP1 overexpression ethanol 1200 600 600 1500 800 400 number of genes number of genes number of genes number of genes 400 200 200 500 0 0 0 0

-6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4

log(expression ratio) log(expression ratio) log(expression ratio) log(expression ratio)

ggalactosealactose.car.1 gglucoselucose.car.1 mmannoseannose..car.1 raffinoseraffinose.car.1 1000 1200 1000 600 800 600 600 400 number of genes number of genes number of genes number of genes 400 200 200 200 0 0 0 0

-6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4

log(expression ratio) log(expression ratio) log(expression ratio) log(expression ratio)

ssucroseucrose.car.1 ethanolYP.ethanol.vs .vsrefer ereferencence.pool.car.2 fructoseYP.fructose.vs .rvsefer ereferencence.pool.car.2 galactoseYP.galactose.vs. rvsefer ereferencence.pool.car.2 1200 800 600 600 800 number of genes number of genes number of genes number of genes 400 400 200 200 0 0 0 0

-6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4

log(expression ratio) log(expression ratio) log(expression ratio) log(expression ratio)

glucoseYP.glucose.vs .rvsefer ereferencence.pool.car.2 mannoseYP.mannose.vs. rvsefere referencence.pool.car.2 raffinoseYP.raffinose.vs .rvsefer ereferencence.pool.car.2 sucroseYP.sucrose.vs .rvsefer ereferencence.pool.car.2 1000 1000 1200 800 600 600 600 number of genes number of genes number of genes number of genes 400 200 200 200 0 0 0 0

-6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4

log(expression ratio) log(expression ratio) log(expression ratio) log(expression ratio) Repressed by mannose (at least 3-fold)

Galactose utilization (redundancy in the database ?)

inferred

Citrate cycle with shunt

gluconeogenesis

Remark: arrows should be displayed as bi-directional Repressed by mannose (at least 2-fold)

(redundancy in the database ?)

gluconeogenesis Citrate cycle with shunt

Galactose utilization gluconeogenesis

Remark: arrows should be displayed as bi-directional Induced by galactose (at least 2-fold)

Galactose utilization

Remark: arrows should be displayed as bi-directional Repressed by glucose (at least 2-fold)

(redundancy in the database ?)

gluconeogenesis

Galactose utilization gluconeogenesis