<<

NRC as a formal model for expressing bioinformatics workflows

A. Gambin 1, J. Hidders 2, N. Kwasnikowska 3, S. Lasota 1, J. Sroka 1, J. Tyszkiewicz 1, J. Van den Bussche 3 1 Warsaw University, 2 University of Antwerp, 3 Hasselt University

Context Problems Our contribution

● bioinformatics workflows ● workflows execute as a mix of automated ● using Nested Relational Calculus [1] for scripts and manual intervention modeling data oriented workflows – network of data centered processing steps – difficult to maintain ● many bioinformatics workflows can be modeled ● processing steps involve ● results are stored in ad-hoc ways, e.g. files, in NRC – large amounts of complex data Excel sheets ● advantages of using NRC ● sequence files, BLAST reports – difficult to manage – puts data oriented workflows on a firm ● XML data foundation – a variety of tools Existing solutions – formalism is already well understood ● EMBOSS suite, BioPerl scripts ● workflow execution engines ● natural approach ● webservices, Mascot searches – Kepler [2], Taverna [3] – BioKleisli [4] is also based on NRC ● not based on a formal data model, or too complicated and not data oriented

dataflow Nested Relational Calculus NRC core operations {PatSample}

● established formalism for querying over ● constant value of a base type — “John”, true, 89 ∗

complex objects [1] ● variable of any type, either base type or complex — $patient PatSample ● complex objects are arbitrarily nested ● construction — 〈 name: “John”, condition: true, age: 89 〉 collections and processPat ● tuple projection — $patient.name ● collections can be sets, multi-sets and lists PatData ● empty construction — ∅ ● set-based model: sets {} and tuples 〈〉 -1 ● set construction — {$patient} ∗ ● typed query language ● set — $patientList = $healthy ∪ $diseased {PatData} – extensible repertoire of base types ● ● Boolean, String, Number flattening of a nested set clusterPatData

● FASTA sequence file – $patientList = flatten( {$healthy, $diseased}) PepClusters ● XML, based on a DTD or XML Schema ● iteration over a set – complex types: nested sets and tuples kSubsets kSubsets – for $patient in $patientList return $patient.name

● named program definition — pBLAST: FASTA → {AccessionNr}

Workflow example – description – external programs, used as a “black box” selectSets selectSets ● 3D signal maps from LC-MS analysis of blood – internal programs, help with top-down design {TestTrain} samples ● equality test for base types — $patient.name = “John”

● two groups: diseased and normal ● emptiness test for sets — $patientList = ∅ tstatistic correlation

● extracting clusters corresponding to peptides ● conditional {CAlg} tstat corr – if $p.condition then $diseased ∪ {$p} else $healthy ∪ {$p} 〈⋅,⋅〉 ● core operations can be combined into programs FSelCAlg

tstatistic Workflow example – NRC programs {TestTrain} ● top-down design of the workflow ∗ – after processing and clustering of raw patient data, k-fold cross validation is performed TestTrain define dataflow($s: {PatSample}): FSelCAlg as [train] [test] [train] [test] [train] [test] kfoldCrossValidation( clusterPatData( PepClusters for $i in $s return processPat($i))) train test train test train test tstatisticDT tstatisticRF tstatisticSVM ● some internal programs ● choosing classifiers, defined by a feature – the set of peptide clusters is divided k times into a PerformanceStats selection method training set and a test set by internal program dt rf svm – t-statistic, correlation selectSets and external program kSubsets

, , ● and a classification algorithm these pairs are passed to internal programs tstatistic 〈⋅ ⋅ ⋅〉 and correlation, then a tuple is constructed from – Decision Trees (DT), Random Forest (RF), Support their results CAlg Vector Machine (SVM) define kfoldCrossValidation($pc: PepClusters): FSelCAlg as -1 ● k-fold cross validation to obtain the following ∗ performance statistics for each classifier 〈 tstat: tstatictic(selectSets(kSubsets($pc))), {CAlg} corr: correlation(selectSets(kSubsets($pc))) 〉 – sensitivity, specificity – for each pair of training set and test set, external Acknowledments Workflow example – data types programs corresponding to chosen classifiers with t-statistic as feature selection method are invoked, Special thanks to J. Dutkowski, A. Gambin, B. Kluge, K. ● base types Kowalczyk, J.Tiuryn from Institute of Informatics, Warsaw then a tuple is constructed from received results University, and M. Dadlez from Institute of Biochemistry and String, Number, Boolean, Sample Biophysics, Polish Academy of Science, for providing their define tstatistic($tt: {TestTrain}): {Calg} as workflow. ● complex types for $i in $tt return References – input type 〈 dt: tstatisticDT($i.train, $i.test), [1] P. Buneman et al., “Principles of programming with complex objects PatSample = 〈 id: String, sample: Sample, diseased: Boolean 〉 rf: tstatisticRF($i.train, $i.test), and collection types”, Theoretical Computer Science, 1995, 149:3–48. – output type svm: tstatisticSVM($i.train, $i.test) 〉 [2] I. Attinas et al., “Kepler: An Extensible System for Design and FSelCAlg = 〈 tstat: {CAlg}, corr: {Calg} 〉 Execution of Scientific Workflows”, in proc. SSDBM 2004. [3] T. Oinn et al., “Taverna: a tool for the composition and enactment of with CAlg = 〈 dt: PStats, rf: PStats, svm: PStats 〉 Graphical representation bioinformatics workflows”, Bioinformatics, 2004, 20(17):3045–3054. and PStats = 〈 sensitivity: Number, specificity: Number 〉 ● dataflow language based on Petri nets and [4] S. Davidson et al., “BioKleisli: A Digital Library for Biomedical – auxiliary types Nested Relational Calculus [6] Researchers”, J. Digital Libraries, 1997, 1(1):36–53. TestTrain = 〈 test: PepClusters, train: PepClusters 〉 [5] A. Tröger et al., “A language for comprehensively supporting the in ● typing system and core operations from NRC vitro experimental process in silico”, in proc. BIBE 2004. PepClusters = { 〈 clustermass: Number, patlist: PatList 〉 } [6] J. Hidders et al., “Petri nets + nested relational calculus = dataflow PatList = { 〈 patid: String, diseased: Boolean, intensity: Number 〉 } ● top-down, hierarchical design of the data flow language”, submitted to CoopIS 2005. NRC as a formal model for expressing bioinformatics workflows

A. Gambin 1, J. Hidders 2, N. Kwasnikowska 3, S. Lasota 1, J. Sroka 1, J. Tyszkiewicz 1, J. Van den Bussche 3 1 Warsaw University, 2 University of Antwerp, 3 Hasselt University

Context Problems Our contribution

● bioinformatics workflows ● workflows execute as a mix of automated ● using Nested Relational Calculus [1] for scripts and manual intervention modeling data oriented workflows – network of data centered processing steps – difficult to maintain ● many bioinformatics workflows can be modeled ● processing steps involve ● results are stored in ad-hoc ways, e.g. files, in NRC – large amounts of complex data Excel sheets ● advantages of using NRC ● sequence files, BLAST reports – difficult to manage – puts data oriented workflows on a firm ● XML data foundation – a variety of tools Existing solutions – formalism is already well understood ● EMBOSS suite, BioPerl scripts ● workflow execution engines ● natural approach ● webservices, Mascot searches – Kepler [2], Taverna [3] – BioKleisli [4] is also based on NRC ● not based on a formal data model, or too complicated and not data oriented

dataflow Nested Relational Calculus NRC core operations {PatSample}

● established formalism for querying over ● constant value of a base type — “John”, true, 89 ∗

complex objects [1] ● variable of any type, either base type or complex — $patient PatSample ● complex objects are arbitrarily nested ● tuple construction — 〈 name: “John”, condition: true, age: 89 〉 collections and tuples processPat ● tuple projection — $patient.name ● collections can be sets, multi-sets and lists PatData ● construction — ∅ ● set-based model: sets {} and tuples 〈〉 -1 ● singleton set construction — {$patient} ∗ ● typed query language ● set union — $patientList = $healthy ∪ $diseased {PatData} – extensible repertoire of base types ● ● Boolean, String, Number flattening of a nested set clusterPatData

● FASTA sequence file – $patientList = flatten( {$healthy, $diseased}) PepClusters ● XML, based on a DTD or XML Schema ● iteration over a set – complex types: nested sets and tuples kSubsets kSubsets – for $patient in $patientList return $patient.name

● named program definition — pBLAST: FASTA → {AccessionNr} SubSets

Workflow example – description – external programs, used as a “black box” selectSets selectSets ● 3D signal maps from LC-MS analysis of blood – internal programs, help with top-down design {TestTrain} samples ● equality test for base types — $patient.name = “John”

● two groups: diseased and normal ● emptiness test for sets — $patientList = ∅ tstatistic correlation

● extracting clusters corresponding to peptides ● conditional {CAlg} tstat corr – if $p.condition then $diseased ∪ {$p} else $healthy ∪ {$p} 〈⋅,⋅〉 ● core operations can be combined into programs FSelCAlg

tstatistic Workflow example – NRC programs {TestTrain} ● top-down design of the workflow ∗ – after processing and clustering of raw patient data, k-fold cross validation is performed TestTrain define dataflow($s: {PatSample}): FSelCAlg as [train] [test] [train] [test] [train] [test] kfoldCrossValidation( clusterPatData( PepClusters for $i in $s return processPat($i))) train test train test train test tstatisticDT tstatisticRF tstatisticSVM ● some internal programs ● choosing classifiers, defined by a feature – the set of peptide clusters is divided k times into a PerformanceStats selection method training set and a test set by internal program dt rf svm – t-statistic, correlation selectSets and external program kSubsets

, , ● and a classification algorithm these pairs are passed to internal programs tstatistic 〈⋅ ⋅ ⋅〉 and correlation, then a tuple is constructed from – Decision Trees (DT), Random Forest (RF), Support their results CAlg Vector Machine (SVM) define kfoldCrossValidation($pc: PepClusters): FSelCAlg as -1 ● k-fold cross validation to obtain the following ∗ performance statistics for each classifier 〈 tstat: tstatictic(selectSets(kSubsets($pc))), {CAlg} corr: correlation(selectSets(kSubsets($pc))) 〉 – sensitivity, specificity – for each pair of training set and test set, external Acknowledments Workflow example – data types programs corresponding to chosen classifiers with t-statistic as feature selection method are invoked, Special thanks to J. Dutkowski, A. Gambin, B. Kluge, K. ● base types Kowalczyk, J.Tiuryn from Institute of Informatics, Warsaw then a tuple is constructed from received results University, and M. Dadlez from Institute of Biochemistry and String, Number, Boolean, Sample Biophysics, Polish Academy of Science, for providing their define tstatistic($tt: {TestTrain}): {Calg} as workflow. ● complex types for $i in $tt return References – input type 〈 dt: tstatisticDT($i.train, $i.test), [1] P. Buneman et al., “Principles of programming with complex objects PatSample = 〈 id: String, sample: Sample, diseased: Boolean 〉 rf: tstatisticRF($i.train, $i.test), and collection types”, Theoretical Computer Science, 1995, 149:3–48. – output type svm: tstatisticSVM($i.train, $i.test) 〉 [2] I. Attinas et al., “Kepler: An Extensible System for Design and FSelCAlg = 〈 tstat: {CAlg}, corr: {Calg} 〉 Execution of Scientific Workflows”, in proc. SSDBM 2004. [3] T. Oinn et al., “Taverna: a tool for the composition and enactment of with CAlg = 〈 dt: PStats, rf: PStats, svm: PStats 〉 Graphical representation bioinformatics workflows”, Bioinformatics, 2004, 20(17):3045–3054. and PStats = 〈 sensitivity: Number, specificity: Number 〉 ● dataflow language based on Petri nets and [4] S. Davidson et al., “BioKleisli: A Digital Library for Biomedical – auxiliary types Nested Relational Calculus [6] Researchers”, J. Digital Libraries, 1997, 1(1):36–53. TestTrain = 〈 test: PepClusters, train: PepClusters 〉 [5] A. Tröger et al., “A language for comprehensively supporting the in ● typing system and core operations from NRC vitro experimental process in silico”, in proc. BIBE 2004. PepClusters = { 〈 clustermass: Number, patlist: PatList 〉 } [6] J. Hidders et al., “Petri nets + nested relational calculus = dataflow PatList = { 〈 patid: String, diseased: Boolean, intensity: Number 〉 } ● top-down, hierarchical design of the data flow language”, submitted to CoopIS 2005. NRC as a formal model for expressing bioinformatics workflows

A. Gambin 1, J. Hidders 2, N. Kwasnikowska 3, S. Lasota 1, J. Sroka 1, J. Tyszkiewicz 1, J. Van den Bussche 3 1 Warsaw University, 2 University of Antwerp, 3 Hasselt University

Context Problems Our contribution

● bioinformatics workflows ● workflows execute as a mix of automated ● using Nested Relational Calculus [1] for scripts and manual intervention modeling data oriented workflows – network of data centered processing steps – difficult to maintain ● many bioinformatics workflows can be modeled ● processing steps involve ● results are stored in ad-hoc ways, e.g. files, in NRC – large amounts of complex data Excel sheets ● advantages of using NRC ● sequence files, BLAST reports – difficult to manage – puts data oriented workflows on a firm ● XML data foundation – a variety of tools Existing solutions – formalism is already well understood ● EMBOSS suite, BioPerl scripts ● workflow execution engines ● natural approach ● webservices, Mascot searches – Kepler [2], Taverna [3] – BioKleisli [4] is also based on NRC ● not based on a formal data model, or too complicated and not data oriented

dataflow Nested Relational Calculus NRC core operations {PatSample}

● established formalism for querying over ● constant value of a base type — “John”, true, 89 ∗

complex objects [1] ● variable of any type, either base type or complex — $patient PatSample ● complex objects are arbitrarily nested ● tuple construction — 〈 name: “John”, condition: true, age: 89 〉 collections and tuples processPat ● tuple projection — $patient.name ● collections can be sets, multi-sets and lists PatData ● empty set construction — ∅ ● set-based model: sets {} and tuples 〈〉 -1 ● singleton set construction — {$patient} ∗ ● typed query language ● set union — $patientList = $healthy ∪ $diseased {PatData} – extensible repertoire of base types ● ● Boolean, String, Number flattening of a nested set clusterPatData

● FASTA sequence file – $patientList = flatten( {$healthy, $diseased}) PepClusters ● XML, based on a DTD or XML Schema ● iteration over a set – complex types: nested sets and tuples kSubsets kSubsets – for $patient in $patientList return $patient.name

● named program definition — pBLAST: FASTA → {AccessionNr} SubSets

Workflow example – description – external programs, used as a “black box” selectSets selectSets ● 3D signal maps from LC-MS analysis of blood – internal programs, help with top-down design {TestTrain} samples ● equality test for base types — $patient.name = “John”

● two groups: diseased and normal ● emptiness test for sets — $patientList = ∅ tstatistic correlation

● extracting clusters corresponding to peptides ● conditional {CAlg} tstat corr – if $p.condition then $diseased ∪ {$p} else $healthy ∪ {$p} 〈⋅,⋅〉 ● core operations can be combined into programs FSelCAlg

tstatistic Workflow example – NRC programs {TestTrain} ● top-down design of the workflow ∗ – after processing and clustering of raw patient data, k-fold cross validation is performed TestTrain define dataflow($s: {PatSample}): FSelCAlg as [train] [test] [train] [test] [train] [test] kfoldCrossValidation( clusterPatData( PepClusters for $i in $s return processPat($i))) train test train test train test tstatisticDT tstatisticRF tstatisticSVM ● some internal programs ● choosing classifiers, defined by a feature – the set of peptide clusters is divided k times into a PerformanceStats selection method training set and a test set by internal program dt rf svm – t-statistic, correlation selectSets and external program kSubsets

, , ● and a classification algorithm these pairs are passed to internal programs tstatistic 〈⋅ ⋅ ⋅〉 and correlation, then a tuple is constructed from – Decision Trees (DT), Random Forest (RF), Support their results CAlg Vector Machine (SVM) define kfoldCrossValidation($pc: PepClusters): FSelCAlg as -1 ● k-fold cross validation to obtain the following ∗ performance statistics for each classifier 〈 tstat: tstatictic(selectSets(kSubsets($pc))), {CAlg} corr: correlation(selectSets(kSubsets($pc))) 〉 – sensitivity, specificity – for each pair of training set and test set, external Acknowledments Workflow example – data types programs corresponding to chosen classifiers with t-statistic as feature selection method are invoked, Special thanks to J. Dutkowski, A. Gambin, B. Kluge, K. ● base types Kowalczyk, J.Tiuryn from Institute of Informatics, Warsaw then a tuple is constructed from received results University, and M. Dadlez from Institute of Biochemistry and String, Number, Boolean, Sample Biophysics, Polish Academy of Science, for providing their define tstatistic($tt: {TestTrain}): {Calg} as workflow. ● complex types for $i in $tt return References – input type 〈 dt: tstatisticDT($i.train, $i.test), [1] P. Buneman et al., “Principles of programming with complex objects PatSample = 〈 id: String, sample: Sample, diseased: Boolean 〉 rf: tstatisticRF($i.train, $i.test), and collection types”, Theoretical Computer Science, 1995, 149:3–48. – output type svm: tstatisticSVM($i.train, $i.test) 〉 [2] I. Attinas et al., “Kepler: An Extensible System for Design and FSelCAlg = 〈 tstat: {CAlg}, corr: {Calg} 〉 Execution of Scientific Workflows”, in proc. SSDBM 2004. [3] T. Oinn et al., “Taverna: a tool for the composition and enactment of with CAlg = 〈 dt: PStats, rf: PStats, svm: PStats 〉 Graphical representation bioinformatics workflows”, Bioinformatics, 2004, 20(17):3045–3054. and PStats = 〈 sensitivity: Number, specificity: Number 〉 ● dataflow language based on Petri nets and [4] S. Davidson et al., “BioKleisli: A Digital Library for Biomedical – auxiliary types Nested Relational Calculus [6] Researchers”, J. Digital Libraries, 1997, 1(1):36–53. TestTrain = 〈 test: PepClusters, train: PepClusters 〉 [5] A. Tröger et al., “A language for comprehensively supporting the in ● typing system and core operations from NRC vitro experimental process in silico”, in proc. BIBE 2004. PepClusters = { 〈 clustermass: Number, patlist: PatList 〉 } [6] J. Hidders et al., “Petri nets + nested relational calculus = dataflow PatList = { 〈 patid: String, diseased: Boolean, intensity: Number 〉 } ● top-down, hierarchical design of the data flow language”, submitted to CoopIS 2005.