Nested Relational Calculus NRC Core Operations Workflow Example

NRC as a formal model for expressing bioinformatics workflows A. Gambin 1, J. Hidders 2, N. Kwasnikowska 3, S. Lasota 1, J. Sroka 1, J. Tyszkiewicz 1, J. Van den Bussche 3 1 Warsaw University, 2 University of Antwerp, 3 Hasselt University Context Problems Our contribution ● bioinformatics workflows ● workflows execute as a mix of automated ● using Nested Relational Calculus [1] for scripts and manual intervention modeling data oriented workflows – network of data centered processing steps – difficult to maintain ● many bioinformatics workflows can be modeled ● processing steps involve ● results are stored in ad-hoc ways, e.g. files, in NRC – large amounts of complex data Excel sheets ● advantages of using NRC ● sequence files, BLAST reports – difficult to manage – puts data oriented workflows on a firm ● XML data foundation – a variety of tools Existing solutions – formalism is already well understood ● EMBOSS suite, BioPerl scripts ● workflow execution engines ● natural approach ● webservices, Mascot searches – Kepler [2], Taverna [3] – BioKleisli [4] is also based on NRC ● not based on a formal data model, or too complicated and not data oriented dataflow Nested Relational Calculus NRC core operations {PatSample} ● established formalism for querying over ● constant value of a base type — “John”, true, 89 ∗ complex objects [1] ● variable of any type, either base type or complex — $patient PatSample ● complex objects are arbitrarily nested ● tuple construction — 〈 name: “John”, condition: true, age: 89 〉 collections and tuples processPat ● tuple projection — $patient.name ● collections can be sets, multi-sets and lists PatData ● empty set construction — ∅ ● set-based model: sets {} and tuples 〈〉 -1 ● singleton set construction — {$patient} ∗ ● typed query language ● set union — $patientList = $healthy ∪ $diseased {PatData} – extensible repertoire of base types ● ● Boolean, String, Number flattening of a nested set clusterPatData ● FASTA sequence file – $patientList = flatten( {$healthy, $diseased}) PepClusters ● XML, based on a DTD or XML Schema ● iteration over a set – complex types: nested sets and tuples kSubsets kSubsets – for $patient in $patientList return $patient.name ● named program definition — pBLAST: FASTA → {AccessionNr} SubSets Workflow example – description – external programs, used as a “black box” selectSets selectSets ● 3D signal maps from LC-MS analysis of blood – internal programs, help with top-down design {TestTrain} samples ● equality test for base types — $patient.name = “John” ● two groups: diseased and normal ● emptiness test for sets — $patientList = ∅ tstatistic correlation ● extracting clusters corresponding to peptides ● conditional {CAlg} tstat corr – if $p.condition then $diseased ∪ {$p} else $healthy ∪ {$p} 〈⋅,⋅〉 ● core operations can be combined into programs FSelCAlg tstatistic Workflow example – NRC programs {TestTrain} ● top-down design of the workflow ∗ – after processing and clustering of raw patient data, k-fold cross validation is performed TestTrain define dataflow($s: {PatSample}): FSelCAlg as [train] [test] [train] [test] [train] [test] kfoldCrossValidation( clusterPatData( PepClusters for $i in $s return processPat($i))) train test train test train test tstatisticDT tstatisticRF tstatisticSVM ● some internal programs ● choosing classifiers, defined by a feature – the set of peptide clusters is divided k times into a PerformanceStats selection method training set and a test set by internal program dt rf svm – t-statistic, correlation selectSets and external program kSubsets , , ● and a classification algorithm these pairs are passed to internal programs tstatistic 〈⋅ ⋅ ⋅〉 and correlation, then a tuple is constructed from – Decision Trees (DT), Random Forest (RF), Support their results CAlg Vector Machine (SVM) define kfoldCrossValidation($pc: PepClusters): FSelCAlg as -1 ● k-fold cross validation to obtain the following ∗ performance statistics for each classifier 〈 tstat: tstatictic(selectSets(kSubsets($pc))), {CAlg} corr: correlation(selectSets(kSubsets($pc))) 〉 – sensitivity, specificity – for each pair of training set and test set, external Acknowledments Workflow example – data types programs corresponding to chosen classifiers with t-statistic as feature selection method are invoked, Special thanks to J. Dutkowski, A. Gambin, B. Kluge, K. ● base types Kowalczyk, J.Tiuryn from Institute of Informatics, Warsaw then a tuple is constructed from received results University, and M. Dadlez from Institute of Biochemistry and String, Number, Boolean, Sample Biophysics, Polish Academy of Science, for providing their define tstatistic($tt: {TestTrain}): {Calg} as workflow. ● complex types for $i in $tt return References – input type 〈 dt: tstatisticDT($i.train, $i.test), [1] P. Buneman et al., “Principles of programming with complex objects PatSample = 〈 id: String, sample: Sample, diseased: Boolean 〉 rf: tstatisticRF($i.train, $i.test), and collection types”, Theoretical Computer Science, 1995, 149:3–48. – output type svm: tstatisticSVM($i.train, $i.test) 〉 [2] I. Attinas et al., “Kepler: An Extensible System for Design and FSelCAlg = 〈 tstat: {CAlg}, corr: {Calg} 〉 Execution of Scientific Workflows”, in proc. SSDBM 2004. [3] T. Oinn et al., “Taverna: a tool for the composition and enactment of with CAlg = 〈 dt: PStats, rf: PStats, svm: PStats 〉 Graphical representation bioinformatics workflows”, Bioinformatics, 2004, 20(17):3045–3054. and PStats = 〈 sensitivity: Number, specificity: Number 〉 ● dataflow language based on Petri nets and [4] S. Davidson et al., “BioKleisli: A Digital Library for Biomedical – auxiliary types Nested Relational Calculus [6] Researchers”, J. Digital Libraries, 1997, 1(1):36–53. TestTrain = 〈 test: PepClusters, train: PepClusters 〉 [5] A. Tröger et al., “A language for comprehensively supporting the in ● typing system and core operations from NRC vitro experimental process in silico”, in proc. BIBE 2004. PepClusters = { 〈 clustermass: Number, patlist: PatList 〉 } [6] J. Hidders et al., “Petri nets + nested relational calculus = dataflow PatList = { 〈 patid: String, diseased: Boolean, intensity: Number 〉 } ● top-down, hierarchical design of the data flow language”, submitted to CoopIS 2005. NRC as a formal model for expressing bioinformatics workflows A. Gambin 1, J. Hidders 2, N. Kwasnikowska 3, S. Lasota 1, J. Sroka 1, J. Tyszkiewicz 1, J. Van den Bussche 3 1 Warsaw University, 2 University of Antwerp, 3 Hasselt University Context Problems Our contribution ● bioinformatics workflows ● workflows execute as a mix of automated ● using Nested Relational Calculus [1] for scripts and manual intervention modeling data oriented workflows – network of data centered processing steps – difficult to maintain ● many bioinformatics workflows can be modeled ● processing steps involve ● results are stored in ad-hoc ways, e.g. files, in NRC – large amounts of complex data Excel sheets ● advantages of using NRC ● sequence files, BLAST reports – difficult to manage – puts data oriented workflows on a firm ● XML data foundation – a variety of tools Existing solutions – formalism is already well understood ● EMBOSS suite, BioPerl scripts ● workflow execution engines ● natural approach ● webservices, Mascot searches – Kepler [2], Taverna [3] – BioKleisli [4] is also based on NRC ● not based on a formal data model, or too complicated and not data oriented dataflow Nested Relational Calculus NRC core operations {PatSample} ● established formalism for querying over ● constant value of a base type — “John”, true, 89 ∗ complex objects [1] ● variable of any type, either base type or complex — $patient PatSample ● complex objects are arbitrarily nested ● tuple construction — 〈 name: “John”, condition: true, age: 89 〉 collections and tuples processPat ● tuple projection — $patient.name ● collections can be sets, multi-sets and lists PatData ● empty set construction — ∅ ● set-based model: sets {} and tuples 〈〉 -1 ● singleton set construction — {$patient} ∗ ● typed query language ● set union — $patientList = $healthy ∪ $diseased {PatData} – extensible repertoire of base types ● ● Boolean, String, Number flattening of a nested set clusterPatData ● FASTA sequence file – $patientList = flatten( {$healthy, $diseased}) PepClusters ● XML, based on a DTD or XML Schema ● iteration over a set – complex types: nested sets and tuples kSubsets kSubsets – for $patient in $patientList return $patient.name ● named program definition — pBLAST: FASTA → {AccessionNr} SubSets Workflow example – description – external programs, used as a “black box” selectSets selectSets ● 3D signal maps from LC-MS analysis of blood – internal programs, help with top-down design {TestTrain} samples ● equality test for base types — $patient.name = “John” ● two groups: diseased and normal ● emptiness test for sets — $patientList = ∅ tstatistic correlation ● extracting clusters corresponding to peptides ● conditional {CAlg} tstat corr – if $p.condition then $diseased ∪ {$p} else $healthy ∪ {$p} 〈⋅,⋅〉 ● core operations can be combined into programs FSelCAlg tstatistic Workflow example – NRC programs {TestTrain} ● top-down design of the workflow ∗ – after processing and clustering of raw patient data, k-fold cross validation is performed TestTrain define dataflow($s: {PatSample}): FSelCAlg as [train] [test] [train] [test] [train] [test] kfoldCrossValidation( clusterPatData( PepClusters for $i in $s return processPat($i))) train test train test train test tstatisticDT tstatisticRF tstatisticSVM ● some internal programs ● choosing

Nested Relational Calculus NRC Core Operations Workflow Example

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support