Nested Relational Calculus NRC Core Operations Workflow Example

NRC as a formal model for expressing bioinformatics workflows A. Gambin 1, J. Hidders 2, N. Kwasnikowska 3, S. Lasota 1, J. Sroka 1, J. Tyszkiewicz 1, J. Van den Bussche 3 1 Warsaw University, 2 University of Antwerp, 3 Hasselt University Context Problems Our contribution ● bioinformatics workflows ● workflows execute as a mix of automated ● using Nested Relational Calculus [1] for scripts and manual intervention modeling data oriented workflows – network of data centered processing steps – difficult to maintain ● many bioinformatics workflows can be modeled ● processing steps involve ● results are stored in ad-hoc ways, e.g. files, in NRC – large amounts of complex data Excel sheets ● advantages of using NRC ● sequence files, BLAST reports – difficult to manage – puts data oriented workflows on a firm ● XML data foundation – a variety of tools Existing solutions – formalism is already well understood ● EMBOSS suite, BioPerl scripts ● workflow execution engines ● natural approach ● webservices, Mascot searches – Kepler [2], Taverna [3] – BioKleisli [4] is also based on NRC ● not based on a formal data model, or too complicated and not data oriented dataflow Nested Relational Calculus NRC core operations {PatSample} ● established formalism for querying over ● constant value of a base type — “John”, true, 89 ∗ complex objects [1] ● variable of any type, either base type or complex — $patient PatSample ● complex objects are arbitrarily nested ● tuple construction — 〈 name: “John”, condition: true, age: 89 〉 collections and tuples processPat ● tuple projection — $patient.name ● collections can be sets, multi-sets and lists PatData ● empty set construction — ∅ ● set-based model: sets {} and tuples 〈〉 -1 ● singleton set construction — {$patient} ∗ ● typed query language ● set union — $patientList = $healthy ∪ $diseased {PatData} – extensible repertoire of base types ● ● Boolean, String, Number flattening of a nested set clusterPatData ● FASTA sequence file – $patientList = flatten( {$healthy, $diseased}) PepClusters ● XML, based on a DTD or XML Schema ● iteration over a set – complex types: nested sets and tuples kSubsets kSubsets – for $patient in $patientList return $patient.name ● named program definition — pBLAST: FASTA → {AccessionNr} SubSets Workflow example – description – external programs, used as a “black box” selectSets selectSets ● 3D signal maps from LC-MS analysis of blood – internal programs, help with top-down design {TestTrain} samples ● equality test for base types — $patient.name = “John” ● two groups: diseased and normal ● emptiness test for sets — $patientList = ∅ tstatistic correlation ● extracting clusters corresponding to peptides ● conditional {CAlg} tstat corr – if $p.condition then $diseased ∪ {$p} else $healthy ∪ {$p} 〈⋅,⋅〉 ● core operations can be combined into programs FSelCAlg tstatistic Workflow example – NRC programs {TestTrain} ● top-down design of the workflow ∗ – after processing and clustering of raw patient data, k-fold cross validation is performed TestTrain define dataflow($s: {PatSample}): FSelCAlg as [train] [test] [train] [test] [train] [test] kfoldCrossValidation( clusterPatData( PepClusters for $i in $s return processPat($i))) train test train test train test tstatisticDT tstatisticRF tstatisticSVM ● some internal programs ● choosing classifiers, defined by a feature – the set of peptide clusters is divided k times into a PerformanceStats selection method training set and a test set by internal program dt rf svm – t-statistic, correlation selectSets and external program kSubsets , , ● and a classification algorithm these pairs are passed to internal programs tstatistic 〈⋅ ⋅ ⋅〉 and correlation, then a tuple is constructed from – Decision Trees (DT), Random Forest (RF), Support their results CAlg Vector Machine (SVM) define kfoldCrossValidation($pc: PepClusters): FSelCAlg as -1 ● k-fold cross validation to obtain the following ∗ performance statistics for each classifier 〈 tstat: tstatictic(selectSets(kSubsets($pc))), {CAlg} corr: correlation(selectSets(kSubsets($pc))) 〉 – sensitivity, specificity – for each pair of training set and test set, external Acknowledments Workflow example – data types programs corresponding to chosen classifiers with t-statistic as feature selection method are invoked, Special thanks to J. Dutkowski, A. Gambin, B. Kluge, K. ● base types Kowalczyk, J.Tiuryn from Institute of Informatics, Warsaw then a tuple is constructed from received results University, and M. Dadlez from Institute of Biochemistry and String, Number, Boolean, Sample Biophysics, Polish Academy of Science, for providing their define tstatistic($tt: {TestTrain}): {Calg} as workflow. ● complex types for $i in $tt return References – input type 〈 dt: tstatisticDT($i.train, $i.test), [1] P. Buneman et al., “Principles of programming with complex objects PatSample = 〈 id: String, sample: Sample, diseased: Boolean 〉 rf: tstatisticRF($i.train, $i.test), and collection types”, Theoretical Computer Science, 1995, 149:3–48. – output type svm: tstatisticSVM($i.train, $i.test) 〉 [2] I. Attinas et al., “Kepler: An Extensible System for Design and FSelCAlg = 〈 tstat: {CAlg}, corr: {Calg} 〉 Execution of Scientific Workflows”, in proc. SSDBM 2004. [3] T. Oinn et al., “Taverna: a tool for the composition and enactment of with CAlg = 〈 dt: PStats, rf: PStats, svm: PStats 〉 Graphical representation bioinformatics workflows”, Bioinformatics, 2004, 20(17):3045–3054. and PStats = 〈 sensitivity: Number, specificity: Number 〉 ● dataflow language based on Petri nets and [4] S. Davidson et al., “BioKleisli: A Digital Library for Biomedical – auxiliary types Nested Relational Calculus [6] Researchers”, J. Digital Libraries, 1997, 1(1):36–53. TestTrain = 〈 test: PepClusters, train: PepClusters 〉 [5] A. Tröger et al., “A language for comprehensively supporting the in ● typing system and core operations from NRC vitro experimental process in silico”, in proc. BIBE 2004. PepClusters = { 〈 clustermass: Number, patlist: PatList 〉 } [6] J. Hidders et al., “Petri nets + nested relational calculus = dataflow PatList = { 〈 patid: String, diseased: Boolean, intensity: Number 〉 } ● top-down, hierarchical design of the data flow language”, submitted to CoopIS 2005. NRC as a formal model for expressing bioinformatics workflows A. Gambin 1, J. Hidders 2, N. Kwasnikowska 3, S. Lasota 1, J. Sroka 1, J. Tyszkiewicz 1, J. Van den Bussche 3 1 Warsaw University, 2 University of Antwerp, 3 Hasselt University Context Problems Our contribution ● bioinformatics workflows ● workflows execute as a mix of automated ● using Nested Relational Calculus [1] for scripts and manual intervention modeling data oriented workflows – network of data centered processing steps – difficult to maintain ● many bioinformatics workflows can be modeled ● processing steps involve ● results are stored in ad-hoc ways, e.g. files, in NRC – large amounts of complex data Excel sheets ● advantages of using NRC ● sequence files, BLAST reports – difficult to manage – puts data oriented workflows on a firm ● XML data foundation – a variety of tools Existing solutions – formalism is already well understood ● EMBOSS suite, BioPerl scripts ● workflow execution engines ● natural approach ● webservices, Mascot searches – Kepler [2], Taverna [3] – BioKleisli [4] is also based on NRC ● not based on a formal data model, or too complicated and not data oriented dataflow Nested Relational Calculus NRC core operations {PatSample} ● established formalism for querying over ● constant value of a base type — “John”, true, 89 ∗ complex objects [1] ● variable of any type, either base type or complex — $patient PatSample ● complex objects are arbitrarily nested ● tuple construction — 〈 name: “John”, condition: true, age: 89 〉 collections and tuples processPat ● tuple projection — $patient.name ● collections can be sets, multi-sets and lists PatData ● empty set construction — ∅ ● set-based model: sets {} and tuples 〈〉 -1 ● singleton set construction — {$patient} ∗ ● typed query language ● set union — $patientList = $healthy ∪ $diseased {PatData} – extensible repertoire of base types ● ● Boolean, String, Number flattening of a nested set clusterPatData ● FASTA sequence file – $patientList = flatten( {$healthy, $diseased}) PepClusters ● XML, based on a DTD or XML Schema ● iteration over a set – complex types: nested sets and tuples kSubsets kSubsets – for $patient in $patientList return $patient.name ● named program definition — pBLAST: FASTA → {AccessionNr} SubSets Workflow example – description – external programs, used as a “black box” selectSets selectSets ● 3D signal maps from LC-MS analysis of blood – internal programs, help with top-down design {TestTrain} samples ● equality test for base types — $patient.name = “John” ● two groups: diseased and normal ● emptiness test for sets — $patientList = ∅ tstatistic correlation ● extracting clusters corresponding to peptides ● conditional {CAlg} tstat corr – if $p.condition then $diseased ∪ {$p} else $healthy ∪ {$p} 〈⋅,⋅〉 ● core operations can be combined into programs FSelCAlg tstatistic Workflow example – NRC programs {TestTrain} ● top-down design of the workflow ∗ – after processing and clustering of raw patient data, k-fold cross validation is performed TestTrain define dataflow($s: {PatSample}): FSelCAlg as [train] [test] [train] [test] [train] [test] kfoldCrossValidation( clusterPatData( PepClusters for $i in $s return processPat($i))) train test train test train test tstatisticDT tstatisticRF tstatisticSVM ● some internal programs ● choosing

Nested Relational Calculus NRC Core Operations Workflow Example

Nested Complexes and Their Polyhedral Realizations Andrei Zelevinsky

COMPLEXES of TREES and NESTED SET COMPLEXES 11 K+2 and Congruent to 2 Mod K, and at Least One Internal Edge

3.2 Supplement

Complexes of Trees and Nested Set Complexes

Submodularity Helps in Nash and Nonsymmetric Bargaining Games∗

Generalized Inductive Definitions in Constructive Set Theory

Fast Exact Inference for Recursive Cardinality Models

A Graph-Theoretic Method for Organizing Overlapping Clusters Into Trees, Multiple Trees, Or Extended Trees

When Is a Right Orderable Group Locally Indicable?

Real Analysis (4Th Edition)

CS 173 Lecture 8: Set Theory (I)

Trees and Other Hierarchies in Mysql