CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Algorithmic Challenges in the CPAMMS Project
Wilfried Gansterer, Andreas Janecek
Research Lab Computational Applications and Technologies, University of Vienna
January 31, 2008
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Outline
1 CPAMMS The CPAMMS Project CPAMMS - CHARMM
2 KD and DM
3 Feature Selection
4 Machine Learning
5 Efforts and Tools
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
The CPAMMS Project Aims
Computing Paradigms and Algorithms for Molecular Modeling and Simulation
Applications in Chemistry Molecular Biology Pharmacy
Focus on Methodological questions in CS and SC Development of innovative methods and computational technologies and on their application
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
The CPAMMS Project Participating Institutes
Faculty of Computer Science Research Lab Computational Technologies and Applications Institute of Scientific Computing
Faculty of Chemistry Institute for Theoretical Chemistry Department of Biomolecular Structural Chemistry
Faculty of Life Sciences Department of Medicinal Chemistry
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
The CPAMMS Project Research Efforts
Middleware ⇒ Talk Sigi Benkner
Applications Quantum Chemistry on Grids ⇒ Talk Mathias Ruckenbauer In-silico Screening ⇒ Talk Gerhard F. Ecker Distributed Molecular Dynamics Simulation ⇒ ...
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
CPAMMS - CHARMM CHARMM (Chemistry at HARvard Macromolecular Mechanics)
General Scalable Molecular Dynamics Simulation and Analysis (evolved over 20+ years) Free energy calculation Replicated data model F77/F95 code + MPI
Simulation studies of ionic liquids Long-term equilibrium simulation (>100ns) Force calculation via Particle Mesh Ewald System size: > 104 atoms
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
CPAMMS - CHARMM CHARMM - Platform & Activities
Computing platform Sun Fire Cluster (72 Sun Fire X4100 = 288 Cores, Infiniband Interconnect)
Ongoing work: Improving scalability Optimizing calculation of atomic interactions Optimal usage of MPI collective communication on Infiniband Comparison with other codes (NAMD, GROMACS)
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
CPAMMS - CHARMM Research Efforts
Middleware ⇒ Talk Sigi Benkner
Applications Quantum Chemistry on Grids ⇒ Talk Mathias Ruckenbauer In-silico Screening ⇒ Talk Gerhard F. Ecker Distributed Molecular Dynamics Simulation ⇒ ...
Algorithms
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Outline
1 CPAMMS
2 KD and DM Predictive QSAR Modeling Steps in the Knowlegde Discovery Process
3 Feature Selection
4 Machine Learning
5 Efforts and Tools
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Predictive QSAR Modeling QSAR
Quantitative Structure Activity Relationship
Quantitative representations of molecular structures
...encoded in terms of information-preserving descriptor values
Pharmacological or biological activity ⇒ Expression describing the beneficial or adverse effects of a drug in an organism
Very general:
Activity = f (physiochemical and/or structural properties)
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Predictive QSAR Modeling QSAR Workflow
Descriptor calculation
Chemical structure
Descriptor numbers Classification or Regression
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Predictive QSAR Modeling Chemical Descriptors
“A chemical descriptor is the final result of a logical and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into an useful number or the result of some standardized experiment.” 1
Physical-chemical properties
Similarity Principle Compounds with similar chemical structures (i.e., descriptor similarity) usually possess similar physicochemical properties and biological activities
1 http://www.qsarworld.com/insilico-chemistry-chemical-descriptors.php
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Predictive QSAR Modeling Chemical Descriptors
Huge amount of obtainable chemical descriptor (> 3 000) ⇒ 1D, 2D, 3D, 4D, molecular weight, volume, solubility, lipophilicity,. . .
Descriptors are computed using structural codes ⇒ Example: SMILES Code Simplified Molecular Input Line Entry Specification
⇒ N(CCC=C1c2c(CCc3c1cccc3)cccc2)(C)C
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Overview
Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing
Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Visualization Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Feature Extraction
Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing
Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Visualization Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Feature Extraction
1. Data collection Collection of pre-classified data ⇒ Literature, NCI (national cancer institute), . . .
Collection of unclassified data ⇒ Compound libraries e.g., SPECS, ChemDiv, . . .
2. Extract structural code e.g., SMILES, .sdf, .mol, . . . 3. Input for software packages to compute descriptors Commercial examples: MOE, Adriana, Dragon Non-commercial examples: JOELib Self developed descriptors
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Feature Extraction
Normalization Descriptor values may have different scales Mean shifting + scaling ⇒ Mean = 0 ⇒ Standard deviation = 1
Discretization Process of transferring continuous numbers into discrete counterparts
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Feature Selection
Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing
Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Visualization Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Feature Selection
Automatic Feature Selection (FS) and Dimensionality Reduction (DR) ⇒ See later
Intuitive FS methods 2
Deletion of descriptors with low “information” content
⇒ E.g., Descriptors, that show more than 80% zero values
Deletion of descriptors with low variance
⇒ variance ≤ 0.5
2 Huang, J., et al., Identifying P-Glycoprotein Substrates Using a Support Vector Machine Optimized by a Particle Swarm. J. Chem. Inf. Model., 2007. 47(4): p. 1638-1647.
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Data Mining
Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing
Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Visualization Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Data Mining
Supervised Learning
Unsupervised Learning
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Post-processing
Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing
Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Visualization Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Steps in the Knowlegde Discovery Process Performance Measures
True/false positives/negatives (TP, FP, TN, FN) For QSAR modeling: “positives” are compounds that have a particular pharmaceutical activity
(Overall) Accuracy
Sensitivity: Pp = TP/(TP + FN) (identical to Recall)
Specificity: Pn = TN/(TN + FP) Precision: Precision = TP/(TP + FP) Matthews Correlation Coefficient C = √ TP×TN−FP×FN (TP+FP)(TP+FN)(TN+FP)(TN+FN)
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Outline
1 CPAMMS
2 KD and DM
3 Feature Selection Data Dimensionality Feature Subset Selection Distributed Feature Selection
4 Machine Learning
5 Efforts and Tools
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Data Dimensionality Benefits from Reducing Number of Features
1. Simplification Better understandable model Simplify the usage of different visualization techniques
2. Computational cost Significantly reduce the computational cost and memory requirements of the classification algorithm used
3. Classification accuracy Irrelevant or/and redundant data may “confuse” the machine learning system
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Data Dimensionality Dimensionality Reduction vs. Feature Selection
1. Dimensionality Reduction Create new attributes which are a combination of the old attributes Use linear algebra to project data from high-dimensional space to lower-dimensional space ⇒ PCA, SVD, FA, . . .
2. Feature (Subset) Selection Select attributes that are a subset of original attributes Remove redundant or irrelevant features from the data set ⇒ Filters ⇒ Wrappers ⇒ Embedded approaches
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Feature Subset Selection Feature Subset Selection Approaches
All Filter Feature Classification A: Filters features approach subset method
All Multiple Classification B:Wrappers features feature method subsets
Wrapper approach
Embedded Feature Selection
All C: Embedded Appr. features Feature Classification selection method
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Feature Subset Selection Filters
Classifier agnostic, no-feedback pre-selection methods
All Filter Feature Classification A: Filters features approach subset method
Independent of the machine learning algorithm
Standard filters rank features according to their individual predictive power Important Methods Information gain, gain ratio, Pearson correlation Correlation and (conditional) mutual information
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Feature Subset Selection Wrappers
Feedback methods that rely on the performance of a specific classifier
All Multiple Classification B:Wrappers features feature method subsets
Wrapper approach
Classifier is used as a black box ⇒ To evaluate the quality of a set of features
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Feature Subset Selection Wrappers
Forward selection vs. backward elimination ⇒ Forward selection: Start with initially empty set and add additional feature in each iteration ⇒ Backward elimination: Start with full set and delete one feature in each iteration
Exhaustive search too expensive ⇒ Heuristics ⇒ Distributed processing
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Feature Subset Selection Wrappers
Example: Genetic Algorithms
Reproduction: New Population X-over, Mut., Sel.
yes
Initial Evaluation Final Continue? no Population ML Algorithm Population
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Feature Subset Selection Filters vs. Wrappers
Filters + Computationally more efficient + Easier to apply - No feedback, no learning
Wrappers + Optimized for a specific classification method - Computationally more demanding
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Feature Subset Selection Embedded Approaches
Act as an integral part of the machine learning algorithm
Embedded Feature Selection
All C: Embedded Appr. features Feature Classification selection method
During the operation of the machine learning algorithm, the algorithm itself decides which attributes to use and which to ignore
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Feature Subset Selection Embedded Approaches
Example 1: Decision Tree
Example 2: Random Forest Ensembles of unpruned classification trees Bootstrap sample, i.e., random feature selection Each tree “votes” for a class; choose the class having the most votes over all trees in the forest
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Feature Selection Motivation
Data is too large to handle centrally: ⇒ Divide and conquer ⇒ Split-up data and computational power
Data is already distributed: ⇒ Security risks
Again: ⇒ Dimensionality reduction vs. feature subset selection
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Feature Selection Distributed Dimensionality Reduction
Mostly parallel implementations or approximations of linear algebra methods Examples:
PCA: Collective PCA from distributed, heterogeneous data [Z. J. Bai 1994]
SVD: Computing of the generalized Singular Value Decomposition via parallel algorithms [H. Kargupta et al. 2000]
Based on parallel matrix computation: ⇒ High dimensional correlation matrices ⇒ High dimensional covariance matrices [A. Wagner et al. 2004]
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Feature Selection Distributed Feature Subset Selection
Rather few attempts to parallelize feature subset selection
Filter Approaches
Feature selection in images [B. M. Miller et al. 1995]
Wrapper approaches - genetic algorithms
Parallel variant of GA [W. F. Punch et al. 1993] Individuals (subsets) passed out to individual processors Evaluation using kNN
Parallel GA-based Wrapper [N. Melab et al. 2002] Parallel multi-threaded evaluation of population
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Outline
1 CPAMMS
2 KD and DM
3 Feature Selection
4 Machine Learning Learning Methods Distributed Machine Learning
5 Efforts and Tools
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Learning Methods Unsupervised Learning
Set of unlabeled training examples (xi ) Work without the class information of a sample
I.e., no a priori known output at the time of classification
No distinction between explanatory and dependent variables Examples: ⇒ Clustering ⇒ Self Organizing Maps Non-linear mapping of a high-dimensional input space to a low-dimensional space
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Learning Methods Supervised Learning
Set of labeled training examples (xi ,yi ) Task: Produce classifier that maps an object to its class
Distinction between explanatory and dependent variables Examples: ⇒ Decision Trees ⇒ k-nearest-neighbor ⇒ Bayesian classifier ⇒ Artificial Neural Networks ⇒ Support Vector Machines ⇒ Ensemble techniques: Bagging, Boosting, etc.
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Machine Learning Motivation
Choice between models: Classification performance
But for similar performance? Look at Model complexity ⇒ Some algorithm are inherently complex (non-polynomial) ⇒ Number of descriptors (used in the model) ⇒ Computation time ⇒ Model interpretability ⇒ Example: On basis of algorithm complexity DT < NB < SVM < NN ! Trade-off between classification performance and runtime
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Machine Learning Difficulties
Distributed or parallel building of a classifier model is a complex and massive parallel task Data Mining often requires huge amounts of resources in storage space and computation time ⇒ Scalability ⇒ Distribution of workload
Data may need to be distributed over several databases ⇒ Physically distributed data ⇒ Prone to security risks
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Machine Learning Examples
Decision Trees: Decision tree induction in peer-to-peer systems [K. Bhaduri et al. 2007]
Hierarchical decision tree induction in distributed genomic databases [A. Bar-Or et al. 2005]
Support Vector Machines:
Multi-way distributed SVM algorithms [F.Poulet et al. 2003]
Parallel cascade SMVs [V.Vapnik et al. 2004]
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Machine Learning Distributed Machine Learning Methods
Regression: Distributed multivariate regression using wavelet-based collective data mining [D.Hershberger et al. 2002] Distributed multivariate regression in peer-to-peer networks
[K.Bhaduri et al. 2008]
K-means clustering:
K-means clustering over peer-to-peer networks [S.Datta et al. 2005] Distributed data mining in peer-to-peer networks using K-means clustering [H.Kargupta et al. 2007]
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Outline
1 CPAMMS
2 KD and DM
3 Feature Selection
4 Machine Learning
5 Efforts and Tools Parallelization Concepts Distributed Life Science Projects Starting Point
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Parallelization Concepts Easy
Building a model on one machine and distributing the model
⇒ Assumption: Model already available ⇒ Labeling of test data using a previously built classifier ⇒ Compute test statistics in parallel on several machines for different subsets
Run cross-validation on various machines ⇒ Building a model on training set ⇒ Testing model on test set
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Parallelization Concepts Easy/Hard
Building a model Easy: Building various classifiers on various machines ⇒ One model is built on exactly one machine ⇒ Train several classifiers in parallel with different parameters
Harder: Building a classifier on distributed data Data cannot be moved because of: ⇒ Security problems ⇒ Size
Very hard: Building one classifier across various machines ⇒ Massive parallelism ⇒ Trade-off between communication and time/resource savings
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Life Science Projects Distributed Computing Projects
Distributed tasks with low inter-communication Solve a large problem by: Giving small parts of the problem to many computers Combining the solutions for the parts into a solution for the problem Projects are too large to be solved in a reasonable amount without distributing the problem
Topics: Look for extra-terrestrial radio signals Look for large prime numbers (> ten million digits) Find more effective drugs to fight cancer or AIDS
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Life Science Projects Distributed Computing Platforms - BOINC
Berkeley Open Infrastructure for Network Computing
http://boinc.ssl.berkeley.edu/ Software platform for volunteer desktop grid computing Uses millions of volunteer computers as a parallel supercomputer Famous examples: distributed.net; SETI@home Rough estimates of needed resources: ⇒ Precondition: existing application ⇒ Three man-months 1 System administrator 1 Programmer 1 Web developer ⇒ About $5 000 for hardware
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Life Science Projects Distributed Computing Platforms - Others
World Community Grid ⇒ Uses BOINC client software
Darwin@Home ⇒ Observe lifelike evolutionary processes in virtual or robotic space
OpenMacGrid ⇒ Computing grid built up entirely of Macs
CPUShare ⇒ Low cost peer-to-peer virtual supercomputer available to everybody
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Life Science Projects Some Projects Based on BOINC
Predictor@home ⇒ Goal: predict protein structures from protein sequences
CHRONOS (Chromosomal Nostalgia) ⇒ Goal: discover the relationships between the 24 chromosomes of the human genome
Rosetta@home ⇒ Goal: predict and design protein structures, and protein-protein and protein-ligand interactions
SIMAP (Similarity Matrix of Proteins) ⇒ Goal: compute similarities between all known protein sequences
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Life Science Projects Some Projects Based on BOINC cont.
Docking@home ⇒ Goal: to enable adaptive multi-scale modeling of the docking applications
MalariaControl.net (part of Africa@home) ⇒ Goal: simulate the ways that the malaria parasite spreads in Africa, and its effects on human health
Proteins@home ⇒ Goal: inverse protein folding problem
POEM@home (Protein Optimization w. Energy Methods) ⇒ Goal: predict the biologically active structure of proteins
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Life Science Projects Other Projects (Not Based on BOINC)
Folding@home ⇒ Goal: simulation of protein folding ⇒ Based on Mithral Client-Server Software Development Kit
Parabon Computation ⇒ Goal: support several areas of Life Sciences research (e.g., microarray gene expression pattern, protein folding. . . ) ⇒ Java client
fightAIDS@home ⇒ Goal: discovering new drugs, building on our growing knowledge of the structural biology of AIDS ⇒ BOINC client for Linux
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Distributed Life Science Projects Our Objectives
”DrugScreening@univie“
Cooperation between
⇒ Faculty of Computer Science and
⇒ Faculty of Life Sciences
First prototype based on CONDOR under development
Long term goal:
⇒ Auto-optimization of features and classifiers
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Starting Point
Grid Weka http://cssa.ucd.ie/xin/weka/Grid Weka.htm
Pure Java implementation for Windows and Linux Systems Custom communication interface (Java obj. serialization) Server and Client programs + Pros: Easy to implement, low software requirements ⇒ JRE Java Runtime Environment ⇒ .jar files for server and client ⇒ config file Interesting starting point for small grid - Cons No GUI support (server and client side) Explicit information about the servers must be available on client side (and manually included in the config file) No grid middleware support !
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Starting Point
Weka4WS http://grid.deis.unical.it/weka4ws
Algorithms provided by the WEKA library are exposed as a Web Service and can be deployed on available grid nodes User nodes (clients) and computing nodes (servers) + Pros: Integration/interoperability with standard grid environments ⇒ Uses security and data management provided by grid middleware GUI extension
- Cons Higher software requirements Requires a good understanding of underlying grid software Works only on Unix/Linux platforms (client AND server) !
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools
Starting Point
Distributed Data Mining Simulator http://rapid-i.com/
Plug-in for RapidMiner (DM environment) Simulation! Prior to real distribution Experiments are really not executed on distributed network nodes Perform experiments with diverse network structures and communication patterns + Pros: Identification of optimal methods and parameters Efficient in the developing stage - Cons Cannot replace testing a real system Simulation may be much easier as real distribution
Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications Acknowledgments
Michael Demel
Gerhard F. Ecker
Manfred Muecke
Hannes Schabauer