CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Algorithmic Challenges in the CPAMMS Project

Wilfried Gansterer, Andreas Janecek

Research Lab Computational Applications and Technologies, University of Vienna

January 31, 2008

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Outline

1 CPAMMS The CPAMMS Project CPAMMS - CHARMM

2 KD and DM

3 Feature Selection

4 Machine Learning

5 Efforts and Tools

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

The CPAMMS Project Aims

Computing Paradigms and Algorithms for Molecular Modeling and Simulation

Applications in Molecular Biology Pharmacy

Focus on Methodological questions in CS and SC Development of innovative methods and computational technologies and on their application

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

The CPAMMS Project Participating Institutes

Faculty of Computer Science Research Lab Computational Technologies and Applications Institute of Scientific Computing

Faculty of Chemistry Institute for Theoretical Chemistry Department of Biomolecular Structural Chemistry

Faculty of Life Sciences Department of Medicinal Chemistry

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

The CPAMMS Project Research Efforts

Middleware ⇒ Talk Sigi Benkner

Applications on Grids ⇒ Talk Mathias Ruckenbauer In-silico Screening ⇒ Talk Gerhard F. Ecker Distributed Simulation ⇒ ...

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

CPAMMS - CHARMM CHARMM (Chemistry at HARvard Macromolecular Mechanics)

General Scalable Molecular Dynamics Simulation and Analysis (evolved over 20+ years) Free energy calculation Replicated data model F77/F95 code + MPI

Simulation studies of ionic liquids Long-term equilibrium simulation (>100ns) Force calculation via Particle Mesh Ewald System size: > 104 atoms

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

CPAMMS - CHARMM CHARMM - Platform & Activities

Computing platform Sun Fire Cluster (72 Sun Fire X4100 = 288 Cores, Infiniband Interconnect)

Ongoing work: Improving scalability Optimizing calculation of atomic interactions Optimal usage of MPI collective communication on Infiniband Comparison with other codes (NAMD, GROMACS)

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

CPAMMS - CHARMM Research Efforts

Middleware ⇒ Talk Sigi Benkner

Applications Quantum Chemistry on Grids ⇒ Talk Mathias Ruckenbauer In-silico Screening ⇒ Talk Gerhard F. Ecker Distributed Molecular Dynamics Simulation ⇒ ...

Algorithms

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Outline

1 CPAMMS

2 KD and DM Predictive QSAR Modeling Steps in the Knowlegde Discovery Process

3 Feature Selection

4 Machine Learning

5 Efforts and Tools

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Predictive QSAR Modeling QSAR

Quantitative Structure Activity Relationship

Quantitative representations of molecular structures

...encoded in terms of information-preserving descriptor values

Pharmacological or biological activity ⇒ Expression describing the beneficial or adverse effects of a drug in an organism

Very general:

Activity = f (physiochemical and/or structural properties)

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Predictive QSAR Modeling QSAR Workflow

Descriptor calculation

Chemical structure

Descriptor numbers Classification or Regression

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Predictive QSAR Modeling Chemical Descriptors

“A chemical descriptor is the final result of a logical and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into an useful number or the result of some standardized experiment.” 1

Physical-chemical properties

Similarity Principle Compounds with similar chemical structures (i.e., descriptor similarity) usually possess similar physicochemical properties and biological activities

1 http://www.qsarworld.com/insilico-chemistry-chemical-descriptors.php

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Predictive QSAR Modeling Chemical Descriptors

Huge amount of obtainable chemical descriptor (> 3 000) ⇒ 1D, 2D, 3D, 4D, molecular weight, volume, solubility, lipophilicity,. . .

Descriptors are computed using structural codes ⇒ Example: SMILES Code Simplified Molecular Input Line Entry Specification

⇒ N(CCC=C1c2c(CCc3c1cccc3)cccc2)(C)C

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Overview

Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing

Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Feature Extraction

Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing

Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Visualization Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Feature Extraction

1. Data collection Collection of pre-classified data ⇒ Literature, NCI (national cancer institute), . . .

Collection of unclassified data ⇒ Compound libraries e.g., SPECS, ChemDiv, . . .

2. Extract structural code e.g., SMILES, .sdf, .mol, . . . 3. Input for packages to compute descriptors Commercial examples: MOE, Adriana, Dragon Non-commercial examples: JOELib Self developed descriptors

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Feature Extraction

Normalization Descriptor values may have different scales Mean shifting + scaling ⇒ Mean = 0 ⇒ Standard deviation = 1

Discretization Process of transferring continuous numbers into discrete counterparts

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Feature Selection

Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing

Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Visualization Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Feature Selection

Automatic Feature Selection (FS) and Dimensionality Reduction (DR) ⇒ See later

Intuitive FS methods 2

Deletion of descriptors with low “information” content

⇒ E.g., Descriptors, that show more than 80% zero values

Deletion of descriptors with low variance

⇒ variance ≤ 0.5

2 Huang, J., et al., Identifying P-Glycoprotein Substrates Using a Support Vector Machine Optimized by a Particle Swarm. J. Chem. Inf. Model., 2007. 47(4): p. 1638-1647.

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Data Mining

Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing

Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Visualization Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Data Mining

Supervised Learning

Unsupervised Learning

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Post-processing

Feature Extraction Feature Selection Data Mining Post-processing Information Data Pre-processing

Feature Collection Feature Subset Selection Supervised and Filtering Patterns Feature Computation Unsupervised Visualization Normalization Dimensionality Reduction Machine Learning Pattern Interpretation Discretization Algorithms ...

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Steps in the Knowlegde Discovery Process Performance Measures

True/false positives/negatives (TP, FP, TN, FN) For QSAR modeling: “positives” are compounds that have a particular pharmaceutical activity

(Overall) Accuracy

Sensitivity: Pp = TP/(TP + FN) (identical to Recall)

Specificity: Pn = TN/(TN + FP) Precision: Precision = TP/(TP + FP) Matthews Correlation Coefficient C = √ TP×TN−FP×FN (TP+FP)(TP+FN)(TN+FP)(TN+FN)

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Outline

1 CPAMMS

2 KD and DM

3 Feature Selection Data Dimensionality Feature Subset Selection Distributed Feature Selection

4 Machine Learning

5 Efforts and Tools

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Data Dimensionality Benefits from Reducing Number of Features

1. Simplification Better understandable model Simplify the usage of different visualization techniques

2. Computational cost Significantly reduce the computational cost and memory requirements of the classification algorithm used

3. Classification accuracy Irrelevant or/and redundant data may “confuse” the machine learning system

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Data Dimensionality Dimensionality Reduction vs. Feature Selection

1. Dimensionality Reduction Create new attributes which are a combination of the old attributes Use linear algebra to project data from high-dimensional space to lower-dimensional space ⇒ PCA, SVD, FA, . . .

2. Feature (Subset) Selection Select attributes that are a subset of original attributes Remove redundant or irrelevant features from the data set ⇒ Filters ⇒ Wrappers ⇒ Embedded approaches

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Feature Subset Selection Feature Subset Selection Approaches

All Filter Feature Classification A: Filters features approach subset method

All Multiple Classification B:Wrappers features feature method subsets

Wrapper approach

Embedded Feature Selection

All C: Embedded Appr. features Feature Classification selection method

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Feature Subset Selection Filters

Classifier agnostic, no-feedback pre-selection methods

All Filter Feature Classification A: Filters features approach subset method

Independent of the machine learning algorithm

Standard filters rank features according to their individual predictive power Important Methods Information gain, gain ratio, Pearson correlation Correlation and (conditional) mutual information

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Feature Subset Selection Wrappers

Feedback methods that rely on the performance of a specific classifier

All Multiple Classification B:Wrappers features feature method subsets

Wrapper approach

Classifier is used as a black box ⇒ To evaluate the quality of a set of features

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Feature Subset Selection Wrappers

Forward selection vs. backward elimination ⇒ Forward selection: Start with initially empty set and add additional feature in each iteration ⇒ Backward elimination: Start with full set and delete one feature in each iteration

Exhaustive search too expensive ⇒ Heuristics ⇒ Distributed processing

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Feature Subset Selection Wrappers

Example: Genetic Algorithms

Reproduction: New Population X-over, Mut., Sel.

yes

Initial Evaluation Final Continue? no Population ML Algorithm Population

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Feature Subset Selection Filters vs. Wrappers

Filters + Computationally more efficient + Easier to apply - No feedback, no learning

Wrappers + Optimized for a specific classification method - Computationally more demanding

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Feature Subset Selection Embedded Approaches

Act as an integral part of the machine learning algorithm

Embedded Feature Selection

All C: Embedded Appr. features Feature Classification selection method

During the operation of the machine learning algorithm, the algorithm itself decides which attributes to use and which to ignore

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Feature Subset Selection Embedded Approaches

Example 1: Decision Tree

Example 2: Random Forest Ensembles of unpruned classification trees Bootstrap sample, i.e., random feature selection Each tree “votes” for a class; choose the class having the most votes over all trees in the forest

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Feature Selection Motivation

Data is too large to handle centrally: ⇒ Divide and conquer ⇒ Split-up data and computational power

Data is already distributed: ⇒ Security risks

Again: ⇒ Dimensionality reduction vs. feature subset selection

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Feature Selection Distributed Dimensionality Reduction

Mostly parallel implementations or approximations of linear algebra methods Examples:

PCA: Collective PCA from distributed, heterogeneous data [Z. J. Bai 1994]

SVD: Computing of the generalized Singular Value Decomposition via parallel algorithms [H. Kargupta et al. 2000]

Based on parallel matrix computation: ⇒ High dimensional correlation matrices ⇒ High dimensional covariance matrices [A. Wagner et al. 2004]

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Feature Selection Distributed Feature Subset Selection

Rather few attempts to parallelize feature subset selection

Filter Approaches

Feature selection in images [B. M. Miller et al. 1995]

Wrapper approaches - genetic algorithms

Parallel variant of GA [W. F. Punch et al. 1993] Individuals (subsets) passed out to individual processors Evaluation using kNN

Parallel GA-based Wrapper [N. Melab et al. 2002] Parallel multi-threaded evaluation of population

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Outline

1 CPAMMS

2 KD and DM

3 Feature Selection

4 Machine Learning Learning Methods Distributed Machine Learning

5 Efforts and Tools

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Learning Methods Unsupervised Learning

Set of unlabeled training examples (xi ) Work without the class information of a sample

I.e., no a priori known output at the time of classification

No distinction between explanatory and dependent variables Examples: ⇒ Clustering ⇒ Self Organizing Maps Non-linear mapping of a high-dimensional input space to a low-dimensional space

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Learning Methods Supervised Learning

Set of labeled training examples (xi ,yi ) Task: Produce classifier that maps an object to its class

Distinction between explanatory and dependent variables Examples: ⇒ Decision Trees ⇒ k-nearest-neighbor ⇒ Bayesian classifier ⇒ Artificial Neural Networks ⇒ Support Vector Machines ⇒ Ensemble techniques: Bagging, Boosting, etc.

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Machine Learning Motivation

Choice between models: Classification performance

But for similar performance? Look at Model complexity ⇒ Some algorithm are inherently complex (non-polynomial) ⇒ Number of descriptors (used in the model) ⇒ Computation time ⇒ Model interpretability ⇒ Example: On basis of algorithm complexity DT < NB < SVM < NN ! Trade-off between classification performance and runtime

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Machine Learning Difficulties

Distributed or parallel building of a classifier model is a complex and massive parallel task Data Mining often requires huge amounts of resources in storage space and computation time ⇒ Scalability ⇒ Distribution of workload

Data may need to be distributed over several databases ⇒ Physically distributed data ⇒ Prone to security risks

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Machine Learning Examples

Decision Trees: Decision tree induction in peer-to-peer systems [K. Bhaduri et al. 2007]

Hierarchical decision tree induction in distributed genomic databases [A. Bar-Or et al. 2005]

Support Vector Machines:

Multi-way distributed SVM algorithms [F.Poulet et al. 2003]

Parallel cascade SMVs [V.Vapnik et al. 2004]

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Machine Learning Distributed Machine Learning Methods

Regression: Distributed multivariate regression using wavelet-based collective data mining [D.Hershberger et al. 2002] Distributed multivariate regression in peer-to-peer networks

[K.Bhaduri et al. 2008]

K-means clustering:

K-means clustering over peer-to-peer networks [S.Datta et al. 2005] Distributed data mining in peer-to-peer networks using K-means clustering [H.Kargupta et al. 2007]

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Outline

1 CPAMMS

2 KD and DM

3 Feature Selection

4 Machine Learning

5 Efforts and Tools Parallelization Concepts Distributed Life Science Projects Starting Point

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Parallelization Concepts Easy

Building a model on one machine and distributing the model

⇒ Assumption: Model already available ⇒ Labeling of test data using a previously built classifier ⇒ Compute test statistics in parallel on several machines for different subsets

Run cross-validation on various machines ⇒ Building a model on training set ⇒ Testing model on test set

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Parallelization Concepts Easy/Hard

Building a model Easy: Building various classifiers on various machines ⇒ One model is built on exactly one machine ⇒ Train several classifiers in parallel with different parameters

Harder: Building a classifier on distributed data Data cannot be moved because of: ⇒ Security problems ⇒ Size

Very hard: Building one classifier across various machines ⇒ Massive parallelism ⇒ Trade-off between communication and time/resource savings

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Life Science Projects Distributed Computing Projects

Distributed tasks with low inter-communication Solve a large problem by: Giving small parts of the problem to many computers Combining the solutions for the parts into a solution for the problem Projects are too large to be solved in a reasonable amount without distributing the problem

Topics: Look for extra-terrestrial radio signals Look for large prime numbers (> ten million digits) Find more effective drugs to fight cancer or AIDS

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Life Science Projects Distributed Computing Platforms - BOINC

Berkeley Open Infrastructure for Network Computing

http://boinc.ssl.berkeley.edu/ Software platform for volunteer desktop grid computing Uses millions of volunteer computers as a parallel supercomputer Famous examples: distributed.net; SETI@home Rough estimates of needed resources: ⇒ Precondition: existing application ⇒ Three man-months 1 System administrator 1 Programmer 1 Web developer ⇒ About $5 000 for hardware

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Life Science Projects Distributed Computing Platforms - Others

World Community Grid ⇒ Uses BOINC client software

Darwin@Home ⇒ Observe lifelike evolutionary processes in virtual or robotic space

OpenMacGrid ⇒ Computing grid built up entirely of Macs

CPUShare ⇒ Low cost peer-to-peer virtual supercomputer available to everybody

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Life Science Projects Some Projects Based on BOINC

Predictor@home ⇒ Goal: predict protein structures from protein sequences

CHRONOS (Chromosomal Nostalgia) ⇒ Goal: discover the relationships between the 24 chromosomes of the human genome

Rosetta@home ⇒ Goal: predict and design protein structures, and protein-protein and protein-ligand interactions

SIMAP (Similarity Matrix of Proteins) ⇒ Goal: compute similarities between all known protein sequences

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Life Science Projects Some Projects Based on BOINC cont.

Docking@home ⇒ Goal: to enable adaptive multi-scale modeling of the docking applications

MalariaControl.net (part of Africa@home) ⇒ Goal: simulate the ways that the malaria parasite spreads in Africa, and its effects on human health

Proteins@home ⇒ Goal: inverse protein folding problem

POEM@home (Protein Optimization w. Energy Methods) ⇒ Goal: predict the biologically active structure of proteins

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Life Science Projects Other Projects (Not Based on BOINC)

Folding@home ⇒ Goal: simulation of protein folding ⇒ Based on Mithral Client-Server Software Development Kit

Parabon Computation ⇒ Goal: support several areas of Life Sciences research (e.g., microarray gene expression pattern, protein folding. . . ) ⇒ Java client

fightAIDS@home ⇒ Goal: discovering new drugs, building on our growing knowledge of the structural biology of AIDS ⇒ BOINC client for

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Distributed Life Science Projects Our Objectives

”DrugScreening@univie“

Cooperation between

⇒ Faculty of Computer Science and

⇒ Faculty of Life Sciences

First prototype based on CONDOR under development

Long term goal:

⇒ Auto-optimization of features and classifiers

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Starting Point

Grid Weka http://cssa.ucd.ie/xin/weka/Grid Weka.htm

Pure Java implementation for Windows and Linux Systems Custom communication interface (Java obj. serialization) Server and Client programs + Pros: Easy to implement, low software requirements ⇒ JRE Java Runtime Environment ⇒ .jar files for server and client ⇒ config file Interesting starting point for small grid - Cons No GUI support (server and client side) Explicit information about the servers must be available on client side (and manually included in the config file) No grid middleware support !

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Starting Point

Weka4WS http://grid.deis.unical.it/weka4ws

Algorithms provided by the WEKA library are exposed as a Web Service and can be deployed on available grid nodes User nodes (clients) and computing nodes (servers) + Pros: Integration/interoperability with standard grid environments ⇒ Uses security and data management provided by grid middleware GUI extension

- Cons Higher software requirements Requires a good understanding of underlying grid software Works only on /Linux platforms (client AND server) !

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications CPAMMS KD and DM Feature Selection Machine Learning Efforts and Tools

Starting Point

Distributed Data Mining Simulator http://rapid-i.com/

Plug-in for RapidMiner (DM environment) Simulation! Prior to real distribution Experiments are really not executed on distributed network nodes Perform experiments with diverse network structures and communication patterns + Pros: Identification of optimal methods and parameters Efficient in the developing stage - Cons Cannot replace testing a real system Simulation may be much easier as real distribution

Wilfried Gansterer, Andreas Janecek Workshop on Campus Grids and Scientific Applications Acknowledgments

Michael Demel

Gerhard F. Ecker

Manfred Muecke

Hannes Schabauer