<<

High Performance Tutorial Outline in PART I: High Performance Computing Thomas Ludwig

Thomas Ludwig ([email protected]) PART II: HPC Computing in Bioinformatics Ruprecht-Karls-Universität Heidelberg Alexandros Stamatakis Department Grand Challenges for HPC Bioinformatics

HPC Bioinformatics by Example of Phylogenetic Alexandros Stamatakis ([email protected]) Inference Technische Universität München Computer Science Department

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 2

PART I High Performance Computing Outline

Introduction Introduction

Architecture Architecture

Top Systems Top Systems

Programming Programming

Problems Problems

Own Research Own Research

The Future The Future

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 3 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 4

Introduction Why High Performance Computing? What Is It?

Situation in science and engineering High Performace Computing (HPC),

Replace complicated physical experiments by Networking and Storage computer simulations Deals with high and highest performance Evaluate more fine-grained models computers, with high speed networks, and powerful disk and tape storage systems User requirements Compute masses of individual tasks Performance improvement

Compute complicated single tasks Compared to personal computers and small workstations: Available computational power Factor 100...10.000 Single workstation is not sufficient

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 5 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 6

1 What Do I Need? What Do I Need...?

Small scale high performance computing Large scale high performance computing

Cheapest version: use what you have Buy 10.000 PCs or a dedicated Workstations with disks and network

A bit more expensive: buy PCs Buy special hardware for networking and

E.g. 16 personal computers with disks and gigabit storage

ethernet Add a special building It´s mainly a human ressources problem Add an electric power station Network of workstations is time consuming to maintain

Software comes for free

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 7 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 8

How Much Do I Have to Pay? Application Fields

Small scale (<64 nodes) Numerical calculations and simulations 1000€/node Particle physics Medium scale (64-1024 nodes) Computational fluid dynamics 2000€/node (multiprocessor, 64-bit) Car crash simulations 1500€/node for high speed network

500€/node for high performance I/O Weather forecast

Large scale (>1024 nodes) ... Money for building Non-numerical computations Money for power plant Chess playing, theorem proving Current costs range between 20...400 million Euros Commercial applications

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 9 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 10

Application Fields... Measures

20 6 30 9 40 12 All fields of Bioinformatics Mega (2 ≅10 ) – Giga (2 ≅10 ) – Tera (2 ≅10 ) 50 15 60 18 Computational genomics Peta (2 ≅10 ) – Exa (2 ≅10 )

Computational proteomics

Computational evolutionary biology Computational Performance (Flop/s)

… Flop/s = floating point operations per second

Modern processor: 3 GFlop/s

Nr. 1 supercomputer: 35 TFlop/s (factor 10.000) In general Everything that runs beween 1 and 10.000 days Network performance (Byte/s) Personal computer: 10/100 MByte/s Everything that uses high volumes of data Supercomputer networks: gigabytes/s

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 11 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 12

2 Measures... Outline

Main memory (Byte) Introduction

Personal computer: 1 GByte Architecture Nr. 1 supercomputer: 10 TByte (factor 10.000) Top Systems Disk space (Byte)

Single disk 2004: 200 GByte Programming

Nr. 1 supercomputer: 700 TByte (factor 3.500) Problems Tape storage (Byte) Own Research Personal computer: 200 GByte

Nr. 1 supercomputer: 1.6 Pbyte (factor 8.000) The Future

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 13 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 14

Architecture Distributed Memory Architecture

Autonomous computers Basic classification concept: computer 1 computer 2 connected via network How is the main memory organized? processor processor Processes on each compute node have access to local Distributed memory architecture memory only Shared memory architecture local local Parallel program spawns memory memory processes over a set of processors

Available systems recv() send() Communication between Dedicated computers (processes) via message passing Cluster systems interconncetion network Called: multi computer system

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 15 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 16

Distributed Memory Architecture... Shared Memory Architecture

Advantages computer Several processors in one box (e.g. multiprocessor mother- Good scalability: just buy new nodes processor processor board) Concept scales up to 10.000+ nodes Each process on a processor You can use what you already have sees complete address space Extend the system when you have money and need for more power Communication between Interconnection network processes via shared Disadvantages variables Complicated programming: parallelization of Called: multiprocessor formerly sequential programs system, symmetric multi- (including complicated debugging, performance shared memory processing system (SMP) tuning, load blancing, etc.)

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 17 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 18

3 Shared Memory Architecture... Hybrid Architectures

Advantages Use several SMP systems

Much easier programming Combination of shared memory systems and distributed memory system Disadvantages The good thing: scalable performance Limited scalability: up to 64 processors according to your financial budget Reason: interconnection network becomes bottleneck The bad thing: programming gets even more complicated (hybrid programming) Limited extensibility Very expensive due to high performance The reality: vendors like to sell these interconnection network systems, because they are easier to build

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 19 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 20

Supercomputers vs. Clusters Supercomputers vs. Clusters...

Supercomputers Supercomputers

(Distributed/shared memory) Very expensive to buy Constructed by a major vendor (IBM, HP, ...) Usually high availability and scalability Use custom components (processor, network, ...) Custom (Unix-like) operating systems Clusters

Clusters (Network of workstations, NOWs) Factor 10 cheaper to buy, but:

Assembled by vendor or users Very expensive to own

Commodity-of-the-shelf components (COTS) Lower overall availability and scalability Linux

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 21 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 22

Outline TOP500-List

Introduction Lists world 500 most powerful systems

Architecture www.top500.org

Top Systems Update in June and November

Programming Ranking based on numerical

Problems In 6 months almost half of the systems fall off the list Own Research The majority of systems now are clusters The Future

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 23 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 24

4 TOP500 Rank 1-10 TOP500 Rank 11-20

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 25 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 26

TOP500 Performance TOP500 Performance Trends factor 1000 in 11 years

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 27 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 28

☺ EarthSimulator / NEC

data archive / tape robots disks

my notebook compute node networking cabinets (320) cabinets (65)

air cooling cables

earth quake power supply protection

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 29 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 30

5 EarthSimulator / NEC BlueGene / IBM

640 nodes x 8 processors 10TByte main memory

= 5120 processors 700TByte disk

1.6PBytes tapes

200MioUSD for computer

200MioUSD for building 83.000 copper cables

and electric power station 2.800km / 220t of cabeling 2 Building: 3250m

earth quake protected Application field: climate

Power: 7MW simulations

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 31 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 32

BlueGene / IBM Outline

IBM intends to break the Application fields: Introduction 1PFlop/s barrier by end of bioinformatics

2006 Ab initio protein folding Architecture

Molecular dynamics on a millisecond to second Top Systems Power consumption and time scale floor space problems Programming solved by new packing Protein structure prediction technology Problems ( less than 30 Earth- Own Research Simulators! ☺ ) The Future

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 33 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 34

Programming Parallelization Paradigms Parallelization Paradigms...

What do we have? Remember the two categories 1. Compute masses of individual tasks Many processors, one program, much data 2. Compute complicated single taks How do we proceed?

Start one instance of the program on each Category 1: embarrassingly parallel

processor (called process) Master/worker concept:

Give it one part of the data Start master on one processor Start workers on all other processors Collect the result Master sends data to workers Is this all? Master collects results from workers

Well, sometimes yes ☺, sometimes no Example: Analysis of different molecules

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 35 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 36

6 Parallelization Paradigms... Parallelization Support

Parallelization via automatically parallelizing Category 2: non-trivial parallelism Complicated, seldom, inefficient, not scalable Usually: data partitioning Data is partitioned amongst the processes Parallelization for shared memory architectures

Each process computes results Language support via OpenMP

During computation processes have to coordinate Only for small node numbers with each other: done by communication

A selected process organizes input/output Parallelization for distributed memory architectures

Example: Molecular dynamics simulations Done manually

Only for experienced programmers Decorate program with message passing calls

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 37 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 38

Message Passing Message Passing...

Basic principle: use send and receive Message passing standard: calls to transfer data MPI (Message Passing Interface) Conceived in the mid 90s

Language bindings: C, C++, Fortran (others available) process proc1 process proc2 Divided into two parts start: start: MPI: just message passing computeData(); computeData(); MPI-2: process management, input/output, ... send(proc2,data); send(proc1,data); Available implementations

recv(proc2,data); recv(proc1,data); Vendor versions on all major supercomputers

goto start; goto start; Open source versions MPICH and LAME for workstations clusters

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 39 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 40

Message Passing... Message Passing... The Dark Side

How big is MPI/MPI-2? Parallel programming in general introduces new error category: Time dependent errors Several hundreds of library calls! The reason for this Don´t worry: half a dozen is enough Processes run in parallel but are not synchronized; to start with timing depends on concrete acitivites on the nodes

What makes MPI so copious? The consequences! An erroneous program crashes only sometimes All sorts of library calls to just have When you slow it down to observe it, it does not crash more comfort with programming: The precondition for the crash cannot be reproduced Collective calls: e.g. compute a global sum over all processes (nondeterminism) and distribute it to all of them The solution? Well, next question Special communication calls: e.g. non-blocking calls provide means to overlap computation and communication © Thomas Ludwig, Alexandros Stamatakis, GCB’04 41 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 42

7 Performance Limitations: Performance Limitations: Amdahl´s Law Amdahl´s Law...

How to evaluate parallel performance Every parallal program has a non-parallel

Run with 1 and with n processors part (input/output, initialization, etc.)

Speedup: S(n) = t(1) / t(n) Fraction of non-parallel part is f Efficiency: E(n) = S(n) / n Maximum speedup with Amdahl Theoretical maximum: S =n max SAmdahl = 1 / (f + (1-f)/p) Practical maximum by Amdahl´s Law Examples:

F=0.01 ⇒ SAmdahl=100

F=0.001 ⇒ SAmdahl=1000

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 43 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 44

Problems Outline Fault Tolerance

Introduction What happens when during runtime a node crashes? ( increases with number of Architecture components and execution time) Top Systems Today Programming The program just crashes The program crashes but can be restarted from a Problems checkpoint Own Research In the future

The Future The program continues with execution and uses a different set of nodes

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 45 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 46

High Performance Input/Output Outline

How can we handle terabytes and Introduction

petabytes of input/output data? Architecture

Today Top Systems Not yet too efficiently; often a master Programming process handles all I/O Problems In the future Own Research All processes perform I/O-calls (parallel I/O)

All nodes have efficient access to high The Future performance I/O hardware

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 47 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 48

8 Own Research High Performance Parallel I/O Outline

High performance I/O is a major challenge Introduction today Architecture Research activities Top Systems Investigate load balancing mechanisms for parallel I/O systems (i.e. distribute data to disks Programming according to load) Problems Provide performance measurement tools to see the influence of I/O library calls in the source Own Research code The Future Adapt parallel programs to parallel I/O concepts

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 49 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 50

The Future Petaflops and Petabytes Gridcomputing

The new borders of supercomputing The new hype of supercomputing

Petaflops: maybe by end of 2006? Idea: as with the power grid Computational performance should be available Petabytes: next year everywhere

Number of processors: 10.000 and more Computational performance can be produced at various places

Concept However: do we have the right Join parallel computers, cluster, compute centers

and programs for that? Offer their aggregate compute performance

Problems

Programming, management, security, ...

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 51 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 52

Limitations of High Performance PART II: HPC in Computing Bioinformatics

Reconsider: „everything that runs between Grand Challenges in HPC Bioinformatics 1 and 10000 days“ HPC Bioinformatics by Example of Current supercomputers can reduce the Phylogenetic Inference program runtime by a maximum factor of 5 phylogenetic analysis orders of magnitude maximum likelihood What to do, if you want to compute e.g. billions of molecular variations? bayesian inference sequential codes There is only one answer: parallel & distributed RAxML First: Improve the algorithm! future developments Second: Use supercomputers © Thomas Ludwig, Alexandros Stamatakis, GCB’04 53 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 54

9 The Classic Slide: Outline GenBank Data Growth

Grand Challenges in HPC Bioinformatics Number of Base Pairs HPC Bioinformatics by Example of Phylogenetic Inference

phylogenetic analysis

maximum likelihood

bayesian inference

sequential codes

parallel & distributed RAxML

future developments 1980 2004 Years

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 55 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 56

Increase in Nucleotide Substitution Model Complexity Grand Challenges

Amount of Protein folding & structure prediction arithmetic JC69 F81 HKY85 GTR GTR + Γ operations Homology search

Multiple alignment

Genomic sequence analysis

Gene finding

Gene expression data analysis

Drug discovery

1969 2004 Years Phylogenetic inference

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 57 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 58

Grand Challenges Multiple Alignment

Protein folding & structure prediction The prerequisite for phylogenetic analysis Computational effort increases exponentially Homology search in time and space with standard dynamic Multiple alignment programming approach and Sum-of-Pairs Genomic sequence analysisMain focus score → Gene finding of this talk! Gene expression data analysis Which is the Drug discovery adequate score function? Phylogeny construction

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 60 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 62

10 Multiple Alignment Multiple Alignment

The prerequisite for phylogenetic analysis The prerequisite for phylogenetic analysis Type of reverse Computational effort increases exponentially Computationalengineering: effort increases How exponentially in time and space with standard dynamic in time and spacecan I withchange standard the dynamic programming approach and Sum-of-Pairs programmingalgorithm approach to and be Sum-of-Pairsable score score → → to parallelize it with Good heuristics Good heuristics MPI? Parallel algorithms Parallel algorithms

Fine-grained, e.g. on alignment matrix level Fine-grained, e.g. on alignment matrix level

FPGA implementations for pairwise alignment FPGA implementations for pairwise alignment

Heuristics & parallel algorithm Heuristics & parallel algorithm

coarse-grained divide-and-conquer coarse-grained divide-and-conquer

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 63 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 64

An example: DCA DCA

Stoye et al (1997) Divide-and-Conquer S1 Alignment Algorithm (DCA) S2 S3 1. Divide sequences into smaller subsequence-sets Divide 2. If length of multiple subsequence-set < S1 S1 predifined threshold value L compute optimal S2 S2 subalignments in parallel S3 S3 Align

3. Concatenate subalignments to whole alignment S1 S1 S2 S2 The Art consists in the design of intelligent S3 S3 decomposition heuristics to obtain near Concatenate

optimal concatenated alignment → non- S1 trivial problem S2 S3 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 65 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 70

Outline Phylogenetic Analysis

Grand Challenges in HPC Bioinformatics Motivation

Tree-of-life HPC Bioinformatics by Example of New insights in medical & biological research Phylogenetic Inference CIPRES: NSF-funded 11.6 million $ tree-of-life phylogenetic analysis project (www.phylo.org)

maximum likelihood Applications of phylogenetic trees

Bader et al (2001) Industrial applications of high- bayesian inference performance computing for phylogeny reconstruction. sequential codes Baker et al (1994) Which whales are hunted? A molecular genetic approach to whaling. parallel & distributed RAxML

future developments

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 71 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 72

11 Phylogenetic Methods Remember !

Input: “good” multiple Alignment Input need not be DNA or protein sequence

Output: unrooted binary tree data → gene order data Moret et al (2001) GRAPPA: a high performance Various models for phylogenetic inference computational tool for phylogeny reconstruction from Models differ in computational complexity & gene-order data.

accuracy of final trees Model need not be a tree → networks

Fast & simple models Gusfield et al (2003) Efficient reconstruction of We focus on Neighbor Joining phylogenetic networks with constrained ML & recombination. Parsimony (MP) Bayesian Slow & complex models Output need not be a strictly bifurcating tree Methods Maximum Likelihood (ML) → multifurcating tree

Bayesian Methods

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 74 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 75

Example: Phylogeny of great Remember ! Apes

Input need not be DNA or protein sequence common ancestor time data → gene order data

Moret et al (2001) GRAPPA: a high performance computationalWe focus tool on forcomputation phylogeny reconstruction of from gene-orderstrictly bifurcatingdata. phylogenetic trees with maximum likelihood Model need not be a tree → networks for DNA and Protein sequence Gusfield et al (2003) Efficient reconstruction of phylogenetic networksdata with! constrained recombination.

Output need not be a strictly bifurcating tree → multifurcating tree Orangutan Gorilla Chimpanzee Human

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 76 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 77

The number of trees explodes! The Algorithmic Problem

Number of potential trees grows exponentially

# Taxa # Trees BANG ! This is ≈ the 5 15 number of 10 2.027.025 atoms in the universe 15 7.905.853.580.625 10^80 50 2.84 * 10^76

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 82 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 84

12 Maximum Likelihood Maximum Likelihood

Maximum Likelihood calculates: Maximum Likelihood calculates: S1 S1 v1 v5 S3 1. Topologies v1 v5 S3 1. Topologies S4 S4 2. Branch lengths v[i] 2. Branch lengths v[i] v2 v3 v4 v7 v2 v3 v4 v7 3. Likelihood of the tree 3. Likelihood of the tree S2 v6 S2 v6 S5 S5 Goal: Obtain topology with maximum likelihood value Goal: Obtain topology with maximum likelihood value Problem I: Number of possible topologies is exponential in n Problem I: Number of possible topologies is exponential in n Problem II: Computation of likelihood function is expensive Problem II: Computation of likelihood function is expensive Solution: algorithmic optimizations + new heuristics + HPC Solution: algorithmic optimizations + new heuristics + HPC

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 86 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 87

Bayesian Inference Bayesian Inference

Uses bayesian statistics Uses bayesian statistics

P (t + m|d) = P (d|t + m) x P (t + m) / P (d) P (t + m|d) = P (d|t + m) x P (t + m) / P (d)

Likelihood of the tree prior probability of the tree: must be assumed; all possible trees are usually considered to be equally probable

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 89 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 90

Bayesian Inference Bayesian Inference

Uses bayesian statistics Uses bayesian statistics

P (t + m|d) = P (d|t + m) x P (t + m) / P (d) P (t + m|d) = P (d|t + m) x P (t + m) / P (d)

problematic term:P(d)equals the problematicimpossible term:P(d)equals to the sum of the likelihood x prior probability sum of the likelihoodcalculate! x prior probability of the tree + model for all possible of the tree + model for all possible trees. trees.

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 91 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 92

13 Bayesian Inference MC³ Algorithm

Solution: Metropolis-Coupled Markov Chain random starting tree Monte Carlo Simulation (MC³)

Advantages compared to ML: t_1_1 t_2_1 straightforward statistical measure of phylogeny

Avoids bootstrapping & provides straightforward support values heated chain cold chain

Disadvantages compared to ML: Tree proposal mechanism Requires prior for tree & model

MC³ convergence problem

More difficult to parallelize: Feng et al (2003) Parallel algorithms for bayesian phylogenetic inference. The art in the design of bayesian phylogenetic analysis lies in the tree proposal mechanism.

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 93 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 95

MC³ Algorithm MC³ Algorithm

random starting tree random starting tree

t_1_1 t_2_1 t_1_1 t_2_1 R = L(t_1_2)/L(t_1_1) R = L(t_2_2)/L(t_2_1) R < 1 accept tree → R > 1→ accept tree if random(0,1) < R t_1_1 t_2_2 t_1_1 t_2_2

if L(t_1_2) > L(t_2_3) t_1_2 t_2_3 → swap chain states

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 96 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 97

MC³ Algorithm ML vs. Bayes

random starting tree Likelihood ML value

t_1_1 t_2_1 Bayes

t_1_1 t_2_2

t_1_2 t_2_3

t_1_3 t_2_4 Model parameter x (transition/transversion ratio)

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 98 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 99

14 MC³ convergence problem MC³ convergence problem

Log Likelihood Log Likelihood

Area of apparent stationarity

Time Time

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 100 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 101

MC³ convergence problem Real-world example

ML analysis reference Log Likelihood

ML starting tree

random starting tree

Time

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 102 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 103

ML Phylogeny Program Outline Development

Grand Challenges in HPC Bioinformatics

HPC Bioinformatics by Example of Phylogenetic Inference Develop fast sequential algorithm with new heuristics & optimizations phylogenetic analysis

maximum likelihood Phylogenetics are an bayesian inference algorithmic discipline sequential codes

parallel & distributed RAxML

future developments

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 104 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 106

15 ML Phylogeny Program Development Basic Algorithms

Two basic classes of algorithms I. Progressive algorithms: progressive Develop fast sequential algorithm with insertion of sequences into the tree e.g. new heuristics & optimizations stepwise addition II. Global algorithms: Iterate use NJ or parsimony starting tree optimize tree by application of standard Parallel program topological alterations NNI: Nearest Neighbor Interchange TBR: Tree Bisection Reconnection Subtree Rearrangements

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 108 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 109

NNI NNI

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 110 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 111

NNI NNI

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 112 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 113

16 TBR TBR

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 114 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 115

TBR TBR

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 116 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 117

TBR TBR

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 118 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 119

17 Subtree Rearrangements Subtree Rearrangements

ST2 ST2 ST1 ST1

ST3 ST3 ST6 ST6

ST5 ST4 ST5 ST4

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 120 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 121

Subtree Rearrangements Subtree Rearrangements

ST2 ST2 ST1 ST1 +1 +1

ST6 ST3 ST3 ST6

ST5 ST4 ST5 ST4

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 122 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 123

Subtree Rearrangements Subtree Rearrangements

ST6 ST2 ST6 ST2 ST1 ST1 +1 +1

ST3 ST3

ST5 ST4 ST5 ST4

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 124 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 125

18 Subtree Rearrangements Subtree Rearrangements

ST2 ST2 ST1 ST1 +2 +2

ST3 ST3

ST4 ST4 ST5 ST6 ST5 ST6

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 126 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 127

State-of-the-Art sequential State-of-the-Art sequential phylogeny programs I phylogeny programs II

PHYML: fast & accurate on simulated data IQPNNI: accurate on real & simulated data;

Guindon et al (2003) A simple, fast, and accurate slower than PHYML/RAxML

algorithm to estimate large phylogenies by maximum Vinh et al (2004) IQPNNI: Moving fast through tree likelihood. space and stopping in time.

RAxML-III: fast & accurate on real data PAUP*: Many options for MP & ML searches; very Stamatakis et al (2004) RAxML-III: A fast program for slow on ML, not available free of charge maximum likelihood-based inference of large Swofford (1998) PAUP* 4.0 - Phylogenetic Analysis phylogenetic trees. Using Parsimony (*and Other Methods). MetaPigA: fastest genetic search algorithm MrBayes: fast bayesian inference Lemmon et al (2002) The metapopulation genetic Huelsenbeck (2001) MrBayes: Bayesian inference of algorithm: An efficient solution for the problem of phylogenetic trees. large phylogeny estimation.

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 128 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 129

Performance of Phylogeny Performance of Phylogeny Programs Programs

Quantitative Measures Quantitative MeasuresMemory consumption is an important Accuracy Accuracy —often underestimated— Time consumption Time consumption problem !

Memory requirements Memory requirements

Qualitative Measures Qualitative Measures

Number of implemented evolutionary models Number of implemented evolutionary models

Ability to optimize evol. model parameters Ability to optimize evol. model parameters

Availability: negative examples TNT, DCM Availability: negative examples TNT, DCM

Code for various platforms Code for various platforms

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 130 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 131

19 Performance of Phylogeny Performance of Phylogeny Programs Programs

Simulated data Real data alignments

generate simulated “true” tree compute tree with phylogeny programs

Standard program: r8s compare final tree scores

generate simulated alignment for the tree significance of small δ between final ML scores

Standard program: Seq-Gen apply likelihood ratio tests

compute tree with phylogeny program remember that programs return log-likelihood values

measure topological distance to true tree High score-accuracy required: 99.99%

Standard measure: Robinson-Foulds distance Problems

Problems real tree not known

perfect world: no gaps, no sequencing errors evolutionary model not known

evolutionary model known a priori application to one class of model (ML, MP) only © Thomas Ludwig, Alexandros Stamatakis, GCB’04 132 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 133

Performance of Phylogeny Performance of Phylogeny Programs Programs

Real data alignments Real data alignments

compute tree with phylogeny programs compute tree with phylogeny programs Current research issue: compare final tree scores compare final tree scores When to stop the analysis? The optimizationsignificance of of trees small δ between final ML scores significance of small δ between final ML scores

for real dataapply is likelihood generally ratio tests apply likelihood ratio tests

significantly remember harder thanthat programs for return log-likelihood values remember that programs return log-likelihood values simulated data! High score-accuracy required: 99.99% High score-accuracy required: 99.99%

Problems Problems

real tree not known real tree not known

evolutionary model not known evolutionary model not known

application to one class of model (ML, MP) only application to one class of model (ML, MP) only © Thomas Ludwig, Alexandros Stamatakis, GCB’04 134 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 135

Performance of Phylogeny Programs When to stop the analysis?

Real data alignments

compute tree with phylogeny programs

compare final tree scores Current issue: Standard significance of small δ between final ML scores real-data benchmark set apply likelihood ratio tests required! remember that programs return log-likelihood values

High score-accuracy required: 99.99%

Problems

real tree not known

evolutionary model not known

application to one class of model (ML, MP) only © Thomas Ludwig, Alexandros Stamatakis, GCB’04 136 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 137

20 When to stop the analysis? Survey of sequential Programs

Comparison on small simulated alignemnts

Williams et al (2003) An investigation of phylogenetic likelihood methods.

Is this improvement Evaluated RAxML, PHYML, MrBayes on 50 worth the extra time? simulated alignments with 100 taxa

Used 9 real-world alignments with 101-1000 taxa

Results:

RAxML best & fastest on real data

MrBayes best on simulated data

MrBayes significantly slower than PHYML, RAxML

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 138 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 139

Sequential Results: Simulated Data Sequential Results: Real Data

data PHYML secs MrBayes secs RAxML secs R > PHY PAXML hrs secs

101_SC -74097.6 153 -77191.5 40527 -73919.3 617 31 -73975.9 47

150_SC -44298.1 158 -52028.4 49427 -44142.6 390 33 -44146.9 164

150_ARB -77219.7 313 -77196.7 29383 -77189.7 178 67 -77189.8 300

200_ARB -104826.5 477 -104856.4 156419 -104742.6 272 99 -104743.3 775

250_ARB -131560.3 787 -133238.3 158418 -131468.0 1067 249 -131469.0 1947

500_ARB -253354.2 2235 -263217.8 366496 -252499.4 26124 493 -252588.1 7372

1000_ARB -402215.0 16594 -459392.4 509148 -400925.3 50729 1893 -402282.1 9898

218_RDPII -157923.1 403 -158911.6 138453 -157526.0 6774 244 n/a n/a PHYML: 0.0796 / 35.21 secs RAxML: 0.0808 / 131.05 secs MrBayes: 0.0741 / 945.32 secs RAxML: 0.0818 / 29.27 secs © Thomas Ludwig, Alexandros Stamatakis, GCB’04 140 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 141

Sequential Results: Real Data Sequential Results: Real Data

data PHYML secs MrBayes secs RAxML secs R > PHY PAXML hrs secs

101_SC -74097.6 153 -77191.5 40527 -73919.3 617 31 -73975.9 47

150_SC -44298.1 158 -52028.4 49427 -44142.6 390 33 -44146.9 164

150_ARB -77219.7 313 -77196.7 29383 -77189.7 178 67 -77189.8 300

200_ARB -104826.5 477 -104856.4 156419 -104742.6 272 99 -104743.3 775

250_ARB -131560.3 787 -133238.3 158418 -131468.0 1067 249 -131469.0 1947

500_ARB -253354.2 2235 -263217.8 366496 -252499.4 26124 493 -252588.1 7372

1000_ARB -402215.0 16594 -459392.4 509148 -400925.3 50729 1893 -402282.1 9898

218_RDPII -157923.1 403 -158911.6 138453 -157526.0 6774 244 n/a n/a

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 142 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 143

21 Sequential Results: Real Data Sequential Results: Real Data

State-of-the-Art parallel program in 2002 !

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 144 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 145

Sequential Results: Real Data Memory requirements

1000 taxa 10000 taxa RAxML 200MB 750MB

PHYML 900MB 8.8GB

MrBayes 1150MB not available

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 146 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 147

Sequential RAxML Sequential RAxML

Compute randomized parsimony starting tree Compute randomized parsimony starting tree with dnapars from PHYLIP with dnapars from PHYLIP

Advantage of RAxML: Apply exhaustive subtree rearrangements search starts from distinct points in search space RAxML performs fast lazy rearrangements

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 149 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 150

22 Sequential RAxML Sequential RAxML

Compute randomized parsimony starting tree RAxMLCompute uses randomizedSubtree Equality parsimony Vectors: Stamatakis starting treeet al (2002) Accelerating Parallel Maximum Likelihood-based with dnapars from PHYLIP Phylogeneticwith Tree dnaparsCalculations from using PHYLIP Subtree Equality Vectors.

Apply exhaustive subtree rearrangements Apply exhaustive subtree rearrangements

Iterate while tree improves Iterate while tree improves

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 151 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 152

Outline Parallel & Distributed RAxML

Grand Challenges in HPC Bioinformatics Design goals - minimize communication overhead HPC Bioinformatics by Example of - attain good speedup Phylogenetic Inference Master-Worker architecture phylogenetic analysis 2 computational phases maximum likelihood I. Computation of # workers parsimony trees bayesian inference II. Rearrangement of subtrees at each worker sequential codes Program is non-deterministic → every run parallel & distributed RAxML yields distinct result, even with fixed

future developments starting tree

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 153 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 154

Parallel RAxML: Phase I Parallel RAxML: Phase I

Distribute alignment file & Receive parsimony trees & compute parsimony trees select best as starting tree

Master Master Process Process

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 155 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 156

23 Parallel RAxML: Phase II Parallel RAxML: Phase II

Distribute currently best tree Workers issue work requests

Master Master Process Process

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 157 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 158

Parallel RAxML: Phase II Parallel RAxML: Phase II

Distribute subtree IDs Distribute subtree IDs

Master Master Process Process

Only one integer must be sent to each node!

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 159 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 160

Parallel RAxML: Phase II Parallel RAxML: Phase II

ST1 ST2 ST1 ST2

ST3 ST4 ST3 ST4

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 161 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 162

24 Parallel RAxML: Phase II Speedup

Receive result trees and continue with best tree

Master Process

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 163 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 164

Speedup RAxML: Biological Research Slightly superlinear speedup due to non- 1. Parallel inference of 5 10.000 taxon trees determinism! containing Bacteria, Eukarya, Archaea on a Linux PC Cluster Accumulated CPU hours per tree ≈ 3200 Largest ML-analysis to date Major clades correctly identified 2. Sequential analysis of 2415 mammals (cytochrome-b sequences) “Traditional” reference tree available Error about 10-13%

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 165 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 166

Outline Future Developments

Grand Challenges in HPC Bioinformatics

HPC Bioinformatics by Example of Phylogenetic Inference

phylogenetic analysis

maximum likelihood

bayesian inference

sequential codes

parallel & distributed RAxML

future developments

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 167 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 168

25 Future Developments

Visualization

Divide-and-conquer algorithms I. Divide alignment into sub-alignments II. Infer subtrees for sub-alignments III. Merge subtrees into comprehensive tree by application of supertree methods IV. Resolve multifurcations & optimize supertree

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 169 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 170

Problem 1: Currently only 2 methods available for Futurealignment Developments division: Future Developments RAxML & DCM Visualization Visualization Divide-and-conquer algorithms ProblemDivide-and-conquer 2:Resolving algorithms I. Divide alignment into sub-alignments multifurcationsI. Divide alignment is hard into sub-alignments II. Infer subtrees for sub-alignments optimizationII. Infer subtrees of the entire for sub-alignments III. Merge subtrees into comprehensive tree by treeIII. required Merge subtrees into comprehensive tree by application of supertree methods application of supertree methods IV. Resolve multifurcations & optimize supertree IV. Resolve multifurcations & optimize supertree

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 171 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 172

Distributed D & C Distributed D & C

Construct Guide Tree & perform Compute Subtrees tree-based alignment divison

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 173 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 174

26 Distributed D & C Distributed D & C

Merge into Guide Tree & Compute Subtrees Re-Divide Alignment

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 175 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 176

Recent Work on D & C Future Developments

Rec-I-DCM3: Very fast on parsimony Visualization Roshan (2004) Rec-I-DCM3: A fast algorithmic Divide-and-conquer algorithms technique for reconstructing large phylogenetic trees. I. Divide alignment into sub-alignments PhyNav: Zoom-in zoom-out technique II. Infer subtrees for sub-alignments Vinh et al (2004) PhyNav: A novel approach to III. Merge subtrees into comprehensive tree by reconstruct large phylogenies. application of supertree methods BWD: New supertree reconstruction IV. Resolve multifurcations & optimize supertree method using distances Shared memory or vector processors

Stephen J. Willson (2004) Constructing rooted supertrees using distances.

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 177 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 178

Shared Memory Parallelism Shared Memory Parallelism virtual root virtual root

P P This operation uses ≈ 90% of total execution time !

Q Q R R

P[i] = f( g(Q[i]) , g(R[i]) ) P[i] = f( g(Q[i]) , g(R[i]) )

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 179 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 180

27 Shared Memory Parallelism Shared Memory Parallelism virtual root virtual root

P This operation uses ≈ 90% P of total execution time ! → simple fine-grained parallelisation Q Q R R

P[i] = f( g(Q[i]) , g(R[i]) )

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 181 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 182

Shared Memory Parallelism Shared Memory Parallelism virtual root virtual root

P P

Q Q R R

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 183 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 184

Future Developments Future Developments

Visualization Visualization Potential Advantages: Divide-and-conquer algorithms Divide-and-conquer algorithms 1. Large number of I. Divide alignment into sub-alignments I. Divide alignmentsampled treesinto sub-alignments to build II. Infer subtrees for sub-alignments II. Infer subtreesconsensus for sub-alignments tree III. Merge subtrees into comprehensive tree by III. Merge2. subtrees Faster convergence into comprehensive to tree by application of supertree methods applicationbest-known of supertree likelihood? methods IV. Resolve multifurcations & optimize supertree IV. Resolve multifurcations & optimize supertree Shared memory or vector processors Shared memory or vector processors Combination of ML hill-climbing and MC³- Combination of ML hill-climbing and MC³- like sampling algorithms like sampling algorithms

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 185 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 186

28 Recent HW-specific Combined method optimization

Like every ML program RAxML makes many calls to exp() and log() in vectors → expensive Start sampling? Utilisation of Intel™ MKL (Math Kernel Library) on a Xeon 2.4GHz CPU

Performance boost of over 30% …

… in half a day of work

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 187 © Thomas Ludwig, Alexandros Stamatakis, GCB’04 188

On-Line Resources

HPC Resources Bioinformatics Resources

TOP500 List CIPRES project www.top500.org www.phylo.org Felsenatein’s phylogeny page BlueGene evolution.genetics.washington.edu/ www.research.ibm.com/bluegene/ phylip/software.html EarthSimulator PHYML download www.es.jamstec.go.jp www.lirmm.fr/~guindon/phyml.html MPI-Forum RAxML download www.mpi-forum.org wwwbode.cs.tum.edu/~stamatak/ Global Grid Form research.html www.gridforum.org IQPNNI & PhyNav download MPICH www.bi.uni-duesseldorf.de/software www-unix.mcs.anl.gov/mpi/mpich/

OpenMP www.openmp.org

© Thomas Ludwig, Alexandros Stamatakis, GCB’04 189

29