Quantian As an Environment for Distributed Statistical Computing
Total Page:16
File Type:pdf, Size:1020Kb
Quantian for distributed statistical computing Dirk Eddelbuettel Background Quantian as an environment for Introduction Timeline distributed statistical computing Quantian Motivation Content Distributed Computing Dirk Eddelbuettel Overview Preparation Beowulf Debian Project openMosix [email protected] R Examples Snow SnowFT papply DSC 2005 – Directions in Statistical Computing 2005 Others University of Washington, Seattle, August 13-14, 2005 Summary Quantian for distributed statistical computing Outline Dirk Eddelbuettel Background Introduction Timeline 1 Background Quantian Motivation Content 2 Quantian Distributed Computing Overview Preparation 3 Distributed Computing Beowulf openMosix R Examples Snow 4 R Examples SnowFT papply Others Summary 5 Summary Quantian for distributed statistical computing What is Quantian? Dirk A live-dvd for numbers geeks Eddelbuettel Background Introduction Timeline • Quantian is a directly bootable and self-configuring Quantian Motivation Linux sytem that runs from a compressed dvd image. Content • Quantian offers zero-configuration cluster computing Distributed Computing using openMosix, including ’openMosix Terminalserver’ Overview Preparation via PXE. Beowulf openMosix • Quantian contains over 6gb of software, including an R Examples Snow additional 4gb of ’quantitative’ software: scientific, SnowFT papply numerical, statistical, engineering, ... Others • Summary Quantian also contains editors, programming languages, complete latex support, two ’office’ suites, networking tools and more. Quantian for distributed statistical computing Quantian lineage Dirk Knoppix, clusterKnoppix, Debian Eddelbuettel Background Introduction Timeline • Quantian is based on clusterKnoppix, which extends Quantian Motivation Knoppix with an openMosix-enabled kernel and Content applications, kernel modules and security patches. Distributed Computing • ClusterKnoppix extends Knoppix, an impressive ’linux Overview Preparation on a cdrom’ system which puts 2.1gb of software onto a Beowulf openMosix cdrom along with auto-detection and configuration (but R Examples Snow Knopppix followed Quantian and switched to 4gb dvds). SnowFT papply • Knoppix is based on Debian, a Linux distribution Others containing over 8000 source packages available for 12 Summary architectures (such as i386, alpha, ia64, amd64, sparc or s390) produced by hundreds of individuals. Quantian for distributed statistical computing Timeline Dirk As provided by the releases Eddelbuettel Background Introduction Timeline • 0.1 (March 2003): Initial version at DSC 2003. Quantian Motivation • 0.2 (May 2003): Now based on Knoppix 3.2. Content Distributed • 0.3 (June 2003): Switched to using clusterKnoppix Computing Overview which added openMosix clustering support. Preparation Beowulf • 0.3.9.* (Sep. 2003): Updated clusterKnoppix. openMosix R Examples • 0.4.9.* (Oct. 2003 to Mar. 2004): Based on Knoppix 3.3. Snow SnowFT • 0.5.9.* (June to Sep. 2004): Based on Knoppix 3.4, first papply Others ’kitchen sink’ versions > 1gb for bootable DVDs. Summary • 0.6.9.* (Oct. to Dec. 2004): Based on Knoppix 3.6, size increased to 2.0gb. Quantian for distributed statistical computing Motivation Dirk Major modes of use Eddelbuettel Background Introduction • Computing clusters to speed up embarrassingly Timeline parallel tasks. Quantian Motivation • Computer labs by enabling temporary use of a Content Distributed computing environment booted off a dvd, and/or Computing netbooting. Overview Preparation Beowulf • Students / co-workers as distributing DVDs enables openMosix work in identical environments with minimal R Examples Snow administration. SnowFT papply • Convenience of not having to chase down new Others Summary software releases, and to configure and installing it. • Easier installation of a ’normal’ workstation by booting off Quantian, and / or installing to hard disk getting a head start with 6gb of configured software. Quantian for distributed statistical computing What is included? Dirk Broken down by field Eddelbuettel Background Introduction Timeline • Statistics: GNU R (plus essentially all of CRAN and Quantian Motivation BioConductor; Ggobi, ESS), Xlispstat, Gretl, PSPP. Content • Bioinformatics: BioConductor, BioPython, BioPerl and Distributed Computing tools like emboss and blast2. Overview Preparation • Beowulf Mathematics: Six computer algebra systems, matrix openMosix languages Octave (with add-on packages), Yorick and R Examples Snow Scilab, TeXmacs front-end. SnowFT papply • Physics: CERN tools (Cernlib, Geant, PAW/PAW++), Others Scientific / Numeric Python, GNU GSL libraries. Summary • Visualization and graphics: OpenDX, Mayavi, Ggobi, Gnuplot, Grace, Gri, plotutils, xfig. Quantian for distributed statistical computing What is included? Dirk Broken down by application area Eddelbuettel Background Introduction • Programming languages: C, C++, Fortran, Java, Perl, Timeline Quantian Python, PHP, Ruby, Lua, Tcl, Awk, A+. Motivation Content • Editors: XEmacs, Vim, jed, joe, kate, nedit, zile. Distributed • Computing Scientific Publishing: Extended LaTeX support with Overview several frontends (xemacs, kile, lyx) and extensions. Preparation Beowulf openMosix • Office software: OpenOffice.org, KOffice, Gnumeric, R Examples and tools like the Gimp. Snow SnowFT • Finance: Software from the Rmetrics project and the papply Others QuantLib libraries. Summary • Networking: ethereal, portmap, netcat, ethercap, bittorent, nmap, squid plus wireless tools and drivers. • General tools: Apache, MySQL, PHP, and more. Quantian for distributed statistical computing How to use many computers Dirk Conceptual overview Eddelbuettel Background Introduction Timeline Quantian • ’sneaker net’: physically (or virtually via ssh) running Motivation Content from machine to machine, launching jobs and collecting Distributed results. Computing Overview • ’Beowulf’ clusters using ’MPI/PVM/...’ require explicitly Preparation Beowulf parallel code (though there are some R wrappers, more openMosix below). R Examples Snow SnowFT • openMosix forms a ’single system image’ computer and papply Others does require explicitly parallel code. Summary • Other approaches such as Condor or OSCAR which we won’t cover here. Quantian for distributed statistical computing Setup for PVM and MPI Dirk Should go into next Quantian revision Eddelbuettel Background Introduction Timeline • PVM and MPI ’do not know’ they are running inside. Quantian • They want to talk to other hosts by ssh. Motivation Content • PVM/MPI require distinct hostnames for all machines. Distributed Computing • Setup for ssh, LAM and PVM: Overview Preparation $ cp -ax /root/.ssh ~knoppix Beowulf openMosix $ chown -R knoppix.knoppix ~knoppix/.ssh R Examples $ ifconfig # note $IP Snow $ hostname Quantian$IP SnowFT papply $ vi /etc/hosts # define local hosts Others $ scp /etc/hosts to_all_local_hosts Summary $ vi /tmp/clusterhosts # add them $ lamboot /tmp/clusterhosts $ echo “conf” | pvm /tmp/clusterhosts Quantian for distributed statistical computing Distributed computing: Beowulf Dirk Eddelbuettel Background • Beowulf clusters use message-passing interfaces such Introduction Timeline as LAM/MPI or PVM to communicate across nodes. Quantian • This may require a sizable amount of new programming Motivation Content and explicitly parallel coding. ’Hard’ Distributed Computing • Quantian includes several Beowulf tools and libraries: Overview Preparation • LAM MPI libraries and run-time; Beowulf • Mpich MPI libraries and run-time; openMosix • R Examples Pvm (Parallel Virtual Machine) libraries and run-time; Snow • Sprng (Scalable Parallel Random Number Generator); SnowFT papply • as well as documentation as examples for these. Others Summary • Contrast: openMosix takes existing programs and moves them around nodes in the cluster to achieve optimal load across all nodes in the cluster – no alteration to algorithms, or new programming. Quantian for distributed statistical computing Cluster computing: openMosix Dirk Eddelbuettel Background • Easiest way to distribute computing load, esp. for Introduction Timeline ’embarrassingly parallel’ tasks, as the kernel schedules Quantian tasks across the cluster. Motivation Content • Since release 0.3, Quantian contains a kernel with the Distributed Computing openMosix patch as well as a set of openMosix utilities. Overview Preparation • As a result, “instant cluster computing” is possible Beowulf based on a single dvd or iso image: openMosix R Examples 1 boot one master instance from the dvd or hard disk, Snow 2 enable ’openmosixterminalserver’ from the menu, SnowFT papply 3 boot 1, 2, ... ’slave’ nodes via PXE protocol (available in Others most recent computers) from master, and Summary 4 enjoy openMosix on the cluster. • Big advantage: Identical software configuration, library versions, ... throughout the cluster. Quantian for distributed statistical computing Cluster computing: openMosix Dirk Eddelbuettel (cont.) Background Introduction Timeline • clusterKnoppix autoconfigures and autodiscovers the Quantian nodes. and enables (root) ssh access between them. Motivation Content • openMosix is ideal for stand-alone programs such as Distributed Computing ’old fashioned’ C++ or Fortran apps that ’just run’. Overview Preparation • In general, any program without shared memory, or Beowulf openMosix threads, will migrate though I/O may bring jobs back to R Examples Snow the front node. SnowFT papply • Ian Latter’s CHAOS projects addresses some of the Others security aspects by overlaying a VPN allowing for Summary private