The R and Omegahat Projects in Statistical Computing
Total Page:16
File Type:pdf, Size:1020Kb
The R and Omegahat Projects in Statistical Computing Brian D. Ripley [email protected] http://www.stats.ox.ac.uk/∼ripley Outline • Statistical Computing – History – S – R • Application and Comparisons – Web servers – Embedding — Medieval Chant – R vs S-PLUS • Omegahat and Component Systems – The Omegahat project – Components — GGobi – The future? Statistical Computing and S Scene-setting: Statistical Computing 1980 Mainly Fortran programming, or PL/I (SAS). Batch computing (SAS, BMDP, SPSS, Genstat) with restricted range of platforms. Some small interactive systems (GLIM 3.77, Minitab). Very poor interactive graphics (2400 baud to a Tektronix storage tube if you were lucky). Flatbed and drum plotters, microfilm for publication-quality output off-line. Mainly home-brew solutions in research. (GLIM macros?) 1990 PCs become widespread, but FPUs still uncommon. Sun etc workstations available for researchers, and for teaching in a few places. Graphics could be pretty good (postscript printers, ca 1000 × 1000 pixel screens), but often was not, and mono text terminals were still widespread. C was beginning to be used, as more portable than Fortran. (Few PC Fortran compilers then and now.) Still SAS, SPSS etc as batch programs. S beginning to be make an impact on research and teaching. 2001 Little spread in machine speed (min 500MHz, max 1.5GHz), fast FPUs are universal. Colour everywhere, usually 24-bit colour. The video-games generation is now at university. Few people would dream of writing a complete program for a research idea: prototype and distribute in a higher-level language such as S or Matlab or Gauss or Ox or .... Fortran is still used in scientific computing, but C or C++ is preferred, and Java has its advocates. SAS lives on as pseudo-batch program. Lots of specialized tools are widespread, such as Perl, Python, Web browsers. XML (eXtensible Markup Language) is the flavour of the year. Scene-setting: The ‘S’ Language Largely the work of one person, Dr John M. Chambers of Bell Laboratories (formerly AT&T, now Lucent Technologies). Awarded the prestigious 1998 Association for Computing Machinery Award for Software Systems for, in the words of the citation, the S system, which has forever altered how people analyze, visualize, and manipulate data. For the last decade it has been the major vehicle for the delivery of new statistical methodology to end users. S has a long history: the GR-Z graphics system goes back to 1976. JMC is now a Bell Labs Fellow, and is working on Omegahat, so that can be considered the successor to S. S History The names have changed (‘New S’ and ‘QPE’ came and went) but the flavours of S are now known mainly by the colours of the covers of the books co-authored by Chambers. S1 1984 brown macro-based extension language S2 1988 blue user-written extensions as first-class objects S3 1991 white classes, some statistical functionality S4 1998 green more rigorous class system All were Unix programs written in C and Fortran. S-PLUS was first produced in 1988 by a start-up in Seattle called Statistical Sciences which in 1993 acquired exclusive marketing rights to S and merged with MathSoft. In 2001 they demerged and became Insightful. S is not thought of by its developers as a statistical system, rather as an interactive environment for data analysis and graphics, a system within which to do statistics. S-PLUS has been available for a limited range of Unix platforms and DOS and then Windows. It was not available for Linux until 1998, and never for Macintoshes. The Unix versions have been based on S4 since 1998: all future Windows versions will be (now due August 2001). S-PLUS is very widely used for teaching statistics at graduate level. Some of the early enthusiasts were earth scientists, and it has been used for service teaching. It has had less impact for mainstream undergraduate teaching, despite radical approaches like Nolan & Speed (2000) Stat Labs: Mathematical Statistics through Applications. Academic licences for S-PLUS remain fairly expensive (although there is a CHEST deal in the UK). It is now pretty successful in several commercial sectors (finance, pharmaceuticals, manufacturing). What is R? R History R is a system originally written by Ross Ihaka and Robert Gentleman of the University of Auckland (so the naming is clear) in the early 1990s. To the user it looks like a dialect of the S language, but the internal implementation is based on ideas from Scheme (a member of the LISP family). It is ‘not unlike’ S3. Probably this started as a research project, but versions were used at Auck- land for elementary classes, on Macintoshes with 2Mb of memory. By 1997 other people had become involved, and a core team had been set up with write access to the source code. (No one kept records of who joined when.) There was a Windows version, and Linux users pushed development forward, there being no S-PLUS version available for Linux. I became involved in 1998, and a member of the core team in Jan 1999. The first non-beta version of R, 1.0.0, was released on 29 Feb 2000. The latest, 1.2.3, was released on April 26th. Where is R now? It is a system available as source code (at www.r-project.org) that com- piles on almost all current Unix and Linux systems, and has binary versions for the major Linux distributions (Red Hat, SuSE, Debian, Mandrake), FreeBSD and 32-bit Windows and classic Macintosh (which also runs on MacOS X, on which the Unix port also builds). It is distributed under GPL2 (the GNU Public Licence). The core system is fairly small but can be extended by packages, 10 of which ship with R and over 100 are available (13 ‘recommended’) from CRAN (cran.r-project.org and mirrors). Collectively these cover a wide range of statistical functionality, mainstream and oddball. Most things one can do with S-PLUS can be done with R and its packages. Applications and Comparisons What is R being used for? With a freely-distributable product, it is hard to know! However, users tend to ask for help, and a few contribute. One of my main motivations for being involved is a (perhaps the) major use, to provide a first-class statistical system to students and researchers in the third world. There are now many examples of R being used for large-scale data analysis. It was used for election forecasting in Austria and will be used (by David Firth) in the UK. My group use it to analyse 100Mb brain images. There are several applications in gene expression arrays, at least two of which are commercial systems built on R, and one, sma, is available from CRAN. It is clear that researchers in many commercial companies are building systems around R. Web-based Statistical Teaching There are two harnesses, Rcgi (Mike Ray, UEA) and Rweb (Jeff Banfield, Montana State), to running R sessions from Web browsers. Both provide a simplified teaching interface. Rweb provides ‘a set of point and click modules that are useful for introduc- tory statistics courses and require no knowledge of the R language’. Rcgi Example Rweb Example Module Embedding Embedding can be taken much, much, further. It is most advanced on Windows, where Thomas Baier’s DCOM interface allows R to be called from Excel, Visual Basic, ..., but there is also a Unix/Linux version of R as a shared library. These enable R to do what it does best, statistical analysis and presentation graphics. Medieval Chant Musicologists undertaking detailed analyses of manuscripts of Western Christian liturgical chant dating back to the ninth century CE would wel- come computer assistance. (Emma Hornby & John Caldwell, Faculty of Music in Oxford, statistics by Ruth Ripley.) The early manuscripts employ several different notations, using neumes rather than notes. There are about twenty-five neumes, plus markings. There a few thousand known chants with further variations between manuscripts. Ideally one would use optical character recognition to read them in, but exploring the feasibility of that is a project for a Master’s student this summer. At present chants are entered by a point-and-click data entry system written in Visual Basic. Medieval Chant: Design Issues • The system has to be usable on fairly minimal Windows PCs by users whose experience stretches to Word and Internet Explorer. • Need to build a database of chants. • Non-trivial display issues: involved designing a TrueType font. • The matching algorithms to be used are fairly complicated and subject to tweaking, and will result in a similarity matrix S. • Given S, use standard multivariate techniques to compare chants (or verses or phrases of chants). Solution has been to use a Visual Basic front-end driving a database interface and also a connection to an R server via DCOM. Medieval Chant: Sample Results Similarity matrix for chants: Deus deus meus Domine exaudi Audi filia Domine audivi a Deus deus meus 00.0 26.0 28.0 29.0 Domine exaudi 00.0 00.0 25.0 41.5 Audi filia 00.0 00.0 00.0 21.0 Domine audivi audit 00.0 00.0 00.0 00.0 Deus deus meus compared with Domine exaudi : (similarity 26.00) c cl D G D D q D 373 & ) % : & : & / 5 & 8 8 ; 8 8 & 8 ) : & ) & 182 e- -ri- -pi- -at e- -um sal- -vum fa- -ci- -at e- -um c c1 s t c G D L c D 272 & ) % : : & : & / 5 & ) 8 8 & 82:) & ) & 110 et a- -rutt cor me- -um qui- -a o- -bli- -tus sum ql G c G cl cl 664 ) : 1 1 - 8 & $ $ $ , : 1 1 % % / 336 -cit do- -mi- -nus ql G c c l G cl cl 236 ) : 1 1 - 8 & $ $ $ , : 1 1 % % / 98 con- -fri- -xa sunt G D l D q 556 / 5 & 8 8 8 8 ) 8 & 8 ) : 280 -um u- -ni- -ver- -sum se- -men ia- G D l D q G 90 / 5 & 8 8 8 8 ) 8 & 8 ) : 34 me In- -qua- -cum--que di- -e tri- Dendrogram of phrases within the four verses of a chant, with groups 0.0 0.2 0.4 0.6 0.8 1.0 Sample Results: Analysis 4/4 4/1 3/1 2/1 highlighted.