<<

A joint newsletter of the Statistical Computing & Statistical Graphics Sections of the American Statistical Association.

April 97 Vol.8 No.1

A WORD FROM OUR CHAIRS TOPICS IN MEDICAL IMAGING Statistical Computing Functional Magnetic Resonance Imaging is a Daryl Pregibon is the 1997 Chair Team Sport of the Statistical Computing Sec- tion. While this is his ®rst column By William F. Eddy for the Newsletter, he is familiar It began like this. It was a cold, dreary day in Novem- with the role of Section Chair. ber 1993. I was sitting in my of®ce studying some box- plots when someone knocked. I looked up. It was a This is my ®rst column as Chair of the Statistical Com- rather wild-lookingman, long gray hair ¯ying about. He puting Section, but alas, it is not my ®rst tour of duty. was very animated, his intensity almost scary. ªI'm a Imagine if you will, a time when a well-groomed statis- psychologist; I need some help,º he said. I smiled; it tician could be sighted wearing bell-bottom trousers and was usually the other way around. He showed me a pic- a zipper-front shirt, perched atop 3 inch platform shoes. ture, a slice of a human brain, he said. I commented that These fashion statements are so old that they are now it didn't seem like very good resolution; ªI thought X- back ± so it goes with my tenure as Chair. And as be- rays were a lot clearer than this fuzzy thing.º He said fore, I look forward to the responsibilities that the posi- it was done with magnetic resonance not x-rays and he tion brings ± including the opportunity to write this col- was studying brain function not brain structure. I tried umn! CONTINUED ON PAGE 2 to absorb the idea. He said the picture was not actually a picture of a brain but rather a visual display of a col-

Statistical Graphics lection of t-tests, one for each pixel in the slice. I smiled

again thinking how silly it was to do thousands of t-tests. He noticed my smile and said ªI don't know any statis- Sally Morton is the 1997 Chair tics but someone told me about `multiple comparisons.' of the Statistical Graphics Section. I've read up on it and I did the Bonferroni corrections.º She welcomes your feedback and My smile turned into outright laughter. ªNothing is sig- comments on any issue of interest ni®cant,º he said; ªplease help me.º By now I was ®gu- to your section. ratively rolling on the ¯oor. We talked a bit more; I de- cided this was not a problem I wanted to work on. I gave Let me begin my ®rst column by thanking Bill him the names of a few colleagues who, I thought, might DuMouchel for his leadership as our Statistical Graph- straighten him out; I imagined a few Bayesian thoughts ics Chair during the past year. I'd also like to thank might thoroughly discombobulate him. I sent him away Deborah Donnell and Jane Gentleman for their service and promptly forgot all about the incident. as Publications Of®cer and Council of Sections Repre- Aside: Just so you have some idea what this is about, I sentative respectively over the past three years. have included a couple of ®gures (page 17). The ®rst is In 1997, Michael Minnotte will be our new Publications a functional image of a brain; it takes about 32 millisec- Of®cer and Bob Newcomb continues as our Secretary/ onds to acquire the data for it (although the duty cycle of Treasurer. Lorraine Denby and Colin Goodall continue the machine and FDA regulations limit the acquisition CONTINUED ON PAGE 3 CONTINUED ON PAGE 17 EDITORIAL Data visualization in the context of large-scale gene expression analysis is the topic of a paper by Daniel Carr, Roland Somogyi, and George Michaels. The tools On-Line PDF Version they propose for looking at gene expression data include This Newsletter marks our ®rst full year as editors. As stereo plots, parallel coordinate (time series) plots and we commented in our maiden issue, we inherited a qual- conditioned parallel coordinate plots. ity publication from Mike Meyer and Jim Rosenberger. Also, make sure you don't miss the article by Clive During the past year, one of our goals has been to main- Loader describing Loc®t, a software package designed tain the standards they set during their three year tenure to be used in S/Splus that performs local regression, as editors. So, you are probably asking yourself, why is likelihood and related smoothing procedures. the quality of the paper not as nice this time around? The Among the several important notices and announce- answer is simple: because we wish to keep offering you ments that we hope you will ®nd useful, let us point out a ®rst-rate product. For many of the articles we publish, the report on the results of the Computing Section's Stu- color ®gures constitute an essential ingredient, but, alas, dent Paper Competition and the announcements for next they are expensive to print. With the money we save on year's competition and it's companion to be sponsored the paper we can offset the increasing cost of color print- for the ®rst time by the Graphics Section. ing and still balance the budget. This issue should reach your mailboxes before the Joint There are, in fact, other avenues that could be ex- Statistical Meetings in Anaheim and contains a timely plored in order to cut printing costs. One possi- article by our Program Chairs, James Rosenberger and bility would be to stop printing the newsletter al- Dianne Cook, outlining our sections' activities at the together. This might be a drastic measure, but we Meetings. The plans are very exciting and we hope to are now offering a much improved on-line version see all of you there. at http://cm.bell-labs.com/who/cocteau/ newsletter/. You can conveniently browse through Mark Hansen the new PDF (Portable Document Format) version us- Editor, Statistical Computing Section ing the Acrobat reader available free from Adobe. The Bell Laboratories old PostScript version is still available, but we strongly [email protected] encourage you to rely on the new PDF version, a much more interactive document due to the addition of arti- cle listings and various links. As we are contemplat- Mario Peruggia ing polling our readers on the issue of phasing out the Editor, Statistical Graphics Section printed version, we would greatly appreciate receiving The Ohio State University

your suggestions on this matter. Please, let us know [email protected]

what you think and what questions we should ask. In this issue, we feature an article by William Eddy on functional MRI. Bill tells us, in a lively and informal style, how he became interested in the study of brain FROM OUR CHAIRS (Cont.). . . function, what statistical issues are involved, and why this scienti®c work would be impossible without the co- Statistical Computing operation of researchers from ®elds as varied as psy- chology, radiology and (of course!) . CONTINUED FROM PAGE 1 Since my previous term, important changes have oc- In another interesting article, John Salch and David curred on the statistical computing landscape and I'll try Scott describe a novel exploratory tool that they devel- to highlight some of these. Perhaps the most important oped to visualize high dimensional data. The method change to the Section is this Newsletter. While all of us they propose, called the Density Grand Tour, is based now take it for granted as one of the bene®ts as Section on the combination of data exploration along continu- members, it is a very recent innovation. In my opinion, ous paths through lower dimensional projections and ef- it has evolved into the premier newsletter in the Associa- ®cient density estimation techniques. John and David tion, combining interesting (and readable!) technical ar- present examples of univariate and bivariate tours and ticles with timely announcements concerning activities discuss the dif®culties associated with the implementa- relevant to the Section membership. tion of trivariate tours.

2 Statistical Computing & Statistical Graphics Newsletter April 97 Many of us remember the good old days of the sity; our Program Chair-Elect, Mark Hansen from Bell ARPANET, a world-wide web free of banners and Labs; and our new Secretary/Treasurer, Merlise Clyde junk email. But with all the bad comes a lot from Duke University. Complete election results will be of good, namely the Section's web presence at available shortly from the ASA home page. I especially http://www-stat.montclair.edu/asascs.This want to thank all of you who were candidates and all of is the best place to look for information on Section ac- you who took the time to send in your ballots. tivities, including online versions of the Newsletter. On behalf of all the current and newly elected Section The ®nal bit of nostalgia concerns a subject near and Of®cers, we look forward to seeing you in Anaheim in dear to many of our hearts Ð namely data analysis. This August ± have a great summer! subject gained respect in statistical circles in the early 70's. But this pales in comparison to the visibility it Daryl Pregibon is now receiving in computer science circles under the Statistics Research AT&T Labs name ªdata mining.º Is it good or bad? In my view it [email protected]

is very good as it provides a tremendous opportunity for statistics to illustrate its central role in science and tech- nology. While the end product is shared between sta- tistical and CS researchers, the means to the end is of- ten quite different. In order for both disciplines to better appreciate what each other has to offer, the ASA is co- Statistical Graphics operating with AAAI (American Association for Arti®- cial Intelligence) by co-locating the premier data mining CONTINUED FROM PAGE 1 conference, KDD-97 (Knowledge Discovery and Data continue as Council of Sections representatives, and are Mining) with the JSM. The two conferences will be joined by Roy Welsch this year. Mike Meyer is our in- back-to-back in southern California during the week of coming Chair-Elect. Both the Graphics and Comput- August 10-17. Depending on demand, there will be bus ing Sections are ably served by our Newsletters Editors shuttle service on Thursday August 14 to transport JSM Mario Perruggia (Graphics) and Mark Hansen (Comput- participants from Anaheim to nearby Newport Beach. ing) who continue to maintain our high quality and in- I encourage anyone interested in attending this confer- formative section Newsletter. Section members are en- ence to register early since the already steep registration couraged to discuss any issues or concerns regarding the fees increase through time. (Please note that the regis- section or ASA matters with any section of®cer. Contact tration includes admission to all tutorial sessions.) information is provided on the inside back page of this issue. I saw many of you at the very successful Interface '97 Conference held last month in Houston. In addition to Later in this issue, our ASA Program Chair Dianne the technical presentations and the timely conference Cook describes in detail the Graphics Program for theme, ªMining and Modeling Massive Data Setsº, In- the August Joint Statistical Meetings in Anaheim. Di terface '97 also provided an opportunity for the Execu- and her session organizers are to be congratulated tive Committee to meet. Three items that I would like for putting together an innovative and exciting pro- to highlight are featured as articles elsewhere in this is- gram for us. I hope to see you at the Graphic ses- sue of the Newsletter: the successful student paper com- sions, including the Data Exposition poster session petition organized by the Section, the request for partic- which involves analyzing a hospital report card dataset. ipation on the Executive Committee in the Continuing Though it's too late to submit a poster, you can access Education area, and the Section's Program at the JSM the data for this and past years' Expositions via the in Anaheim. The other relevant item of business is that web at http://lib.stat.cmu.edu/data-expo/. the Executive Committee approved limited funds to as- Above all, be sure not to miss the Computing and sist Section members who want to attend KDD-97. If Graphics mixer Monday night at the Meetings. We'll you are in this category, send me an email message out- have door prizes and good food in addition to the lining your ®nancial constraints and the willingness of latest section news. If you can stay on in South- your institution to provide ªmatching fundsº. ern California, consider attending the Third Inter- national Conference on Knowledge Discovery and Finally, I have just received the ASA election results. Data Mining (KDD-97) to be held in Newport Please join me in congratulating our new Chair-elect, Beach immediately following the ASA Meetings James L. Rosenberger from Pennsylvania State Univer- (http://www.aig.jpl.nasa.gov/kdd97).

April 97 Statistical Computing & Statistical Graphics Newsletter 3 Even though this August 1997 is still a few months NEWS CLIPPINGS AND SECTION NOTICES away, it's not too early to start thinking about the August 1998 Joint Meetings in Dallas whose theme is ªStatis- tics: A Guide to Policy.º Ed Wegman is our Program Results of the Student Chair-Elect. Please contact Ed if you have ideas regard- Paper Competition ing the Graphics Program. By Daryl Pregibon and Trevor Hastie Your section dues have been put to good use in the past The results are in! We have selected the four winners of year. In particular, our section contributed to the Eliza- the Statistical Computing Section's 1997 Student Paper beth Scott and George Snedecor Awards initiated by the Competition. In alphabetical order they are Committee of Presidents of Statistical Societies. These awards are given in alternate years. The 1996 Eliza-  Wenjiang J. Fu, University of Toronto beth Scott award was given to Grace Wahba for fostering Penalized Regressions: the Bridge versus the Lasso

the role of women in statistics, and the 1995 Snedecor  Alan Gous, Stanford University Award for an exceptional published paper in Biometry Adaptive Estimation of Distributions using was given to Norman Breslow and David Clayton. Exponential Sub-Families

In terms of new initiatives, in the coming year our sec-  Gareth James, Stanford University tion will fund a student paper competition in parallel The Error Coding Method and PaCT's

with the Computing Section's competition. Please see  Ramani S. Pilla, Pennsylvania State University the article later in this issue describing the competi- New Cyclic Data Augmentation Approaches for tion rules and deadlines, and encourage your students Accelerating EM in Mixture Problems to enter. At the Interface `97 Meetings in May, we Before we tell you more about each of the winners, a bit discussed our sponsorship of this competition and the more about the competition itself. This is our third such Poster Competition for students in kindergarten through competition, and this year we had 16 entries. There were high school. In addition, we considered how our section many good submissions, but these four prevailed. If you might support the Journal of Computational and Graph- plan to attend the ASA meetings this year, come and see ical Statistics (JCGS), including the possibility of offer- for yourself! ing a prize for the best graphics paper. We especially need suggestions from you regarding Continuing Edu- As part of their prize, each of the four winners will cation Courses you'd like to take or teach, a dataset for present their papers in a special session at the ASA. The the next Data Exposition, and how the section might more tangible part of their prize is that their entire con- serve the membership better. ference trip will be covered by the section ± airfare, hotel and registration! The approximate value is $1000 each. We welcome ideas for other initiatives, and we can al- They also get their names and faces in this newsletter as ways use volunteers. If you'd like to get involved with well as the Amstat news ± publicity gives a great kick the section or have comments, please contact me via start to a career! email. We would like to thank two colleagues who helped in the I look forward to seeing all of you at conferences and review process this year: Mark Hansen at Bell Labs and symposia throughout the year, Bill DuMouchel at AT&T Labs. Mark, as you know, is the current Co-Editor of this Newsletter, and Bill is Past- Sally Morton Chair of the Section. RAND We intend to hold a studentpaper competition every year Sally [email protected] and in the upcoming year there is a strong possibility

of joining forces with the Graphical Statistics Section.

This would allow for even more winners and hopefully even more submissions. Registered students are eligible to submit papers. See the announcement on page 34 of this newsletter for more details of the competition and conditions for entry. Professors, please make a note in your calendar to get your students geared up in time for next year's submissions. The deadline for submissions for the 1998 competition is early January 1998.

4 Statistical Computing & Statistical Graphics Newsletter April 97 Details on the Winners Gareth James Wenjiang J. Fu Gareth is a third year Ph.D. student in Wenjiang is a fourth year Ph.D. student in the Statistics Department at Stanford Uni- at the University of Toronto versity. His advisor is Associate Profes- under the supervision of Professor Robert sor Trevor Hastie. Gareth plans to pur- Tibshirani. He is graduating this year. sue an academic career because he enjoys His thesis is titled ªA Statistical Shrinkage both teaching and research. His research Model and Its Applicationsº. Wenjiang is interests include classi®cation problems pursuing a career in applied statistical re- with large numbers of classes, multivariate search. His research interest includes bio- statistics and generalizations to bias and statistics and statistical computing, partic- variance concepts. ularly in longitudinal models, survival models, cancer research, and AIDS studies. He received a B.Sc. in Mathematics from The Error Coding Method and PaCT's Peking University, an M.Sc. in Mathematics from Tsinghua Uni- versity, and an M.Sc. in Statistics from the University of Toronto. A new class of classi®cation techniques have recently been developed in the statistics literature. A ªplug inº Penalized Regressions: the Bridge versus the Lasso classi®cation technique (PaCT) is a method that takes a

We consider Bridge regression, a special family of pe- standard classi®er (such as LDA or nearest neighbors) P

and plugs it into an algorithm to produce a new classi®er.

j j

nalized regressions with penalty function i .A general approach to solving for the regression coef®- The standard classi®er is known as the Plug in Classi- cients is developed. A simple and ef®cient algorithm for ®er (PiC). These methods often produce large improve-

ments over using a single classi®er. In this paper we in- =1 the Lasso ( ) is obtained by studying the structure

vestigate one of these methods developed by Dietterich

of Bridge solutions. The shrinkage parameters,  and , are optimized via generalized cross-validation (GCV). and Bakiri (1995). They use Error Correcting Coding Comparison among several models are made through a Theory ideas to motivate their method but provide little simulation study. The method is demonstrated through hard theory. We give some further insight into the prob- an application to prostate cancer data. Some computa- lem. tional advantages and applications are discussed. Ramani S. Pilla Alan Gous Ramani is a ®nal year Ph.D. student in Alan is a third year Ph.D. student in the the Department of Statistics at the Penn- Statistics Department at Stanford Univer- sylvania State University. Her thesis is sity. His advisor is Brad Efron. Alan re- entitled ªEM-Based Methods in High- ceived a B.Sc.(hons) from the University Dimensional Finite Mixtures: Theory & of Natal, Pietermaritzburg, South Africa. Applications,º and her advisor is Profes- His research interests include exponential sor Bruce G. Lindsay. Ramani's career families, splines, numerical optimization plans include research, teaching, and con- methods in statistics, and bioinformatics. sulting. Her statistical interests include likelihood methods in mixtures, missing data analysis, modern statistical computation, and statistical mod- eling with applications to mixture models, image analysis, genet- Adaptive Estimation of Distributions using ics, and neural networks. Exponential Sub-Families New cyclic data augmentation approaches for accel- erating the EM algorithm in mixture problems. Suppose we are given a number of groups of data points on the real line. The observations are i.i.d. within each Consider a general class of maximum likelihood (ML) group, but the groups are drawn from different distribu- problems in which one is interested in maximizing a tions, not well modeled as members of one of the stan- mixture likelihood with ®nitely many known compo- dard distribution families. An algorithm is presented nent densities over the set of unknown weight param- which ®nds an appropriate low-dimensional exponential eters. In this mixture setting, convergence of the EM family to model these data from among sub-families of algorithm will be extremely slow when the component an exponential family of much larger dimension. Model densities are poorly separated and when the ML solu- selection between sub-families of different dimension is tion requires some of the weight parameters to be zero. discussed. The algorithm is implemented for the spe- To address this problem, two basic devices are used to cial case in which the large family is a logspline family construct new EM type algorithms. First, the ªcomplete of distributions. An example data set is analyzed using dataº is constructed in two new ways, each designed to these methods. increase the convergence rate in the determination of the

April 97 Statistical Computing & Statistical Graphics Newsletter 5 relative weights of pairs of neighboring mixture com- Statistical Graphics ponents. These representations correspond to two dif- Chair-Elect: ferent data augmentations; neither greatly increases the Dianne Cook, Iowa State University numerical complexity of the EM algorithm. Secondly, cyclic versions of these two basic methods are created Program Chair-Elect: by changing the missing data formulation between EM Deborah Swayne, Statistics Research AT&T Labs steps. Results on high dimensional mixtures of bino- Council of Sections Representative (1998±2000): mial, Poisson, and normal show that the new methods David W. Scott, Rice University converge many (100 to 300) times faster than the con- ventional EM across these three cases. Newly elected Section Of®cers should attend our Busi- ness meeting/mixer in Anaheim. Congratulations! We especially want to thank those of you who were can- Congratulations to our four winners, and to everyone didates, as well as those of you who took the time to who participated in this year's competition. vote. Your continuing interest in and support for our

Sections' activities are greatly appreciated.

Trevor Hastie Stanford University [email protected]

Daryl Pregibon Buja Named JCGS Editor Statistics Research AT&T Labs The Journal of Computational and Graphical Statistics [email protected] management committee is pleased to announce the ap-

pointment of Andreas Buja, as the new editor of the jour-

nal 1998-2000. He will succeed William J. Kennedy, who has held the editorship since 1995. For 1997 Buja will serve as Editor-Elect, and then he will serve as Editor for a three year term, from 1998 through 2000. ASA Election Results The appointment was recommended by the manage- ment committee and approved by the boards of the three Recently you were asked to vote for a number of ASA sponsoring organizations: the American Statistical As- Of®cers, including ASA President and Vice-President. sociation, the Institute of Mathematical Statistics, and In a memorandum from Ray A. Waller, Executive Di- the Interface Foundation of North America. rector, the ASA has just released the election results:

President-Elect, 1998 (President 1999) Manuscripts and correspondence should continue to be Jonas H. Ellenberg, Westat, Inc. sent to the current editor: Vice President, 1998±2000 William J. Kennedy, Editor JCGS Mary Ellen Bock, Purdue University Department of Statistics In addition, several of®cers of the Statistical Computing 117 Snedecor Building and Graphics Sections were elected during this round of Iowa State University balloting. Ames, IA 50011 Email: [email protected] Statistical Computing until August 31, 1997. From September 1 on, new sub- Chair-Elect missions should be sent to the new JCGS editor: James L. Rosenberger, Andreas Buja, Editor JCGS The Pennsylvania State University AT&T Laboratories Email: [email protected] Program Chair-Elect: http://www.research.att.com/ andreas Mark H. Hansen, Bell Laboratories Secretary/Treasurer (1998±99): The mailing address and phone numbers are available Merlise Clyde, Duke University from the URL above.

6 Statistical Computing & Statistical Graphics Newsletter April 97 Andreas Buja's current mailing address is:  Review and selection of proposals;

AT&T Laboratories  Management of the CE student scholarship program. 180 Park Ave Continuing education is a very important service that the P.O. Box 971 Section provides to its members and to the profession. Florham Park, NJ 07932-0971 I have truly enjoyed participating in the development of The management committtee extends our thanks to the program and think that it offers a great opportunityto Stephen Fienberg, outgoing chair of the JCGS manage- help promote statistical education and thinking. If you ment committee, for conducting a vigorous search re- have an interest in serving on the Statistical Computing sulting in this important appointment. Section CE committee, please contact me via email.

James L. Rosenberger Tom Devlin Chair, JCGS Management Committee Montclair State University The Pennsylvania State University [email protected]

[email protected]

Invitation to Membership TOOLS FOR DATA ANALYSIS on SCS Continuing Data Exploration with the Education Committee Density Grand Tour By Tom Devlin By John D. Salch and David W. Scott The Statistical Computing Section has had a very suc- Exploratory analysis can often lead to effective model- cessful continuing education program over the past ®ve ing and powerful inference. Interactive tools, such as years. We have offered exciting courses at the Joint Sta- XGobi (Swayne, Cook, and Buja, 1990), provide a uni- tistical Meetings (JSM) and have co-sponsored courses ®ed environment to work with techniques available for

with other organizations such as NIBS and the Interface this purpose. However, these methods become infeasi-

 10; 000 Foundation. These offerings have included: ble when applied to moderately large (n )data sets. It is pondering this de®ciency that has driven the

 ªExtending the Cox Modelº by Terry Therneau work discussed herein.

 ªIntroduction to Complex Bayesian Modeling using The reliance on scatterplots, as a means of visualizing BUGSº by and Nicole Best data, represents one computational stumbling block for

 ªMultivariate Density Estimation and Visual Cluster- standard exploratory methods. With the resolution of ingº by David Scott modern CRT's it seems very unlikely that visualizing data points themselves will remain feasible without sub-  ªModern Nonparametric Regression and Classi®ca- tionº by Trevor Hastie and Rob Tibshirani setting. Other graphical methods, such as binning (Carr 1991), glyphs (Chambers et. al. 1983), and high resolu-

 ªResampling-Based Multiple Testingº by Peter West- tion density plots (Scott 1996), provide practical alter- fall and Stanley Young natives. I have been appointed the Section's Electronic Commu- One commonly used method for exploring high- nication Liaison and am stepping down as chair of the dimensional data is the Grand Tour (Asimov 1985). Continuing Education (CE) Committee. The Section in- The basic premise is to explore data of high-dimension

vites people with strong interests in education and con-

2 3 by continuous paths through projections in 1, ,or - tinuing education to membership on the CE Committee dimensions. This method involves viewing a sequence and to help coordinate the CE program. Responsibilities of scatterplots in time, allowing the user to ªtourº the of the CE Committee include: data in search of interesting structure. This method

 Solicitation of proposals for presentation at the JSM works well if the data set is reasonably small and if its and at Section co-sponsored events, e.g. the Interface; dimension is not too large.

April 97 Statistical Computing & Statistical Graphics Newsletter 7 We propose combining ef®cient density estimation, via the averaged-shifted histogram (ASH) (Scott 1985), with the Grand Tour to create the Density Grand Tour, or DGT. Here we describe software that we are making available for the SGI and demonstrate it with real data. Univariate Tour: The Iris Data Revisited The 1-D DGT can be described as an attempt to view all possible one-dimensional projections of the data in a semi-orderly fashion. In the case of Fisher's Iris Data, there are 4 dimensions and 150 data points, which rep- resent measurements on 3 different species of irises (se- tosa, virginica, versicolor), 50 points per species. To start, the user ®rst invokes Geomview (Phillips 1993), an interactive environment for display- ing geometric entities, produced and distributed by researchers at the NSF Geometry Center (http://www.geom.umn.edu). This environment in- cludes many modules providing such features as zoom, rotation, and rearranging of light sources as well as PostScript snapshots. The ability to easily add new modules is very powerful and facilitated the coding of the DGT. Once Geomview is started, the user can then select the ªDGT 1Dº from the ªmodulesº menu. The Density Grand Tour operates as a module of Geomview, written entirely in C, thus eliminating the need to write our own display code. After invoking the DGT (1-D in this case) a Graphical User Interface (GUI) appears allowing interactive con- trol of the tour. Features include control over the ASH through the number of bins and the smoothing parame- ter. Besides ¯owing forward in time, the user can stop the tour, rewind it, and step through at at leisure. Fig- ure 1 demonstrates the GUIs for both the 1-D and 2-D tours. Three frames from the tour for the Iris data are displayed in Figure 2. These frames represent a small but inter- esting part of a complete tour and hint at the dynamic nature of the tour. As one can notice, in the ®rst frame there appears to be only one mode present, but by the Figure 1. Interactive GUIs for the DGT 1-D (top panel) third frame the multimodality of the data becomes clear. and the DGT 2-D (lower panel). In this example, the frames are 5 time units, or ªsteps,º apart, which represents a reasonably smooth transition. drawn on the same plot with different colors as shown However, the smoothness, or ªstep size,º of the anima- in Figure 3. In the case of the 1-D DGT, the differ- tion can be adjusted as the user desires. ent group-speci®c densities are plotted with an offset so Other features of the DGT include ªbrushingº and the as to prevent overlap. The user can also select ªDraw visual display of the data points and projection (weight) Pointsº to have the projected data points drawn adja- vector, extensions of XGobi capabilities. Brushing is cent to the 1-D ASH, while the tour is running. ªAdd implemented by computing a separate density estimate Weightsº can be selected to display the projection vec- for each unique group. Then all of these densities are tor currently being used.

8 Statistical Computing & Statistical Graphics Newsletter April 97 Figure 2. Three frames from the univariate tour of the Iris Data.

Figure 3. The effect of brushing on the univariate tour. Bivariate Tour: Landsat Image Data A demonstration of the 2-D DGT is given using a sub-

set of a Landsat satellite image. The data represent

; 034 8 pixels from three groups of farm crops (sun- ¯ower, spring wheat, and spring barley). Subsetting was performed only to reduce the data down to three distinct subgroups (from 32), so as to reduce confusion in the ®gures below. Within each of the three subgroups the data are otherwise complete. The 2-D DGT is invoked similarly to the 1-D version, but the GUI is slightly different (see Figure 1). Since the ASH is based on two dimensional data, there are con- trols for the number of bins and smoothing parameter in Figure 4. Five frames from the bivariate tour of the both the ªXº and ªYº directions. Landsat Image Data. Time progresses from left to right Figure 4 illustrates 5 frames (sequential in time) from a and from top to bottom. tour of the Landsat data. After looking at these frames one might suspect that there is at least one crop con- The user can also select ªDraw Pointsº to superimpose tributing to the multimodality. Using ground truth, a scatterplot of the data. The points are plotted with a which was available for these data, brushing was acti- ®xed Z-coordinate so as to appear above the density plot vated giving a clear picture of how the individual groups in the graphics window. ªAdd Weightsº displays both are contributing to the whole. projection (weight) vectors used to obtain the projected As before, brushing is implemented by plotting a sepa- data at the current time. rate density estimate for each brushed group on the same Trivariate Tour graph (see Figure 5). However, for the 2-D DGT, the plots are allowed to overlap thus obscuring some of the A working module of a 3-D DGT was written using density for each. Despite the overlap, we can still ob- Splus and the Animate module of Geomview. We dis- serve which group has the highest density at each set of covered that the visualization overhead was too high for coordinates. We have explored ideas such as utilizing all but a few of the fastest machines. While conceptually hardware transparency to avoid the visual confusion, but more powerful than the bivariate tour, the trivariate tour these have proven unsatisfactory. is much more dif®cult to follow from frame to frame.

April 97 Statistical Computing & Statistical Graphics Newsletter 9 Figure 5. The effect of brushing on the bivariate tour. Figure 6. Contours from a trivariate ASH. This would represent a single frame from a trivariate tour. This problem is magni®ed if the data have any non- trivial structure. Thus, we think that this option will not forms that run X-windows including PC's and Sun be available for a few more years (see Figure 6). workstations. The DGT code (written in standard C) has been compiled and tested on several platforms as well. Future Work Thus, it is possible to try the code on a variety of ma- There are several areas we are currently exploring as to chines. However, due to the dynamic graphical nature improve this software. Extending the idea of the Projec- of the product, we recommend using high-end SGI ma- tion Pursuit Guided Tour (Cook et. al. 1991) could be chines. very valuable as the dimensionality of the data become References more massive. The time cost of exploring all possible

Asimov, D., (1985), ªThe grand tour: A tool for view- 7 2-dimensional projections of, say, -dimensional data could be greatly reduced by locally optimizing a PP- ing multidimensional data.º SIAM Journal on Scienti®c index. However, the indices we have reviewed tend to and Statistical Computing, 6, 128-143. be too computationally expensive for use in a dynamic Carr, D.B., (1991), ªLooking at large data sets using setting such as this one. Thus, research into an ef®cient binned data plots.º Computing and Graphics in Statis- index is underway. tics, 7-39. We are also currently working on a new method to obtain Chambers, J.M., Cleveland, W.S., Kleiner, B., Tukey, better ASH estimates while reducing the amount of data P.A., (1983), Graphical Methods for Data Analysis. storage required. This could be very helpful in increas- Wadsworth, Belmont, CA. ing the feasibility of the DGT for larger data sets. We Cook, D., Buja, A., Cabrera, J., (1991), ªDirection and also think that high-dimensional binning could be done motion control in the grand tour.º In Computing Science ef®ciently and current results look promising. and Statistics. Proceedings of the 23rd Symposium on We are also hoping that the DGT could possibly be in- the Interface, 180-183. cluded in a future version of XGobi. This could combine Phillips, M., (1993), The Geomview Manual both the power of the XGobi interface and its computa- ftp://geom.umn.edu. tional ef®ciency with our method of exploration. Scott, D.W., (1985), ªAveraged shifted histograms: Ef- Software Availability fective nonparametric density estimators in several di- Software can be obtained via anonymous ftp mensions.º The Annals of Statisics, 13, 1024-1040. from ftp.stat.rice.edu in the subdirectory Scott, D.W., (1995), ªIncorporating density estimation /pub/scottdw/DGT.code. The software is written into other exploratory tools.º ASA Proceedings of Sta- entirely in C with the interface to Geomview. Note that tistical Graphics Section. versions of Geomview do exist for other computing plat-

10 Statistical Computing & Statistical Graphics Newsletter April 97

Support squares criterion:

n

X

x x i

This research was supported in part by the National Sci- 2

W Y a + a x x

0 1 i i (1)

ence Foundation (DMS-9626187) and the National Se- h =1 curity Agency (MDA 904-95-C-2203). i

By default, Loc®t uses the weight function



3 3

jvj  jvj < 1

John D. Salch 1

W v = :

Rice University 0 otherwise

[email protected] The bandwidth h controls the smoothness of the ®t. A

large h may result in oversmoothing, or miss important

David W. Scott features in the data, while a small h may result in a ®t

Rice University that is too noisy. The simplest choice is to take h con-

[email protected] stant; often it may be desirable to vary h with the ®tting

point x.

The local least squares criterion (1) is easily minimized

^a ^a 1

to produce estimates 0 and . The local linear estimate

x

of  is

^ x= ^a :

0

 x Loc®t: An Introduction Note that each least squares problem produces ^ for a

By Clive R. Loader single point x; to estimate at additional points, the local weights change and a new least squares problem must What is Loc®t? be solved. Loc®t is a software package performing local regres- Our example uses the ethanol dataset, studied exten- sion, likelihood and related smoothing procedures. It is sively in Cleveland (1993), and ®ts a local quadratic designed to be used in S/Splus, making extensive use model: of the data management and graphical facilities in that > fit.et <- locfit(NOxÄE, data=ethanol, language. The interface uses the modeling language + alpha=0.5) and classes and methods of S, so users familiar with the > plot(fit.et, get.data = T) existing modeling software in S (Chambers and Hastie 1992) should ®nd Loc®t easy to use. Most of the numer- Three arguments are given to the locfit() func- ical routines of Loc®t are written in C, and it is also pos- tion. The model formula, NOxE, speci®es a response sible to use Loc®t as a stand-alone C program. variable NOx and predictor E.Thedata=ethanol ar- gument provides a data frame, where the variables Loc®t can be obtained via the WWW at the URL may be found. The smoothing parameter is given by http://cm.bell-labs.com/stat/project/ alpha=0.5; this gives a nearest-neighbor based band- locfit; substantial online documentation can also be width covering 50% of the data. Figure 1 shows the plot found at this address. of the ®t.

Local Regression Bivariate local regression arises when there are two pre-

x = x ;x 

i;1 i;2 dictor variables: i . The localization Local regression was applied in a variety of ®elds in late weights then become

19th and early 20th centuries; see for example Hender-

kx xk

son (1916). The current popularity of local regression as i

w x= W ; i a statistical procedure is largely due to the Lowess pro- h cedure (Cleveland 1979) and Loess (Cleveland and De- where kkdenotes the Euclidean norm. The local

vlin 1988).

=u; v  quadratic model around a point x is

The underlying model for local regression is x   a + a x u+a x v 

i 0 1 i;1 2 i;2

2

+a x u + a x ux v 

Y = x +  ; 3 i;1 4 i;1 i;2

i i i

2

+a x v  :

5 i;2

x the function  is assumed to be smooth and is esti- mated by ®tting a polynomial model (most commonly, These are substituted into the local least squares crite-

linear or quadratic) within a sliding window. That is, for rion (1). Again, we minimize over the coef®cients, and

x ^x=a ^

each ®tting point , we consider a locally weighted least take 0.

April 97 Statistical Computing & Statistical Graphics Newsletter 11 ooo oooo ooo oooo 4 ooo

o 3 ooo o 2 oo Z o 1 oo o 0 o ooo -1 oo NOx o 18 oo o oo o 16 oo o 14 oo o 1.2 ooo o 12 1 1.1 o 10 0.9 oo o C 0.8 1234 ooo 8 oo ooo 0.7 E ooooo 0.6 o oo oooo ooo oo o

0.6 0.8 1.0 1.2 Figure 2. Bivariate local quadratic ®t for ethanol E dataset. Figure 1. Local quadratic ®t for the ethanol dataset. By changing the likelihood and weighting scheme, local likelihood estimates can be obtained in numerous dif- In Loc®t, multivariate local regression is requested by ferent settings. Those supported in Loc®t include likeli- adding additional terms on the right hand side of the for- hood regression models; density estimation; conditional mula. For example, the ethanol dataset contains a sec- hazard rate estimation and censored likelihood models. ond predictor variable C: As an example, we consider a mortality dataset from > fit.et2<-locfit(NOxÄE+C, data=ethanol, Henderson and Sheppard (1919). This consists of three + alpha=0.5, scale=0) variables: age; number of patients of each age, and num- > plot(fit.et2, type="persp") ber of deaths for each age. Local logistic regression is The scale argument allows the user to specify differ- requested by the family="binomial" argument, and ent scales for each variable, used in computing neigh- the number of patients is passed as the weights argu- borhood weights. When scale=0 is given, each vari- ment: able is scaled by its standard deviation. The two dimen- > fit.mo <- locfit(deathsÄage,weights=n, sional ®t can be displayed in several formats: contour + family="binomial",data=morths, plots (the default); perspective plots (Figure 2) and cross + alpha=0.5) sections, using trellis displays (Cleveland 1993). > plot(fit.mo, get.data=T) Local Likelihood Figure 3 displays the result. Note that while estimation Local likelihood ®tting was developed by Tibshirani is performed on the logistic scale, the result is automat- (1984) and Tibshirani and Hastie (1987). The proce- ically back-transformed to the probability scale. dure is applicable in situations such as binary data, when an additive Gaussian model is inappropriate as an er- For density estimation, the appropriate local likelihood

ror structure. In local likelihood, we simply replace criterion is

Z n

the local least squares criterion by an appropriate local X u x

w x logf x  n W f udu; i

log-likelihood criterion. For binary data, the local log- i

h =1

likelihood is i n

X (see Loader 1996). By default, we use the log-link; that

w xY log p +1 Y  log1 p 

i i i i i

f x

is, log  is modeled by local polynomials.

i=1

p = px  px i where i . We could model directly using In Loc®t, density estimation is requested with local polynomials; however, it is usually preferable (and family="density"; this becomes the default if no re-

the Loc®t default) to model via the logisticlink function, sponseis given in the formula. When link="ident" is

x = log px=1 px

. As in the local regression given, a local polynomial model for the density is used.

x case, we approximate locally by a polynomial, then Local quadratic ®tting with the identity link is one con- choose the polynomial coef®cients to maximize the like- struction of the fourth order kernel estimate discussed lihood. in section 6.2.3.1 of Scott (1992).

12 Statistical Computing & Statistical Graphics Newsletter April 97 0 o

o

o density deaths o o o oo oo ooo o o oo o oo o o o o o oo 0.0 0.2 0.4 0.6 o o o o o o o oooooo oo ooo 0.0 0.2 0.4 0.6 0.8 1. 123456 geyser 60 70 80 90 100 age Figure 3. Mortality data of Henderson and Shepherd: Local logistic ®t.

Our example estimates the density of the durations of 107 eruptions of the Old Faithful geyser: density > fit.of <- locfit(Ägeyser,flim=c(1,6),

+ alpha=c(0.15,0.9)) 0.0 0.2 0.4 > plot(fit.of,get.data=T,mpv=200) > fit.og <- locfit(Ägeyser,flim=c(1,6), 123456 + alpha=c(0.15,0.7),link="ident") geyser > plot(fit.og,get.data=T,mpv=200) Figure 4. Local quadratic density estimates for Old Note the two components to the smoothing parameter Faithful data: log link (top) and identity link (4th order alpha: the ®rst is a nearest neighbor component, and kernel) (bottom). the second a ®xed component. At each ®tting point, both components are evaluated, and the larger bandwidth is estimates the average squared estimation error. Other used in the local likelihood. tools, such as residual plots and con®dence bands, at- From Figure 4 the log-link provides a visually more ap- tempt to assess the importance of individualfeatures and pealing estimate; Loader (1995) provides substantial ev- check other modeling assumptions. idence that it is also a better estimate. While asymptotic For example, in Figure 3, the smooth seems to ®t the theory suggests the choice of link should have little ef- data nicely up to age 90, while the data becomes wild fect on the estimate, it is often advantageous in prac- for larger ages. But the number of patients is very small

tice to choose a link mapping the parameter space to for these larger ages, making it impossible to tell from

1; 1  . The Loc®t default satis®es this for all mod- Figure 3 whether there is any problem. Instead, we ex- els. amine residuals: Model Assessment > plot(morths$age, residuals(fit.mo), + xlab="age", ylab="residual") It is well known that smoothing parameters have a crit- ical in¯uence on the smooth curve: A large bandwidth Several possible de®nitions of residuals for general- leads to an oversmoothed curve that may inadequately ized linear models are given by McCullagh and Nelder model or completely miss important features, while a (1989), section 2.4. By default, Loc®t uses deviance

small bandwidth may undersmooth the curve, resulting residuals; when the smooth is adequate, the distribution

0; 1 in a ®t that is visually too noisy. is very approximately N . Figure 5 shows there is

A number of tools are available to help assess the perfor- no problem at the right end; if anything, a small amount

< < 80 60 age mance of smooths. Global criteria such as cross valida- of overdispersion in the range is possi- tion and generalized cross validation Craven and Wahba ble. (1979) estimate the average squared prediction error, Global goodness of ®t criteria are readily computed. For while the M statistic of Cleveland and Devlin (1988) example, the generalized cross validation statistic is

April 97 Statistical Computing & Statistical Graphics Newsletter 13 o o o o o o o o o o o o o ooo o o o o o o o oo o oo o oo o o o o residual o o o o o o o o o o o o o oo

o GCV o o o o o o -2 -1 0 1 2 o

60 70 80 90 100 age 0.0 0.05 0.10 0.15 0.20

Figure 5. Deviance Residuals for the mortality dataset. 4 6 8 10 12 14 16

Fitted DF

P

n

2

Y ^ x 

i i =1

i Figure 6. Generalized cross validation plot for the n

GCV =

2

n H  tr ethanol dataset.

where H is the hat matrix. The "locfit" object con- tains all the necessary components to compute this; the a symptom of how dif®cult data-based model selec- gcv() function provided with Loc®t makes a call to tion is; for example, re¯ecting the uncertainty as to the locfit() and extracts the necessary components: correct amount of smoothing of the peak in Figure 1. Plug-in selectors, often claimed to be less variable, have > gcv(NOxÄE,data=ethanol,alpha=0.5) not magically answered this uncertainty, but effectively lik infl vari gcv make strong prior assumptions as to the correct amount -4.53376 7.013307 6.487449 0.1216589 of smoothing. This point is discussed further in Loader (1995), where it is shown overreliance on bandwidth The lik component is -0.5 times the residual sum of

T selectors has led to questionable conclusions on some

H H H squares; infl is tr ; vari is tr and gcv is the GCV score. We can easily loop through gcv() for standard examples. several smoothing parameters: Locally Adaptive Fitting

> alpha <- seq(0.2,0.8,by=0.05) Sometimes, we may be blessed with large datasets with > g <- matrix(nrow=length(alpha), low noise. In such cases, we can try to choose a sep- + ncol=4) arate bandwidth for each smoothing point. Loc®t pro- > for(i in 1:length(alpha)) vides one such method for doing so, based on a localized + g[i, ] <- gcv(NOxÄE, data=ethanol, version of AIC. + alpha=alpha[i]) Figure 7 shows an example, using one of the four exam- > plot(g[,3], g[,4], ylim=c(0,0.2), ples popularized by Donoho and Johnstone (1994). The + xlab="Fitted DF", ylab="GCV") S commands producing this example are The plot is displayed in Figure 6. Note the plot is fairly > x <- seq(0, 1, length.out=2048) ¯at from about 6 to 16 degrees of freedom. This sit- > y <- 20*sqrt(x*(1-x))* uation is not uncommon, and re¯ects the dif®culty of + sin((2*pi*1.05)/(x+0.05))+ purely data-based bandwidth selection. Looking at Fig- + rnorm(2048) ure 1, one might argue that the peak should be much ¯at- > plot(yÄx) > fit.ad <- locfit(yÄx, maxk=500, ter than the smooth displays, or possibly even bimodal. + alpha=c(0,0,log(2048))) The ¯atness of GCV simply re¯ects this uncertainty. > plot(fit.ad, mpv = 2048) Recent literature on bandwidth selection (e.g. Ruppert, > plot(predict(fit.ad, what="band"), Sheather, and Wand 1995) has strongly criticized cross + type="p") validation and related procedures as being too variable The locally adaptive ®t is requested by providing a third and unreliable. In fact, a careful analysis shows the vari- component to the smoothing parameter alpha;this ability of cross validation is not the problem, but rather speci®es a penalty for the `number of parameters' in the

14 Statistical Computing & Statistical Graphics Newsletter April 97 oooooo o o oooooo oooooooooooo o oooooooooooooo o o oo oooo oooooooo ooooooooooooooo ooooooooooooooooooooooooooooooooooooooo oooo ooo ooo ooooo oooooo oooo oooooo ooooooooooooooooooooooooooooooooooooooooooooo ooooooooooo oo ooo oooooo oooo ooooo ooooooooooooooo o ooooooooooooooooooooooo ooooooooooooooooo ooo o oo oo ooo oooo oooooo oooooooooooo o oooooooooooooooooooooo oooooooooooooooo oooo oo o o ooo oo oooo ooooooooo ooooooooooooooooooooo y oooooooooooo oooo o o oo oo ooo ooo ooo oooooooo oooooooooooooooo oooooooooooooooooooo o o o oo oo oooo ooooo ooooooo oooooooo oooooooooooo ooo oo o oo ooo oooo oooooo oooooo o oooooooooooo ooo oooo ooo oooo oooooo ooooooooo ooo oo oooo ooooo ooooooooooo ooooooooo oooooooooo o oo oooo oooooooooo oooooooooooooooooooooooooooo ooo ooooooo ooooooooooooooooooo -10 0 5 o oo o oooooooo

0.0 0.2 0.4 0.6 0.8 1.0 x y -10 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 x 5 o

o o o o oo o o o

Bandwidth o o o oo o o o o o o o oooooo ooooo o oooooooooooooooooooooooooooooooooooooooooooooo 0.0 0.10 0.2 0.0 0.2 0.4 0.6 0.8 1.0 x Figure 7. Locally adaptive ®t. Dataset (top); Smooth ®t (middle) and bandwidths used (bottom).

local AIC criterion. Usually, a value slightly larger than petal measurements. Local logistic regression (the de-

2  2 produces the most satisfactory results. fault for a T/F response) is used: While examples like Figure 7 are challenging from an > fit.ir<-locfit( algorithmic viewpoint (it is hard to make a computer + I(species=="virginica")Äpetal.wid+ draw a smooth curve through the data), they are quite + petal.len,scale=0,data=iris) simple from a statistical viewpoint (the structure is quite > plot(fit.ir, v = 0.5) obvious, and a perfectly adequate smooth would be ob- > plotbyfactor(petal.wid,petal.len, tained with a pen). For the types of data statisticians of- + species,data=iris,pch=c("O","+"), ten face Ð more noise, and less obvious structure Ð lo- + col=c(1,1),add=T,lg=c(1,7)) cally adaptive smoothing is likely to be less satisfactory. Figure 8 shows the resulting ®t, with the single contour The model selection uncertainty identi®ed in the pre- plotting the classi®cation boundary. vious section applies equally to locally adaptive selec- tion, and there can be no guarantee that the smooth pro- We can also estimate the error rate, using cross valida- tion: duced by Loc®t (or any other locally adaptive smoother) matches what the user desires. > table(fitted(fit.ir,cv=T)>0.5, Classi®cation + iris$species) versicolor virginica Classi®cation problems, where one attempts to classify FALSE 47 2 observations into two or more classes, can be addressed TRUE 3 48 using either logisticregression or density estimation. As

an example, we consider classifying the versicolor and Here, we estimate the cross validated ®t, and hence the

=100 virginica species from Fisher's Iris data set, based on the misclassi®cation rate is estimated as 5 .

April 97 Statistical Computing & Statistical Graphics Newsletter 15 Acknowldegments O versicolor + + + + I thank Mark Hansen for inviting this article, and for + virginica + + helpful comments on presentation. + + + + + + + References + + + + + + + + + + + + + + Chambers, J. M. and Hastie, T. J., (1992), Statistical + + + + + + +OO+ + Models in S, Wadsworth & Brooks-Cole, Paci®c Grove, O+ + OOOOO California. petal.len OOOOO+ OOO Cleveland, W. S. (1979), ªRobust locally weighted re- OOOO O OOOOO gression and smoothing scatterplots,º Journal of the OO American Statistical Association, 74, 829±836. OO O Cleveland, W. S. (1993), Visualizing Data, Hobart 34567 O0.5 Press, Summit, New Jersey. 1.0 1.5 2.0 2.5 Cleveland, W. S. and Devlin, S. J. (1988), ªLocally petal.wid weighted regression: An approach to regression analy- sis by local ®tting,º Journal of the American Statistical Figure 8. Classi®cation boundary for Fisher's Iris data. Association, 83, 596±610. Computational Algorithms Cleveland, W. S. and Grosse, E. H. (1991), ªComputa- tional methods for local regression,º Statistics and Com- Modern work stations and PC's are suf®ciently fast that puting, 1, 47±62. direct implementation of local regression for datasets with a few hundred points is not a problem. For larger Craven, P. and Wahba, G. (1979), ªSmoothing noisy datasets, or iterative procedures such as local likelihood, data with spline functions,º Numerische Mathematik, computational approximations become useful. 31, 377±403. Donoho, D. L. and Johnstone, I. M. (1994), ªIdeal spa- The computational methods used in Loc®t develop tial adaptation by wavelet shrinkage,º Biometrika, 81, the ideas used in Loess (Cleveland and Grosse 1991). 425±455. Roughly, the local regression/likelihood is performed directly at a small number of points, and the ®t at these Henderson, R. (1916), ªNote on graduation by adjusted points is smoothly interpolated to obtain the ®t at re- average,º Transactionsof the Actuarial Society of Amer- maining points. Loc®t differs from Loess in the way ica, 17, 43±48. points are selected for direct ®tting. While Loess bases Henderson, R. and Sheppard, H. N. (1919), Gradua- its choice on the density of data points, Loc®t uses tion of Mortality and other Tables, Actuarial Society of the bandwidths at ®tting points. This allows Loc®t to America, New York. adapt to different bandwidth schemes: ®xed, nearest- neighbor, and locally adaptive. The power of this ap- Loader, C. R. (1995), ªOld Faithful erupts: Bandwidth proach becomes apparent in the third panel of Figure 7: selection reviewed,º Technical Report, Bell Laborato- the direct ®t is performed at just 108 points; far less than ries, Murray Hill, New Jersey.

the 2048 data points, and most of the ®tting points are in http://cm.bell-labs.com/stat/doc/95.9.ps

 x  0:2 the interval 0 , where the smallest bandwidths Loader, C. R. (1996), ªLocal likelihood density estima- are used and the locally adaptive procedure is relatively tion,º The Annals of Statistics, 24, 1602±1618. cheap. McCullagh, P. and Nelder, J. A. (1989), Generalized Conclusions Linear Models, Chapman and Hall, London. Ruppert, D., Sheather, S. J., and Wand, M. P. (1995), This article has outlined the main ideas underlying Loc- ªAn effective bandwidth selector for local least squares ®t, and presented examples showing some of the main regression,º Journal of the American Statistical Associ- capabilities. The web pages referred to in Section 1 con- ation, 90, 1257±1270. tain a number of other applications, including several models with censored data, and details of many more Scott, D. W. (1992), Multivariate Density Estimation: options. Of course, the best way for readers to decide Theory, Practice and Visualization, John Wiley & Sons, whether Loc®t is useful is to download and try it! New York.

16 Statistical Computing & Statistical Graphics Newsletter April 97 Tibshirani, R. J. (1984), Local Likelihood Estimation, animated. He said he had talked to the colleagues I had Ph.D. Thesis, Department of Statistics, Stanford Univer- suggested and they said that I (Me; Bill Eddy) was the sity. person who could help him. I laughed yet again and said ªNo; multiple comparisons is not my kind of problem. Tibshirani, R. J. and Hastie, T. J. (1987), ªLocal like- Go away.º I gave him the names of more colleagues lihood estimation,º Journal of the American Statistical ...on the other sideof town. Association, 82, 559±567. Another week and there he was again. Persistent son-of- a-gun, I thought. Maybe I should listen more carefully. Clive R. Loader ªTell me more about this magnetic resonance machine.º Lucent Technologies [email protected] He did. I was pretty impressed with his explanation; it

ranged over physics, electrical engineering, biophysics,

neurobiology, neurology, anatomy, cognitive psychol- ogy, computer programming, and, of course, statistics. Then he took me to see the MR machine over in the hos- TOPICS IN MEDICAL IMAGING (Cont.) pital. It was pretty impressive. A huge hollow cylinder in a quite large room. (I later found out the reason the CONTINUED FROM PAGE 1 machine was so large was it contained an internal mag- rate to about 4 such images per second). The second im- netic shield made of steel about one foot thick. I also

age is a ªt-mapº in which those voxels in the functional found out that the room was large so the uncontained map which have either increased (black) or decreased ®eld would be small outside the room.) Next door was (white) by more than some ®xed amount between two the control room. Actually much less impressive, it only experimental conditions are highlighted. These are the had a fancy touch screen console and a few computer ªsigni®cantº voxels the psychologist is seeking. You monitors. And next door to that was the radio equip- have probably seen such maps in various scienti®c and ment that made it all work. (It was extremely cold in popular publications. The main difference here is that I there. I later found out that was to reduce the noise from have not ªcleaned it upº by removing the stray voxels the electronics.) And then the whole setup was repeated. which exceed the threshold by chance. A week went by They had two of these multi-million dollar machines. and there he was again, even wilder looking and more Just for research! Into the functioning of the brain???

Sentence Comprehension--Axials Sentence Comprehension--Axials 09134_002_raw0.1.Mean 09134_002_raw0.2.5.SPM -4,4 Slice 4 Slice 4 870

654

439

223

8

SC SC IA O IA O F 3.0 F 3.0

Figure 1. A functional image of a brain, and the associated t-map in which voxels in the functional map have either increased (black) or decreased (white) by more than some ®xed amount between two experimental conditions.

April 97 Statistical Computing & Statistical Graphics Newsletter 17 I decided this deserved some more careful study. I got ing determined by the atomic species. For hydrogen, the a couple of my students and we started reading about one that is currently the basis of fMRI, the constant is magnetic resonance, magnetic resonance imaging, and around 42MHz per Tesla. A Tesla is a measure of the functional magnetic resonance imaging. The latter was strength of the magnetic ®eld; one Tesla is 10,000 times pretty easy; at the time there were only four papers on the strength of the earth's magnetic ®eld. A standard the subject. I was very quickly hooked. I am, by na- ªhigh-®eldº MR scanner used for fMRI has a main ®eld ture, a sort of data junkie, the larger the data set the of 1.5 Tesla. more interesting it is a priori. It turns out that a mag- netic resonance scanner is a multi-million dollar ma- Alright, so the nuclei precess. So what? Well, the dis- chine which can crank out data at the rate of 500MBph tribution of the directions the axes of all the nuclei are (that's megabytes per hour, folks). pointing is determined by the strength of the ®eld. The stronger the ®eld the more concentrated the distribution I guess I should say that the animated psychologist was pointing in the direction of the ®eld (or in the opposite Jon Cohen from CMU. He and I have now, together with direction to the ®eld; there is a local minimum in the en- others, had a couple of grants proposals funded and have ergy function in exactly the opposite direction). This, of written several papers. He is one of the very smart peo- course, depends on the energy state of the system; the ple I have met in this ®eld, which makes the research a more energy the nuclei have the less they have to pay lot of fun. attention to the magnetic ®eld and the less concentrated is the distribution. And this is the secret to magnetic res- My students and I spent the next year bothering every onance; inject a little energy into the system. At the fre- one in sight asking question after question about this quency of precession, it is absorbed by the atoms. (They mysterious process. (I want to especially thank Doug resonate.) However, they prefer to return to their ground Noll of the Department of Radiology at the University of state, which they do by emitting the absorbed energy. Pittsburgh for his clear and patient explanations; with- out his efforts we would be doing something else now.) If you remember ªthe right hand ruleº from elementary My personal favorite question was to the Chief of MR physics, you probably remember that associated with Research (Keith Thulborn); these are his machines. One every electric ®eld there is a magnetic ®eld and vice day (wondering if the noise was truly Gaussian, constant versa. That means that by transmitting pulses of elec- variance, and isotropic, as I had been told repeatedly) tric current through wires carefully wrapped around the I asked him if we could get some images while there hollow tube of the magnet one can cause spatially vary- was nothing in the scanner. He laughed at me: ªYou ing ¯uctuations in the strength of the magnetic ®eld and don't even understand that it's resonance; there has to hence spatially varying ¯uctuations in the frequency of be something in the scanner to resonate or you'll get no precession. Thus when the atoms emit the absorbed image.º The following week I asked him again with the energy the frequency of the emitted energy tells where same result. And again thefollowingweek ...Thenone they are and the strength of the emitted energy tells how week he said ªThe real reason I won't let you do that is many there are. Of course, it is a bit trickier than this de- that it would damage the electronics.º Now I had him; scription would indicate. The one important point is that he couldn't keep his story straight. I pressed and even- because of the physics, the data that are acquired from tually he allowed me to take almost 15,000 images with the emitted signal are actually sampled at locations in k- the scanner empty; we did have to turn the gain down space (Fourier space). Thus the images must be recon- on the receiver. A few simple statistical summaries later structed by Fourier transform (the Faster the better) from and they called the repairman. No, we didn't damage the the data. scanner; we found a minor ¯aw in the hardware. And forget Gaussian or constant variance or isotropic; none That really is it, although a lot of important detail has seem to be true. been left out. So let's turn now to the ªfunctionalº part of fMRI. It had been known since the 1930's that ªblueº So how does it all work? At the lowest level are the blood (deoxygenated hemoglobin) was more paramag- atomic nuclei. Each nucleus which is affected by mag- netic than ªredº blood. In the early 1980's it was shown netism (it has to have an odd number of protons or neu- that the MR signal from blue blood was stronger. In the trons or both) spins like a gyroscope when in a magnetic late 1980's (using positron emission tomography) it was ®eld. Remember how the axis of rotation rotates itself shown that the active brain required an increased blood (precession); same thing happens to the nuclei. The fre- supply but did not utilize the increase in available oxy- quency of precession is proportional to the strength of gen. This set the stage for the discovery in the early the magnetic ®eld, the constant of proportionality be- 1990's that the MR signal in the vicinity of active brain

18 Statistical Computing & Statistical Graphics Newsletter April 97 tissue showed an apparent increase (because the blood Fast Fourier Transform) and remove the (unexplained) leaving the region was more oxygenated and hence less voxel-wise trend over time. Finally we are at the point paramagnetic and hence interfered less with the local that statistics is traditionallyinvoked: it is time to decide

magnetic ®eld). Whew! on the effect of the experimental paradigm. A t-test is often powerful enough. As MR techniques and the cog- Consider an experiment, carefully designed by a cogni- nitive questions evolve there is greater demand for sub- tive psychologist to cause activation of a certain brain tler methods. function in one condition and deactivation in another condition. Images gathered in the ®rst condition will There is much work to do. Every question we ask leads show a stronger signal at locations associated with the to something new. The question that got me into this: function that images gather in the second condition. ªHow can I retain some power after allowing for multi- Thus we are naturally led to compare the average of the ple comparisons?º led to Forman et al. (1995). It is not images in the ®rst condition with the average of the im- the ªsolutionº but it is a step in the right direction. The ages in the second condition. question ªWhat happens if you run the same experiment tomorrow?º led to Genovese et al. (1997) and some So where is the statistics in all this? It's simple really: related papers. The question ªWhy not correct the raw the data are very, very noisy. There are many sources of data (instead of the reconstructed images) for head mo- variation we think we understand and, of course, many tion?º led to Eddy et al. (1996). And, of course, there is we don't understand. At the lowest level there is the the ªscience.º For example, ªDoes reading more com- ªthermalº vibration of the atoms which can only be con- plex sentences cause more brain activity?º led to Just et trolled by cooling the material toward absolute zero. al. (1996). There are two main sources of variation over which we have some control: the hardware and the subject. In If you are at all interested in fMRI, I suggest you walk the hardware one source of variation is the lack of uni- over to your local MR center and begin to get to know

the people and what they are doing. If you don't have the B

formity of the main magnetic ®eld, 0 ; another source

energy for that you might want to read Lange (1996); it B

is lack of linearity in the gradient ®eld, 1 .Thereis miscalibration of the analog-to-digital converter; there is a good introduction to fMRI for statisticians although is drift in the receiver electronics; mistimings of reso- it is already a bit dated. Change in this ®eld is extremely nant gradients cause ªN/2 Nyquist ghosts;º the list goes rapid; e.g., we are already using 8-fold higher spatial on and on. resolution than Lange describes. One of the main sources of variation cause by the sub- Addendum ject is movement of the brain. Movement of the brain On rereading this I see that I have failed to capture the is itself the result of many sources. There is the rigid collaborative nature of fMRI work. I regularly inter- movement of the whole head; there is the almost peri- act with a large team of physicists, electrical engineers, odic compression and vertical movement of the brain psychologists, neurologists, computer scientists, and, of caused by the diastole of the cardiac cycle; and there is course, fellow statisticians. And, this doesn't count the the complex distortion of the brain caused by respira- workhorses: the MR technologists, computer program- tion. mers, registered nurses, etc. fMRI is truly a team sport. And what can we do about the noise? There are two general approaches: engineering (remove the noise at its source) and statistical (model the data and remove Annotated References the variation it accounts for). We believe very strongly Eddy, W.F., Fitzgerald, M., Genovese, C.R., Mockus, in both approaches. Keeping with statistical tradition, A., and Noll, D.C. (1996), ªFunctional Imaging Anal- we'd really like to make a model which relates the data ysis Software - Computational Olio,º Proceedings in to the parameters of interest. That seems very dif®- Computational Statistics (A. Prat, Ed.), Physica-Verlag, cult and the computations required by any plausible es- Heidelberg, 39-49. (A slightly more technical descrip- timation technique are daunting. As a stop gap measure tion of the data processing and the software we have de- we have chosen to estimate (and remove from the data) veloped.) each effect successively. So, in short summary, we cur- Eddy, W.F., Fitzgerald, M., and Noll, D.C. (1996), ªIm- rently correct for (see Eddy et al. 1996 for details): a) proved Image Registration By Using Fourier Interpola- analog-to-digital converter miscalibration; b) gradient tion,º Magnetic Resonance in Medicine, 36, 6, 923-931. mis-timings; c) receiver drift; d) subject head motion; (Our method for two dimensional motion correction.) e) shot noise; and then we reconstruct the images (by a

April 97 Statistical Computing & Statistical Graphics Newsletter 19 Forman, S.D., Cohen, J.D., Fitzgerald, M., Eddy, W.F., iceberg. Some background information is appropriate to Mintun, M.A., and Noll, D.C. (1995), ªImproved As- place this study in a larger context. sessment of Signi®cant Change in Functional Mag- We are in a new era in which biologists can selectively netic Resonance Imaging (fMRI): Use of a Cluster Size disrupt genes and design genes to perform speci®c tasks. Threshold,º Magnetic Resonance in Medicine, 33, 636- However, genes do not function in isolation. In gene

647. (Our ®rst attempt to increase the power of the t- knock-out experiments, the deletion of a gene can have a testing procedures.) disastrous effect in some cases, while in others the work- Genovese, C.R., Noll, D.C., and Eddy, W.F.(1997), ªEs- ing constellation of genes compensates quite well for the timating Test-Retest Reliability in Functional MR Imag- gene's absence (see Galli-Taliadoros et al. 1995). This ing I: Statistical Methodology,º Magnetic Resonance in suggests redundancy of gene function and combinatorial Medicine (in press). (What happens when you repeat an regulation of genes: a complex genetic network. fMRI experiment?) The study of genetic networks is one of the topics in Just, M.A., Carpenter, P.A., Keller, T.A., Eddy, W.F., recent books and journals addressing complexity (see and Thulborn, K.R. (1996), ªBrain Activation Modu- Kauffman 1993, Somogyi and Sniegoski 1996, and lated by Sentence Comprehension,º Science, 274, Octo- http://rsb.info.nih.gov/mol-physiol/ ber 4, 1996, 114-116. (An example applicationof fMRI.) homepage.html). The introduction of networks into Lange, N. (1996), ªStatistical Approaches to Human this area has effectively split the ranks of biologists. The Brain Mapping by Functional Magnetic Resonance old guard continues to focus on the study of individual Imaging,º Statistics in Medicine, 15, 389-324. (A very genes using an evolving but relatively mature method- long introduction to fMRI.) ology. Those taking on the challenge of genetic net- work studies are pioneers who must modify and create conceptual models and methodologies to view this new Willian F. Eddy landscape of molecular interactions. This work is much Carnegie Mellon University different than studying electronic communications net- [email protected]

works where one can at least consult with the engineers

who designed the network! Initial work in studying genetic networks conceptual- TOPICS IN INFORMATION VISUALIZATION izes two types of communication paths (Somogyi and Sniegoski 1996). The ®rst consists of proximal paths operating through ªcis regionsº and ªtrans elements.º Templates for Looking Cis are control regions of DNA proximal to gene coding sequences and trans elements are gene products that reg- at Gene Expression ulate by interacting with cis regions. The second type of communication pathway is composed of extended paths Clustering involving protein-protein and protein-signaling factor By Daniel B. Carr, Roland Somogyi and interactions governing intra- and extra-cellular commu- George Michaels nication. Genes encoding the participating proteins con- 1. Introduction trol this communication. Our starting point is very simple; we will look for clus- In this paper, we describe the design of graphical dis- ters in gene expression time series data. When the out- plays for investigatingclustering evident in gene expres- put patterns of different genes are very similar, there is sion data. The displays include stereo plots, parallel hope that they are a part of a constellation of communi- coordinate (time series) plots and conditioned parallel cating genes, receiving similar control signals. Again, coordinate plots. These basic templates are subject to this is only a starting point. Since some genes turn other numerous variations and are potentially useful in many genes off, things get complicated quickly. Cluster mea- other cluster analysis settings. sures such as mutual information provide clues about As an application of our approach, we will consider gene additional members of the constellation. (Mutual infor- expression mapping of the developing spinal cord in mation is also referred to as the rate of transmission, and rats, focusing on only 112 genes. Because tens of thou- is related to conditional entropy.) As this is work in pro- sands of interacting genes control spinal cord develop- cess, we welcome insights into how to proceed in the ment, this study is really addressing just the tip of the face of our limited understanding and the apparent com-

20 Statistical Computing & Statistical Graphics Newsletter April 97 plexity of genetic networks. In the next section we pro- the current example emphasizes reading labels in order vide more details about the nature of our gene expres- and sorting would scramble the order. sion data. In Figure 1 the plotting order for color is consistent in 2. The Observations and Clustering Methods all panels: cyan, green, orange, and red. We use graph- ical ordering and plot from the bottom up within each As mentioned above, we identi®ed 112 genes involved functional group. This might be argued since there is a in the development of a rat's spinal cord. This collection clash of conventions. Graph reading is bottom up while includes genes deemed important for the development table reading is top down. Figure 1 is much like a table of the central nervous system: neurotransmitter recep- so there is ambiguity about which convention to apply. tors and metabolizing enzymes, intracellular signaling Note that with four genes in a panel, the red line, which proteins, peptide factors and their receptors and growth should appear closest based on wavelength considera- factors. Genes for marker proteins were also selected so tions, plots on top. The color selection also makes red that we can associate gene patterns with cell differenti- the darkest color on a lightness scale and hence it con- ation. trasts the most against the light background. The graph The observations on each gene constitute a time series convention is slightly advantageous because red appears of length nine. The nine developmental times studied on top in the panel and at the top of the color key. were gestation days 11, 13, 15, 18, 21; and after birth The left to right sequence of gene functional group days 0, 7, 14, and 90 (adult). Each point in the series is name, panels, and then gene names can also be argued. a value between zero and one. When a gene is not func- The task can motivate a different order. If communicat- tioning the value is zero, and when it is maximally func- ing membership in functional groups were much more tioning the value is one. The observations themselves important than looking at the time series within func- are obtained through a process known as RT-PCR, re- tion groups, then putting the functional group names verse transcription-polymerase chain reaction (see So- and gene names together would be the logical design. mogyi et al1995).Ç The values are determined from digi- Putting the text in one place has merit in its own right. tal records of gel images and are actually means of trip- However, putting the gene names on the right panels al- licate observations. The variability of the triplicates is lows the names to be left aligned and to be uniformly very small and not shown in the graphs below. Scaling close to the key and panel. Since ®nding the names forces the means to range from zero to one. within the function groups is still easy, we show this Figure 1 (page 27) shows the gene expression patterns variation. by functional groups and gene sequence families. The The selection of four genes per panel follows Kosslyn's group names appear to the left of the clusters of panels. (1994) advice for creating small perceptual groups. The The groups are members of four general functional cat- apparent simplicity of the panels deteriorates quickly as egories described later in Figure 4. Since our focus here number of time series in each panel increases. Figure 1 is on graphics, we will not describe the gene families. appears simple while showing the thousand means in The design of Figure 1 makes heavy use of perceptual this data set. The four time series per panel design has grouping and warrants some comment. The scale for the many applications. In landscape orientation and with- axes appears in the top left panel. Genes in the same out the two column format, the design readily accom- functional group appear in a consecutive grouping of modates the much longer time series that occur in man- panels. Each panel within a gene function group shows ufacturing and other applications. the times series for four or fewer genes. The represen- Sorting the time series can make the plot appear sim- tation is a parallel coordinates plot (see Inselberg 1985 pler (see Carr and Olsen 1996). However, data order- and Wegman 1990) with omitted axes. ing can serve other purposes. In Figure 1 we ordered Each times series in a panel has its own color. While the functional groups based on page layout considera- there is overplotting, the reader can quickly infer values tions. We kept the provided label ordering within func- for overplotted points. The color key and corresponding tional groups, because that simpli®ed ®nding a speci®c gene label appears at the right of the panel. The color has gene within a function group. A time-series sorted ver- no meaning other than to serve as a link (see also Carr sion would be interesting. In an interactive setting (see and Pierson 1996). One can additionally sort the rows Carr, Valliant and Rope 1996) one might try visual clus- of the key by values for the last time period. This posi- tering by dragging and dropping time series into differ- tional linkingmakes more lines run into the rectangles of ent panels. An automated approach can use clustering their own color and linking becomes trivial. However, as a nominal basis for sorting.

April 97 Statistical Computing & Statistical Graphics Newsletter 21 In this case my co-authors came to me (Dan) with tween encoded multivariate points into interpoint dis- FITCH (Felsenstein 1993) clustering results. FITCH is tances. The merits of a higher-dimensional represen- an n4 algorithm and produces graphics like that in Fig- tations can be more than counterbalanced by the dif®- ure 2. They had decided that their Euclidean distance culty and inaccuracy of the decoding process. In fact a clustering was better if they included the differences be- de®nitive test for multivariate representations should be tween consecutive observations in the time series. In how well the user can assess the distance between two other words the input vectors were of length 17, 9 time points and the ratio of two such distances. The position series values plus 8 differences. This is equivalent to us- of Carr et al 1986 is that 3-D stereo plots (and possibly ing a weighted distance that emphasizes the seven inter- 4-D stereo ray glyph plots) allow quick distance judge- nal points of the time series. The co- authors note that ments that are good enough to be worthwhile. In the another option is to add slopes based on the actual spac- principal components context, if a third or fourth com- ing in days. They also used mutual information cluster- ponent adds little to the percent of variability explained, ing. then one might get by focusing on a 2-D plot. However, a 3-D or 4-D plot is often a better starting point. Those Like many statisticians, I am aware of hierarchical clus- busy interpreting a 2-D plot can seem naive when im- tering algorithms, maximum likelihood clustering, and portant structure is obvious in a 3-D view. Of course re®nement of clusters using the K-means algorithm. naivetÂe is relative. Those that can incorporate and un- However, I am far from being an expert. I had never derstand even more information in the graphics have an heard of FITCH nor mutual information clustering. My advantage. co-authors gave me every opportunity to recommend a clustering algorithm that would provide the truth, or one Figure 3 (page 28) is a side-by side stereo plot that that all scientists would recognize the best of the avail- distinguishes the six groups using color and symbol. able choices, but I declined. My early participation was Carr (1990) discusses stereo projections. The slightly simply to help them look at the data and the results of rotated view in Figure 3 seems to help image fusion clustering. over a directly facing view. Many people can learn to fuse side-by-side images without the aid of a view de- 3. Cluster Plots and Stereo Plot Construction vice. This learned skill involves the decoupling of eye- Figure 2 shows a cluster tree produced by FITCH. The convergence and lens focusing that normally work to- follow-the-line distance between points approximates gether in a process called accommodation. Proper fu- the multivariate interpoint distance. FITCH minimizes sion results in the square dot appearing in the back left a measure of stress that differs somewhat from the mea- corner of the plot frame. sure minimized in traditional 2-D nonmetric multidi- The plot axes in Figure 3 are the ®rst three principal mensional scaling (MDS). Figure 2 shows the average components of the 17 variables. The three coordinates time series pro®le for each of the six resulting clusters. capture 65 percent of the variability. The ®gure uses The labels for the clusters derive from the pro®les and do global scaling for the three coordinates and the plot not necessarily have any deep meaning. Figure 2 is good frame re¯ects the range of the principal components. in that it gets all the labels into the plot and provides a That is, the x-axis represents the ®rst principal compo- feel for clusters and subclusters. The extra freedom pro- nent and the frame is largest in the x direction. The y- vided by using connecting-line length rather than direct axis represents the second principal component. Repre- interpoint distance should allow signi®cant reduction in senting the third principal component with stereo depth any measure of stress. In this sense Fitch cluster tree reduces overplotting and the complications of looking views should be better than MDS or ®rst-two principal through many layers of data. The analyst should view component views. However, tracking lines for each pair the stereo plot from the correct distance to perceive in- in the cluster to assess interpoint distance is a compli- terpoint distances properly. The assignment of variables cated visual operation. How can we judge the cluster- to the axes gives an important clue. If the frame appears ing if we can not easily judge interpoint distances? A deeper than the frame is tall, then the analyst is too far ®rst reaction is to stick with views that represent inter- away. point point distances directly even though the distance approximations are not as good. The cluster and color pairing are: Constant = red, Wave± 1 = orange, Wave±2 = green, Wave±3 = cyan, Wave±4 Conceptually, higher-dimensional plots reduce the mea- = magenta, and Other = black. A minimal spacing tree sure of stress and hence provide a better representation based on the three axes connects the points in each clus- of interpoint distances than low dimensional plots. In ter. This helps to constrain visual traversal paths in practice the analyst must translate the differences be-

22 Statistical Computing & Statistical Graphics Newsletter April 97 nAChRd Gene Expression Clusters in the Developing Spinal Cord SC6

wave 2 trkC GAT1 mGluR8

nAChRa4 GRa3pre-GAD67 MAP2 NMDA2B ACHE PDGFb mGluR3 GRb2 nAChRa6 GRa5 GRa2 synaptophy GRg2 wave 1 statin EGF other neno GRb1 NMDA1 NOS EGFR GAD65 mGluR5 trkB GRb3 GDNF mGluR2

NT3 NMDA2C PDGFR GAP43 cyclin_B mGluR6 GRa1 TH 5HT3 SC2 mAChR3 IGF_II Ins1 Brm mGluR4 cellubrevi MK2 nAChRa2 L1 keratin GRa4 nAChRe IGFR2 5HT1b GAD67 trk 5HT2 wave 3 nestin GRg3

G67I80/86 G67I86 SC7 IGFR1 NFL

IP3R3 IP3R2 mGluR7 mAChR2 5HT1c nAChRa7 nAChRa3 nAChRa5

NMDA2D CNTF GRg1 mAChR4 CNTFR S100_beta IGF_I DD63.2 cjun H2AZ bFGF

TCP ODC InsR CCO2 cyclin_A actin CCO1 TGFR IP3R1 PTN PDGFa SOD ChAT wave 4 aFGF FGFR NFH CRAFBDNFSC1 cfos Ins2

GFAP NFM constant mGluR1 MOG NGF

NMDA2A repeated viewing, and the perceptual grouping makes more conventional software. Production of Figure 3 is the plot look simpler (Carr et al 1986). Given fusion, straightforward using high resolution black lines. How- we see plausible clusters (red triangles and green oc- ever, color inconsistencies arise using conventional vec- tagons) at a glance. The scale for the red and green clus- tor graphics. The wrong color overplots when drawing ters raises serious doubts about some of the other clus- a distant line of one color after drawing a close line of a ters such as the orange squares and magenta x's. different color. Another way to look at cluster results is to use local The partial solution used to construct Figure 3, broke averages of the times series rather than principal com- the line segments into a sequence of short line segments ponents. The averaging of adjacent times series values based on a large number of depth planes. The algorithm is a standard dimension reduction technique. The ad- used the closest of the short segment endpoints as the vantage is that when patterns appear, interpretation of- measure of the segment's depth. The algorithm then ten less complicated than for a principal components sorted both points and lines back to front before plot- view. (A disadvantageis that researchers don't like loos- ting. This procedure, while computationally tedious, ing temporal resolutionespecially when the experiments corrects the problem except for overplottingof segments were laborious.) After grouping the nine values into and points at almost identical depth. sets of three and averaging, we used the shuttering glass 5. Cluster Interpretation and Assessment stereo in ExplorN (Carr, Wegman, and Luo 1997) to look at the resulting three coordinates. (ExplorN also In an unsupervised clustering problem, one hopes to use supports touring in parallel coordinate and scatterplot corroborating scienti®c information as well as cluster matrix views of up to 20 dimensions.) Color and ray tightness to assess the clustering. Here the genes tyro- angle represented the cluster membership. The clus- sine hydroxylase (Th), insulin 1 (Ins1) and insulin-like ters were plausibly coherent just as they are in Figure 3. growth factor II (IFGII), appear as a tight subgroup in However there was one bad exception in our initial look the Wave±1 cluster. They turn out to be located on the at the clustering. The data for that gene had a transcrip- same human cytogenetic band (11p15.5) and are close tion error. The principal component view in Figure 3 together on mouse chromosome 7 (see Mouse Genome shows the corrected data. Before assessing clusters fur- Database). This suggests the genes are regulated in par- ther, we pause for more comments on stereo views and allel due to their close proximity on the chromosomes. plot construction. Figure 4 shows the residuals from the cluster means 4. Stereo Viewing and Plot Production by gene functional group and cluster. (Using a differ- ent scale, one can also show the panel means in row Small side-by-side stereo plots are less than optimal. A and column margins.) Both Waves±2 and 3 are no- good stereo viewer with appropriate mirrors and lenses tably con®ned to neurotransmitter signaling. Wave± allows use of much larger left-eye and right-eye plots. 4 cluster genes primarily belong to several functional Stereo workstations using shuttering eye-glasses work families. Genes showing largely constant expression quite well, although they sacri®ce a bit in terms of spa- (the Constant group) originate from diverse families but tial and brightness resolution. Rotation of points pro- strictly exclude the neurotransmitter signaling and neu- vides a good depth cue, motion parallax. However, roglial markers. The variability in Figure 4 raises con- rapidly rotating plots are hard to study. In our use of Ex- cern about the adequacy of the clustering and the sta- plorN we found very slowly rotating stereo views to be bility of the clusters variations when using more data or a desirable compromise. other algorithms. It seems doubtful that all of Wave±1 is just one constellation of genes. With more data, Wave±1 The move from the workstation stereo graphics to may break into several defensible clusters. Currently the printed side-by-side views raises the issue of color subcluster indicated above could begin to de®ne a con- overplotting inconsistencies. In non-translucent stereo stellation. The co-location of genes on a chromosome is mode, our SGI graphics workstation uses a z-buffer reasonable con®rmation. methods to make sure that whatever is closest to the viewer plots on top. One could utilize the worksta- 6. Cluster Comparison tion graphics by accessing the separate eye views and As indicated above my co-authors also brought results copying the low resolution bit maps to a high resolution from a mutual information clustering algorithm. Again printer. For small side-by-side views the size reduction they selected six classes. A natural step is to compare ameliorates the problem of limited resolution in work- the clustering. station views. For the graphics here we sought to use

24 Statistical Computing & Statistical Graphics Newsletter April 97 Residuals From Cluster Means By Functional Groups and Clusters Residual Scale = [-.75, .75]

Constant Wave 1 Wave 2 Wave 3 Wave 4 Other

Other

Novel/EST

Diverse Cell Cycle

Trans Factor

ICS

Insulin & IGF

Peptide Neurotrophin Signaling

Growth Factor

Neuroglial Marker

NME

AChR

Neurotransmitter GABA-R Signaling

GluR

5HTR

E13 P0 E13 P0 E13 P0 E13 P0 E13 P0 E13 P0 E18 P14 E18 P14 E18 P14 E18 P14 E18 P14 E18 P14

Figure 4. Multivariate residuals in a two-way layout.

April 97 Statistical Computing & Statistical Graphics Newsletter 25 Comparison graphics can take many forms. One tem- crossings correspond to high correlations. Kendall and plate is a two-way panel layout. The rows are class Stuart (1979) describe an eigenvector scoring approach membership for one algorithm and the columns are the for categorical data that will maximize the correlation class membership for other. Each panel contains the for two classi®cation. This provides a basis for order- corresponding times series (if any). After suitable rear- ing the classes within each classi®cation. Unfortunately, rangement of rows and columns, a strong diagonal pat- we have not been able to generalize this approach to tern would indicate effective similarity of the cluster- three variables and fear that the general case may be np- ing algorithms. The shape of the series in off diago- complete. nal panels provides insight into algorithm differences. Some ®nd the string art in Figure 5 appealing but for One can augment such graphics. In the current cases, others the plot is still too complicated. The advantage there would be a 6 x 6 panel layout and color could in- of the small panel approach described previous is that it dicate membership in one of the four major functional shows the times series in addition to the classi®cation. groups diverse, peptide signaling, neuroglial and neu- We present Figure 5 because it illustrates one way of rotransmitter signaling. Another variation incorporates converting classi®cation tables into graphs and because the information using a (4 x 6) x 6 layout. The 24 rows of it raises a sorting challenge. panels result from crossing the major functional groups with one of the classi®cations. The potential variations 7. Closing Remarks using conditioning and color to incorporate additional Alternatives and extensions to the above templates for information are numerous. viewing clustering results are numerous. Hierarchical Figure 5 (page 28) shows a cluster comparison approach cluster tree views are common. Visually connecting the based on parallel coordinates. Here the time series are cluster trees to the multivariate data can help provide omitted and the plot emphasizes three classi®cations: insights about the clustering. Buja, Cook, and Swayne gene function, Euclidean distance clusters and mutual (1996) used color brushing in XGobi to link cluster information clusters. The gene function axis appears at tree branches to other scatterplot views of data. Such both the top and bottom of the plot. The regular spacing interactivity facilitates following the visual clues pro- between a small number of crossing points distinguishes vided by graphics. The idea of joining multiple win- the classi®cation axes. dow brushing capability with adaptable thoughtfully- designed multiple panel plots has occurred to many but The Figure 5 design introduces two unique-case axes be- implementations have been slow to appear. tween each classi®cation axis. Every case (gene) has its own unique plotting position on these axes. With Many multiple panel-designs scale to much large sam- only a hundred or so lines, all lines are visibly distinct. ple sizes. Density methods apply to parallel coordinate George's idea behind using two such axes was to con®ne plots (Wegman and Luo 1997). The data need not nec- messy line crossing to the region between the unique-id essarily be time series. Plots like Figure 4 have many axes. This creates regular patterns of lines reaching the extensions. classi®cation axes. As usual, S-PlusT functions and scripts for pro- The choice of color in the ®gure serves two purposes. ducing the graphics are available via anonymous First it collapses the 13 gene function groups in the ftp to galaxy.gmu.edu. Change directory to four functional families. Second the color selection pur- pub/dcarr/newsletter/gene. In contrast to the posely calls attention to the peptide signaling (high con- past, the data provided is arti®cial. The real data will trast yellow) and down plays the distinction between be substituted when my co-authors have published in a neuroglial and neurotransmitter signaling. refereed journal. The precursors to Figure 5 raised an interesting sorting Splus users may ®nd the matrix layout functions of par- issue. The crossing lines made the plots look compli- ticular interest. Currently there is a Bureau of Labor cated. The challenge then is to order classi®cations, or- Statistics technical report describing the functions and der classes within each classi®cation, subclasses within selected connections to TrellisT graphics. Contact Dan nested classi®cations, and genes within (sub)classes to for a copy. minimize line crossings. An all permutations approach Those with Silicon Graphics workstations may be inter- works for small problems. Unfortunately the combina- ested in ExplorN. A tar ®le containing an executable and torics become overwhelming in a general table setting. sample data sets is available. Conversion to OpenGL For two classi®cations there is a convenient approx- and available on other OpenGl compatible platforms imate approach. Wegman (1990) observes that few may happen later in the year.

26 Statistical Computing & Statistical Graphics Newsletter April 97 Gene Expression Patterns By Functional Groups

1.0

Scales 0.5 MOG 0.0 GFAP S100_beta E13 E18 P0 P14 neno E11 E15 E21 P7 A synaptophysin NFH Marker NFM CCO2 NFL CCO1 L1 Other SOD GAP43 actin MAP2 nestin cellubrevin keratin DD63.2 Novel/EST SC7 NOS SC6 TH SC2 ODC SC1 ACHE ChAT statin NME GAT1 H2AZ G67I86 Cell_Cycle cyclin_B cyclin_A G67I80/86 GAD67 pre-GAD67 TCP GAD65 Brm Trans_Factor cfos cjun mAChR4 mAChR3 IP3R3 mAChR2 IP3R2 ICS IP3R1 nAChRe CRAF nAChRd AChR nAChRa7 nAChRa6 IGFR2 nAChRa5 IGFR1 nAChRa4 InsR nAChRa3 Insulin_&_IGF IGF_II nAChRa2 IGF_I Ins2 Ins1 GRg3 GRg2 CNTFR GRg1 trkC GRb3 trkB GABA-R GRb2 trk GRb1 Neurotrophin CNTF GRa5 BDNF GRa4 NT3 GRa3 NGF GRa2 GRa1 TGFR PDGFR FGFR EGFR PDGFb NMDA2D PDGFa NMDA2C Growth_Factor aFGF NMDA2B bFGF NMDA2A NMDA1 EGF GluR GDNF mGluR8 PTN mGluR7 MK2 mGluR6 mGluR5 5HT3 mGluR4 5HT2 mGluR3 5HTR 5HT1c mGluR2 5HT1b mGluR1

E13 E18 P0 P14 E13 E18 P0 P14

Figure 1. Labeled time series with controlled overplotting. Here, the colors have no meaning, but serve only as a linking device within each panel.

April 97 Statistical Computing & Statistical Graphics Newsletter 27 Figure 3. Stereo pairs with careful color overplotting. In this case, colors correspond to clusters (see the text for a detailed description).

Gene Families and Expression Clusters

OtherNovel/ESTCellTransFactor CycleICS Insulin Neurotrophin& IGF Growth FactorMarker NME AChR GABA-R GluR 5HTR

Mutual Information 123456 Clusters

Euclidean Distance C Wave 1 Wave 2 Wave 3 Wave 4 ~ Clusters

Peptide Neuro Neurotransmitter Diverse Signaling glial Signaling

Figure 5. A sorted parallel coordinates view of multiway classi®ed cases.

28 Statistical Computing & Statistical Graphics Newsletter April 97 As always, the authors are open to gentle constructive Inselberg, A. (1985), ªThe Plane with Parallel Coordi- suggestions. Comments on graphics are best addressed nates,º The Visual Computer, 1, 69±96. to Dan and comments on the study or on genetic net- Kauffman S. A. (1993), The Origins of Order, Self- works are best addressed to Roland and George. Organization and Selection in Evolution, Oxford Uni- Acknowledgements versity Press, London. The authors thank Drs. Xiling Wen and Stefanie Kendall,M. and Stuart, A. (1979), The Advanced Theory Fuhrman, Laboratory of Neurophysiology, NINDS, for of Statistics, Vol. 2, Inference and Relationship,Charles their outstanding work in the experimental data acquisi- Grif®n & Company, London. tion and analysis. Thanks also go to Andreas Buja for a Kosslyn, S. M. (1994), Elements of Graph Design,W. discussion on ordering classes. H. Freeman and Company, New York. S-Plus is a register trademark of MathSoft, Inc. Trellis Mouse Genome Database, Mouse Genome Informatics, is a register trademark of Lucent Technologies, Inc. The Jackson Laboratory, Bar Harbor, Maine. References http://www.informatics.jax.org/ Somogyi R. and C. A. Sniegoski (1996), ªModeling the Buja, A., Cook, D. and Swayne, D. F. (1996), ªInterac- complexity of genetic networks: understanding multi- tive High-Dimensional Data Visualization,º Journal of genic and pleiotropic regulation,º Complexity 1(6), 45± Computational and Graphical Statistics, 5(1), 78±99. 63. Carr, D. B. (1993), ªProduction of Stereoscopic Dis- Somogyi R, Wen, X., Ma, W., and Barker, J. L. (1995), plays for Data Analysis,º Statistical Computing & ªDevelopmental kinetics of GAD family mRNAs paral- Graphics Newsletter, 4(1), 2±7. lel neurogenesis in the rat spinal cord,º J. Neurosci, 15, Carr, D. B. and Olsen, A. R. (1996), ªSimplifying Vi- 2575±2591. sual Appearance By Sorting: An Example Using 159 Wegman, E. J. (1990), ªHyperdimensional Analysis Us- AVHRR Classes,º Statistical Computing & Graphics ing Parallel Coordinates,º Journal of the American Sta- Newsletter, 7(1), 10±16. tistical Association, 85, 664±675. Carr, D. B., Nicholson, W. L., Little®eld, R. J. and D. Wegman, E. J. and Luo, Q. (1997), ªHigh-Dimensional L. Hall (1986), ªInteractive Color Display Methods for Clustering Using Parallel Coordinates and the Grand Multivariate Data,º Statistical Image Processing and Tour,º Computing Science and Statistics, 28, 361±368. Graphics,eds.E.J.WegmanandD.J.DePriest,Mar- cel Dekker, New York, pp. 215±250. Dan Carr Carr, D. B. and Pierson, S. M. (1996), ªEmphasiz- Institute for Computational Sciences ing Statistical Summaries and Showing Spatial Context and Informatics with Micromaps,º Statistical Computing & Graphics George Mason University Newsletter, 7(3), 16±23. [email protected] Carr, D. B., Valliant, R. and Rope, D. (1996), ªPlot In- terpretation and Information Webs: A Time-Series Ex- Roland Somogyi ample From the Bureau of Labor Statistics,º Statistical National Institute of Neurological Computing & Graphics Newsletter, 7(2), 19±26. Disorders and Stroke National Institutes of Health Carr, D. B., Wegman, E. J., and Luo, Q. (1997), ªEx- [email protected] plorN: Design Considerations Past and Present,º Center for Computation Statistics Technical Report No. 137, George Mason University, Fairfax, Va. 22030. George Michaels Institute for Computational Sciences Felsenstein, J. (1993), PHYLIP (Phylogeny Inference and Informatics Package), version 3.5c, distributed by the author, De- George Mason University partment of Genetics, University of Washington, Seat- [email protected]

tle.

Galli-Taliadoros, L.A., Sedgwick, J. D., Wood, S. A., and Korner, H. (1995), J. Immunol. Methods, 181, 1± 15.

April 97 Statistical Computing & Statistical Graphics Newsletter 29 GRAPHICS AND COMPUTING JSM PLANS ing and ANOVA, Robustness and Nonparametrics, Cat- egorical and Latent Variable Methods, Density Estima- tion, Markov chain Monte Carlo, Algorithms, Compu- The Joint Statistical tational Innovations, and Outlier Detection. Meetings, Anaheim 1997 Everyone should browse the poster presentations on Tuesday at noon. For those of you who would like to By Dianne Cook and James L. Rosenberger interact with others in greater depth, we will also spon- sor four roundtable lunches on Tuesday on the topics Statistical Computing Activities at the JSM of ªMultiple Comparisonsº, with Peter Westfall, ªData Miningº with Daryl Pregibon, ªComputational Science The Statistical Computing Section has organized a busy and the Internetº with Mark Hansen, and ªMarkov chain schedule of invited and contributed sessions, posters and Monti Carlo' with Peter Mueller. Sign up and meet other tutorials, for the members attending the Joint Statistical section members with common interests. Meetings, August 10-14, 1997, in Anaheim, California. We're hoping to see a good turnout in Anaheim. Be sure The theme of ªdata mining and massive data setsº pro- to attend the mixer on Monday evening following a short vides the focus for a number of sessions, and ®ts nicely business meeting at 7:30PM. Good food, door prizes, with the overall conference theme of Shaping Statis- and a chance to meet newcomers and oldtimers. tics for Success in the 21st Century. The invited pro- gram begins with a ªData Mining Tutorialº on Sunday Statistical Graphics at the JSM at 2:00PM by Usama Fayyad of Microsoft Research. The Statistical Graphics program at this year's Joint Sta- Monday at 2:00PM a co-sponsored session with Statis- tistical Meetings has a wide array of talks and posters. tics in Sports discusses Data Mining and its Application Late on Sunday afternoon there are several invited to the NBA, with Daryl Pregibon of AT&T Labs giv- posters with graphical content: immersive virtual reality ing the statisticians perspective and Ed Colet and I.S. environments, the production graphics library, and spa- Bhandari of IBM Research Center demonstratingthe ad- tial/environmental visualization. On Tuesday early af- vanced scout software. On Tuesday at 10:30AM Peter ternoon the contributed poster session has many posters J. Huber presents a Program Chair's special lecture on with graphical content, and in particular several on the Statistics and Massive Data Sets with discussion by Ed topic of graphics for biometrics. This is a year that we Wegman and others. also have a statistical graphics data exposition: there are Other futuristiclooking invited sessions include: ªCom- over 15 posters that will display graphical methods ap- putational Science and the Internet: A Seamless Webº plied to examining costs and outcomes of hospital visits on Monday at 10:30AM organized by Mark Hansen and (Wednesday 3:00-5:00pm). showcasing S and JAVA for data analysis, use of the There are 4 invited, 2 special contributed, and 2 con- graphics production library, and dynamic electronic re- tributed paper sessions. Bright and early at 8:30am search publications; and invited sessions on Tuesday at Monday is a panel discussion on publishing in the elec- 8:30AM on ªQuantile Regressionº, on Wednesday at tronic age, and then there are 2 sessions on spatial 2:00PM on ªThe Bootstrap and Empirical Likelihoodº, data analysis tools at 10:30am and 2:00pm. Tuesday and on Thursday at 8:30AM on ªStatistical Software at 8:30am there is a session on multivariate data and at Engineeringº, which presents a statistical approach to 10:30am there is a session, with a discussant, on inter- software development and maintenance. A special con- active graphics. In the early afternoon at 2:00pm there tributed session of interest to program developers is ti- is the invited session ªNatural Selection: Advances in tled: ªStatistical Reference Datasets for Assessing the Brushingº. Wednesday morning features two invited Numerical Accuracy of Statistical Software.º sessions: ªRecent Developments in Projection Pursuitº The winners of our section's student paper competiton at 8:30am and ªLost in Space: Assessing Multivariate present their talks on Tuesday at 2:00PM. Please see Missing Dataº at 10:30am. Wenjiang Fu, Toronto; Ramani Pilla, Penn State; Alan There are also many sessions outside of the main spon- Gous, Stanford; and Gareth James, Stanford, in this spe- sorship of the Stat Graphics section that have graphics cial contributed paper session. content. For example, on Monday at 10:30am there is In addition to the seven invited sessions and special ses- a session on graphics and publishing interactive mate- sions mentioned above, the section has organized nine rial: ªComputational Science and the Internetº. There is contributed paper sessions on topics including Model- a session on graphics and teaching at 10:30am Tuesday.

30 Statistical Computing & Statistical Graphics Newsletter April 97 On Wednesday there is a session with several papers on informal presentation. graphics for outlier detection at 10:30am and another at Limited travel support is available to workshop partic- the same time on graphics in teaching. On Thursday at ipants and the organizers especially want to encourage 10:30am look for the session on using the WWW for young researchers, women, underrepresented minori- classroom instruction; it has a decidedly graphical ¯a- ties, and the handicapped to attend and apply for sup- vor. port. Finally don't forget the roundtable luncheons tentatively The Fourth Morris H. DeGroot scheduled for Wednesday. This year there are discus- Memorial Lecture sions on visualizing spatial data, graphics with JAVA, virtual reality, projection pursuit/neural networks and The Fourth Morris H. DeGroot Memorial Lecture and visualizing massive data. Oh, and one last major event Reception will be held on the ®rst day of the Workshop, is that the popular Stat Computing and Graphics mixer Friday, September 26. The guest lecturer is Brad Efron, is planned for Monday night. Professor of Statistics at Stanford University. Past lec- turers have been Adrian F.M. Smith, A. , For more details point your web browser to: and James O. Berger.

http://www.public.iastate.edu/ dicook/ graphicsprog97.html Invited Case Studies Four invited papers will be presented at the workshop, Dianne Cook organized into two sessions per day. Each presentation Statistical Graphics 1997 Program Chair is scheduled to last three hours, which includes time for Iowa State University several invited discussants. Below is a list of authors [email protected] and titles. Full abstracts are available from the Bayes '97 home page given at the end of this announcement. James L. Rosenberger  ªModeling Risk of Breast Cancer and Decisions about Statistical Computing 1997 Program Chair Genetic Testing,º by Giovanni Parmigiani, ISDS, Duke The Pennsylvania State University University; Joellen Schildkraut, Cancer Control Unit,

[email protected] Duke University Medical Center;EricWiner,Depart-

ment of Medicine, Duke University Medical Center;and Don Berry, Ed Iversen, and Peter Mueller, ISDS, Duke University.

CONFERENCE ANNOUNCEMENTS  ªFunctional Connectivity in the Cortical Circuits Sub- serving Eye Movements,º by Christopher R. Genovese of the Department of Statistics, Carnegie Mellon;and Case Studies in John A. Sweeney of the Western Psychiatric Institute Bayesian Statistics and Clinic, University of Pittsburgh.

 ªPopulation Pharmacokinetic Modeling in Drug De- Workshop 4 ± 1997 velopment,º by Jon Wake®eld, Department of Epidemi- ology and Public Health, Imperial College School of September 26-27, 1997 Medicine at St. Mary's, London, UK; Leon Aarons, De- Carnegie Mellon University partment of Pharmacy, University of Manchester, UK; The Fourth Workshop on Case Studies in Bayesian and Amy Racine-Poon, Biometrics Group, Ciba-Geigy, Statistics will take place September 26-27, 1997, on the Basel, Switzerland.

campus of Carnegie Mellon University in Pittsburgh,  ªModeling Customer Survey Data,º by Linda Clark, Pennsylvania. Bill Cleveland, Lorraine Denby, and Chuanhai Liu of Contributed paper abstracts will be accepted through Bell Labs August 1, 1997 for presentation in a poster session on Invited papers from the Workshops I and II, together the ®rst evening of the Workshop. The Morris H. DeG- with a selected subset of contributed papers, were pub- root Lecture and Reception will precede the poster ses- lished by Springer-Verlag in refereed volumes of the sion and a sit down dinner will follow. The organizers mini-series ªCase Studies In Bayesian Statistics.º A plan to accept 25 to 30 papers. The posters will be dis- third volume will appear soon, covering Workshop III, played and discussed with the individual authors in an and a similar volume will be published for Workshop IV.

April 97 Statistical Computing & Statistical Graphics Newsletter 31 Organizers titative framework for accounting for the uncertainty in the model selection process. Imaginative solutions are being de- Bradley P. Carlin University of Minnesota veloped in speci®c application areas and there is wide poten- tial for application of many of these solutions in other areas or Alicia L. Carriquiry Iowa State University disciplines. Similarly, a high level of activity and innovation Constantine Gatsonis Brown University is registered in methodological developments. Andrew Gelman Columbia University Robert E. Kass Carnegie Mellon University The goal of the proposed workshop is to promote interaction Isabella Verdinelli Carnegie Mellon University between researchers involved in diverse aspects of this ®eld. Mike West Duke University We believe that interaction will: a) help elicit and provide fo- cus on current issues and methods from a host of different Contact Bayes 97 via email at [email protected] application areas and disciplines; b) promote wider utiliza- or through the URL http://www.stat.cmu.edu/ tion of worthy practical approaches and solutions developed meetings/Bayes97.html in speci®c ®elds; c) advance understanding of relative mer-

its of existing tools and approaches, and d) point to directions

for future methodological developments. Dissemination of stochastic model search techniques to practitioners and devel- opment of practical communication strategies for addressing model uncertainty in statistical consulting and teaching are Statistics Week at Duke also among the objectives of the meeting. October 9-13, 1997 The 1997 NBER/NSF Time Series Seminar Under sponsorshipfrom various organisations,the Insti- tute for Decision Sciences (ISDS) is hosting three linked October 10-11, 1997 research workshops during the ®rst couple of weeks of October this year. Contact Organiser: Mike West, ISDS

Stochastic Model Building and Wavelets and Statistics Variable Selection October 9-10, 1997 October 12-13, 1997 Contact Organiser: Giovanni Parmigiani, ISDS Contact Organiser: Brani Vidakovic, ISDS

The workshop will bring together in an informal and pro- The workshop is planned to focus on the developing inter- ductive way researchers active in the area of stochastic face between wavelets and statistical science. It will bring model building and variable selection. Advances in statis- together a small group of leading researchers, applied work- tical methodology and computing are opening opportunities ers, and new investigators, from various disciplinary back- for statistical analysis and modeling of larger and more com- grounds, to explore and summarize the current status of re- plex data sets. In this endeavor, it is becoming increasingly search on wavelets in statistics, to explore the applied im- important to use computer-based tools for guiding the initial pact that statistical wavelet modeling and wavelet methods in speci®cation of the main features of statistical models. Re- statistics are having, and to suggest and stimulate novel the- cently, a new generation of stochastic algorithms has been oretical, methodological and computational research direc- emerging as an important augmentation to traditional deter- tions. It is an opportune time for the wavelet/statistics com- ministic strategies. Examples include stochastic search meth- munities to come together to assess and address the dual ques- ods for variable selection, graphical models, selection of vari- tions about the status and future prospects: for the nature and able transformations and interactions, wavelet thresholding, potential contributions of wavelet ideas to statistical science ARMA modeling, CART, MARS, neural networks, data min- and application, and for furthering wavelet based technology ing and more. through the use of statistical concepts and models. Among the motivations for the increasing usage of stochas- tic methods are (1) searching large spaces of model speci- The workshop follows two recent meetings on wavelet meth- ®cations more thoroughly than ªgreedyº deterministic algo- ods in statistics: (i) Wavelets and Statistics, Institute IMAG- rithms, (2) implementing practical generalizations of stan- LMC, Grenoble, France, November 16-18, 1994, and (ii) the dard model selection strategies, such as adaptive thresholding ANU Wavelet Workshop, Canberra, Australia, June 26-30, of wavelets, (3) incorporating subject matter knowledge, ex- 1995.

pert judgment, and existing evidence in a non-binding way, via prior distributions and utilities, and (4) providing a quan-

32 Statistical Computing & Statistical Graphics Newsletter April 97 CALLS FOR PARTICIPATION Call for Participation in the Center for Imag- ing Science The Center for Imaging The Center for Imaging Science is amassing data-bases of research articles, simulators and remote sensor data Science associated with image understanding, computer-vision and automated target recognition. We invite the statis- The Army Center for Imaging Science was established tics and graphics community to participate in the Cen- in 1995 with funding from the Army Research Of®ce. ter for Imaging Science. Registering through the mail- It is a consortium which includes Washington Univer- ing list as a participant provides updates on activities and sity, Brown, MIT, the Universities of Texas at Austin current research and availability of new information in and El Paso, Yale and industrial partners Lincoln Lab- the Center's data base. oratory and Smith Kettlewell Institute. The activity is centered at Washington University under the Direction Come register at http://cis.wustl.edu/ of Michael Miller, Professor of Electrical Engineering. Director: The research of the Center is devoted to the develop- ment of representation and understanding of complex Michael I. Miller remotely sensed scenes, and re¯ects the broad multi- Department of Electrical Engineering disciplinary nature of imaging science, encompassing Electronic Signals and Systems Research Laboratory physics, mathematics, electrical engineering, computer Washington University, Campus Box 1127 vision, computer science and cognitive science. St. Louis, Missouri 63130 (314) 935-6195 (tel), (314) 935-7500 (fax) Mission for the Center for Imaging Science [email protected]

The overall goal of the Center is to develop the fun- Contact Point: damental foundations of image understanding for rec- Dr. David Skatrud ognizing and describing complex objects contained in U.S. Army Research Of®ce natural and cluttered scenes. The thrust of the Cen- Attn: AMXRO-PR/SFIA ter is towards the development of fundamental limits of P.O. Box 12211 performance of automated target recognition systems. Research Triangle Park, NC 27709-2211 This includes the development of detection and recogni- (919) 549-4315 (tel), (919) 549-4310 (fax)

tion bounds, as well as Cramer-Rao bounds on the pose of objects, and on complexity and information capacity bounds describing model databases and sensor/channel fusion characterizations. Theoretical Foundation of the Center for Call for JSM 1998 Imaging Science Proposals The scienti®c and intellectual foundations of the Center build on the Pattern Theories which have emerged over The Statistical Graphics Section is seeking proposals for the past several decades from the mathematics, compu- short courses for presentation at the 1998 Joint Statisti- tational vision, engineering and statistics communities. cal Meetings next August in Dallas, Texas. We are in- The mathematical methodology being pursued by Cen- terested in courses that involve innovative uses of statis- ter members is organized into the three principal com- tics in new application areas. We are especially inter- ponents of image understanding: the representation of ested in courses describing innovative uses of graphics complex scenes, image formation and sensor modeling, in the design of user interfaces, statistics on the world and characterization of recognition and decoding strate- wide web (WWW), visualization in conjunction with gies for image understanding. database mining, bioinformatics, and drug discovery. Proposals in other areas and proposals that would be co- A central theoretical theme involves the use of ge- sponsored with other sections are encouraged. For more ometry for the representation of natural and cluttered information or to informally present ideas, please con- scenes, focusing on both low-dimensional Lie groups tact Russ Lenth ([email protected]) or Perry for rigid bodies, as well as high-dimensional groups for Haaland ([email protected]). All proposals received deformable objects in clutter. before October, 1997 will be considered.

April 97 Statistical Computing & Statistical Graphics Newsletter 33 ASA abstract submission instructions described in AM- The Third Annual Student STAT News. Students selected for inclusion in the ses- sion will receive further information about abstract sub- Paper Competition mission and fee waivers from the award committee. The Computing Section announces its Inquiries and materials should be emailed or mailed to: Student Paper Competition for 1998 Terry M. Therneau Student Paper Selection Committee The Statistical Computing Section of the ASA will again Statistical Computing Section sponsor a Student Paper Session at the Joint Statistical Section of Biostatistics Meetings in 1998. The winners present their papers in Mayo Clinic a special session at the annual ASA meetings. See the Rochester, Minn 55905 article on page 4 concerning the winners of this year's [email protected] competition. The topic of the session is Statistical Computing. Four All electronic submissions of papers should be in

students will be selected to participate in this session. postscript.

All fees associated with registration, accommodation, and travel to the conference will be awarded to the par- ticipants in this Session. Students at all levels (undergraduate, Masters, and Ph.D.) are encouraged to participate. To be eligible, an ATTENTION, STUDENTS! applicant must be a registered student in the fall of 1997. The applicant must be the ®rst author of the paper. Statistical Graphics Section Student Paper To be considered for selection in the session, students Competition must submit an abstract, a six page manuscript, a re- New for 1998! sume, and a letter of recommendation from a mentor fa- miliar with their work. The manuscript should be single- spaced in a 10 point font with one inch margins (this is Are you interested in attending the 1998 annual ASA consistent with ASA's Proceedings guidelines.) All ®g- meeting? Are you looking for sponsorship to attend this ures, tables and references should be included in the six- meeting? Well, submit an entry to the Statistical Graph- page limit. In the case of joint authorship, the mentor ics Section Student Paper Competition! should indicate what fraction of the contribution is at- The emphasis of the competition is statistical graphics. tributable to the applicant. This can mean original methodological research, some All application materials MUST BE RECEIVED by novel application, or any other suitable contribution (for January 9, 1998. They will be reviewed by the Sta- example, a data analysis project which employs an ex- tistical Computing Section Student Paper Competition tensive use of graphics). The judging committee will Award committee, which is made up of the Section's select up to 3 winning papers to be presented by their representatives to the ASA's Council of Sections. The authors at a discussion session at the 1998 annual ASA topic of the paper should be in the area of statistical com- meeting. The winners will receive up to a $1000 award puting, and might be original methodological research, for expenses involved in attending this meeting. some novel application, or any other suitable contribu- The deadline for submission is January 9, 1998, but tion (for example, a software related project). Selection it is not too early to start thinking of an appropri- will be based on a variety of criteria at the discretion of ate project. For more information, check the Graph- the selection committee, and will include novelty and ics Section web page that can be reached through signi®cance of contribution, amongst others. Award an- http://www.amstat.org/ or contact Lorraine nouncements will be made in late January, 1998. The Denby ([email protected]) or Mike Minnotte selection committee's decision will be ®nal. ([email protected]).

Students not selected for inclusion in the Session may submit their abstract and a registration fee to ASA by February 1, 1998 if they plan to attend the Joint Meet- ings. Those abstracts must be submittedaccording to the

34 Statistical Computing & Statistical Graphics Newsletter April 97 SECTION OFFICERS Statistical Computing Section - 1997 Daryl Pregibon, Chair Statistical Graphics Section - 1997 908-582-3193 Sally C. Morton, Chair AT&T Laboratories 310-393-0411 ext 7360 [email protected] The Rand Corporation Karen Kafadar, Chair±Elect Sally [email protected] 303-556-2547 Michael M. Meyer, Chair±Elect University of Colorado-Denver 412-268-3108 [email protected] Carnegie Mellon University Sallie Keller-McNulty, Past±Chair [email protected] 913-532-6883 William DuMouchel, Past±Chair Kansas State University 908-582-7180 [email protected] AT&T Laboratories James L. Rosenberger, Program Chair [email protected] 814-865-1348 Dianne H. Cook, Program Chair The Pennsylvania State University 515-294-8865 [email protected] Iowa State University Russel D. Wol®nger, Program Chair±Elect [email protected] 919-677-8000 SAS [email protected] Edward J. Wegman, Program Chair±Elect 703-993-1680 Mark Hansen, Newsletter Editor (96-98) George Mason University [email protected] 908-582-3869 Bell Laboratories Mario Peruggia, Newsletter Editor (96-97) [email protected] 614-292-0963 Evelyn M. Crowley, Secretary/Treasurer (97-98) Ohio State University [email protected] 317-494-6030 Purdue University Robert L. Newcomb, Secretary/Treasurer (97-98) [email protected] 714-824-5366 James S. Marron, Publications Liaison Of®cer University of California, Irvine 919-962-5604 [email protected] University of North Carolina, Chapel Hill Michael C. Minnotte, Publications Liaison Of®cer [email protected] 801-797-1844 MaryAnn H. Hill, Rep.(95-97) to Council of Sections Utah State University 312-329-2400 SPSS [email protected] [email protected] Lorraine Denby, Rep.(96-98) to Council of Sections Janis P. Hardwick, Rep.(96-98) Council of Sections 908-582-3292 313-769-3211 Bell Laboratories University of Michigan [email protected] [email protected] Colin R. Goodall, Rep.(95-97) to Council of Sections Terry M. Therneau, Rep.(97-99) Council of Sections Health Process Management 507-284-1817 [email protected] Mayo Clinic Roy E. Welsch, Rep.(97-99) to Council of Sections [email protected] 617-253-6601 Naomi S. Altman, Rep.(97-99) to Council of Sections MIT, Sloan School of Management 607-255-1638 [email protected] Cornell University

naomi [email protected]

April 97 Statistical Computing & Statistical Graphics Newsletter 35 INSIDE

A WORD FROM OUR CHAIRS

Statistical Computing ::: :::: ::: :::: : 1

Statistical Graphics :: ::: :::: ::: :::: : 1

EDITORIAL ::: :::: ::: :::: ::: :::: : 2The Statistical Computing & Statistical Graphics TOPICS IN MEDICAL IMAGING Newsletter is a publication of the Statistical Comput- Functional Magnetic Resonance Imaging ing and Statistical Graphics Sections of the ASA. All communications regarding this publication should be is a Team Sport : ::: :::: ::: :::: : 1 addressed to: NEWS CLIPPINGS AND SECTION NOTICES Mark Hansen

Results of the Student Paper Competition :::: : 4 Editor, Statistical Computing Section ASA Election Results ::: :::: ::: :::: : 6Statistics Research

Buja Named JCGS Editor : :::: ::: :::: : 6Bell Laboratories

SCS Continuing Education Committee :: :::: : 7Murray Hill, NJ 07974

TOOLS FOR DATA ANALYSIS (908) 582-3869  FAX: 582-3340 [email protected]

Data Exploration with the Density Grand Tour :: : 7 http://cm.bell-labs.com/who/cocteau

Loc®t: An Introduction :: :::: ::: :::: : 11 TOPICS IN INFORMATION VISUALIZATION Mario Peruggia Templates for Looking at Editor, Statistical Graphics Section Department of Statistics Gene Expression Clustering : ::: :::: : 20 The Ohio State University GRAPHICS AND COMPUTING JSM PLANS Columbus, OH 43210-1247

The Joint Statistical Meetings Anaheim 1997 :: : 30 (614) 292-0963  FAX: 292-2096 CONFERENCE NOTICES [email protected]

http://stat.ohio-state.edu/ peruggia Case Studies in Bayesian Statistics Workshop 4 :: 31

Statistics Week at Duke :: :::: ::: :::: : 32 All communications regarding membership in the ASA CALLS FOR PARTICIPATION and the Statistical Computing or Statistical Graphics

The Center for Imaging Science :: ::: :::: : 33 Sections, including change of address, should be sent to:

JSM 1998 Proposals : ::: :::: ::: :::: : 33 American Statistical Association rd

The Computing Section's 3 Annual 1429 Duke Street

Student Paper Competition :: ::: :::: : 34 Alexandria, VA 22314-3402 USA

The Graphics Section's (703) 684-1221  FAX (703) 684-2036

Student Paper Competition :: ::: :::: : 34 [email protected]

Nonpro®t Organization U. S. POSTAGE PAID Permit No. 50 Summit, NJ 07901

American Statistical Association 1429 Duke Street Alexandria, VA 22314-3402 USA

This publication is available in alternative media on request.