Comparative and Functional Genomics Comp Funct Genom 2002; 3: 264–269. Published online 2 May 2002 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.170 Feature Meeting Review: 2002 O’Reilly Technology Conference Westin La Paloma Resort, Tucson, Arizona, USA. January 28–31 2002.

Damian Counsell* MRC UK Human Genome Mapping Project Resource Centre, Hinxton, Cambridge, CB10 1SB, UK

*Correspondence to: Abstract MRC UK Human Genome Mapping Project Resource Centre, At the end of January I travelled to the States to speak at and attend the first O’Reilly Hinxton, Cambridge, CB10 1SB, Bioinformatics Technology Conference [14]. It was a large, well-organized and diverse UK. meeting with an interesting history. Although the meeting was not a typical academic con- E-mail: ference, its style will, I am sure, become more typical of meetings in both biological and [email protected] computational sciences. Speakers at the event included prominent bioinformatics researchers such as , and Lincoln Stein; authors and leaders in the open source programming community like and Nat Torkington; and representatives from several publishing companies including the Nature Publishing Group, Current Science Group and the President of O’Reilly himself, Tim O’Reilly. There were presentations, tutorials, debates, quizzes and even a ‘jam session’ for musical bioinformaticists. Copyright #2002 Received: 2 April 2002 John Wiley & Sons, Ltd. Accepted: 5 April 2002 Keywords: bioinformatics; open source; BLAST; visualization; programming; O’Reilly

Introduction documentation to solve problems, to learn, and to impress others. They make all of their original Over 700 biologists, computer scientists, bioinfor- human-readable program code available (under maticists, hackers, publishers and journalists came copyright) for improvement or modification. If (some at great personal expense) to Tucson, Arizona to others distribute programs based on the original listen, argue, share and to write computer code. In his source they are often required by the copyright- introduction to one of the keynotes, Tim O’Reilly holders to make the source code of these changes explained why a computer book company and available to others in turn, otherwise they are free to documentation consultancy had organized a bio- do what they will with it. Hackers make a distinction technology conference. Last year O’Reilly published between this meaning of ‘free’, ‘free as in speech’, and its first bioinformatics text [10]. It is not the best ‘free as in beer’. introductory bioinformatics text, but imprint’s This ethos and several open source licences have reputation with the so-called ‘open source’ commu- been adopted by most of the major bioinformatics nity was enough to make it immediately (and programming projects including EMBOSS [7], temporarily) Amazon’s best selling computer book. Ensembl [8], BLAST [6] and the those of the Open Many of O’Reilly’s loyal readers would happily Bioinformatics Foundation [13]: BioPerl, BioJava, refer to themselves as ‘hackers’, meaning ‘indepen- BioPython and so on. dent voluntary programmers’ rather than ‘computer criminals’. (The cognoscenti use the term ‘crackers’ Curious hackers to refer to the sort of people who deface Web sites or steal passwords.) Hackers lie at the heart of the Many non-scientists were present at this meeting. enormously successful and growing open source Part of the success of O’Reilly’s foray into bio- software movement. informatics publishing derives from the strong and Open source hackers write computer code and well-intentioned curiosity of this hacker fraternity

Copyright # 2002 John Wiley & Sons, Ltd. Meeting Review 265

about the maturing field of bioinformatics. Hackers programme and here I report on some of the pre- are particularly annoyed, if not always so well sentations that made the biggest impression on me. informed, by any restriction being placed on the Presumably to cater for such a broad range of availability of sequence data. They see genomic attendees, the conference was sensibly organized data as software and are disturbed by any attempt along three concurrent technical/academic tracks to ‘close its source’. They see parallels between and one commercial track. I did, however, hear software patents (common in the States, but not complaints (compliments?) from delegates that too possible in Europe) and ‘gene patents’. many interesting events coincided. That actual computer programming was a part of The ‘Fundamentals’ track was aimed at both the proceedings is testimony to the meeting’s biological and computational beginners. Though I unusual background. The ‘Bio Hackathon’ at the cannot comment on the biological talks on the conference was a successful effort in ‘live’ software ‘Analysis’ track, it certainly carried most of the development by various programming groups from more heavily computational talks I heard. Most the Open Bioinformatics Foundation. It began the practically oriented was the ‘Discovery’ track which weekend preceding the week of the meeting and also carried a couple of presentations about ethical continued in Cape Town, South Africa. I must issues in bioinformatics, issues discussed with confess that I turned down the chance to participate vigour both inside and outside sessions. in this (in place of Alan Bleasby on behalf of the The presentations were further subdivided by style EMBOSS team). Sadly I’m not the sort of pro- into keynotes, tutorials (with accompanying bound grammer who can write substantial programs in texts) and other, more informal, gatherings (birds-of- days, but I was happy to accept an invitation to a-feather sessions and community meetings). speak on the Bioinformatics.Org [3] track.

‘BLAST programming’ – Thomas Madden Bioinformatics.Org Even the least computer-literate biologist will, Bioinformatics.Org is a non-profit academic organ- knowingly or not, have used a BLAST search tool ization, established in 1998 at the University of to find identical or similar sequences in the ever- Massachusetts Lowell. As well as providing a vir- growing genomic databases. Two of the best pre- tual home for a large number of open source sentations I attended centred on this marvellous bioinformatics projects it campaigns for freedom tool. Thomas Madden is one of the core BLAST of information in the biosciences. It is also inter- developers at the NCBI. For me his talk justified ested in bioinformatics education. It was true to the my journey on its own. I am currently planning a spirit of the event that Bioinformatics.Org took collaborative bioinformatics system in which a over a lecture theatre for its own one-day track. standalone BLAST server will play a central part. Bioinformatics.Org’s Executive Director, Jeff This tutorial made clear – albeit in fairly tech- Bizarro, took the opportunity of the O’Reilly nical terms – that BLAST is more than just the meeting – the Second Annual Meeting of Bioinfor- most commonly used bioinformatics program. It matics.Org – to present the organization’s 2002 has developed into a comprehensive and customi- Benjamin Franklin Award to Michael B. Eisen of sable platform with its own storage and output the Lawrence Berkeley National Laboratory and the formats, programming interfaces and architecture. University of California at Berkeley. This award was If the audience was anything to go by, there is also doubly appropriate, as Eisen wrote ScanAlyze, an active community of programmers and adminis- Cluster and TreeView, immensely popular software trators who not only use BLAST, but also have for cluster analysis of microarray data, and is also created new variants of the original code. one of the principal moving forces behind the Public Library of Science [16] movement. ‘Linux clusters for bioinformatics’ – Glen Otero The organization of the conference It would be impossible to cover all of the talks in Unlike other, proprietary computer operating sys- detail. I followed my own personal path through the tems, running Linux on one hundred PCs costs no

Copyright # 2002 John Wiley & Sons, Ltd. Comp Funct Genom 2002; 3: 264–269. 266 Meeting Review

more than running Linux on one PC. Even more advocate of Open Source, both sharing genomic surprising, one copy of Linux might well cost you data and sharing bioinformatics source code. His nothing, if you download a so-called ‘distribution’ – funny and relaxed presentation chimed well with a pre-packaged collection of kernel (the core the feelings of the attendees. He joked about the operating system), installation tools, documentation misunderstandings between biologists and compu- and software from the Net. It’s usually easier, tational scientists (over the meanings of words like however, to buy a boxed Linux distribution with ‘vector’, for example). He talked about his own CDs and manuals. Even this is cheaper than a work on the Ensembl project and with the various typical shrinkwrapped edition of Windows. Open Bioinformatics Foundation Open Source Linux was not only built across the Net by projects: BioPerl, BioJava, BioPython, BioCorba, remotely collaborating programmers, it is also a BioDAS. network-centric operating system. These two attri- He also made ‘the case for bioinformatics’. While butes are probably connected. Because Linux bioinformaticians are often frustrated by the reluc- combines sensible licensing with fast, refined and tance of the biological establishment to embrace integrated network capabilities it is perfect for com- computational tools for biology, his belief that bining computers locally in parallel via Ethernet. ‘bioinformatics still hasn’t been hyped enough’ Such ‘clusters’ of cheap, generic machines can pool caused me a certain amount of discomfort. Ten their resources to solve computational problems years ago, however, people were equally skeptical that could not be tackled by standalones. The about the Human Genome project and the prospects power of desktop processors of the Pentium and for the cloning of large mammals; I hope to be Athlon lines improves so rapidly and the machines proved a pessimist about the potential of the field. based on these chips are such commodity items that the price/performance characteristics of these sys- tems can overtake those of slower-developing super- ‘Interactive data visualization’ – John computer architectures. Linux clusters of these and Hotchkiss other processors have become popular with many academics both as tools and as objects of study in The original speaker planned for this presentation themselves. Glen Otero of Linux Prophet [15] des- was the founder of AnVil Informatics, Inc. (AVI), cribed the application of this technology to bioin- Georges Grinstein of the University of Massachusetts formatics. at Lowell. Unfortunately he lost his voice before he This was the second tutorial I attended on the could talk, but John Hotchkiss, Chief Technology first day and any learning-fatigue I might have been Officer of AnVil Informatics, bravely stepped for- suffering had dispersed at the start of Otero’s ward to give us a brisk tour of a whole range of enthusiastic and jokey presentation. Offering free technologies – doubly brave since he was presenting software CDs for anyone who could use one of a someone else’s slides. He did an excellent job. collection of improbable words in a question, Genomics produces vast quantities of data. juggling with illuminated balls and mocking both Humans are notoriously poor at absorbing and himself and his audience Dr. Otero (‘don’t be put retaining individual items of information, but off by the ‘Doctor’’) gave an extraordinarily famously good at identifying patterns. Hotchkiss comprehensive introduction to the practicalities of began with John Snow’s simple street map plotting choosing and running a Linux cluster in a biological of cholera deaths around a water pump in London. environment. Otero is a consultant to biological The Broad Street pump cholera outbreak of 1854 researchers thinking of setting up such ‘discount is now an exemplar in epidemiology. Hotchkiss supercomputers’. pointed out that, contrary to the popular image, this map was more a tool of argument than one of investigation. It still makes a convincing case today. ‘Open source bioinformatics’ – Ewan This important distinction led to the argument that Birney visualization approaches could usefully be divided according to the kind of interaction users made Ewan Birney is the young and charismatic leader of with the data being processed. Some applications of the Ensembl group at the European Bioinformatics visualization were for exploratory, some for con- Institute. He has always been an enthusiastic firmatory and some for production purposes.

Copyright # 2002 John Wiley & Sons, Ltd. Comp Funct Genom 2002; 3: 264–269. Meeting Review 267

As he proceeded, Hotchkiss introduced some of the GeoWall [9] consortium, including members the interesting techniques developed inside and from the Universities of Minnesota and Michigan, outside his company to handle and clarify multi- have developed a means of using PCs or Macs to dimensional data of the kind commonly encoun- put three dimensions of data onto lecture theatre tered in genomics. Some of these techniques were viewing screens. The system was built with visuali- strikingly simple and effective. He showed, for zation of geological data in mind, but applications example, that substituting complex icons for to bioinformatics were discussed. coloured dots could separate different populations of data points in plots even more clearly. One of AVI’s own proprietary approaches, RadViz, simply ‘Speedup at what cost?: Heuristic vs. maps values onto radii of a circle, but, more complete algorithms in homology cleverly, maximizes the usefulness of this approach search’ – Christopher Dwan by optimising the arrangement of those spokes for revealing patterns. This was one of those rare computational talks aimed at the biological contingent and an out- standing example of the art of explanation. It ‘Data visualization for genomics’ – inspired me to revise my own approach to teaching Timothy M. Kunau bioinformatics. Christopher Dwan of University of Minnesota’s Center for Computational Genomics and Timothy Kunau of the University of Minnesota Bioinformatics gave the clearest explanation of pursued some of these themes further, although his distinctions and choices to be made in practical focus initially was more on tools for development bioinformatics (and applied computing in general) such as ‘Integrated SYStem’ (ISYS), a set of com- that I have ever seen. ponents developed by the National Center for He did two difficult things well: explained the Genome Resources (NCGR) for the exploration of specific functional differences between two of the genomic data. ISYS is based on the established most important sequence alignment algorithms and Java/Swing programming toolkit. This offered, for their consequences for practising researchers, and example, a metabolic pathway viewer. ISYS seemed explained the general distinctions computer scien- to be a perfect example of a tool designed with tists make between different classes of problem- sharing in mind. Different developers in different solving algorithm. labs could produce components independently, but In this case the two algorithms compared were those components had a similar look and feel. the complete Smith-Waterman and heuristic BLAST Kunau outlined the ‘programming by cartoon’ sequence search (alignment) methods. The simple approach of University of Pennsylvania’s bio- conclusion of this talk was that BLAST is quicker/ Widgets [5] toolkit. Users can build their own cheaper and misses some matches while Smith bioinformatics pipeline by assembling visual repre- Waterman is slower/more expensive and finds all sentations of various modules. For example multi- matches (at least all those possible given a specific ple simultaneous comparisons could be performed scoring system and cut-off). Not only is this a gross by connecting a single sequence input to various over-simplification of Dwan’s presentation, but it processing modules. does no justice to the elegant way Dwan combined He also discussed the visualization techniques empirical demonstration – using data obtained dur- used in MetaFam [12], an interfamily protein brow- ing the conference itself with an exhibitor’s system – ser to represent not just the existence of connections with good, old-fashioned exposition. between various proteins, but the strengths and nature of those relations. Finally Kunau addressed the simple practical issues ‘Computing Strategies for the of screen dimensions. You can see more if your screen interpretation of mass spectral data for is bigger. You can see more if you can see in three proteomics applications’ – William dimensions. He showed (two-dimensional) slides of Gleason a system called ‘GeoWall’ in action. Based on tech- nology originating at the Electronic Visualization William Gleason [11] (University of Minnesota) gave Laboratory of the University of Illinois at Chicago, a witty and insightful talk about some of the most

Copyright # 2002 John Wiley & Sons, Ltd. Comp Funct Genom 2002; 3: 264–269. 268 Meeting Review

advanced proteomics methods. As biologists drown Molecular Biology in the Centre for Cellular and in nucleotide sequences he emphasised the import- Molecular Biology, India. SeWeR (Sequence analy- ance of proteomics to biology and quoted Greg sis Web Resources) [18] is his ingenious and simple Petsko in describing proteomics as ‘going from Web interface to an array of server-based bioinfor- sequence to consequence’. matics programs. It allows a naive user to apply With the advent of what he referred to as ‘soft’ database search, sequence analysis and even visua- fragmentation techniques such as MALDI (Matrix lization programs to his or her data in a simple and Assisted Laser Desorption Ionization) and ESI customisable way. (ElectroSpray Ionization) it has become possible to Basu’s striking talk described the philosophy and break proteins up into analysable ions with much technology behind the system. SeWeR is a perfect less damage. If a technique such as ESI is coupled example of some of the extraordinary individual to capillary electrophoresis or high-pressure liquid efforts taking place in the open source software chromatography (HPLC) then sequencing of these development community. He also spoke about fragments can be done very rapidly on tiny some of his more recent projects, including a (femtomolar) quantities of material. library for the generation of scalable vector gra- Gleason described how a cluster of Linux PCs phics (SVG) and derived modules for visualizing (the University of Minnesota Supercomputing biological data. Institute Netfinity Linux Cluster) running the Lutefisk1900 and CIDentify software packages could be used to rapidly analyse the fragmentary sequence output from experiments such as these. ‘Using the NCBI C++ toolkit in the His goal was to obtain answers sufficiently quickly development of the BIND database’ – for the settings of the mass spectrometer to be Doron Betel adjusted in order to maximise the usefulness of its output. This talk was oriented more strongly towards pro- grammers than any other I attended, with detailed ‘Project management at source code examples displayed throughout. The Bioinformatics.Org’ – Gary van NCBI toolkit is an extraordinarily wide-ranging Domselaar collection of C++ code for bioinformatics soft- ware development. In its functionality it rivals the As well as being on the Executive Committee of libraries underlying the European Molecular Bio- Bioinformatics.Org, Gary van Domselaar adminis- logy Open Software Suite (EMBOSS). It is a testi- ters Bioinformatics.Org’s computer systems. He is a mony to its accessibility that Doron Betel produced bioinformatics scientist at the Genetics Institute and his BIND database for the manipulation of bio- a PhD. Candidate in David Wishart’s research chemical pathway data completely independently of group in the Faculty of Pharmacy and Pharmaceu- the toolkit’s authors. He is a graduate student in the tical Sciences in Edmonton. He gave a revealing talk, Chris Hogue’s Bioinformatics Lab at the Mt. Sinai both about the logistics of hosting a wide range of Hospital Research Institute in Toronto; as he put it: biological computing projects distributed around ‘I’m not at the NCBI and I’ve never even been the globe, and about the nature of some of the there’. projects. Even as a regular visitor and contributor While I was impressed by the vast range of to the Bioinformatics.Org site I learned about functions available in the kit, its code-generation several features which were new to me. utility (programs which do their own programming will always be popular with programmers), its docu- mentation and its cross-platform nature, I began to ‘DHTML and scalar vector graphics in be a little daunted by the level of abstraction at bioinformatics’ – Malay Kumar Basu which it operates. For example an elaborate HTML page with multiple hierarchical components can be One of these revelations was a gem of a project generated in the toolkit by a single ‘print’ com- created by Malay Kumar Basu in ‘downtime’ from mand, but a firm grasp of such structures is neces- his full-time study. Basu is a graduate student in sary to use this kind of power sensibly.

Copyright # 2002 John Wiley & Sons, Ltd. Comp Funct Genom 2002; 3: 264–269. Meeting Review 269

‘BioMail: as an example of push Now that biology and computing are converging technology in bioinformatics’ – Dmitry it is naturally members of this open source commu- Mozzherin nity and who are most eager to bring the philoso- phy of shared enterprise back to the scientific world BioMail [4] is a classic example of a bioinformatics whence some feel it came. O’Reilly, as court pub- resource which has become a raging success because lishers to the hacker nation may have become acci- it is useful, simple and free. Dmitry Mozzherin des- dental pioneers of a new kind of scientific gathering. cribed how his easy-to-use email publication alert It is likely that future biotechnological meetings will system had caught on with individual biologists and also be more open to intelligent ‘outsiders’, more superseded expensive commercial services in several concerned with explanation and more fun. university libraries because of its convenience and The Meeting Reviews of Comparative and Func- reliability. tional Genomics aim to present a commentary on The system is hosted at Bioinformatics.Org. the topical issues in genomics studies presented at a conference. The Meeting Reviews are invited; they Conclusion represent personal critical analyses of the current reports and aim at providing implications for future The O’Reilly Bioinformatics Conference managed genomics studies. to be both scholarly and lively. I learned a lot; I discovered bioinformatics projects I had not pre- viously encountered and understood more about References familiar projects. While the meeting dealt with science and technology, it also had a ‘philosophy’. 1. Apache Software Foundation: http://www.apache.org/ Computer programming was originally consid- 2. BIND: http://www.isc.org/products/BIND/ ered something of an academic pursuit. As it 3. Bioinformatics.Org: http://bioinformatics.org/ became a big business, the previous science-like 4. BioMail: http://bioinformatics.org/biomail/ 5. BioWidgets: http://www.cbil.upenn.edu/bioWidgets/ ethic of sharing code disappeared. The rise of the 6. BLAST: http://www.ncbi.nih.gov/BLAST/ Net has revitalized this ‘collegiate’ approach and 7. European Molecular Biology Open Software Suite inspired the open source movement and its frater- ‘EMBOSS’: http://www.emboss.org/ nity of self-proclaimed hackers. It would be easy to 8. Ensembl: http://www.ensembl.org/ dismiss them as a semi-anarchic rabble if not for the 9. GeoWall: http://www.geowall.org/ dazzling successes of the products of their colla- 10. Gibas C, Jambeck P. 2001. Developing Bioinformatics Com- puter Skills. O’Reilly, Sebastopol, California. borations – including the infrastructure of the Net 11. Gleason W. homepage: http://www.cbc.umn.edu/ybgleason/ itself. Programs such as Linux, the free operating 12. MetaFam: http://metafam.ahc.umn.edu/ system, Apache [1], the most popular Web serving 13. Open Bioinformatics Foundation: http://open-bio.org/ software, BIND [2], the code which assigns iden- 14. O’Reilly Bioinformatics Conference: http://conferences.oreilly. tities to almost every internet-connected device, and com/biocon/ sendmail [17], which handles the vast majority of 15. Linux Prophet: http://www.linuxprophet.com/ 16. Public Library of Science: http://publiclibraryofscience.org/ email, are all open source creations. In bioinfor- 17. sendmail.org: http://www.sendmail.org/ matics, of course, a great deal of open source code 18. SeWeR: http://bioinformatics.org/sewer/ was used to map, sequence and assemble the human genome.

Copyright # 2002 John Wiley & Sons, Ltd. Comp Funct Genom 2002; 3: 264–269.