bioinformatics technology feature Programmed for success

Advanced software and services aimed at the non-bioinformatician are making it easier to ride the tidal-wave of and proteomics data. Steve Buckingham reports.

eel that you’re missing out on frame the questions that shape research vice-president of the life-sciences company genomics? Even if you don’t think of projects and the design of experiments. Applera in Norwalk, Connecticut, compares Fyourself as ‘bioinformatics-enabled’, But is this empowering knowledge get- the situation to the early days of motoring. your desktop PC, coupled to the Internet ting to all of the researchers who would “The first cars were driven by engineers and with its ever-expanding bioinformatics benefit from it? Biologists want data in a mechanics,” he says. “Later, everyone was resources, will let you tap in to the data del- form that is meaningful to their own able to drive them.”Already, Speechly notes, uge. All you need to join the bioinformatics research, but they don’t always have the new tools are opening up the human revolution is software that doesn’t require time or resources to master the art of bioin- genome across the life sciences. you to have a deep working knowledge of formatics or to hire someone who has. So Take, for example, someone working on bioinformatics and genetics in order to get how can bench scientists in a small or medi- cardiovascular disease. They can now run a results. But is it out there? According to um-sized company or university depart- computer search of the more than 5 million software companies and service providers, ment cash in on all of this raw information? known human single-nucleotide polymor- the answer is a resounding ‘yes’. phisms (SNPs, pronounced ‘snips’, see Finding genes lurking in raw sequence Help where it’s needed Nature 422, 917–923; 2003) for those asso- data, determining the proteins that they The need now is for software and services ciated with the condition. They can then encode, and managing the flow of data has that can broker this information and pass it test their control groups for these SNPs to powered a large bioinformatics industry. on to the non-expert in an intuitive way. screen out those people who might have Indeed, the software and services offered by “The challenge is that in-depth expertise unidentified disease, thus removing one big companies such as LION Bioscience in has to be packaged into an environment source of uncertainty from their experi- Heidelberg, Germany, coupled with special- that appeals to the average biologist, is easy ment. Applera and some other companies ist hardware solutions from IBM, Microsoft to understand and easy to use,” says Klaus allow academic researchers free access to and Sun, have been indispensable in han- May, director of sales and marketing for their SNP database for this kind of search, dling the flow of data coming out of Genomatix Software in Munich, Germany. and offer a catalogue of SNP-testing kits. genome-sequencing projects. But it is not A growing bioinformatics market is try- The bioinformatics industry itself has just a problem of scale. As well as trans- ing to bridge this gap. There are encourag- had to meet some tough challenges. Ron forming biological knowledge, the rapid ing signs of a trend towards software with Ranauro, general manager of the Paris, emergence of such a volume of raw data is intuitive interfaces coupled with ever- France-based company Gene-IT, sees a also changing the way in which researchers increasing analytical power. David Speechly, pressing need for software firms to produce

FINDING YOUR WAY

Most genome databases now feature more than just basic sequence didn’t even know of the existence of the European Bioinformatics Institute’s information; they encompass a wide range of annotated information on public Ensembl database and website, which collates vast amounts of factors such as disease states, protein structure and polymorphisms. diverse information on the . The findings prompted the trust Yet many researchers remain unaware of the potential of these resources. to launch an educational initiative in 2001 to draw attention to these A much-cited survey commissioned two years ago by the London-based databases. Wellcome Trust revealed that only about half of the biomedical researchers So where do you turn to make sense of the number of databases using genome databases were available and the array of tools they offer? A good starting point is the familiar with all the tools on offer to annual database special issue of Nucleic Acids Research (31; 1–516; 2003). make full The online version of this issue contains an index of major databases along use of the with a brief summary of what each offers, information that is compiled by data. A Andreas Baxevanis of the genome-technology branch at the US National significant Human Genome Research Institute. number Newcomers are often overwhelmed by the tools on offer and the knowledge of genetics that is presupposed on the part of the user. The tools tend to assume an understanding of terms and principles that are unfamiliar to, say, physiologists and pharmacologists, and without this knowledge the users cannot feel confidence in the results that they get. A useful guide for the perplexed is A User’s Guide to the Human Genome, a web special published by Nature Genetics (www.nature.com/ng). This is designed for the genome neophyte, and guides the user through worked examples to give Genome gateways: but how to get into the garden? them confidence in understanding concepts and strategies that can then be used to empower their own research. S.B.

209 NATURE | VOL 425 | 11 SEPTEMBER 2003 | www.nature.com/nature © 2003 Nature Publishing Group bioinformatics technology feature programs that are more adaptable to differ- Canada, which introduces the concept of ent users. “It is definitely not a case of one- ‘spaced seed’ to accelerate homology searches, size-fits-all,” he says. “The way applications and claims to be some 100 times faster than have to be developed now is to work from traditional BLAST. Whereas BLAST looks the consumers’ needs backward.” Gene-IT’s for regions where 11 consecutive residues new product, GenomeQuest, to be released match, PatternHunter looks for any 11 this month, aims to provide an intranet- matches over, for example, an 18-residue based sequence-search solution that makes segment, making the search more sensitive it easy for scientists to get a complete picture and, surprisingly, faster. Using a multiple- of functional annotation from the entire seed approach gives PatternHunter the sen- sequence world and coordinate sequence sitivity of the Smith–Waterman algorithm, information across diverse research teams. but up to thousands of times faster. It There are other challenges. “The main runs on Sun Microsystems’ Java Virtual problem, as I see it, is the pace of discovery,” Machine and also boasts conservative mem- says Bill Ladd, senior director of analytic Three-way synteny from Softberry. ory usage and a guarantee not to miss any applications for software developer Spot alignment. The program was used by the fire in Somerville, Massachusetts. “Bioinfor- different sources, but difficulties arise when Mouse Genome Sequencing Consortium to matics exists to support very dynamic, and you try to integrate even this collated data- compare the mouse and human genomes therefore very different, claims on software base data with experimental data. Once I (Nature 420, 520–562; 2002), and is also development. The technology is improving have a list of genes out of Ensembl, what used by companies such as Celera Genomics all the time, and so the analysis changes. In happens when I ask whether other genes in in Rockville, Maryland, and deCODE the past, when something new came along my data have the same characteristics?” Genetics in Reykjavik. you would have probably a few months to Fast searching is also a feature of gather the requirements and another few New lamps for old Genome Explorer from Softberry in Mount months or so to roll out the new software. Fortunately help is at hand for the non- Kisco, New York, which uses the FMAP algo- Now that cycle has sped up to a matter of expert. Many of the new bioinformatics rithm. This is a very fast algorithm developed weeks, if not days.” products aim to do essential tasks, such by Softberry to map query sequence to large And the pace is only going to accelerate as BLAST queries, which find matching genomes. It keeps the oligonucleotide vocab- further. The way in which software interacts sequences in databases, and protein align- ulary of the entire genome in computer with the user has undergone a sea-change in ments, more efficiently and in a more user- memory to speed up the searches. Softberry the past few years, largely in response to the friendly way than previous systems. New claims that the program can search the entire need to deal with large data sets. Ladd algorithms allow searches of large genomes human genome for a sequence of interest in believes that there has been a migration to be done at unprecedented speeds with- under two seconds. As well as offering sim- towards the use of web interfaces because out the loss in sensitivity that results in ple pattern searches, retrieval of nucleotide they are easy to manage and develop. But missed alignments. This means that it is and amino-acid sequences, Genome Explor- this has a cost. “It means that we are doing possible to do real-time interactive searches er includes access to expression data on more browsing than analysis,” he says. against whole genomes and genomic data. genes and the annotation of the draft of the “There are databases like Ensembl, for One such package is PatternHunter human genome. example, that effectively collate data from from Bioinformatics Solutions in Waterloo, A number of desktop programs make it

SEEING IS BELIEVING

Science often works by data mining — finding correlations within and Spotfire in Somerville, Massachusetts, prides itself on the ease of use of between sets of data. Dividing these sets into subsets and testing out its data visualization and decision-making software, such as DecisionSite. various scenarios are key steps in this process. But many data sets The functional genomics version of this program allows users to visualize generated today are extremely large and complex, sometimes involving genomics data and spot trends and correlations. It accepts data from a several dimensions and varying levels of subdivision. wide variety of different databases, addressing the And this is not only a concern for large old problem in bioinformatics that relevant data are pharmaceutical companies — anyone using dispersed across different locations and are often in microarrays has the same problem. widely divergent, and potentially incompatible, formats. Many new bioinformatics products address this Data visualization tries to shorten the path to the problem by organizing data in a dynamic visual ‘eureka!’ moment, where the researcher has intuitively context. “People are visually oriented. They are more grasped what the data are saying. But intuition must productive when data are presented visually rather be backed by rigorous analysis. Programs are often than textually,” says Ron Ranauro, general manager packaged with a number of powerful analytical tools at Paris-based Gene-IT. Data visualization aims to including similarity searches, replicate summarization allow scientists to get an intuitive grasp of data and coincidence testing. DecisionSite, for example, structure and to spot potentially interesting trends. comes with preconfigured guides to assist in common For example, the well known SigmaPlot package genomic analyses such as gene finding, generating made by SPSS in Chicago, Illinois, is probably used “heat maps” — a type of visualization where data is as much for trends analysis and scenario testing as Ron Ranauro: spotting trends. colour-coded to enable an overview of large amounts it is for the preparation of graphs for publication. of data at once. S.B.

211 NATURE | VOL 425 | 11 SEPTEMBER 2003 | www.nature.com/nature © 2003 Nature Publishing Group bioinformatics technology feature

easy to do protein and nucleic-acid align- which the manufacturers claim is fast, sensi- ments, as well as to design primers, to tive and accurate. The suite also includes search motifs and to perform multiple FGENESH_C, which searches for similar sequence analysis. One example is MacVec- cDNAs, and FGENESH+, which will find tor, for the Apple Macintosh and the Win- similar proteins. The fully automated FGE-

NONLINEAR DYNAMICS dows version DS Gene, produced by NESH++C will automatically annotate any Accelrys in San Diego, California. MacVec- genome (other than human) to a standard tor has been available for many years and is claimed to be indistinguishable from manual continually being improved. The reasonably annotation, using a battery of complemen- priced Jellyfish from LabVelocity in San tary techniques. Francisco, California, is also available for The accuracy of programs that predict both Mac and PC. Like MacVector, Jellyfish protein structure from sequence has will generate primers, oligonucleotides, Making sense of 2D gels. improved over the past few years and these cloning constructs and restriction maps, as are gradually becoming more widely used. well as perform BLAST searches and from the data-mining results and send A number of easy-to-use academic and sequence alignments. When sequence data them back to the image-analysis module to commercial protein-structure prediction are downloaded, so are the annotations. drive spot-picking from the original gel. programs are available. PROSPECT Pro Another option is the Visual Cloning 3 Paddy Lavery, Nonlinear’s bioinformat- from Bioinformatics Solutions uses the package from Redasoft in Bradford, ics marketing manager, says that a future ‘threading’ method, which threads the Ontario, which the company claims simpli- version of the software will allow users to query sequence onto all known protein fies sequence analysis, sequence editing and import raw mass-spectrometry traces or folds from the Protein Data Bank to find presentation, as well as offering access to peptide map lists into the program and to the one that gives the best-fitting structure. online tools through its integrated web- search against internal or external peptide The program builds on this established browser interface and an array of plain-lan- sequence databases. The user will be able to strategy by allowing the user to feed in guage ‘wizards’ to guide the user through store the search results and link them back experimental data, such as constraints on the process. to the gel and the sample ID of any spot in the threading, and it checks its own results Proteomics has benefited from software the pick list. using a neural network. that not only allows rapid analysis of two- Predicting the localization of a protein dimensional electrophoresis gels, but also Making predictions within the cell is also a help in identifying draws together other functions, such as data As the number of sequences of known func- the function of a gene product. Softberry’s mining, into one easy-to-use package. Pro- tion increase, predictions of genes, protein ProtComp illustrates the trend in bioinfor- genesis from Nonlinear Dynamics in New- structure and protein function from sequence matics software to integrate diverse compu- castle upon Tyne, UK, comes in three information are getting more accurate. There tational approaches. The program uses clues modules. The first allows data annotation, is a trend towards collections of integrated, in a protein’s sequence to guess at where it is the second does the image analysis and data coordinated suites of gene-prediction pro- localized within a cell, mostly by using neur- logging, and the third is the data-mining grams, many of which can be tried out on the al networks to check sequence elements for component in which, for example, the data web. Softberry, for example, offers a number tell-tale localization-specific motifs. from different gels can be compared. Users of gene-prediction programs that can be Another important issue is monitoring can generate pick-lists of interesting spots accessed over the web, including FGENESH, and controlling the flow of data as they are

GENOMIC MERGERS

According to Celera Genomics in Rockville, Maryland, the next step after the sequencing of the human genome will be ‘merging technologies’. Celera, recently forged agreements with in Foster City, California, under the umbrella of the Applera Corporation to work towards integrating all aspects of genomics. The companies are now developing

an array of predesigned and prevalidated assays for genes identified on APPLERA CORPORATION completion of the sequencing of the human genome. There are already assays for more than 18,500 human genes known to be expressed, as well as kits for over 125,000 single-nucleotide polymorphisms (SNPs) — their goal is to have 200,000 SNP assays by the end of the year. “The most common complaint we hear from scientists is: ‘but we’re not bioinformaticians’,” says Ramin Cyrus, a senior director for Celera. Anthony Kerlevage, a senior director at Celera, likens the situation to modern word processors. “They are packed with features, but most of us just use them Which gene? Assays for human genetics are big business. to write letters,” he smiles. The solution? This month, Celera and Applied Biosystems launched the ‘myScience’ portal.This website allows users to upload data from instruments, databases or laboratory information user through the options of which web-based tools are available. The site management systems (LIMS), and store them in personalized workspaces. itself is free (Applied Biosystems hopes that it will tempt users to purchase Users can then analyse their data with an array of tools, guided by the company’s assays), and is designed to complement the subscription- predesigned workflows. These workflows act like ‘wizards’ — guiding the based Celera Discovery System, which offers deeper analyses. S.B.

213 NATURE | VOL 425 | 11 SEPTEMBER 2003 | www.nature.com/nature © 2003 Nature Publishing Group bioinformatics technology feature generated and passed through the that. The LexiQuest Knowledge laboratory, especially as individual Management Suite from SPSS in experiments can now produce Chicago, Illinois, is based on ‘real’ large, complex data sets that need linguistics: it can race through to be interpreted and passed on to unstructured text and produce a the next set of experiments. This graphical map of the main con- is where LIMS come in — labora- cepts in that text. SPSS claims that tory information management the software understands natural systems that not only store data, language queries and can respond but will track what reagents are to them intelligently. “Everyone is used and where they are bought calling text ‘unstructured data’, but and stored, as well as following that’s not quite true,” says Cather- the project’s progress. LIMS have ine DeSesa, senior analyst at SPSS. been described as electronic lab “Text does have structure because notebooks — a good LIMS pack- language has structure — a very age will catalogue experimental complex structure — and it is the data, which it can capture directly incorporation of the knowledge of from laboratory instruments, this structure that enables Lex- scientists’ reports and even pur- iQuest Mine to accurately extract chasing. Another essential feature and organize multi-word concepts is traceability — original data without prior knowledge of the must be open to tracking and Joining the dots: BiblioSphere from Genomatix searches exact terms themselves.” Another retrieval throughout the project. the literature for co-citations of genes. example is KDE TextSense from John Helfrich, programme InforSense in London, UK, which manager for drug discovery and develop- search. PubGene, based in Oslo, Norway, contains at its core a free-text mining toolbox ment at NuGenesis Technologies in West- offers a program at both free and propri- and can be used to turn the information into borough, Massachusetts, has been keeping etary levels that identifies potential associa- structured data tables. track of how LIMS are evolving. “I am tions of genes and proteins by finding their New information is driving software seeing a trend towards the development co-occurrence in abstracts of published development, but changes are also needed of ‘purpose-built’ LIMS that perform a spe- papers or in gene-expression experiments. to the way in which data are presented. Take cific subset of data management within The publicly available version allows the database problem — a huge number of specific departments in the biopharmaceu- searches for co-occurrences of genes, databases now exist, all of varying quality tical industry,” he says. He points to the although the database behind the commer- and featuring different, often incompatible, Watson LIMS from InnaPhase in Philadel- cial version is claimed to be more up-to- formats. Matthew Day, databases editor for phia, Pennsylvania, which was specifically date and also offers protein searches. London-based online publishers BioMed designed for preclinical bioanalysis in drug A suite of programs from Genomatix Central, sees the need for change. “There discovery. Software takes a similar approach. One aren’t yet public repositories for all the Another likely trend will see LIMS pack- powerful member of this suite, which different sorts of information that biolo- ages become easier to configure — tradi- illustrates the emerging emphasis on the gists are producing. I believe that all data tionally a LIMS has been designed in visual presentation of complex data, is sets should be published as user-friendly consultation with the client to meet a spe- BiblioSphere. This package brings together online databases that are closely associated cific need, but Helfrich thinks that future genome analysis and the US National with journal articles and are amenable to LIMS vendors will aim at more customiz- Center for Information’s data mining. Thus the boundaries between able off-the-shelf products. The latest PubMed database. Like PubGene, Biblio- journals and databases becomes blurred, version of NuGenesis’ interoperable SDMS Sphere is a data-mining tool that looks for and a sea of data sets is created under the (Scientific Data Management System) was co-citations of two or more genes of interest umbrella of the peer-review system.” launched this June, with enhanced support in published abstracts. The lowest-level In the genome age, it is easy for small for the Macintosh OS X, UNIX and Win- search tags the citation of two genes in the laboratories to feel that they are being left dows platforms. The system catalogues and same abstract. The most discerning search behind. Large biotech companies can now captures all of a project’s data from its looks for the co-citation of the two genes in achieve in an afternoon what used to take source, thus avoiding the problem of tradi- the same sentence, coordinated by a key an average-sized lab months, if not years, to tional LIMS designed to capture only a word such as ‘regulates’. The findings are accomplish. But software technologies are narrowly limited data stream or restricted presented graphically — the gene you getting better at encapsulating expertise to using treated data. It can be integrated searched for is presented visually at the cen- into compact programs. These are becom- with an existing LIMS or any high-order IT tre of a sphere of related genes. Clicking on ing easier to use and more reliable, and the system, or implemented as a stand-alone the line joining two genes leads you to the results are easier to interpret. There is a system for the small or medium-sized lab. citation mentioning the two genes. Clicking growing emphasis on automatically draw- on any gene leads you to further data on ing on diverse data sources from widely Spending less time in the library that gene. divergent locations. New services and soft- The explosive increase in sequence data is Some programs even claim to be able to ware products are drawing us closer to the almost matched by the increase in text track that most complex domain of biologi- day when the expertise of a professional publications, and keeping up with the pub- cal information — the research paper — to bioinformatician can be downloaded into a lished literature can be a full-time task. find exactly the papers you need. Despite the desktop computer. ■ Some companies are now producing soft- obvious difficulties faced by a machine Steve Buckingham is a neurophysiologist at the ware that allows the user to explore textual attempting to ‘understand’ natural, human Department of Molecular Biophysics, University of data at a level beyond a simple literature language, some packages claim to do just Oxford. He is also a freelance writer.

215 NATURE | VOL 425 | 11 SEPTEMBER 2003 | www.nature.com/nature © 2003 Nature Publishing Group