Download PDF of This Story

B NY RA DY BARRETT ILLUSTRATION BY MIKE PERRY TE H NEW JANELIA COMPUTING CLUSTER PUTS A PREMIUM ON EXPANDABILITY AND SPEED. ple—it’s pretty obvious to anyone which words are basically the same. That would be like two genes from humans and apes.” But in organisms that are more diver- gent, Eddy needs to understand how DNA sequences tend to change over time. “And it becomes a difficult specialty, with seri- ous statistical analysis,” he says. From a computational standpoint, that means churning through a lot of operations. Comparing two typical-sized protein sequences, to take a simple example, would require a whopping 10200 opera- Computational biologists have a need for to help investigators conduct genome tions. Classic algorithms, available since speed. The computing cluster at HHMI’s searches and catalog the inner workings the 1960s, can trim that search to 160,000 Janelia Farm Research Campus delivers and structures of the brain. computations—a task that would take the performance they require—at a mind- only a millisecond or so on any modern boggling 36 trillion operations per second. F ASTER Answers processor. But in the genome business, In the course of their work, Janelia A group leader at Janelia Farm, Eddy deals people routinely do enormous numbers researchers generate millions of digitized in the realm of millions of computations of these sequence comparisons—trillions images and gigabytes of data files, and they daily as he compares sequences of DNA. and trillions of them. These “routine” cal- run algorithms daily that demand robust He is a rare breed, both biologist and code culations could take years if they had to be computational horsepower. Geneticists, jockey. “I’m asking biological questions, done on a single computer. molecular biologists, biophysicists, physi- and designing technologies for other peo- That’s where the Janelia cluster comes ologists, and even electrical engineers ple to ask biological questions,” he says. in. Because a different part of the work- pursue some of the most challenging Eddy writes algorithms to help re- load can easily be doled out to each of problems in neuroscience, chief among searchers extract information from DNA its 4,000 processors, researchers can get them how individual neuronal circuits sequences. It’s a gargantuan matching their answers 4,000 times faster—in hours process information. Their discoveries game where a biological sequence— instead of years. The solutions don’t depend, now more than ever, on the seam- DNA, RNA, or protein—is treated as a tend to lead to eureka moments; rather, less interplay of scientists and computers. string of letters and compared with other they provide reference data for genome Humming nonstop in Janelia’s compact sequences. “From a computer science researchers as they delve into the com- computing center are 4,000 processors, standpoint, it’s similar to voice recognition plexities of different organisms. “These 500 servers, and storage machines holding and data mining,” he says. “You’re compar- computational tools are infrastructural, a half a petabyte of data—about 50 Libraries ing one piece against another. We look for foundation for many things,” Eddy says. of Congress worth of information. a signal in what looks like random noise.” While that may not sound dramatic, Though there are many larger clus- Eddy looks for the hand of evolution in Eddy’s protein-matching algorithms are ters around the world, this particular DNA by comparing different organisms’ an industry standard, used by researchers one is just right for Janelia Farm. “Beau- genomes. He’s searching for strings of as the search tool for a reference library tifully conceived, ruthlessly efficient, DNA sequences that match—more than called the Protein Families database, and extraordinarily well run by the high- random chance would dictate. or Pfam. There are roughly 10 million performance computing team,” according “It’s a lot like recognizing words from proteins in the database. Luckily about to Janelia researcher Sean Eddy, the sys- different languages that have a common 80 percent of those sequences fall into tem is designed to make digital images ancestry, thus probably the same mean- a much smaller set of families and Eddy available lightning fast while muscling ing,” he explains. “In two closely related has designed the analysis software to query through the monster calculations required languages—Italian and Spanish, for exam- for matches in this data set. “When a new 30 HHMI BULLETIN | May 2o1o sequence comes in, [Pfam] is like a dic- tionary—it’s always being added to,” he says. The database currently identifies 12,000 protein families. There is also an RNA database called Rfam for which Eddy and his Janelia team have software design and upkeep responsi- bilities. Eddy has to keep one step ahead of his users, which means stressing his analysis tools to the failure point so he can improve them. “We set up experiments and try to break the software and push the enve- lope,” he says. A Mosaic OF FLY NEURONS The Janelia computing system is referred to as a “cluster” of processors by both its overseers and its users. The cluster serves 350 researchers and support staff and can scale up to serve many more if requirements demand it. Its design puts a premium on expandability, flexibility, and fast response, particularly since the scientific needs may change and evolve rapidly. The computer cluster was recently upgraded as part of a regular four-year technology refresh. Made up of com- mercially available hardware components built by Intel, Dell, and Arista Networks, Eddy calls the system a “working class supercomputer.” Janelia is the first cus- tomer for this particular design—in fact, some of the components have serial num- ber 1 or 2 and are signed by the engineers who built them. The new system is up to 10 times faster than the old one and has six times more memory. “Janelia’s new computing cluster pro- vides a platform that is an order of mag- nitude more responsive than the previous system and can be grown easily to accom- TOP: Roian Egnor MIDDLE: Sean Eddy Paul Fetters BOTTOM: Elena Rivas, Lou Sheffer May 2o1o | HHMI BULLETIN 31 modate changing requirements,” says Vijay But Scheffer’s matching is just the first processors. Those familiar with office net- Samalam, Janelia’s director of information step in the image-manipulation process. works know it well: Ethernet, the popular technology and scientific computing. Janelia software engineer Philip Winston standard for moving electronic data from That expanded capacity is a big help to takes the processed pictures and does the point A to B. While it has been a long- Janelia fellow Louis Scheffer, an electrical unthinkable—he chops them up again. standing protocol for slower connections, engineer and chip designer by training. He creates smaller “tiles” of the photos, 10-gigabit Ethernet has not traditionally He uses the cluster to help researchers which can be more easily added and been the choice of makers and engineers map the brain wiring of the common fruit subtracted from a computer screen as a of top supercomputers who until recently, fly Drosophila melanogaster. Essentially, researcher pans across an image. “To open when best performance was a must, used it’s a massive three-dimensional image- a single image would take five minutes if specialized networking technology called manipulation challenge. First, slices of you didn’t tile them,” says Winston. Only InfiniBand. brain 1/1,000th the thickness of a human 20 tiles are required on the screen at any “Now Ethernet switches are as effi- hair are digitally photographed with an one time. Currently, Winston is working cient as, or very close to, InfiniBand and electron microscope and stored. In each with four million tiles as part of the Janelia you don’t need a different [networking] layer, the computer assigns colors to the Fly Electron Microscope project to map skill set,” says Spartaco Cicerchia, man- neurons so researchers can trace their the entire brain of the fruit fly. ager of network infrastructure at Janelia path. As an example, the medulla of the Humans proofread the final fly-brain Farm. The bottom-line advantage is that fly, part of the brain responsible for vision, image for accuracy, to trace the neural Ethernet is easier to work with, familiar requires more than 150,000 individual paths and make sure the computer has to more networking engineers, and tends images to create the full mosaic, which is identified structures correctly. “[People] to be cheaper to use. 1,700 layers (slices) deep. are an important step,” says Winston. Lower latency—the time it takes to But all these pictures must be knitted “Without them, the computer segmen- move data across a network connection— together so scientists can follow neural tation would be 95 percent right and we is now possible via Ethernet due to a paths and see where they lead. Think wouldn’t know about the other 5 percent.” relatively new networking standard called Google Earth. As you pan across the Scheffer and Winston’s ultimate goal iWarp. Traditionally, computers’ proces- globe, data are fed onto the screen so you is to completely automate the mapping sors must manage the flow of information can “fly” from one location to another, process and to teach the computer to packets as they pass between them. In the and more images are required as you drill identify the inner structures of the fruit new systems, those packets are handled by down to examine surface topography. fly brain, in particular the different types a separate piece of hardware made by the Making the transitions smooth in between of neurons, and the axons and dendrites chip manufacturer Intel.

Download PDF of This Story

RDA COVID-19 Recommendations and Guidelines on Data Sharing

Downloaded Were Considered to Be True Positive While Those from the from UCSC Databases on 14Th September 2011 [70,71]

Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids

Genomic and Transcriptomic Surveys for the Study of Ncrnas with a Focus on Tropical Parasites

On the Necessity of Dissecting Sequence Similarity Scores Into

Clawhmmer: a Streaming Hmmer-Search Implementation

HMMER User's Guide

Computational Identification of Functional RNA Homologs in Metagenomic Data

INFERNAL User's Guide

Reading Genomes Bit by Bit

The Janus-Faced E-Values of Hmmer2: Extreme Value Distribution Or Logistic Function?

Enhanced Protein Domain Discovery by Using Language Modeling Techniques from Speech Recognition