NCBI News

National Center for Biotechnology Information

National Library of Medicine National Institutes of Health Spring 2000

Dazzling Graphics with Cn3D 3.0 BLAST Now Offers Taxonomic Views The newly released Cn3D 3.0 pro- View/Drawing settings. Lower of the Output vides dazzling graphics through the resolutions can be used to facilitate use of the OpenGL graphics library. faster rotation speeds for larger NCBI’s Basic and Advanced BLAST Coupled with the graphics enhance- molecules. Higher resolutions can services now offer the option of ments are several new rendering be chosen when publication-quality returning a taxonomically organized options and color schemes. The figures are desired. report. Clicking on the Taxonomy Cn3D sequence window has also Reports link on the BLAST results been refined to allow easier selec- Enhanced Functionality page will generate taxonomy tion of residues, and better depiction in the Sequence Window reports in three formats: a Lineage of secondary structural elements, Report, an Organism Report, and The Cn3D Sequence window oper- protein domains, and hetero-atoms a Taxonomy Report. Together, ates in either a single or multiple alongside the sequence. these three reports provide a sequence mode. If a single sequence broad overview of the taxonomic is displayed, then the window is Greater Graphics relationships among the records entitled OneD-Viewer and uses a returned from a BLAST search. Cn3D 3.0 depicts solid objects more set of options suitable for opera- realistically, taking advantage of tions on a single sequence. If more highlighting and shadowing. New than one sequence is displayed, the The Lineage Report rendering options include an alpha- window is entitled DDV (DeuxD- The Lineage Report gives a simpli- carbon backbone “worm”, or “coil” Viewer) and a set of options fied view of the relationships representation at three thickness suitable for multiple sequences is between the organisms generating levels. Alpha-helices may now be active. Columns of a multiple database hits to the query sequence represented either as blunt-ended alignment may be colored accord- continued on page 8 cylinders or cylinders with carboxy- ing to sequence conservation, terminal “caps” to indicate the for example, with these colors direction of the peptide chain. In mapped to both the sequence and In this issue addition, ions are depicted as structure windows. translucent spheres. A new color 1 Dazzling Graphics with Cn3D 3.0 scheme called Sequence Conserva- OneD-Viewer Operation 1 BLAST Offers Taxonomic Views tion offers several coloring options when multiple sequences are The Sequence window has been 2 HomoloGene: Clusters of Clusters displayed in the sequence window. given several new abilities in 4 News Briefs version 3.0. Secondary structural 4 Fly Genome Deposited in GenBank This graphical power can be adjust- elements are now represented 5 Finds New Home Page ed to allow for differences in com- as 3-D-like cartoon images 5 Human Genome Map Viewer puter speeds, video resolutions, or beneath the sequence. NCBI- 6 Frequently Asked Questions molecule size using a new Cn3D assigned protein domains are 7 BLASTLab Quality control tab found under continued on page 3 HomoloGene: Clusters of Clusters

NCBI News HomoloGene is a new NCBI data- same cluster in a third organism. base of both curated and calculated For the organisms human, mouse, orthologs and homologs for the and rat, there are currently over NCBI News is distributed four times human, mouse, rat, and zebrafish 7,000 of these triplets. For the a year. We welcome communication genes represented in UniGene organisms zebrafish, human, and from users of NCBI databases and and LocusLink. rodent (mouse or rat), there are software and invite suggestions for currently just over 200 triplets. articles in future issues. Send corre- spondence to NCBI News at the Curated orthologs include gene address below. pairs from the Mouse Genome HomoloGene Search NCBI News Database (MGD) at the Jackson National Library of Medicine Laboratory, the Zebrafish Infor- To obtain a HomoloGene report Bldg. 38A, Room 8N-803 mation (ZFIN) database at the for a gene, use the Query box at the 8600 Rockville Pike University of Oregon, and from top of the HomoloGene page to Bethesda, MD 20894 Phone: (301) 496-2475 published reports. Computed search using a UniGene ClusterID, Fax: (301) 480-9241 orthologs and homologs are iden- LocusLink LocusID, gene symbol, E-mail: [email protected] tified from nucleotide sequence gene name, nucleotide accession Editors comparisons between all UniGene number, or any free text appearing Dennis Benson clusters for each pair of organisms. in UniGene cluster titles. The Barbara Rapp Calculated orthologs and homo- HomoloGene report consists of a Contributors logs may be considered putative header section, followed by reports Scott Federhen because they are based only on falling within any of three sections Donna Maglott Jim Ostell sequence comparison. entitled Curated Orthologs, Lukas Wagner Calculated Orthologs, and Mutally

Writer Computed similarities are detected Orthologous Pairs. David Wheeler using BLAST to compare nucleotide

Editing, Graphics, and Production sequences for each pair of organ- The header section gives the title of Marla Fogelman isms, and to identify those sequence the HomoloGene cluster, followed Veronica Johnson pairs that share the greatest degree by a listing of all the possible ortho- Jennifer Vyskocil of nucleotide sequence similarity. logs contained within it. For each Design Consultant The best match for a sequence in entry, the UniGene cluster ID and Gary Mosteller one organism to a sequence in a the LocusLink ID are given. In 1988, Congress established the second organism is based on the National Center for Biotechnology percentage of identical sequence, The Curated Orthologs section gives Information as part of the National Library of Medicine; its charge is called the %ID in the HomoloGene pairs of orthologs and, for each pair, to create information systems for report, for an alignment of a a link to the source in which the molecular biology and genetics minimum of 100 base pairs. When orthologous relationship is claimed. data and perform research in sequences from two UniGene If the source is a research paper, computational molecular biology. clusters are reciprocal best match- the link is to a PubMed abstract. The contents of this newsletter may es, the UniGene clusters correspon- In other cases, the source may be be reprinted without permission. MGD or ZFIN, and links are pro- The mention of trade names, com- ding to the pair of sequences are mercial products, or organizations considered to represent a putative vided to these resources. does not imply endorsement by ortholog pair. NCBI, NIH, or the U.S. Government. The Calculated Orthologs section NIH Publication No. 00-3272 HomoloGene also contains a set of gives a listing of putative ortholog ISSN 1060-8788 triplet ortholog clusters in which pairs identified on the basis of ISSN 1098-8408 (Online Version) orthologous clusters in two organ- sequence similarity. For each pair isms are also orthologous to the continued on page 8

Spring 2000 NCBI News 2 The enhanced graph- Dazzling Graphics with Cn3D ical capabilities of continued from page 1 Cn3D 3.0 are illustra- ted in the views of indicated using solid 2-D arrows. Human Phenylalanine Hetero-atoms in the features list are Hydroxylase (PDB indicated in the sequence display as code 1PAH) in small triangles positioned near the Figures 1 and 2. The residues that are closest to them in structure window the 3-D structure. As in earlier ver- shows an image of sions of Cn3D, additional sequences the protein in which may be aligned to a structurally the NCBI-defined anchored sequence by importing secondary structural FASTA-formatted files or by down- elements, helices loading directly from Entrez over and strands, are rep- the network. resented as solid cylinders and planks, DDV Viewer Operation respectively, with When a is the pointed ends shown in the Sequence window, indicating the amino- a different set of options is to carboxy-terminal available. The mouse mode may direction. An iron Figure 1: Cn3D Structure window view of human phenylalanine hydroxylase (PDB code 1PAH). be changed, using the Options atom is seen in the menu, from the default of Select center of the structure, to Query in order to use the mouse represented as a translucent The accompanying Sequence to identify residue numbers. A sphere. Using the Annotation window shows these mutable Styles panel, also invoked from panel, the sites of documented residues in a dark shade (red). the Options menu, allows several mutations, as taken from Online Secondary structural elements sequence-alignment display Mendelian Inheritance in Man are shown below the sequence. parameters to be adjusted. These (OMIM), are marked by showing The single NCBI-derived protein include the use of color and the these amino acid side-chains in domain for this structure is placement of ruler lines. a “fat-tube” representation. indicated by the arrow that runs from residue 59 to residue 291. Residues proxi- mal to the iron atom in the struc- ture are marked with an H under the sequence.

For more infor- mation, navigate to www.ncbi.nlm. nih.gov/Structure/ CN3D/cn3d.html. —DW

Figure 2: Cn3D Sequence window view of human phenylalanine hydroxylase (PDB code 1PAH).

3 Spring 2000 NCBI News Fly Genome News Briefs News Briefs Deposited in GenBank LocusLink Adds Organisms 1.5 Million Bases and Search Fields of Bad Bug Sequenced The Drosophila genome sequence

Two new organisms have been The complete 1.5 million base pair described in Science (Mar 2000; added to LocusLink: the zebrafish, genomic sequence of the motile, 287, 2185–95) is publicly available Danio rerio, and the fly, Drosophila gram-negative bacterium, Campy- in GenBank. The genome, seq- melanogaster. The data for lobacter jejuni (Feb 10, Nature), uenced jointly by scientists at Drosophila include all genes thus has been deposited in GenBank and Celera Genomics and the Berkeley far identified in the recent deposi- can be viewed in Entrez Genomes. Drosophila Genome Project tion of the Drosophila genome. C. jejuni has earned a place in the (BDGP), was deposited in the Links to the FlyBase database of FDA's “Bad Bug Book” by being the form of more than 700 “scaffolds” Drosophila genome annotations bacterium most frequently assoc- are provided for these genes. iated with diarrhea in the United consisting of annotated, ordered States. The Bad Bug is linked to 45% sets of contigs having gaps of Two new Boolean qualifiers have of diarrhea cases, leading Salmonella known length. Approximately 134 also been added to LocusLink. The (30%), Shigella (17%), and E. coli scaffolds, representing 98% of the first is termed “disease_known”, (5%). Unchlorinated water and whereas the second is termed genome, have been mapped to a uncooked meat are the most com- “has_seq”. These qualifiers provide chromosome and these mappings mon sources of contamination. the ability to search for records appear in the GenBank record with associated diseases or sequence Take a look at the genome on the with a “/chromosome” qualifier in data. Together, these qualifiers can NCBI Bacterial Genomics page at the source feature field. be used, for example, to determine www.ncbi.nlm.nih.gov/PMGifs/ how many disease genes have been Genomes/eub.html. Many scaffolds contain unsequenced sequenced using the search: gaps of known length, indicated by the appropriate number of Ns. The disease_known AND has_seq BLAST Version 2.0.13 is Released sequence spanning the gaps will This query returns more than one be filled, primarily by the BDGP, The latest version of the BLAST thousand records. from one end of the genome to the suite of programs, version 2.0.13, is now available on the NCBI FTP other, in the coming months. As site at ftp://ncbi.nlm.nih.gov/blast/. each chromosome is completed, GenBank Gets Billion Base it will be completely reannotated, Boost With Version 118.0 Archives of executables of stand- reviewed, and updated with a alone BLAST are found at this site GenBank release 118.0 is now avail- within the “executable” directory. complete, consistent set of anno- able via FTP from NCBI. Uncom- The BLAST network client is found tated records. In the meantime, pressed, the release 118.0 flat files within the “network” directory. the BDGP plans to provide regular require roughly 29 megabytes (MB) unannotated updates to NCBI of of disk space. The ASN.1 version The major enhancement of this of the GenBank sequence data release is that Bl2seq, the stand- the most recent versions of the requires roughly 24 MB. Release alone version of the Web-based, scaffold sequences. These updates 118.0 comprises 8,604,221,980 BLAST2Sequences, can now perform will be available from the NCBI base pairs, 7,077,491 sequences, nucleotide-protein (blastx style) FTP site. NCBI will also align the and represents a growth of 1.23 comparisons. In this new version of annotated GenBank records to the billion base pairs over release Bl2seq, the “-p” option accepts the most recent unannotated versions 117.0. Pick up the latest GenBank arguments “blastn”, “blastp”, or of the scaffolds in order to combine at ftp://ncbi.nlm.nih.gov/genbank/. “blastx” rather than the Boolean the annotation available from the “T” or “F” of prior versions. initial release with the most recent sequence data.— JO, DW

Spring 2000 NCBI News 4 Drosophila Finds a Home Page at NCBI Selected Recent NCBI has combined a gateway to similarity of the set of Drosophila Publications by the complete genomic sequence of proteins to the protein sets of NCBI Staff the fruit fly, an array of links to any two organisms or taxa. A related Drosophila resources such Related Structures link on the Galperin, MY, L Aravind, and EV as FlyBase and GadFly, and a mix Drosophila page leads to a listing Koonin. Aldolases of the DhnA family: of pre-computed analyses conduct- of Drosophila protein sequences a possible solution to the problem of pentose and hexose biosynthesis in ed at NCBI, to produce a special with significant similarity to the archaea. FEMS Microbiol Lett 183(2): Drosophila Web page to serve as a sequences of proteins with known 259–64, 2000. springboard for the analysis of the structure. Using NCBI’s Cn3D Kirsch, IR, ED Green, R Yonescu, fly genome sequenced jointly by macromolecular viewer it is R Strausberg, N Carter, D Bentley, Celera Genomics and the Berkeley possible to view 3-D images of MA Leversha, I Dunham, VV Braden, Drosophila Genome Project. the Drosophila protein sequences E Hilgenfeld, GD Schuler, AE Lash, GL Shen, M Martelli, WM Kuehl, superimposed upon protein RD Klausner, and T Ried. A systematic, structures to which they bear The Genome Map Viewer high-resolution linkage of the cytoge- sequence similarity. netic and physical maps of the human The genome of the fruit fly is genome. Nat Genet 24(4):339–40, 2000. accessible through five chromo- The Drosophila Web page can be Makarova, KS, YI Wolf, O White, some links, each of which invokes reached through a link on NCBI’s K Minton, and MJ Daly. Short repeats NCBI’s new Genome Map Viewer and IS elements in the extremely radia- Genomic Biology Web page at tion-resistant bacterium Deinococcus to provide access to the sequence www.ncbi.nlm.nih.gov/Genomes/ radiodurans and comparison to other data. Users can select from five index.html.—DW bacterial species. Res Microbiol 150 different maps, which run the (9-10):711–24, 1999. gamut from physical maps show- Panchenko, AR, A Marchler-Bauer, ing chromosomal banding A Map Viewer for and SH Bryant. Combination of patterns, to sequence maps with the Human Genome threading potentials and sequence profiles improves fold recognition. links to GenBank records. Using NCBI’s Human Genome Map J Mol Biol 296(5):1319–31, 2000. a query box at the top of the main Viewer can display up to seven Pickeral, OK, W Makalowski, Drosophila page, it is possible to parallel chromosomal maps MS Boguski, and JD Boeke. Frequent perform text searches followed simultaneously. The maps human genomic DNA transduction driven by LINE-1 retrotransposition. by protein neighbor searches in displayed can be selected from Genome Res 10(4):411–5, 2000. which the results are graphically a set of 19, and include cyto- mapped onto the Drosophila genetic maps, such as G-band- Pruitt, KD, KS Katz, H Sicotte, and DR Maglott. Introducing RefSeq chromosomes. ing chromosomal ideograms, and LocusLink: curated human genome sequence-based maps, such resources at the NCBI. Trends Genet NCBI Analysis as those showing contig and 16(1):44–7, 2000. clone information, and radia- Schaffer, AA, YI Wolf, CP Ponting, Pre-computed BLAST comparisons tion hybrid maps, such as EV Koonin, L Aravind, and between protein sequences derived the G3 and GB4 maps used SF Altschul. IMPALA: matching a from the Drosophila sequence and to construct GeneMap ‘99. protein sequence against a collection of PSI-BLAST-constructed position- all protein sequences in the BLAST To take a broad view of the specific score matrices. Bioinfor- non-redundant database can be human genome click on the matics 15(12):1000–11, 1999. accessed through a link on the Map Viewer link on NCBI’s Sherry, ST, M Ward, and K Sirotkin. Human Genome Resources Drosophila page. Use of molecular variation in the web page at www.ncbi.nlm. NCBI dbSNP database. Hum Mutat In addition, a one can generate a nih.gov/genome/guide/. 15(1):68–75, 2000. 2-dimensional dot plot showing the

5 Spring 2000 NCBI News Q&A Frequently Asked Questions

Q. A.

In PubMed, how can I restrict Click on the Preview/Index link in the grey area beneath the query box. my search to an author name? Choose Author Name from the Index list box and type the author name into the text box to the right. Click on the AND button to add the author name to your query.

Why do I get the message You may have accidentally entered an accession number in the search box “ERROR: Blast: No valid letters without changing the input selection from Sequence in FASTA format to to be indexed”? Accession or gi. You will also see this error message if your query contains invalid characters or a large number of ambiguity codes (R,Y,K,W,N, etc. for nucleotide). Although BLAST allows ambiguity codes, be aware that these will always contribute a negative value to nucleic acid alignment scores.

How are symbols and gene Both LocusLink and RefSeq use gene symbols and gene names established names selected for LocusLink by the nomenclature committee appropriate for the genome. and RefSeq records?

How can I see the sequence From the Entrez Nucleotides search screen, click on the Limits button, alignments in GenBank? and then select Properties from the search field list box. Type in “seqalign present” in the search box and press Go. To view an alignment, follow the Popset link to the right of the Entrez summary.

Is there a way to download the Yes, if the organism is represented in Entrez Genomes. Navigate to the proteome of an organism from Entrez Genomes summary page for the organism in question and click on GenBank? the Protein Coding Genes link in the Feature table. A table will be displayed giving the protein coding regions of the genome and links to corresponding protein sequences. This table, as well as the corresponding protein sequences, can be saved to a local file. Entrez Genomes is found at www.ncbi.nlm.nih.gov/Entrez/Genome/main_genomes.html.

Where can I find tutorials on the A set of NCBI tutorials as well as other educational material can be found use of the NCBI Web services? under the Education link on the NCBI home page.

Spring 2000 NCBI News 6 BLAST Lab

How to BLAST Using Very Large Nucleotide Queries

As an era of prolific genome begins, the need arises to run BLAST repeat found in human sequences. searches using very large query sequences that often reach hundreds of kilobases Both simple repeat and human in length. Because of its speed and flexibility, the BLAST algorithm can rise to the repeat filtering are activated using occasion and perform these searches quickly. check boxes on the Advanced BLAST Web page. To run a BLASTn One of the contributors to the speed and flexibility of BLAST is the fact that BLAST search using a word-size of 50 and matches multi-character “words” between the query and database sequences filtering the query for human repeats, rather than single characters. By default, these matches need not be perfect, type “-W 50” into the Advanced although a scoring threshold for reporting matches can be adjusted. This search BLAST Advanced Options box and strategy offers a tradeoff between speed and sensitivity; smaller word-sizes result check the appropriate filtering in greater sensitivity at the expense of speed while larger word-sizes optimize options as shown in Figure 1. BLAST for speed. As an example, a BLASTn search of the default nr nucleotide database with a 400 kb contig from human chromosome 22 will time-out using Advanced BLAST if the default parameters are used. However, if the word-size is changed to 200, and human repeat filtering activated, the search is completed within 15 minutes!

Searches with very large sequences may also be performed using Figure 1: Portion of the Advanced BLAST page demonstrating the use of the BLAST2Sequences if the word-size “Advanced Options” box and filtering options. is increased sufficiently. Using the default word-size of 11, alignments The default word-size for BLASTn Filtering parameters may also be with query sequences on the order searches is 11, which allows untrans- adjusted to facilitate BLAST of 150 kb in length will not gener- lated nucleotide searches to proceed searches with large queries. Because ally be completed. However, by at a rapid pace with a degree of repetitive sequence within the query increasing the word-size to 50, two sensitivity appropriate for queries in leads to a repetitive and relatively sequences as large as 250 kb apiece the range of 1 to 50 kb. A word-size uninformative series of database may be aligned. The word-size may of 11, however, is too small to facili- hits, BLAST masks simple, repetitive be changed on the BLAST2Sequence tate rapid searches with queries of sequence by default. Repetitious page using the input box provided. 100 kb or more. Fortunately BLAST hits waste computational time, and is flexible enough to allow the word- the larger the query sequence, the size to be adjusted upward indefi- larger the potential problem— The BLAST Lab feature is intended nitely to accommodate ever larger so repeat filtering should always be to provide detailed technical infor- mation on some of the more special- query lengths. For a query of 100 kb, used with large query sequences. ized uses of the BLAST family of a word-size of 30 to 50 will allow a In the case of human sequences, programs. Topics are selected from Web-BLAST search to run to com- the human repeat filtering option of the range of questions received by pletion and return informative Advanced BLAST should be used the BLAST Help Group. database matches. to mask more complex varieties of

7 Spring 2000 NCBI News The Taxonomy Report BLAST HomoloGene continued from page 1 The Taxonomy Report summarizes continued from page 2 the relationships among all of the by showing how closely these organisms found in the BLAST listed, a %ID score is given as a organisms are related to a “focus results. Using this report, it is easy measure of reliability. If the gene organism”, according to the taxo- to see how many records are found in question is a member of a nomy database. This focus organ- within broad taxonomic groups triplet ortholog cluster, the Mutally ism is the organism giving the such as the mammalia, or the Orthologous Pairs section shows strongest BLAST hit and this will archaea. the triplet. often be the source organism of the query sequence. For detailed information on the FTP Access to Data interpretation of the BLAST taxo- The current datasets for the calcu- The Organism Report nomy reports, see the Help docu- ment at www.ncbi.nlm.nih.gov/ lated orthologs and homologs and In the Organism Report, the BLAST /taxblasthelp.html. The new the mutally orthologous pairs are results are grouped into blocks taxonomic output format is avail- available via FTP at ftp://ftp.ncbi. by species. Within each species able via Basic or Advanced BLAST nlm.nih.gov/pub/HomoloGene/. block, the records are sorted by at www. ncbi.nlm.nih.gov/BLAST/. Link to the HomoloGene Web BLAST score. The order of species —SF, DW page from the UniGene page at blocks themselves is based on the www.ncbi.nlm.nih.gov/UniGene/. BLAST score of the best hit within —LW, DW the block.

DEPARTMENT OF HEALTH AND HUMAN SERVICES Public Health Service, National Institutes of Health FIRST CLASS MAIL National Library of Medicine POSTAGE & FEES PAID National Center for Biotechnology Information PHS/NIH/NLM Bldg. 38A, Room 8N-803 BETHESDA, MD 8600 Rockville Pike PERMIT NO. G-816 Bethesda, Maryland 20894

Official Business Penalty for Private Use $300

NCBI News NATIONAL INSTITUTES OF HEALTH • National Library of Medicine Spring 2000