NCBI News

National Center for Biotechnology Information

National Library of Medicine National Institutes of Health Winter 2000

New Genomes Views: Links to the Mitelman summary of aberrations assoc- A Fresh Look at Human iated with cancers.

A set of new Entrez Genomes display For each chromosome, the new A selection of disease that formats provides an improved van- Entrez Genomes views now provide: map to the chromosome:Links tage point on the human chromo- are also given to the Genes and somal sequences in GenBank, facili- A graphic depicting sequencing Disease Web site, OMIM, and tating convenient access to finished progress with finished regions OMIM Morbid Map. contig data and more supplementary shown in red: Clicking on a finished information than ever before. Figure region leads to the corresponding Links to references giving infor- 1 shows a partial Entrez Genomes contig data. mation on the chromosomal view of human chromosome 22, sequence: For example, the Entrez illustrating many of the features Sequencing progress statistics: Genomes view for chromosome 22 highlighted here. These include euchromatic size, gives a link to the full text of the amount of finished paper by Dunham et al. (Nature sequence given (1999) 402, 489-95) that reports in kilobases or as the completion of chromosome 22 a percentage of the sequencing. chromosome, and the number of con- To see the human chromosome tigs available for views, go to www.ncbi.nlm.nih.gov/ the chromosome. genome/guide.— RM, DW

Links to contig lists sorted by size and In this issue position. 1 Entrez Genomes Links to various 2 IgBLAST STS maps. 2 BLAST1.4 3 PubMed Central Links to the coor- 4 News Briefs dinating chromo- 4 Mitochondria Energize RefSeq some sequencing 4 PSI-BLAST Profiles center: Other 5 Frequently Asked Questions sequencing centers 6 Textbooks Linked to PubMed that are working 6 BankIt 3.0 on thechromosome are listedas well. 6 Mouse and Rat in LocusLink 7 BLASTLab Figure 1: Entrez Genomes View of Human Chromosome 22. [Reprinted by permission from Nature402(6761):489-95, copyright 8 Malaria Menace Mapped 1999Macmillan Magazines Ltd.] IgBLAST: NCBINews An Immunoglobulin-Specific Search Tool Immunoglobulin BLAST (IgBLAST), by the best hits to the nr database. a specialized variant of BLAST One additional feature of IgBLAST NCBI Newsis distributed four times a year. We welcome communication that is designed for the analysis of is its ability to flag the returned from users of NCBI databases and human or mouse immunoglobulin matches from the nr database software and invite suggestions for sequence data, is now available on- according to their germline V- articles in future issues. Send corre- line from NCBI. IgBLAST performs origins. This allows one to distin- spondence to NCBI News at the address below. blastn or blastp searches of a non- guish easily between the results

NCBI News redundant (nr) database of germline that use the same germline V-gene National Library of Medicine V genes and returns the best three as the query and those that do not. Bldg. 38A, Room 8N-803 V-gene matches, and the best two D- 8600 Rockville Pike gene and J-gene matches. IgBLAST Although IgBLAST also supports Bethesda, MD 20894 Phone: (301) 496-2475 also annotates the query sequence, searches using blastp, only Fax: (301) 480-9241 based on the domain information V-gene matches will be reported. E-mail: [email protected] drawn from an alignment between D-gene and J-gene matches are Editors the top-matched germline V gene reported only for DNA searches Dennis Benson and the query sequence, using the with blastn. Barbara Rapp nomenclature of the Kabat Data- Contributors base of Sequences of of The germline V-gene database Renata McCarthy Jo McEntyre Immunological Interest. Such fea- currently contains Igh, Ig kappa, Ig Margaret McGhee tures as framework regions (FWR1, lambda, and D and J genes from Liz Pope for instance) or complementarity- both human and mouse. The data- Jian Ye determining regions (CDR1, for base can be viewed in the form of Writer instance) are delineated. an annotated multiple sequence David Wheeler alignment by clicking on the appro- Editing, Graphics, and Production IgBLAST may also be used to search priate database link in the sidebar Marla Fogelman Veronica Johnson the standard BLAST nr database. In of the IgBLAST page. Jennifer Vyskocil this case, the best matches to the

Design Consultant germline gene database are aligned Try IgBLAST at www.ncbi.nlm.nih. Gary Mosteller with the query as before, followed gov/igblast.— DW, JY In 1988, Congress established the National Center for Biotechnology Information as part of the National Old BLAST 1.4 Network Server Retired Library of Medicine; its charge is to create information systems for The old BLAST 1.4 network server, Blast outside of Sequin. Blastcl3, molecular biology and genetics which accounts for less than 3% of available at ftp://ncbi.nlm.nih.gov/ data and perform research in computational molecular biology. all BLAST jobs performed at NCBI, blast/network/netblast, should be

The contents of this newsletter may was retired on March 1, 2000. It was used instead. MacVector versions be reprinted without permission. replaced by BLAST 2.0, which was prior to 6.5.3 are also affected by The mention of trade names, com- designed to handle the increasing this change. mercial products, or organizations size and complexity of the sequence does not imply endorsement by NCBI, NIH, or the U.S. Government. databases. This change will notaffect the NCBI BLAST Web pages, e-mail server, NIH Publication No. 00-3272 The following NCBI programs used GCG connection, blastcl3 client, ISSN 1060-8788 ISSN 1098-8408 (Online Version) BLAST 1.4 and will no longer func- PowerBlast within Sequin, or Mac- tion: blastcl2, blastcli, and Power- Vector at release 6.5.3 and higher.

Winter 2000 NCBI News 2 PubMed Central Archive Developed Selected Recent at NCBI Publications by PubMed Central is a Web-based presentation, and navigation; and NCBI Staff repository established at the archiving the content to guarantee National Institutes of Health (NIH) accessibility in the future. Boguski, MS.Biosequence exegesis. to provide barrier-free access to Science286(5439):453–5, 1999. primary research reports in the life How to Access Gerlach, VI, L Aravind,G Gotway, sciences. So named because of its RA Schultz, EV Koonin,and EC natural integration with the existing PubMed Central began accepting Friedberg. Human and mouse homo- research reports this year and logs of Escherichia coliDinB (DNA PubMed retrieval system, PubMed polymerase IV), members of the Central will serve as a host for is now accessible on the Web at UmuC/ DinB superfamily. Proc Natl scientific publishers, professional www.pubmedcentral.nih.gov. Acad Sci USA 96(21):11922–7, 1999. societies, and other groups to Grishin, NV.Phosphatidylinositol From the home page, select a title archive, organize, and distribute phosphate kinase: a link between pro- to see the table of contents of the their research articles at no cost to tein kinase and glutathione synthase most recent issue available. From folds. J Mol Biol291(2):239–47, 1999. the user. there, select specific articles or Koonin, EV, L Aravind,K Hofmann, click on the Archivebox for back J Tschopp, and VM Dixit. Apoptosis. Content issues. Enhanced search capabili- Searching for FLASH domains. Nature401(6754):662–3, 1999. The scope of PubMed Central in- ties are currently under develop- cludes the broad life sciences, en- ment. Links to publishers’ sites are Panchenko, A, A Marchler-Bauer, compassing plant and agricultural also included. andSH Bryant.Threading with explicit models for evolutionary con- research as well as biology and servation of structure and sequence. medicine. Participating journals Two journals,Molecular Biology Proteins Suppl 3:133–40, 1999. and editorial groups may submit of the Celland PNAS: The Pro- Spouge, JL, A Marchler-Bauer, and ceedings of the National Academy peer-reviewed reports from jour- S Bryant.The combinatorics and ex- nals, as well as screened, although of Sciencesare currently available. treme value statistics of protein thread- not formally peer-reviewed, reports Journals currently in process in- ing. Ann Combinatorics3:81–93, 1999. from recognized editorial boards. clude Biochemical Journal, Su, X, MT Ferdig, Y Huang, CQ Huynh, Canadian Medical Association A Liu, J You, JC Wootton,and TE Wellems. A genetic map and recombi- Participants’ Roles Journal, Frontiers in Bioscience, and five journals from BioMed nation parameters of the human malar- Contributing publishers, societies, ia parasite Plasmodium falciparum. Central. Many additional journals Science286(5443):1351–3, 1999. and other editorial groups inde- have expressed interest and are Tatusova, TA, I Karsch-Mizrachi, pendent of the NIH have complete preparing content for submission. andJA Ostell.Complete genomes responsibility for their input. An in WWW Entrez: data representation international PubMed Central How to Participate and analysis. Bioinformatics15(7): advisory committee establishes 536–43, 1999. Guidelines covering acceptance criteria for certifying groups that Wheelan, SJ, MS Boguski, L Duret, may submit material. criteria, media formats, copyright, andW Makalowski.Human and and data formats are available nematode orthologs—lessons from NIH’s responsibilities, to be carried from the PubMed Central home the analysis of 1,800 human genes and the proteome of the Caenorhabditis out by NCBI, pertain to developing, page under theAbout PubMed elegans. Gene238(1):163–70, 1999. maintaining, and providing access Centrallink. to the repository. This includes Wolfsberg, T,and I Makolowska. Web alert. Pattern formation and dev- facilitating the submission of SGML- Organizations interested in depos- elopmental mechanisms. Curr Opin tagged content;developing tech- iting content are urged to contact us Genet Dev 9(4):385–6, 1999. nology for enhanced retrieval, at [email protected].— MM

3 Winter 2000 NCBI News Mitochondrial NewsBriefs NewsBriefs Genomes Energize RefSeq . SNP Consortium Makes Both UniVec and UniVec_Core First dbSNP Contribution sequences are in the FASTA format. NCBI now offers a collection of They are suitable for processing by 145 eukaryotic organelle genome The SNP Consortium, including the the formatdb program of the Wellcome Trust and 11 other phar- sequences as part of RefSeq. These Standalone BLAST package (avail- maceutical companies, has recently include 123 complete mitochondrial able at ftp://ncbi.nlm.nih.gov/blast/ made its first contributions totaling sequences as well as 16 complete executables/), so that each database 7,396 SNPs to NCBI’s database of can then be searched locally by the plastid sequences. The animal (met- single nucleotide polymorphisms blastall program. azoan) mitochondrial records are (dbSNP). The SNP consortium was considered Reviewed; that is, they formed to create a public repository VecScreen and further information of SNP information useful in the regarding UniVec can be found at have been manually curated by the study of disease. For more infor- www.ncbi.nlm.nih.gov/VecScreen/. NCBI staff and include standardized mation or to search dbSNP, see gene, protein, and RNA names. Oth- www.ncbi.nlm.nih.gov/SNP/. NCBI on Exhibit er mitochondrial and chloroplast genome records are still Provisional NCBI will be exhibiting at the meet- GenBank Release 116 RefSeq entries and are therefore ings listed below. The conference Posts Largest Increase list is also available from NCBI’s presented as found in the source With 5.8 billion base pairs from home page underAbout NCBI.For GenBank records used to create more than 5.6 million sequences, further information, contact NCBI them. Visit www.ncbi.nlm.nih.gov/ the recent GenBank Release 116.0 at [email protected]. PMGifs/Genomes/organelles.html outweighs version 115.0 by 1,151 American Association for for more information. megabase pairs (Mbp), posting Cancer Research (AACR) the largest single-release increase San Francisco, California ever by GenBank and retiring April 1-5 the previous record of 812 Mbp. Search Database of Uncompressed, the Release 116.0 Meeting—HUGO flatfiles require roughly 21,350 MB Vancouver, British Columbia PSI-BLAST Profiles (sequence files only) or 23,300 MB April 9-12 with IMPALA (including the “index” files). Immunology 2000 Release 116 is now available for Seattle, Washington The NCBI BLAST group has devel- downloading via ftp at ftp:// May 12-16 ncbi.nlm.nih.gov/genbank. oped IMPALA—software to match a American Society for protein sequence against a library Microbiology (ASM) of score matrices stored from PSI- UniVec Vector Screening Los Angeles, California Database Available by FTP BLAST. A standalone version of the May 22-24 IMPALA suite of programs is includ- The NCBI UniVec database, used by American Society for Biochemistry ed within the standalone BLAST the VecScreen Web service for iden- and Molecular Biology (ASBMB) tifying DNA sequence segments distributions found atftp://ncbi. Boston, Massachusetts that may be of vector origin, is now nlm.nih.gov/blast/executables/. June 4-9 available for downloading at ftp:// ncbi.nlm.nih.gov/pub/UniVec. Special Libraries Association (SLA) A Web-based IMPALA search is Philadelphia, Pennsylvania A second database, UniVec_Core, is available on the BLOCKS server at June 10-15 also available. UniVec_Core is a sub- the Fred Hutchinson Cancer Center set of sequences from the full Uni- Endocrine Society (ENDO 2000) (blocks.fhcrc.org/blocks/impala.html). Vec database, chosen to minimize Toronto, Canada Any protein can be searched against the number of false positive hits. June 21-24 a library of score matrices derived from the BLOCKS database.

Winter 2000 NCBI News 4 Q&A Frequently AskedQuestions

Q. A.

How can I create a Definition line An easy way is to use the Annotate>Generate Definition Linemenu item. for my sequence submission auto- This option creates Definition lines based on the source and feature annota- matically from within Sequin? tions you have made to the record, and conforms to GenBank style guidelines. An existing sequence title (i.e., DEFINITION line) can be edited by double- clicking on it and making changes in the editing window that pops up.

What is the new Entrez Genomes Although the old alphabetical marker list is no longer available, any marker equivalent of the alphabetical list that used to be in an Entrez Genomes human chromosome view can now be containing chromosome-specific found using the new STS searching page (www.ncbi.nlm.nih.gov/genome/ markers that I used to see in the sts/query.cgi?). old Entrez Genomes views of human chromosomes? The STS search page includes all markers from dbSTS and all the human maps used in Entrez Genomes and GeneMap ’99.

There does not seem to be a You can print a Cn3D image by first exporting the image as a GIF file using “print” function in Cn3D. Cn3D’s File/Export/GIF function. You can then print this GIF image using How can I print Cn3D images? most image-viewing programs.

Where can I get a very basic For a basic summary of data access routes, take a look at www.ncbi.nlm. summary of my GenBank data nih.gov/Genbank/GenbankSearch.html. access options?

Is NCBI Newsavailable in Yes, in addition to HTML access to NCBI News, you can print a PDF PDF format? version of the current and back issues at www.ncbi.nlm.nih.gov/About/ newsletter.html.

If I run a BLAST search against Yes. The BLAST htgs (High Throughput Genomic Sequence) database is only the nr database, am I likely excluded from nr and must be searched separately; the same is true of the to miss anything important? BLAST EST and GSS databases. The Microbial Genomes: Finished and Unfinishedlink on the main BLAST page provides access to data on 68 finished and unfinished microbial genomes, which are also not contained in the nr database. Researchers interested in BLASTing against human contig data should access these data by using the Human Genome BLASTlink from the main BLAST page, rather than searching nr.

5 Winter 2000 NCBI News PubMed Abstracts BankIt 3.0 Offers Linked to Textbook Information New Validation Features In collaboration with book publishers, the National Center for Biotechnology Information (NCBI) is adapting textbooks for the Web and linking them to A new version of NCBI’s popular the PubMed literature database. These textbooks will provide background online sequence submission tool, information to PubMed users for exploring concepts they encounter in BankIt, is now available. BankIt 3.0 PubMed citations. is designed to allow for the vali- dation and annotation of sequence The first textbook to be included online is Molecular Biology of the Cell, data before submission to Gen- 3rd ed., 1994, by Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Bank. BankIt 3.0 automatically Keith Roberts, and James D. Watson, published by Garland Publishing, Inc. checks sequences and their fea- Molecular Biology of the Cellis one of the most widely used undergrad- tures to confirm that they are com- uate textbooks in molecular and cell biology. This textbook provides back- plete, biologically correct, and free ground information on central topics of molecular and cell biology. Books of contaminating sequence, such covering other topics and approaches to biology and medicine will appear as vector or linker sequence. This in the future. latter contamination check is made using VecScreen, available at Linking the Textbooks with PubMed www.ncbi.nlm.nih.gov/VecScreen/.

Textbook information is accessed from the Bookslink shown on the BankIt 3.0 also provides an expand- abstract display of most PubMed records. Selecting this link causes the ed source organism modifier list abstract to be redisplayed, with some words and phrases highlighted as and automatically translates all links to corresponding textbook sections. Users may navigate within coding regions (CDS features) these sections, accessing text, figures, tables, and references. Rather than when either the CDS nucleotide mirroring the print version of the textbooks, we provide logical units of interval or the resulting protein content based on the book’s native organization of chapters, sections, and sequence is provided by the sub- subsections. The size of a unit and its interconnection with other parts mitter. Try the new BankIt 3.0 at of the book depend on both the native organization of the book and the www.ncbi.nlm.nih.gov/BankIt/. intention of the publisher.

From the textbook sections, references are also linked back to their Mouse and Rat PubMed abstracts. This gives the textbook reader a starting point to further explore the literature using PubMed’s Related Articlesfunction. Now in LocusLink Because books usually contain established knowledge, their reference citations are often several years old; using the Related Articles link, LocusLink has now been expanded readers can move forward in time to find articles that are similar, but to include mouse and rat genes, more recent, than those cited in the book. in addition to human, containing records for more than 13,946 hu- man loci, 13,014 mouse loci, and For More Information 2,268 rat loci. LocusLink provides Authors, editors, and publishers interested in linking a book to PubMed a gateway of access points to should contact [email protected]. information on gene maps, pheno- types, nomenclature, reference For more information on the textbook project and how to access book sequences, and related resources. information, select the Bookslink from the NCBI Literature Databases Genome locus seekers should visit home page at www.ncbi.nlm.nih.gov/Literature.— MM, BR, DW www.ncbi.nlm.nih.gov/LocusLink.

Winter 2000 NCBI News 6 How to Search Huge Local Databases

The amount of public sequence data is growing at an exponential rate and is likely # to continue to do so for the foreseeable future. With this data growth comes the # Alias file created Tue Jan 18 problem of transferring, formatting, and searching gigabase-scale databases. Given 13:12:24 2000 current data transfer rates and computational resources, including CPU speeds and # memory configurations, a “divide and conquer” approach has been implemented # by NCBI in the current version of standalone BLAST. Features of standalone BLAST TITLE huge (blastall) and formatdb allow one to create and search arrays of smaller databases # DBLIST huge.00 huge.01 huge.02 rather than having to search a single huge database. This allows efficient searches # of databases with effective sizes far in excess of the RAM available on most small #GILIST computer systems. # #OIDLIST # Standalone BLAST is able to search formatdb -i huge -o T -p F -v The “nal” and “pal” files can also be several databases sequentially with a 1000000000 used to simplify searches of multiple single query using a syntax such as This command line will create a num- databases created separately as in blastall -i infile -d “part1 part2 ber of database “volumes,” each con- the first example. For instance, a file part3” -p blastn -o out taining one billion base pairs or fewer, called “multi.nal” containing the fol- In this case, the databases “part1”, as specified by the “-v” option, from lowing lines could be created from “part2”, and “part3” have been created the source database file. The volumes scratch using a text editor. will have names consisting of the root in the usual manner using formatdb # with a syntax such as database followed by a two-digit vol- # Alias file created Tue Jan 18 ume extension, followed by the usual 13:12:24 2000 formatdb -i part1 -o T -p F BLAST database extensions. These # The ability to name multiple databases smaller databases can be searched as # TITLE multi in the blastall command line gives the if they were a single entity using # user the flexibility to search an arbi- blastall -i infile -d huge -p blastn -o DBLIST part1 part2 part3 trary group of databases that may be out # derived either from the division of a #GILIST single huge source database or from In this case, BLAST recognizes that # several separate source databases. the database “huge” has been parti- #OIDLIST However, since each database must tioned into several volumes because it # detects a file with the name of the root be formatted in a separate step, this The “multi.nal” file would allow the database followed by an extension of process may become cumbersome three databases, “part1”, “part2”, and “nal” (for protein databases, the exten- if many databases are to be created. “part3”, to be searched by specifying sion is “pal”). This file specifies a data- A recent feature of formatdb stream- a single database name, “multi”, on base list to be searched when the root lines the formatting process by creat- the blastall command line as follows: database name is specified to BLAST. ing several smaller database “vol- BLAST sequentially searches each blastall -i infile -d multi -p blastn -o umes” automatically from a single database listed in this “nal” file and out huge source file. Furthermore, search- generates output that is indistinguish- es of these volumes are performed able from that of a single database without explicitly naming each vol- search. A sample “nal” file, resulting The BLAST Lab feature is intended to ume on the blastall command line. from formatting the datafile “huge” provide detailed technical information To create a set of database volumes into three volumes, is given below. on some of the more specialized uses of the BLAST family of programs. Topics from a single source file, with a file- The “DBLIST” line can also be edited are selected from the range of questions name of “huge”, use formatdb with a to specify additional databases to received by the BLAST Help Group. syntax such as be searched.

7 Winter 2000 NCBI News Malaria Menace Mapped

Biologists at the National Institute of Allergy and Infectious Diseases, in collaboration with researchers at NCBI, have produced a genetic map of the human malaria parasite, Plasmodium falciparum.Maps, markers, and recombination data of linkage groups corresponding to the 14 P. falciparum nuclear chro- mosomes are available at NCBI’s Malaria Genetics & Genomics page. Figure 1 illustrates the map for P. fal- ciparum linkage group 1, which shows the relative position of several STS markers. Clicking on one of the STS markers leads to the appropriate GenBank sequence record. Crossover counts, crossover loca- tions, and genotype segregation proportions for each of the linkage groups are also available.

To access these mapping data or download sequence data for each of P. falciparum’s14 chromosomes, visit www.ncbi.nlm.nih.gov/Malaria/.— DW Figure 1: Map for P. falciparum linkage group 1.

DEPARTMENT OF HEALTH AND HUMAN SERVICES Public Health Service, National Institutes of Health FIRST CLASS MAIL National Library of Medicine POSTAGE & FEES PAID National Center for Biotechnology Information PHS/NIH/NLM Bldg. 38A, Room 8N-803 BETHESDA, MD 8600 Rockville Pike PERMIT NO. G-816 Bethesda, Maryland 20894

Official Business Penalty for Private Use $300

NCBINews NATIONAL INSTITUTES OF HEALTH • National Library of Medicine Winter 2000