<<

NCBI News

National Center for Biotechnology Information National Library of Medicine National Institutes of Health Spring 2004 Department of Health and Services

Transitioning from LocusLink to Cancer : a New Entrez Database A gene-based view of annotated The Entrez Gene help document is essential to capitalize on provides tips to ease the transition Three databases, the NCI/NCBI the increase in the sequencing and for LocusLink users to the current SKY (Spectral Karyotyping)/M- analysis of model genomes. The Entrez Gene database. FISH (Multiplex-FISH) and CGH Entrez Gene database has been (Comparative Genomic The default display format for developed to supply key connections Hybridization) Database, the NCI Entrez Gene is the graphics display between maps, sequences, expression Mitelman Database of shown in Figure 1 for BMP7, which profiles, structure, function, homolo- Aberrations in Cancer, and the NCI resembles the traditional view of a gy data, and the scientific literature. Recurrent Chromosome Aberrations LocusLink record. The array of col- Unique identifiers are assigned to in Cancer databases are now integrat- ored boxes at the head of LocusLink with defining sequence, genes ed into NCBI’s Entrez system as the reports that provide links to gene- with known map positions, and “Cancer Chromosomes” database. related resources is replaced by the genes inferred from phenotypic Cancer Chromosomes supports “Links” menu in Gene, which information. These gene identifiers searches for cytogenetic, clinical, or includes additional links, such as are tracked, and functional informa- reference information using the flexi- those to Books, GEO, UniSTS, and tion is added when available. Access ble Entrez search and retrieval sys- . The Gene Transcripts Entrez Gene from the Entrez Home and Products section is provided continued on page 3 Page or directly at: when a gene has been annotated on www.ncbi.nlm.nih.gov/entrez/ a genomic Reference Sequence In this issue query.fcgi?db=gene 1 Transitioning from LocusLink to continued on page 6 Entrez Gene 1 Cancer Chromosomes: a New Entrez Database 2 HomoloGene 4 BLAST Link (BLink) 5 Debut of HCT Database 7 350Kb Sequence Length Limit Removed 7 New Eukaryotic Genomes in Map Viewer 8 Environmental Samples from the Sargasso Sea 8 HIV Interaction Database 9 Perform Reverse ePCR 9 New Organisms in UniGene 9 Rat Gets NP_999999 10 RefSeq Release 6 10 Entrez Tools new “Hotspot” 11 BLAST Lab

Figure 1. Entrez Gene display for human BMP7 showing links to over 20 related resources in the “Links” 12 Entrez Quiz pulldown menu. HomoloGene: An Entrez Database with a New Look

NCBI News HomoloGene is a system for auto- New Views of the Data mated detection of homologs among the annotated genes of several com- HomoloGene reports include NCBI News is distributed four times pletely sequenced eukaryotic and phenotype informa- a year. We welcome communication genomes. The genomes represented tion drawn from Online Mendelian from users of NCBI databases and in the recent Build 36 of Inheritance in Man (OMIM), Mouse software and invite suggestions for Informatics (MGI), articles in future issues. Send corre- HomoloGene include H. sapiens, spondence to NCBI News at the M.musculus, R.norvegicus , D. Information Network address below. To subscribe to NCBI melanogaster , A. gambiae, C. elegans , S. (ZFIN), Saccharomyces Genome News, send your name and address to pombe, S. cerevisiae , N. crassa, M. grisea, Database (SGD), Clusters of either the street or E-mail address below. A. thaliana, and P. falciparum. Orthologous Groups (COG), and NCBI News FlyBase. A “Pairwise Scores” display National Library of Medicine NCBI has adopted a new Homolo- gives a table of pairwise statistics for Bldg. 38A, Room 3S-308 Gene build procedure which is guid- members of a Homologene group 8600 Rockville Pike ed by the taxonomic tree, relies on that includes percent amino acid and Bethesda, MD 20894 conserved gene order and measures nucleotide identities, the Jukes- Phone: (301) 496-2475 Fax: (301) 480-9241 of DNA similarity among closely Cantor genetic distance parameter, E-mail: [email protected] related species, while making use of D, the ratio of non-synonymous to protein similarity for more distantly synonymous amino acid substitutions Editors Dennis Benson related organisms. The new compu- (Ka/Ks) for predicted , and David Wheeler tational procedure greatly increases the ratio of nucleotide identities the reliability of the computed within non-coding regions of the Contributors Susan Dombrowski homologous gene sets and the result- transcript to those within coding Scott McGinnis ing HomoloGene entries now regions (Knr/Knc). Tao Tao include paralogs in addition to orthologs. For more details or to —DW Writers search the database, see the Vyvy Pham New HomoloGene FTP File Formats David Wheeler Homologene home page at: The Homologene data is available by FTP where Editing and Production www.ncbi.nlm.nih.gov/entrez/ the data for each build is contained in two files; Robert Yates query.fcgi?db= "homologene.data" and "homologene.xml.gz". Follow the "FTP site" link in the sidebar on the Graphic Design New Search Strategies Supported Homologene home page to download the files. Robert Yates homologene.data homologene.data is a tab delimited file con- In 1988, Congress established the Because HomoloGene is now an taining, from left to right: National Center for Biotechnology Entrez database, it can be queried •HomoloGene group id •Taxonomy ID Information as part of the National Library using an assortment of fielded terms •gene ID •gene symbol •geninfo identifier of Medicine; its charge is combined with boolean operators. (gi) of the protein product of the gene to create information systems for Among the fields unique to Homolo- •accession number of the protein product molecular biology and of the gene Gene is the “Ancestor” field which data and perform research in homologene.xml.gz computational molecular biology. refers to the taxonomic group of the homologene.xml.gz is a compressed file last common ancestor of the species that contains a complete XML version of the The contents of this newsletter may represented in a HomoloGene entry. HomoloGene build and includes the infor- be reprinted without permission. mation available on the public webpage. The mention of trade names, com- Using the “Ancestor” field it is possi- The Homologene XML DTD is available in mercial products, or organizations ble to limit a search to genes con- the archive "homologene.dtd.tar" at the top level of the ftp site. does not imply endorsement by served in one of 9 ancestral groups: NCBI, NIH, or the U.S. Government. Sordariomycetes (147,550 entries), The old HomoloGene FTP files of the formats NIH Publication No. 04-3272 Eukaryota (2,759 entries), used in "hmlg.ftp" and "hmlg.trip.ftp" will be dis- Fungi/Metazoa (33,154 entries), continued after a transition period. During the ISSN 1060-8788 transition, a new set of codes, reflecting the new build procedure, will be used in these files ISSN 1098-8408 (Online Version) Bilateria (33,213 entries), Coelomata to indicate the nature of the evidence for (33,316 entries), Mammalia (9,172 homology: b - reciprocal best, B - reciprocal entries), (1,083 entries), best in a self-consistent triplet, m - similarity between sequences that do not give reciprocal Insecta (1,689 entries), Rodentia (1,587 best hits. entries).

Spring 2004 NCBI News 2 Cancer Chromosomes Mitelman and Recurrent databases & CGH Database, the total matches continued from page 1 use a different system. The menus found in the Mitelman database, and include all ICD-O-3 terms entered the total matches from the Mitelman into the database to date and all Recurrent Database. tem. Search tips are provided in the terms used in the Mitelman and Help document at: Recurrent databases. Descriptions of From the results list, users can access the sections and terms indexed are the pull-down menu and display a www.ncbi.nlm.nih.gov/entrez/query/ variety of features, including the cor- SkyCgh/help.html given in the Help document. responding literature from PubMed, Search “Cancer Chromosomes” Searches based on case information, the results as a list of UI (unique from the database pulldown menu such as diagnosis and disease site, identifier) numbers, or view related on the NCBI home page or navigate return a “case-based report” that lists reports based on common cytoge- to the “Cancer Chromosomes” all cases matching the query terms. netic or diagnostic features. Users page for advanced searches via the Searches based on underlying cytoge- can also view Similarity reports, link on the Entrez home page at: netic features are displayed as a which show terms common to a “clone/cell report” in which each group or records within several term www.ncbi.nlm.nih.gov/Entrez clone or cell-line is listed separately. categories such as diagnosis/site and cytogenetic abnormalities Three search formats are (including CGH) among the offered on the Entrez selected cases or clones/cells. Chromosomes home page: Term co-occurences are listed a conventional Entrez at several levels: common to Query, a Quick/Simple all cases, common to 50%- Search, and an Advanced 90% of cases, and common Search. The Entrez Query to less than 50% of cases. is performed using the The common term or abnor- search box at the top of mality is shown in the left col- the page, and, as with umn and the number of other Entrez databases, affected cases is shown in the searches may be combined right column. The cytogenet- using term limits and ic abnormalities are shown at Boolean expressions. The all levels of resolution. Select Simple Search, available via Figure 1. Results of an Entrez Cancer Chromosomes search for records using the ‘Similarity Report (High a link in the sidebar on the query "8q". The number of cases from each database is given at the top of the result summary. Clicking on the Details link shows how the query was interpreted by Entrez. Resolution Only)’ to see simi- Cancer Chromosomes larities at a high level of reso- page, offers a set of lution such as chromosome menus from which one band. may select search terms to indicate a disease site or The results of a sample query diagnosis. These terms can for records dealing with be combined with specifi- breakpoints of the chromoso- cations for a particular mal band 8q is shown in chromosomal location and Figure 1. anomaly. The Advanced Search form, also linked Links in the search results from the sidebar, is summary lead to full reports arranged similarly. This such as the Case report #1437 form contains three main shown in Figure 2. Display sections, labeled buttons provide access to Cytogenetic, Clinical, and additional views of the data Reference, which offer a Figure 2. Detailed display of Case report #1437, with written karyotype, case summary, such as chromosomal dia- graphic display of colored ideogram, and chromosome abnormalities in tabular format. combination of forms and grams or a table view. Users can also access the case menus of search terms to help in the The number of cases or cells/clones details or link to related resources construction of complex queries. found for each search is displayed at from the original search summary by Diagnostic terms vary among data- the top of the results page, broken bases; the SKY/M-FISH database way of the Entrez Links menu. down into three totals: the total —VP uses ICD-3-O terms, whereas the matches found in the SKY/M-FISH

3 Spring 2004 NCBI News BLAST Link (BLink) to Protein Alignments and Structures Use BLink for Quick Insights into Protein Function “Best Hits” display is shown in Pre-computed sequence alignments, Because BLink reports are pre-com- Figure 2. generated from routine all-against-all puted it is possible to rapidly view a BLAST comparisons performed at The “Common Tree” button allows BLAST alignment without having to NCBI, are available for each protein for the selective display of align- generate it. The graphical display of record in Entrez. The best 200 of ments to sequences arising from spe- the aligned sequences provides a these alignments can be displayed by cific branches of the taxonomic tree. clear view of the distribution of con- clicking on the “BLink” link in the The related “Taxonomy Report” but- served sequence blocks across taxo- upper right-hand corner of Entrez ton lists the BLink results as a nomic groups as an aid to under- protein reports. The BLink report BLAST Taxonomy Report. standing evolutionary and functional for human MLH1 is shown in Fig- relationships. Added insight into pro- ure 1. The report begins with a To limit the view to those sequences tein function is provided by the description of the query sequence, derived from structure records, press CDD display of multiple sequence the sequence IDs of other entries in the “3-D Structures” button. In the alignments for functional domains Entrez with identical sequences, and “3-D Structures” display, shown in allowing one to evaluate position a set of controls, described below, Figure 3, the dots are links to Con- specific sequence conservation in the used to customize the display. The served Domain and Cn3D structure context of the biological function of alignments presented in the lower displays. the query protein. The “3D section of the report are depicted Structure” display provides a quick graphically and color-coded on the The “CDD Search” button does not way to determine the availability of basis of the taxonomic origin of the format the BLink report, but instead 3D structures to serve as modeling aligned sequence. Each alignment is links to a precomputed conserved templates to use for further study. followed by its BLAST score, linked domain display for the query to a detailed alignment view, the sequence. —TT accession number of the aligned sequence, linked to Entrez, and the The “GI” button links to GI number of the aligned sequence, the Entrez display of the linked to its own BLink report. protein sequences whose alignments are shown in Customizing the Display the BLink report.

BLink reports can be customized Taxonomic and Database using a combination of format but- Restrictions tons, taxonomic restriction controls, Sequences derived from a and a source database pulldown taxonomic group may be menu. Taxonomic and source data- selectively removed from base restrictions take effect when the the display by clicking on “Select” button is pressed. Figure 1. BLink report for the human MLH1protein. Format controls are any of the color-coded located at the top of the report with alignments, colored by the taxonomic Alignments may be sorted on the taxon links; an “X” across origin of the sequence match, given in the lower section. basis of sequence similarity to the the sequence count for query, the default, or by the taxo- the group indicates that nomic proximity of the source the group will be removed organisms using a link in the format- from the display when the ting area. “Select” button is pressed. The BLink report can also Figure 2. "Best Hits" Blink report format indicating two alignments to a Six format buttons are used to select sequence from a synthetic construct and six alignments to sequences the display mode of the BLink be customized using the from Homo sapiens. A graphical depiction of the best alignment in each organism group is shown on the left. report. The ‘Best Hits’ button dis- “Keep only” pulldown plays a single line for each organism menu to limit the display represented in the results, showing to entries included in the alignment of the best hit in the databases such as RefSeq, organism group and a link, in the Protein Data Bank, “N” column, to a Blink report limit- SwissProt, COGs, and NCBI Complete ed to the group. A portion of a Figure 3. BLink report limited to sequences derived from structures Genomes. using the "3-D Structures" button.

Spring 2004 NCBI News 4 Debut of the HCT Database and Anthropology/Allele Frequencies in dbMHC

The International Histocompatibility A powerful way to view the data in is the Antrhopology/Allele Working Group (IHWG) in dbMHC is to compute Kaplan-Meier Frequencies databank, created in an Hematopoietic Cell Transplantation survival plots using the HCT online effort to determine HLA class I and (HCT) is a worldwide scientific col- plotting tool. Many default parame- class II allele and haplotype frequen- laborative effort to support the use ters for the plots can be adjusted in cies in various human populations. of information on the properties of order to customize the data dis- Studies of allelic diversity in different the HLA barrier to allogeneic trans- played. Help and FAQ documents populations can shed light on the plantation to improve the safety, effi- are available for most plots. As an of HLA polymorphism as cacy and availability of HCT. The example, consider the Kaplan-Meier well as on the evolution and migra- IHWG HCT studies are designed to survival plot shown in Figure 1. The tion of human populations. In a clin- determine whether complete allele plot uses tick marks for censored ical context, knowledge of the allele matching for HLA-A, B, C DRB1, data and shading to indicate the con- frequency distributions in various DQB1 and DPB1 is necessary for fidence intervals. To produce the populations is critical to the strategy successful transplantation. Data gen- plot, the Advanced Query Form is of establishing and searching bone erated from the study are anticipated initialized with default parameters marrow donor registries as well as for studies of HLA- associated disease susceptibility.

Users of the resource can choose to view allelic frequencies found in individuals from certain regions of the world, or view data submitted by a particular group. Users can also speci- fy the loci to be dis- played in the output table. Additional information about Figure 1. Kaplan-Meier survival plot generated on the web using analysis tools available from dbMHC. A summary of the parameters used to create the plot is displayed to the right. The plot reveals a significant effect on transplant survival times of a one allele mismatch. the project from the to offer new approaches to the selec- and one curve is plotted which links to project tion of suitable donors for HCT. allows no mismatches at HLA A, B, overview and data contributors is C, DRB1 and DQB1 loci and ignores found at: The IHWG and the NCBI at the HLA DPB1 locus. The parameters National Institutes of Health, have www.ncbi.nlm.nih.gov/mhc/ which affect the entire plot can be ihwg.fcgi?ID=9&cmd=PRJOV collaborated to create a public data- specified and a summary is displayed base, dbMHC, to store genotype and for each curve. A user may add a The HCT and Anthropology/Allele clinical data, including up-to-date curve, modify a curve, delete a curve, Frequencies resources are available information on matching and trans- view the plot, view plot data, view from the dbMHC home page via outcomes, and to provide individual data, save the curve links from the blue sidebar menu. online tools for data analysis. The parameters, or restore saved parame- Questions and comments can be new database contains anonymous ters. When adding or modifying a addressed to the NCBI Service Desk data for selected unrelated donor curve, another form is displayed for at: transplants performed worldwide for entering the parameters for the curve the treatment of both malignant and and upon completion the user will [email protected] non-malignant blood disorders. return to the original page to view More information and a link to a list the updated selections. —VP of contributors to dbMHC is found on the home page at: Another important resource stem- ming from the efforts of the IHWG www.ncbi.nih.gov/mhc/ihwg.fcgi?cmd =page&page=HCTintro

5 Spring 2004 NCBI News LocusLink to Entrez Gene continued from page 1

(RefSeq) and intron, exon, and cod- ing region information is available with genomic coordinates. Each accession given in this section is a link to a menu allowing the display of the sequence in several formats. Protein accessions provide menu options to navigate to BLink, CDD, or COG displays. This section is equivalent to the RNA-Genomic alignment available from the graphic at the top of a LocusLink entry. In the case of the Gene record for BMP7, NC_000020 is the accession number of the genomic contig that contains the gene. Clicking on the “NC_000020” link brings up a menu used to select one of several displays of the contig within the genomic range of the gene BMP7. The Entrez Gene record’s “General Gene Information” section summa- rizes information contained in LocusLink’s “Function”, Table 1. LocusLink to Gene feature transition chart. The help documentation covers the conversion of the “Relationships” and “Map master LocusLink FTP file, “LL_tmpl”, to the Entrezgene.asn format. The Entrezgene.asn data will be Information” sections. This section available on the Gene FTP site in the near future. includes several categories of infor- subset of links to information both (human [organism] OR mouse [organism] OR rat [organism]) AND bmp7 mation, such as within and external to NCBI. Some human [organism] AND (bmp7 OR bmp3) (GO), Homology, Phenotypes, of these links overlap those included Markers, Pathways and Relationships. in the Links menu. The intent of this human [orgn] AND (bmp7 [title] OR bmp3 [title]) The remaining sections of an Entrez section is to provide a printable Like other Entrez databases, Gene Gene record-”NCBI Reference report of, for example, MIM num- offers a number of display formats Sequences”, “Related Sequences”, bers, UniGene cluster numbers, and beyond the default “Graphical” for- and “Additional Links”-are equiva- family-specific Web sites. mat. Additional formats include an XML format and a “Gene Table” lent to the corresponding entries in Entrez Gene can be considered as view, providing access to the the LocusLink report. The first sec- the successor to LocusLink, but sequences of each of the gene’s tion lists gene-specific NCBI Gene improves on LocusLink by exons and introns. RefSeqs, provides links to the appro- providing coverage of more NCBI priate Entrez sequence database, and reference genomes, by providing Entrez Gene is also accessible using gives descriptions of each transcript additional display formats, and by its the Entrez Programming Utilities (E- variant, the accession numbers of integration with other databases utilities), that provide access to sequences used to support the within NCBI’s Entrez system. Users Entrez from application programs RefSeqs, and a listing of conserved can query Gene via the powerful and scripts. domains found in the encoded pro- query features of Entrez, using teins. The “Related Sequences” sec- Boolean operators, filters, and field Users interested in subscribing to tion lists the nucleotide and protein limiters, such as accession number, email announcements of new Entrez accessions of sequences that are gene name, protein name, Gene features are welcome to join related to the gene, and provides disease/phenotype, and map loca- the Gene-announce mailing list at: links to the sequence records in tion. Users can search for records in www.ncbi.nlm.nih.gov/mailman/ Entrez. The “Additional Links” sec- Entrez Gene using any of the search listinfo/gene-announce tion provides a printable view of a strategies in the shaded box. —VP

Spring 2004 NCBI News 6 350 kb Sequence Length Limit Removed by Sequence Database Collaboration In 1995, the International Nucleotide missions from large scale sequencing Escherichia coli K-12 MG1655 com- Sequence Database Collaborators projects of draft sequences, such as plete genome sequence. Under the (GenBank, DDBJ, and EMBL) phase 1 and phase 2 high-throughput 350 KB limit, this accession number agreed to a 350 KB limit on the size genomic sequences (HTGS), that refered to a contig record giving a of database sequence records in were longer than 350 KB. To avoid list of short sequences that can be order to maintain compatibility with breaking a huge amount of draft assembled to create the complete existing molecular biology software sequence into 350 KB chunks, the genome. With the removal of the that was not able to work with large database collaborators agreed to 350 KB limit the accession now sequences. relax the 350 KB limit in these cases. refers to the complete contiguous The 350 KB limit was also relaxed sequence for Escherichia coli K-12. At this time, a new GenBank divi- for assemblies of Whole Genome The accessions for all 400 parts will sion was created called "CON" for Shotgun (WGS) project data and for appear as secondary accessions. The contig. The records in the CON divi- large eukaryotic genes. CON division will remain as a sion contain the instructions for the GenBank division to represent assembly of full-length contigs from Removal of 350 kb Limit sequences which by their nature are the sequence data of multiple In 2003, the Database Collaborators assembles; Ex. genome scaffold GenBank records. Although CON agreed to remove the 350 KB limit records. division records contain no sequence for all sequences as of June 2004, The effect of the changes on the data, the assembly information they since the increased ability of molecu- NCBI GenBank FTP files and the provide makes it possible for NCBI's lar biology software to analyze long BLAST database files available for Entrez search and retrieval system to sequences quickly has rendered the download is expected to be minimal. show complete genomic sequences limit on sequence length unnecessary. As sequences become secondary to by dynamically assembling the data To help software developers prepare primary records, the overall size of for display. Using the information in for the change, some sample records the databases should not change CON division records, FTP files are with large sequences have been made drastically. However, the number of also regularly created by NCBI for available for testing: megabased-sized records will in- download that contain megabase- crease, therefore NCBI recommends scale genomic sequences as single ftp.ncbi.nih.gov/genbank/LargeSeqs that software be tested with the FASTA files. An example of the effect of the example large sequence records, removal of the 350 KB limit on By 1998, GenBank, DDBJ, and mentioned above. GenBank records may be seen in the EMBL were routinely accepting sub- case of accession U00096, the —SM

New Eukaryotic Genomes at NCBI assigned RefSeq accessions NCBI has created new Map Viewer The Apis mellifera whole genome NC_005782 to NC_005789. The displays for several organisms, in- shotgun (WGS) project has been sequenced-based maps- contig, gene, cluding the honey bee, cat, and the assigned the project accession and transcript maps- are provided in fungi gossypii and AADG00000000 and the Amel_1.1 the Map Viewer for this filamentous Encephalitozoon cuniculi. In addition, assembly displayed and annotated in . new genome guides are available for Map Viewer, is comprised of acces- The sequencing, annotation, and the honey bee, cat, chicken, frog and sions AADG02000001-AADG020- analysis of the Encephalitozoon cuniculi sea urchin. 30074. genome is described in Katinka et 4 The Map Viewer display for Apis mel- The NCBI Map Viewer for Felis catus al. Its chromosomes have been lifera, the honey bee, includes a vari- presents an RH map of the cat assigned RefSeq accessions ety of sequence-based maps, such as genome. The map contains 1,126 NC_003242, and NC_003229 to maps for contigs, UniGene clusters, markers, including microsatellite- NC_003238. The contig and gene genes, and transcripts, as well as the based markers and coding loci. An maps are available in the Map Viewer Solignac microsatellite-based linkage integrated genetic map is also includ- for this microsporidium. map. The current honey bee genome ed.1,2 New Genome Guide pages, created build, Amel_1.1, is a composite of by NCBI in cooperation with the whole genome shotgun sequence and The sequencing, annotation, and genomic research communities to BAC sequence from clones isolated analysis of the provide links to an array of genome- by a clone-array pooled strategy. genome is described in Dietrich et al.3 Its chromosomes have been continued on page 11 7 Spring 2004 NCBI News Environmental Samples New HIV Protein-Interaction Database Make Big Splash Documenting the interaction of section of the page displays its inter- The technology of Whole Genome human immunodeficiency virus type action report. Shotgun (WGS) sequencing is now 1 (HIV-1) proteins with those of the Interaction reports for HIV-1 pro- being applied to quickly assemble host cell is crucial to our understand- teins are displayed in a 4-column for- large sets of genomic sequences ing of the processes of HIV-1 repli- mat beginning from the left with the taken from organisms inhabiting a cation and pathogenesis. To meet HIV-1 protein name, linked to the particular ecological niche. Sequence this need, the Division of Acquired LocusLink report for the gene, and data collected in this manner pro- Immunodeficiency Syndrome continuing with phrases taken from vides a snapshot of the genetic (DAIDS) of the National Institute of the literature indicating, in the sec- diversity existing at a particular locale Allergy and Infectious Diseases ond column, a type of interaction, and is especially important in provid- (NIAID), in collaboration with the and, in the third column, a descrip- ing data for organisms which are dif- Southern Research Institute and tion of the interaction partner. The ficult or impossible to culture in the NCBI, has begun compiling a com- fourth column displays links to laboratory. Recently, The Institute prehensive “HIV Protein-Interaction LocusLink and Entrez Gene reports for Biological Energy Alternatives Database” to provide a concise sum- for the interaction partner. sampled water from the Sargasso mary of documented interactions Interaction reports can be filtered in Sea, one of the most well-character- between HIV-1 proteins and host cell a number of ways by phrases appear- ized regions of the world's oceans.1 proteins, other HIV-1 proteins, or ing in the reports using pull down The larger of two sets of samples proteins from disease organisms phrase lists. Full or filtered reports collected produced over 1.3 gigabas- associated with HIV or AIDS. For may be downloaded as text in a tab- es of sequence in the form 1.66 mil- each documented protein-protein delimited format that includes fields lion WGS reads. These reads were interaction the following information for the name of the subject HIV-1 assembled into contigs containing is collected, if available: protein, its RefSeq accession number, about 1 gigabase of non-redundant Protein Reference Sequence acces- the interaction phrase, the descrip- sequence. In addition, over 1 million sion numbers. tion of the interaction partner, the protein sequences were derived from Entrez Gene ID numbers. partner’s RefSeq accession number, the annotation of open reading the title of the publication reporting frames on the genomic sequences. The Amino acids from each protein that are known to be involved in the the interaction, and its PubMed iden- Contigs constructed from the WGS interaction. tifier. For example, the first line in reads and the remaining single reads A Brief description of the protein-pro- the tab-delimited file produced by fil- have been deposited in the WGS tein interaction. tering the interaction report for the division of GenBank, under the “tat” protein by the interaction project accession number Keywords to support searching for interactions. “upregulates” is shown in Figure 1. AACY01000000. Scaffolds assem- bled from these contigs are available PubMed identification numbers for all journal articles describing the interac- Interaction reports are currently within the accession ranges tion. available for 7 of the 9 proteins pro- CH004436-CH004736, and duced by HIV-1 including, gag, pol, CH004737-CH236877. The raw The HIV Protein-Interaction Data- rev, tat, vif, vpr, and vpu. Reports sequencing data is available in the base may be searched through an for the remaining proteins nef and Trace Archive. online interface at: env, will be completed soon. All protein-protein interactions docu- The Sargasso Sea dataset along with www.ncbi.nlm.nih.gov/RefSeq/ mented in the HIV Protein- other environmental sample datasets, HIVInteractions/index.html Interaction Database are listed in such as sequences from an acid mine Clicking on a link for an HIV-1 pro- Entrez Gene reports in the “HIV-1 drainage biofilm submitted by the tein in the “Reports and downloads” protein interactions” section. DOE Joint Genome Institute,2 can be queried using the new “Environ- —DW mental Samples” BLAST page at: Tat, p14 NP_057853 upregulates B-cell lymphoma 6 protein NP_001697 HIV-1 www.ncbi.nlm.nih.gov/BLAST/ Tat upregulates the expression of BCL-6 in Kaposi's sarcoma cells 11994280 Genome/EnvirSamplesBlast.html Figure 1. First line in the tab-delimited file produced by filtering the interaction report for the “tat” protein by the interaction “upregulates”. continued on page 12

Spring 2004 NCBI News 8 e-PCR and Reverse e-PCR: Greater Sensitivity, More user-supplied primer pairs. The Options Reverse e-PCR query page can STS is found using the online ver- accept up to 20 primer pairs or STS Electronic PCR (e-PCR) is used to sion of Forward e-PCR, the chromo- identifiers as input for searches identify sequence landmarks called somal location and a link to UniSTS against organism-specific databases. Sequence Tagged Sites (STSs) within are provided. Mapped markers can Primer pairs or STS identifiers can a nucleotide sequence. Electronic- be displayed from UniSTS reports be searched against the genomic and PCR works by looking for matches using the MapViewer and viewed in transcript databases of Anopheles gam- to STS primer pairs with the orienta- the context of the other available biae, , Caenorhabditis tion and spacing required to produce genomic data. elegans, , Homo an amplicon of the expected size. sapiens, Mus musculus and Rattus To improve the sensitivity of a norvegicus. Two types of e-PCR can now be Forward e-PCR search, there is now performed from the e-PCR home an option to search using discontigu- Interested in seeing how the forward page: the original, Forward e-PCR, ous, or imperfect, matches between and reverse e-PCR tools work? Try and a new application, Reverse e- the query sequence and the STSs in your hand at some of the e-PCR PCR. Forward e-PCR is used to UniSTS. To increase the probability examples that are provided on the determine if a user-supplied of finding STSs that may have been two search pages found at: nucleotide sequence contains any missed with the contiguous word www.ncbi.nlm.nih.gov/sutils/e-pcr known STS. Queries are made option, the size of the segment to be against the markers in NCBIs matched, called the word size, num- For those who need to perform large UniSTS database, a public collection ber of allowable mismatches, and batches of searches or who need to of PCR primer pairs used in map- number of permissible gaps can be search a custom database, a stand- ping and other types of genome adjusted. The size of the STS can alone version of e-PCR has been analysis for a wide range of organ- also be adjusted in order to allow for developed for the Windows, Linux isms. UniSTS contains over 270,000 deviations which may arise from the and Unix operating systems. These markers and includes data from the amplification of a region in the binaries, along with the source code STS division of GenBank, The genome that shows length polymor- for compiling e-PCR on other oper- Radiation Hybrid Database, The phism. ating systems, are available via FTP Genome Database, Mouse Genome at: Informatics, the Rat Genome Reverse e-PCR can be used to esti- Database, Zebrafish Information mate the genomic binding site, ftp.ncbi.nlm.nih.gov/pub/schuler/ Network and PubMed Central. If an amplicon size and specificity for e-PCR —SD

New Organisms in UniGene 21,155 transcript sequences in 1,904 transcripts in 6,950 clusters, Lactuca clusters, Salmo salar (Atlantic salmon), sativa (garden lettuce) with 53,347 UniGene now covers 45 and with 23,111 sequences in 1,050 clus- transcripts in 9,656 clusters, Malus x and can be searched using the ters, Bombyx mori (domestic silk- domestica (Apple) with 34,733 tran- Entrez search system where it is worm), with 60,481 sequences in scripts in 3,996 clusters, Hydra magni- linked to nucleotide records. Recent 2,050 clusters, Apis mellifera (honey papillata with 69,049 transcripts in additions to UniGene include: bee), with 14,386 transcripts in 5,190 5,362 clusters, Populus tremula x clusters, Lotus corniculatus (Birdsfoot Populus tremuloides (aspen) with 20,820 Canis familiaris (dog) with 15,281 trefoil), with 59,678 transcripts in transcripts in 2,707 clusters, and Ovis transcript sequences in 4,544 clusters, 8,158 clusters, Physcomitrella patens aries (sheep) with 2,852 transcripts in Helianthus annuus (sunflower), with (physcomitrella moss) with 83,948 1,164 clusters.

RefSeq Accession Numbers Get Longer length of 9 digits, e.g., NP_123456789. Preexisting acces- as Rat Gets Last 6-digit Accession sions will not be changed and accession numbers in both the old and new extended formats, such as NM_000100 and NM_01000000, will coexist. Rattus norvegicus has scurried to the end of the possibilities offered by the six digit RefSeq accession format by making Are you wondering which organism got the first 9-digit off with NCBI protein RefSeq accession NP_999999, RefSeq protein accession? That would be the energetic required for its olfactory receptor Olr1386. To compen- and inquisitive Rattus norvegicus again, for another of its sate, RefSeq accessions have now been extended to a very important olfactory receptors!

9 Spring 2004 NCBI News Slots available for FieldGuidePlus Training Course Onsite at NCBI The popularity of the free NCBI training course for life The FGPlus provides more detailed information, practical scientists, A Field Guide to NCBI Resources, continues to tips, and more extensive hands-on practice than the stan- grow, with over 40 courses presented throughout the dard course. In addition, the FGPlus offers complete United States in the first half of 2004. Portions of the BLAST and structure modules, as well as a Mini-Course Field Guide have also been presented as more detailed on disease genes. Following the success of the Spring edition, modules on specific tools and databases, including a 3D NCBI will again offer the FGPlus course on August 24 and 25th, structures course, a gene expression course, and a BLAST 2004 at the National Library of Medicine's Lister Hill auditori- course. In addition to the 2-day Field Guide, NCBI offers um. For more details and registration information please several 2-hour problem-centered Mini-Courses. This visit the FGPlus page: Spring, the NCBI conducted the first enhanced Field Guide, the FGPlus, at the National Library of Medicine, www.ncbi.nlm.nih.gov/Class/FieldGuide/FGPlus that combines the best of the standard Field Guide with or write to Peter Cooper ([email protected]) the modular Field Guide courses and the Mini-Courses.

RefSeq Release 6 on FTP Site RefSeq Release 6 is now available by anonymous FTP at: the scope and content of the release are provided at:

ftp.ncbi.nih.gov/refseq/release ftp.ncbi.nih.gov/refseq/release/release-notes

Release 6 includes genomic, transcript, and protein For more information, visit the NCBI RefSeq Web Site at: sequences available as of July 5, 2004 from 2,467 organ- www.ncbi.nih.gov/RefSeq isms. The number of RefSeq accessions in Release 6 and their combined lengths is given in the shaded box. # Accessions # Basepairs/Residues Genomic 68,592 8,263,102,565 RefSeq releases are posted bimonthly and the next release RNA 247,639 433,269,151 is scheduled for September. Release notes documenting Protein 1,050,975 365,446,682

Exponential Growth of GenBank Continues with Release 142 Over the past decade, the growth of GenBank has followed an exponen- tial curve with a doubling time of between 12 and 15 months. As shown in Figure 1, the trend contin- ues with release 142 for which close- of-data was June 16. In the eight week period between the close dates for GenBank releases 141.0 and 142.0, the non-WGS portion of GenBank grew by 1,335,978,783 base pairs and by 1,855,785 sequence Figure 1. Growth of GenBank in billions of base pairs Figure 2. Billions of base pairs of sequence in GenBank from release 3 in April of 1994 to the current release, release 142 for selected organisms. records. The number of base pairs 142. of sequence in release 142 for sever- Primary FTP site at NCBI: Indiana University mirror: al organisms of interest is shown in Figure 2. ftp.ncbi.nih.gov/genbank bio-mirror.net/biomirror/genbank

GenBank release 142 is available on San Diego SuperComputer Center Uncompressed, the Release 142.0 flat- the NCBI FTP site and at two mir- mirror: files require approximately 136 giga- ror sites. bytes while the more compact ASN.1 genbank.sdsc.edu/pub version requires 119 gigabytes. Entrez Tools is a ‘Hot Spot’ Look for a new “Hot Spot” on the NCBI homepage called “Entrez Tools”. Entrez Tools provides single-page access to a group of specialized Entrez resources. These resources include Batch Entrez, the Entrez Cubby, help with advanced Entrez searches, documentation for the Entrez utilities, and a guide to creating links to the interactive Entrez pages.

Spring 2004 NCBI News 10 Using BLASTClust to Make Non-redundant Sequence Sets BLASTClust is a program within the standalone BLAST package used to cluster either protein or nucleotide sequences. The program begins with pairwise matches and places a sequence in a cluster if the sequence matches at least one sequence already in the cluster. In the case of proteins, the blastp algorithm is used to compute the pairwise matches; in the case of nucleotide sequences, the Megablast algorithm is used.

In the simplest case, BLASTClust takes as input a file con- The sequences in "infile" will be clustered and the results taining catenated FASTA-format sequences, each with a will be written to "outfile". The input sequences are identi- unique identifier at the start of the definition line. fied as nucleotide (-p F); "-p T", or protein, is the default. BLASTClust formats the input sequence to produce a tem- To register a pairwise match two sequences will need to be porary BLAST database, performs the clustering, and 95% identical (-S 95) over an area covering 90% of the removes the database at completion. Hence, there is no length (-L .9) of each sequence (-b T) . Using "-b F" need to run formatdb in advance to use BLASTClust. The instead of "-b T" would enforce the alignment length output of BLASTClust consists of a file, one cluster to a threshold on only one member of a sequence pair. The line, of sequence identifiers separated by spaces. The clus- parameter "S", used here to specify the percent identity, can ters are sorted from the largest cluster to the smallest. also be used to specify, instead, a "score density." The lat- ter is equivalent to the BLAST score divided by the align- BLASTClust accepts a number of parameters that can be ment length. If "S" is given as a number between 0 and 3, used to control the stringency of clustering including it is interpreted as a score density threshold; otherwise it is thresholds for score density, percent identity, and alignment interpreted as a percent identity threshold. length. The BLASTClust program has a number of appli- cations, the simplest of which is to create a non-redundant To create a stringent non-redundant protein sequence set, set of sequences from a source database. As an example, use the following command line: one might have a library of a few thousand short nucleotide sequence reads and wish to replace these with a blastclust -i infile -o outfile -p T -L 1 -b T -S 100 non-redundant set. To produce the non-redundant set, one In this case, only sequences which are identical will be clus- might use: tered together. The “blastclust.txt” file in the standalone blastclust -i infile -o outfile -p F -L .9 -b T -S 95 BLAST package details the full range of BLASTClust parameters.

—DW

New Eurkaryotic Genomes continued from page 7 New Microbial Genomes in GenBank specific resources, are available for honey bee, cat and chicken. These pages can be found at: Organism GenBank | RefSeq Accession Numbers

www.ncbi.nlm.nih.gov/genome/guide/bee Mycobacterium avium subsp. paratuberculosis str. k10 AE016958 | NC_002944

Bdellovibrio bacteriovorus BX842601 | NC_005363 www.ncbi.nlm.nih.gov/genome/guide/cat Treponema denticola ATCC 35405 AE017226 | NC_002967 www.ncbi.nlm.nih.gov/genome/guide/chicken Lactobacillus johnsonii NCC 533 AE017098 | NC_005362 Wolbachia endosymbiont of Drosophila melanogaster AE017196 | NC_002978

A Map Viewer display for the chicken genome, Gallus gallus, Wolbachia endosymbiont of Drosophila melanogaster pending | pending will be available soon, but the sequences deposited into Mycoplasma mobile 163K AE017308 | NC_006908 GenBank under the whole genome shotgun (WGS) Bacillus anthracis strain Ames 0581 sequencing project accession AADN00000000 (accessions Picrophilus torridus AE017261 | NC_005877 AADN01000001-AADN01111864), are available now in Entrez and on the GenBank FTP site within the “WGS” directory. For more detailed information, see the online version of the Spring —VP 2004 NCBI News, or use the GenBank or RefSeq Accession Number 1Menotti-Raymond et al, 2003a, PMID 14970716 to query the Entrez “Genome” database using the query box on the 2Menotti-Raymond et al, 2003b, PMID 12692169 NCBI Home Page. 3Dietrich et al. Science 304: 304-307, 2004 PMID 15001715 4Katinka et al. Nature 414:450-453, 2001 PMID 11719806

11 Spring 2004 NCBI News Entrez Quiz Environmental Samples What is the total number of records in each of the 23 Entrez databases? If continued from page 8 you've ever asked yourself this question, try the following query in the Entrez global search: Environmental sample data can also All[filter] be searched using two newly-created standard BLAST databases, “env_nt” Each of the Entrez databases supports the “all” filter term which returns the or “env_nr” for nucleotide and pro- total number of records indexed in a database. You can also collect this infor- tein sequences respectively. The mation using an E-utilities URL, e.g.: environmental sample data contained eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=all[filter] within these two new databases is no The results of such a query as of mid-June are plotted in the Figure 1. longer contained within the “nt” or “nr” BLAST databases.

—VP

Figure 1. Number of records in 1 Venter, J.C., et.al., Environmental Genome Shotgun the 23 Entrez databases as of Sequencing of the Sargasso Sea, Science, 2004 Apr 2;304(5667):66-74. mid June 2004. Numbers are plotted on a logarithmic scale. 2 Tyson, G.W., et.al., Community structure and metabo- lism through reconstruction of microbial genomes from the environment, Nature, 2004 Mar 4;428(6978):37-43.

Department of Health and Human Services Public Health Service, National Institutes of Health FIRST CLASS MAIL National Library of Medicine POSTAGE & FEES PAID National Center for Biotechnology Information DHHS/NIH/NLM Bldg. 38A, Room 3S308 BETHESDA, MD 8600 Rockville Pike PERMIT NO. G-816 Bethesda, Maryland 20894

Official Business Penalty for Private Use $300

NCBI News NATIONAL INSTITUTES OF HEALTH ■ National Library of Medicine Spring 2004