Tutorial Section Entrez: Making Use of Its Power

Tutorial section Entrez: Making use of its power Entrez1 is a data retrieval system • identify similar proteins; developed by the National Center for Biotechnology Information (NCBI) that • identify known mutations within the provides integrated access to a wide range gene or protein; of data domains, including literature, nucleotide and protein sequences, • find a resolved three-dimensional complete genomes, three-dimensional structure for the protein or, in its structures, and more. Entrez includes absence, identify structures with powerful search features that retrieve not homologous sequence; only the exact search results but also related records within a data domain that • view genomic context and download might not be retrieved otherwise and the sequence region. associated records across data domains. These features enable us to gather SEPARATE THE WHEAT previously disparate pieces of an FROM THE CHAFF information puzzle for a topic of interest. An Entrez data domain usually Effective and powerful use of Entrez encompasses data from several different requires an understanding of the available source databases. The goal is to identify a data domains, the variety of data sources representative, well-annotated mRNA and types within each domain, and sequence record among the many Entrez’s advanced search features. available in the Entrez Nucleotide data This tutorial uses the human MLH1 domain. gene, implicated in colon cancer, to The Entrez Nucleotide domain demonstrate the wide variety of includes sequence records from the information that we can rapidly gather for archival GenBank database, the curated a single gene. The numbers noted in the Ref Seq2 database, nucleotide sequences search results will of course change over extracted from Protein Data Bank (PDB)3 time as the databases grow. The same records, and a new Third-Party techniques shown here can be used for Annotation (TPA) database. As a result, any topic of interest. an unrefined search can retrieve records The search goals are to: of varying quality (in both sequence and annotation), and there can be a high • separate the wheat from the chaff – degree of redundancy in search results, identifying a representative, well- depending upon how many labs have annotated mRNA sequence record; submitted sequence data for a gene or its fragments. • retrieve associated literature and protein For example, an unqualified search of records; Entrez Nucleotide for colon cancer currently retrieves .10,000 hits. The • identify conserved domains within the results include archival and curated protein; records, characterised sequences and & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 4. NO 2. 179–184. JUNE 2003 1 7 9 Tutorial section lower-quality sequences such as expressed the other Entrez data domains. For sequence tags (ESTs), contigs from the example, the PubMed4 link for genome project and more. NM_000249 retrieves the 12 references A ‘Limits’ option allows us to restrict cited in the RefSeq record. They our search, if desired, to a specific data represent a set of articles selected by subset, such as the curated, non- curators (if a record is in a ‘Reviewed’ redundant RefSeq database. It also allows rather than ‘Provisional’ state) that discuss us to limit searches to specific data fields, salient research on the gene, such as retrieve records with certain attributes, mapping, characterisation and phenotype. such as molecule type, and exclude Returning to the nucleotide record for sometimes unwanted records such as NM_000249, we can just as easily traverse ESTs, which are typically numerous and from the nucleotide data domain to of lower sequence and annotation quality protein. By simply selecting ‘Protein’ than characterised genes. from the ‘Links’ menu box, we can view In this case, if we use the Limits page to the corresponding amino acid sequence restrict our ‘colon cancer’ search to the record. We will only see the record for Title field and then only to records from NP_000240, which contains the sequence RefSeq, our retrieval narrows to 31 hits. that was extracted from the Features/CDS If we then do a new search for human in translation field of NM_000249. the Organism field and use the ‘History’ Additional, similar protein sequences that option to combine the two searches with were identified by the BLAST5 algorithm a Boolean AND, we retrieve 13 hits – far can be retrieved by following the ‘Related fewer and far more specific results than Sequences’ link. our original .10,000. Similarly, the Links menu for In addition, because each RefSeq NM_000249 lists all other Entrez data record presents an encapsulation of the domains that contain associated knowledge about a single gene or splice information and can be used to easily variant, rather than the work of an access that additional data. individual laboratory, each hit is similar to a review article. RETRIEVE RELATED For this example, we will more closely RECORDS examine NM_000249: Homo sapiens mutL Links from one Entrez data domain to homologue 1 (MLH1), and the additional another provide access only to data that information we can retrieve for that gene are directly related to our original record in Entrez. Of course, a search for MLH1, of interest. However, the ‘Related rather than colon cancer, would have Records’ option within most Entrez data worked as well, and the same techniques domains allows us to instantly broaden could have been used to narrow the retrieval to other relevant records in that search results. Gene symbol searching, domain that would not otherwise have however, can sometimes be less reliable if been retrieved by the original query. For a gene has been known by numerous example, when viewing the 12 PubMed aliases. Although curated RefSeq records records above, the ‘Display: Related include the official gene symbol as well as Articles’ option instantly retrieves the aliases, archival records, such as those hundreds of other PubMed records that in GenBank, include only the gene were identified using a word weight symbol that the authors used at the time algorithm, which finds records with of submission or last update. similar words in their titles, abstracts, and Medical Subject Headings (MeSH). TRAVERSE THE DATA Similarly, the display for protein record DOMAINS NP_000240 includes a link to ‘Related The Links menu for each record allows us Sequences’ that were identified using the to retrieve directly associated records from BLAST algorithm. The related sequences 180 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 4. NO 2. 179–184. JUNE 2003 Tutorial section are shown in decreasing order of similarity or through the ‘Allelic Variants’ section of to the original sequence and can provide the corresponding Online Mendelian valuable insights into the possible function Inheritance in Man (OMIM)8 record. of the original sequence if it has not yet The SNP link will retrieve records for been characterised. variations submitted by individual labs to In the Entrez Protein domain, the dbSNP9 and aligned to the corresponding BLink (short for BLAST Link) option mRNA using the BLAST algorithm. A provides a graphical overview of the top graphic summary for each SNP indicates 200 similar sequences, showing the whether it is in a locus region, transcript regions of similarity to the original or coding region and gives additional sequence of interest. BLink also provides information about mapping consistency, great flexibility in filtering and heterozygosity, validation status and customising the view of the complete set more. of similar sequences (not just the top 200). An OMIM record, on the other hand, It allows us, for example, to see the best describes (if available) allelic variants that hit from each organism, only the hits that have been reported in the literature and have associated 3D structure records, a summarised by the OMIM editorial staff. phylogenetic tree of our hits (in which we For example, one interesting mutation can choose to exclude organisms or reported in MIM entry number 120436 is organism groups), and more. allelic variant .0011 (Gly67Trp), in which the smallest amino acid has been IDENTIFY CONSERVED substituted by the largest amino acid. A DOMAINS corresponding structure record, as Conserved domains, like similar described in the next section, can shed sequences, can shed light on a protein’s light on the possible significance of that function as well as its organisation. Each substitution. protein sequence in Entrez has been compared against NCBI’s Conserved FIND THREE- Domain Database (CDD).6 DIMENSIONAL Returning to the original protein STRUCTURES record for NP_000240, we can follow the As noted by Mullan,10 finding a resolved ‘Domains’ link to view the conserved structure for a protein is the exception domains that have been identified in the rather than the rule. This is true because sequence. This traverses to the CDD and, the currently available .2.7 million if ‘Details’ are viewed, shows the presence protein sequence records far exceeds the of the HATPase and DNA mismatch available number of individual structure repair domains. In addition, the grey records, currently 20,250 in Entrez’s ‘MUTL’ bar represents the protein family Molecular Modeling Database with which NP_000240 is associated. (MMDB).11 However, the presence of a Clicking on the graphic for any domain homologous structure can assist in the or protein family leads to more detailed analysis of protein function. information. The ‘Show Domain The ‘Links’ menu for NP_000240 does Relatives’ option retrieves protein not include ‘Structure’, indicating that sequences with similar domain this sequence record is not directly architectures identified by the Conserved associated with a 3D protein structure Domain Architecture Retrieval Tool record. Several options exist to find (CDART).7 possible homologous structures: IDENTIFY KNOWN • retrieve the approximately 600 related MUTATIONS sequences for NP_000240 and then Variations within the human MLH1 gene display the ‘Structure Links’ for the can be identified through the ‘SNP’ link complete set; & HENRY STEWART PUBLICATIONS 1467-5463.

Tutorial Section Entrez: Making Use of Its Power

Expressed Sequence Tag Analysis of the Response of Apple

HEREDITARY CANCER PANELS Part I

Impact of the Protein Data Bank Across Scientific Disciplines.Data Science Journal, 19: 25, Pp

Mismatch Repair Gene PMS2: Disease-Causing Germline

Deficiency in DNA Mismatch Repair of Methylation Damage Is a Major

Involvement of MBD4 Inactivation in Mismatch Repair-Deficient Tumorigenesis

The Genetics of Bipolar Disorder

Predisposition to Hematologic Malignancies in Patients With

S Na P S H O T: D N a Mism a Tc H R E P a Ir

Pdbefold Tutorial Tutorial Pdbefold Can May Be Accessed from Multiple Locations on the Pdbe Website

DNA Mismatch Repair Proteins MLH1 and PMS2 Can Be Imported to the Nucleus by a Classical Nuclear Import Pathway

EMBL-EBI-Overview.Pdf