Finding Protein and Molecular Structures

Part of the Training Guide from the MSOE Center for BioMolecular Modeling Interactive version available at http://cbm.msoe.edu/teachingResources/jmol/jmolTraining/structures.html

Introduction

In order to view a protein or molecule using Jmol, or any molecular program, you need to have a 3-dimensional structure file. These files contain the (X, Y, Z) coordinates for the atoms that make up a structure, along with information about each atom.

These files can vary dramatically in both size and internal format, depending on how large the structure is and how the structure file was created. The most common molecular structure file formats that you will be using with Jmol are Protein Databank (.pdb) files and MDL Molfile (.mol) files.

Types of Structure Files

Protein Databank (.pdb) Files

The protein databank (.pdb) file format is curated and annotated by the RCSB Protein Databank (www.pdb.org). The RCSB PDB is an international database that contains archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps students and researchers understand all aspects of micro biology. The RCSB Protein Databank has also created tools and resources for research and education in molecular biology, structural biology, computational biology, and beyond. The RCSB Protein Databank is the primary source for large protein structure files and will be discussed in more detail before.

MDL Molfile (.mol) Files

The MDL Molfile (.mol) file format was originally designed as part of the Chemical MIME Project by Henry Rzepa. It is similar to .pdb files in that it contains the 3- dimensional locations of atoms in a molecular structure. However, unlike .pdb files, .mol files are often used for smaller structures such as ligands, drugs and sugars.

There are a large number of .mol file sources including ChemSpider, Drug Bank and the NIH Cactus Server. Many chemical drawing programs such as ChemDraw and ChemDoodle export .mol files for viewing created structures in 3-dimensional visualization programs.

Inside a Structure File

Once a structure has been determined, each atom in the structure is assigned an (X, Y, Z) coordinate to mark its location in 3- dimensional space. Additional information compliments these basic coordinates including the type of atom at each location, the chain and the residue the atom is part of. Some structure files contain additional information such as resolution data, temperature numbers, electrostatic potential data and more.

The image to the right shows a short bit of code from inside of a structure file.

For more information on structure files and how they are determined, visit these RCSB Protein Databank resources:

 Understanding PDB Data http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/intro.html

 Methods for Determining Atomic Structures http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at- Structures/methods.html

The RCSB Protein Databank

The RCSB Protein Databank (http://www.pdb.org) is the largest worldwide repository for the processing and distribution of .pdb file structure data of large molecules of proteins and nucleic acids. There now well over 100,000 structure files available on the www.pdb.org website!

Finding Structures on the Protein Databank

Each structure hosted on the Protein Databank has a unique four character long alpha-numeric identifier, referred to as the structure's PDB ID.

Often more than one .pdb file will exist for a specific type of protein. For example, there are hundreds of .pdb file entries for the relatively common protein Hemoglobin. It is often a good idea to use specific information about a structure listed below to help determine if you have found the best possible file.

 Who are the authors of the PDB file?

 In which journal was the primary citation published?

 On what date was the file deposited into the PDB?

 How many chains are in this file?

 Are there any heterologous groups within this PDB file? If so, which ones?

 From what source was this molecule isolated?

The Structure Summary Page

When you click on a specific PDB ID, you will initially see the Structure Summary page for the structure. This page includes a variety of useful information about the structure.

 Structure Preview Image - Provides a quick overview of what the molecule or protein looks like.

 Structure ID Number - This 4 letter/number ID is a unique identifier that is assigned to the data file upon deposition into the database.

 Source of the Molecule - From which species was the molecule isolated, such as human, bacterium, virus, mouse, etc..

 Title - Title of the .pdb file

 Authors - These are the researchers who were involved with the crystallization of the molecule. The senior author or principal investigator is usually the last author in science publications.

 Primary Citation - The journal article that accompanies the .pdb file. This is usually an excellent research resource for understanding the function of the molecule.

 Molecular Description – The abstract associated with the primary citation.

 Chemical Component - This will tell you the number of chains within the molecule and the chain identity. For example, in the hemoglobin file 1a3n.pdb, the chains A and C are the alpha-globin molecules and chains B and D are the beta-globin molecules.

This section also tells you if there are any heterologous groups that were crystallized with the molecule. Not all .pdb files will have this section.

o The 2-3 letter identifier used to designate the chemical components contained within the file listed are recognized by Jmol and can be used to select these molecules with the Jmol Console.

o For example, if this section stated that there was NAG (N-acetyl-glucosamine) contained within the molecule, RasMol would recognize “NAG” and you could therefore “select NAG” and RasMol would be able to select the atoms within that chemical component of the PDB file.

 Method of Structure Determination - The method that was used to obtain the structural data (NMR, X-ray diffraction).

 Resolution - How accurate the data is; the smaller the number, the better the data.

The View in 3D Window

The View in 3D Window will also let you preview the structure using a web- embedded online Jmol. To view this preview, simply click the "View in 3D: JSmol" button that is located directly below the molecule image on each Structure Summary Page.

The Sequence Page

Just above the .pdb file Title should be a series of tabs, the fourth of which is the Sequence tab. This section of the .pdb file page provides specific sequence information as well as secondary structure information about the molecule. You can identify the alpha helices or beta sheets as well as the amino/carboxyl termini, which are the first and last amino acids of the protein.

The Two Ways to Obtain a .pdb Structure

One of the key features of the Protein Data Bank is the ability to search the database for files. You can search for a unique structure if you know its PDB ID, or by using key words and authors. To submit a search query, enter these terms in the search box located near the top center of every www.pdb.org page. After you have entered the search terms in the field, hit enter or click on the "Go" button to the right of the search field. There are two ways to obtain a .pdb file:

1. Download the File from the RCSB Protein Databank website.

a. Go to the website http://www.pdb.org

b. In the top right corner of the website is a search bar similar to the image below. Type in the four number/letter file name, in this case we are looking for "1qys", and click the "Search" button.

c. This should bring you to the page for "1qys.pdb – Top 7". Just below the search box on the right should be a list of four options. Click "Download Files" and you will see an expanded menu similar to the image shown below.

d. Click "PDB Format" to begin the download of the .pdb file containing the coordinates for Top 7. This file, named "1qys.pdb", can be saved to the location of your choosing on your computer.

Note that is a good idea to create a new folder for each molecule you work on to organize all of your .pdb files, images, and other related work.

2. Dynamically Load the File from the RCSB Protein Databank Server.

As long as you have an Internet connection, Jmol allows you to dynamically connect to the RCSB Protein Databank and load a structure without downloading it permanently to your computer. You will, however, need to know the four character alpha-numeric PDB ID for the structure you are looking for.

To load the structure file 1qys.pdb: load=1qys

Note that you do not need to add the file extension (.pdb) when entering this command; just the four character alpha-numeric PDB ID is needed. You do, however, need to include the equal sign "=" with no spaces between it and the name of the .pdb file. This equal sign tells Jmol that you want to access the RCSB Protein Databank servers to find the structure, rather than finding a file locally on your computer.

Additional Resources from the RCSB Protein Databank

The RCSB Protein Databank has several regularly updated features as well as some interesting interviews and newsletters that may be useful for any Jmol designer.

 The Molecule of the Month by David S. Goodsell provides an introduction to the structure and function of a molecule, a discussion of its relevance to health and disease, interactive views, discussion topics, and links to related entries. This monthly feature has been around for a while, so the collection of proteins covered is quite extensive! This is also an excellent source for good .pdb file suggestions. http://www.rcsb.org/pdb/motm.do

 The PDB Newsletter is a quarterly publication that highlights new features and programs supported by the RCSB Protein Databank. http://www.rcsb.org/pdb/static.do?p=general_information/news_publications/newsletters/newslette r.html

 PDB-101 is an excellent source for various educational resources produced by the RCSB Protein Databank, including animations, videos, posters, and other useful teaching tools. http://www.rcsb.org/pdb/101/structural_view_of_biology.do

The NIH Cactus Databank

The NIH (National Institute of Health) Cactus (CADD Group Chemoinformatics Tools and User Services) Database is a public website with several powerful chemoinformatics tools that can provide structures, data, and tools to help explore molecular structures. Most of the tools on the NIH Cactus Database focus on small molecules and use the (.mol) file format.

 You can access the NIH CACTUS home page at http://cactus.nci.nih.gov/index.html

 You can search for MDL Molfile (.mol) structures at http://cactus.nci.nih.gov/ncidb2.2/

 You can draw custom chemical structures and export hem as MDL Molfile (.mol) structures at http://cactus.nci.nih.gov/cgi-bin/lookup/search

Dynamically Connecting to the NIH Cactus Server

Like .pdb files, small molecule structures from the NIH Cactus Server can be loaded into Jmol dynamically without downloading it permanently to your computer. As long as you have an Internet connection, you can load a specific small molecule directly from Jmol.

To load the small molecule aspirin: load$aspirin

Note that you need to include the dollar sign "$" with no spaces between it and the name of the small molecule. This dollar sign tells Jmol that you want to access the NIH Cactus servers to find the structure, rather than finding a file locally on your computer.

SMILES Sequences

While almost every molecular structure you can think of will be identifiable by name when loading a structure dynamically from the NIH Cactus database, you may occasionally come across a structure that the database does not know. For these situations, we suggest you try to find a SMILES (Simplified Molecular Input Line Entry Specification) sequence.

SMILES Sequences are a line notation for molecules that include connectivity between the specific atoms in a structure but do not include 2D or 3D coordinates. Atoms are represented by their element symbols (C, N, O, P, Cl, Br, etc.). The equals sign "=" represents double bonds and the pound sign "#" represents triple bonds. Branching is indicated by brackets "()" and rings are indicated by pairs of digits. A few examples are shown below.

 Aspirin - O=C(Oc1ccccc1C(=O)O)C

 Glucose - OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O

 Dopamine - c1cc(c(cc1CCN)O)O

Jmol can use a SMILES sequence and connect to the NIH Cactus database to turn it into a 3-dimensional structure.

To load the SMILES sequence for glucose: load$OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O

Note that like loading a small molecule by name, you need to include the dollar sign "$" with no spaces between it and the name. This dollar sign tells Jmol that you want to access the NIH Cactus servers to convert the structure from a SMILES sequence to a 3-dimensional structure.

SMILES Sequences can be found from a variety of online drug and small molecule databases, including the following websites.

 Wikipedia actually include a SMILES sequence along the right hand column for almost all small molecule entries. https://www.wikipedia.org/

 Drug Bank has a huge variety of resources for drugs of all kinds, including SMILES sequences for each entry. http://www.drugbank.ca/

 ChemSpider is a free chemical structure database providing fast text and structure search access to over 34 million structures from hundreds of data sources http://www.chemspider.com/