An Introduction to Bioinformatics

An Introduction to Bioinformatics Advanced Genetics, Fall 2017

Names:

Instructions. Work in pairs to complete this guided worksheet. Make sure you follow the instructions exactly, as failure to do so will result in a failed analysis. The goals for today are for you to:

1) Become familiar with some of the capabilities of the NCBI database and the BLAST (Basic Local Alignment Search) tool. 2) Apply basic knowledge regarding gene structure in eukaryotes to select appropriate DNA sequences for phylogenetic analysis. 3) Learn how to import and align BLAST sequences using MEGA7. 4) Learn how to create a phylogenetic tree in MEGA 7 using your aligned sequences.

You will need these skills to complete your human mitochondrial DNA lab report, so paying close attention and taking good notes today will be critical to your success later in the unit.

Part 1: Introduction to the NCBI Database

1) Go to the NCBI database at https://www.ncbi.nlm.nih.gov/ . The current iteration of the home page has a search window at the top, a list of resource categories on the left, a list of popular resources on the right and several very useful links to NCBI data and tools in the middle. These pages are constantly (and annoyingly) under revision, so do not panic if they do not look like you expect them to look from this handout! If you use a bit of critical thought you will find what you need.

2) Click on “Analyze” in the middle of the page and answer the following questions: a. Name the Tool you would use if you wanted to find regions of local similarity between any two biological sequences in the NCBI database (Hint: this is the tool we will be using…)

b. Name the tool you would use to identify conserved domains present in a protein sequence

c. Name the tool you would use to find regions of local similarity between your sequence and whole genome sequences. 3) Go back to the NCBI home page, type the word ‘cystic fibrosis’ into the top search window and click on ‘Search’. This will look for that term in all of the NCBI databases.

4) Obviously, this disease has been well studied! If you look under ‘literature’ you will see that over 61,000 full-text articles have been published related to this disease, alone. There are also abundant links to information about the protein, the gene that codes for the protein, its homologs, etc.

5) Click on “OMIM” under “Health”. The Online Mendelian Inheritance in Man database should be familiar to all of you who took Biology 102, since it is the same database you used to try to identify your unknown genetic disorder in that class. It is a great resource for disease-related genes. Click on record #602421 – Cystic Fibrosis Transmembrane Conductance Regulator; CFTR and answer the following questions:

a. What is the chromosomal location of the CFTR gene?

b. What type of protein is coded for by CFTR?

6) Now let’s get started with our phylogenetic analysis. The first step is to search for CFTR homologs* using BLAST.

(*Homologs are genes/proteins that have originated from a common ancestral sequence and still retain significant and recognizable amounts of structural similarity. There are two types of homologs. Orthologs evolved from a common ancestral gene during speciation, and share a common function despite their presence in different species. An example of an ortholog is DNA polymerase in humans and bacteria. While these genes do have some small DNA and protein sequence differences, their protein products have the same function: they make copies of cellular DNA during DNA replication. Paralogs, by contrast, arise via gene duplication within a species, and then evolve new functions over time. An example of a paralog in humans is hemoglobin (an iron and oxygen-binding protein found in our red blood cells) and myoglobin (an iron and oxygen-binding protein found in our muscles). When you do a BLAST search you may find both types of homologs, so it is important to only choose genes that reflect the type of relationship you want to study.

7) Go back to the NCBI home page, type ‘cystic fibrosis’ in the search window and choose ‘nucleotide’ from the left-hand drop-down menu before clicking on ‘search’. You should see that there are over 200,000 sequences related in some way to cystic fibrosis. We are NOT going to try to align all of these! Choose the second record – it should be Human cystic fibrosis mRNA, encoding a presumed transmembrane conductance regulator (CFTR), mRNA – and click on it. This will take you to the Genbank record for this sequence. The record contains a ton of information about the gene, including references, its function, the locations of introns and the exons, and the cDNA sequence.

a. What is cDNA? What does it lack that genomic DNA has?

b. Do you expect introns or exons to be more highly conserved during evolution… Why?

8) In this analysis, I want you to analyze closely related homologs, so we will be using the human cDNA sequence as our ‘query’. Go to the top of the Genbank record and look for the sequence accession number. Write it down: M28668.1. You will use that number in Part 2.

Part 2. Using MEGA7 to align homologous sequences.

Now that you have identified a sequence of interest using the NCBI database, it is time to use MEGA7 to find homologs and to generate a phylogenetic tree.

1) Open MEGA 7. When it has finished loading you will see its Main window. In the upper left-hand corner, choose Align Do BLAST search.

2) Make sure you have selected the blastn (nucleotide BLAST) tab from the BLAST window (this is the default in MEGA 7 and should already be selected). Below that tab you will see a large box In a BLAST analysis you can type accession numbers or entire sequences into› the search box. You can also upload sequence files if you have them in the correct format. Since our sequence is over 6,000 bases long, typing in the accession number is much simpler. Do that now. It should look like this (but with a different accession number): 3) Under ‘Choose Search Set', make sure the ‘Others‘ button is clicked. We do not want to limit ourselves simply to human and mouse genes.

4) Under Program Selection, choose ‘Somewhat similar sequences (blastn)‘.

5) Click on ‘BLAST’. Do NOT click on ‘show results in a new window’ box.

6) A results window will now appear, but will probably not show your matches, yet. It will take the BLAST tool a little while to compare your 6000+ base sequence to all of the millions of others in the NCBI database. When your matches are ready, they will be shown as a graphic summary (long, red lines mean high similarity) followed by a ‘Descriptions’ box. That box will show a table that has the following information:

· Description The name of the aligned sequence with a link to the alignment in the box at the bottom of the web page. Note that the first record should be the actual human sequence for which you entered the accession number. If this is not the case recheck your accession number and try again! · Max score and Total score. Both are statistical representations of the strength of the match. High is good. · Query cover. This tells the percentage of your human sequence is represented in the match. For the first sequence, you can see that 100% of the bases in your input mouse sequence were used in the match. This is expected because that top match actually is the gene you entered. If you look a few genes down, only 97% of your gene was used. In this analysis, we are not looking for perfect matches, but rather a variety of closely related homologs, so we will use a cutoff of 75% Query cover. · E value. This tells you the likelihood that you got this match purely by chance (rather than because the sequences actually are similar). This should be very low if you want a significant match. We will not be using anything higher than 0 in this analysis, but if we were looking at genes with less similarity we could. · Ident. This is the percent identity between your sequence (Query) and the matched sequence (Subject, or Sbjct in the BLAST alignments). Note that high is good, but not always best here, as it is only calculated using the bases in the alignment. So, an alignment that has 100% identity but only has a Query cover of 5% of your gene is probably not as strong as an alignment with 95% identity and 95% Query cover. We will not use a cutoff for Ident. · Accession. A link to the Genbank record for the sequence. 1) Next you must choose and download sequences for your alignment. Let’s start with your human sequence, since that will be a part of everyone’s analyses. Click on the sequence name in the Description column. This should take you to a graphical view of the alignment of your sequence (Query) with the sequence from the database (Sbjct). The first part of the alignment will look like the sequence shown below. You can see a few things in this record. ·

The gene name itself says that it is an mRNA (more accurately a cDNA) sequence, and that the total length of the sequence in the genbank record is 6129 bases. We will limit this analysis to other cDNA sequences, as aligning cDNA and genomic sequences (that still contain the introns) requires a level of analysis a bit too advanced for the first time through this! · The ‘Strand’ match is Plus/Plus. That means coding strand of your sequence matched the coding strand of the database sequence. If you had seen Plus/Minus, that would mean your sequence’s complement matched the database sequence, and you would need to reverse and complement the database sequence before adding it to the alignment. Again, in today’s assignment we will not be doing that (I think all of the matching sequences are cDNA), although your assigned reading goes through that process should you ever need it. · Identities are shown as lines between the top (query) and bottom (subject) sequences. Note this first alignment is 100% identical so there are vertical lines between every base. · A link to the Genbank record. 1) Once you have made sure that your sequence is cDNA and has a Plus/Plus alignment, click on the Genbank link and scroll down the page until you see Features on the left- hand side. Find “CDS” (short for coding sequence). This tells you where the start and stop codons are in your gene sequence. Remember that mRNA has untranslated sequences at both the 5’ and the 3’ ends, and we do not want to analyze them – just the actual sequence that codes for the protein. In this case our start codon begins at base 133 and the coding region ends at base 4575.

On the right-hand side of the Genbank window, you will see a box that says, “Change Region Shown”. Click on the arrow to open the box, then click on “Selected Region”, write in the first and last base of the coding region (in this case 133-4575), and then click on ‘Update View’. Finally, click the “Add to Alignment” (red plus sign) button at the top of the page to add the coding sequence to the alignment.

7) MEGA 7’s Input Sequence Label box will open up. In the “Input Sequence Label” first word, choose H. sapiens (genus, species) and second word enter or choose CFTR (gene name). Use the same convention for all of the other sequences you choose. Click ok.

8) The human CFTR coding sequence will now show up in a new Alignment Explorer window.

9) Go back to the MEGA web browser window and click on the left arrow at the upper right of the window to get back to the tab for the original alignment.

10) Click on The NCBI BLAST tab to reopen the alignment window and then repeat steps 7-9 for 5 more genes. Make sure they have at least a 75% Query cover and represent 5 different species. Also make sure to enter the correct first and last bases in the coding sequence for each species, based on the Genbank record. Are these sequences orthologs or paralogs? Explain in the space below.

11) You will now have 6 sequences in your Alignment Explorer Window. Make sure they all start with ATG (can you tell me why?), then close the MEGA web browser window.

12) It is a good idea to save your data at this point. In the Alignment editor, under Data, choose Save Session and save your data as CFTR_unaligned, since you have not aligned the data yet. The file will save in .masx format.

13) It is finally alignment time! MEGA provides two alignment methods: ClustalW and MUSCLE. We will use the latter, as it is generally more reliable. In the Alignment Explorer, under Alignment, Choose Align by Muscle (codons) since our sequences are coding sequences. This will tell the program to avoid making alignments that would insert stop codons where they don’t belong.

14) A settings window will now open up. For now, just accept the default settings by clicking on “Compute”. Click on ‘yes’ when the program asks if you want to remove gaps prior to the alignment. Depending on the number and length of your sequences, this can take from seconds to hours. In our case, it will only take seconds.

15) Now it is time to save your alignment. In the alignment explorer window, under ‘Data’ choose ‘export alignment’. Choose MEGA format and name your file CFTR_aligned. Name your data CTFR and click ‘yes’ to confirm that you have a protein-coding nucleotide sequence. Now you can close your alignment explorer window.

16) You should be back to your main MEGA window. To generate a phylogenetic tree, under phylogeny choose Construct/Test Maximum Likelihood tree. There are several different algorithms that can be used to build a tree at this point, and each have their pros and cons, so we will just pick this one because it is at the top. Choose your CFTR_aligned.meg file and click on ‘open’. Again, simply use the default values for now.

17) A new window will open that not only shows your phylogenetic tree, but also provides you with a complete figure legend! Under “Image” save your tree as an enhanced metafile (.emf), and also save the figure legend. Insert the tree in the space below, and then copy and paste the figure legend below it. The next page shows what mine looked like.

Figure 1. Molecular Phylogenetic analysis by Maximum Likelihood method The evolutionary history was inferred by using the Maximum Likelihood method based on the Tamura-Nei model [1]. The tree with the highest log likelihood (-9656.12) is shown. Initial tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. The analysis involved 6 nucleotide sequences. Codon positions included were 1st+2nd+3rd+Noncoding. All positions containing gaps and missing data were eliminated. There were a total of 4038 positions in the final dataset. Evolutionary analyses were conducted in MEGA7 [2].

1. Tamura K. and Nei M. (1993). Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10:512-526. 2. Kumar S., Stecher G., and Tamura K. (2016). MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets.Molecular Biology and Evolution 33:1870-1874.