Tutorial 1: Exploring the UCSC Genome Browser

Last updated: June 25, 2012

Open the homepage of the UCSC Genome Browser at: http://genome.ucsc.edu/ In the blue bar at the top, click on the Genomes link. Change the drop down menus so that clade is set to Vertebrate, genome is set to Human and assembly is set to Feb. 2009 (GRCh37/hg19). Then type in “adam2” in the position or search term window and click submit. On the subsequent page, click on the top link under the UCSC Genes section. This should bring up a window that looks similar to that shown in Figure 1:

Figure 1: ADAM2 as displayed in the UCSC genome browser with default tracks.

The UCSC Genome browser is highly configurable. If your view looks much different than Fig. 1, try clicking the default tracks button below the graphic window.

This is a complicated window, so we will go through it from top to bottom:

1. In the row labeled move, are buttons that allow you to move left to right on the chromosome as well as zoom in and out on the information. 2. The position window shows the genomic coordinates of the chromosome sequence displayed. You can edit this manually or use the move controls to zoom in and out or change positions. The entire chromosome can be viewed at once, though this takes a very long time to load. To the right of the jump and clear buttons, the size the genomic region being viewed is listed. The view is taken up by the ADAM2 gene, corresponding to about 94Kb in length, not atypical for a mammalian gene.

BCH-M628 2012 UCSC Genome Viewer tutorials Page 1 of 10 Last updated: June 25, 2012

3. Use the configure button to change the default image width. By default it is set to 800. Try changing from 800 to 1000 and then click the submit button. You can also adjust the placement of the labels on the graphic window as well as what tracks to display and how those tracks are configured. 4. The next row shows a picture of the chromosome with the chromosome bands in lighter and darker shades of grey. The red vertical bars shows where on the chromosome the current view is located. In this case, chr8 (p11.22) tells you this gene is located on the petit or short arm at position 22 of chromosome 8. The designation q is for the long arm of the chromosome. Cytogenetic mapping has not been done for all organisms so this view may not always be available.

5. The main window displays a set of annotation tracks. Right-clicking on any track should give you display configuration options. If you scroll down below the main window, the tracks that are shown on the display have their menu boxes in white, whereas the hidden ones are in grey. Clicking on the link above the menu will give you more information about the tracks as well as configuration options. The tracks listed shown in this graphic represent different sequence information that has been placed on the human genome by coordinates relative to the beginning and end of each Figure 2: Expanded view shown in 1st red box chromosome. For the tracks representing genes or ESTs, the longer vertical bars in the gene structure represent exons while the horizontal lines between the vertical bars represent introns. The > and < symbols tell you which orientation the gene is relative to the chromosome. ADAM2 is listed four times because it has 4 alternatively spliced transcripts. Note the difference in exon number between the 3 isoforms in the area surrounded by the red boxs. ADAM2 is coded on the negative strand (<). Below the gene and EST tracks are 3 tracks that provide information relevant to the regulation of transcription from the ENCODE project. The layered H2K27Ac track represents histone acetylation marks. You can configure the track height, among other parameters to see these marks more clearly. The DNase Clusters track shows regions where the chromatin is hypersensitive to cutting by the DNase enzyme, which has been assayed in a large number of cell types. Regulatory regions, in general, tend to be DNase sensitive, and promoters are particularly DNase sensitive. The Txn Factor ChIP track shows DNA regions where transcription factors, proteins responsible for modulating gene transcription, bind as assayed by chromatin immunoprecipitation with antibodies specific to the transcription factor followed by sequencing of the precipitated DNA (ChIP-seq).

BCH-M628 2012 UCSC Genome Viewer tutorials Page 2 of 10 Last updated: June 25, 2012

Below the regulation tracks are a series of sequence conservation tracks. The darker the color, the more conservation between the sequence in the window and species listed. Change the view to full and then click the refresh button. It should be a little easier to decipher. It’s a bar graph and the higher the bar, the more conserved are the sequences Not surprisingly, Rhesus macques have the highest degree of conservation with humans, while the conservation falls off as you get further away from humans. 6. Below the image window, are the fine level controls for changing the start and end of the displayed sequence data. 7. The next row provides default tracks, hide all, configure and refresh buttons for rapidly resetting the tracks display. To the right of those buttons are check boxes for toggling on and off the chromosome display and label display. If your display does not look like above, click the default tracks button to reset it.

Tracks: You will notice there are many different tracks, representing the different information sources included in the output. There are lots of tracks to view (> 100 for this Human build) and different views for them. Scroll down to the track window below the graphic. Only a few have menus displayed, most are in “hide” mode. Change the UCSC genes from “pack” to “dense”, and click the Refresh button. What happened to the display? Change RefSeq from “dense” to “pack” and click the Refresh button. What happened? If you want to know what a track represents, click on the link above each track setting. The UCSC browser has many genomes, most of which do not have nearly as many annotation tracks as the human browser. You can export the graphic window as either a PDF or PS (post-script) file. Click on the PDF/PS link on the blue navigation bar at the top and a window will open allowing you to export either PostScript or PDF. If you are familiar with Adobe Illustrator, this program will open a postscript file so that it can be edited and printed at high resolution. If you click on the Gene names listed above or to the left of the track, it will open a window with a description, summary and links to other databases that have information about that gene. From the gene summary page, there is a link for obtaining the genomic sequence from the region around the gene. Another way to obtain the genomic sequence, use the DNA link located in the top blue bar. It will automatically retrieve all of the genomic sequence represented in the current view. You can use the subsequent dialog box to add more DNA to one or both ends. If you want the coding sequence, you have to know the orientation of your gene. If it is on the negative/reverse (<) strand, you will want to check the Reverse compliment box before using the get DNA button in the dialog box.

Figure 3: Get DNA dialog box for UCSC genome browser.

BCH-M628 2012 UCSC Genome Viewer tutorials Page 3 of 10 Last updated: June 25, 2012

Tutorial 2: Mapping sequences using BLAT

One of the most useful features of the UCSC browser is the “BLAT” search. This allows you to put in either a nucleotide or protein sequence and it will conduct a similarity search that is related to BLAST. The results are the location of your sequence on the genome itself. This is quite useful if you want to map an unknown clone or to find a homolog. For this example, we will use an EST fragment and determine where the EST is mapped in the human genome. Once we find that, we will look for alternatively spliced transcripts. First, retrieve the EST sequence from NCBI’s Entrez interface by typing in “BG334944” into the text box at the top of the page. The resulting page should show one entry in the EST database. Click on the EST link and then click on the fasta link to bring up the sequence in fasta format. Copy the sequence from the definition line (identified by the “>”) to the end of the sequence. Go to http://genome.ucsc.edu. From the main page, choose human in the pull-down menu, then click the “Blat” link from the list on the left-hand side or from the top menu. In the BLAT search window on the next page, paste in the sequence you copied from NCBI. Keep the Assembly selection at Human Feb. 2009. You can change Query type to DNA. Press Submit.

Figure 4: BLAT search window

The server will very quickly return the search results. In this case, the EST produces a several matches on several different chromosomes. This is not unusual, but if you look at the list returned, the first one has a significantly better score over a longer stretch of the sequence than do the other hits.

Figure 5: BLAT Search Results for BG334944.

Click on details link for the first match to obtain more information about the query sequence and its alignment to the genomic sequence. This will bring up a long web page with 3 major sections, the mRNA sequence, the genomic sequence and an alignment of the mRNA sequence to the genomic sequence. In the alignment, matching bases in the cDNA and genomic sequence are colored in darker blue and capitalized. Gaps are indicated in lower

BCH-M628 2012 UCSC Genome Viewer tutorials Page 4 of 10 Last updated: June 25, 2012 case black type. Light blue upper-case bases mark the boundaries of aligned regions on either side of a gap and are often splice sites. Links on the left-hand side allow you to jump from one alignment block to another.

Figure 6: Details of EST BG334944 genomic alignment.

Use the browser back button to return to the window shown in Figure 5. Click on the on the browser link for the first match to bring up a graphic of the alignment of BG334944 to chromosome 9 as shown in Figure 7.

Figure 7: Browser view of the alignment of EST BG334944 to chromosome 9.

Notice the high level of histone acetylation at the start of this EST, suggesting active transcription. I had changed the track size to 100 so it may not be as large in yours. Approximately halfway down the graphic is a track labeled Human ESTs That Have Been Spliced. This track is first shown in dense mode, with all the ESTs condensed onto a single line. To see all the individual ESTs, you can either click on the track’s label or scan down the page to the track controls. Find the one labeled Spliced ESTs and change the drop-down menu to full. If you make a change via the Track Controls, click the refresh button to update the graphic. NOTE: there are a LOT of ESTs. ESTs are the largest segment of Genbank. Change the view back to dense and click the refresh button.

BCH-M628 2012 UCSC Genome Viewer tutorials Page 5 of 10 Last updated: June 25, 2012

You can color the ESTs by the tissue source. While not all ESTs have a tissue source and there is no standard nomenclature for how these are named, this can be useful feature. Scroll down to the tracks section and click on the Spliced ESTs link. This will open a settings window as shown in Figure 8.

Figure 8: EST settings window

Type brain in the tissue box and select a color next to Filter. Change the Display mode to Squish, Pack or Full and then Click the Submit button. If you selected green, the first 2 ESTs from brain should be color green as shown in Figure 9.

Figure 9: UCSC browser with ESTs filtered by color.

BCH-M628 2012 UCSC Genome Viewer tutorials Page 6 of 10 Last updated: June 25, 2012

Tutorial 3: Advanced exercises using UCSC genome browser

For this section of tutorial, you will do the following: 1. Determine if the mouse BRCA1 gene has non-synonymouse SNPs, color them and get external data about a codon-changing SNP. 2. Find the protein sequence for rat leptin gene. Use BLAT to find the human homolog. Find SNPs in the human gene and obtain genomic sequence with color coded features. 3. Perform an in-silico PCR.

3-1: Mouse BRCA1 gene. Skills: basic text search; genome view pull-down menus; filters; links to external resources. Step by step: 1. Open the UCSC genome browser, select the latest Mouse assembly (July 2007). 2. Enter text BRCA1 in the text box and click Submit. 3. From the results list, click on the BRCA1 link under Refseq: NM_009764. 4. Hide all tracks except: Base Position, UCSC Genes, RefSeq Genes and SNPs (128) to simplify the browser display 5. Scroll down to the SNP pull down menu (it’s located under the blue bar titled “Variation and Repeats”. Click the SNPs (128) link above the menu. 6. This opens a configuration page for this track. You will see lots of options for changing the appearance and features of the SNPs to be displayed. Expand the menu titled Coloring Options to bring up the “SNP Feature for Color Specification” menu. Select “Function” as the setting to change. Change all menus for types to “black” except the Coding Non-Synonymous, which should be set to “green”. Set the “Display mode” to “pack” and click Submit to apply the changes to the browser. 7. Examine the SNP track; the display should show all the SNPs (in Pack mode). You can quickly identify those SNPs which are in a coding region and non- synonymous because they are shown in green. 8. Select a green SNP from the display by clicking on it. This should open a SNP details page. From there you can follow the link the dbSNP where a new window will open with that SNP entry in the dbSNP database. From here you can learn additional details about the SNP.

NOTE: Filters remain in effect until you reset them or reset everything back to default.

3-2: Leptin gene Skills: Obtaining protein sequence; BLAT; finding SNPs in exons; “get DNA” sequence with extended case/color options.

1. From UCSC Genome browser gateway page, select the most current Rat assembly (Nov. 2004). Search for leptin. From the results page, click on the Lep gene (leptin precursor [obesity factor] link. It should be located on chromosome 4. It doesn’t matter if you choose Lep from Known Genes or RefSeq genes. 2. From the rat leptin Genome browser page, click on the UCSC Lep gene link on the left side of the window to open the UCSC gene details page.

BCH-M628 2012 UCSC Genome Viewer tutorials Page 7 of 10 Last updated: June 25, 2012

3. Under the section Sequence and Links to Tools and Databases, is a link Protein (167 aa). Click on it and a window should open with the sequence in fasta format. Copy the rat leptin protein sequence and return to the previous page. Click on the BLAT link at the top. Select the latest human assembly and paste the rat leptin sequence into the query box. Click on the Submit button. 4. From the BLAT results page, click on the details link for the top hit. Examine the details page to see the quality of the match. NOTE: you are querying a DNA database with protein sequence; the match will not be contiguous because of introns. 5. Return to the BLAT results page, click the browser link to see the Genome Viewer location with this match. 6. Now you will download the DNA sequence for this region and find SNPs in exons. 7. Click the hide all button below the browser graphic. 8. Set UCSC genes to Full mode; SNPs (129) in Pack mode and click the refresh button. 9. There should now be only two tracks in the browser. Look at the SNPs in the context of the genomic sequence. How many appear to be in an exon? How many exons are in this gene? 10. To get the genomic DNA, click the DNA link on the blue navigation bar at the top. 11. On the next page, click the extended case/color options button. Choose bold for the UCSC genes and underline the SNPs. Delete the 255 from the color boxes for the genes and put 255 in one of the color boxes for the SNPs. Click the Submit button and it should return a page with your sequence. The exon sequences should be in bold and the all SNPs should be colored either red, blue or green depending on which box you put the 255 into. You can also play around with toggling the case such that the genomic sequence is lower case but exons are upper case.

NOTE: Extended case/color options list only those tracks that are currently shown in the Genome Viewer window.

3-3: in silico PCR Skills: in silico PCR of genomic sequences; finding product sizes and Tm

1. Go to the USCS genome browser homepage and enter the PCR tool by clicking either the PCR or In silico PCR links from the homepage. 2. Select the latest human assembly. 3. Enter this as the forward primer: TTC AAG GAG GCC TTC TCC CT 4. Enter this as the reverse primer: CTG GGG GAG AAG CTG A 5. Click the “flip reverse primer” checkbox if it isn’t already selected. 6. Click the Submit button. 7. The results page should show that these particular primers would amplify 2 different genomic regions – one on chr10 and the other on chr19. 8. The product size would vary and be detectible. 9. What are the two different product sizes? 10. What is the melting temperature for each primer?

BCH-M628 2012 UCSC Genome Viewer tutorials Page 8 of 10 Last updated: June 25, 2012

3-4: Queries of UCSC browser using tables Skills: SQL type queries of the track tables

1. Return to the human genome and the leptin gene. Actually, it really doesn’t matter where you are at, but before you do the next steps, turn off (hide) the following tracks so that pages refresh faster: SNPs ESTs Conservation MGC Genes 2. Zoom out or move until you have 2-4 genes in the view. 3. Click on Tables in the blue bar at the top. This should open a view that looks something like:

4. This browser allows you to select which information you want to download. Under group, change the menu to the track region that contains the track data that you would like to export. For example, if I want to export all the STS markers for a particular region, I would change the group to Mapping and Sequencing Tracks and then change the track menu to STS markers. 5. Spend a bit of time playing with the menu options for group and track. 6. The region section is where you define the region of the genome from which to download the information. The default is genome (i.e. the entire genome). You do not want to download all of the ESTs or any track from the entire genome unless you have a lot of time on your hands. Select position and by default, the position listed there is the same as what was displayed in the browser window when you clicked on the tables link. 7. After you click on the position, then click on the lookup button. Nothing will appear to happen, but behind the scenes it did a query of that region. If you then click on the get output button at the bottom, the data will be displayed in the browser.

BCH-M628 2012 UCSC Genome Viewer tutorials Page 9 of 10 Last updated: June 25, 2012

8. If you type in a name into the output file text box, and then click the get output button, a dialog box will appear asking you where to save the file. The file will be a tab-delimited text file of all of the fields in the table that fall within the positions given. 9. Try exporting all of the spliced ESTs from a select region of the genome. Export as a file and import into Excel. 10. I did the following region: chr7:127709447-127954726 and exported 754 spliced ESTs. When you view the exported data in Excel, you will get a feel for how the tables are set up in the database that drives the UCSC genome browser. The first row contains the name of the columns. A genome browser locates almost all features by their physical location relative to the beginning of a chromosome. So all of my features were on chromosome 7 between positions 127709447 and 12795726.

Think about how you might use this feature. If you were a geneticist whose identified a marker that segregates with a phenotype of interest (i.e. a disease) and you could only place the marker within 500 kb on a chromosome, then you could download all the genes from that region to see what is there. You might have done a chromatinIP on a chip experiment and identified several regions. You could easily retrieve the genomic sequence corresponding to those regions for further analyses.

BCH-M628 2012 UCSC Genome Viewer tutorials Page 10 of 10