March 2016 ChIP-Seq Exercise

Dena Leshkowitz, Bioinformatics and Biological Computing Unit

Introduction

In this workshop we will learn how to analyse ChIP-Seq data. The data is taken from the article Dicken at al. Transcriptional reprogramming of CD11b+Esam(hi) dendritic cell identity and function by loss of Runx3. PLoS One. 2013 Oct 15;8(10). We will use two biological replicate ChIP-Seq experiments that were conducted for detection of Runx3-bound genomic regions using in-house anti Runx3 Ab and 30x106 positive CD11c MACS isolated (Miltenyi Biotec) and classical dendritic cells (DC). For further details please see the article. The BAM files we will be using were done with only a subset of the reads present in the study.

General remark- the commands you need to type in the line are in italic format.

Instructions

1. Accessing the Wexac server

Open the MobaXterm as in previous exercises and open the terminal icon (red arrow)

1 You now have the ability to write commands in which run on the WEXAC cluster.

2. Learning Basic /Linux commands

In order to list the files and folders in your home directory (your current folder location), type:

To get more information in this command type

man ls

The man command gives you information on the command of interest. To exit the manual, type q.

For more commands, look at the supplementary section (#2).

To get more information on the files type:

ls –l

To move to the output directory type:

cd output

3. Loading modules to your environment

With the help of the Environment Modules package it is possible to dynamically modify your environment on the wexac cluster in order to be able to run particular software packages. We will know load the modules we need for this exercise. Please type all the commands: module load MACS/2.0.10 module load bedtools/2.25.0 module load python/2.7.11

In order to see all the modules that were loaded to our environment type: module list

You should see three modules, if you don’t please load the module that is missing or ask an instructor to help. You will not be able to perform the tasks listed below without these modules.

4. Run MACS command

2 For more details on the tool look at the site https://github.com/taoliu/MACS

Typing the macs command with the –h will give an explanation on how to run this tool macs2 callpeak -h

Following is the command to run macs on the first replicate (the sign ~ is a shortcut to your home directory) macs2 callpeak -t ~/results/IP1.bam -c ~/results/input1.bam -n macs2_input1_IP1

Question 1: How many files were created by macs2?

Question 2: From the information written to the screen during the run, what is the predicted fragment size (d)?

Within the outputs produced, one of the files gives us all the peaks detected in a text file, write the following to see the top lines in the file macs2_input1_IP1_peaks.narrowPeak

To find the number of peaks, count the number of lines in the macs2_input1_IP1_peaks.narrowPeak file using the command –l wc -l macs2_input1_IP1_peaks.narrowPeak

Question 3: How many peaks were detected in replicate 1?

Run macs2 analysis on the second replicate – macs2 callpeak -t ~/results/IP2.bam -c ~/results/input2.bam -n macs2_input2_IP2

Question 4: How many peaks were detected in the second replicate experiment?

To find the amount of overlap between the two replicate peak files we will use the program intersectBed. This program comes from BEDtools suite

For the program explanation first type – intersectBed -h

Question 5: What does the option –wa do?

3

We will run intersectBed to find the peaks overlapping between the two files to wc -l in order to count the lines of overlapping peaks.

intersectBed -wa -a macs2_input1_IP1_peaks.narrowPeak -b macs2_input2_IP2_peaks.narrowPeak | wc –l

Question 6: What is the amount of overlap between the replicates?

Question 7: What is the amount of overlap if we switch the order of the files (or the –a and –b)?

For good reproducibility we expect to have greater than 50% overlap.

5. Analyse peaks using CEAS: Enrichment of Genome Features

CEAS provides statistics on ChIP enrichment at important genome features such as specific chromosome, promoters, gene bodies, or exons, and infers genes most likely to be regulated by the binding .

The narrowPeaks outputs we have do not comply with the BED format needed to run CEAS. Therefore, we will first make a new file that is BED format, by cutting the first five columns of the narrowPeak file using the following command:

-f1-5 macs2_input1_IP1_peaks.narrowPeak > macs2_input1_IP1_peaks.bed

ceas -b macs2_input1_IP1_peaks.bed --name=replicate1_ceas \ -g /shareDB/CEAS-DB/mm9.refGene

In order to view the CEAS pdf output file look at the sftp pane in the mobaXterm (see picture below), you might need to check in the “Follow terminal folder” (red arrow) and if you still do not

4 see the output folder files, change the path (green arrow). Once you are in the right folder and can see the CEAS pdf file, double click on it in order to open it.

Question 8: Does our transcription factor - Runx3 preferably bind to promoters?

6. Analyse Peaks using GREAT

5 Download the peak BED file using the download feature of the sftp (see black arrow). Create a folder in disk D and save it to this folder. Open GREAT and upload the BED file. You should select the mm9 mouse built and click submit.

Without going to deep into the researched biological question, the DC are involved in immune reactions.

Question 9: How many times is the word immune found in the enriched terms? How many peaks and how many genes are associated with the first immune term in the output?

6 Question 10: Looking at the Region-Gene Association Graphs – what is the most frequent binned distance from the peak to TSS?

We can view the genes associated with the peaks if we the Jobs Description (top of the report; press on the +) Select - View all genomic region-gene associations.

The Following window will open -

For the next assignment select a gene which has a peak next to it.

7. Browsing the peaks with a genome browser

We will use the Integrative genomics browser - IGV to view the mapped reads and the peaks. Open the VNC session and open the IGV tool by double clicking on the IGV icon on the VNC monitor (see red arrow below).

7

Once the application opened load the mm9 genome.

Select from File -> “Load from File” (see red arrow below) and select the BAM files from results directory (ending with bam ) and the bed files from output directory to your folder on D and load them to IGV as described below.

After loading the bam files you can insert the name of the gene you selected from GREAT and write it in the window where there is a red arrow in the picture below.

8

Question 11: Where is the peak in regards to the gene (near TSS, upstream…)? Are you convinced that there is enrichment in binding at this location when considering the reads in both the IP experiments as well as in the controls? Are there other peaks near this genes – you can zoom out by clicking on the minus sign in the left corner (see blue arrow).

Supplementary

1. Basic Linux commands:

man (command) ...... shows help on a specific command ls ...... show directory, in alphabetical order logout ...... logs off system ...... make a directory ...... remove directory ( -r to delete folders with files) rm ...... remove files cd ...... change current directory more ...... views a file, pausing every screenful grep ...... search for a string in a file head ...... show the first few lines of a file ...... show the last few lines of a file ...... shows disk space available on the system ...... shows how much disk space is being used up by folders ...... changes permissions on a file cut ...... print selected parts of lines ...... copy file ...... move file wc –l ...... print the number of lines ...... sort lines of text files

9