DNA Microarrays Data Analysis

Home , Median polish

DNA microarrays data analysis

Matej Stano UMB SAV 10.11.2015 Overview

1) Molecular biology background

2) How microarrays technology works

3) Basic statistics

4) Analysis workflow

5) Working with Chipster

6) Hands-on: practical exercise Molecular biology background Molecular biology background

All living things are composed of one or more cells Molecular biology background

Human body ≈ 1013 cells ≈ 200 cell types Molecular biology background

The cell Molecular biology background

DNA – deoxyribonucleic acid A C T G / C A G T Molecular biology background

Human genome ≈ 3.2 x 109 base pairs / 23 chromosomes / 2x ≈ 1.5% has protein coding function … ≈ 20 000 – 25 000 genes

Molecular biology background

Flow of genetic information

DNA

structural regulation

signalling transportation

protein catalytic kinetic Molecular biology background

Flow of genetic information

DNA

transcription RNA

translation

protein Molecular biology background

Flow of genetic information

DNA RNA Double-stranded Single-stranded Deoxyribose Ribose A C G T A C G U

RNA

polyA tail Molecular biology background

Differential gene expression Molecular biology background

Differential gene expression

tissue A / tissue B healthy / sick before / after medication Molecular biology background

Differential gene expression

DNA

transcription RNA

translation

protein Molecular biology background

Differential gene expression

Src gene Src gene

RNA RNA

Tumor tissue Tumor Healthy tissue Healthy

Microarrays technology

protein protein How microarryas technology works How microarryas technology works

Microarrays technology

• microscope slides that contain an ordered series of samples (DNA, RNA, protein, tissue) • high-throughput method How microarryas technology works

Microarrays technology

ssDNA TARGET

ssDNA PROBE

Probe fabrication • deposition of DNA fragments • in-situ hybridization How spotted microarryas technology works

Cy3 Cy5 How Affymetrix technology works

Affymetrix GeneChip – oligonucleotide microarray How Affymetrix technology works

Affymetrix GeneChip – oligonucleotide microarray

5 μm

1.28 cm probe cell

array 6.5 million probe cells on 1 array

Each cell = millions of copies of a specific probe

Each probe = 25 nucleotides How Affymetrix technology works

Oligonucleotides in one cell are specific for a single gene

25 nt How Affymetrix technology works

Oligonucleotides in one probe cell are specific for a single gene How Affymetrix technology works

How are target sequences and probes chosen?

• Target sequences are selected from the 3’end of the transcript • Probes should be unique in genome (unless probesets are intended to cross hybridize) • Probes should not hybridize to other sequences in fragmented cDNA • Thermodynamic properties of probes How Affymetrix technology works

Each gene is represented with a probe set

5’ Gene A 3’

A A C C G G T T Mismatch G G ACGTGCAATGCCG C C A T A A T T G G C C Probe cell C C PM G G MM

Probe pair Probe set for one gene

PM = perfect match = specific hybridization MM = mismatch = non-specific hybridization How Affymetrix technology works

Each gene is represented with a probe set

Example: • 1415771_at: – Description: Mus musculus nucleolin mRNA, complete cds – LocusLink: AF318184.1 (NT sequence is 2412 bp long) – Target Sequence is 129 bp long 11 probe pairs tiling the target sequence gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt

How Affymetrix technology works

Each gene is represented with a probe set

Probe pairs from one probe set are dispersed on the array How Affymetrix technology works

Signal calculation

N PM (PM  MM ) AD  i i i MM N Probe set for one gene P (present) if PM>MM DC = A (absent) if PM

Each gene has 2 measures 1. Average difference Quantitative Numeric value 2. Detection call Qualitative Flags (P, A, M) How Affymetrix technology works

GeneChip construction – photolithography How Affymetrix technology works

Tissue RNA sample preparation

RNA isolation

RNA

cDNA synthesis

Reverse DNA transcriptase T7 promoter TTTTTTTTTT AAAAAAAAAA RNAase H mRNA How Affymetrix technology works

Tissue RNA sample preparation

RNA islotation

RNA

cDNA synthesis

DNA T7 promoter TTTTTTTTTT

DNA polymerase I How Affymetrix technology works

Tissue RNA sample preparation

Biotinylated RNA

RNA islotation biotinylated + UTP / CTP RNA T7 RNA T7polymerase promoter TTTTTTTTTT

cDNA synthesis cRNA synthesis

T7 promoter TTTTTTTTTT How Affymetrix technology works

RNA sample preparation

Biotinylated RNA How Affymetrix technology works

Sample hybridization How Affymetrix technology works

Sample hybridization

U A U A U U A A A A G G A A C C G G C C C C C C C C G G C C A A G G C C A A A A C C A A A A G G A A U U G G U U How Affymetrix technology works

Staining

streptavidin-phycoerythrin

U A U A U U A A A A G G A A C C G G C C C C C C C C G G C C A A G G C C A A A A C C A A A A G G A A U U G G U U How Affymetrix technology works

Affymetrix microarrays principle

Stronger expression of the gene in the cell

More RNAs of the gene in the cell

More RNAs of the gene in the sample

More RNAs of the gene hybridize to specific probes on the microarray

Stronger light signal from the specific microarray probe cell

How Affymetrix technology works

Affymetrix GeneChip scan

Bioinformatics How Agilent technology works

SurePrint • Ink-jet printer technology • 1 000 000 features / array • 1 oligo = 60-mer • 1 gene = I probe Basic statistics Basic statistics

Population, sample, variable

Population – set of individuals or objects of interest • All obese man in Slovakia Sample – a subset of the population • 530 randomly selected obese man from Slovakia

Variable – any measured characteristic that differs between members of sample/population • Quantitative: • Continuous: weight, level of gene expression • Discrete: number of patients, number of over-expressed genes • Qualitative: healthy/disease, male/female Basic statistics

Statistical characteristics of the sample

Number of subjects: n

n x i i Mean: x  n

Median: value situated in the middle of ordered measurements

96, 78, 90, 62, 73, 89, 92, 84, 76, 86

62, 73, 76, 78, 84, 86, 89, 90, 92, 95

Median = (84+86)/2 = 85 Basic statistics

Mean / median

In general, median is more robust a statistically reliable than mean

96, 78, 90, 62, 73, 89, 92, 84, 76, 86 96, 78, 90, 12, 73, 89, 92, 84, 76, 86 • Mean = 82.6 • Mean = 77.6 • Median = 85 • Median = 85 Basic statistics

Measure of variability

77, 79, 81, 79 96, 78, 90, 12, 73, 89, 92, 84, 90, 86 Mean = 79 Mean = 79 Range = 4 Range = 84 Variance = 2.63 Variance = 600 Standard deviation = 1.66 Standard deviation = 24.5

Range: r  xmax  xmin

n x  x 2 2 i  i  Variance: s  n 1

n x  x 2 i  i  Standard deviation: s  n 1 Basic statistics

Measure of correlation

Characterizes relationship (degree of linear dependence) between two variables • Height and weight of a man • Level of expression of genes X a Y

n (x  x)(y  y) Correlation coefficient: CC  i i i n n x  x 2 y  y 2 i  i  i  i 

CC = -1…0…+1

CC = 0.87 CC = 0.06 Basic statistics

Scatter plot Basic plot types

Histogram

Box plot

outliers

25%

49.5% 4 8 12 16 20 Q3 signal intensity median 50%

Q1 49.5% signal intensitysignal

25% Basic statistics

Outliers

• atypical, infrequent data points which do not appear to follow the characteristic distribution of the rest of the data (random errors). • definition of an outlier is subjective • outliers are excluded from the analysis

Basic statistics

Probability distribution

Statistical function that assigns probability to every possible value of a random variable. • discrete • continuous

Binomial distribution: Normal distribution: e.g. coin toss e.g. probe signal intensity

(x)2  n x nx 1 2 2 f (x)    p 1 p f (x)  e  x  2 Basic statistics

Normal distribution / Gaussian distribution

• most important continuous distribution • many natural phenomena follow it • anatomical parameters • physiological parameters • gene expression level • many statistical tests are derived from normal distribution

Standard normal distribution: μ = 0 σ = 1 Basic statistics

Standardization

Normal distribution Standard normal distribution x   z   Basic statistics

Normalization

• process of removing non-biological influences on biological data • enables array to array comparison

Before After

Basic statistics

Log2 transformation

• transformation is used, when original data does not fulfill distributional presumptions • after log2 transformation, distribution becomes more normal-like Basic statistics

RMA preprocessing for Affymetrix data

• standard procedure for Affymetrix data normalization • 4 steps: 1) Background correction 2) Quantile normalization 3) Median polish summarization 4) Log2 transformation Basic statistics

Statistical testing / Hypothesis testing

Statistical testing tries to answer following questions: • Can the difference between these observations be explained by chance alone? • How significant is this difference?

According to difference in children’s height in these two samples: Is there a significant difference in children’s height in Class A and Class B?

Class A Class B Basic statistics

Statistical testing / Hypothesis testing

Statistical testing tries to answer following questions: • Can the difference between these observations be explained by chance alone? • How significant is this difference?

We accept there is a significant difference between two samples if: • The difference is big • The difference is small, but regularly present

Significance of the difference is supported by: • Big difference between sample means • Small variation in samples (small standard deviation) • Large sample sizes Basic statistics

Statistical testing / Hypothesis testing

NORMAL TISSUE TUMOR TISSUE GENE chip1 chip2 chip3 chip4 chip5 chip6

PDCD6 8.08 8.5 8.46 8.34 8.53 8.22

BCL2L10 4.88 5.14 4.88 5.09 4.84 4.92

TOMM6 10.16 9.73 10.7 10.31 10.44 10.55 WhichBCL2L11 of these genes4.19 is significantly4.44 4.05 differentially4.01 expressed4.16 4.36 in SH2B3 7.21 6.22 7.08 7.55 7.18 6.81 normal and tumor tissue? CDH3 6.09 9.44 6.57 6.02 7.14 7.66

GNE 7.13 7.44 7.34 8.22 8.73 8.48

HCN4 4.58 4.59 4.96 4.66 4.63 4.55

INSL5 4.14 4.27 4.41 4.34 4.31 4.19

FRAT1 6.57 5.7 6.31 6.46 6.27 6.43

TROAP 6.12 6.65 6.16 6.73 6.62 6.35

MED16 6.78 6.65 6.31 6.75 6.48 6.79

PIGK 6.87 6.97 6.91 6.2 7.12 6.6 Basic statistics

Statistical testing / Hypothesis testing

Formulate a pair of mutually exclusive hypotheses:

• H0 - null hypothesis: There IS NO difference in means between compared groups

• H1 - alternative hypothesis: The IS a difference in means between compared groups

A priori H0 is accepted as truthful.

To reject H0 (and accept H1), H1 must pass the statistical test.

Basic statistics

Statistical testing / Hypothesis testing

Result of a statistical test is a p-value

To reject H0 (and accept H1), p-value must be below chosen significance level (p < α) α = 0.05 or 0.01

If α = 0.05, there is a 5% risk that we reject H0, although it is actually true. In other words: • There is 5% risk that we accept there is a significant difference between groups, although it is not! • In 1 of 20 cases we conclude there is a significant difference between groups, although it is not!

Basic statistics

Choosing a statistical test

When choosing a test, there are two essential questions, which need to be answered: • is there more than 2 groups to compare? • should we assume that the data are normally distributed?

Are the data normally distributed?

YES NO

Are there more NO t-test Mann-Whitney U test than 2 groups to compare? YES ANOVA Kruskal-Willis test Basic statistics

Student’s t-test

Is used to determine whether difference between two group means is significant.

• H0 – group means are equal Expression GENE • H1 – group means are unequal change

PDCD6 0.08 BCL2L10 -1.88 1) One sample t-test: TOMM6 1.16

• compares sample mean with certain hypothetical mean BCL2L11 2.19 • answers question: does the treatment cause significant SH2B3 2.21 change in expression of genes? CDH3 -1.09 GNE -2.13

HCN4 2.66

INSL5 0.04 Mean 0.36 Hypothetical mean 0 Basic statistics

Student’s t-test

Is used to determine whether difference between two group means is significant.

2) Independent two sample t-test • sample sizes are equal • compares means of two independent samples • answers question: is there a significant difference in gene expression in normal a tumor cells?

NORMAL TISSUE TUMOR TISSUE GENE chip1 chip2 chip3 chip4 chip5 chip6

CAIX 7.13 7.44 7.34 8.22 8.73 8.48 Mean 7.3 8.48

• equal / unequal variance Basic statistics

Student’s t-test

Is used to determine whether difference between two group means is significant.

3) Dependent t-test for paired samples (multiple-measure t-test) Before After GENE treatme treatme • sample sizes are equal nt nt PDCD6 8.08 8.5 • compares means of two samples, where individual BCL2L10 4.88 5.14 measurements are paired TOMM6 10.16 9.73 • answers question: is there a significant change in expression BCL2L11 4.19 4.44 before and after treatment? SH2B3 7.21 8.26

CDH3 6.09 9.44 GNE 7.13 7.44 HCN4 4.58 4.59 INSL5 4.14 4.27 Mean 6.27 6.86 Basic statistics

ANOVA – analysis of variance

Is used to determine whether difference between more than two groups is significant.

• H0 – all group means are equal

• H1 – at least one pair of means are different from each other

• samples are independent, random samples from k populations • all k populations have equal variances • all k populations are normal Basic statistics

ANOVA – analysis of variance

Is used to determine whether difference between more than two groups is significant.

• H0 – all group means are equal

• H1 – at least one pair of means are different from each other

The goal of ANOVA is to evaluate relationship between within- and inter- group variances

Measureme Gene expression

nt Brain Lungs Liver Kidney

Patient 1 12 14 17 10

Patient 2 15 20 31 15

Patient 3 17 23 19 20 group variance group Patient 4 - 20 21 14 25

Patient 5 12 19 26 30 within Means 15.2 19.4 21.4 20 inter-group variance Overall 19 mean Basic statistics

One-way ANOVA

Only one factor is influencing the data – tissue type

Measureme Gene expression

nt Brain Lungs Liver Kidney

Patient 1 12 14 17 10

Patient 2 15 20 31 15

Patient 3 17 23 19 20 group variance group Patient 4 - 20 21 14 25

Patient 5 12 19 26 30 within Means 15.2 19.4 21.4 20 inter-group variance Overall 19 mean

Answers question: is there a significant difference in gene expression between tissue types? But does not answer: which tissue type has significantly different gene expression Basic statistics

Two-way ANOVA – randomized complete block design

Two factors are influencing the data – tissue / patient

Measureme Gene expression

nt Brain Lungs Liver Kidney

Patient 1 12 14 17 10 block Patient 2 15 20 31 15 block

Patient 3 17 23 19 20 block inter-block variance group variance group Patient 4 - 20 21 14 25 block

Patient 5 12 19 26 30 block within Means 15.2 19.4 21.4 20 inter-group variance Overall 19 mean

Answers questions: • does level of gene expression depend on tissue type? • does level of gene expression depend on patient sample? But does not answer: which tissue type or patient has significantly different gene expression Basic statistics

Two-way ANOVA – randomized complete block design

Two factors are influencing the data – tissue / treatment

Gene expression H0: Expression does not depend on tissue type Measuremen H1: Expression depends on tissue type: Brain Lungs Liver Kidney t p-value = 0.23 C T C T C T C T H0: Expression does not depend on drug treatment Patient 1 12 11 14 19 17 15 10 12 H1: Expression depends on drug treatment p-value = 0.04 Patient 2 15 11 20 18 31 28 15 19

Patient 3 17 15 23 20 19 25 20 12

Patient 4 20 16 21 15 14 11 25 22

Patient 5 12 10 19 11 26 22 30 26

Means 15.2 12.6 19.4 16.6 21.4 20.2 20 18.2

Overall mean 17.95 Basic statistics

Multiple testing

The more analyses are performed on a data-set, the more results will meet the significance level by chance

If α = 0.05, there is 5% risk that we accept there is a significant difference between groups, although it is not!

20000 genes on a chip = 20000 t-tests in parallel = 1000 genes will meet significance level by chance

p-value corrections (α=0.05  α=0.005): • Bonferroni correction: very stringent, for more than 1000 genes is useless • False discovery rate (FDR) – Benjamini/Hochberg: each gene has own significance level α Basic statistics

Replicates

Replicates are a very powerful way to reduce random errors of the data

Types of replicates: • biological

• technical

3 replicates is minimum !!! Basic statistics

Clustering

• Group together similarly expressed genes • Infer biological consequences

Methods: • Hierarchical clustering • K-means • K-nearest neighbor (KNN) • Principal component analysis (PCA) • Self-organizing maps (SOMs) • Support vector machines (SVMs) Basic statistics

Clustering

K-means clustering - profile Basic statistics

Clustering

Hierarchical clustering - Heatmap Basic statistics

Clustering

Principal component analysis (PCA) Basic statistics

Pathway analysis – Hypergeometric test Identify pathways, where differentially expressed genes are over-represented

R = 6 B = 4 Basic statistics

Pathway analysis - Hypergeometric test

b = 3 r = 1

R = 6 Probability = ??? B = 4 R = 6 6R4B B = 4    1r3b P      11% R10 B   r 4 b Basic statistics

Pathway analysis - Hypergeometric test

b = 3 r = 1

R = 6 Probability = ??? B = 4 R = 6  R B B = 4    b    r - i i  P  99% i0 R  B    r  b  Basic statistics

KEGG PATHWAY - collection of pathway maps representing our knowledge on the molecular interaction and reaction networks

Focal adhesion: 207 genes Basic statistics

All genes on the chip: 15000

Basic statistics

Genes in focal adhesion

All genes on the chip: 15000

Basic statistics

Genes in focal adhesion Genes in focal adhesion: 207

All genes on the chip: 15000

Basic statistics

Genes in focal adhesion Genes in focal adhesion: 207

DE genes

All genes on the chip: 15000

Basic statistics

Genes in focal adhesion Genes in focal adhesion: 207 DE genes in focal adhesion: 10 DE genes

DE genes: 310

B = 207 R = 15000 - 207 = 14793 b = 10 r = 310 - 10 = 300

 R B    b-1    r - i i All genes on the chip: 15000 P (x  10) 0.996    i0 R  B    r  b  Basic statistics

Genes in focal adhesion Genes in focal adhesion: 207 DE genes in focal adhesion: 10 DE genes

DE genes: 310

B = 207 R = 15000 - 207 = 14793 b = 10 r = 310 - 10 = 300

All genes on the chip: 15000 P (x  10) 0.996

P(x  10) 1-0.996  0.004 0.4% probability that 10 or more DE genes will belong to FA pathway by chance Analysis workflow Analysis workflow

Import .CEL files Analysis workflow

Import .CEL files

Quality control

RNA degradation plot

QC stats

Relative log expression (RLE)

Normalized unscaled standard error (NUSE)

be awaregood

goodbad Analysis workflow

Import .CEL files

Quality control

Normalization

method (RMA, GCRMA, MAS5)

chip type (hg133a, hg133plus, hg133plus2) Analysis workflow

Import .CEL files

Quality control

Normalization

Edit phenodata

Define which chip belongs to which group of samples Analysis workflow

Import .CEL files

Quality control

Normalization

Edit phenodata

Preprocessing

Filter non-changing genes Analysis workflow

Import .CEL files

Quality control

Normalization

Edit phenodata

Preprocessing

Analysis

Statistical testing

Clustering

Pathway analysis

Promoter analysis Analysis workflow

Import .CEL files

Quality control

Normalization

Edit phenodata

Preprocessing

Analysis

Visualization Working with Chipster Working with Chipster

Chipster – open-source platform for data analysis

NGS / microarrays / proteomics / sequence analysis 300 analysis tools

http://chipster.csc.fi/ Getting access to utilize Chipster: • Server in Finland - 500 € / year http://chipster.csc.fi/manual/ • Server at UMB SAV - ??? / year http://chipster.csc.fi/manual/tutorials.html • Your own server – for free

• client : graphical Java desktop application • server: R / Bioconductor

Please cite:

M Aleksi Kallio, Jarno T Tuimala, Taavi Hupponen, Petri Klemelä, Massimiliano Gentile, Ilari Scheinin, Mikko Koski, Janne Käki, Eija I Korpelainen: Chipster: user-friendly analysis software for microarray and other high-throughput data (2011) BMC Genomics 12: 507

Working with Chipster

Microarrays data resources:

Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/ ArrayExpress: https://www.ebi.ac.uk/arrayexpress/ Oncomine: https://www.oncomine.org/ Hands-on: practical exercise Hands-on: practical exercise

Chipster: http://chipster.embnet.sk:8081/ Username: chipster Password: chipster

Exercises and data: http://www.embnet.sk/workshop2015/microarrays