DNA microarrays data analysis
Matej Stano UMB SAV 10.11.2015 Overview
1) Molecular biology background
2) How microarrays technology works
3) Basic statistics
4) Analysis workflow
5) Working with Chipster
6) Hands-on: practical exercise Molecular biology background Molecular biology background
All living things are composed of one or more cells Molecular biology background
Human body ≈ 1013 cells ≈ 200 cell types Molecular biology background
The cell Molecular biology background
DNA – deoxyribonucleic acid A C T G / C A G T Molecular biology background
Human genome ≈ 3.2 x 109 base pairs / 23 chromosomes / 2x ≈ 1.5% has protein coding function … ≈ 20 000 – 25 000 genes
Molecular biology background
Flow of genetic information
DNA
structural regulation
signalling transportation
protein catalytic kinetic Molecular biology background
Flow of genetic information
DNA
transcription RNA
translation
protein Molecular biology background
Flow of genetic information
DNA RNA Double-stranded Single-stranded Deoxyribose Ribose A C G T A C G U
RNA
polyA tail Molecular biology background
Differential gene expression Molecular biology background
Differential gene expression
tissue A / tissue B healthy / sick before / after medication Molecular biology background
Differential gene expression
DNA
transcription RNA
translation
protein Molecular biology background
Differential gene expression
Src gene Src gene
RNA RNA
Tumor tissue Tumor Healthy tissue Healthy
Microarrays technology
protein protein How microarryas technology works How microarryas technology works
Microarrays technology
• microscope slides that contain an ordered series of samples (DNA, RNA, protein, tissue) • high-throughput method How microarryas technology works
Microarrays technology
ssDNA TARGET
ssDNA PROBE
Probe fabrication • deposition of DNA fragments • in-situ hybridization How spotted microarryas technology works
Cy3 Cy5 How Affymetrix technology works
Affymetrix GeneChip – oligonucleotide microarray How Affymetrix technology works
Affymetrix GeneChip – oligonucleotide microarray
5 μm
1.28 cm probe cell
array 6.5 million probe cells on 1 array
Each cell = millions of copies of a specific probe
Each probe = 25 nucleotides How Affymetrix technology works
Oligonucleotides in one cell are specific for a single gene
25 nt How Affymetrix technology works
Oligonucleotides in one probe cell are specific for a single gene How Affymetrix technology works
How are target sequences and probes chosen?
• Target sequences are selected from the 3’end of the transcript • Probes should be unique in genome (unless probesets are intended to cross hybridize) • Probes should not hybridize to other sequences in fragmented cDNA • Thermodynamic properties of probes How Affymetrix technology works
Each gene is represented with a probe set
5’ Gene A 3’
A A C C G G T T Mismatch G G ACGTGCAATGCCG C C A T A A T T G G C C Probe cell C C PM G G MM
Probe pair Probe set for one gene
PM = perfect match = specific hybridization MM = mismatch = non-specific hybridization How Affymetrix technology works
Each gene is represented with a probe set
Example: • 1415771_at: – Description: Mus musculus nucleolin mRNA, complete cds – LocusLink: AF318184.1 (NT sequence is 2412 bp long) – Target Sequence is 129 bp long 11 probe pairs tiling the target sequence gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt gagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt
How Affymetrix technology works
Each gene is represented with a probe set
Probe pairs from one probe set are dispersed on the array How Affymetrix technology works
Signal calculation
N PM (PM MM ) AD i i i MM N Probe set for one gene P (present) if PM>MM DC = A (absent) if PM Each gene has 2 measures 1. Average difference Quantitative Numeric value 2. Detection call Qualitative Flags (P, A, M) How Affymetrix technology works GeneChip construction – photolithography How Affymetrix technology works Tissue RNA sample preparation RNA isolation RNA cDNA synthesis Reverse DNA transcriptase T7 promoter TTTTTTTTTT AAAAAAAAAA RNAase H mRNA How Affymetrix technology works Tissue RNA sample preparation RNA islotation RNA cDNA synthesis DNA T7 promoter TTTTTTTTTT DNA polymerase I How Affymetrix technology works Tissue RNA sample preparation Biotinylated RNA RNA islotation biotinylated + UTP / CTP RNA T7 RNA T7polymerase promoter TTTTTTTTTT cDNA synthesis cRNA synthesis T7 promoter TTTTTTTTTT How Affymetrix technology works RNA sample preparation Biotinylated RNA How Affymetrix technology works Sample hybridization How Affymetrix technology works Sample hybridization How Affymetrix technology works Sample hybridization U A U A U U A A A A G G A A C C G G C C C C C C C C G G C C A A G G C C A A A A C C A A A A G G A A U U G G U U How Affymetrix technology works Staining streptavidin-phycoerythrin U A U A U U A A A A G G A A C C G G C C C C C C C C G G C C A A G G C C A A A A C C A A A A G G A A U U G G U U How Affymetrix technology works Affymetrix microarrays principle Stronger expression of the gene in the cell More RNAs of the gene in the cell More RNAs of the gene in the sample More RNAs of the gene hybridize to specific probes on the microarray Stronger light signal from the specific microarray probe cell How Affymetrix technology works Affymetrix GeneChip scan Bioinformatics How Agilent technology works SurePrint • Ink-jet printer technology • 1 000 000 features / array • 1 oligo = 60-mer • 1 gene = I probe Basic statistics Basic statistics Population, sample, variable Population – set of individuals or objects of interest • All obese man in Slovakia Sample – a subset of the population • 530 randomly selected obese man from Slovakia Variable – any measured characteristic that differs between members of sample/population • Quantitative: • Continuous: weight, level of gene expression • Discrete: number of patients, number of over-expressed genes • Qualitative: healthy/disease, male/female Basic statistics Statistical characteristics of the sample Number of subjects: n n x i i Mean: x n Median: value situated in the middle of ordered measurements 96, 78, 90, 62, 73, 89, 92, 84, 76, 86 62, 73, 76, 78, 84, 86, 89, 90, 92, 95 Median = (84+86)/2 = 85 Basic statistics Mean / median In general, median is more robust a statistically reliable than mean 96, 78, 90, 62, 73, 89, 92, 84, 76, 86 96, 78, 90, 12, 73, 89, 92, 84, 76, 86 • Mean = 82.6 • Mean = 77.6 • Median = 85 • Median = 85 Basic statistics Measure of variability 77, 79, 81, 79 96, 78, 90, 12, 73, 89, 92, 84, 90, 86 Mean = 79 Mean = 79 Range = 4 Range = 84 Variance = 2.63 Variance = 600 Standard deviation = 1.66 Standard deviation = 24.5 Range: r xmax xmin n x x 2 2 i i Variance: s n 1 n x x 2 i i Standard deviation: s n 1 Basic statistics Measure of correlation Characterizes relationship (degree of linear dependence) between two variables • Height and weight of a man • Level of expression of genes X a Y n (x x)(y y) Correlation coefficient: CC i i i n n x x 2 y y 2 i i i i CC = -1…0…+1 CC = 0.87 CC = 0.06 Basic statistics Scatter plot Basic plot types Histogram Box plot outliers 25% 49.5% 4 8 12 16 20 Q3 signal intensity median 50% Q1 49.5% signal intensitysignal 25% Basic statistics Outliers • atypical, infrequent data points which do not appear to follow the characteristic distribution of the rest of the data (random errors). • definition of an outlier is subjective • outliers are excluded from the analysis Basic statistics Probability distribution Statistical function that assigns probability to every possible value of a random variable. • discrete • continuous Binomial distribution: Normal distribution: e.g. coin toss e.g. probe signal intensity (x)2 n x nx 1 2 2 f (x) p 1 p f (x) e x 2 Basic statistics Normal distribution / Gaussian distribution • most important continuous distribution • many natural phenomena follow it • anatomical parameters • physiological parameters • gene expression level • many statistical tests are derived from normal distribution Standard normal distribution: μ = 0 σ = 1 Basic statistics Standardization Normal distribution Standard normal distribution x z Basic statistics Normalization • process of removing non-biological influences on biological data • enables array to array comparison Before After Basic statistics Log2 transformation • transformation is used, when original data does not fulfill distributional presumptions • after log2 transformation, distribution becomes more normal-like Basic statistics RMA preprocessing for Affymetrix data • standard procedure for Affymetrix data normalization • 4 steps: 1) Background correction 2) Quantile normalization 3) Median polish summarization 4) Log2 transformation Basic statistics Statistical testing / Hypothesis testing Statistical testing tries to answer following questions: • Can the difference between these observations be explained by chance alone? • How significant is this difference? ? According to difference in children’s height in these two samples: Is there a significant difference in children’s height in Class A and Class B? Class A Class B Basic statistics Statistical testing / Hypothesis testing Statistical testing tries to answer following questions: • Can the difference between these observations be explained by chance alone? • How significant is this difference? We accept there is a significant difference between two samples if: • The difference is big • The difference is small, but regularly present Significance of the difference is supported by: • Big difference between sample means • Small variation in samples (small standard deviation) • Large sample sizes Basic statistics Statistical testing / Hypothesis testing NORMAL TISSUE TUMOR TISSUE GENE chip1 chip2 chip3 chip4 chip5 chip6 PDCD6 8.08 8.5 8.46 8.34 8.53 8.22 BCL2L10 4.88 5.14 4.88 5.09 4.84 4.92 TOMM6 10.16 9.73 10.7 10.31 10.44 10.55 WhichBCL2L11 of these genes4.19 is significantly4.44 4.05 differentially4.01 expressed4.16 4.36 in SH2B3 7.21 6.22 7.08 7.55 7.18 6.81 normal and tumor tissue? CDH3 6.09 9.44 6.57 6.02 7.14 7.66 GNE 7.13 7.44 7.34 8.22 8.73 8.48 HCN4 4.58 4.59 4.96 4.66 4.63 4.55 INSL5 4.14 4.27 4.41 4.34 4.31 4.19 FRAT1 6.57 5.7 6.31 6.46 6.27 6.43 TROAP 6.12 6.65 6.16 6.73 6.62 6.35 MED16 6.78 6.65 6.31 6.75 6.48 6.79 PIGK 6.87 6.97 6.91 6.2 7.12 6.6 Basic statistics Statistical testing / Hypothesis testing Formulate a pair of mutually exclusive hypotheses: • H0 - null hypothesis: There IS NO difference in means between compared groups • H1 - alternative hypothesis: The IS a difference in means between compared groups A priori H0 is accepted as truthful. To reject H0 (and accept H1), H1 must pass the statistical test. Basic statistics Statistical testing / Hypothesis testing Result of a statistical test is a p-value To reject H0 (and accept H1), p-value must be below chosen significance level (p < α) α = 0.05 or 0.01 If α = 0.05, there is a 5% risk that we reject H0, although it is actually true. In other words: • There is 5% risk that we accept there is a significant difference between groups, although it is not! • In 1 of 20 cases we conclude there is a significant difference between groups, although it is not! Basic statistics Choosing a statistical test When choosing a test, there are two essential questions, which need to be answered: • is there more than 2 groups to compare? • should we assume that the data are normally distributed? Are the data normally distributed? YES NO Are there more NO t-test Mann-Whitney U test than 2 groups to compare? YES ANOVA Kruskal-Willis test Basic statistics Student’s t-test Is used to determine whether difference between two group means is significant. • H0 – group means are equal Expression GENE • H1 – group means are unequal change PDCD6 0.08 BCL2L10 -1.88 1) One sample t-test: TOMM6 1.16 • compares sample mean with certain hypothetical mean BCL2L11 2.19 • answers question: does the treatment cause significant SH2B3 2.21 change in expression of genes? CDH3 -1.09 GNE -2.13 HCN4 2.66 INSL5 0.04 Mean 0.36 Hypothetical mean 0 Basic statistics Student’s t-test Is used to determine whether difference between two group means is significant. 2) Independent two sample t-test • sample sizes are equal • compares means of two independent samples • answers question: is there a significant difference in gene expression in normal a tumor cells? NORMAL TISSUE TUMOR TISSUE GENE chip1 chip2 chip3 chip4 chip5 chip6 CAIX 7.13 7.44 7.34 8.22 8.73 8.48 Mean 7.3 8.48 • equal / unequal variance Basic statistics Student’s t-test Is used to determine whether difference between two group means is significant. 3) Dependent t-test for paired samples (multiple-measure t-test) Before After GENE treatme treatme • sample sizes are equal nt nt PDCD6 8.08 8.5 • compares means of two samples, where individual BCL2L10 4.88 5.14 measurements are paired TOMM6 10.16 9.73 • answers question: is there a significant change in expression BCL2L11 4.19 4.44 before and after treatment? SH2B3 7.21 8.26 CDH3 6.09 9.44 GNE 7.13 7.44 HCN4 4.58 4.59 INSL5 4.14 4.27 Mean 6.27 6.86 Basic statistics ANOVA – analysis of variance Is used to determine whether difference between more than two groups is significant. • H0 – all group means are equal • H1 – at least one pair of means are different from each other • samples are independent, random samples from k populations • all k populations have equal variances • all k populations are normal Basic statistics ANOVA – analysis of variance Is used to determine whether difference between more than two groups is significant. • H0 – all group means are equal • H1 – at least one pair of means are different from each other The goal of ANOVA is to evaluate relationship between within- and inter- group variances Measureme Gene expression nt Brain Lungs Liver Kidney Patient 1 12 14 17 10 Patient 2 15 20 31 15 Patient 3 17 23 19 20 group variance group Patient 4 - 20 21 14 25 Patient 5 12 19 26 30 within Means 15.2 19.4 21.4 20 inter-group variance Overall 19 mean Basic statistics One-way ANOVA Only one factor is influencing the data – tissue type Measureme Gene expression nt Brain Lungs Liver Kidney Patient 1 12 14 17 10 Patient 2 15 20 31 15 Patient 3 17 23 19 20 group variance group Patient 4 - 20 21 14 25 Patient 5 12 19 26 30 within Means 15.2 19.4 21.4 20 inter-group variance Overall 19 mean Answers question: is there a significant difference in gene expression between tissue types? But does not answer: which tissue type has significantly different gene expression Basic statistics Two-way ANOVA – randomized complete block design Two factors are influencing the data – tissue / patient Measureme Gene expression nt Brain Lungs Liver Kidney Patient 1 12 14 17 10 block Patient 2 15 20 31 15 block Patient 3 17 23 19 20 block inter-block variance group variance group Patient 4 - 20 21 14 25 block Patient 5 12 19 26 30 block within Means 15.2 19.4 21.4 20 inter-group variance Overall 19 mean Answers questions: • does level of gene expression depend on tissue type? • does level of gene expression depend on patient sample? But does not answer: which tissue type or patient has significantly different gene expression Basic statistics Two-way ANOVA – randomized complete block design Two factors are influencing the data – tissue / treatment Gene expression H0: Expression does not depend on tissue type Measuremen H1: Expression depends on tissue type: Brain Lungs Liver Kidney t p-value = 0.23 C T C T C T C T H0: Expression does not depend on drug treatment Patient 1 12 11 14 19 17 15 10 12 H1: Expression depends on drug treatment p-value = 0.04 Patient 2 15 11 20 18 31 28 15 19 Patient 3 17 15 23 20 19 25 20 12 Patient 4 20 16 21 15 14 11 25 22 Patient 5 12 10 19 11 26 22 30 26 Means 15.2 12.6 19.4 16.6 21.4 20.2 20 18.2 Overall mean 17.95 Basic statistics Multiple testing The more analyses are performed on a data-set, the more results will meet the significance level by chance If α = 0.05, there is 5% risk that we accept there is a significant difference between groups, although it is not! 20000 genes on a chip = 20000 t-tests in parallel = 1000 genes will meet significance level by chance p-value corrections (α=0.05 α=0.005): • Bonferroni correction: very stringent, for more than 1000 genes is useless • False discovery rate (FDR) – Benjamini/Hochberg: each gene has own significance level α Basic statistics Replicates Replicates are a very powerful way to reduce random errors of the data Types of replicates: • biological • technical 3 replicates is minimum !!! Basic statistics Clustering • Group together similarly expressed genes • Infer biological consequences Methods: • Hierarchical clustering • K-means • K-nearest neighbor (KNN) • Principal component analysis (PCA) • Self-organizing maps (SOMs) • Support vector machines (SVMs) Basic statistics Clustering K-means clustering - profile Basic statistics Clustering Hierarchical clustering - Heatmap Basic statistics Clustering Principal component analysis (PCA) Basic statistics Pathway analysis – Hypergeometric test Identify pathways, where differentially expressed genes are over-represented R = 6 B = 4 Basic statistics Pathway analysis - Hypergeometric test b = 3 r = 1 R = 6 Probability = ??? B = 4 R = 6 6R4B B = 4 1r3b P 11% R10 B r 4 b Basic statistics Pathway analysis - Hypergeometric test b = 3 r = 1 R = 6 Probability = ??? B = 4 R = 6 R B B = 4 b r - i i P 99% i0 R B r b Basic statistics KEGG PATHWAY - collection of pathway maps representing our knowledge on the molecular interaction and reaction networks Focal adhesion: 207 genes Basic statistics All genes on the chip: 15000 Basic statistics Genes in focal adhesion All genes on the chip: 15000 Basic statistics Genes in focal adhesion Genes in focal adhesion: 207 All genes on the chip: 15000 Basic statistics Genes in focal adhesion Genes in focal adhesion: 207 DE genes All genes on the chip: 15000 Basic statistics Genes in focal adhesion Genes in focal adhesion: 207 DE genes in focal adhesion: 10 DE genes DE genes: 310 B = 207 R = 15000 - 207 = 14793 b = 10 r = 310 - 10 = 300 R B b-1 r - i i All genes on the chip: 15000 P (x 10) 0.996 i0 R B r b Basic statistics Genes in focal adhesion Genes in focal adhesion: 207 DE genes in focal adhesion: 10 DE genes DE genes: 310 B = 207 R = 15000 - 207 = 14793 b = 10 r = 310 - 10 = 300 All genes on the chip: 15000 P (x 10) 0.996 P(x 10) 1-0.996 0.004 0.4% probability that 10 or more DE genes will belong to FA pathway by chance Analysis workflow Analysis workflow Import .CEL files Analysis workflow Import .CEL files Quality control RNA degradation plot QC stats Relative log expression (RLE) Normalized unscaled standard error (NUSE) be awaregood goodbad Analysis workflow Import .CEL files Quality control Normalization method (RMA, GCRMA, MAS5) chip type (hg133a, hg133plus, hg133plus2) Analysis workflow Import .CEL files Quality control Normalization Edit phenodata Define which chip belongs to which group of samples Analysis workflow Import .CEL files Quality control Normalization Edit phenodata Preprocessing Filter non-changing genes Analysis workflow Import .CEL files Quality control Normalization Edit phenodata Preprocessing Analysis Statistical testing Clustering Pathway analysis Promoter analysis Analysis workflow Import .CEL files Quality control Normalization Edit phenodata Preprocessing Analysis Visualization Working with Chipster Working with Chipster Chipster – open-source platform for data analysis NGS / microarrays / proteomics / sequence analysis 300 analysis tools http://chipster.csc.fi/ Getting access to utilize Chipster: • Server in Finland - 500 € / year http://chipster.csc.fi/manual/ • Server at UMB SAV - ??? / year http://chipster.csc.fi/manual/tutorials.html • Your own server – for free • client : graphical Java desktop application • server: R / Bioconductor Please cite: M Aleksi Kallio, Jarno T Tuimala, Taavi Hupponen, Petri Klemelä, Massimiliano Gentile, Ilari Scheinin, Mikko Koski, Janne Käki, Eija I Korpelainen: Chipster: user-friendly analysis software for microarray and other high-throughput data (2011) BMC Genomics 12: 507 Working with Chipster Microarrays data resources: Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/ ArrayExpress: https://www.ebi.ac.uk/arrayexpress/ Oncomine: https://www.oncomine.org/ Hands-on: practical exercise Hands-on: practical exercise Chipster: http://chipster.embnet.sk:8081/ Username: chipster Password: chipster Exercises and data: http://www.embnet.sk/workshop2015/microarrays