Sequence Analysis

• Some algorithms analyze biological sequences for patterns – RNA splice sites – Open reading frames (ORFs): stretch of codons – propensities in a – Conserved regions in • AA sequences [possible active site] • DNA/RNA [possible protein binding site]

• Others make predictions based on sequence – Protein/RNA secondary structure folding

1 Bioinformatics Sequence Driven Problems • Genomics – Fragment assembly of the DNA sequence. • Not possible to read entire sequence. • Cut up into small fragments using restriction enzymes. • Then need to do fragment assembly. Overlapping similarities to matching fragments. • N-P complete problem. – Finding Genes • Identify open reading frames – Exons are spliced out. – Junk in between genes

• Proteomics – Identification of functional domains in protein’s sequence • Determining functional pieces in . – Protein Folding • 1D Sequence → 3D Structure 2 • What drives this process? Genome is Sequenced, What’s Next? • Tracing Phylogeny – Finding family relationships between species by tracking similarities between species. • Gene Annotation (cooperative genomics) – Comparison of similar species. • Determining Regulatory Networks – The variables that determine how the body reacts to certain stimuli. • Proteomics – From DNA sequence to a folded protein.

3 Modeling • Modeling biological processes tells us if we understand a given process • Because of the large number of variables that exist in biological problems, powerful computers are needed to analyze certain questions • Protein modeling: – Quantum chemistry imaging algorithms of active sites allow us to view possible bonding and reaction mechanisms – Homologous protein modeling is a comparative proteomic approach to determining an unknown protein’s tertiary structure • Regulatory Network Modeling: – Micro array experiments allow us to compare differences in expression for two different states – Algorithms for clustering groups of gene expression help point out possible regulatory networks (e.g. WGCNA) – Other algorithms perform statistical analysis to improve signal to noise contrast • Systems Biology Modeling: – Predictions of whole cell interactions (Organelle processes, expression modeling) – Currently feasible for specific processes (eg. Flux Balance Analysis) 4 Back to the problem

5 Pairwise Sequence Alignment Methods

• Visual • Brute Force • Dynamic Programming • Word-Based (k tuple)

6 Visual Alignments (Dot Plots)

• Matrix – Rows: Characters in one sequence – Columns: Characters in second sequence

• Filling – Loop through each row; if character in row, col match, fill in the cell – Continue until all cells have been examined

7 The Dot Matrix

• established in 1970 by A.J. Gibbs and G.A.McIntyre • method for comparing two amino acid or sequences

A G C T A G G A

G • each sequence builds one axis of A the grid C • one puts a dot, at the intersection of same letters appearing in both T sequences • scan the graph for a series of dots A • reveals similarity G • or a string of same characters • longer sequences can also be G compared on a single page, by C using smaller dots

8 Example Dot Plot

9 An entire software module of a telecommunications switch; about two million lines of C

Darker areas indicate regions with a lot of matches (a high degree of similarity). Lighter areas indicate regions with few matches (a low degree of similarity). Dark areas along the main diagonal indicate sub- modules. Dark areas off the main diagonal indicate a degree of similarity between sub- modules The largest dark squares are formed by redundancies in initializations of signal- tables and finite-state machines.

10 The Dot Matrix

The very stringent, self-dotplot: The non-stringent self-dotplot:

11 Noise in Dot Plots • Nucleic Acids (DNA, RNA) – 1 out of 4 bases matches at random • Stringency (The condition under which a DNA sequence can bind to related or non- specific sequences. For example, high temperature and lower salt increases stringency such that non-specific binding or binding with low melting temperature will dissolve) – Window size is considered – Percentage of bases matching in the window is set as threshold • To filter out random matches, one uses sliding windows • A dot is printed only if a minimal number of matches occur • rule of thumb: – larger windows for (only 4 bases, more random matches) – typical window size is 15 and stringency of 10

12 Reduction of Dot Plot Noise

Self alignment of ACCTGAGCTCACCTGAGTTA

13 The Dot Matrix Two similar, but not identical, sequences An indel (insertion or deletion):

14 The Dot Matrix Self-dotplot of a tandem duplication: A tandem duplication:

15 An inversion: The Dot Matrix Joining sequences:

16 The Dot Matrix

Self dotplot with repeats:

17 The Dot Matrix • the dot matrix method reveals the presence of insertions or deletions • comparing a single sequence to itself can reveal the presence of a repeat of a subsequence – Inverted repeats = reverse complement • Used to determine folding of RNA molecules

• self comparison can reveal several features: – similarity between chromosomes – tandem genes – repeated domains in a protein sequence – regions of low sequence complexity (same characters are often repeated)

18 Insertions/Deletions

19 Repeats/Inverted Repeats

20 Comparing Genome Assemblies

21 Chromosome Y self comparison

22 Available Dot Plot Programs • Vector NTI software package (under AlignX)

23 Available Dot Plot Programs

• Vector NTI software package (under AlignX)

GCG software package: • Compare http://www.hku.hk/bruhk/gcgdoc/compare.html • DotPlot+ http://www.hku.hk/bruhk/gcgdoc/dotplot.html • http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html • http://bioweb.pasteur.fr/cgi-bin/seqanal/dottup.pl • Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html)

24 Available Dot Plot Programs

Dotlet (Java Applet) http://www.isrec.isb- sib.ch/java/dotlet/Dotlet.html

25 Available Dot Plot Programs Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html)

26 Available Dot Plot Programs EMBOSS DotMatcher, DotPath,DotUp

27 The Dot Matrix

• When to use the Dot Matrix method? – unless the sequences are known to be very much alike • limits of the Dot Matrix – doesn’t readily resolve similarity that is interrupted by insertion or deletions – Difficult to find the best possible alignment (optimal alignment) – most computer programs don’t show an actual alignment

28 Dot Plot References

Gibbs, A. J. & McIntyre, G. A. (1970). The diagram method for comparing sequences. its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1-11.

Staden, R. (1982). An interactive graphics program for comparing and aligning nucleic-acid and amino-acid sequences. Nucl. Acid. Res. 10 (9), 2951-2961.

29 Next step We must define quantitative measures of sequence similarity and difference! • Hamming distance: – # of positions with mismatching characters

AGTC Hamming distance = 2 CGTA • Levenshtein (or edit) distance: – # of operations required to change one string into the other (deletion, insertion, substitution)

AG-TCC Levenshtein distance = 3 CGCTCA

30 Scoring

• +1 for a match -1 for a mismatch? • should gaps be allowed? – if yes how should they be scored? • what is the best algorithm for finding the optimal alignment of two sequences? • is the produced alignment significant?

31 Scoring Matrices

• match/mismatch score – Not bad for similar sequences – Does not show distantly related sequences

• Likelihood matrix – Scores residues dependent upon likelihood substitution is found in nature – More applicable for amino acid sequences

32 Parameters of Sequence Alignment

Scoring Systems: • Each symbol pairing is assigned a numerical value, based on a symbol comparison table.

Gap Penalties: • Opening: The cost to introduce a gap • Extension: The cost to elongate a gap

33 DNA Scoring Systems -very simple

Sequence 1 actaccagttcatttgatacttctcaaa Sequence 2 taccattaccgtgttaactgaaaggacttaaagact

A G C T A 1 0 0 0 Match: 1 G 0 1 0 0 Mismatch: 0 C 0 0 1 0 Score = 5 T 0 0 0 1

34 Protein Scoring Systems

T:G = -2 Sequence 1 PTHPLASKTQILPEDLASEDLTI T:T = 5 Sequence 2 PTHPLAGERAIGLARLAEEDFGM Score = 48

Scoring C S T P A G N D . . matrix C 9 S -1 4 A scoring matrix is a table T -1 1 5 of values that describe the P -3 -1 -1 7 probability of a residue pair A 0 1 0 -1 4 occurring in alignment. G -3 0 -2 -2 0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1 1 6 . 35 . Amino acid exchange matrices

Amino acids are not equal: 1. Some are easily substituted because they have similar: • physico-chemical properties • structure 2. Some mutations between amino acids occur more often due to similar codons The two above observations give us ways to define substitution matrices

36 Properties of Amino Acids • Substitutions with similar chemical properties • Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution.

37 Scoring Matrices

• table of values that describe the probability of a residue pair occurring in an alignment

• the values are logarithms of ratios of two probabilities

1. probability of random occurrence of an amino acid (diagonal) 2. probability of meaningful occurrence of a pair of residues

38 Protein Scoring Systems

• Scoring matrices reflect: – # of mutations to convert one to another – chemical similarity – observed mutation frequencies • Log odds matrices: – the values are logarithms of probability ratios of the probability of an aligned pair to the probability of a random alignment. • Widely used scoring matrices: • PAM • BLOSUM

39 Scoring Matrices Widely used matrices • PAM (Percent Accepted Mutation) / MDM (Mutation Data Matrix ) / Dayhoff – Derived from global alignments of closely related sequences. – Matrices for greater evolutionary distances are extrapolated from those for lesser ones. – The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. – PAM-1 corresponds to about 1 million years of evolution – for distant (global) alignments, Blosum50, Gonnet, or (still) PAM250

• BLOSUM (Blocks ) – Derived from local, ungapped alignments of distantly related sequences – All matrices are directly calculated; no extrapolations are used – The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances. – The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches. – For local alignment, Blosum 62 is often superior

• Structure-based matrices

• Specialized Matrices

40 The relationship betweenScoring BLOSUM and Matrices PAM substitution matrices

• BLOSUM matrices with higher numbers and PAM matrices with low numbers are designed for comparisons of closely related sequences. • BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins.

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.html

41 Percent Accepted Mutation (PAM or Dayhoff) Matrices • Studied by Margaret Dayhoff (Dayhoff et al., 1978). • Amino acid substitutions – Alignment of common protein sequences – 1572 amino acid substitutions

– 71 groups of protein, 85% similar Derived from global alignments of protein families. Family members share at least 85% identity. • “Accepted” mutations – do not negatively affect a protein’s fitness • Similar sequences organized into phylogenetic trees • Number of amino acid changes counted • Relative mutabilities evaluated • 20 x 20 amino acid substitution matrix calculated

42 Percent Accepted Mutation (PAM or Dayhoff) Matrices

• PAM 1: 1 accepted mutation event per 100 amino acids; PAM 250: 250 mutation events per 100 …

• PAM 1 matrix can be multiplied by itself N times to give transition matrices for sequences that have undergone N mutations

• PAM 250: 20% similar; PAM 120: 40%; PAM 80: 50%; PAM 60: 60%

43 PAM1 matrix normalized probabilities multiplied by 10000

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901

44 Log Odds Matrices

• PAM matrices converted to log-odds matrix – Calculate odds ratio for each substitution • Taking scores in previous matrix • Divide by frequency of amino acid – Convert ratio to log10 and multiply by 10 – Take average of log odds ratio for converting A to B and converting B to A – Result: Symmetric matrix

45 PAM 250

A R N D C Q E G H I L K M F P S T W Y V B Z A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -C4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 W2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 --2 8-2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -176 -2 4 0 0 WB 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6

A value of 0 indicates the frequency of alignment is random

46 log(freq(observed)/freq(expected)) PAM250 Log odds matrix

47 Blocks Amino Acid Substitution Matrices (BLOSUM)

• Larger set of sequences considered • Sequences organized into signature blocks • Consensus sequence formed – 60% identical: BLOSUM 60 – 80% identical: BLOSUM 80

48 BLOSUM (Blocks Substitution Matrix)

• Derived from alignments of domains of distantly related proteins (Henikoff & Henikoff,1992).

A A C E C

• Occurrences of each amino acid pair in each column of each block alignment is counted. A A - C = 4 A • The numbers derived from all blocks were A - E = 2 C used to compute the BLOSUM matrices. C - E = 2 E A - A = 1 C C - C = 1

49 BLOSUM (Blocks Substitution Matrix)

• Sequences within blocks are clustered according to their level of identity.

• Clusters are counted as a single sequence.

• Different BLOSUM matrices differ in the percentage of sequence identity used in clustering.

• The number in the matrix name (e.g. 62 in BLOSUM62) refers to the percentage of sequence identity used to build the matrix.

• Greater numbers mean smaller evolutionary distance.

50 AMINO ACID FREQUENCY

• Genetic information contained in mRNA is in the form of codons, which are translated into amino acids which then combine to form proteins. • At certain sites in a protein's structure, amino acid composition is not critical. • Yet certain amino acids occur at such sites up to six times more often than other amino acids. • Are frequencies of particular amino acids simply a consequence of random permutations of the or instead a product of natural selection?

51 AMINO ACID FREQUENCY

• If a particular amino acid is in some way adaptive, then it should occur more frequently than expected by chance. • This can easily be tested by calculating the expected frequencies of amino acids and comparing to observed. • The codons and observed frequencies of particular amino acids are given in the next slide.

52 The codons and observed frequencies of amino acids

53 The codons and observed frequencies of amino acids

54 AMINO ACID FREQUENCY

• The frequencies of DNA bases in nature are – 22.0% uracil, – 30.3% , – 21.7% , – 26.1% . • The expected frequency of a particular codon can then be calculated by – multiplying the frequencies of each DNA base comprising the codon.

55 AMINO ACID FREQUENCY

• The expected frequency of the amino acid can then be calculated by adding the frequencies of each codon that codes for that amino acid. • As an example, – the RNA codons for tyrosine are UAU and UAC, so the random expectation for its frequency is • (0.220)(0.303)(0.220) + (0.220)(0.303)(0.217) = 0.0292. – Since 3 of the 64 codons are nonsense or stop codons, this frequency for each amino acid is multiplied by a correction factor of 1.057.

56 TIPS on choosing a scoring matrix

• Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993).

• When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices,

• For distantly related proteins higher PAM or lower BLOSUM matrices.

• For database searching the commonly used matrix is BLOSUM62.

57 Use Different PAM’s for Different Evolutionary Distances

58 (Adapted from D Brutlag, Stanford) Scoring Scheme • Transition mutation (more common) – purine A G – pyrimidine T C

• Transversion mutation – purine pyrimidine A, G T, C

A G T C A 20 10 5 5 G 10 20 5 5 T 5 5 20 10 C 5 5 10 20

59 DNA Mutations In addition to using a match/mismatch scoring scheme for DNA sequences, nucleotide mutation matrices can be constructed as well. These matrices are based upon two different models of nucleotide evolution: the first, the Jukes-Cantor model, assumes there are uniform mutation rates among , while the second, the Kimura model, assumes that there are two separate mutation rates: one for transitions (where the structure of purine/pyrimidine stays the same), and one for transversions. Generally, the rate of transitions is thought to be higher than the rate of transversions.

A G : A, G C, T

Transitions: A«G; C«T Transversions: A«C, A«T, C«G, G«T C T

60 Nucleic Acid Scoring Matrices • Two mutation models: – Jukes-Cantor Model of evolution: a = common rate of base substitution – Kimura Model of Evolution: a = rate of transitions; b = rate of transversions • Transitions • Transversions

61 Nucleotide substitution matrices with the equivalent distance of 1 PAM

A. Model of uniform mutation rates among nucleotides. A G T C A 0.99 G 0.00333 0.99 T 0.00333 0.00333 0.99 C 0.00333 0.00333 0.00333 0.99

B. Model of 3-fold higher transitions than transversions. A G T C A 0.99 G 0.006 0.99 T 0.002 0.002 0.99 C 0.002 0.002 0.006 0.99

62 Nucleotide substitution matrices with the equivalent distance of 1 PAM

E.g. take ~‘ln’ (logarithm base e) of the matrix in the previous slide (for non-diagonal entries): A. Model of uniform mutation rates among nucleotides. A G T C A 2 G -6 2 T -6 -6 2 C -6 -6 -6 2

B. Model of 3-fold higher transitions than transversions. A G T C A 2 G -5 2 T -7 -7 2 C -7 -7 -5 2

63 Determining Optimal Alignment

• Two sequences: X and Y – |X| = m; |Y| = n – Allowing gaps, |X| = |Y| = m+n

• Brute Force • Dynamic Programming

64 Brute Force

• Determine all possible subsequences for X and Y – 2m+n subsequences for X, 2m+n for Y!

• Alignment comparisons – 2m+n * 2m+n = 2(2(m+n)) = 4m+n comparisons

• Quickly becomes impractical

65 Dynamic Programming

• Used in Computer Science

• Solve optimization problems by dividing the problem into independent subproblems

• Sequence alignment has optimal substructure property – Subproblem: alignment of prefixes of two sequences

– Each subproblem is computed once and stored in a matrix

66 Dynamic Programming

• Optimal score: built upon optimal alignment computed to that point

• Aligns two sequences beginning at ends, attempting to align all possible pairs of characters

67 Dynamic Programming

• Scoring scheme for matches, mismatches, gaps • Highest set of scores defines optimal alignment between sequences • Match score: DNA – exact match; Amino Acids – mutation probabilities

• Guaranteed to provide optimal alignment given: – Two sequences – Scoring scheme

68 Steps in Dynamic Programming • Initialization • Matrix Fill (scoring) • Traceback (alignment)

DP Example: Sequence #1: GAATTCAGTTA; M = 11 Sequence #2: GGATCGA; N = 7

• s(aibj) = +5 if ai = bj (match score) • s(aibj) = -3 if ai¹bj (mismatch score) • w = -4 (gap penalty)

69 View of the DP Matrix • M+1 rows, N+1 columns

70 Global Alignment (Needleman-Wunsch)

• Attempts to align all residues of two sequences

• INITIALIZATION: First row and first column set

• Si,0 = w * i

• S0,j = w * j

71 Initialized Matrix(Needleman-Wunsch)

72 Matrix Fill (Global Alignment)

Si,j = MAXIMUM[

Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),

Si,j-1 + w (gap in sequence #1),

Si-1,j + w (gap in sequence #2) ]

73 Matrix Fill (Global Alignment)

• S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 - 4] = MAX[5, -8, -8]

74 Matrix Fill (Global Alignment)

• S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 - 4] = MAX[-4 - 3, 5 – 4, -8 – 4] = MAX[-7, 1, - 12] = 1

75 Matrix Fill (Global Alignment)

76 Filled Matrix (Global Alignment)

77 Trace Back (Global Alignment)

• maximum global alignment score = 11 (value in the lower right hand cell).

• Traceback begins in position SM,N; i.e. the position where both sequences are globally aligned.

• At each cell, we look to see where we move next according to the pointers.

78 Trace Back (Global Alignment)

79 Global Trace Back

G A A T T C A G T T A | | | | | | G G A – T C – G - — A

80 Checking Alignment Score

G A A T T C A G T T A | | | | | | G G A – T C – G - — A

+ - + - + + - + - - + 5 3 5 4 5 5 4 5 4 4 5

5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11!

81 Local Alignment

82 Local Alignment

• Smith-Waterman: obtain highest scoring local match between two sequences

• Requires a modification: – When a value in the score matrix becomes negative, reset it to zero (begin of new alignment)

83 Local Alignment Initialization

• Values in row 0 and column 0 set to 0.

84 Matrix Fill (Local Alignment)

Si,j = MAXIMUM [

Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),

Si,j-1 + w (gap in sequence #1),

Si-1,j + w (gap in sequence #2), 0 ]

85 Matrix Fill (Local Alignment)

S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 – 4,0] = MAX[5, -4, -4, 0] = 5

86 Matrix Fill (Local Alignment)

S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 – 4, 0] = MAX[0 - 3, 5 – 4, 0 – 4, 0] = MAX[-3, 1, - 4, 0] = 1

87 Matrix Fill (Local Alignment)

S1,3 = MAX[S0,2 -3, S1,2 - 4, S0,3 – 4, 0] = MAX[0 - 3, 1 – 4, 0 – 4, 0] =

MAX[-3, -3, -4, 0] = 0

88 Filled Matrix (Local Alignment)

89 Trace Back (Local Alignment)

• maximum local alignment score for the two sequences is 14

• found by locating the highest values in the score matrix

• 14 is found in two separate cells, indicating multiple alignments producing the maximal alignment score

90 Trace Back (Local Alignment)

• Traceback begins in the position with the highest value.

• At each cell, we look to see where we move next according to the pointers

• When a cell is reached where there is not a pointer to a previous cell, we have reached the beginning of the alignment

91 Trace Back (Local Alignment)

92 Trace Back (Local Alignment)

93 Trace Back (Local Alignment)

94 Maximum Local Alignment G A A T T C - A G A A T T C - A | | | | | | | | | | G G A T – C G A G G A – T C G A

+ - + + - + - + + - + - + + - + 5 3 5 5 4 5 4 5 5 3 5 4 5 5 4 5

95 Overlap Alignment

96 Overlap Alignment Consider the following problem: Find the most significant overlap between two sequences? Possible overlap relations: a.

b.

Difference from local alignment: Here we require alignment between the endpoints of the two sequences. Overlap Alignment

Initialization: Si,0 = 0 , S0,j = 0 Recurrence: as in global alignment

Si,j = MAXIMUM [

Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),

Si,j-1 + w (gap in sequence #1),

Si-1,j + w (gap in sequence #2) ]

Score: maximum value at the bottom line and rightmost line

global local overlap Overlap Alignment - example

PAWHEAE HEAGAWGHEE

Scoring scheme : Match: +4 Mismatch: -1 Gap penalty: -5 Overlap Alignment

PAWHEAE HEAGAWGHEE

Scoring scheme : Match: +4 Mismatch: -1 Gap penalty: -5 Overlap Alignment

PAWHEAE HEAGAWGHEE

Scoring scheme: Match: +4 Mismatch: -1 Gap penalty: -5 Overlap Alignment The best overlap is: P A W H E A E ------H E A G A W G H E E

Pay attention! A different scoring scheme could yield a different result, such as:

Scoring scheme : Match: +4 - - - P A W - H E A E Mismatch: -1 Gap penalty: -2 H E A G A W G H E E - Sequence Alignment Variants • Global alignment (The Needelman-Wunsch Algorithm)

– Initialization: Si,0 = i*w, S0,j = j*w – Score: Si,j = MAX [Si-1, j-1 + s(ai,bj), Si,j-1 + w, Si-1,j + w] • Local alignment (The Smith-Waterman Algorithm)

– Initialization: Si,0 = 0, S0,j = 0 – Score: Si,j = MAX [Si-1, j-1 + s(ai,bj), Si,j-1 + w, Si-1,j + w, 0] • Overlap alignment

– Initialization: Si,0 = 0, S0,j = 0 – Score: Si,j = MAX [Si-1, j-1 + s(ai,bj), Si,j-1 + w, Si-1,j + w] 104 Linear vs. Affine Gaps • The scoring matrices used to this point assume a linear gap penalty where each gap is given the same penalty score. • However, over evolutionary time, it is more likely that a contiguous block of residues has become inserted/deleted in a certain region (for example, it is more likely to have 1 gap of length k than k gaps of length 1). • Therefore, a better scoring scheme to use is an initial higher penalty for opening a gap, and a smaller penalty for extending the gap. 105 Linear vs. Affine Gaps

• Gaps have been modeled as linear

• More likely contiguous block of residues inserted or deleted – 1 gap of length k rather than k gaps of length 1

• Scoring scheme should penalize new gaps more

106 Affine Gap Penalty

• wx = g + r(x-1)

• wx : total gap penalty; g: gap open penalty; r: gap extend penalty; x: gap length

• gap penalty chosen relative to score matrix – Gaps not excluded – Gaps not over included – Typical Values: g = -12; r = -4

107 Affine Gap Penalty and Dynamic Programming

Mi, j = max { Di - 1, j - 1 + subst(Ai, Bj) , Mi - 1, j - 1 + subst(Ai, Bj) , Ii - 1, j - 1 + subst(Ai, Bj) }

Di, j = max { Di , j - 1 – extend, Mi , j - 1 - open }

Ii, j = max { Mi-1 , j – open, Ii-1 , j - extend }

where M is the match matrix, D is delete matrix, and I is insert matrix

108 Drawbacks to DP Approaches

• Dynamic programming approaches are guaranteed to give the optimal alignment between two sequences given a scoring scheme. • However, the two main drawbacks to DP approaches is that they are compute and memory intensive, in the cases discussed to this point taking at least O(n2) space, between O(n2) and O(n3) time. • Linear space algorithms have been used in order to deal with one drawback to dynamic programming. The basic idea is to concentrate only on those areas of the matrix more likely to contain the maximum alignment. • The most well-known of these linear space algorithms is the Myers-Miller algorithm.

109 Alternative DP approaches

• Linear space algorithms Myers-Miller

• Bounded Dynamic Programming

• Ewan Birney’s Dynamite Package – Automatic generation of DP code

110 Assessing Significance of Alignment

• When two sequences of length m and n are not obviously similar but show an alignment, it becomes necessary to assess the significance of the alignment.

• The alignment of scores of random sequences has been shown to follow a Gumbel extreme value distribution.

111 Gumbel Extreme Value Distribution

• http://roso.epfl.ch/mbi/papers/discretechoice/node11.html • http://mathworld.wolfram.com/GumbelDistribution.html • http://en.wikipedia.org/wiki/Generalized_extreme_value_distribution

112 Probability of Alignment Score

• Using a Gumbel extreme value distribution, the expected number of alignments (E-value) with a score at least S is: E = Kmn e-λS

– m,n: Lengths of sequences – K ,λ: statistical parameters dependent upon scoring system and background residue frequencies

113 • Recall that the log-odds scoring schemes examined to this point normally use a S = 10*log10x scoring system.

• We can normalize the raw scores obtained using these non-gapped scoring systems to obtain the amount of bits of information contained in a score (or the amount of nats of information contained within a score).

114 Converting to Bit Scores

A raw score can be normalized to a bit score using the formula:

• The E-value corresponding to a given bit score can then be calculated as:

115 • Converting to nats is similar.

• We just substitute e for 2 in the above equations.

• Converting scores to either bits or nats gives a standardized unit by which the scores can be compared.

116 P-Value

• P values can be calculated as the probability of obtaining a given score at random.

• P-values can be estimated as: P = 1 – e-E

117 A quick determination of significance • If a scoring matrix has been scaled to bit scores, then it can quickly be determined whether or not an alignment is significant. • For a typical amino acid scoring matrix, K = 0.1 and lambda depends on the values of the scoring matrix. • If a PAM or BLOSUM matrix is used, then lambda is precomputed. • For instance, if the log odds matrix is in units of bits, then lambda = loge2, and the significance cutoff can be calculated as log2(mn).

118 Significance of Ungapped Alignments

• PAM matrices are 10 * log10x • Converting to log2x gives bits of information

• Converting to logex gives nats of information

Quick Calculation: • If bit scoring system is used, significance cutoff is:

log2(mn)

119 Example

• Suppose we have two sequences, each approximately 250 amino acids long that are aligned using a Smith-Waterman approach. • Significance cutoff is:

log2(250 * 250) = 16 bits

120 Example

• Using PAM250, the following alignment is found:

• F W L E V E G N S M T A P T G • F W L D V Q G D S M T A P A G

121 PAM250 matrix

122 Example

• Using PAM250, the score is calculated:

• F W L E V E G N S M T A P T G • F W L D V Q G D S M T A P A G

• S = 9 + 17 + 6 + 3 + 4 + 2 + 5 + 2 + 2 + 6 + 3 + 2 + 6 + 1 + 5 = 73

123 Significance Example

• S is in 10 * log10x, so this should be converted to a bit score:

• S = 10 log10x • S/10 = log10x • S/10 = log10x * (log210/log210)

• S/10 * log210 = log10x / log210 • S/10 * log210 = log2x • 1/3 S ~ log2x

• S’ ~ 1/3S

124 Significance Example

• S’ = 1/3S = 1/3 * 73 = 24.333 bits

• The significance cutoff is: log2(mn) = log2(250 * 250) = 16 bits

• Since the alignment score is above the significance cutoff, this is a significant local alignment.

125 Estimation of E and P • When a PAM250 scoring matrix is being used, K is estimated to be 0.09, while lambda is estimated to be 0.229. • For PAM250, K = 0.09; l = 0.229

• We can convert the score to a bit score as follows: • S’= lS-lnKmn • S’ = 0.229 * 73 – ln 0.09 * 250 * 250 • S’ = 16.72 – 8.63 = 8.09 bits -x • P(S’ >= x) = 1 – e(-e ) -8.09 • P(S’ >= 8.09) = 1 – e(-e ) = 3.1* 10-4

• Therefore, we see that the probability of observing an alignment with a bit score greater than 8.09 is about 3 in 1000.

126 Significance of Gapped Alignments

• Gapped alignments make use of the same statistics as ungapped alignments in determining the statistical significance.

• However, in gapped alignments, the values for l and K cannot be easily estimated.

• Empirical estimations and gap scores have been determined by looking at the alignments of randomized sequences.

127 Biological Network Analysis and Gene-Gene Interaction Networks Content • Rationale of “biological network analysis”

• Inferring gene-gene interaction networks

• Co-regulation (co-expression) networks

• Weighted correlation network analysis of genes (WGCNA)

• Bayesian networks

• Real life examples of gene networks usage

129 Hypothesis

130 What can gene networks provide us? To study common human diseases such as NAFLD and CAD..

131 132 Weighted correlation network analysis (WGCNA)*

*Citation for WGCNA summary: SIB course, Nov. 15-17, 2016, “Introduction to Biological Network Analysis” by Leonore Wigger and with Frédéric Burde tand Mark Ibberson

133 Overview of WGCNA

• Theory 1: Weighted correlation network, split into modules • Theory 2: Identify modules and genes of interest • Input data: – Gene expression data (microarray or RNA-Seq) Recommendation: at least 20 individuals – Clinical/phenotypical traits from the same individuals (optional) e.g. weight, insulin level, glucose level

134 Aims of WGCNA: Inferring a gene-gene similarity network

135 Aims: Correlate phenotypic traits to gene modules* (optional aim)

* “Modules” found in WGCNA: Groups or clusters of co-expressed genes with similar expression profiles over a large group of individuals 136 Aims: Identify “key driver” genes in modules

137 Rationale

• Genes with similar expression patterns are of interest because they may be – tightly co-regulated – functionally related – members of the same pathway

• WGCNA encourages hypotheses about genes based on their close network neighbors.

138 Workflow

139 But first: preprocess the gene expression data

- Remove outlier samples e.g. by creating dendrogram for samples (take transpose of the expression matrix for this) and identifying the outliers

- Remove the lowly expressed genes by eliminating the genes that do not have expression levels > 0 at least for 50% of all samples

- Normalize gene expression by log2(expr+1). We need to transform expression values with logarithm base 2, since the correlation measures (e.g. Pearson coefficient) assume that the expression values of each gene are normally distributed (approximately) across the samples.

- Extract expression table with the most variable probes/genes: e.g. find the top ~15,000 highly variable genes based on the “coefficient of variation” (cv) measure.

140 But first: preprocess the gene expression data • Remove the lowly expressed genes: – Min_Exp_level=0 – Gene_present<-matrix() – for (i in 1:nrow(Expr_Mat)) Gene_present[i] = (sum(Expr_Mat[i,] > Min_Exp_level) /ncol(Expr_Mat) )>=0.5 – Expr_Mat <- Expr_Mat[Gene_present,] • Transform expression values with logarithm base 2 (normalization): – Expr_Mat = log2(Expr_Mat + 1) • Extract expression table with the most variable probes/genes: – cv=NULL – a=Expr_Mat – for (m in 1:nrow(a)){ cv_individual = cv(a[m,]) ; cv = rbind(cv,cv_individual) } – a_sort = with(data.frame(a), data.frame(a)[order(-cv),]) – # Keep the top e.g. 90 percentile, which provides us to have ~15,000 genes: – perc <- 0.90 – a_keep <- a_sort[1:(perc*nrow(a_sort)),] 141 – Expr_Mat <- a_keep Construct weighted correlation network

• Correlation: A statistical measure for the extent to which two variables fluctuate together. • Positive correlation: variables increase/decrease together • Negative correlation: variables increase/decrease in opposing direction • Caution: There is no causality here!!

142 Correlation examples (Pearson correlation, R^2)

143 http://www.dummies.com/how-to/content/how-to-interpret-a-correlation-coefficient-r.html Correlation measures implemented in WGCNA

• Pearson • Spearman • Kendall's tau • Biweight midcorrelation (bicor)

144 Choosing a correlation method

• Fastest, but sensitive to outliers: Pearson correlation, cor(x), “standard” measure of linear correlation • Less sensitive to outliers but much slower: Biweight mid- correlation, bicor(x), robust, recommended by the authors for most situations [needs modification for correlations involving binary/categorical variables] • Spearman correlation, cor(x, method=“spearman”), rank-based, works even if relationship is not linear (but the relationship expected to be monotonous), less sensitive to gene expression differences [can be used as-is for correlations involving binary/categorical variables]

• Default correlation method in WGCNA: cor(Pearson). • Caveat: use it only if there are no outliers, or for exercises/tutorials.

145 Adjacency matrix calculation • Compute a correlation raised to a power between every pair of genes (�,�): • Effect of raising correlation to a power: – Amplifies disparity between strong and weak correlations – Example: Power term �= 4

146 Adjacency matrix calculation – cont’d

147 Adjacency matrix calculation – cont’d

148 Connectivity (degree) in a weighted network

• Connectivity of a gene: Sum of the weights of all edges connecting to this gene • Example: Connectivity of gene 1 is: 0.55 + 0.39 + 0.09 = 1.03

149 Picking a power term

• Selection criterion: Pick lowest possible � that leads to an approximately scale-free network topology: – few nodes with many connections (”hubs”) – many nodes with few connections

• Degree distribution follows a power law: – the probability for a node of having k connections is k �

150 Source: Carlos Castillo: Effective Web Crawling, PhD Thesis, University of Chile, 2004

151 Why scale-free network topology?

• Barábasi et al.* found many types of network in many domains to be approximately scale-free, including metabolic and protein interaction

• So, aim in WGCNA is: Building a biologically “realistic” network.

* Barabási, Albert-László; Bonabeau, Eric (May 2003). "Scale-Free Networks"(PDF). Scientific American. 288(5): 50–9.

152 Pick a power term: Visual Aid in WGCNA

• Left plot: Choose power 6. Lowest possible power term where topology approximately fits a scale free network (on or above red horizontal line). • Right plot: mean connectivity drops as power goes up. Must not drop too low

153 Next Step: Detect modules of co-expressed genes

154 4 steps to get from network to modules 1. Compute dissimilarity between genes: “topological overlap measure dissimilarity”

2. Perform hierarchical clustering of genes: obtain tree structure

3. Divide clustered genes into modules: cut tree branches

4. Optional: Merge very similar modules: use module “eigengenes”

155 Step 1: Compute dissimilarity between genes • Why we use Topological Overlap Measure (TOM)? – TOM is a pairwise similarity measure between network nodes (genes) – TOM(i,j) is high if genes i,j have many shared neighbors because overlap of their network neighbors is large • So, a high TOM(i,j) implies that genes have similar expression patterns

• How to calculate TOM similarity between two nodes: – 1. Count number of shared neighbors: “agreement” of the set of neighboring nodes – 2. Normalize to [0,1] TOM(i,j) = 0 means: no overlap of network neighbors TOM(i,j) = 1 means: identical set of network neighbors

• NOTE that: Generalized to the case of weighted networks in Zhang and Horvath (2005), first WGCNA paper – All nodes are neighbors; counting them is not informative.

– Compute agreement of the set of neighboring nodes based on edge strengths.156 • But, we need a dissimilarity measure for clustering!!

• TOM as a similarity measure can be transformed into a dissimilarity measure: distTOM= 1-TOM.

157 Step 2: Perform hierarchical clustering of genes • Compute gene dendrogram:

158 159 Step 3: Divide clustered genes into modules • Gene dendrogram and detected modules

Dynamic tree cut algorithm groups genes into modules: corFnc=“pearson”; power=6; min.modulesize=30 160 Step 4 (Optional): Merge very similar modules • A module eigengene is a 1-dimensional data vector, summarizing the expression data of the genes that form a module • How it is computed: the 1st principal component of the expression data • What it is used for: to represent the module in mathematical operations: – modules can be correlated with one another – modules could be clustered together (we can combine them) – modules can be correlated with external traits

161 Clustering of module eigengenes

Cutoff for merging

Dissimilarity measure: 1 –cor(Meigengenes)

Merge modules whose dissimilarity is below the merging cutoff 162 Gene dendrogram and detected modules, before and after merging

Modules, before merging ------Modules, after merging corFnc=“pearson”; power=6; min.modulesize=30 163 Module Detection: Decisions to make How to choose optimal parameters?

Dynamic Tree cut procedure • Minimal module size: typically 20 or 30 • CutHeight: try different heights such as 0.95 or 0.9995, etc.

Module Merging procedure • Cutoff for module eigengene dendrogram: typically between 0.15 and 0.25 – check if clusters look ok on dendrogram • Merge once or several times? Usually once, but merge step can be repeated: – if some modules are very similar – if we want larger modules

164 165 Correlate modules to external traits Examples: trait variables from obtained from some samples/individuals (preferably the same samples that we generated the expression data set): • weight_g • FFA • Aneurysm • length_cm • Glucose • Aortic_cal_M • ab_fat • LDL_plus_VLDL • Aortic_cal_L • other_fat • MCP_1_phys • CoronaryArtery_Cal • total_fat • Insulin_ug_l • Myocardial_cal • Trigly • Glucose_Insulin • BMD_all_limbs • Total_Chol • Leptin_pg_ml • BMD_femurs_only • HDL_Chol • Adiponectin • UC • Aortic lesions

How to compute correlations: each module eigengene to each trait variable: cor(MEs, traitDat) 166 Example

167 Module preservation analysis in WGCNA

Is my network module preserved and reproducible?* *Langfelder et al PloS Comp Biol. 7(1): e1001057.

168 Network-based module preservation statistics

" Input: module assignment in reference data.

ref test " Adjacency matrices in reference A and test data A

" Network preservation statistics assess preservation of

- 1. network density: Does the module remain densely connected in the test network?

- 2. connectivity: Is hub gene status preserved between reference and test networks?

- 3. separability of modules: Does the module remain distinct in the test data?

169 Several connectivity preservation statistics For general networks, i.e. input adjacency matrices

- cor.kIM=cor(kIMref,kIMtest)

- correlation of intramodular connectivity across module nodes

- cor.ADJ=cor(Aref,Atest)

- correlation of adjacency across module nodes For correlation networks, i.e. input sets are variable measurements

- cor.Cor=cor(corref,cortest)

- cor.kME=cor(kMEref,kMEtest) One can derive relationships among these statistics in case of weighted correlation network

170 Choosing thresholds for preservation statistics based on permutation test

" For correlation networks, we study 4 density and 3 connectivity preservation statistics that take on values <= 1

" Challenge: Thresholds could depend on many factors (number of genes, number of samples, biology, expression platform, etc.)

" Solution: Permutation test. Repeatedly permute the gene labels in the test network to estimate the mean and standard deviation under the null hypothesis of no preservation.

" Next we calculate a Z statistic: Z=(observed-mean)/sd

" We have had 4 density and 3 connectivity preservation statistics:

171 PermutationGene modules test for estimating in Adipose Z scores

" For each preservation measure we report the observed value and the permutation Z score to measure significance ( Z=(observed-mean)/sd).

" Each Z score provides answer to “Is the module significantly better than a random sample of genes?”

" Summarize the individual Z scores into a composite measure called Z.summary

" Zsummary < 2 indicates no preservation,

" 2

" Zsummary>10 strong evidence

172 Summary preservation • Standard cross-tabulation based statistics are intuitive – Disadvantages: i) only applicable for modules defined via a module detection procedure, ii) ill suited for ruling out module preservation • Network based preservation statistics measure different aspects of module preservation – Density-, connectivity-, separability preservation • Two types of composite statistics: Zsummary and medianRank. • Composite statistic Zsummary based on a permutation test – Advantages: thresholds can be defined, R function also calculates corresponding permutation test p-values – Example: Zsummary<2 indicates that the module is *not* preserved – Disadvantages: i) Zsummary is computationally intensive since it is based on a permutation test, ii) often depends on module size • Composite statistic medianRank – Advantages: i) fast computation (no need for permutations), ii) no dependence on module size. – Disadvantage: only applicable for ranking modules (i.e. relative preservation)

173 Application: Studying the preservation of human brain co- expression modules in chimpanzee brain expression data.

Modules defined as clusters (branches of a cluster tree)

Data from Oldam et al 2006 Preservation of modules between human and chimpanzee brain networks

175 2 composite preservation statistics

• Zsummary is above the threshold of 10 (green dashed line), i.e. all modules are preserved. • Zsummary often shows a dependence on module size which may or may not be attractive • In contrast, the median rank statistic is not dependent on module size. • It indicates that the yellow module is the most preserved one

176 Implementation and R software tutorials, WGCNA R library

• General information on weighted correlation networks • Google search – “WGCNA” – “weighted gene co-expression network”

" R function modulePreservation is part of WGCNA package

" Tutorials: preservation between human and chimp brains www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ModulePreservatio n

177 Pseudocode for modulePreservation()

• setLabels = c("Control", "Disease"); • multiExpr = list(Control = list(data = dataExp_Control), Disease = list(data = dataExp_Disease)); • multiLabel = list(Control = moduleLabelsControl, Disease=moduleLabelsDisease); • mp = modulePreservation(multiExpr, multiLabel, referenceNetworks = c(1:2), nPermutations = 100, randomSeed = 1, quickCor = 0, verbose =3); • save(multiExpr, multiLabel, mp, file =“modulepreservationLabel_Alldata.RData" ) • ref = 1; test = 2 • Zsummary1=mp$preservation$Z[[ref]][[test]][, 2] • names(Zsummary1)=rownames(mp$preservation$Z[[ref]][[test]]) • low.preserved1=Zsummary1[which(Zsummary1<2)] • ref=2; test=1 • Zsummary2=mp$preservation$Z[[ref]][[test]][, 2] • names(Zsummary2)=rownames(mp$preservation$Z[[ref]][[test]])

• low.preserved2=Zsummary2[which(Zsummary2<2)] 178 Introduction to Bayesian Networks Content

• Definition of Bayesian networks (BN)

• An example, Rimbanet…

• Mandatory and optional input files

• Comparison of BN tools

180 Gene-gene interaction networks

Co-expression networks

181 What is a DAG

• A directed acyclic graph (DAG) is a finite directed graph with no directed cycles.

• There is no way to start at any vertex v and follow a consistently-directed sequence of edges that eventually loops back to v again.

182 Bayes’ theorem

• Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

• Bayes’ theorem is stated as:

The picture can't be displayed.

183 Bayes’ theorem - an example

• Suppose a test to detect a disease in human is 99% sensitive and 95% specific, means: For the patients, the test gives a positive result with 99% (TP) and gives a negative result with 1% (FN): P(+ | C)=0.99 and P(-| C)=0.01 For the healthy samples, test gives a negative result with 95% (TN) and a positive result with 5% (FP): P(-| H)=0.95 and P(+| H)=0.05

• Q: If a randomly selected individual’s test result is positive, what is the probability that he/she carries the disease? Hence, P(C | +) = ?

184 Bayesian networks (BNs)

• BNs are directed acyclic graphs (DAGs) in which the edges of the graph are defined by conditional probabilities that characterize the distribution of the states of each node given the state of its parents. The picture can't be displayed. , where M: network model, D: observation data

• We can define a partitioned joint probability distribution over all nodes: The picture can't be displayed. The where pictu is parent set of . re can't be displ aye d.

185 Application of Bayes’ theorem in gene regulatory network inference P(M | D) = P(X ) = Õ P(X i | Pa(X i )) i • The numbers of the possible network structures (M) ↑ with the # of nodes ↑.

• Search of all possible structures to find the best supported one by the data is not feasible.

• Solution: We can use Markov chain Monte Carlo (MCMC) simulation to identify a particular amount (e.g. 1,000) of plausible networks.

• A consensus network is obtained by combing these plausible networks.

186 Markov chain

• State of the system at time “t+1” is predicted by solely using the state at time “t”, i.e. previous states (t-1, t-2,..) will not be used for predicting t+1.

0.4

0.5 0.6

0.1 grapes lettuce cheese

0.4 0.5

0.5 187 Markov chain Monte Carlo (MCMC)

• MCMC methods are a class of sampling algorithms.

• It is used for sampling from a probability distribution based on a Markov chain model that has the desired distribution as its equilibrium distrib.

• The state of the chain after a number of steps is used as a sample of the desired distribution.

• The quality of the sample ↑ as the number of steps ↑.

• MCMC methods mainly used for numerical approximations of the integrals

188 MCMC in the BN problem • MCMC algorithm to identify 1000 of networks: – Start with a null network (for each 1000 cases) including prior edges. – Make random changes on each network such as: • Flip, • Add, and • Delete individual edges; – Accept the random changes made above that lead to an improvement in the fit of the network to the data. P(M | D) » P(D | M )´ P(M ) – The fitness is assessed by Bayesian information criterion (BIC), which also penalizes the network if the #of the parameters (complexity)↑. – Note: The picture can't be displayed. where n is # of data points in D (sample #); k = # of the parameters to be estimated(parent # of node i).

• Create a consensus network by combining above 1000 networks and check for the DAG feature of the final network. 189 Comparison of BN tools

190 Software packages

Tool Platform Scoring Data type Multi Prior info method(s) processing? support? RIMBANet Perl+C BIC Continuous No Yes + Discrete Bnfinder Python BIC , BDE, Continuous Yes Yes MIT + Discrete sparsebn R BIC+CV Continuous No No + Discrete bnlearn R BIC, AIC, Continuous Yes Yes BDE + Discrete catnet R BIC, AIC Continuous Yes Yes + Discrete

• BIC: Bayesian information criterion • AIC: Akaike information criterion • BDE: Bayesian Dirichlet equivalence • CV: cross validation • MIT: χ2 distance based Mutual information test 191 Example - 500 genes 100 samples from an aorta tissue expression dataset Tool Scoring Runtime (sec) Edge # method RIMBANet BIC 5,586 sec (~1.30 hour) 502 on 1 node Bnfinder BIC 12,710 sec (~3.30 hours) on 8 nodes 479 (Could not finish in 24h on 1 node) sparsebn BIC 2,409 sec on 1 node 5,019 Prior info integration is not available* bnlearn BIC Could not finish in 24h -- (on 8 nodes) catnet BIC Could not finish in 24h -- (on 8 nodes)

192 193 Summary

• Gene networks provide us to: – Identify biological mechanisms and molecular subnetworks underlying common human diseases. – Integrating diverse type of multi-dimensional biological datasets. – Predicting the key driver genes in disease-related subnets.

• After presenting our data-driven findings, experimentalists can conduct in vivo and/or in vitro experiments to test our candidate genes on a given tissue, for certain hallmarks of a disease/disorder.

194