Signal Processing meets Immunology: Towards a via High- Dimensional Correlation Estimation

Matthew McKay

ECE Department Hong Kong University of Science and Technology

Centrale-Supelec February 3, 2015 Other Team Members

I-Ming Hsing Raymond H. Y. Louie Ahmed Abdul Quadeer Professor, CBME Visiting Assistant Professor PhD student, ECE Head and Professor, BME ECE

Karthik Shekhar Arup K. Chakraborty Post-doc, Broad Institute Robert T. Haslam Professor of Chemical Engineering, Professor of Chemistry, Physics, and Biological Engineering

2 Outline

 Immunology Background  Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea  Correlation Matrix Estimation using RMT  Vaccine Design – Details and Validation  Conclusions

3 Virus

 Invading microbial organism that replicates inside the living cells

 Cause infectious diseases like  Human Immunodeficiency Virus (HIV) that leads to AIDS  Hepatitis (,B,C virus)  Influenza (H1N1, H3N2, H7N9)

4 (HCV)

 HCV causes an infectious disease that affects mainly the liver  More than 170 million people affected globally  Treatment available  Pegylated interferon and ribavirin  Expensive  Prolonged  Extensive side-effects  Frequently fails  No vaccine available!

Vexing problem: Virus’s extreme mutability

5 Virus consists of proteins

HCV Viral Genome

6 Proteins consist of sequence of amino acids

No. Amino Acid Letter

1 Alanine A 2 Arginine R 3 Asparagine N 4 Aspartic acid D 5 Cysteine C 6 Glutamic acid E 7 Glutamine Q 8 Glycine G 9 Histidine H 10 Isoleucine I 11 Leucine L 12 Lysine K 13 Methionine M 14 Phenylalanine F 15 Proline P 16 Serine S 17 Threonine T 18 Tryptophan W 19 Tyrosine Y 20 Valine V 7 Protein properties

Different proteins have different amino acid sequence and length

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Protein 1 V Y A T T S A S A G L R Q K K

1 2 3 4 5 6 7 8 Protein 2 M Q S A A K L R

The same protein has similar length and amino acid sequence

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Protein 1 V Y A T T S A S A G L R Q K K V A S K T K R S K G L R R K K

 Same function but different effectiveness

8 Multiple sequence alignment (MSA)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K …

Peptide

All observed viral sequences are considered fit

9 Pathogen specific adaptive

Host cell Virus

BCR

B cell Peptide - MHC

TCR Antibodies

T cell Infected cell

10 Pathogen specific adaptive immune system

Host cell Virus

BCR

B cell B cell Peptide - MHC

TCR Antibodies

CTL Infected cell

 Memory of past  Basis for  Goal: Find specific peptides that kill large number of infected cells

11 Single mutation in peptide can abrogate T cell recognition

Infected cell

Epitope with no Epitope with one mutation mutation Peptide - MHC Recognition and Activation Cannot recognize TCR

T cell

T cell T cell 12 Outline

 Immunology Background  Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea  Correlation Matrix Estimation using RMT  Vaccine Design – Details and Validation  Conclusions

13 Vaccine Design Challenges

 1. Which type of immune response should the vaccine induce?  2. Which proteins to target?  3. Which peptides of the protein to target?

14 1. B cell or T cell vaccine?

 B cells (antibodies) based vaccine that targets the external proteins?  T cell based vaccine that targets the internal proteins?

 Experimental and clinical studies reveal that HCV controllers use broadly directed T cell response to clear the virus

T cell based immune response is important in case of HCV

15 2. Which proteins to target?

Helicase/ Membrane Polymerase Protease Binding Function Function Function  Why NS3?  Immune system of HCV Controllers target peptides of NS3  Comparatively large number of sequences

16 3. Which peptides of the protein to target?

 Major challenge

 Difficult to address experimentally

Use of statistical and computational methods to help finding a solution based on the large amount of sequence data available now

17 Human Genome Project

 Modern advances in bio-technology are revolutionizing the field of biomedical research

 Landmark: Human Genome Project  Time Period: 1990  2003  Cost: 3 BILLION US DOLLARS

 Advancement in Genomics paved the way for advanced study in the field of medicine to develop treatment of cancer and other diseases

18  Increase in data

19 Open databases

Explosive growth in submissions! Lots of databases!

and many more.. (e.g. UniProt, ProDm, VectorBase….)

20 Large number of sequences for many infectious diseases!

21 3. Which peptides of the protein to target?

 Large number of sequences (observations) (2800+ in NS3)  Large number of amino acids in the protein (variables) (631 in NS3)

Most difficult challenge to be addressed using high- dimensional correlation matrix estimation 22 Conventional vaccine design strategy

A TOY EXAMPLE:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K …

Consensus V Y A T T S R S A G L R Q K K … Sequence

 No mutation at all  100% conserved  Conventional approach: Design a vaccine which can elicit a T cell response to target highly conserved peptides  Basis of a recently proposed HCV vaccine IC-41

Problem: High mutability of virus may result T cell 23 in escape mutations T cell Proposed vaccine design approach

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K …

Consensus V Y A T T S R S A G L R Q K K … Sequence

Positively correlated pairs of locations  Beneficial mutations

24 Proposed vaccine design approach

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K …

Consensus V Y A T T S R S A G L R Q K K … Sequence

Positively correlated pairs of locations  Beneficial mutations Negatively correlated pairs of locations  Harmful mutations

Target the negatively correlated pairs of locations along with the 100% conserved ones and avoid the positively correlated pairs of locations

25 Outline

 Immunology Background  Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea  Correlation Matrix Estimation using RMT  Vaccine Design – Details and Validation  Conclusions

26 Technical problem …

 Large number of sequences (observations) (2800+ in NS3)  Large number of amino acids in the protein (variables) (631 in NS3)

Challenge: Accurate high dimensional correlation estimation 27 Correlation matrix estimation

 Examples  Portfolio management and risk assessment  Array processing  Designing wireless communication receivers  Number of observations ≈ number of variables  The sample correlation is known to have poor performance [Johnstone, 2001] Basis - RMT application in finance

 Random Matrix Theory (RMT) for noise-cleaning in finance

Bouchaud Stanley

 RMT also instrumental in modern communication system design such as WiFi and cellular phones

 HIV work by Arup Chakraborty (MIT) [PNAS, 2011]  Finding HIV sectors (groups of amino acids)  Designing vaccine to attack such sectors  Vaccine trials in progress

Arup K. Chakraborty

29 In the news…

30 Method

Design immunogen Obtain the Multiple Construct the sample targeting the highly Clean the correlation Sequence Alignment correlation matrix conserved and matrix using RMT (MSA) from MSA negatively correlated pairs of sites

 Advantages:  The results can potentially yield significant improvements over IC-41  Such vaccine strategies can be explored with computational methods

31 Sample correlation matrix

32 Cleaned correlation matrix

Statistical Noise Phylogenetic Noise Alternate covariance matrix estimation methods

 Regularized (shrinkage) methods [Ledoit et. al., 2004, Ledoit et. al., 2012]

 Sparse covariance matrix estimation [Bickel et. al., 2008, Cai et. al., 2012]

 Sparse PCA [Johnstone et. al., 2009, Paul et. al. 2012, Ma 2013, Vu 2013, Liu et. al. 2014]

 Robust estimation [Maronna 1976,, Couillet et. al. 2013, Zheng et. al. 2014]

34 Outline

 Immunology Background  Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea  Correlation Matrix Estimation using RMT  Vaccine Design – Details and Validation  Conclusions

35 Important factors in the proposed vaccine design

1. Metric L - calculated based on correlations 2. Population coverage

Host Cell

MHC

Peptide

T cell 36 1. Metric L - calculated based on correlations Peptide 1 Peptide 1 with single mutation

Peptide 2 Peptide 2 with single mutation

Vaccine Design Objective: Maximize L = PCP + PNCP – PPCP – PUCP

 PCP = Percentage of 100% conserved pairs  PNCP = Percentage of negatively correlated pairs  PPCP = Percentage of positively correlated pairs  PUCP = Percentage of uncorrelated pairs

37 2. Population Coverage

Difference in MHC molecules leads to presentation of different peptides across populations

Cell

MHC Molecules

Person 1 Person 2 Person 3 Person 4 Person 5

 Different people have different types of MHC molecules  Different MHC molecules may present different peptides  Thus different people may present different peptides

38 2. Population Coverage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 … Person 1 V Y A T T S A S A G L R Q K K R E D K M V L K F G S …

Person 2 V Y A T T S A S A G L R Q K K R E D K M V L K F G S …

Person 3 V Y A T T S A S A G L R Q K K R E D K M V L K F G S …

Person 4 V Y A T T S A S A G L R Q K K R E D K M V L K F G S …

 Challenge: Designing a vaccine that covers a large proportion of population  Information required:  Detailed statistics of distribution of MHCs in a given population  Data of NS3 peptides presented by particular MHCs (IEDB database) 39 Statistics of haplotypes in US Caucasian population [Maiers et. al. 2007]

40 Proposed T cell vaccine design

 A list of 32 peptides recognized by T cells in individuals in a large proportion of the US Caucasian population was compiled

APITAYAQQTRGLLGCIITSLTGRDKNQVEGEVQIVSTAAQTFLATCINGVCWTVYHGAGTRTIASPKGPVIQMYTNVDQDLV GWPAPQGARSLTPCTCGSSDLYLVTRHADVIPVRRRGDSRGSLLSPRPISYLKGSSGGPLLCPAGHAVGIFRAAVCTRGVAKAV DFIPVENLETTMRSPVFTDNSSPPAVPQSFQVAHLHAPTGSGKSTKVPAAYAAQGYKVLVLNPSVAATLGFGAYMSKAHGI DPNIRTGVRTITTGSPITYSTYGKFLADGGCSGGAYDIIICDECHSTDATSILGIGTVLDQAETAGARLVVLATATPPGSVTVPHP NIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMT GFTGDFDSVIDCNTCVTQTVDFSLDPTFTIETTTLPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPSGMFDSSVLCECYDAGCA WYELTPAETTVRLRAYMNTPGLPVCQDHLEFWEGVFTGLTHIDAHFLSQTKQSGENLPYLVAYQATVCARAQAPPPSW DQMWKCLIRLKPTLHGPTPLLYRLGAVQNEVTLTHPITKYIMTCMSADLEVVT

 We consider a 5-peptides based vaccine design for this population as an example

41 Proposed T cell vaccine design  Obtain 10 combinations with maximum L (effectiveness of combination to kill viruses)  Order them with respect to Dcov (double coverage)

Combination Peptide 1 Peptide 2 Peptide 3 Peptide 4 Peptide 5 L Dcov

1 1251-1259 1292-1300 1436-1444 1585-1594 1585-1595 63.58 0.50

2 1123-1131 1169-1177 1251-1259 1292-1300 1436-1444 61.62 0.44

3 1123-1131 1175-1183 1251-1259 1292-1300 1436-1444 65.45 0.37

4 1123-1131 1175-1183 1251-1259 1359-1367 1436-1444 61.62 0.37

5 1169-1177 1175-1183 1251-1259 1292-1300 1436-1444 64.46 0.34

6 1123-1131 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.30

7 1251-1259 1292-1300 1436-1444 1540-1550 1541-1550 61.31 0.18

8 1169-1177 1251-1259 1292-1300 1359-1367 1436-1444 61.62 0.14

9 1175-1183 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.07

10 1123-1131 1175-1183 1251-1259 1292-1300 1359-1367 61.62 0.07

42 Analysis of NS3 peptides of IC41  Plus point  No positively correlated pairs of sites!  Rank in 2-peptides based vaccine design  71 /496

Double Coverage Mean conservation across all genotypes 67.03 38.34 75.44 72.55 L-score 0,14 100

0,12 99

98 0,1 97 0,08 80.39 86.93 96 0,06 95 0,04 94

0,02 93

0 92 1 IC41 2 3 4 5 1 IC41 2 3 4 5 Combination of 2 NS3 peptides Combination of 2 NS3 peptides

43 Validation

 Experiments

 Existing clinical and experimental data  Cannot directly validate proposed peptides

 Validation Strategy: 1. Identify group/sector of potentially vulnerable sites (negatively correlated) that are collectively coupled 2. Validate this sector by comparing with structural and clinical data 3. Check if our vaccine targets the sites in this sector

44 1. Identify sectors of potentially vulnerable sites

 Use clustering algorithm based on eigenvectors of Ccleaned

 Finance  Economic sectors

45 Three sectors of co-evolving sites in NS3 3-D Scatter plot of Mean %Positive eigenvectors conservation correlations

1 30

20 0,9 10

0,8 0 1 2 3 1 2 3

%Negative Neg/pos correlations correlations

12 8 10 6 8 6 4 4 2 2 0 0 1 2 3 1 2 3 Sector Sector Sector 1 consists of the most 46 immunologically vulnerable sites 2. Structural significance of sector 1

Red – Sector 1 sites

Sector1 sites are dominant in the critical interface of the NS3 crystal structure (p-value < 0.01)

47 2. Significance of sector 1 based on previously published experimental and clinical results

% Sector 1 sites

80 70 60 50 40 >30% 30 20 10 0 1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 Allele- Allele-restricted epitopes independent epitopes

Majority of peptides targeted by “HCV Controllers” consist of predominantly sector 1 sites (p-value < 0.05).

48 3. Sector 1 sites in proposed vaccine design

Combination Peptide1 Peptide2 Peptide3 Peptide4 Peptide5 L Dcov

1 1251-1259 1292-1300 1436-1444 1585-1594 1585-1595 63.58 0.50

2 1123-1131 1169-1177 1251-1259 1292-1300 1436-1444 61.62 0.44

3 1123-1131 1175-1183 1251-1259 1292-1300 1436-1444 65.45 0.37

4 1123-1131 1175-1183 1251-1259 1359-1367 1436-1444 61.62 0.37

5 1169-1177 1175-1183 1251-1259 1292-1300 1436-1444 64.46 0.34

6 1123-1131 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.30

7 1251-1259 1292-1300 1436-1444 1540-1550 1541-1550 61.31 0.18

8 1169-1177 1251-1259 1292-1300 1359-1367 1436-1444 61.62 0.14

9 1175-1183 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.07

10 1123-1131 1175-1183 1251-1259 1292-1300 1359-1367 61.62 0.07

A large proportion (~60%) of sites in the proposed vaccine design belong to sector 1 (p-value < 0.01)

49 Conclusions

 Majority of the sites present in the proposed design belong to sector 1 that appears to be significant from experimental and clinical data available in literature

 Numerical validation of currently proposed vaccine design, IC-41

 Proposal of new vaccine design strategies which can:  Potentially improve upon IC-41 by inducing an immune response against more vulnerable parts of the HCV genome  Cover a large portion of the population (currently, for US)

 Similar analysis for NS4B and NS5B proteins also reveals potential sites for vaccine design Next step: Experimental trials!

50 Conclusions

 There is much similarity between high-dimensional statistical problems in immunology and those in signal processing

 Many methods common in SP find direct application (though, currently not well explored):  Maximum entropy modeling  Sampling methods (e.g., MCMC)  Sparsity  Subspace estimation  Robust estimation  Machine learning  …

51 Related Publications

 A. A. Quadeer, R. H. Y. Louie, K. Shekhar, A. K. Chakraborty, I. Hsing, and M. R. McKay, “Discovering statistical vulnerabilities in highly mutable viruses: a random matrix approach,” in Proc. of the IEEE Workshop on Statistical Signal Processing (SSP), Gold Coast,Australia, July 2014.

 A. A. Quadeer, R. H. Y. Louie, K. Shekhar, A. K. Chakraborty, I. Hsing, and M. R. McKay, “Statistical linkage of substitutions in patient-derived sequences of genotype 1a hepatitis C virus non-structural protein 3 exposes targets for immunogen design,” Journal ofVirology, 88 (13), pp. 7628-7644, July 2014.

52 Join us in Brisbane 19 – 24 April 2015 www.icassp2015.org