Signal Processing meets Immunology: Towards a Hepatitis C Vaccine via High- Dimensional Correlation Estimation
Matthew McKay
ECE Department Hong Kong University of Science and Technology
Centrale-Supelec February 3, 2015 Other Team Members
I-Ming Hsing Raymond H. Y. Louie Ahmed Abdul Quadeer Professor, CBME Visiting Assistant Professor PhD student, ECE Head and Professor, BME ECE
Karthik Shekhar Arup K. Chakraborty Post-doc, Broad Institute Robert T. Haslam Professor of Chemical Engineering, Professor of Chemistry, Physics, and Biological Engineering
2 Outline
Immunology Background Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea Correlation Matrix Estimation using RMT Vaccine Design – Details and Validation Conclusions
3 Virus
Invading microbial organism that replicates inside the living cells
Cause infectious diseases like Human Immunodeficiency Virus (HIV) that leads to AIDS Hepatitis (Hepatitis A,B,C virus) Influenza (H1N1, H3N2, H7N9)
4 Hepatitis C virus (HCV)
HCV causes an infectious disease that affects mainly the liver More than 170 million people affected globally Treatment available Pegylated interferon and ribavirin Expensive Prolonged Extensive side-effects Frequently fails No vaccine available!
Vexing problem: Virus’s extreme mutability
5 Virus consists of proteins
HCV Viral Genome
6 Proteins consist of sequence of amino acids
No. Amino Acid Letter
1 Alanine A 2 Arginine R 3 Asparagine N 4 Aspartic acid D 5 Cysteine C 6 Glutamic acid E 7 Glutamine Q 8 Glycine G 9 Histidine H 10 Isoleucine I 11 Leucine L 12 Lysine K 13 Methionine M 14 Phenylalanine F 15 Proline P 16 Serine S 17 Threonine T 18 Tryptophan W 19 Tyrosine Y 20 Valine V 7 Protein properties
Different proteins have different amino acid sequence and length
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Protein 1 V Y A T T S A S A G L R Q K K
1 2 3 4 5 6 7 8 Protein 2 M Q S A A K L R
The same protein has similar length and amino acid sequence
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Protein 1 V Y A T T S A S A G L R Q K K V A S K T K R S K G L R R K K
Same function but different effectiveness
8 Multiple sequence alignment (MSA)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K …
Peptide
All observed viral sequences are considered fit
9 Pathogen specific adaptive immune system
Host cell Virus
BCR
B cell Peptide - MHC
TCR Antibodies
T cell Infected cell
10 Pathogen specific adaptive immune system
Host cell Virus
BCR
B cell B cell Peptide - MHC
TCR Antibodies
CTL T cell Infected cell
Memory of past infections Basis for vaccination Goal: Find specific peptides that kill large number of infected cells
11 Single mutation in peptide can abrogate T cell recognition
Infected cell
Epitope with no Epitope with one mutation mutation Peptide - MHC Recognition and Activation Cannot recognize TCR
T cell
T cell T cell 12 Outline
Immunology Background Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea Correlation Matrix Estimation using RMT Vaccine Design – Details and Validation Conclusions
13 Vaccine Design Challenges
1. Which type of immune response should the vaccine induce? 2. Which proteins to target? 3. Which peptides of the protein to target?
14 1. B cell or T cell vaccine?
B cells (antibodies) based vaccine that targets the external proteins? T cell based vaccine that targets the internal proteins?
Experimental and clinical studies reveal that HCV controllers use broadly directed T cell response to clear the virus
T cell based immune response is important in case of HCV
15 2. Which proteins to target?
Helicase/ Membrane Polymerase Protease Binding Function Function Function Why NS3? Immune system of HCV Controllers target peptides of NS3 Comparatively large number of sequences
16 3. Which peptides of the protein to target?
Major challenge
Difficult to address experimentally
Use of statistical and computational methods to help finding a solution based on the large amount of sequence data available now
17 Human Genome Project
Modern advances in bio-technology are revolutionizing the field of biomedical research
Landmark: Human Genome Project Time Period: 1990 2003 Cost: 3 BILLION US DOLLARS
Advancement in Genomics paved the way for advanced study in the field of medicine to develop treatment of cancer and other diseases
18 Increase in data
19 Open databases
Explosive growth in submissions! Lots of databases!
and many more.. (e.g. UniProt, ProDm, VectorBase….)
20 Large number of sequences for many infectious diseases!
21 3. Which peptides of the protein to target?
Large number of sequences (observations) (2800+ in NS3) Large number of amino acids in the protein (variables) (631 in NS3)
Most difficult challenge to be addressed using high- dimensional correlation matrix estimation 22 Conventional vaccine design strategy
A TOY EXAMPLE:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K …
Consensus V Y A T T S R S A G L R Q K K … Sequence
No mutation at all 100% conserved Conventional approach: Design a vaccine which can elicit a T cell response to target highly conserved peptides Basis of a recently proposed HCV vaccine IC-41
Problem: High mutability of virus may result T cell 23 in escape mutations T cell Proposed vaccine design approach
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K …
Consensus V Y A T T S R S A G L R Q K K … Sequence
Positively correlated pairs of locations Beneficial mutations
24 Proposed vaccine design approach
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Sequence 1 V Y A T T S A S A G L R Q V K … Sequence 2 V Y S T T K R S K G L R Q K K … Sequence 3 V Y S T T S R S K G L R Q K K … : : : : : : : : : : : : : : : : … Sequence n V Y A T T S R S A G L R Q K K …
Consensus V Y A T T S R S A G L R Q K K … Sequence
Positively correlated pairs of locations Beneficial mutations Negatively correlated pairs of locations Harmful mutations
Target the negatively correlated pairs of locations along with the 100% conserved ones and avoid the positively correlated pairs of locations
25 Outline
Immunology Background Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea Correlation Matrix Estimation using RMT Vaccine Design – Details and Validation Conclusions
26 Technical problem …
Large number of sequences (observations) (2800+ in NS3) Large number of amino acids in the protein (variables) (631 in NS3)
Challenge: Accurate high dimensional correlation estimation 27 Correlation matrix estimation
Examples Portfolio management and risk assessment Array processing Designing wireless communication receivers Number of observations ≈ number of variables The sample correlation is known to have poor performance [Johnstone, 2001] Basis - RMT application in finance
Random Matrix Theory (RMT) for noise-cleaning in finance
Bouchaud Stanley
RMT also instrumental in modern communication system design such as WiFi and cellular phones
HIV work by Arup Chakraborty (MIT) [PNAS, 2011] Finding HIV sectors (groups of amino acids) Designing vaccine to attack such sectors Vaccine trials in progress
Arup K. Chakraborty
29 In the news…
30 Method
Design immunogen Obtain the Multiple Construct the sample targeting the highly Clean the correlation Sequence Alignment correlation matrix conserved and matrix using RMT (MSA) from MSA negatively correlated pairs of sites
Advantages: The results can potentially yield significant improvements over IC-41 Such vaccine strategies can be explored with computational methods
31 Sample correlation matrix
32 Cleaned correlation matrix
Statistical Noise Phylogenetic Noise Alternate covariance matrix estimation methods
Regularized (shrinkage) methods [Ledoit et. al., 2004, Ledoit et. al., 2012]
Sparse covariance matrix estimation [Bickel et. al., 2008, Cai et. al., 2012]
Sparse PCA [Johnstone et. al., 2009, Paul et. al. 2012, Ma 2013, Vu 2013, Liu et. al. 2014]
Robust estimation [Maronna 1976,, Couillet et. al. 2013, Zheng et. al. 2014]
34 Outline
Immunology Background Vaccine Design – Challenges, Conventional Strategy, and Proposed Idea Correlation Matrix Estimation using RMT Vaccine Design – Details and Validation Conclusions
35 Important factors in the proposed vaccine design
1. Metric L - calculated based on correlations 2. Population coverage
Host Cell
MHC
Peptide
T cell 36 1. Metric L - calculated based on correlations Peptide 1 Peptide 1 with single mutation
Peptide 2 Peptide 2 with single mutation
Vaccine Design Objective: Maximize L = PCP + PNCP – PPCP – PUCP
PCP = Percentage of 100% conserved pairs PNCP = Percentage of negatively correlated pairs PPCP = Percentage of positively correlated pairs PUCP = Percentage of uncorrelated pairs
37 2. Population Coverage
Difference in MHC molecules leads to presentation of different peptides across populations
Cell
MHC Molecules
Person 1 Person 2 Person 3 Person 4 Person 5
Different people have different types of MHC molecules Different MHC molecules may present different peptides Thus different people may present different peptides
38 2. Population Coverage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 … Person 1 V Y A T T S A S A G L R Q K K R E D K M V L K F G S …
Person 2 V Y A T T S A S A G L R Q K K R E D K M V L K F G S …
Person 3 V Y A T T S A S A G L R Q K K R E D K M V L K F G S …
Person 4 V Y A T T S A S A G L R Q K K R E D K M V L K F G S …
Challenge: Designing a vaccine that covers a large proportion of population Information required: Detailed statistics of distribution of MHCs in a given population Data of NS3 peptides presented by particular MHCs (IEDB database) 39 Statistics of haplotypes in US Caucasian population [Maiers et. al. 2007]
40 Proposed T cell vaccine design
A list of 32 peptides recognized by T cells in individuals in a large proportion of the US Caucasian population was compiled
APITAYAQQTRGLLGCIITSLTGRDKNQVEGEVQIVSTAAQTFLATCINGVCWTVYHGAGTRTIASPKGPVIQMYTNVDQDLV GWPAPQGARSLTPCTCGSSDLYLVTRHADVIPVRRRGDSRGSLLSPRPISYLKGSSGGPLLCPAGHAVGIFRAAVCTRGVAKAV DFIPVENLETTMRSPVFTDNSSPPAVPQSFQVAHLHAPTGSGKSTKVPAAYAAQGYKVLVLNPSVAATLGFGAYMSKAHGI DPNIRTGVRTITTGSPITYSTYGKFLADGGCSGGAYDIIICDECHSTDATSILGIGTVLDQAETAGARLVVLATATPPGSVTVPHP NIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMT GFTGDFDSVIDCNTCVTQTVDFSLDPTFTIETTTLPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPSGMFDSSVLCECYDAGCA WYELTPAETTVRLRAYMNTPGLPVCQDHLEFWEGVFTGLTHIDAHFLSQTKQSGENLPYLVAYQATVCARAQAPPPSW DQMWKCLIRLKPTLHGPTPLLYRLGAVQNEVTLTHPITKYIMTCMSADLEVVT
We consider a 5-peptides based vaccine design for this population as an example
41 Proposed T cell vaccine design Obtain 10 combinations with maximum L (effectiveness of combination to kill viruses) Order them with respect to Dcov (double coverage)
Combination Peptide 1 Peptide 2 Peptide 3 Peptide 4 Peptide 5 L Dcov
1 1251-1259 1292-1300 1436-1444 1585-1594 1585-1595 63.58 0.50
2 1123-1131 1169-1177 1251-1259 1292-1300 1436-1444 61.62 0.44
3 1123-1131 1175-1183 1251-1259 1292-1300 1436-1444 65.45 0.37
4 1123-1131 1175-1183 1251-1259 1359-1367 1436-1444 61.62 0.37
5 1169-1177 1175-1183 1251-1259 1292-1300 1436-1444 64.46 0.34
6 1123-1131 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.30
7 1251-1259 1292-1300 1436-1444 1540-1550 1541-1550 61.31 0.18
8 1169-1177 1251-1259 1292-1300 1359-1367 1436-1444 61.62 0.14
9 1175-1183 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.07
10 1123-1131 1175-1183 1251-1259 1292-1300 1359-1367 61.62 0.07
42 Analysis of NS3 peptides of IC41 Plus point No positively correlated pairs of sites! Rank in 2-peptides based vaccine design 71 /496
Double Coverage Mean conservation across all genotypes 67.03 38.34 75.44 72.55 L-score 0,14 100
0,12 99
98 0,1 97 0,08 80.39 86.93 96 0,06 95 0,04 94
0,02 93
0 92 1 IC41 2 3 4 5 1 IC41 2 3 4 5 Combination of 2 NS3 peptides Combination of 2 NS3 peptides
43 Validation
Experiments
Existing clinical and experimental data Cannot directly validate proposed peptides
Validation Strategy: 1. Identify group/sector of potentially vulnerable sites (negatively correlated) that are collectively coupled 2. Validate this sector by comparing with structural and clinical data 3. Check if our vaccine targets the sites in this sector
44 1. Identify sectors of potentially vulnerable sites
Use clustering algorithm based on eigenvectors of Ccleaned
Finance Economic sectors
45 Three sectors of co-evolving sites in NS3 3-D Scatter plot of Mean %Positive eigenvectors conservation correlations
1 30
20 0,9 10
0,8 0 1 2 3 1 2 3
%Negative Neg/pos correlations correlations
12 8 10 6 8 6 4 4 2 2 0 0 1 2 3 1 2 3 Sector Sector Sector 1 consists of the most 46 immunologically vulnerable sites 2. Structural significance of sector 1
Red – Sector 1 sites
Sector1 sites are dominant in the critical interface of the NS3 crystal structure (p-value < 0.01)
47 2. Significance of sector 1 based on previously published experimental and clinical results
% Sector 1 sites
80 70 60 50 40 >30% 30 20 10 0 1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 Allele- Allele-restricted epitopes independent epitopes
Majority of peptides targeted by “HCV Controllers” consist of predominantly sector 1 sites (p-value < 0.05).
48 3. Sector 1 sites in proposed vaccine design
Combination Peptide1 Peptide2 Peptide3 Peptide4 Peptide5 L Dcov
1 1251-1259 1292-1300 1436-1444 1585-1594 1585-1595 63.58 0.50
2 1123-1131 1169-1177 1251-1259 1292-1300 1436-1444 61.62 0.44
3 1123-1131 1175-1183 1251-1259 1292-1300 1436-1444 65.45 0.37
4 1123-1131 1175-1183 1251-1259 1359-1367 1436-1444 61.62 0.37
5 1169-1177 1175-1183 1251-1259 1292-1300 1436-1444 64.46 0.34
6 1123-1131 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.30
7 1251-1259 1292-1300 1436-1444 1540-1550 1541-1550 61.31 0.18
8 1169-1177 1251-1259 1292-1300 1359-1367 1436-1444 61.62 0.14
9 1175-1183 1251-1259 1292-1300 1359-1367 1436-1444 65.45 0.07
10 1123-1131 1175-1183 1251-1259 1292-1300 1359-1367 61.62 0.07
A large proportion (~60%) of sites in the proposed vaccine design belong to sector 1 (p-value < 0.01)
49 Conclusions
Majority of the sites present in the proposed design belong to sector 1 that appears to be significant from experimental and clinical data available in literature
Numerical validation of currently proposed vaccine design, IC-41
Proposal of new vaccine design strategies which can: Potentially improve upon IC-41 by inducing an immune response against more vulnerable parts of the HCV genome Cover a large portion of the population (currently, for US)
Similar analysis for NS4B and NS5B proteins also reveals potential sites for vaccine design Next step: Experimental trials!
50 Conclusions
There is much similarity between high-dimensional statistical problems in immunology and those in signal processing
Many methods common in SP find direct application (though, currently not well explored): Maximum entropy modeling Sampling methods (e.g., MCMC) Sparsity Subspace estimation Robust estimation Machine learning …
51 Related Publications
A. A. Quadeer, R. H. Y. Louie, K. Shekhar, A. K. Chakraborty, I. Hsing, and M. R. McKay, “Discovering statistical vulnerabilities in highly mutable viruses: a random matrix approach,” in Proc. of the IEEE Workshop on Statistical Signal Processing (SSP), Gold Coast,Australia, July 2014.
A. A. Quadeer, R. H. Y. Louie, K. Shekhar, A. K. Chakraborty, I. Hsing, and M. R. McKay, “Statistical linkage of substitutions in patient-derived sequences of genotype 1a hepatitis C virus non-structural protein 3 exposes targets for immunogen design,” Journal ofVirology, 88 (13), pp. 7628-7644, July 2014.
52 Join us in Brisbane 19 – 24 April 2015 www.icassp2015.org