EVOLUTION OF REGIONS IN HIV GENOME: DELINEATING SELECTIVE FORCES ACTING ON CONFORMATIONAL AND LINEAR

A thesis submitted to Kent State University in partial fulfillment of the requirements for the degree of Master of Sciences

By

Satish Kumar Perikala

May 2010

Thesis Written by

Satish Kumar Perikala

B.V.Sc & A.H, Acharya N.G Ranga Agricultural University, India 2004

M.S Kent State University, 2010

Approved by

______, Advisor Helen Piontkivska

______, Member, Masters Thesis Committee Gail Fraizer

______, Member, Masters Thesis Committee Michael Tubergen

Accepted by

______, Chair, Department of Biomedical Sciences Robert V Dorman

______, Dean, College of Arts and Sciences John R.D. Stalvey

ii TABLE OF CONTENTS

LIST OF FIGURES...... v

LIST OF TABLES...... vii

LIST OF ABBREVIATIONS………………………………………………………..ix

ACKNOWLEDGEMENTS…………………………………………….…………….x

CHAPTER 1:

INTRODUCTION……………………………………………………………1

HIV Genome……………………………………………….…………..4 Geographical Distribution of HIV world wide…………….…………..5 Different Types of Immune Epitopes in HIV…………….……………6 Thesis Overview………………………………………….……………9

CHAPTER 2:

MATERIALS AND METHODS……………………………………...……...13

HIV genomic sequence data and multiple sequence alignment……...14 HIV Epitopes………………………………………………………....15 Nucleotide Substitution Patterns………………………………….….15 Reconstruction of Ancestral Sequence……………………………….17 Radical and Conservative Amino acid Substitutions…………………18 Phylogenetic Analysis…………………………………………...……18

CHAPTER 3:

Patterns of synonymous and nonsynonymous substitutions at different epitope regions among B-subtype and CRF subtype HIV-1 sequences……..30

Results………………………………………………………………….31 Discussion…………………………………………..……………….…35

CHAPTER 4:

Patterns of radical and conservative amino acid changes at different epitope regions among selected B-subtype and CRF HIV-1 sequences……..43

iii Results………………………………………………………………..44 Discussion……………………………………………………………50

SUMMARY…………………………………………………………………………66

LITERATURE CITED………………………………………………………………81

APPENDIX

1. Box Plots showing the d S - d N values in different epitope regions of Env gene…………………………………………………………………………..68

2. Box Plots showing the d S - d N values in different epitope regions of Gag gene…………………………………………………………………………..70

3. Box Plots showing the d S - d N values in different epitope regions of Pol gene………………………………………………………………………..…71

4. List of CTL epitopes used…………………………...……………..…..….72

5. List of T-Helper epitopes used……………………...………………...…...76

6. List of epitopes used……………………………………….…...79

7. List of conformational epitopes used………………………………..….…80

iv LIST OF FIGURES

Fig 1.1 Overview of the Life cycle of HIV (From: Sewell 2000)…………..………..11

Fig 1.2 Structure of HIV Genome (Modified from Los Alamos HIV database)… ….11

Fig 1.3 Global Distribution of HIV subtypes (From: International AIDS Vaccine

Initiative) Report, 2003.………………… … ……………….……….……………...12

Fig 2.1 Example showing the Synonymous and Nonsynonymous Substitutions...... 19

Fig 2.2 Phylogenetic tree showing the ancestral sequence of Env gene of B-subtype sequences……………………………………………………….…………………….19

Fig 2.3 Phylogenetic tree showing the ancestral sequence of Gag gene of B-subtype sequences……………………………………….………………………….…………20

Fig 2.4 Phylogenetic tree showing the ancestral sequence of Pol gene of B-Subtype sequences…………………………………………………………………..…………20

Fig 2.5 Phylogenetic tree showing the ancestral sequence of Env gene of CRF-subtype sequences……………………………………………………………………………21

Fig 2.6 Phylogenetic tree showing the ancestral sequence of Gag gene of CRF-subtypes sequences……………………………………………………………………………22

Fig 2.7 Phylogenetic tree showing the ancestral sequence of Pol gene of CRF-subtypes sequences……………………………………………………………………………23

Fig 3.1 Bar graph showing the mean d N and d S values of different epitope regions of Env

gene for B and CRF sequences ………………………………………….…………..38

Fig 3.2 Bar graph showing the mean d N and d S values of different epitope regions of

Gag gene for B and CRF sequences ………………………………………………...39

v Fig 3.3 Bar graph showing the mean d N and d S values of different epitope regions of Pol

gene for B and CRF sequences…………………………………………….………40

Fig 4.1 Bar graph showing number of Radical and Conservative amino acid changes at

different epitope regions of Env ,Gag and Pol genes from B and CRF sequences based

on the classification of amino acids by Polarity…………………………..……..…54

Fig 4.2 Bar graph showing number of Radical and Conservative amino acid changes at

different epitope regions of Env ,Gag and Pol genes from B and CRF sequences based

on the classification of amino acids by Charge……………………………………55

Fig 4.3 Bar graph showing number of Radical and Conservative amino acid changes at

different epitope regions of Env ,Gag and Pol genes from B and CRF sequences based

on the classification of amino acids by Polarity AND Volume……………………56

Fig 4.4 Bar graphs showing the proportion of Radical changes relative to the total number

of changes in Env gene……………………………………………………………..57

Fig 4.5 Bar graphs showing the proportion of Radical changes relative to the total number

of changes in Gag gene…………………………………………………..…..……58

Fig 4.6 Bar graphs showing the proportion of Radical changes relative to the total number

of changes in Pol gene…………………………………………………….………59

vi LIST OF TABLES

Table 2.1 GenBank accession numbers of 67 B subtype sequences ………..…..24

Table 2.2 GenBank accession numbers of 135 CRF subtype sequences …….…26

Table 2.3 Distribution of different types of epitope regions in each gene and the number of codons in each epitope category for the 3 longest genes, Gag, Pol and Env…27

Table 2.4: Average nucleotide p-distance values for different epitope regions

of Env, Gag and Pol genes from HIV-1 sequences of B-subtype and CRFs...….28

Table 2.5 Three types of amino acid classifications based on physicochemical properties

(Modified from Zhang 2000)………………………..………….………………..29

Table 3.1 Average estimates of synonymous (d S) and nonsynonymous (d N) Nucleotide

substitutions in different epitope regions of Env gene among B subtype and CRF

sequences …………………………………………………………………………41

Table 3.2 Average estimates of synonymous (d S) and nonsynonymous (d N) Nucleotide

substitutions in different epitope regions of Gag gene among B subtype and CRF

sequences…………………………………………………………………….……42

Table 3.3 Average estimates of synonymous (d S) and nonsynonymous (d N) Nucleotide

Substitutions in different epitope regions of Pol gene among B subtype and CRF

sequences…………………………………………………………………….……42

Table 4.1 Number of Radical and Conservative amino acid substitutions at Env gene

from pairwise comparisons of B and CRF sequences along with the number of No

change sites………………………………………………………………….…….61

Table 4.2 Number of Radical and Conservative amino acid substitutions at Gag gene

vii from pairwise comparisons of B and CRF sequences along with the number of No change

sites ……………………………………………………………………………..…62

Table 4.3 Number of Radical and Conservative amino acid substitutions at Pol gene

from pairwise comparisons of B and CRF sequences along with the number of No

change sites …………………………………………………….…………….....…..63

Table 4.4 Number of Radical and Conservative amino acid substitutions at Env gene

from Pairwise comparisons of B-subtype sequences estimated with CRF ancestral

sequence ……………………………………………………………………..………64

Table 4.5 Number of Radical and Conservative amino acid substitutions at Gag gene

from Pairwise comparisons of B-subtype sequences estimated with CRF ancestral

sequence ………………………………………………………………………………65

Table 4.6 Number of Radical and Conservative amino acid substitutions at Pol gene

from Pairwise comparisons of B-subtype sequences estimated with CRF ancestral

sequence …………………………………………..…………………………………65

viii LIST OF ABBREVIATIONS

Ab - Antibody (Neutralizing antibody)

AIDS - Acquired Syndrome

AOCE – All Overlapping Conformational Epitopes

AOLE – All Overlapping Linear Epitopes

CE – Conformational Epitope

CRF- Circulating Recombinant Forms

CTL – Cytotoxic T

HIV – Human Immunodeficiency virus

MEGA - Molecular Evolutionary Genetics Analysis

SIV – Simian Immunodeficiency Virus

Th – T-Helper

ix Acknowledgements

It is a pleasure to thank the many people who made this thesis possible.

It is difficult to overstate my gratitude to my Masters principle Advisor, Dr. Helen

Piontkivska. With her enthusiasm, inspiration, and her great efforts to explain things clearly and simply, helped me to make the subject fun for me. Throughout my thesis- writing period, she provided encouragement, sound advice, good teaching, and lots of good ideas. I would have been lost without her.

Besides my advisor, I would like to thank the rest of my thesis committee: Dr. Gail

Fraizer and Dr. Michael Tubergen, for their encouragement and insightful comments.

My sincere thanks to all my Professors who taught me in the Graduate school and also to all my teachers from India.

I sincerely thank my labmate and friend, Sinu Paul for his support all through.

I wish to thank the rest of my friends, Pawan Puri, Teja, Sourabh ,Sudhakar, Hari and many more for helping me get through the difficult times, and for all the support, entertainment, and caring they provided.

Lastly, and most importantly, I wish to thank my parents, Showraiah and Vimala, My sister Soujanya and my brother Avinash for their end less love and support. To them I dedicate this thesis.

x

Chapter – 1

General Introduction

1 2

INTRODUCTION

Human immune deficiency virus-1 (HIV-1) is the virus responsible for the cause of acquired immunodeficiency syndrome (AIDS) (Gallo et al. 1984; 2003). It is estimated that over 40 million people worldwide are infected with HIV-1 with the increasing rate of infections worldwide (Papathanasopoulos 2002). Of these 40 million, over 1 million people live in the US, with a new infection occurring every 9 ½ minutes (CDC 2009, http://www.nineandahalfminutes.org ). Notably, out of this 40 million people, only a minor category of patients exhibited any symptoms at the early stages of infection, with as many as 20% being unaware of the infection (CDC 2009).

Vertebrates, including mammals, have developed a sophisticated molecular system, the , to protect themselves against various pathogens such as viruses. The immune system consists of two major arms: (a) innate and (b) adaptive , the latter consists of cell-mediated and humoral responses (Klein and Horejsi 1997; Abbas and Lichtman 2005). Innate immunity involves pre-existing defenses of the body, such as barriers formed by skin and a broad variety of anti-microbial enzymes (Ganz and RI

2002; Boman 2003). The latter is a quick-response but non-pathogen specific system that protects against many pathogens by relying on the conserved features of a wide range of pathogens. On the other hand, adaptive immunity represents is a much more complex defense system capable of specific recognition of pathogens. It is also capable of

“remembering” the previous encounter with a particular and developing a more efficient response (Klein and Horejsi 1997). Major histocompatibility complex molecules 3

(MHC, referred to as HLA ( system) in humans play a very important role in the adaptive immune response, by presenting foreign on the cell surface (Klein and Horejsi 1997). Over a course of a typical viral infection, the expressed viral peptides are digested into small pieces inside the cell and loaded into the class I MHC molecule, which is then transported to the cell surface and presented to the white blood cells, namely, cytotoxic T lymphocytes (CTLs, or CD8+ T cells). Thus, this recognition step between MHC/viral peptide complex and CTLs plays a critical role in the elimination of many viral infections (Bjorkman et al. 1987; Klein and Horejsi 1997), including HIV and HCV (Borrow et al. 1997; Goulder et al. 2001; Klenerman 2002;

Moore et al. 2002; Yusim et al. 2002; Leslie et al. 2004; Allen et al. 2005; Bowen and

Walker 2005; Bowen and Walker 2005; Poon et al. 2007).

Over the course of an HIV infection, both humoral and cell-mediated immune responses have been shown to contribute to the control of infection (Cao et al. 2003,

Fauci et al. 1996). An overview of the HIV life cycle and viral processing by the immune system is illustrated in the Figure 1.1 (Sewell et al. 2000). After entering the host body, the virus particles bind to the surface of the target cells by association of the viral

Envelope to the receptor of the host cell (CD4). Following binding, the virus is integrated into the host cell, the viral core is uncoated and the RNA genome is reverse transcribed by the reverse transcriptase into the DNA. This pro-viral DNA gets integrated into the host genome and may either remain dormant or enter a replication phase by using the host RNA polymerase. Once transcribed and translated, the long chain of HIV proteins is cut into several smaller proteins by an enzyme protease and these small 4

particles along with the HIV’s RNA genetic material get assembled as a new virus particle. In the productively infected cells, the newly assembled viral particles get out of the infected host cell by budding (Sewell et al. 2000). In the case of latent infection the virus hides in the hosts resting CD4+ cells (cells that have not received activating stimuli and have not entered cell cycle stage)(Swiggard et al. 2005).

As with the majority of viral infections, immune response via CTL plays a critical role in controlling HIV infection (Musey et al. 1997; Ogg et al. 1998; Goulder 2004;

Borrow 1994; Koup 1994). Usually leading to a strong antibody-driven response usually occurs (Goulder 2004; Prince 1991; Levy 1993; Wei et al. 2003; Richman et al. 2003 and

Frost et al. 2005). On the other hand, loss of CTL activity is correlated with the progression to acquired immunodeficiency syndrome (AIDS) (Goulder 1997, 2004), loss of CTL activity is driven by so-called ‘escape’, the appearance of novel viral sequence variants that greatly diminish CTL recognition capacity (Phillips 1991; Koenig 1995) that mostly occur during the chronic stage of infection (Goulder 2004; Evans 1999, 2000;

Edwards et al. 2006). Such “escape” is mediated by the ongoing CTL-driven pressure to control HIV infection. Notably, because of differences in CTL recognition specificities and/or functional and structural constraints on the viral sequences, epitopes differ in their capacity to undergo escape mutations. Indeed, some epitopes have been shown to evolve under strong positive selection that promotes the likelihood of escape mutations, while others are instead subjected to strong purifying selection (Piontkivska and Hughes 2004,

2006; Paul and Piontkivska 2009).

5

HIV GENOME

Human immunodeficiency virus (HIV) is a RNA virus that belongs to the retrovirus family, genus Lentivirus. The whole virion is enclosed in a dense nucleocapsid surrounded by a lipid bi-layer embedded with viral envelope proteins (Hockley et al.

1988). The entire genome is about 9700 base pairs long, and includes a cap at the 5’ end and a poly-A tail at the 3’ end. Lentivirus genomes are also A-rich (on average 38-39% of A residues), thus, the HIV codon usage differs between viral and cellular genes (Myers

1992). The HIV genome encodes 9 different protein-coding genes that are in turn processed to produce an array of viral peptides. Among the 9 polyprotein-coding genes, the Group specific gene (Gag), Polymerase poly protein (Pol) and the Envelope protein

(Env) are the largest, and are responsible for a variety of functions, including those critical for viral replication. The genome is packaged in virions as duplicate copies of single stranded RNA (Frankel et al. 1998), thus, providing ongoing opportunities for recombination (Levy et al. 2004; Rhodes et al. 2005). Accessory genes (such as Nef, Vif,

Vpu and Vpr), while not considered essential for replication, nonetheless play important role in the host-virus interactions (Clementi 2001), although many of their functions remain to be elucidated. Figure 1.2 depicts the HIV Viral genome with all the nine genes.

GEOGRAPHIC DISTRIBUTION OF HIV-1 WORLD-WIDE

Based on the sequence similarity of multiple genomic regions, HIV-1 genomes can be divided into three phylogenetic groups: Group M, Group N and Group O. Group M is the most widely distributed all over the world and is indeed the largest group classified into 9 6

subtypes or clades (A, B, C, D, F, G, H, J, and K) based on the extent of sequence similarity within and between clades (Gurtler 1994, Seaman 2005). In addition to various subtypes of M group, there exist various recombinant forms, referred to as circulating recombinant forms (CRFs) that are thought to be derived from two of more subtypes of

HIV-1 via recombination (Peeters et al. 2001, Robertson et al. 2000, Robertson 2000,

Peters M. 2001). Up to date, a total of 45 CRFs had been identified

(http://www.hiv.lanl.gov/content/sequence/HIV/CRFs/CRFs.html ). The global distribution of different subtype of HIV-1 is shown in the Figure 1.3.

Out of all subtypes, the most prevalent are A, B and C, with subtype C accounting for almost 50% of all HIV-1 infections worldwide (Buonaguro et al. 2007). Subtype A is the most prevalent in African and European countries, while B subtype is distributed mostly in North America, South America, Australia and the Western Europe. Subtype C is widely distributed in the southern Africa and India. The recombinant forms of HIV-1 accounts for almost 18% (Hemelaar 2004) of the total HIV -1 infections, and the most predominant CRFs are the CRF01-AE that is circulating mostly in Southeast Asia

(Motomura 2000), and CRF02-AG, that is distributed mostly in West-Central Africa

(McCutchan 1999).

DIFFERENT TYPES OF IMMUNE EPITOPES IN HIV-1

HIV-1 virus is considered to be one of the most variable viruses, and is characterized by a high mutation rate of 10 -3-10 -4 per generation (Seo et al. 2002; Achaz et al. 2004) due to its error-prone RNA polymerase. Thus, even within a single patient, HIV shows rather high level of genetic polymorphism (Drake et al. 2004, Mansky and Temin 1995). As a 7

result of high mutation rate, coupled with a strong immune system-driven selection pressure, this leads to rapid changes in protein coding sequences (Briones et al. 2004), particularly within regions responsible for interactions with the host immune system, such as epitopes (antigenic determinants).

Epitopes can be broadly classified into two categories, linear epitopes and conformational epitopes. Linear epitopes are the segments composed of continuous amino acids sequence of a protein (Barlow et al. 1986; Walter 1986), and include three types of epitopes, based on what arm of the immune system they interact with, namely, the neutralizing antibody epitope regions (Ab), CTL epitope regions and the T-helper epitope regions (Th) (Langeveld et al. 2001). Some epitope regions harbor multiple epitopes, either of the same or different epitope types (e.g., CTL+Ab, CTL+Th, and

CTL+Ab+Th). Conformational epitopes (CE) are instead formed by several discontinuous amino acid sequence segments that are folded within the primary structure

(Benjamin 1995; Barlow et al. 1986) forming a 3D structure (Huang et al. 2006). Because of the three-dimensional structural and functional constraints likely imposed on the 3D structure of CE, conformational epitopes are expected to evolve slower than linear epitopes, similarly to structural elements of protein folds and domains evolving slower than underlying primary amino acid sequences (Friedberg 2002; Reddy 2001; Li 2002).

However, the extent of the amino acid sequence conservation at CE remains unknown, and one of the aims of this work is to fill this important gap.

Generally, natural selection is thought of as a differential reproduction of genetically distinct individuals, which leads to preservation of advantageous traits and differential 8

reproduction of different genotypes (Eugene et al. 2009). However, at the molecular sequence level natural selection can act as either positive (also referred to as Darwinian) selection that promotes amino acid sequence diversification (such as in the case of peptide-binding regions of class I and class II HLA molecules (Hughes 1988; Hughes

1989), or alternatively, as purifying (or negative) selection that is aimed at eliminating disadvantageous mutations (Nei 2000; Hughes 2000). The latter selection is manifested with a much higher ratio of synonymous than non synonymous (amino acid altering) nucleotide changes, within a protein-coding gene (i.e., d S >> d N), (Nei 1986). Another way to assess the selective pressure is via the ratio of radical to conservative amino acid substitutions (Hughes et al. 1990), based on the classification of physicochemical properties of amino acids. It is expected that purifying selection leads to much higher proportion of conservative than radical amino acid changes. Overall, such molecular sequence-based estimates of selection pressure provide a powerful tool to better understand the molecular dynamics of rapidly evolving viruses, such as HIV. Sequence changes in fast evolving genes may provide important clues regarding evolutionary mechanisms responsible for promoting changes in HIV genome in response to host- pathogen interactions (Hurst 2009; Hughes 2000).

9

THESIS OVERVIEW:

In the present study, we examined the pattern of nucleotide substitutions at four types of epitopes in human immunodeficiency virus HIV-1, namely, at CTL (cytotoxic T lymphocytes), Th (T helper), Ab (neutralizing antibody) and CE (conformational) epitopes, along with overlapping epitope combinations, among publicly available full- length genomic sequences of B subtype and CRFs. The estimates of the substitution rates were contrasted among different types of epitope regions to determine whether the selective pressure is acting uniformly across different epitope types or whether some types of epitopes are subject to stronger selective forces than others (Piontkivska and

Hughes 2004; Hughes et al. 2005). While some studies have been conducted on linear epitopes (Rambaut 2004; Piontkivska and Hughes 2004, 2006) addressing the extent of sequence variability and nature of selection pressure acting at these regions, still there is little known about the pattern of sequence changes at the conformational epitope regions.

Furthermore, while positive selection has been shown to play an important role in evolution of some linear epitopes (Rambaut 2004; Hughes 2005; Piontkivska and Hughes

2004), other epitopes have been shown to evolve under strong purifying selection

(Piontkivska and Hughes 2004, 2006; Paul and Piontkivska 2009). Very little is currently known about sequence changes at conformational epitopes despite their importance as they represent the majority of the total B-cell epitopes (Huang and Honda 2006; Saxena et al. 2006). Steimer et al. (1991) have shown that directed against conformational epitopes neutralized the HIV isolates better than the antibodies directed against linear epitopes making it evident that conformational epitopes can be better 10

targets for vaccine preparation. Hence there is a need to better understand mechanisms of conformational epitopes evolution.

Because secondary and 3D protein structures tend to evolve slower than primary amino acid sequences, we hypothesize that conformational epitopes in HIV would be more conserved than linear epitopes due to stronger pressure of purifying selection. To test this hypothesis, we evaluated the patterns of nucleotide substitutions in major protein-coding genes in HIV to determine the d N and dS values. We also studied the

distribution of amino acid substitutions to determine the numbers of radical and

conservative changes. As evidence of strong purifying selection acting to preserve the

overall 3D structure of conformational epitopes, we expected to observe d N << d S and many more conservative changes than radical changes at the conformational epitope regions.

Thus, the primary objective of our study is to estimate the extent of sequence conservation of conformational epitope (CE) regions and to contrast the degree of CE sequence conservation with that of linear epitopes to gain insights into the evolutionary mechanisms that affect these epitopes. Our second objective is to elucidate the distribution of selective pressures acting on nucleotide and amino acid epitope sequences in three largest genes of HIV-1, namely, Gag, Pol and the Env genes.

11

Fig 1.1: Overview of the Life cycle of HIV (From: Sewell 2000)

Fig 1.2: HIV genome with nine protein-coding genes. The genes highlighted in color are the Gag, Pol and Env genes, examined in this study. Nucleotide coordinates correspond to HXB2 reference sequence (Acc No. K03455). Figure source: HIV Sequence database by Los Alamos National Laboratory (http://www.hiv.lanl.gov/content/sequence/HIV/MAP/landmark.html). 12

Fig1.3: Global distribution of different subtypes of HIV-1. Source: From IAVI (International AIDS Vaccine Initiative) Report, 2003.

Chapter – 2

Material and Methods

13

14

MATERIALS AND METHODS

HIV genomic sequence data and multiple sequence alignment:

Genomic sequences of HIV-1 genomes were collected from the HIV Sequence

Database from the Los Alamos National Laboratory

(http://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html ). 67 sequences of B- subtype and 135 sequences of circulating recombinant isolates that have full-length nucleotide sequences of Gag, Pol and Env genes were selected. These sequences were isolated in all major geographical regions, including sequences of the B subtype of M group, which are prevalent in North America, as well as recombinant forms collected worldwide. Multiple sequence alignments of nucleotide sequences were reconstructed using ClustalW (Thompson et al. 1994) as implemented in the program MEGA 4

(Molecular Evolutionary Genetics Analysis, software version 4.0. Kumar et al. 2007) as per corresponding amino acid sequence alignment, thus, taking into account the codon structure of these sequences. Accession numbers of sequences of B subtype and CRFs used for this analysis are given in Table 2.1 and 2.2 respectively.

Different types of epitopes were mapped onto the amino acid sequence alignment, including the regions that harbor overlapping epitopes (of the same or different type) and non-epitope regions. The epitope regions were divided into different categories based on the types of epitopes, whether individual or overlapping. A total of 13 different types of epitope regions were considered. Notably, not all of the 13 region types are present in every gene ( Table 2.3 ). For example, conformational epitope regions and non-

15

overlapping antibody epitope regions were found only in the Env gene; likewise, Th-Ab epitopes are present in Env and Pol gene but are absent in Gag gene.

HIV-1 Epitopes:

Lists of Antibody (Ab) and T-Helper (Th) epitopes found to be immunogenic in humans were collected from the HIV Database by Los Alamos National

Laboratory (http://www.hiv.lanl.gov/content/immunology/index ). A list of all the epitopes used for this study was given in Appendix 4, 5, 6, and 7 . Cytotoxic -T-

Lymphocytes (CTLs) were taken from the “best defined” CTL epitope list (Frahm et al.

2006). The conformational epitopes were collected from the Conformational Epitope

Database (http://web.kuicr.kyoto-u.ac.jp/~ced/ ). We focused on the epitope regions of

Gag, Pol and Env genes, because these are the largest among the nine protein-coding HIV genes and form the majority of the virion (Arnold and Arnold 1991), with the remaining genes forming less than 18 % of the protein-coding sequences. Thus, other protein-coding genes were not included because of their relative small length and lack of sufficient numbers of codons in some epitope regions to obtain reliable estimates of the substitution patterns. Table 2.3 shows different types of epitope regions in each gene and the

number of codons in each epitope category for the three genes, Gag, Pol and Env.

Nucleotide substitution patterns:

The estimates of nucleotide substitution patterns provide valuable insights into the

nature of selection acting on the epitope regions. It is expected that under the purifying

selection the majority of observed nucleotide substitutions will be synonymous, with only

few non-synonymous (amino acid altering) changes (Miyata and Yasunaga 1980; Messier

16

and Stewart 1997; Hughes 2000). On the other hand, the increased ratio of nonsynonymous to synonymous changes can be taken as an evidence of positive

Darwinian selection (Hughes and Nei 1988; Hughes 2000). Thus, here we use estimates of the number of synonymous substitutions per synonymous site (d S) and the number of

nonsynonymous substitutions per nonsynonymous site (d N) to evaluate the strength of selection pressures influencing each gene and type of epitope region. An example of showing synonymous and nonsynonymous change at the nucleotide level is shown in

Figure 2.1.

For all three genes, pairwise estimates of the number of synonymous substitutions per synonymous site (d S) and the number of nonsynonymous substitutions per

nonsynonymous site (d N) were computed using the Nei-Gojobori model with Jukes-

Cantor correction (Nei and Gojobori 1986 ) as implemented in MEGA4 (Kumar et al.

2007). Standard errors were computed using the bootstrap method, with 100 bootstrap

pseudo-replications. dS and dN values were estimated for all types of epitope regions, along with overlapping epitope regions and also for the non-epitope regions (NE). The relatively simple Nei-Gojobori method was used because it relies on a smaller number of assumptions than more complicated methods, and thus is expected to have smaller standard error values (Nei and Kumar 2000). Sites with gaps were excluded from consideration because no comprehensive evolutionary model exists to describe in/del patterns in HIV genome. Further, only relatively few sites fall into this category ( Table

2.3 ). We also calculated the overall nucleotide p-distance for each different category of

17

epitope regions to evaluate the overall extent of the sequence divergence. The p-distance values for Gag, Pol and Env genes were shown in Table 2.4.

Reconstruction of ancestral sequence :

To analyze the patterns of radical and conservative amino acid changes, an ancestral

sequence was constructed using the reference sequences of the M group of HIV-1.

Reference sequences for each subtype available from the Los Alamos National

Laboratory HIV Database were used to construct the ancestral sequence. The

phylogenetic trees were constructed separately for each gene, using de-gapped

alignments. Ancestral sequences were constructed separately for each gene using the

program ANCESTOR (Zhang and Nei 1997). Johnes et al. (1992) empirical model (JTT

model) of amino acid substitution was used in this program to simulate the evolutionary

changes of amino acid sequences. Figures 2.2, 2.3 and 2.4 depict phylogenetic trees of

HIV-1 sequences that include inferred ancestral sequence of Gag , Pol and Env genes of

B subtype sequences and Figures 2.5, 2.6 and 2.7 depict the phylogenetic trees with ancestral sequences of Gag, Pol an Env genes of CRF sequences. For constructing the putative ancestral sequences of Gag, Pol and Env gene of B subtype, four reference sequences of B subtype (Leitner et al. 2005) along with the phylogenetically similar sequences of D, C and K subtypes for all three genes were used to reconstruct the ancestral sequence. For reconstructing the CRF ancestral sequences, all reference sequences of 9 subtypes (Leitner et al. 2005) were used. It should be noted that in this fashion putative B-ancestor sequence represents somewhat younger sequence (reflecting

18

more recent divergence times of B, D and K sequences than all subtypes within M group) than CRF-ancestor.

Radical and conservative amino acid substitutions :

Comparison of the relative number of radical and conservative amino acid changes, as defined by the physicochemical properties, is another approach to gauging selection pressure, other than d N and d S estimates. It is expected that under the positive selection,

the ratio of radical to conservative changes will be high, while under the purifying

selection, it is expected to be much lower, with many more conservative than radical

amino acid changes. The amino acids were classified into three categories based on the 1)

charge (Alff-Steinberger 1969), 2) polarity (Woese et al. 1966) and 3) polarity and

volume (Grantham 1974) of the amino acid (Zhang 2000). The amino acid substitution is

considered radical if there is a change in amino acid categories, i.e., from polar to non-

polar, or from positively charged to negatively charged or neutral and vice-versa. If there

is no change in the physicochemical property of the amino acid residues involved, i.e., if

the respective amino acid site retains its property, it is considered a conservative change.

Table 2.5 illustrates the physicochemical classification of amino acids.

Phylogenetic analysis:

Phylogenetic trees were reconstructed using the neighbor-joining method (Saitou and

Nei, 1987) and Kimura-2-Parameter nucleotide substitution model with MEGA4 (Kumar et al. 2007). Complete-deletion option was used. Concatenated sequences of all three genes, Gag, Pol and Env , were used, combining epitope regions only and non-epitope

19

regions only from three genes separately. Phylogenetic trees of epitope regions and non- epitopes regions are shown in the Figures 2.9 and 2.10 respectively.

Fig 2.1: Example showing the synonymous and nonsynonymous Substitutions. Nucleotide Substitution of T-A in Valine codon represents a synonymous change and substitution of T-G and T-A resulting in amino acid change from Phenylalanine codon to Valine codon represents a nonsynonymous change.

99 B.TH.90.BK132.AY173951 65 B.US.98.1058 11.AY331295 100 B.FR.83.HXB2 LAI IIIB BRU.K03455 B.NL.00.671 00T36.AY423387

42 D.CD.83.ELI.K03454

99 D.UG.94.94UG114.U88824

43 D.CM.01.01CM 4412HAL.AY371157 D.TZ.01.A280.AY253311 ANCESTOR

0.02

Fig 2.2: Phylogenetic tree, the placement of B-subtype Env ancestral sequence (referred to as ANCESTOR). The tree was reconstructed using the neighbor-joining method with 500 bootstrap replicates. Here and in subsequent figures numbers above internal branches represent bootstrap values.

20

60 B.TH.9.BK132.AY173951 66 B.FR.83.HXB2 LAI IIIB BRU.K3455 98 B.NL..671 T36.AY423387 35 B.US.98.158 11.AY331295

79 D.TZ.1.A28.AY253311 28 D.CD.83.ELI.K3454

43 D.CM.1.1CM 4412HAL.AY371157 D.UG.94.94UG114.U88824

99 C.IN.95.95IN2168.AF67155 C.ZA.4.SK164B1.AY772699 ANCESTOR

0.01

Fig 2.3: Phylogenetic tree the placement of B-subtype Gag ancestral sequence (referred to as ANCESTOR). The tree was reconstructed using the neighbor-joining method with 500 bootstrap replicates.

100 K.CD.97.EQTB11C.AJ249235 33 K.CM.96.MP535.AJ249239

26 D.UG.94.94UG114.U88824

34 D.CD.83.ELI.K03454 37 D.TZ.01.A280.AY253311 D.CM.01.01CM 4412HAL.AY371157 B.US.98.1058 11.AY331295 98 88 B.NL.00.671 00T36.AY423387 65 B.FR.83.HXB2 LAI IIIB BRU.K03455 .B.TH.90.BK132.AY173951 ANCESTOR

0.01

Fig 2.4: Phylogenetic tree the placement of B-subtype Pol ancestral sequence (referred to as ANCESTOR). The tree was reconstructed using the neighbor-joining method with 500 bootstrap replicates

21

81 G.NG.92.92NG83.U88826 99 G.SE.93.SE6165.AF61642 69 G.BE.96.DRCBL.AF84936 49 J.SE.94.SE722.AF82395

100 H.BE.93.VI991.AF19127 78 H.BE.93.VI997.AF19128 22 H.CF.9.56.AF5496

98 A2.CD.97.97CDKTB48.AF286238

87 A2.CY.94.94CY17 41.AF286237 A1.UG.98.98UG57136.AF48459 100 9 69 A1.UG.92.92UG37.AB253429 62 A1.KE.94.Q23 17.AF4885 A1.SE.94.SE7253.AF6967 D.CM.1.1CM 4412HAL.AY371157

99 37 D.TZ.1.A28.AY253311

82 30 D.UG.94.94UG114.U88824 D.CD.83.ELI.K3454 34 95 B.US.98.158 8.AY331294 97 B.FR.83.HXB2 LAI IIIB BRU.K3455 B.TH.9.BK132.AY173951

99 C.IN.95.95IN2168.AF67155 56 C.BR.92.BR25 d.U52953 C.ET.86.ETH222.U4616

90 F1.BE.93.VI85.AF77336 38 F1.FI.93.FIN9363.AF7573 97 F1.FR.96.MP411.AJ249238 F1.BR.93.93BR2 1.AF5494 30 98 K.CD.97.EQTB11C.AJ249235

31 K.CM.96.MP535.AJ249239 F2.CM.95.MP255.AJ249236 99 F2.CM.95.MP257.AJ249237 99 91 F2.CM.2.2CM 16BBY.AY371158 F2.CM.97.CM53657.AF377956 ANCESTOR

0.01

Fig 2.5: Phylogenetic tree, the placement of CRF Env ancestral sequence (referred to as ANCESTOR). The tree was reconstructed using the neighbor-joining method with 500 bootstrap replicates.

22

75 F2.CM.95.MP255.AJ249236

82 F2.CM.95.MP257.AJ249237

86 F2.CM.2.2CM 16BBY.AY371158 34 F2.CM.97.CM53657.AF377956 F1.FR.96.MP411.AJ249238 34 37 28 F1.FI.93.FIN9363.AF7573 30 F1.BR.93.93BR2 1.AF5494 72 F1.BE.93.VI85.AF77336 K.CM.96.MP535.AJ249239 43 K.CD.97.EQTB11C.AJ249235

100 G.SE.93.SE6165.AF61642 52 G.NG.92.92NG83.U88826 G.BE.96.DRCBL.AF84936 30 94 C.BR.92.BR25 d.U52953 43 C.ET.86.ETH222.U4616 98 C.IN.95.95IN2168.AF67155 C.ZA.4.SK164B1.AY772699 19 97 H.BE.93.VI991.AF19127 H.CF.9.56.AF5496 46 99 A2.CD.97.97CDKTB48.AF286238 A2.CY.94.94CY17 41.AF286237 43 77 57 A1.KE.94.Q23 17.AF4885

99 A1.UG.98.98UG57136.AF48459

93 A1.UG.92.92UG37.AB253429 A1.SE.94.SE7253.AF6967

69 B.TH.9.BK132.AY173951 83 B.FR.83.HXB2 LAI IIIB BRU.K3455 98 B.NL..671 T36.AY423387 85 B.US.98.158 11.AY331295 D.CM.1.1CM 4412HAL.AY371157 38 D.UG.94.94UG114.U88824 47 95 D.TZ.1.A28.AY253311 D.CD.83.ELI.K3454 J.SE.94.SE722.AF82395 ANCESTOR

0.02

Fig 2.6: Phylogenetic tree, the placement of CRF Gag ancestral sequence (referred to as ANCESTOR). The tree was reconstructed using the neighbor-joining method with 500 bootstrap replicates

23

21 A1.SE.94.SE7253.AF6967

99 A1.UG.92.92UG37.AB253429

31 A1.KE.94.Q23 17.AF4885 99 A1.UG.98.98UG57136.AF48459

100 A2.CD.97.97CDKTB48.AF286238 79 A2.CY.94.94CY17 41.AF286237

100 G.BE.96.DRCBL.AF84936 61 100 G.NG.92.92NG83.U88826 G.SE.93.SE6165.AF61642 C.ET.86.ETH222.U4616 100 47 C.ZA.4.SK164B1.AY772699 34 65 C.BR.92.BR25 d.U52953 C.IN.95.95IN2168.AF67155

72 D.TZ.1.A28.AY253311 98 D.UG.94.94UG114.U88824

99 D.CM.1.1CM 4412HAL.AY371157 99 B.NL..671 T13.AY423386 100 54 B.FR.83.HXB2 LAI IIIB BRU.K3455 93 B.TH.9.BK132.AY173951 75 B.US.98.158 8.AY331294

100 H.BE.93.VI991.AF19127 H.CF.9.56.AF5496 80 98 K.CD.97.EQTB11C.AJ249235 K.CM.96.MP535.AJ249239 F2.CM.95.MP257.AJ249237 94 57 F2.CM.95.MP255.AJ249236 95 F2.CM.2.2CM 16BBY.AY371158 F2.CM.97.CM53657.AF377956 F1.FR.96.MP411.AJ249238 99 F1.FI.93.FIN9363.AF7573 42 39 F1.BE.93.VI85.AF77336 F1.BR.93.93BR2 1.AF5494 ANCESTOR

0.02

Fig 2.7: Phylogenetic tree, the placement of CRF Pol ancestral sequence (referred to as ANCESTOR). The tree was reconstructed using the neighbor-joining method with 500 bootstrap replicates

24

No. Subtype/Common Name Acc. No No. Subtype/Common Name Acc. No 1 B.AR.00.ARMS008 AY037269 35 B.CO.01.PCM013. AY561237 2 B.AR.02.02AR114146. DQ383746 36 B.CO.01.PCM034. AY561238 3 B.AR.03.03AR137681. DQ383748 37 B.DE.86D31. U43096 4 B.AR.03.03AR138910. DQ383749 38 B.FR.83.HXB2-LAI-111BBRU. K03455 5 B.AR.04.04AR143170. DQ383750 39 B.GB.83.CAM1 D10112 6 B.AR.04.04AR151263. DQ383751 40 B.GB.x.MANC. U23487 7 B.AR.04.04AR151516. DQ383752 41 B.MM.99.mSTD101. AB097870 8 B.AR.98.ARCH054. AY037268 42 B.RU.04.04RU128005. AY682547 9 B.AR.99.ARMA132. AY037282 43 B.RU.04.04RU129005. AY751406 10 B.AU.86.MBC200. AF042100 44 B.RU.04.04RU139089. AY751407 11 B.AU.87.MBC925. AF042101 45 B.RU.04.04RU139095. AY819715 12 B.AU.x.1181. AF538302 46 B.TH.00.00TH_C3198. AY945710 13 B.AU.x.1182. AF538303 47 B.TH.96.96TH_NP1538. AY713408 14 B.AU.x.8634991. AY857144 48 B.TH.96.M041. DQ354114 15 B.AU.x.C24. AF538304 49 B.TH.96.M081. DQ354116 16 B.AU.x.C42. AF538305 50 B.TH.96.M140. DQ354112 17 B.AU.x.C76. AF538306 51 B.TH.96.M145. DQ354118 18 B.AU.x.C92. AF538307 52 B.TH.96.M149. DQ354119 19 B.BO.99.BOL0122. AY037270 53 B.TW.94.TWCYS. AF086817 20 B.BR.02.02BR002. DQ358805 54 B.US.84.SF33. AY352275 21 B.BR.02.02BR008. DQ358808 55 B.US.85.5077_85. AY835769 22 B.BR.02.02BR011. DQ358809 56 B.US.87.BC. L02317 23 B.BR.02.02BR013. DQ358810 57 B.US.89.P896. U39362 24 B.BR.03.BREPM1023. EF637057 58 B.US.90.90US_873. AY713412 25 B.BR.03.BREPM1024. EF637056 59 B.US.94.94US_33931N. AY713410 26 B.BR.03.BREPM1027. EF637054 60 B.US.96.5155_96. AY835753 27 B.BR.03.BREPM1028. EF637053 61 B.US.98.98USHVTN1925c1. AY560107 28 B.BR.03.BREPM1032. EF637051 62 B.US.98.98USHVTN3605c9. AY560108 29 B.BR.03.BREPM1033. EF637050 63 B.US.98.98USHVTN8229c6. AY560109 30 B.BR.03.BREPM1035. EF637049 64 B.US.98.98USHVTN941c1. AY560110 31 B.BR.03.BREPM1038. EF637048 65 B.US.x.NC7. AF049495 32 B.BR.03.BREPM1040. EF637047 66 B.UY.01.01UYTRA1092. AY781126 33 B.BR.03.BREPM2012. EF637046 67 B.UY.01.01UYTRA1179. AY781127 34 B.CN.x.RL42. U71182

Table-2.1: HIV-1 B-subtype sequences used in the analyses: The names of the HIV-1 B subtype sequences used in the analysis along with their GenBank accession number are shown. A total of 67 B-subtype sequences were used (alignment downloaded from the HIV Sequence database by Los Alamos National Laboratory, http://hiv.lanl.gov)

25

No. Subtype/Common Name Acc. No No. Subtype/Common Name Acc. No 1 2_BG.ES.99.R77. AY586544 35 1_AE.TH.99.OUR22I. AY35847 2 2_BG.CU.3.CB134. DQ2274 36 1_AE.TH.99.OUR23I. AY35848 3 3_AB.RU.98.RU981. AF193277 37 1_AE.TH.99.OUR258I. AY35849 4 12_BF.UY.99.URTR23 AF385934 38 1_AE.TH.1.OUR414I. AY3585 5 12_BF.UY.99.URTR35. AF385935 39 1_AE.TH.99.OUR422I AY35851 6 12_BF.AR.99.ARMA159 AF385936 40 1_AE.TH..OUR595I. AY35852 7 14_BG.ES.99.X397 AF423756 41 1_AE.TH.1.OUR647I. AY35856 8 14_BG.ES.99.X421. AF423757 42 1_AE.TH..OUR661I AY35857 9 14_BG.ES..X475 AF423758 43 1_AE.TH.1.OUR72I. AY35859 10 14_BG.ES..X477 AF423759 44 1_AE.TH..OUR724I AY3586 11 14_BG.ES..X65. AF4596 45 1_AE.TH..OUR746I. AY35861 12 14_BG.ES..X623. AF4597 46 1_AE.TH.2.OUR769I AY35862 13 3301B.MY.05.05MYKL0071. DQ366659 47 1_AE.TH..OUR81I AY35863 14 3301B.MY.05.05MYKL0152. DQ366660 48 1_AE.TH.1.OUR83I AY35864 15 3301B.MY.05.05MYKL0311. DQ366661 49 1_AE.TH..OUR2I. AY35866 16 3301B.MY.05.05MYKL0451. DQ366662 50 1_AE.TH..OUR721I. AY35867 17 12_BF.UY.1.1UYTRA12. AY781128 51 1_AE.TH.1.OUR788I. AY35868 18 24_BG.CU.3.CB619. AY9576 52 2_AG.CM.2.2CM_2162SA. AY371129 19 12_BF.UY.99.URTR17. AY37272 53 2_AG.CM.1.1CM_74NY. AY371131 20 29_BF.BR.99.BREPM11948. DQ85871 54 2_AG.CM.1.1CM_158ND. AY371132 21 28_BF.BR.99.BREPM12313. DQ85872 55 2_AG.CM.1.1CM_131NY AY371137 22 28_BF.BR.99.BREPM1269. DQ85873 56 11_cpx.CM.1.1CM_186ND. AY371149 23 28_BF.BR.99.BREPM12817. DQ85874 57 11_cpx.CM.1.1CM_441HAN. AY37115 24 29_BF.BR.1.BREPM1674. DQ85876 58 11_cpx.CM.2.2CM_219SA. AY371151 25 3401B.TH.99.OUR1969P EF165539 59 11_cpx.CM.2.2CM_4118STN. AY371153 26 3401B.TH.99.OUR2275P. EF165540 60 13_cpx.CM.2.2CM_3226MN. AY371154 27 3401B.TH.99.OUR2478P. EF165541 61 22_1A1.CM.1.1CM_1BBY AY371159 28 07BC.CN.05.XJDC6441. EF368370 62 25_cpx.CM.2.1918LE. AY371169 29 07BC.CN.05.XJN0084. EF368371 63 1_AE.US.98.98US_MSC112. AY44483 30 07BC.CN.05.XJDC64312. EF368372 64 1_AE.US..US_MSC1164. AY44484 31 29_BF.BR.99.99UFRJ_1. AY455778 65 1_AE.US.98.98US_MSC28. AY44485 32 1_AE.TH.99.OUR66I. AY35843 66 1_AE.US.98.98US_MSC312. AY44486 33 1_AE.TH.99.OUR98I. AY35844 67 2_AG.US.99.99US_MSC1134. AY44489 34 1_AE.TH.99.OUR164I. AY35845 68 2_AG.US..US_MSC383. AY444811

Table Continued

26

No. Subtype/Common Name Acc. No No. Subtype/Common Name Acc. No 69 1_AE.TH.99.OUR44I. AY35842 103 1_AE.TH.1.OUR786I. AY35836 70 19_cpx.CU.99.CU29. AY588971 104 1_AE.TH.2.OUR737I. AY35837 71 1_AE.TH.97.97TH_NP1695. AY713419 105 1_AE.TH.1.OUR674I AY35838 72 1_AE.TH.97.97TH_NP1525. AY71342 106 1_AE.TH.99.OUR199I. AY35839 73 1_AE.TH.96.96TH_NP146. AY713421 107 06cpx.RU.05.04RU001. DQ400856 74 1_AE.TH.98.98TH_NP1251. AY713422 108 01AE.CN.05.FJ051. DQ859178 75 1_AE.TH.99.99TH_NI152. AY713423 109 01AE.CN.05.FJ053. DQ859179 76 1_AE.TH.96.96TH_M2138. AY713424 110 01AE.CN.06.FJ054. DQ859180 77 1_AE.TH.1.OUR69I. AY3584 111 01AE.CN.05.Fj055. EF036527 78 1_AE.TH.1.OUR642I. AY35841 112 01AE.CN.05.Fj052. EF036528 79 1_AE.TH.99.99TH_C18 .AY945712 113 01AE.CN.05.Fj056. EF036529 80 1_AE.TH.1.1TH_C1436. AY945713 114 01AE.CN.05.Fj057. EF036530 81 1_AE.TH.TH_C211. AY945716 115 01AE.CN.06.Fj062. EF036531 82 1_AE.TH..TH_C2257. AY945717 116 01AE.CN.06.Fj063. EF036532 83 1_AE.TH.99.99TH_C245. AY945718 117 01AE.CN.06.Fj064. EF036533 84 1_AE.TH.1.1TH_C257. AY945719 118 01AE.CN.05.Fj065. EF036534 85 1_AE.TH.1.1TH_C3256. AY94572 119 01AE.CN.05.Fj066. EF036535 86 1_AE.TH..TH_C3347. AY945721 120 01AE.CN.06.Fj061. EF036536 87 1_AE.TH..TH_C4118. AY945722 121 35AD.AF.05.05AF094. EF158040 88 1_AE.TH..TH_C4151. AY945724 122 35AD.AF.05.05AF095. EF158041 89 1_AE.TH..TH_C4382. AY945725 123 35AD.AF.05.05AF104. EF158042 90 1_AE.TH.99.99TH_C446. AY945726 124 35AD.AF.05.05AF026. EF158043 91 1_AE.TH.99.99TH_R1149. AY945727 125 2_AG.GH.97.97GHAG1. AB49811 92 1_AE.TH.98.98TH_R1166. AY945728 126 4_cpx.GR.97.97PVMY. AF119819 93 1_AE.TH.1.1TH_R2184. AY94573 127 21_A2D.KE.99.KER23. AF45751 94 1_AE.TH.99.99TH_R36. AY945731 128 21_A2D.KE.99.KSM41. AF45772 95 1_AE.TH.99.99TH_R3265 .AY945732 129 2_AG.SN.98.MP1211. AJ25156 96 27_cpx.FR.4.4CD_FR_KZS AM85191 130 5_DF.ES.99.X492. AY22717 97 1_AE.CN.97.97CNGX_11F. AY8718 131 01AE.CF.90.90CF402. U51188 98 9_cpx.SN.95.95SN1795. AY9363 132 01AE.TH.93.93TH253. U51189 99 9_cpx.SN.95.95SN788. AY9364 133 01AE.TH.00.00THC4151. AY945724 100 9_cpx.GH.96.96GH2911. AY9365 134 1_AE.TH..OUR21I. AY35846 101 9_cpx.US.99.99DE457. AY9367 135 2_AG.SN.98.MP1213 AJ25157 102 1_AE.TH.96.M114. DQ354117

Table-2.2: HIV-1 CRF sequences used in the analyses: The names of the HIV-1 CRF sequences used in the analysis along with their GenBank accession number are shown. A total of135 CRF sequences were used (alignment downloaded from the HIV Sequence database by Los Alamos National Laboratory, http://hiv.lanl.gov)

27

Number of Codons in each gene and epitope region

Env Gag Pol Number Number Number Number Number Number of of sites of of sites of of sites Region Type of epitope shared with shared with shared with If combined Number s region sites gaps sites gaps sites gaps epitope region 1 NE only 63 20 62 25 416 14 NE only 2 CTL only 140 17 47 6 264 5 CTL only 3 Th only 294 35 165 18 172 24 Th only 4 Ab only 110 46 - - - - Ab only 5 AOCE (all CE only 5 0 - - - - combinations with 6 Th+CE 38 0 - - - - CE) 7 CTL+Th+CE 23 0 - - - - 8 Th+Ab+CE 31 0 - - - - 9 CTL+Th+Ab+CE 9 0 - - - - 10 AOLE (all CTL+Th 70 33 301 33 174 14 combinations of 11 CTL+Ab - - 4 0 8 0 linear epitopes) 12 Th+Ab 101 12 - 13 2 13 CTL+Th+Ab 43 4 8 0 7 2 14 AOCE 94 0 - - - - 15 AOLE 272 24 313 33 189 18

Table -2.3 : Distribution of different types of epitope regions in each gene and the number of codons with deleted gaps in each epitope category for the three major genes Gag, Pol and Env.

28

B CRF ENV P-distance SE P-distance SE Ab only 0.1763 0.0170 0.1699 0.0162 CTL only 0.0818 0.0120 0.1146 0.0120 Th only 0.1280 0.0065 0.1459 0.0700 CE only 0.0448 0.0186 0.1010 0.0464 NE only 0.0900 0.0100 0.1252 0.0182 CTL+Th 0.0994 0.0097 0.1210 0.0106 Th+Ab 0.0894 0.0085 0.1352 0.0098 CTL+Th+Ab 0.0959 0.0105 0.0338 0.0127 Th+CE 0.0703 0.0108 0.1117 0.0149 CTL+Th+CE 0.0802 0.0175 0.1232 0.0027 Th+Ab+CE 0.0500 0.0113 0.0450 0.0022 CTL+Th+Ab+CE 0.0927 0.0236 0.0971 0.0153 AOLE 0.1014 0.0055 0.1268 0.0062 AOCE 0.0663 0.0060 0.1028 0.0090

B CRF GAG P-distance SE P-distance SE NE 0.0787 0.0134 0.1408 0.0202 CTL 0.0582 0.0103 0.0909 0.0151 Th 0.0577 0.0048 0.0821 0.0062 CTL+Th 0.0576 0.0038 0.0855 0.0052 CTL+Ab 0.0214 0.0107 0.0624 0.0249 CTL+Th+Ab 0.0481 0.0219 0.0678 0.0320 AOLE 0.0577 0.0040 0.0925 0.0062

B CRF POL P-distance SE P-distance SE NE 0.0448 0.0027 0.0676 0.0035 CTL 0.0561 0.0037 0.0839 0.0052 Th 0.0504 0.0045 0.0717 0.0060 CTL+Th 0.0454 0.0040 0.0695 0.0061 Th+Ab 0.0470 0.0104 0.0899 0.0184 CTL+Ab 0.0338 0.0106 0.0990 0.0336 CTL+Th+Ab 0.0466 0.0176 0.0650 0.0227 AOLE 0.0466 0.0037 0.0679 0.0052

Table 2.4 : Average nucleotide p-distance values (and standard errors, SE) for different epitope regions of Env, Gag and Pol genes from HIV-1 sequences of B-subtype and CRFs

29

Classification criterion Property Amino acid Positive R, H, K Charge Negative D,E A, N, C, Q, G, I, L, M, F, P, S, T, W, Neutral Y, V Polar R,N,D,C,Q,E,G,H,K,S,T,Y Polarity Non-polar A,I,L,M,F,P,W,V Special C Neutral &small A,G,P,S,T

Polar & relatively small N,D,Q,E Polarity & Volume Polar & relatively large R,H,K Non-polar & relatively small I,L,M,V Non-polar & relatively large F,W,Y

Table-2.5: Three types of amino acid classifications based on physicochemical properties (Modified from: Zhang, 2000)

Chapter – 3

Patterns of synonymous and nonsynonymous substitutions at different epitope

regions among B-subtype and CRF subtype HIV-1 sequences.

30 31

RESULTS

Patterns of synonymous and nonsynonymous substitutions at different epitope

regions among B-subtype and CRF HIV-1 sequences.

As described in the Materials and Methods section (Chapter 2), we classified each codon into one of 13 categories of sites, depending on the presence of different epitope types (i.e., CTL epitopes only, CTL and antibody epitopes etc), and the numbers of synonymous substitutions per synonymous site (d S) and nonsynonymous substitutions per

nonsynonymous site (d N), respectively, were estimated in pairwise comparisons of 67

sequences from the B subtype and 135 sequences from the recombinant forms of

sequences from M group. Average d S and d N value for each gene and each epitope

category for B subtype sequences and CRF subtype sequences are shown in Tables 3.1,

3.2 and 3.3

As our results show, overall, the majority of different genomic regions in HIV

genome, including non-epitopes, exhibit strong signature of purifying selection, with d S significantly exceeding d N values in majority of pairwise sequence comparisons, except

Ab epitope regions among B subtype sequences ( Figure 3.1 ) (paired t-tests, p < 0.001).

Our results are consistent with the previous studies that showed that in general purifying selection plays a major role in evolution of immunodeficiency viruses, including HIV-1 and SIV (Piontkivska and Hughes 2004, 2006; Nobubelo and Konrad 2008; Soren Banke and Marie 2009; Maria 2009; Jonathan 2006; McCauley et al. 2007; Gerald 2001) .

Notably, there was a substantial heterogeneity in the absolute values of d N and d S among different categories of epitope regions, with regions that harbor conformational

32

epitopes (CE) alone or in combination with other epitope types exhibiting generally lower dN and d S values than regions without CEs (Table 3.1 ). In Env gene, the smallest d N value, 0.0189, corresponds to regions that harbor three epitope types, namely, T-helper, antibody and CE, followed by regions that harbor T-helper and CE epitopes.

Interestingly, the third smallest d N value corresponds to different types of epitope regions in different sequence sets, in particular, among CRF sequences this corresponds to regions that harbor CE epitopes only, while among B subtype sequences it corresponds to regions that harbor CTL, T-helper and CE epitopes. This difference can be attributed to a rather small number of codons within CE-only epitope regions (only 5 codons) and overall higher degree of sequence diversity among CRF sequences. On the other hand, top two epitope categories with the highest d N values correspond to regions that harbor

antibody epitopes (d N = 0.13391 and 0.21545 in CRF and B sequence sets, respectively),

followed by regions that harbor T-helper epitopes (0.11758 and 0.11578, respectively).

Interestingly, the antibody-harboring epitope category is also an only category where on

average d N exceeds d S (paired t-test, p < 0.001), reflecting the signature of positive

selection acting on these epitopes. Similarly to the much conserved epitope categories,

the reversal of this pattern among CRF sequences can be attributed to a higher sequence

and antibody epitope diversity among CRFs.

In the Pol and Gag genes, the overall numbers of different epitope categories are smaller, due to absence of CE sites (in any combinations), as well as absence of regions that only harbor antibody epitopes ( Figures 3.2 an 3.3 ). Instead, these two genes have a new category of epitopes that harbor both CTL and antibody epitopes. It should be noted,

33

however, that some epitope categories, such as Th+Ab (T-helper + Antibody), CTL+Ab and CTL+Th+Ab, are composed of only few codons, thus, the substitution pattern estimates obtained from these regions should be treated with caution. When these epitope categories were excluded from consideration, within Pol gene, surprisingly, non-epitope regions exhibit overall rather low d N values, in fact, the smallest (d N = 0.02743 and

0.02335 in CRF and B sequence sets, respectively); this pattern may reflect the overall high degree of conservation of this gene due to strong structural and functional constraints that are often difficult to override even under the drug pressure (Ceccherini-

Silberstein 2005; Chen 2004). However, it should be noted that despite statistically significant differences between d N values at non-epitope sites versus CTL epitope regions

(the latter being significantly larger, paired t-tests, p < 0.001), the absolute d N values at both these types of regions are still smaller, compared to d N values from the respective regions of Env gene (0.023 and 0.033 for non-epitopes and CTL epitopes among B subtype Pol sequences versus 0.07008 and 0.07581 for non-epitopes and CTL epitopes among Env sequences, respectively). On the other hand, when Gag gene is considered, non-epitope regions by far have the highest d N values among different epitope categories, although smaller than those from Env gene. The smallest d N values (among considered

epitope categories with over 10 codons) were found at CTL and CTL+Th epitope regions

for B and CRF sequence sets, respectively.

We also considered relative strength of purifying selection as can be reflected by the

value of (d S – d N) difference, with the larger values expected to correspond to relatively stronger purifying selection than that reflected by the smaller (d S – d N) values. Notably,

34

when various epitope categories were compared between CRF and B subtype sequences, the results showed that (d S – d N) difference values tended to have a broader range than

respective (d S – d N) difference values among B subtype sequences (boxplots in Appendix

1, 2 and 3 ). Likewise, (d S – d N) difference values also were generally higher among CRF sequences, in 18 out of 23 comparisons between different epitope categories.

We also compared d N values among different epitope categories in different genes considering only categories with more than 10 codons in all three genes, to see whether the gene has an effect on the absolute d N value. The following categories were

considered: CTL epitopes, Th epitopes, CTL+Th epitopes, and non-epitope regions. Our

results showed that d N values vary significantly between different genes in each epitope category, with Pol gene having significantly lower d N values in all four site categories

and Env having significantly higher d N values in all site categories (one-way ANOVA,

Tukey’s pairwise comparisons, p < 001).

35

DISCUSSION:

Comparison of patterns of synonymous and nonsynonymous substitutions across

different epitope regions

Ratio of synonymous and nonsynonymous substitutions can provide important

insights into the relative strength of selection acting at genomic regions. Generally, if

ratio of nonsynonymous substitutions to synonymous substitutions (d N/d S) is greater than

1, this is taken as evidence of positive (diversifying) selection, on the other hand, d N/d S <

1 serves as evidence of purifying selection (Messier and Stewart 1997; Hughes 2000).

Our results of d N/d S << 1 and overall low absolute d N values revealed a pattern of

strong purifying selection acting at the majority of epitope sites in all three major HIV

genes. The results were consistent with the previous results (Seibert et al. 1995;

Nobubelo and Konrad 2008; Soren Banke and Marie 2009; Maria 2009; Piontkivska and

Hughes 2004, 2006; Paul and Piontkivska 2009). Among B subtype sequences, different

epitope regions, including individual and combination epitopes, had d S >> d N, with the exception of antibody epitope regions of Env, indicating that, overall, purifying selection plays a major role in the evolution of HIV genome. On the other hand, antibody epitopes regions in Env gene showed a higher number of nonsynonymous (amino acid altering) than synonymous changes, indicating that positive selection plays a role in evolution of these epitope regions. This is in agreement with other studies (e.g., Wei et al. 2003;

Seibert et al. 1995) that showed that antibody escape mutations in Env gene may reach high frequencies before their selective advantage is lost. This may be explained by the

36

glycan shield model by Wei et al. (2003), which states that the selected amino acid changes in the N-linked glycans are preventing the neutralizing antibody binding and that the constant number of such sites is maintained by the strong selective pressure.

Notably, the evolutionary patterns of sequence changes at conformational and differ significantly in terms of sequence conservation. All conformational epitopes, including the overlapping regions with linear epitopes, showed a high extent of amino acid sequence conservations, while among linear epitope regions a broad range of sequence divergence values was observed. Overall, conformational epitopes, particularly overlapping regions of conformational and linear epitopes, showed a significantly higher degree of conservation compared to linear epitopes and non-epitope regions. When non- overlapping regions of Env gene were considered, the patterns of sequence conservation were similar among all the non-overlapping regions, including non-epitope regions, except the antibody epitopes. Among overlapping segments, the regions harboring conformational epitopes were more conserved than the regions without conformational epitopes.

From the published studies it was evident that conformational epitopes present in Env gene were considered a promising target for several neutralizing antibodies due to their high frequency of conservation in many viruses, including hepatitis C virus (Hadlock et al. 2000). Study by Moore and Ho (1993) showed that majority of the cross-reactive antibodies target the discontinuous regions (conformational epitopes) rather than linear epitopes regions. Similar results were observed in HIV-1 (Haig wood et al. 1990), simian

37

immunodeficiency virus (SIV) in monkeys (Cole et al. 1998), and in equine infectious anemia caused by equine infectious anemia virus (Hammond et al 1997).

38

0.4

0.35

0.3

0.25

ENV_dN-B ENV_dN-CRF 0.2 ENV_dS-B ENV_dS-CRF

0.15

0.1

0.05

0 Ab CTL Th linear-Eps with-CE NE

Fig 3.1: Number of synonymous substitutions per synonymous site (d S) and number of nonsynonymous substitutions per nonsynonymous site (d N) estimated at different epitope regions of Env gene from HIV-1 sequences of B-subtype and CRFs. Ab, CTL and Th designate regions that harbor antibody-only, CTL-only and Th epitopes, respectively. Linear Eps and With CE designate epitope regions that include combinations of linear epitopes only or linear and CE epitopes, respectively. NE designates non- epitope regions. Except in the Ab category, in all other comparisons, d S significantly exceeded d N values (pairwise t tests, p < 0.001).

39

Fig 3.2: Number of synonymous substitutions per synonymous site (d S) and number of nonsynonymous substitutions per nonsynonymous site (d N) estimated at different epitope regions of Gag gene from HIV-1 sequences of B-subtype and CRFs. Ab, CTL and Th designate regions that harbor antibody-only, CTL-only and Th epitopes, respectively. Linear Eps and With CE designate epitope regions that include combinations of linear epitopes only or linear and CE epitopes, respectively. NE designates non-epitope regions. Ab and with CE site categories are missing from Gag and Pol genes (marked by X)

40

Fig 3.3: Number of synonymous substitutions per synonymous site (d S) and number of nonsynonymous substitutions per nonsynonymous site (d N) estimated at different epitope regions of Pol gene from HIV-1 sequences of B-subtype and CRFs. Ab, CTL and Th designate regions that harbor antibody-only, CTL-only and Th epitopes, respectively. Linear Eps and With CE designate epitope regions that include combinations of linear epitopes only or linear and CE epitopes, respectively. NE designates non- epitope regions. Ab and with CE site categories are missing from Gag and Pol genes (marked by X).

41

Epitope region Subtype dN SE dS SE dN/d S Ab 0.2158 0.0011 0.1923 0.0017 1.1225 CTL 0.0701 0.0006 0.1491 0.0013 0.4704 Th 0.1159 0.0004 0.2126 0.0012 0.5449 CE 0.0445 0.0015 0.0870 0.0043 0.5114 NE 0.0760 0.0007 0.2212 0.0018 0.3434 CTL+Th 0.0809 0.0022 0.2042 0.0747 0.3960 Th+Ab 0.0951 0.0023 0.1904 0.0669 0.4990 B CTL+Th+Ab 0.0889 0.0034 0.1662 0.0882 0.5340 Th+CE 0.0330 0.0022 0.2657 0.1418 0.1240 CTL+Th+CE 0.0390 0.0033 0.2310 0.0117 0.1680 Th+Ab+CE 0.0188 0.0254 0.0996 0.0882 0.1880 CTL+Th+Ab+CE 0.0920 0.0012 0.3025 0.0048 0.3040 AOLE 0.0846 0.0003 0.1655 0.0008 0.5113 AOCE 0.0346 0.0004 0.1816 0.0013 0.1905

Ab 0.1386 0.0005 0.2562 0.0015 0.5409 CTL 0.0979 0.0005 0.2234 0.0013 0.4380 Th 0.1237 0.0005 0.3335 0.0018 0.3710 CE 0.0535 0.0012 0.5402 0.0075 0.0990 NE 0.1052 0.0005 0.3248 0.0021 0.3240 CTL+Th 0.0920 0.0074 0.2580 0.0143 0.3566 Th+Ab 0.1150 0.0066 0.2690 0.0151 0.4275 CRF CTL+Th+Ab 0.1090 0.0088 0.2540 0.0149 0.4291 Th+CE 0.0550 0.0141 0.5250 0.0231 0.1048 CTL+Th+CE 0.1020 0.0051 0.2560 0.0142 0.3984 Th+Ab+CE 0.0490 0.0088 0.1250 0.0011 0.3920 CTL+Th+Ab+CE 0.0930 0.0022 0.1460 0.0184 0.6370 AOLE 0.0981 0.0004 0.2107 0.0010 0.4656 AOCE 0.0624 0.0003 0.2494 0.0012 0.2500

Table: 3.1- Average estimates of synonymous substitutions per synonymous site (d S) and nonsynonymous substitutions per nonsynonymous site (d N) in different epitope regions of Env gene from HIV-1 sequences of B-subtype and CRFs.

42

S.No Epitope Subtype dN SE dS SE dN/d S 1 CTL 0.0323 0.0004 0.1697 0.0014 0.1905 2 Th 0.0351 0.0002 0.1661 0.0011 0.2113 3 NE 0.0623 0.0007 0.1479 0.0017 0.4215 4 CTL+Th B 0.0343 0.0102 0.1760 0.0541 0.1948 5 CTL+Ab 0.0089 0.0003 0.0691 0.2791 0.1284 6 CTL+Th+Ab 0.0120 0.0031 0.2810 0.3189 0.0427 7 AOLE 0.0315 0.0002 0.1537 0.0008 0.2049

1 CTL 0.0476 0.0002 0.1863 0.0011 0.2552 2 Th 0.0495 0.0002 0.2296 0.0011 0.2156 3 NE 0.0669 0.0004 0.3506 0.0021 0.1908 4 CTL+Th CRF 0.0450 0.0002 0.3080 0.0015 0.1461 5 CTL+Ab 0.0060 0.0002 0.1260 0.0037 0.0476 6 CTL+Th+Ab 0.0290 0.0003 0.3480 0.0035 0.0833 7 AOLE 0.0429 0.0002 0.2454 0.0011 0.1746

Table: 3.2- Average estimates of synonymous substitutions per synonymous site (d S) and nonsynonymous substitutions per nonsynonymous site (d N) in different epitope regions of Gag gene from HIV-1 sequences of B-subtype and CRFs.

S.No Epitope Subtype dN SE dS SE dN/d S 1 CTL 0.0314 0.0002 0.1507 0.0009 0.2081 2 Th 0.0263 0.0002 0.1515 0.0012 0.1737 3 NE 0.0234 0.0001 0.1559 0.0008 0.1498 4 CTL+Th 0.0250 0.0092 0.1420 0.0046 0.1760 B 5 CTL+Ab 0.0240 0.0337 0.1300 0.0022 0.1846 6 Th+Ab 0.0230 0.0251 0.3110 0.0030 0.0739 7 CTL+Th+Ab 0.0070 0.0257 0.3220 0.0371 0.0217 8 AOLE 0.0242 0.0002 0.1388 0.0008 0.1745

1 CTL 0.0446 0.0002 0.2933 0.0015 0.1522 2 Th 0.0298 0.0002 0.2595 0.0013 0.1148 3 NE 0.0274 0.0001 0.3010 0.0016 0.0911 4 CTL+Th 0.0360 0.0160 0.2230 0.1170 0.1614 CRF 5 CTL+Ab 0.0710 0.0320 0.3640 0.3850 0.1951 6 Th+Ab 0.0420 0.0320 0.4230 0.6060 0.0993 7 CTL+Th+Ab 0.0310 0.0310 0.2970 0.3450 0.1044 8 AOLE 0.0366 0.0002 0.1979 0.0010 0.1850

Table: 3.3- Average estimates of synonymous substitutions per synonymous site (d S) and nonsynonymous substitutions per nonsynonymous site (d N) in different epitope regions of Pol gene from HIV-1 sequences of B-subtype and CRFs.

Chapter – 4

Patterns of radical and conservative amino acid changes at different epitope regions

among selected B-subtype and CRF HIV-1 sequences

43

44

RESULTS

Patterns of radical and conservative amino acid changes at different epitope regions

among B-subtype and CRF HIV-1 sequences.

In the process of molecular evolution, to preserve overall protein structure, generally substitutions of amino acids occur more often between amino acids with similar physicochemical properties than between those with dissimilar properties (Zukerkandl and Pauling 1965; Kimura 1983). The physiochemical properties of amino acids provide important insights into amino acid similarities and dissimilarities and can help determine the rate and pattern of evolution by estimating the radical and conservative changes.

Here we used three different types of amino acid classifications, based on their physicochemical properties. In particular, 20 amino acids are classified based on (1)

Polarity, (2) Charge and (3) Polarity and Volume (per Zhang 2000). Table 2.4 shows which amino acids belong to which category. In addition to the absolute counts of observed amino acid changes, we also consider relative ratio of radical to conservative substitutions. It is expected that a high ratio of radical to conservation substitutions reflects a strong influence of positive selection, as is the case of changes observed within peptide-binding region of class II MHC molecules (Hughes et al. 1990). To compute these values, we first reconstructed putative ancestral amino acid sequences representing a common ancestor of various B and CRF sequences (see Chapter 2 for details), and then the numbers of radical and conservative changes were computed in pairwise comparisons between any given sequence and a putative ancestor, for each epitope region and amino acid classification categories separately (Tables 4.1, 4.2, and 4.3). We also compared B

45

subtype sequences with the ‘older’ ancestor represented by the CRF-ancestors (Tables

4.4), to ensure that the overall lower extent of sequence divergence when B sequences are compared with B-ancestor is taken into account. However, the obtained results were similar, and thus, below we focus on comparisons based on B subtype sequences compared with younger B ancestor, and CRF sequences compared with older CRF- ancestor (which can also be considered an ancestor of the entire M group) (see also

Figures 2.5, 2.6 and 2.7 ).

Figures 4.1, 4.2 and 4.3 show the distribution of numbers of radical and

conservative changes among different epitope categories and amino acid classifications in

Env, Gag and Pol genes, respectively. Similarly to the results obtained using d N and d S comparisons, Env gene appears to have the largest number of radical substitutions overall, reflecting its high degree of sequence variability. Noticeably, epitope regions that harbor conformational epitopes were also found to be highly conserved, with the smallest number of both radical and conservative changes, overall, in agreement with our earlier findings of small d N values at such regions ( Table 3.1 ). Interestingly, regions that harbor

CTL epitopes in both B subtype and CRF sequences were found to also harbor rather

high numbers of conservative amino acid substitutions in all three amino acid

classifications. This pattern may reflect a complex mixture of selective forces acting on

these epitopes, namely, positive selection driven by the influence of the host immune

system to promote viral escape, likely acting episodically, and purifying selection due to

structural and functional constraints acting on these epitope regions (Piontkivska and

Hughes 2004). Similar pattern was also observed at the CTL epitopes in Gag and Pol

46

gene. It should be noted that Pol gene, in agreement with lower d N values that reflect highly conserved nature of this gene, overall had smaller number of amino acid sequence changes, particularly at regions harboring T-helper epitopes and combinations of linear epitopes.

When we considered a ratio of radical to conservative changes (R/C) in the Polarity-

Volume classification, the majority of epitope categories, including non-epitope regions, had the R/C ratio smaller than 1, indicating the excess of conservative changes, as would be expected under purifying selection, although the absolute values of the ratio varied from 0.099 at T-helper epitope regions in Pol gene of B subtype sequences, to 1.98 at antibody epitope regions in Env gene of CRF sequences. In addition to the latter category, only 5 other categories (out of 28 total comparisons) had R/C greater than 1, in particular, regions that harbor CTL and combinations of linear epitopes in Env gene, as well as regions that harbor conformational epitopes among CRF sequences, as well as regions of linear epitopes in Pol and Gag among B subtype sequences. Interestingly, regions that harbor T-helper epitopes as well as non-epitope regions did not show an excess of radical amino acid changes relative to conservative changes.

When both Polarity and Charge classifications were considered, neither classification had epitope categories with a strong signature of positive selection, i.e., R/C

> 1, although several categories (CTL epitope regions in Gag gene of B subtype and CRF sequences, and antibody epitopes in CRF sequences had significant presence of radical changes, as approximated by R/C value > 0.6 in Polarity and Charge, respectively).

Noticeably, with the exception of two categories described above, other epitope

47

categories had significantly more conservative changes within Charge classification, which is not surprising considering the importance of such amino acid property as charge, and the likelihood that the overall protein structure would change if many charge-related changes occur. Thus, these results indicate that purifying selection plays an important role in the evolution of different epitope categories, although to a somewhat different extent across genes and epitope regions.

In addition to the ratio of radical to conservative amino acid changes, we also considered the number of sites in pairwise sequence comparisons where no amino acid change has occurred, and not surprisingly, in the majority of epitope categories there were significantly more sites with no changes than with either radical or conservative change ( Table 4.1, 4.2 and 4.3, last columns). We also computed a proportion of radical changes relative to the total number of possible changes, R-ratio (i.e., R-ratio = R / (R +

C + No change) ( See Figure 4.4, 4.5, and 4.6 ). Overall, this ratio differed significantly only between different types of amino acid changes (i.e., Polarity-Volume category has significantly higher value of R-ratio than either Polarity or Charge category, one-way

ANOVA, p < 0.05); on the other hand, average R-ratios did not differ significantly between three genes or HIV-1 subtypes.

As the results show, in Env gene that harbors all six possible epitope categories, the proportion of radical changes relative to the total number of possible changes was generally smaller in Polarity and Charge category for all epitope categories (smaller than

10% among B sequences, and smaller than 15% in all but one epitope category in CRF sequences). Notably, when both Polarity and Volume were considered, R-ratio was

48

larger in all but one epitope category of T-helper epitopes among B subtype sequences.

Further, epitope regions that harbor conformational epitopes had significantly higher R- ratio than other epitope categories or even non-epitope regions, although on the other hand, these regions also had the smallest absolute number of both radical and conservative changes. Interestingly, when the relative proportion of radical to other amino acid changes or no changes (considered jointly) was considered, regions harboring conformational epitopes had significantly higher proportion of radical changes than that of other types of epitopes among all types of regions of CRF sequences (2x2 contingency table, Fisher exact test, p < 0.01), but only between regions with CE and CTL epitopes only in B subtype sequences (Fisher exact test, p < 0.05). These results suggest that amino acid changes at the epitope regions that harbor conformational epitopes (either by themselves or in combination with other epitope types) in the event when they do occur -

, they are likely to involve radical amino acid changes where both polarity and size of the affected amino acid residue are changed. This may reflect the complex interplay between strong functional and structural constraints operating on conformational epitopes, and the selective pressure from the host immune system to escape. However, further studies are needed to further delineate the nature of substitutions that occur at conformational epitopes.

Likewise, in Pol and Gag genes the majority of epitope categories had relatively small R-ratio, with a significantly higher R-ratio at the non-epitope regions. The latter values ranged from over 30% at Pol to slightly over 40% among CRF Gag sequences.

When different epitope categories were compared, non-epitope regions tended to have

49

significantly higher R-ratio than epitope regions (Fisher exact test, p < 0.01). However, when different epitope regions were compared to each other, regions harboring CTL epitopes were found to have slightly higher, although not statistically significant, R-ratio in all comparisons but linear epitopes in Gag sequences of B subtype.

50

DISCUSSION

Comparison between radical and conservative amino acid changes at different

epitope regions among B-subtype and CRF HIV-1 sequences.

Because of the functional and structural constraints acting at the level of protein secondary and tertiary structures, this often results in clusters of such functionally critical residues being organized as adjacent amino acid regions conserved across homologous sequences (Goldenberg 2009; Madabushi 2002; Del Sol 2003). Thus, to preserve the overall features of protein structure, in the process of protein evolution, majority of amino acid changes tend to occur between amino acids that share similar physicochemical properties rather than between dissimilar residues (Zukerkandl and

Pauling 1965; Kimura 1983; Zhang 2000; McClellan 2001; Hughes et al. 1990).

Therefore, examining physiochemical properties of amino acids and patterns of their replacements within certain regions, such as epitope regions, can provide important insights into the nature of selective forces operating on such regions. In particular, it can be expected that if positive (diversifying) selection is a major influence, then the number of radical amino acid substitutions (i.e., involving amino acid residues that belong to different physicochemical categories) would be rather large. On the other hand, under purifying selection, it is expected that there will be only few radical amino acid substitutions, and the majority of changes, if any, will be conservative (i.e., between amino acid residues that belong to the same physicochemical categories).

51

Differences in physicochemical properties of amino acids play an important role in evolution of protein sequences by influencing the range of amino acid changes that can occur (McClellan and McCracken 2001). A total of over 130 different physicochemical properties have been described for 20 amino acids (Sneath 1966), out of which properties like polarity, hydropathy, charge, volume, aromaticity, aliphaticity and hydrogenation were shown to play a role in determining the rate and pattern of protein evolutions (Xia and Li 1998). Here we used three different classifications of amino acids (Zhang 2000), namely, charge, polarity and polarity and volume. If amino acid substitution leads to substantial change in the conformation of the protein domain or genomic regions involved, this would result in an evolutionary change (Zhang 2000).

The results obtained in this study are consistent with the expectations that the protein sequences will harbor a larger number of conservative amino acid changes than radical amino acid changes (Zuckerkland and Pauling 1965; Epstein 1967; Dayhoff et al. 1972).

This may be because the genetic code has evolved to minimize the change in polarity when nonsynonymous mutation involving single nucleotide change occurs (Sonneborn

1965; Alff-Steinberger 1969). As Xia et al. (1998) showed, primitive amino acids differed mostly based on the polarity, but as new amino acids appeared, the genetic code has evolved to minimize the differences in amino acids based on polarity. In contrast, charge and volume did not differ much and the variation between them was not significant and, hence, did not play a significant role in the evolution of early amino acids.

52

Our results showed that based on polarity, all epitope regions including linear and conformational epitopes, showed higher number of conservative changes than radical. In

Env gene, when we compared conformational epitope regions with linear, the results showed that the majority of evolutionary conserved regions belonged to the conformational epitopes with an R/C ratio of 0.1. In other words, the number of radical changes compared to conservative changes is much smaller, 1/10th or less. In Gag and

Pol genes, T-helper epitope regions harbor larger number of conservative than radical changes, including the overlapping regions. On the other hand, overlapping regions of

CTL+Ab in Pol gene showed signature of positive selection with larger number of radical changes. Results based on polarity and volume showed almost equal number of radical and conservative changes at the conformational epitopes and a significantly higher number of radical changes at the overlapping conformational epitopes due to the change in volume whereas the other classifications showed a significantly higher number of conservative changes than radical changes. Likewise, among linear epitopes some regions also showed a rather high number of radical changes.

This pattern may reflect a complex mixture of selective forces acting on these epitopes, namely, positive selection driven by the influence of the host immune system to promote viral escape, likely acting episodically (due to differences in HLA haplotypes between patients), and purifying selection due to structural and functional constraints acting on these epitope regions (Piontkivska and Hughes, 2004). Similar pattern was also observed at the CTL epitopes in Gag and Pol gene. It should be noted that Pol gene, in agreement with lower d N values that reflect highly conserved nature of this gene, overall

53

had smaller number of amino acid sequence changes, particularly at regions harboring T- helper epitopes and combinations of linear epitopes.

In summary, our study shows that although conformational epitope regions evolve

predominantly through purifying selection, some regions are likely influenced by

(episodic) positive selection, indicating that similar to linear epitopes, conformational

epitopes are also subject to conflicting selective pressures due to structural and functional

constraints and escape pressure driven by the host immune system.

54

Fig 4.1: Ratios of Radical to Conservative amino acid substitutions in Env, Gag and Pol genes from B subtype and CRF sequences, based on Polarity. Ab, CTL and Th designate regions that harbor antibody-only, CTL-only and Th epitopes, respectively. Linear Eps and With CE designate epitope regions that include combinations of linear epitopes only or linear and CE epitopes, respectively. NE designates non-epitope regions. Ab and with CE site categories are missing from Gag and Pol genes (marked by X).

55

Fig 4.2: Ratios of Radical to Conservative amino acid substitutions in Env, Gag and Pol genes from B subtype and CRF sequences, based on Charge. Ab, CTL and Th designate regions that harbor antibody-only, CTL-only and Th epitopes, respectively. Linear Eps and With CE designate epitope regions that include combinations of linear epitopes only or linear and CE epitopes, respectively. NE designates non-epitope regions. Ab and with CE site categories are missing from Gag and Pol genes (marked by X).

56

Fig 4.3: Ratios of Radical to Conservative amino acid substitutions in Env, Gag and Pol genes from B subtype and CRF sequences, based on Polarity and Volume. Ab, CTL and Th designate regions that harbor antibody-only, CTL-only and Th epitopes, respectively. Linear Eps and With CE designate epitope regions that include combinations of linear epitopes only or linear and CE epitopes, respectively. NE designates non-epitope regions. Ab and with CE site categories are missing from Gag and Pol genes (marked by X).

57

1554/7839 4167/28307 3587/16034 2862/14690 1076/3566 3993/14472 100%

90%

80%

70%

60% Conservative or No change % 50% Radical% 40% 30% Conserved or No changes No or Conserved 20%

10% Relative percent of Radical changes with with changes Radical of percent Relative

0% Ab CTL Th Linear Ep Ep with NE CE Epitope categories of Env gene (B)

Fig 4.4 a) : Relative percentage of sites with radical amino acid changes to sites with conservative or no changes in the Env gene of B-subtype sequences.

2597/13960 9286/46845 5053/22410 4646/14505 2178/3843 1936/3026 100%

90%

80%

70%

60% Conservative or No change % 50% Radical% 40%

30% Conserved or No changes No or Conserved 20%

Relative percent of Radical changes with with changes Radical of percent Relative 10%

0% Ab CTL Th Linear Ep Ep with NE CE Epitope categories of Env gene (CRF)

Fig 4.4 b) : Relative percentage of sites with radical amino acid changes to sites with conservative or no changes in the Env gene of CRF subtype sequences.

58

Absent 3360/29647 2091/26130 2533/13076 Absent 2086/4065 100% or or 90%

80%

70%

60% Conservative or No change % 50% Radical%

No changes No 40%

30%

20%

10%

Relative percent of Radical changes with Conserved Conserved with changes Radical of percent Relative 0% Ab CTL Th Linear Ep with NE Ep CE Epitope categories of Gag gene (B)

Fig 4.5 a) : Relative percentage of sites with radical amino acid changes to sites with conservative or no changes in the Gag gene of B subtype sequences.

100% Absent 5977/54000 5283/47790 2872/30635 Absent 2770/3105 or or 90%

80%

70%

60% Conservative or No change % 50% Radical%

No changes No 40%

30%

20%

10%

Relative percent of Radical changes with conserved conserved with changes Radical of percent Relative 0% Ab CTL Th Linear Ep with NE Ep CE Epitope categories of Gag gene (CRF)

Fig 4.5 b) : Relative percentage of sites with radical amino acid changes to sites with conservative or no changes in the Gag gene of CRF subtype sequences.

59

100% Absent 10978/50282 530/7973 479/7762 Absent 11457/13333

90%

80%

70%

60% Conservative or No change % 50% Radical% 40%

30% Conserved or No changes No or Conserved 20%

Relative percent of Radical changes with with changes Radical of percent Relative 10%

0% Ab CTL Th Linear Ep Ep with NE CE Epitope categories of Pol gene (B)

Fig 4.6 a) : Relative percentage of sites with radical amino acid changes to sites with conservative or no changes in the Pol gene of B subtype sequences. .

Absent 8973/76680 474/16065 595/4619 Absent 4586/6142 100%

90%

80%

70%

60% Conservative or No change % 50% Radical% 40% or No changes No or 30%

20%

10%

Relative percent of Radical changes with Conserved Conserved with changes Radical of percent Relative 0% Ab CTL Th Linear Ep with NE Ep CE Epitope categories of Pol gene (CRF)

Fig 4.6 b) : Relative percentage of sites with radical amino acid changes to sites with conservative or no changes in the Pol gene of CRF subtype sequences.

No. of changes based on No. of changes based on No. of changes based on Polarity Charge Polarity-Volume Subtype Epitopes NO Radical Conserved R/C Radical Conserved R/C Radical Conserved R/C change B Ab 269 1285 0.2093 307 1271 0.2415 558 996 0.5602 6285 CTL 862 3406 0.2531 942 3326 0.2832 1650 2517 0.6555 24140 Th 1483 2484 0.5970 940 3027 0.3105 1436 2151 0.6676 12447 CE 60 466 0.1290 193 333 0.5800 265 261 10.1500 531 NE 891 3102 0.2872 821 3172 0.2588 1575 2418 0.6514 10479 CTL+Th 364 1174 0.3100 245 1293 0.1890 525 1013 0.5180 7574 Th+Ab 225 571 0.3940 120 626 0.1920 290 456 0.6360 2956 CTL+Th+Ab 108 470 0.2300 44 534 0.0820 140 438 0.3200 1298 Th+CE 56 122 0.4590 13 165 0.0790 102 76 1.3420 1028 CTL+Th+CE 56 122 0.4590 13 165 0.0790 102 76 1.3420 1028 Th+Ab+CE 38 59 0.6440 7 90 0.0780 14 83 0.1690 588 CTL+Th+Ab+CE 40 57 0.7020 4 93 0.0430 28 69 0.4060 372 AOLE 697 2215 0.3147 409 2453 0.1667 955 1907 0.5008 11828 AOCE 250 826 0.3027 230 846 0.2719 511 565 0.9044 2490

Continued

60

No. of changes based on No. of changes based on No. of changes based on Polarity Charge Polarity-Volume No Subtype Epitopes change Radical Conserved R/C Radical Conserved R/C Radical Conserved R/C

CRF Ab 685 1912 0.3583 1225 1372 0.8929 1726 871 1.9816 11363 CTL 3215 6071 0.5296 2912 6374 0.4569 5731 3555 1.6121 37559 Th 1380 3673 0.3757 1607 3446 0.4663 2397 2656 0.9025 17357 CE 272 891 0.3050 438 725 0.6040 685 478 1.4330 7207 NE 430 1506 0.2855 687 1249 0.5500 814 1122 0.7255 3026 CTL+Th 1143 2009 0.5690 646 2506 0.2580 1537 1615 0.9520 7403 Th+Ab 255 806 0.3160 505 556 0.9080 659 402 1.6390 1934 CTL+Th+Ab 143 290 0.4930 161 272 0.5920 304 129 2.3570 522 Th+CE 142 268 0.5300 108 302 0.3580 288 122 2.3610 600 CTL+Th+CE 137 252 0.5440 107 282 0.3790 284 105 2.7050 961 Th+Ab+CE 0 102 0.0000 2 100 0.0200 2 100 0.0200 438 CTL+Th+Ab+CE 4 110 0.0360 3 111 0.0270 6 108 0.0560 831 AOLE 1541 3105 0.4963 1312 3334 0.3935 2500 2146 1.1650 9859 AOCE 555 1623 0.3420 658 1520 0.4329 1265 913 1.3855 1667

Table 4.1: Number of Radical and Conservative amino acid substitutions of ENV gene belonging to B-subtype and CRF subtypes sequences along with No change sites.

61

No. of changes based on No. of changes based on No. of changes based on No Subtype Epitopes Polarity Charge Polarity-Volume change Radical Conserved R/C Radical Conserved R/C Radical Conserved R/C B CTL 729 2531 0.2880 564 2796 0.2017 1250 2110 0.5924 26287 Th 315 1776 0.1774 495 1596 0.3102 677 1414 0.4788 24039 NE 662 1424 0.4649 405 1681 0.2409 977 1109 0.8810 1979 CTL+Th 267 1341 0.1990 393 1215 0.3230 647 961 0.6730 9648 CTL+Ab 367 462 0.7940 206 639 0.3220 688 157 4.3820 439 CTL+Th+Ab 0 80 0.0000 2 78 0.0260 2 78 0.0260 456 AOLE 634 1883 0.3367 601 1932 0.3111 1337 1196 1.1179 10543 CRF CTL 1717 4260 0.4031 932 5045 0.1847 2175 3802 0.5721 48023 Th 1680 3603 0.4663 711 4572 0.1555 1952 3331 0.5860 42507 NE 686 2084 0.3292 722 2048 0.3525 1346 1424 0.9450 335 CTL+Th 1040 1748 0.5950 391 2397 0.1630 851 1937 0.4390 19892 CTL+Ab 0 4 0.0000 0 4 0.0000 0 4 0.0000 1076 CTL+Th+Ab 0 80 0.0000 2 78 0.0260 2 78 0.0260 460 AOLE 1040 1832 0.5677 393 2479 0.1585 853 2019 0.4225 27763

Table 4.2: Number of Radical and Conservative amino acid substitutions of Gag gene belonging to B-subtype and CRF subtypes sequences along with No change sites.

62

No. of changes based on No. of changes based on No. of changes based on No Subtype Epitopes Polarity Charge Polarity-Volume change Radical Conserved R/C Radical Conserved R/C Radical Conserved R/C B CTL 1592 2118 0.7517 1791 1918 0.9338 6457 4521 1.4282 39304 Th 23 507 0.0454 64 466 0.1373 48 482 0.0996 7443 NE 4308 7149 0.6026 3268 9189 0.3556 4864 6593 0.7378 1876 CTL+Th 63 154 0.4090 16 201 0.0800 23 194 0.1190 5411 CTL+Ab 60 19 3.1580 54 26 2.0770 66 13 5.0770 1328 Th+Ab 2 139 0.0140 4 137 0.0290 17 124 0.1370 395 CTL+Th+Ab 0 42 0.0000 0 42 0.0000 0 42 0.0000 226 AOLE 125 354 0.3531 74 406 0.1823 106 373 0.2842 7283 CRF CTL 2079 6894 0.3016 2340 6633 0.3528 4409 4564 0.9660 67707 Th 75 399 0.1880 32 442 0.0724 175 299 0.5853 15591 NE 1164 3422 0.3402 1341 3245 0.4133 1956 2630 0.7437 1556 CTL+Th 14 75 0.1870 6 93 0.0650 36 53 0.6790 1314 CTL+Ab 153 265 0.5770 142 276 0.5140 159 259 0.6140 2094 Th+Ab 3 41 0.0730 8 36 0.2220 21 23 0.9130 308 CTL+Th+Ab 3 41 0.0730 8 36 0.2220 21 23 0.9130 308 AOLE 173 422 0.4100 164 441 0.3719 237 358 0.6620 4024

Table 4.3: Number of Radical and Conservative amino acid substitutions of Pol gene belonging to B-subtype and CRF subtypes sequences along with No change sites.

63

No. of changes based on No. of changes based on No. of changes based on Polarity Charge Polarity-Volume NO Subtype Epitopes Radical Conserved R/C Radical Conserved R/C Radical Conserved R/C change B Ab 476 642 0.7414 254 864 0.294 616 502 1.2271 5381 CTL 897 2361 0.3799 858 2400 0.3575 1693 1565 1.0818 19924 Th 922 1296 0.7114 647 1571 0.4118 1451 767 1.8918 8904 CE 44 315 0.1397 136 223 0.6099 208 151 1.3775 3862 NE 586 1254 0.4673 611 1229 0.4972 856 984 0.8699 1376 CTL+Th 382 898 0.4254 534 746 0.7158 938 342 2.7427 4951 Th+Ab 242 306 0.7908 80 468 0.1709 382 166 2.3012 1931 CTL+Th+Ab 106 169 0.6272 31 244 0.127 181 94 1.9255 596 Th+CE 9 81 0.1111 18 72 0.25 18 72 0.25 915 CTL+Th+CE 9 78 0.1154 8 79 0.1013 17 70 0.2429 583 Th+Ab+CE 1 73 0.0137 2 74 0.027 9 65 0.1385 261 CTL+Th+Ab+CE 1 77 0.013 4 74 0.0541 10 68 0.1471 391 AOLE 730 1373 0.5317 645 1458 0.4424 1501 602 2.4934 7478 AOCE 20 309 0.0647 32 299 0.107 54 275 0.1964 2150

Table 4.4: Number of Radical and Conservative amino acid substitutions of Env gene belonging to B-subtype sequences estimated with CRF ancestral sequence .No change sites were included in the last column.

64

No. of changes based on No. of changes based on No. of changes based on Polarity Charge Polarity-Volume NO Subtype Epitopes Radical Conserved R/C Radical Conserved R/C Radical Conserved R/C change B CTL 658 1956 0.3364 598 2016 0.2966 1186 1428 0.8305 24587 Th 302 1426 0.2118 512 1216 0.4211 584 1144 0.5105 22369 NE 489 1581 0.3093 416 1654 0.2515 912 1158 0.7876 1812 CTL+Th 267 1341 0.1991 393 1215 0.3235 576 1032 0.5581 8547 CTL+Ab 342 483 0.7081 189 636 0.2972 615 210 2.9286 390 CTL+Th+Ab 0 80 0 2 78 0.0256 2 78 0.0256 426 AOLE 609 1904 0.3199 584 1929 0.3027 1193 1320 0.9038 9784

Table 4.5: Number of Radical and Conservative amino acid substitutions of Gag gene belonging to B-subtype sequences estimated with CRF ancestral sequence .No change sites we also included in the last column.

No. of changes based on No. of changes based on No. of changes based on NO Subtype Epitopes Polarity Charge Polarity-Volume change Radical Conserved R/C Radical Conserved R/C Radical Conserved R/C B CTL 1456 1958 0.7436 1625 1789 0.9083 2311 1103 2.0952 36251 Th 23 507 0.0454 64 466 0.1373 48 482 0.0996 7053 NE 3952 6399 0.6176 2956 7395 0.3997 4431 5919 0.7486 1625 CTL+Th 63 154 0.4091 16 201 0.0796 23 194 0.1186 5411 CTL+Ab 60 19 3.1579 54 26 2.0769 66 13 5.0769 1328 Th+Ab 2 139 0.0144 4 137 0.0292 17 124 0.1371 395 CTL+Th+Ab 0 42 0.0000 0 42 0.0000 0 42 0.0000 226 AOLE 125 354 0.3531 74 406 0.1823 106 373 0.2842 7283

Table 4.6: Number of Radical and Conservative amino acid substitutions of Pol gene belonging to B-subtype sequences estimated with CRF ancestral sequence .No change sites were included in the last column. 65

SUMMARY

Viral epitopes play a critical role in the interaction between virus and host immune system. Although few studies have showed that antibodies directed against conformational epitopes neutralized the HIV isolates better than those against linear epitopes, very little is currently known about the evolutionary pattern of the conformational epitopes, particularly the extent of evolutionary sequence conservation and the relative strength of the selection pressure acting on such epitope regions. Thus, this study was focused on mechanisms of molecular evolution of different epitope regions in HIV-1 genome, particularly, assessing the extent of nucleotide and amino acid sequence conservation and delineating selective forces acting on conformational and linear epitopes.

The patterns of nucleotide and amino acid substitutions were first estimated at conformational and three types of linear epitopes (CTL, T-helper and neutralizing antibody) in Gag, Pol and Env genes of HIV-1, and then were contrasted among different types of epitope regions to determine the pattern of selective pressure acting across different types of epitope regions. The results showed a pattern of strong purifying selection acting at the majority of epitope regions in all three genes, Gag, Pol and Env, which was consistent with the results of many previous studies. With respect to amino acid substitutions, while the conservative changes outnumbered radical in majority of the epitope regions (thus, indicating that purifying selection is a dominant selective force removing deleterious effects of drastic amino acid changes), some HIV-1 genomic regions showed a trend toward an increased number of radical amino acid changes.

66

67

Altogether, this study showed that conformational epitopes are much more conserved than linear epitopes, and that although conformational epitope regions evolve predominantly through purifying selection, some sites within these regions may also be subjected to influence of positive selection. This indicates that, similar to linear epitopes, conformational epitopes are also subject to conflicting (and potentially episodic) selective pressures between positive selection that favors mutations to facilitate escape from the host immune system and purifying selection due to functional and structural constraints acting at the protein level.

From published studies it is evident that conformational epitopes are promising targets for vaccine development in viruses like HIV-1. The results from our study indicating that the conformational epitopes are more conserved than linear epitopes, support this idea. However, currently one of the major limitation of utilization of conformational epitopes is that very few conformational epitopes have been identified in

HIV-1 so far. Better understanding of conformational epitopes can provide useful information for new vaccine design and new diagnostic reagent development, as well as will be of great value for treatment developments against HIV-1 infection. There is an urgent need to focus more attention on studies of conformational epitopes, including identification of other such epitopes as part of HIV-1 vaccine research. Further computational and experimental analyses are needed to identify which conformational epitope(s) are the best candidates for an epitope vaccine as well as to determine whether a combination of conformational and linear epitopes can lead to better treatment efficiency.

APPENDIX – 1

Box Plots showing the d S - d N values of different epitope regions in Env gene. (1 represents B subtypes and 0 represents CRF subtypes)

68

69

APPENDIX – 2

Box Plots showing the d S - d N values of different epitope regions in Gag gene. (1 represents B subtypes and 0 represents CRF subtypes)

70 APPENDIX – 3

Box Plots showing the d S - d N values of different epitope regions in Pol gene. (1 represents B subtypes and 0 represents CRF subtypes)

71 72

Appendix -4 CTL Epitopes Epitope Protein Start # End # Epitope Protein Start # End # GELDRWEKI Gag 11 19 YTAFTIPSV Pol 127 135 KIRLRPGGK Gag 18 26 TAFTIPSI Pol 128 135 IRLRPGGKK Gag 19 27 NETPGIRYQY Pol 137 146 RLRPGGKKK Gag 20 28 IRYQYNVL Pol 142 149 RLRPGGKKKY Gag 20 29 SPAIFQSSM Pol 156 164 GGKKKYKLK Gag 24 32 AIFQSSMTK Pol 158 166 KYKLKHIVW Gag 28 36 KQNPDIVIY Pol 173 181 HLVWASREL Gag 33 41 HPDIVIYQY Pol 175 183 LVWASRELERF Gag 34 44 NPEIVIYQY Pol 175 183 WASRELERF Gag 36 44 VIYQYMDDL Pol 179 187 ELRSLYNTV Gag 74 82 IEELRQHLL Pol 202 210 RSLYNTVATLY Gag 76 86 IVLPEKDSW Pol 244 252 SLYNTVATL Gag 77 85 LVGKLNWASQIY Pol 260 271 LYNTVATL Gag 78 85 KLNWASQIY Pol 263 271 LYNTVATLY Gag 78 86 QIYPGIKVR Pol 269 277 TLYCVHQK Gag 84 91 YPGIKVRQL Pol 271 279 IEIKDTKEAL Gag 92 101 ILKEPVHGV Pol 309 317 NSSKVSQNY Gag 124 132 ILKEPVHGVY Pol 309 318 VQNLQGQMV Gag 3 11 GQGQWTYQI Pol 333 341 HQAISPRTL Gag 12 20 IYQEPFKNLK Pol 341 350 QAISPRTLNAW Gag 13 23 RMRGAHTNDV Pol 356 365 ISPRTLNAW Gag 15 23 RMRGAHTNDVK Pol 356 366 SPRTLNAWV Gag 16 24 IAMESIVIW Pol 375 383 VKVIEEKAF Gag 24 32 PIQKETWETW Pol 392 401 EEKAFSPEV Gag 28 36 GAETFYVDGA Pol 436 445 KAFSPEVI Gag 30 37 ETFYVDGAANR Pol 438 448 KAFSPEVIPMF Gag 30 40 ETKLGKAGY Pol 449 457 FSPEVIPMF Gag 32 40 IVTDSQYAL Pol 495 503 EVIPMFSAL Gag 35 43 VTDSQYALGI Pol 496 505 VIPMFSAL Gag 36 43 QIIEQLIKK Pol 520 528 SEGATPQDL Gag 44 52 LFLDGIDKA Pol 560 8 TPQDLNTML Gag 48 56 LPPIVAKEI Pol 28 36 TPYDINQML Gag 48 56 THLEGKIIL Pol 66 74 GHQAAMQML Gag 61 69 STTVKAACWW Pol 123 132 KETINEEAA Gag 70 78 IQQEFGIPY Pol 135 143 ETINEEAAEW Gag 71 80 VRDQAEHL Pol 165 172 AEWDRVHPV Gag 78 86 KTAVQMAVF Pol 173 181 HPVHAGPIA Gag 84 92 AVFIHNFKRK Pol 179 188 GQMREPRGSDI Gag 94 104 FKRKGGIGGY Pol 185 194 TSTLQEQIGW Gag 108 117 IIATDIQTK Pol 203 211

73

74

PPIPVGDIY Gag 122 130 KIQNFRVYY Pol 219 227 EIYKRWII Gag 128 135 VPRRKAKII Pol 260 268 RRWIQLGLQK Gag 131 140 RKAKIIRDY Pol 263 271 KRWIILGLNK Gag 131 140 RVKEKYQHL Env 2 10 GLNKIVRMY Gag 137 145 AENLWVTVY Env 31 39 VRMYSPVSI Gag 142 150 AENLWVTVYY Env 31 40 RMYSPTSI Gag 143 150 TVYYGVPVWK Env 37 46 FRDYVDRFF Gag 161 169 VPVWKEATTT Env 42 51 FRDYVDRFYK Gag 161 170 VPVWKEATTTL Env 42 52 RDYVDRFFKTL Gag 162 172 LFCASDAKAY Env 52 61 RDYVDRFYKTL Gag 162 172 KAYETEVHNVW Env 59 69 YVDRFYKTL Gag 164 172 YETEVHNVW Env 61 69 YVDRFFKTL Gag 164 172 DPNPQEVVL Env 78 86 DRFYKTLRA Gag 166 174 MHEDIISLW Env 104 112 AEQASQDVKNW Gag 174 184 SVITQACPK Env 199 207 AEQASQEVKNWM Gag 174 185 SFEPIPIHY Env 209 217 QASQEVKNW Gag 176 184 RPNNNTRKSI Env 298 307 DCKTILKAL Gag 197 205 HIGPGRAFY Env 310 318 ACQGVGGPGHK Gag 217 227 RGPGRAFVTI Env 311 320 GPGHKARVL Gag 223 231 SFNCGGEFF Env 375 383 AEAMSQVTNS Gag 1 10 LPCRIKQII Env 416 424 CRAPRKKGC Gag 42 50 RIKQIINMW Env 419 427 TERQANFL Gag 64 71 YRLGVGALI Env 511 519 RQANFLGKI Gag 66 74 RAIEAQQHL Env 557 565 FLGKIWPSYK Gag 70 79 RAIEAQQHM Env 557 565 KELYPLTSL Gag 118 126 ERYLKDQQL Env 584 592 NSPTRREL Gag 24 31 RYLKDQQLL Env 585 593 ITLWQRPLV Gag-Pol 3 11 YLKDQQLL Env 586 593 DTVLEEWNL Pol 30 38 TAVPWNASW Env 606 614 EEMNLPGRW Pol 34 42 VFAVLSIVNR Env 698 707 RQYDQILIEI Pol 57 66 EIIFDIRQAY Env 703 712 GKKAIGTVL Pol 68 76 IVNRNRQGY Env 704 712 KAIGTVLV Pol 70 77 RLRDLLLIVTR Env 770 780 LVGPTPVNI Pol 76 84 IVTRIVELL Env 777 785 TPVNIIGRNML Pol 80 90 GRRGWEALKY Env 786 795 IETVPVKL Pol 5 12 RRGWEVLKY Env 787 795 GPKVKQWPL Pol 18 26 KYCWNLLQY Env 794 802 ALVEICTEM Pol 33 41 QELKNSAVSL Env 805 814 ALVEICTEMEK Pol 33 43 SLLNATDIAV Env 813 822 EKEGKISKI Pol 42 50 EVAQRAYR Env 831 838 KLVDFRELNK Pol 73 82 IPRRIRQGL Env 843 851 GIPHPAGLK Pol 93 101 RIRQGLERA Env 846 854 TVLDVGDAY Pol 107 115 RQGLERALL Env 848 856 VPLDEDFRKY Pol 118 127

75

# Here and in subsequent tables Start and End refer to the amino acid positions within a peptide from the reference HIV-1 genome, HXB2

Appendix-5 T-Helper Epitopes Epitope Protein Start End Epitope Protein Start End MGARASVLSGGELDRWEK Gag 1 18 KEGHQMKDCTERQAN Gag 55 69 ASILRGGKLDKW Gag 5 16 MKDCTERQANFLGKI Gag 60 74 SGGELDRWEKIRLRPGGK Gag 9 26 DCTERQANFLG Gag 62 72 GGKLDAWEKIRLRPG Gag 10 24 RQANFLGKIWPSHKGR Gag 66 81 EKIRLRPGGKKKYKL Gag 17 31 GKIWPSHKGRPGNFLQSR Gag 72 89 EKIRLRPGGKKKYKLKHI Gag 17 34 PSYKGRPG Gag 76 83 EKIRLRPGGKKKYKLHKI Gag 17 34 REETTTPS Gag 89 96 RLRPGGKKHYM Gag 20 30 ESFRSGVETTTPPQK Gag 98 112 LRPGGKKKYKLKHIV Gag 21 35 GEETTTPSQKQEPIDKEL Gag 103 120 RPGGKKKY? Gag 22 29 FEETTPAPPKQ Gag 104 113 HYMLKHLVWAS Gag 28 38 QKQEPIDKELYPLASLR Gag 111 127 YKLKHIVWASRELER Gag 29 43 KDREPLTSLKS Gag 118 128 KHIVWASRELERFAV Gag 32 46 EICTEMEKEGKISKIGP Pol 36 52 HIVWASRELERFAVN? Gag 33 47 TEMEKEGKISKIGPE Pol 39 53 HIVWASRELERFAVN Gag 33 47 FRKYTAFTIPSINNE Pol 124 138 ASRELERFAVNPGLL Gag 37 51 SPAIFQSSMTKILEP Pol 156 170 SRELERFALNPSLLEE Gag 38 53 IGQHRTKIEELRQHL Pol 195 209 RELERFAVN Gag 39 47 KDSWTVNDIQKLVGK Pol 249 263 LERFAVNPGLL Gag 41 51 KDSSTVNDIQKLVGK Pol 249 263 LERFAVNPGLLETSE Gag 41 55 KDSWTWNDIQKLVGK Pol 249 263 ERFAVNPGLL Gag 42 51 SSTVNDIQKLV Pol 251 261 ERFALNPSLLETAEG Gag 42 56 QKLWGKLNWASQIYP Pol 258 272 ERFAVNPGLLETSEGCR Gag 42 58 WRQLCKLLRGTKALT Pol 276 290 PGLLETSEGCK Gag 48 58 GTKALTEVIPLTEEA Pol 285 299 TGSEELRSLNTVALY Gag 70 86 PLTEEAELELAENRE Pol 294 308 TSEELKSLFVTVATL Gag 71 85 LAENREILKEPVHGV Pol 303 317 LKSLFNTVATLYCVH Gag 75 89 TYQIYQEPFKNLKTG Pol 338 352 SLNTVATLYCVHQR Gag 77 91 GKTPKFKLPIQKETW Pol 384 398 SLYNTVATLYCVHQRIEV Gag 77 94 WEFVNTPPLVKLWYQ Pol 414 428 VATLYCVHAGI Gag 82 92 LEKEPIVGAETFYVD Pol 429 443 EIKDTKEALDKIEEE Gag 93 107 EKVYLAWVPAHKGIG Pol 529 543 AAADTGHSSQVSQNY Gag 118 132 KVYLAWVPAHKGIGG Pol 530 544 PIVQNIQGQ Gag 1 9 SAGIRKVLFLD Pol 553 3 PIVQNLQGQMV Gag 1 11 HSNWRAMASDFNLPP Pol 16 30 PIVQNIQGQMVHQAI Gag 1 15 LKTAVQMAVFIHNFK Pol 172 186 QGQMVHQAISPRTLN Gag 7 21 KTAVQMAVFFIHNFKR Pol 173 187 QMVHQAISPRTLNAWVKV Gag 9 26 KTAVQMAVFIHNFKR Pol 173 187 VHQAISPRTLNAWVKC Gag 11 26 RKGGIGGYSAGERIVDII Pol 187 204 NAWVKVVEEKAFSPEC Gag 21 36 SAGERIVDIIATDIQTK Pol 195 211 AWVKVIEEKAFSPEV Gag 22 36 AGERIVDIIATDIQT Pol 196 210

76

77

WVKVVEEKAFSPEVIPMF Gag 23 40 QKQITKIQNFRVYYR Pol 214 228 WKVVEEKAFSPEVIPMF Gag 23 40 KQITKIQNFRVYY Pol 215 227 KVVEEKAFSPEVIPM Gag 25 39 LWKGEGAVVIQDNSDIKV Pol 242 259 EEKAFSPEV Gag 28 36 VIQDNSDIKVVPRRKAKI Pol 250 267 EEKAFSPEVIP Gag 28 38 TEKLWVTVYYGVPVW Env 31 45 AFSPEVIPMFT Gag 31 41 VYYGVPVWKEA Env 38 48 AFSPEVIPMFSALSEC Gag 31 46 CVPTDPNPQEVV Env 74 85 AFSPEVIPMFSALSEGA Gag 31 47 YFNMWKNNMV Env 92 101 AFSPEVIPMFSALSEGAT Gag 31 48 HEDIISLWDQSLK Env 105 117 PEVIPMFSALSEGATP Gag 34 49 IISLWDQSLKPC Env 108 119 EVIPMFSALS Gag 35 44 SLWDQSLKPCVKLTPL Env 110 125 PMFTALSEGAT Gag 38 48 SLKPCVKLTPLC Env 115 126 SALSEGATPQDLNTMC Gag 41 56 SLKPCVKLTPLCVSL Env 115 129 TPQDLNTMLNTVGGH Gag 48 62 KNCSFNISTSIRGKV Env 155 169 PQDLNTMLNTVGGHQ Gag 49 63 SVITQACSKVSFE Env 199 211 PQDLNMMLNIVGGHQA Gag 49 64 VITQACPKVSFEPIP Env 200 214 DLNTMLNTYGGHQAAC Gag 51 66 SFEPIPIHYCAP Env 209 220 NTMLNTVGGHQAAM Gag 53 66 PAGFAILKCNNKTFN Env 220 234 ETINEEAAEWDRVHPC Gag 71 86 PAGFAILKCNNKTFNY Env 220 235 ETINEEAAEWDRVHPVHA Gag 71 88 FAILKCNNK Env 223 231 INEEAAEWDRL Gag 73 83 NKTFNGKGPCTNVSTY Env 230 245 EAAEWDRVHP Gag 76 85 TNVSTVQCTHGRPIY Env 240 255 EAAEWDRVHPVHAGP Gag 76 90 GIRPIVSTQLLLNGSC Env 250 265 EWDRVHPVHA Gag 79 88 EVVIRSANFTDNAKT Env 269 283 RLHPVHAGPIA Gag 82 92 VVIRSDNFTNNAKTIC Env 270 285 PVHGPIAPGQMREP Gag 85 99 SANFTDNAKTIIVQL Env 274 288 VHAGPIAPG Gag 86 94 NAKTIIVQLNESVAIC Env 280 296 PGQMREPRGSDIAGT Gag 93 107 NESVAINCT Env 289 297 GQMREPRGSDI Gag 94 104 SVVEINCTRPNNNTRKS Env 290 306 MREPRGSD Gag 96 103 RIQRGPGRAFVTIGK Env 308 322 MREPRGSKIAGTTST Gag 96 110 RIHIGPGRAFYTTKN Env 308 322 EPRGSDIAGT Gag 98 107 EQRGPGRAFVTIGKI Env 309 323 GSDIAGTTSTQEQI Gag 101 115 IQRGPGRAFVTIGKIGN Env 309 325 GSDIAGTTSTLQEQIC Gag 101 116 GRAFVTIGKIGNMRQ Env 314 328 GTTSTLQEQIA Gag 106 116 RIIGDIRKAHCNISRY Env 321 336 STLQEQIGWMTNNPP Gag 109 123 CNISRAQWNNTLEQI Env 331 345 EQIAWMTSNPPVPVG Gag 113 127 TLEQIVKKLREQFGNC Env 341 356 WMTSNPPVPVG Gag 117 127 QIVKKLREQFGNNK Env 344 357 TNNPPIPBGEIYKRW Gag 119 133 QSSGGDPEIV Env 363 372 NPPIPVGEIYKRWIIC Gag 121 136 SSGGKPEIVTHSFNC Env 364 378 PVGEIYKRWIILGLN Gag 125 139 PEIVTHSFNCGGEFF Env 369 383 GEIYKRWIILGLNKI Gag 127 141 GEFFYCNSTQLFNS? Env 380 393 EIYKRWIILG Gag 128 137 EFFYCNTTQLFNNTW Env 381 395 IYKRWIILGLNKIVR Gag 129 143 TQLFNSTWFNSTWST Env 388 402

78

KRWIILGLNKIVRMY Gag 131 145 FNNTWRLNHTEGTKGC Env 391 405 WIILGLNKIVRM Gag 133 144 TWFNSTWSTKGSNNT Env 394 408 WIILGLNKIVRMYSP Gag 133 147 TWSTKGSNNTEGSDT Env 399 413 WIILGLNKIVRMYSPTSI Gag 133 150 LPCRIKQIINMWQEVY Env 416 431 ILGLNKIVRMY Gag 135 145 KQIINMWQEVGKAMYA Env 421 436 GLNKIVRMYSPTSIL Gag 137 151 KQFINMWQEWGKAMYA Env 421 436 LNKIVRMYSPVSILD Gag 138 152 FINMWQEVGKAMYAPPIS Env 423 440 KIVRMYSPT Gag 140 148 INMWQEVGKAMYAPP Env 424 438 KIVRMYSPTS Gag 140 149 MWQEVGKAMYAPPIGC Env 426 441 IVRMYSPTSILDIRQC Gag 141 156 APPIGGQISCSSNITY Env 436 451 IVRMYSPTSILDIRQGPK Gag 141 158 IGGQIRCSSN Env 439 448 SPTSILDIRQGPKEP Gag 146 160 SSNITGLLLTRDGGTC Env 446 461 LDIRQGPKEPFRDYVC Gag 151 166 RDGGTNVTNDTEVFRC Env 456 470 IRQGPKEPFRDYVDR Gag 153 167 GNSNNESEIFRPGGG Env 459 473 GPKEPFRDYVDRFYK Gag 156 170 FRPGGGDMRDNWRSEL Env 468 483 GPKEPFRDYVDRFYKTLR Gag 156 173 DMRDNWRSELYKYKV Env 474 488 PKEPFRDYV Gag 157 165 YKYKVVKIEPLGVAP Env 484 498 FRDYVDRFFKT Gag 161 171 KYKVIKIEPLGIAPTC Env 485 500 DYVDRFYKTLRAE Gag 163 175 TKAKRRVVEREKR Env 499 511 DYVDRFYKTLRAEQA Gag 163 177 GIVQQQNNLLRAIEA Env 547 561 YVDRFYKTLRAEQASQEV Gag 164 181 QQHLLQLTVWGIKQL Env 562 576 VDRFYKTLRAEQASQ Gag 165 179 YLRDQQLLGIWG Env 586 597 DRFFKTLRAEQAT Gag 166 178 LGIWGCSGKLIC Env 593 604 DRFFKTLRAEQATQE Gag 166 180 GIWGCSGKLI Env 594 603 RFYKTLRAEQAS Gag 167 178 GIWGCSGKLIC Env 594 604 FYKTLRAEQASQ Gag 168 179 CSGKLICTTAVP Env 598 609 FYKTLRAEQASQE Gag 168 180 CTTAVPWNASWS Env 604 615 FFKTLRAEQATQE Gag 168 180 PWNASWSN Env 609 616 YKTLRAEQA Gag 169 177 WSNKSLEDIWDNMTWC Env 614 629 YKTLRAEQASQEVKN Gag 169 183 EQIWNHTTWMEWDRE Env 620 634 RAEQASQEVKNWMTE Gag 173 187 EIDNYTNTIYTLLEEC Env 634 649 VKNWMTETLLVQNANC Gag 181 198 EESQNQQEKNEQELL Env 647 661 MTETLLVQNANPDCKTIL Gag 185 202 QNQQEKNEQELLE Env 650 662 LLVQNANPDCKTILR Gag 189 203 ASLWNWFNITNWLWY Env 667 681 ILKALGPAATLEEMM Gag 201 215 IKLFIMIVGGLVGLR Env 682 696 LGPAATLEEMMTACQ Gag 205 219 GIEEEGGERDRDR Env 732 744 TNSATIMMQRGNFRNQRK Pol 8 25 WLNATAIAVTEGTDRC Env 814 829 QRGNFRNQRKTVKCF Pol 16 30 YVAEGTDRVIEVVQGACR Env 821 838 VKCFNCGKGEH Pol 27 37 DRVIEVVQGAYRAIR Env 827 841 FNCGKEGHTARNCRA Pol 30 44 DRVIEVVGQAYRAIR Env 827 841 HIAKNCRAPRKKGCWK Pol 37 52 AIRHIPRRIRQGLER Env 839 853 RAPRKKGCWKCGKEGHQM Pol 43 60 HIPRRIRQGLERILL Env 842 856

Appendix-6 Antibody Epitopes Epitope Protein Start End Epitope Protein Start End AAMQMLKETINE Gag 64 75 GRAF Env 314 317 KDSWTVNDIQKLVGK Pol 249 263 IKQI Env 420 423 LTEEAELELA Pol 295 304 KQIINMWQEVGKAMYA Env 421 436 IIEQLIKKEKV Pol 521 531 EVGKAMYAPPISGQI Env 429 443 CTDLKNDTNTNSSSGRMMMEK Env 131 151 LTRDGGNNNNESEIFRPGGGD Env 454 474 ISTSIRGKVQKEYAFFYKLD Env 161 180 NNNNGSEI Env 460 467 FYKLDIVPIDNTTTSYRLISC Env 176 196 IEPLGVAPTK Env 491 500 PIPIHYCAPA Env 212 221 PTKAKRR Env 498 504 KGSCKNVSTV Env 236 245 RRVVQRE Env 503 509 PNNNTRKSIR Env 299 308 DWVVQREKR Env 503 511 NYNKRKRIHIGPGRAFYTTK Env 300 321 RRVVQREKR Env 503 511 NTRKSIHIGPGRAF Env 302 317 VVQREKR Env 505 511 NTRKSIHIGPGRAFY Env 302 318 AAGSTMGAASMTLTVQARQ Env 525 543 TRTSV Env 303 307 AQQHLLQLTVWGIKQLQARIL Env 561 581 TRTSVR Env 303 308 GIKQLQARILAVERYLKDQQ Env 572 591 RKSIR Env 304 308 LQARILAVERYLKDQQL Env 576 592 RKSIRIQRGPGRAFV Env 304 318 QARILAV Env 577 583 RKRIHIGPGRAFYTT Env 304 320 RILAVERYLKDQQLLGIWGCS Env 579 599 RKRIRIQRGPGRAFVTIGK? Env 304 322 ILAVERYLKDQQLLGIWG Env 580 597 RKRIHIGPGRAFYTTKN Env 304 322 AVERYLKD Env 582 589 KRIHI Env 305 309 ERYLKDQLLGIWGCSGKLIC Env 584 604 KRIHIGP Env 305 313 DQQLLGIWGCSG Env 589 600 KRIHIGPGRAFY Env 305 318 QQLLGIWG Env 590 597 KRIHIGPGRAFYTT Env 305 320 QQLLGIWGCSGKLICTTA Env 590 607 RIRPGRAFVTIGK Env 306 322 QLLGIWG Env 591 597 KSITK Env 307 311 LLGIWGCSG Env 592 600 KSITKG Env 307 312 LLGIWGCSGKLICTT Env 592 606 KSITKGP Env 307 313 GIWGCSGK Env 594 601 IRIQRGPGRAFVTI Env 307 320 GIWGCSGKLIC Env 594 604 RIHIGPGRAFYT Env 308 319 GIWGCSGKLICTTAVP Env 594 609 ELDKWA Env 308 320 IWGCSGKLICTTA Env 595 607 SISGPGRAFYTG Env 308 321 GCSGKLICTT Env 597 606 RIHIGPGRAFYTTKN Env 308 322 CSGKLIC Env 598 604 IXIGPGR Env 309 315 SGKLICTTAVPWNAS Env 599 613 IHIGPGR Env 309 315 CTTAVPWNASWS Env 604 615 HIGPGR Env 310 315 SLIEESQNQQEKNEQELLEL Env 644 663 HIGPGRA Env 310 316 ELDKWA Env 662 667 IGPGR Env 311 315 WASLWNWFDITN Env 666 677 GPXR Env 312 315 NWFDIT Env 671 676 PGRAFY Env 313 318

79

Appendix - 7 Conformational Epitopes Epitope Start End 103 N 103 104 117 E 117 118 128 D 128 129 132 K 132 132 134 CVKL 134 138 217 F 217 218 221 D 221 222 225 I 225 226 250 T 250 251 252 A 252 253 254 TQAC 254 258 259 K 259 260 306 VVSTQLLLNGSLA 306 320 366 P 366 367 372 TT 372 374 381 R 381 382 425 DPEITTHSF 425 434 437 GEFFY 437 442 443 N 443 444 447 LFN 447 450 495 RIKQII 495 501 506 V 506 507 508 K 508 509 510 MYAP 510 514 533 D 533 534 552 RP 552 554 555 G 555 556 557 DMRD 557 561 562 WR 562 564 574 I 574 575 588 VVQREKR 588 594

80 LITERATURE CITED

Abbas, A. K., Lichtman, A., & Pober, J. (2005). Diseases caused by immune responses:

Hypersensitivity and . Cellular and Molecular Immunology, 5th

Edn.Philadelphia: Saunders, , 411–431.

Achaz, G., Palmer, S., Kearney, M., Maldarelli, F., Mellors, J., Coffin, J., et al. (2004). A

robust measure of HIV-1 population turnover within chronically infected

individuals. Molecular Biology and Evolution, 21 (10), 1902.

Alff-Steinberger, C. (1969). The genetic code and error transmission. Proceedings of the

National Academy of Sciences of the United States of America, 64 (2), 584-591.

Arnold, E., & Arnold, G. F. (1991). Human immunodeficiency virus structure:

Implications for antiviral design. Adv Virus Res, 39 , 1-87.

Banke, S., Lillemark, M. R., Gerstoft, J., Obel, N., & Jorgensen, L. B. (2009). Positive

selection pressure introduces secondary mutations at gag cleavage sites in human

immunodeficiency virus type 1 harboring major protease resistance mutations.

Journal of Virology, 83 (17), 8916.

Barlow, D., Edwards, M., & Thornton, J. (1986). Continuous and discontinuous protein

antigenic determinants.

Benjamin, D. C. (1995). B-cell epitopes: Fact and fiction. Advances in Experimental

Medicine and Biology, 386 , 95-108.

81

82

Boulanger, N., Munks, R. J., Hamilton, J. V., Vovelle, F., Brun, R., Lehane, M. J., et al.

(2002). Epithelial innate immunity. A novel antimicrobial peptide with antiparasitic

activity in the blood-sucking insect stomoxys calcitrans. The Journal of Biological

Chemistry, 277 (51), 49921-49926.

Briones, C., & Bastolla, U. (2005). Protein evolution in viral quasispecies under selective

pressure: A thermodynamic and phylogenetic analysis. Gene, 347 (2), 237-246.

Buonaguro, L., Tornesello, M., & Buonaguro, F. (2007). Human immunodeficiency virus

type 1 subtype distribution in the worldwide epidemic: Pathogenetic and therapeutic

implications. Journal of Virology, 81 (19), 10209.

Cao, H., Kaleebu, P., Hom, D., Flores, J., Agrawal, D., Jones, N., et al. (2003).

Immunogenicity of a recombinant human immunodeficiency virus (HIV)-canarypox

vaccine in HIV-seronegative ugandan volunteers: Results of the HIV network for

prevention trials 007 vaccine study. The Journal of Infectious Diseases, 187 (6), 887-

895.

Ceccherini-Silberstein, F., Svicher, V., Erba, F., Santoro, M., Gori, C., Bellocchi, M. C.,

et al. (2005). Novel human immunodeficiency virus type 1 protease mutations

potentially involved in resistance to protease inhibitors. Antimicrobial Agents and

Chemotherapy, 49 (5), 2015.

Chen, L., Perlina, A., & Lee, C. J. (2004). Positive selection detection in 40,000 human

immunodeficiency virus (HIV) type 1 sequences automatically identifies drug

83

resistance and positive fitness mutations in HIV protease and reverse transcriptase.

Journal of Virology, 78 (7), 3722.

Clementi.M, G., S., Menzo, S., Brambilla, A., Bordignon, P. P., Lorini, A. L., Poli, G., et

al. (2001). Inhibition of R5X4 dualtropic HIV-1 primary isolates by single

chemokine co-receptor ligands. Virology, 280 (2), 253-261.

Cole, K. S., Murphey-Corb, M., Narayan, O., Joag, S. V., Shaw, G. M., & Montelaro, R.

C. (1998). Common themes of antibody maturation to simian immunodeficiency

virus, simian-human immunodeficiency virus, and human immunodeficiency virus

type 1 infections. The Journal of Virology, 72 (10), 7852.

Dayhoff, M. 0.(1972) atlas of protein sequence and structure. The National Biomedical

Research Foundation, Silver Springs, MD,

Drake, A., Mijch, A., & Sasadeusz, J. (2004). Immune reconstitution hepatitis in HIV and

hepatitis B coinfection, despite lamivudine therapy as part of HAART. Clinical

Infectious Diseases : An Official Publication of the Infectious Diseases Society of

America, 39 (1), 129-132.

Edwards, C. T. T., Asquith, B., Lipsitch, M., & McLean, A. R. (2006). Inefficient

cytotoxic T -mediated killing of HIV-1-infected cells in vivo. PLoS

Biology, 4 (4), 583.

84

Epstein, C. J. (1967). Non-randomness of amino-acid changes in the evolution of

homologous proteins. Nature, 215 (5099), 355-359.

Evans, D. T., Jing, P., Allen, T. M., O'Connor, D. H., Horton, H., Venham, J. E., et al.

(2000). Definition of five new simian immunodeficiency virus cytotoxic T-

lymphocyte epitopes and their restricting major histocompatibility complex class I

molecules: Evidence for an influence on disease progression. Journal of Virology,

74 (16), 7400.

Evans, D. T., O'Connor, D. H., Jing, P., Dzuris, J. L., Sidney, J., Da Silva, J., et al.

(1999). Virus-specific cytotoxic T-lymphocyte responses select for amino-acid

variation in simian immunodeficiency virus env and nef. Nature Medicine, 5 (11),

1271.

Filippo, C., Maria, M., Michela, S., Stefano, B., Patrizia, B., Massimo, D., et al. Dynamic

features of the selective pressure on the human immunodeficiency virus type 1

(HIV-1) gp120 CD4-binding site in a group of long term non progressor (LTNP)

subjects. Retrovirology, 6

Fox, M., Transmission, H., & Center, P. R. (2009). US launches AIDS campaign aimed

at most affected.

Frankel, A. D., & Young, J. A. T. (1998). HIV-1: Fifteen proteins and an RNA. Annual

Review of Biochemistry, 67 (1), 1-25.

85

Friedberg, I., & Margalit, H. (2002). Persistently conserved positions in structurally

similar, sequence dissimilar proteins: Roles in preserving protein fold and function.

Protein Science: A Publication of the Protein Society, 11 (2), 350.

Frost, S. D. W., Wrin, T., Smith, D. M., Pond, S. L. K., Liu, Y., Paxinos, E., et al. (2005).

Neutralizing antibody responses drive the evolution of human immunodeficiency

virus type 1 envelope during recent HIV infection. Proceedings of the National

Academy of Sciences, 102 (51), 18514.

Gallo, R. C., & Montagnier, L. (2003). The discovery of HIV as the cause of AIDS. The

New England Journal of Medicine, 349 (24), 2283.

Gallo, R., Salahuddin, S., Popovic, M., Shearer, G., Kaplan, M., Haynes, B., et al. (1984).

Frequent detection and isolation of cytopathic retroviruses (HTLV-III) from patients

with AIDS and at risk for AIDS. Science, 224 (4648), 500.

Goulder, P. J. R., & Watkins, D. I. (2004). HIV and SIV CTL escape: Implications for

vaccine design. Nature Reviews Immunology, 4 (8), 630-640.

Grantham, R. (1974). Amino acid difference formula to help explain protein evolution.

Science, 185 , 862-864.

Gurtler, L., Hauser, P., Eberle, J., Von Brunn, A., Knapp, S., Zekeng, L., et al. (1994). A

new subtype of human immunodeficiency virus type 1 (MVP-5180) from cameroon.

Journal of Virology, 68 (3), 1581.

86

Hadlock, K. G., Lanford, R. E., Perkins, S., Rowe, J., Yang, Q., Levy, S., et al. (2000).

Human monoclonal antibodies that inhibit binding of hepatitis C virus E2 protein to

CD81 and recognize conserved conformational epitopes. Journal of Virology,

74 (22), 10407.

Haigwood, N., Barker, C., Higgins, K., Skiles, P., Moore, G., Mann, K., et al. (1990).

Evidence for neutralizing antibodies directed against conformational epitopes of

HIV-1 gp120. Vaccines, 90 , 313–320.

Hammond, S., Cook, S., Lichtenstein, D., Issel, C., & Montelaro, R. (1997). Maturation

of the cellular and humoral immune responses to persistent infection in horses by

equine infectious anemia virus is a complex and lengthy process. The Journal of

Virology, 71 (5), 3840.

Hemelaar, J., Gouws, E., Ghys, P. D., & Osmanov, S. (2006). Global and regional

distribution of HIV-1 genetic subtypes and recombinants in 2004. Aids, 20 (16),

W13.

Hockley, D., Wood, R., Jacobs, J., & Garrett, A. (1988). Electron microscopy of human

immunodeficiency virus. Journal of General Virology, 69 (10), 2455.

Huang, J., & Honda, W. (2006). CED: A conformational epitope database. BMC

Immunology, 7 , 7.

87

Hughes, A. L., Green, J. A., Garbayo, J. M., & Roberts, R. M. (2000). Adaptive

diversification within a large family of recently duplicated, placentally expressed

genes. Proceedings of the National Academy of Sciences of the United States of

America, 97(7), 3319.

Hughes, A. L., & Nei, M. (1988). Pattern of nucleotide substitution at major

histocompatibility complex class I loci reveals overdominant selection. Nature,

335 (6186), 167-170.

Hughes, A., & Nei, M. (1989). Evolution of the major histocompatibility complex:

Independent origin of nonclassical class I genes in different groups of mammals.

Molecular Biology and Evolution, 6 (6), 559.

Hughes, A., Ota, T., & Nei, M. (1990). Positive darwinian selection promotes charge

profile diversity in the antigen-binding cleft of class I major-histocompatibility-

complex molecules. Molecular Biology and Evolution, 7 (6), 515.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Minx, P. J., Graves, T., Rozen, S., et al.

(2005). Conservation of Y-linked genes during human evolution revealed by

comparative sequencing in chimpanzee. NATURE-LONDON-, 7055 , 101.

Hurst, L. D. (2009). Evolutionary genomics and the reach of selection. Journal of

Biology, 8 (2), 12.

88

Jonathan , G., Bazykin, G. A., Dushoff, J., Levin, S. A., & Kondrashov, A. S. (2006).

Bursts of nonsynonymous substitutions in HIV-1 evolution reveal instances of

positive selection at conservative protein sites. Proceedings of the National Academy

of Sciences, 103 (51), 19396.

Klein, J., & Horejsi, V. (1997). and their receptors. Immunology.2nd

Ed.Blackwell Science, London.[Links], , 291-326.

Koenig, S., Conley, A. J., Brewah, Y. A., Jones, G. M., Leath, S., Boots, L. J., et al.

(1995). Transfer of HIV-1-specific cytotoxic T lymphocytes to an AIDS patient

leads to selection for mutant HIV variants and subsequent disease progression.

Nature Medicine, 1 (4), 330-336.

Koup, R., Safrit, J., Cao, Y., Andrews, C., McLeod, G., Borkowsky, W., et al. (1994).

Temporal association of cellular immune responses with the initial control of

viremia in primary human immunodeficiency virus type 1 syndrome. Journal of

Virology, 68 (7), 4650.

Kumar, S., & Dudley, J. (2007). Bioinformatics software for biologists in the genomics

era. Bioinformatics, 23 (14), 1713.

Langeveld, J. P. M., Martinez-Torrecuadrada, J., Boshuizen, R. S., Meloen, R. H., &

Ignacio Casal, J. (2001). Characterisation of a protective linear epitope against

feline parvoviruses. Vaccine, 19 (17-19), 2352-2360.

89

Leitner, T., Korber, B., Daniels, M., Calef, C., & Foley, B. (2005). HIV-1 subtype and

circulating recombinant form (CRF) reference sequences, 2005. HIV Sequence

Compendium, 2005 , 41-48.

Leslie, A., Price, D. A., Mkhize, P., Bishop, K., Rathod, A., Day, C., et al. (2006).

Differential selection pressure exerted on HIV by CTL targeting identical epitopes

but restricted by distinct HLA alleles from the same HLA supertype. The Journal of

Immunology, 177 (7), 4699.

Levy, D. N., Aldrovandi, G. M., Kutsch, O., & Shaw, G. M. (2004). Dynamics of HIV-1

recombination in its natural target cells. Proceedings of the National Academy of

Sciences of the United States of America, 101 (12), 4204.

Li, S., Strelow, A., Fontana, E. J., & Wesche, H. (2002). IRAK-4: A novel member of the

IRAK family with the properties of an IRAK-kinase. Proceedings of the National

Academy of Sciences, 99 (8), 5567.

Mansky, L., & Temin, H. (1995). Lower in vivo mutation rate of human

immunodeficiency virus type 1 than that predicted from the fidelity of purified

reverse transcriptase. The Journal of Virology, 69 (8), 5087.

McCauley, S., De Groot, S., Mailund, T., & Hein, J. (2007). Annotation of selection

strengths in viral genomes. Bioinformatics, 23 (22), 2978.

90

McClellan, D. A., & McCracken, K. G. (2001). Estimating the influence of selection on

the variable amino acid sites of the cytochrome b protein functional domains.

Molecular Biology and Evolution, 18 (6), 917.

McCutchan, F. E., Carr, J. K., Bajani, M., Sanders-Buell, E., Harry, T. O., Stoeckli, T. C.,

et al. (1999). Subtype G and multiple forms of A/G intersubtype recombinant human

immunodeficiency virus type 1 in nigeria. Virology, 254 (2), 226-234.

Miyata, T., Yasunaga, T., & Nishida, T. (1980). Nucleotide sequence divergence and

functional constraint in mRNA evolution. Proceedings of the National Academy of

Sciences, 77 (12), 7328.

Moore, J., & Ho, D. (1993). Antibodies to discontinuous or conformationally sensitive

epitopes on the gp120 glycoprotein of human immunodeficiency virus type 1 are

highly prevalent in sera of infected humans. Journal of Virology, 67 (2), 863.

Motomura, K., Kusagawa, S., Kato, K., Nohtomi, K., Lwin, H. H., Tun, K. M., et al.

(2000). Emergence of new forms of human immunodeficiency virus type 1

intersubtype recombinants in central myanmar. AIDS Research and Human

Retroviruses, 16 (17), 1831-1843.

Muse, S. (1996). Estimating synonymous and nonsynonymous substitution rates.

Molecular Biology and Evolution, 13 (1), 105.

91

Musey, L., Hughes, J., Schacker, T., Shea, T., Corey, L., & McElrath, M. J. (1997).

Cytotoxic-T-cell responses, viral load, and disease progression in early human

immunodeficiency virus type 1 infection. New England Journal of Medicine,

337 (18), 1267.

MYERS, G., & LENROOT, R. (1992). HIV glycosylation: What does it portend? AIDS

Research and Human Retroviruses, 8 (8), 1459-1460.

Nei, M., & Gojobori, T. (1986). Simple methods for estimating the numbers of

synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and

Evolution, 3 (5), 418.

Nei, M., & Kumar, S. (2000). Molecular evolution and phylogenetics Oxford University

Press, USA.

Nielsen, R., & Yang, Z. (1998). Likelihood models for detecting positively selected

amino acid sites and applications to the HIV-1 envelope gene. Genetics, 148 (3), 929.

Nobubelo, N., Konrad, S., Penny, M., Zenda, W., Darren, M., & Cathal, S. Extensive

purifying selection acting on synonymous sites in HIV-1 group M sequences.

Virology Journal, 5

Ogg, G. S., Jin, X., Bonhoeffer, S., Dunbar, P., Nowak, M. A., Monard, S., et al. (1998).

Quantitation of HIV-1-specific cytotoxic T lymphocytes and plasma load of viral

RNA. Science, 279 (5359), 2103.

92

Papathanasopoulos, M. A., Cilliers, T., Morris, L., Mokili, J. L., Dowling, W., Birx, D.

L., et al. (2002). Full-length genome analysis of HIV-1 subtype C utilizing CXCR4

and intersubtype recombinants isolated in south africa. AIDS Research and Human

Retroviruses, 18 (12), 879-886.

Paul, S., & Piontkivska, H. (2009). Discovery of novel targets for multi-epitope vaccines:

Screening of HIV-1 genomes using association rule mining. Retrovirology, 6 , 62.

Phillips, K. A., Maddala, T., & Johnson, F. R. (2002). Measuring preferences for health

care interventions using conjoint analysis: An application to HIV testing. Health

Services Research, 37 (6), 1681.

Piontkivska, H., & Hughes, A. L. (2004). Between-host evolution of cytotoxic T-

lymphocyte epitopes in human immunodeficiency virus type 1: An approach based

on phylogenetically independent comparisons. Journal of Virology, 78 (21), 11758.

Piontkivska, H., & Hughes, A. L. (2006). Patterns of sequence evolution at epitopes for

host antibodies and cytotoxic T-lymphocytes in human immunodeficiency virus type

1. Virus Research, 116 (1-2), 98-105.

PRINCE, A. M., REESINK, H., PASCUAL, D., HOROWITZ, B., HEWLETT, I.,

MURTHY, K. K., et al. (1991). Prevention of HIV infection by passive

immunization with HIV immunoglobulin. AIDS Research and Human Retroviruses,

7(12), 971-973.

93

PRINCE, A. M., REESINK, H., PASCUAL, D., HOROWITZ, B., HEWLETT, I.,

MURTHY, K. K., et al. (1991). Prevention of HIV infection by passive

immunization with HIV immunoglobulin. AIDS Research and Human Retroviruses,

7(12), 971-973.

Rambaut, A., Posada, D., Crandall, K. A., & Holmes, E. C. (2004). The causes and

consequences of HIV evolution. Nature Reviews Genetics, 5 (1), 52-61.

Reddy, B. V. B., Li, W. W., Shindyalov, I. N., & Bourne, P. E. (2001). Conserved key

amino acid positions(CKAAPs) derived from the analysis of common substructures

in proteins. Proteins Structure Function and Genetics, 42 (2), 148-163.

Rhodes, T. D., Nikolaitchik, O., Chen, J., Powell, D., & Hu, W. S. (2005). Genetic

recombination of human immunodeficiency virus type 1 in one round of viral

replication: Effects of genetic distance, target cells, accessory genes, and lack of high

negative interference in crossover events. The Journal of Virology, 79 (3), 1666.

Richman, D. D., Wrin, T., Little, S. J., & Petropoulos, C. J. (2003). Rapid evolution of

the neutralizing antibody response to HIV type 1 infection. Proceedings of the

National Academy of Sciences, 100 (7), 4144.

Robertson, D., Anderson, J., Bradac, J., Carr, J., Foley, B., Funkhouser, R., et al. (2000).

HIV-1 nomenclature proposal. Science, 288 (5463), 55-55.

94

Rodrigo, A. G., & Learn, G. H. (2000). Computational and evolutionary analysis of HIV

molecular sequences Kluwer Academic Pub.

Saitou, N., & Nei, M. (1987). The neighbor-joining method: A new method for

reconstructing phylogenetic trees. Molecular Biology and Evolution, 4 (4), 406.

Saxena, A. K., Singh, K., Su, H. P., Klein, M. M., Stowers, A. W., Saul, A. J., et al.

(2005). The essential mosquito-stage P25 and P28 proteins from plasmodium form

tile-like triangular prisms. Nature Structural & Molecular Biology, 13 (1), 90-91.

Seibert, S., Howell, C., Hughes, M., & Hughes, A. (1995). Natural selection on the gag,

pol, and env genes of human immunodeficiency virus 1 (HIV-1). Molecular Biology

and Evolution, 12 (5), 803.

Seaman, M. S., L. Xu, K. Beaudry, K. L. Martin, M. H. Beddall, A. Miura, A.

Sambor, B. K. Chakrabarti, Y. Huang, R. Bailer, R. A. Koup, J. R. Mascola, G.

J. Nabel, and N. L. Letvin. 2005. Multiclade human immunodeficiency virus

type 1 envelope elicit broad cellular and

in rhesus monkeys. J. Virol. 79:2956-2963.

Seo, T. K., Thorne, J. L., Hasegawa, M., & Kishino, H. (2002). Estimation of effective

population size of HIV-1 within a host: A pseudomaximum-likelihood approach.

Genetics, 160 (4), 1283.

95

Sewell, A. K., Price, D. A., Oxenius, A., Kelleher, A. D., & Phillips, R. E. (2000).

Cytotoxic T lymphocyte responses to human immunodeficiency virus: Control and

escape. Stem Cells, 18 (4), 230.

Sneath, P. (1966). Relations between chemical structure and biological activity in

peptides. Journal of Theoretical Biology, 12 (2), 157-195.

Sonneborn, T. (1965). Degeneracy of the genetic code: Extent, nature, and genetic

implications. Evolving Genes and Proteins.Academic Press, New York, , 377–397.

Steimer, K., Scandella, C., Skiles, P., & Haigwood, N. (1991). Neutralization of

divergent HIV-1 isolates by conformation-dependent human antibodies to Gp120.

Science, 254 (5028), 105.

Swiggard, W. J., Baytop, C., Yu, J. J., Dai, J., Li, C., Schretzenmair, R., et al. (2005).

Human immunodeficiency virus type 1 can establish latent infection in resting CD4

T cells in the absence of activating stimuli. The Journal of Virology, 79 (22), 14179.

Thompson, J. D., Higgins, D. G., & Gibson, T. J. (1994). CLUSTAL W: Improving the

sensitivity of progressive multiple sequence alignment through sequence weighting,

position-specific gap penalties and weight matrix choice. Nucleic Acids Research,

22 (22), 4673.

Walter, G. (1986). Production and use of antibodies against synthetic peptides. Journal of

Immunological Methods, 88 (2), 149-161.

96

Wei, X., Decker, J. M., Wang, S., Hui, H., Kappes, J. C., Wu, X., et al. (2003). Antibody

neutralization and escape by HIV-1. Nature, 422 (6929), 307-312.

Woese, C., Dugre, D., Dugre, S., Kondo, M., & Saxinger, W. (1966). On the fundamental

nature and evolution of the genetic code. Cold Spring Harbor Symposia on

Quantitative Biology, , 31 723.

Xia, X., & Li, W. H. (1998). What amino acid properties affect protein evolution?

Journal of Molecular Evolution, 47 (5), 557-564.

Zhang, J. (2000). Rates of conservative and radical nonsynonymous nucleotide

substitutions in mammalian nuclear genes. Journal of Molecular Evolution, 50 (1),

56-68.

Zhang, J., & Nei, M. (1997). Accuracies of ancestral amino acid sequences inferred by

the parsimony, likelihood, and distance methods. Journal of Molecular Evolution,

44 , 139-146.

Zuckerland, E., & Pauling, L. (1965). Evolutionary divergence and convergence.