Artificial Intelligence, Bioinformatic and Systems Biology Approaches to Understanding Evolution and Viral Control

Hamid Alinejad Rokny (Hamid Alinejad-Rokny) (MS.c Computer Science-Artificial Intelligence)

A thesis in fulfilment of the degree of Doctor of Philosophy

(Bioinformatics and Systems Biology)

Supervisors

Miles Davenport, Diako Ebrahimi and Vanessa Venturi

School of Medical Sciences

Faculty of Medicine

The Kirby Institute

November 2017

ii

iii

iv

Originality statement

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

Signed:

Date: 20 October 2017

v

vi

Copyright Statement

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.’

Signed:

Date: 20 October 2017

vii

viii Abstract

Human immunodeficiency (HIV) continues to be a major global health problem. Decades of research have still not produced a successful vaccine and understanding many aspects of this virus and HIV infection continues to challenge researchers. In this thesis I have addressed several key questions about HIV biology and infection using a computational biology approach.

Macaque models of HIV infection play an important role in HIV research and yet the viral peptides presented by MHC Class-I and recognised by cytotoxic T-lymphocytes (CTL) in macaques are not well characterised. I developed an in-house bioinformatics pipeline to investigate novel CTL epitopes and their associated patterns of escape mutations in pigtailed macaques. I identified new potential CTL epitopes and numerous novel non-synonymous point mutations and regions of non-synonymous mutation associated with specific MHC-I haplotypes.

I also investigated the nature and distribution of APOBEC3-induced hypermutation signatures and whether this information provides clues about the HIV inhibition by different APOBEC3 enzymes such as APOBEC3G and APOPEC3F. I developed a new method for hypermutation detection. I also used a novel approach to identify preferential patterns of G-to-A mutation for both APOBEC3G and APOBEC3F.

The source of CpG depletion in the HIV genome is another aspect of HIV biology that is not well understood. My bioinformatics analyses suggest that the methylation mechanism may be responsible for depletion of CpG dinucleotides in the HIV genome. Importantly, the results showed that viral genome adaptation to the host CpG machinery is a highly specific pattern that is only observed in HIV and its simian counterpart, SIV.

The thesis finally reports a meta-analysis approach to investigate the effect of expression level on the clonal expansion process of latently-infected cells during HIV treatment. The results of this analysis show that among the HIV proviruses that integrated into , those integrated into poorly expressed genes are more likely to become clonally expanded.

ix In conclusion, my findings shed light on several important components of the host immune system and their roles in viral control and viral evolution. These findings have implications for the future design of immunotherapies and vaccines against HIV.

x

xi

Acknowledgements

Firstly, I would like to thank my beautiful wife Hoda, for her love and generous support in all my endeavours, and particularly, her support throughout my studies.

I would like to thank my supervisors, Professor Miles Davenport, Assistant Professor Diako Ebrahimi and Associate Professor Vanessa Venturi for their complete support and invaluable guidance throughout my candidature. Over the past three and a half years they went above and beyond to ensure I was equipped for the research I was conducting. I am truly thankful for the advice and general counsel they have provided throughout my studies, and for the opportunities they provided me to be directly involved in interdisciplinary collaboration.

I would also like to thank all of my experimental collaborators. In particular, Professor Stephen Kent and Associate Professor Sarah Palmer and the researchers in their laboratories. They provided the experimental data necessary for several projects during my PhD candidature.

I need to acknowledge the other group members, along-side whom I have worked these past three years, for all their support and advice. In particular, Janka Petravic, Deborah Cromer, Mykola Pinkevych, Andrew Grimm, Alexey Martyushev, Adeshina Adekunle, Arnold Reynaldi and David Khoury.

Finally, I thank the Lord my God who has carried me every step of the way.

xii

xiii Table of Contents

Originality statement ...... v

Copyright statement ...... vii

Abstract ...... ix

Acknowledgements...... xiii

List of figures, tables and abbreviations ...... xx

Figures ...... xx

Tables ...... xxv

Abbreviations ...... xxvii

Publications during candidature ...... xxx

Chapter 1: General introduction and scope of thesis ...... 33

Author contributions to thesis Chapter 1 ...... 33

The virus ...... 34

HIV types and subtypes ...... 36

The viral life cycle ...... 37

Binding to host cells...... 37 Reverse transcription...... 37 Integration into host DNA...... 37 Transcription and translation...... 38 Viral assembly and budding...... 38 SIV, an animal model of HIV infection ...... 39

HIV disease progression ...... 41

Immune response activation ...... 42

HIV immune escape...... 44 Viral latency ...... 45

Vaccine and antiretroviral therapy...... 46 APOBEC protein ...... 47

xiv Host restriction factor APOBEC3 versus HIV Vif protein...... 50 HIV hypermutation...... 51 Sequencing ...... 52

Sanger sequencing...... 52 454 sequencing...... 53 Illumina sequencing...... 54 Pacific bioscience...... 54 Nanopore sequencing techniques...... 55 NGS data analyses ...... 56

Read quality control...... 56 Mapping and consensus sequence generation...... 57 Variant calling...... 58 Geneious software...... 59 Aim of this thesis ...... 61

References ...... 64

Chapter 2: A linkage analysis between MHC-I alleles and SIV sequences in macaques infected with SIV using an in-house bioinformatics pipeline to identify new CTL escape mutations ...... 79

Publication details ...... 79

Author contributions to thesis Chapter 2 ...... 79

Abstract ...... 80

Introduction ...... 81

Materials and methods ...... 82

Samples...... 83 Whole genome amplification ...... 83 MHC Class-I typing...... 83 Data analysis...... 87 Method validation...... 97 Association detection approach...... 102 A permutation analysis to determine statistically significant associations ...... 105 Results ...... 110

Result validation...... 110 New MHC-linked epitope mutations...... 112 New MHC-linked mutations using sliding window analysis ...... 117 Comparison with previously reported epitopes ...... 121 Association between viral load and ART/vaccination, MHC haplotype and mutation 126 Validation of detected associations ...... 129 Discussion ...... 129

xv References ...... 134

Chapter 3: G2A3: a method to avoid errors associated with the analysis of hypermutated viral sequences by alignment-based methods ...... 141

Publication details ...... 141

Author contributions to thesis Chapter 3 ...... 141

Abstract ...... 142

Introduction ...... 143

Methods ...... 145

Pace method ...... 145 Ulenga method ...... 145 Errors in Pace and Ulenga methods ...... 145 Hypermut program ...... 146 Type of errors in the Hypermut program ...... 147 Error type 1 (site is ignored) ...... 148 Error type 2 (site is misidentified) ...... 148 G2A3: My proposed method ...... 148 Results ...... 152

Discussion...... 157

References ...... 158

Chapter 4: Insights into the motif preference of APOBEC3 enzymes using multivariate analysis of full genome HIV-1 ...... 162

Publication details ...... 162

Author contributions to thesis Chapter 4 ...... 162

Author contributions to publication ...... 162

Abstract ...... 163

Introduction ...... 164

Materials and methods ...... 166

How PCs can be obtained ...... 169 Results ...... 170

Identification of different HIV-1 subtypes ...... 171 Identification of hypermutation by APOBEC3G ...... 175 Identification of hypermutation by APOBEC3F ...... 179

xvi Discussion ...... 184

References ...... 188

Chapter 5: Source of CpG depletion in the HIV-1 genome ...... 194

Publication details ...... 194

Author contributions to thesis Chapter 5 ...... 194

Abstract ...... 195

Introduction ...... 196

Materials and methods ...... 198

Data acquisition ...... 198 Data analysis ...... 203 Analysis of CpG representation ...... 203 Analysis of % methylation of CpG ...... 204 Results ...... 204

Discussion ...... 234

References ...... 236

Chapter 6: Relationship between division, gene expression and the patterns of HIV integration into the human genome ...... 240

Author contributions to thesis Chapter 6: ...... 240

Abstract ...... 241

Introduction ...... 242

Materials and methods ...... 244

Genome sequence and gene expression data ...... 244 Identification of sites by hierarchical clustering ...... 246 Results ...... 256

Gene and non-gene regions ...... 256 Gene expression analysis ...... 258 Comparison of gene expression level between different kinds of T-cells ...... 261 Cancer genes and non-cancer genes...... 262 Expansion size ratio ...... 263 Confirmation of the results using an alternative published data set ...... 266 Discussion ...... 267

References ...... 271

xvii Chapter 7: Summary and future work ...... 276

Author contributions to thesis Chapter 7: ...... 276

Synopsis ...... 277

Summary of findings ...... 279

CTL epitopes undergo patterns of escape mutations ...... 279 APOBEC proteins as a host defence mechanism against HIV-1 infection ...... 284 Methylation may be a potential source of CpG depletion in the HIV-1 genome ...... 291 HIV-1 integration into lowly expressed genes is associated with clonal expansion .... 295 Conclusion ...... 301

References ...... 302

xviii

xix List of Figures, Tables and abbreviations

Figures

Chapter 1 ...... 33

Figure 1.1 The HIV-1 viral genome carries two copies of single-stranded RNA that code for the virus's nine proteins ...... 35

Figure 1.2 Viral HIV-1 replication cycle ...... 38

Figure 1.3 Evolutionary history of HIV ...... 40

Figure 1.4 The course of HIV infection ...... 42

Figure 1.5 T-cell receptor binding to MHC-antigen complex ...... 44

Figure 1.6 Model of HIV latent infection and active infection ...... 46

Figure 1.7 Members of the APOBEC3 enzymes family of cytidine deaminase...... 48

Figure 1.8 Antiviral activity of APOBEC3G in the presence, or absence, of Vif protein ...... 51

Chapter 2 ...... 79

Figure 2.1 Flowchart of the bioinformatics pipeline to identify associations between MHC-I haplotypes and mutations ...... 88

Figure 2.2 Quality scores across all bases for animal 1335. x-axis shows position in read; y-axis shows quality of the position...... 91

Figure 2.3 Quality score distribution over all sequences in animal 1335; x-axis shows mean sequence quality. y-axis shows number of bases with that quality ...... 92

Figure 2.4 An overview of the trimming process for animal 1335. Blue colours indicate good quality base pairs and gold colours show trimmed regions ...... 93

Figure 2.5 An overview of mapped reads for animal 1335 ...... 94

Figure 2.6 A snapshot of called SNPs within a selected region of animal 1335 ...... 96

Figure 2.7 Quality scores across all bases for simulated data. x-axis shows position in read; y- axis shows quality of the position...... 98

Figure 2.8 Quality score distribution over all sequences in simulated data; x-axis shows mean sequence quality; y-axis shows number of bases with that quality ...... 99

xx Figure 2.9 An overview of the trimming process for simulated independent sequences. Blue colours indicate good quality base pairs and gold colours show trimmed regions ...... 100

Figure 2.10 An overview of mapping of the simulated reads...... 101

Figure 2.11 Flowchart for point mutation algorithm to identify the correlations between CTL scape mutation and MHC type...... 103

Figure 2.12 Flowchart for window analysis algorithm to identify the correlations between CTL scape mutation and MHC type...... 104

Figure 2.13 Flowchart of permutation method ...... 106

Figure 2.14 Matrix of animals vs. mutated positions in permutation process ...... 107

Figure 2.15 Compression of best P values in the 1000 permutations and real data ...... 109

Figure 2.16 An example of number and type of mutations within all macaques, with and without Mane-B028 at position 2928 in protein ...... 116

Figure 2.17 An example of number and type of mutations based on window analyses within all macaques, with and without Mane-A084, in protein Gag between positions 2284 and 2313 .. 120

Figure 2.18 Viral load comparison in different categories of ART and vaccination ...... 127

Figure 2.19 An example of number and type of mutations based on my single polymorphism analyses within all macaques, with and without Mane-A032 in protein at positons 7483 and 7493 ...... 132

Chapter 3 ...... 149

Figure 3.1 Presence of indels in aligned sequences; sequences were aligned by MUSCLE (www.ebi.ac.uk/Tools/msa/muscle) and checked manually ...... 144

Figure 3.2 Schematics of the context enforcement in the Hypermut program; By default the Hypermut program enforces context on the query sequence ...... 147

Figure 3.3 Hypermut 2-by-2 table as the input of Fisher’s exact test to calculate a probability associated with hypermutation for each sequence ...... 147

Figure 3.4 Flowchart of proposed method when enforce context on the reference sequence is selected ...... 151

Figure 3.5 A snapshot of the SIV sequences and examples of the errors resolved by G2A3 (insertion and deletion) ...... 153

Chapter 4 ...... 162

xxi Figure 4.1 A schematic of principal component analysis applied to the motif representation data of HIV sequences ...... 168

Figure 4.2 Principal components with their eigenvalue ...... 171

Figure 4.3 Principal component analysis of the motif representation data of HIV-1 sequences (PC1 vs. PC2) ...... 173

Figure 4.4 Multiple alignments of random selected of HIV-1 sequences (Gag region, positions 300-650) from 4 different groups ...... 174

Figure 4.5 Principal component analysis of the motif representation data of HIV-1 sequences (PC3 vs. PC4) ...... 176

Figure 4.6 Conserved polypurine tracts shown in a representative alignment of 54 hypermutated HIV-1 sequences used in this study ...... 178

Figure 4.7 Percentage change of motifs within which G-to-A change(s) in the context of GG has occurred...... 179

Figure 4.8 Principal component analysis of the motif representation data of HIV-1 sequences (PC11 vs. PC13) ...... 181

Figure 4.9 Average of D-representation of motifs TGA, TAA, TGC, TAC, GAT, AAT, GAC, and AAC in APOBEC3F-mutated, APOBEC3G-mutated and non-hypermutated HIV-1 sequences ...... 183

Chapter 5 ...... 194

Figure 5.1 Comparison of the representation of 2-mers motifs of average of 2000 HIV-1 sequences ...... 196

Figure 5.2 Comparison of the representation of CpG motifs flanked by different nucleotides within HIV-1 and Human ...... 205

Figure 5.3 Comparison of the representation of GpC motifs flanked by different nucleotides within HIV-1 sequences ...... 207

Figure 5.4 Comparison of the representation of CpG motifs flanked by different nucleotides within non-hypermutated and hypermutated sequences, separately ...... 208

Figure 5.5 Comparison of the representation of CpG motifs flanked by different nucleotides within HIV-1 sequences with and without LTR region, separately ...... 209

Figure 5.6 Comparison of the codon usage of Arginine in HIV-1 genes and in human...... 210

xxii Figure 5.7 Comparison of the representation of CpG motifs flanked by T/A versus C/G in selected from all four groups of ssRNA, dsRNA, ssDNA, and dsDNA...... 212

Figure 5.8 Comparison of the representation of CpG motifs flanked by T/A versus those flanked by C/G in 103 viruses...... 222

Figure 5.9 Positive correlation between the representation of CpG reverse complementary tri-and tetra-nucleotide motifs ...... 224

Figure 5.10 Comparison of the % methylation of CpG motifs flanked by different nucleotides in the human genome ...... 227

Figure 5.11 Inverse correlation between % methylation and representation of CpG tri- and tetra- nucleotide motifs ...... 229

Figure 5.12 Correlation between % methylation and representation of CpG tetra-nucleotide motifs within NCGN, NNCG and CGNN groups, separately ...... 231

Figure 5.13 Comparison of the representation of CpG motifs flanked by different nucleotides within NCGN, NNCG and CGNN groups, separately ...... 233

Chapter 6 ...... 240

Figure 6.1 The method divides genes into two different groups: a group of genes that include only non-expanded sites and a group of genes that includes at least one expanded sites...... 248

Figure 6.2 Expanded sites and non-expanded sites of the HIV genome within the MKL2 gene on 16 using the published data from Ref [25]. Figure drew using the Geneious software...... 250

Figure 6.3 Determine expression level for each gene ...... 252

Figure 6.4 A schematic of the pipeline, which has been used in this study ...... 253

Figure 6.5 Percentage of integration sites across the whole genome in gene and non-gene regions and fraction of expanded and non-expanded sites in gene and non-gene regions ...... 256

Figure 6.6 Gene expression comparison between genes that include at least one HIV integration site and genes that do not have any site ...... 259

Figure 6.7 Gene expression comparison between genes that include at least one expanded site and genes that include only non-expanded sites at different thresholds levels ...... 260

Figure 6.8 Red line: Gene expression comparison of different T-cells within genes that include only non-expanded sites. Green line: Gene expression comparison of different T-cells within genes that include at least one expanded site ...... 262

xxiii Figure 6.9 Gene expression comparison between genes that include at least one expanded site and genes that include only non-expanded sites within cancer and non-cancer genes ...... 263

Figure 6.10 Spearman correlation between gene expression and Expansion size in different patients ...... 265

Chapter 7 ...... 277

Figure 7.1 A summary of the key questions addressed in this thesis and the key findings for each question ...... 278

Figure 7.2 A schematic of the HIV life cycle and the anti-viral role of main host restriction factors red circles...... 285

Figure 7.3 Distribution of CpG motifs in the HIV-1 subtype B reference genome (HXB2) .... 295

Figure 7.4 Positive correlation between gene expressions of memory CD4+ and CD8 T-cells in FANTOM5 (Functional annotation of the mammalian genome 5) dataset...... 300

xxiv Tables

Chapter 2 ...... 79

Table 2.1 Details of infected pig-tailed macaques ...... 84

Table 2.2 The MHC-I genotype of the macaques that were used in this study ...... 85

Table 2.3 Comparison and validation of results with known CTL responses ...... 111

Table 2.4 New SIVmac251 mutations associated with MHC Class-I haplotypes detected by proposed approach in this study ...... 114

Table 2.5 New SIVmac251 mutations associated with MHC Class-I haplotypes detected by proposed sliding windows approach in this study...... 118

Table 2.6 Comparison of discovered epitopes in this study with known human epitopes via an Immune Epitope Database (www.iedb.org) ...... 122

Table 2.7 SIV viral load for different MHC Class-I haplotypes (mean viral loads after week 6) ...... 128

Chapter 3 ...... 149

Table 3.1 Type of errors in alignment-based methods ...... 148

Table 3.2 Details of the sequences used in this study ...... 152

Table 3.3 Comparison of Hypermut and G2A3 methods for the analysis of single genome HIV-1 sequences. Sequences identified incorrectly are in grey ...... 154

Table 3.4 Comparison of Hypermut and G2A3 methods for analysis of 454 SIV sequences. Sequences identified incorrectly are in grey ...... 155

Table 3.5 Comparison of the Pace and Ulenga methods with G2A3 for the analysis of single genome HIV-1 sequences ...... 156

Chapter 4 ...... 162

Table 4.1 Input and output parameters of PCA analysis in Matlab 2015b ...... 169

Table 4.2 Accession number of hypermutated sequences used in this study ...... 170

Chapter 5 ...... 194

xxv Table 5.1 List of representative viruses investigated in this study (A full list of viruses studied is given in the Table 5.2)...... 199

Table 5.2 List of additional viruses used in this study...... 200

Chapter 6 ...... 240

Table 6.1 Details of the data obtained from references 24 and 25 that was used in this study . 245

Table 6.2 An example of the output from the first step of the method that identifies the integration sites using hierarchical clustering ...... 247

Table 6.3 An example of the final output of the method that specifies, for various genes, the average gene expression level across 3 donors for various types of T-cells and the identified sites of integration ...... 254

Table 6.4 A summarized list of genes with expanded sites in Maldarelli study...... 257

xxvi Abbreviations

AAV: Adeno-associated virus

AIDS: Acquired immune deficiency syndrome

APOBEC: Apolipoprotein B mRNA-editing enzyme-catalytic polypeptide-like

ART: Antiretroviral therapy

B-cell: B lymphocyte

BoV: Bocavirus

cDNA: Complementary DNA

cPPTs: Central polypurine tracts

CTFV: Colorado tick fever virus

CTL: Cytotoxic T lymphocyte

DNA: Deoxyribonucleic acid

dsDNA: Double strand DNA

dsRNA: Double strand RNA

ENV: Envelope

ES-Gene: Gene that includes sites in which multiple HIV sequences have been integrated

GAG: Group-specific antigen

gp120: glycoprotein 120

: glycoprotein 41

APOBEC3F: Apolipoprotein B mRNA-editing enzyme-catalytic polypeptide-like 3F

APOBEC3G: Apolipoprotein B mRNA-editing enzyme-catalytic polypeptide-like 3G

HBV:

HCV: Hepatitis C virus

HERVK: Human endogenous type K

xxvii HIV: Human immunodeficiency virus

MM: Markov model

HTLV: Human T-lymphotropic virus

ICS: Intracellular cytokine staining

Indels: Insertions/Deletions

IRD: Influenza research database

IS: Integration site

LANL: Los Alamos National Laboratory

LTR: Long terminal repeat

MHC: Major histocompatibility complex mRNA: Messenger RNA

NEF: Negative regulatory factor

NES-Gene: Gene that includes regions with only one integration site

NGS: Next generation sequencing ORF: Open reading frame

PBMC: Peripheral blood mononuclear cell

PCA (PC): Principal component analysis

Pexp: Probability of expected frequency

Pobs: Probability of observed frequency

POL: DNA polymerase

PPTs: Polypurine tracts

RNA: Ribonucleic acid RF: Restriction factor

RT:

RV:

xxviii SFV: Simian foamy virus

SINV: Sindbis virus

SIV: Simian immunodeficiency virus

SNP: Single-nucleotide polymorphism ssDNA: Single strand DNA ssRNA: Single strand RNA

TAT: Trans-Activator of transcription

T-cell: T lymphocyte

TCM: Memory T-cell

TCR: T-cell receptor

TEM: Effector memory T-cell

TN: Naive T-cell

TSCM: Stem memory T-cell

VIF:

VL: Viral load

VPR: R

VPU: Viral protein U

WT: Wild-type

xxix Publications during candidature

1. H. Alinejad-Rokny, F. Anwar, S. Waters, D. Ebrahimi and M.P. Davenport. (2016). "Source of CpG depletion in the HIV genome", Molecular Biology and Evolution, 33(12): 3205-3212.

2. D. Ebrahimi, H. Alinejad-Rokny and M.P. Davenport. (2014). "Insights into the Motif Preference of APOBEC3 Enzymes", PLOS ONE, 9(1), pp. e87679.

3. S.L. Gooneratne(&), H. Alinejad-Rokny(& joint first author), D. Ebrahimi, P.S. Bohn, R.W. Wiseman, D.H. O`Connor, M.P. Davenport and S.J. Kent. (2014). "Linking Pig-Tailed macaque major histocompatibility complex class I haplotypes and cytotoxic T lymphocyte escape mutations in SIV infection", Journal of Virology, 88(24): 14310-14325.

4. H. Alinejad-Rokny, D. Ebrahimi. (2015). "A method to avoid errors associated with the analysis of hypermutated viral sequences by alignment-based methods", Journal of Biomedical Informatics, 58(2012): 220-225.

5. A. Martyushev, A. Grimm, J. Petravic, H. Alinejad-Rokny, S.L. Gooneratne, J.C. Reece, D. Cromer, S.J. Kent, M.P. Davenport. (2014). "CD8+ T cell response kinetics rather than viral variability determines the timing of immune escape in SIV infection", The Journal of Immunology, 194(9): 4112-4121.

6. S.L. Gooneratne, H. Alinejad-Rokny, D. Ebrahimi, P.S. Bohn, R.W. Wiseman, D.H. O'Connor, M.P. Davenport and S.J. Kent. (2015). "Linking pigtail macaque MHC I haplotypes and CTL escape mutations in SIV infection", Journal of Medical Primatology Processing, 44(5): 335-335.

7. S.B. Lloyd, M. Kramski, T.H. Amarasena, S. Alcantara, R. De Rose, G. Tachedjian, H. Alinejad-Rokny, V. Venturi, M.P. Davenport, W.R. Winnall, S.J. Kent. (2016). "High fidelity Simian Immunodeficiency virus reverse transcriptase mutants have impaired replication in vitro and in vivo", Virology, 492(2016): 1-10.

xxx 8. H. Alinejad-Rokny and M. Masoud. (2016). "A method for hypermutated viral sequences detection in fastq and bam format files", Journal of Medical Imaging and Health Informatics, 6(5): 1202-1210.

9. D. Ebrahimi, H. Alinejad-Rokny, G.S. Starrett, N.A. Temiz and R.S. Harris. (2017). "Enrichment of Mutations within ERVK Elements in Breast Cancer", Retrovirology, revising for re-submission.

10. H. Alinejad-Rokny, M.P. Davenport, and D. Ebrahimi. (2018) "Humans use primarily APOBEC3G or APOBEC3H to hypermutate HIV", preparing for submission in PLOS Pathogens.

11. H. Alinejad-Rokny, A. Reynaldi, D. Ebrahimi, V. Venturi and M.P. Davenport. (2018). "Relationship between cell division, gene expression and the patterns of HIV insertion into the human genome", preparing for submission in PLOS ONE.

Conference presentation during candidature

1. H. Alinejad-Rokny, D. Ebrahimi and M. Davenport. (2014). "Errors associated with the analysis of hypermutated viral sequences", 15th International Conference on Bioinformatics, Aug 2014, Oral Presentation, Sydney, Australia.

2. S. Gooneratne, H. Alinejad-Rokny, P. Bohn, R. Wiseman, D. Ebrahimi, M. Scarlotta, W. Winnall, D. O’Connor, M. Davenport and S. Kent. (2013). "To escape, or not to escape: pig-tailed macaque MHC class I haplotypes that drive CTL escape in SIV infection", ASMR Vic Student Symposium, Sydney, Australia

xxxi

xxxii

Chapter 1:

General introduction and scope of thesis

Author contributions to thesis Chapter 1:

HA-R wrote the chapter. DE, MPD and VV read and revised the chapter.

33 The virus

Human immunodeficiency virus (HIV) is a ribonucleic acid (RNA) virus that assaults the immune system and causes HIV infection and acquired immune deficiency syndrome (AIDS). The first known infection of HIV in a human was seen in 1981 [1, 2] but it was clinically recognised to be the principal agent of AIDS in 1983 [3, 4].

HIV is a lentivirus and belongs to the family of that mainly infect immune cells that have CD4 cell-surface receptors on their cells membrane. HIV carries its genome in the form of a RNA, which is transcribed into DNA by the action of the viral reverse transcriptase enzyme during the HIV life cycle. Each virion contains two copies of single-stranded RNA genomes and multiple proteins including reverse transcriptase, which is essential for HIV replication (Fig. 1.1A),

34

Figure 1.1: A) The HIV-1 virion carries two copies of single-stranded RNA that code for nine viral proteins. Each virion contains reverse transcriptase (RT), protease and enzymes. The viral particle is enclosed in conical , which is made of the viral protein p24. The single-strand RNA is bound to the nucleocapsid proteins, p6, p7 and viral enzymes reverse transcriptase (RT) and integrase (IN). The matrix formed by GAG protein p17 encloses on all sides of the capsid. The matrix is further covered by a that contains glycoproteins gp120 and gp41; B) Structure of the HIV-1 genome and the genes. Figure sourced from Shum et al., 2013 [5].

The genome of HIV-1 is approximately 9 kilobases (kb) in length and consists of nine genes in different reading frames (Fig. 1.1B). The HIV-1 proteins are classified into three groups [6, 7]:

1. Structural proteins: Gag (Group-specific antigen), Pol (DNA polymerase), and Env (Envelope)

2. Essential regulatory elements: Tat (Trans-Activator of Transcription) and Rev (regulator of expression of virion proteins)

35 3. Accessory regulatory proteins: Vpu (Viral protein U), (Viral protein R), Vif (Viral infectivity factor), and Nef (Negative regulatory factor)

Both ends of the HIV genome are flanked by identical non-coding sequence known as long terminal repeats (LTRs). HIV LTRs are about 640 bp in length and are segmented into U3, R, and U5 proteins (Fig. 1.1B).

HIV types and subtypes

There are two major types of HIV: HIV type 1 (HIV-1) [8] and HIV type 2 (HIV-2) [9]. Both types are transmitted by unprotected sexual intercourse, or through blood-to-blood exposure (e.g. needle stick injuries) and both can cause AIDS. However HIV-1 appears to be far more virulent and more easily transmitted heterosexually and it is responsible for the vast majority of global HIV infections [10, 11]. In contrast to HIV-1, HIV-2 is less easily transmitted and generally progresses to AIDS more slowly than HIV-1. HIV-2 is mainly restricted to West Africa [12].

HIV-1 can be further classified into four different groups: major group “M”, outlier group “O”, minor groups “N” (non-M non-O) and a more recently identified group “P” [13].

Approximately 90% of the HIV-1 infections are group “M”, which is further classified into nine subtypes (or clades): A, B, C, D, F, G, H, J and K and at least 43 Circulating Recombinant Forms (CRF) based on their genetic differences [14, 15]. The geographic distribution of clades is listed below:

 Subtype A: Central and East Africa [16].

 Subtype B: the Americas, west and central Europe, South America, Australia, Japan, Thailand, northern Africa [17].

 Subtype C: India, Brazil and southern and eastern Africa [17].

 Subtype D: Eastern and central Africa [17].

 Subtype F: South Asia, South America and eastern Europe [18].

 Subtype G: Central and West Africa and central Europe [18].

 Subtypes H: Africa [18].

 Subtype J: North, central and West Africa, the Caribbean [19].

36  Subtype K: confined to the Congo and Cameroon [18].

HIV-1 life cycle

The HIV life cycle involves the following six steps (see Fig. 1.2):

Binding to host cells

The gp120 and/or gp41 proteins on the surface of HIV bind to the CD4 primary receptor CCR5 or chemokine receptors CXCR4 located on the surface of T-cells and macrophages. After the binding of gp120 to CD4, a conformation change in gp120 takes place on the HIV surface protein (co-receptors CCR5 or CXCR4 are responsible for inducing the conformation change in gp41), leading to the uncoiling of gp41 that allows for fusion of the viral envelope protein (the trimeric gp120) and the viral capsid.

Reverse transcription

After the virion (the infectious particle of HIV virus) enters into the cytoplasm, a transfer RNA primer binds to the specific region of the HIV genome (called the primer-binding site) and then the enzyme reverse transcriptase transcribes the RNA, beginning from this site, into a single strand of complementary DNA (cDNA), which is a DNA copy of the virus RNA. During the process of reverse transcription, a single-strand RNA viral genome is transcribed into double-strand DNA [20, 21]. The reverse transcription process is error prone and often creates mutations in the viral genome. Recombination, which is the production of an offspring virus from the two RNA strands) is also a result of error prone reverse transcription [22]. The high rate of HIV mutation/recombination creates a high degree of genomic variability but also produces many replication defective viruses.

Integration into host DNA

The cDNA enters the nucleus and uses integrase to integrate into the host DNA. Here the virus may stay in an inactive state for several years and remain undetected by the immune system. Such infected cells are referred to as latent HIV pool [23, 24].

37

Figure 1.2: HIV-1 replication cycle. The Figure sourced from [25].

Transcription and translation

After viral DNA integration into the chromosome of host cells, the HIV provirus may remain inactive, but when the host cell becomes activated it treats the HIV genes much like its own genes. This leads to the transcription of HIV genes and their transaltion in infected cells.

Viral assembly and budding

The final step of the HIV life cycle is assembly and budding of new virions. In this step a special enzyme called protease cuts the long HIV protein chain into smaller individual proteins; these new HIV proteins then bind to the receptors on other T-cells and are used to assemble mature viral cores of new HIV particle at the CD4 T-cell membrane. The newly produced HIV virion then leaves the host cell, and is ready to infect the next cell and start the process all over again [29, 30].

38 SIV, an animal model of HIV infection

The origin of HIV has proven elusive. Scientists studying the evolutionary history of HIV identified a virus similar to HIV in a type of monkey in Africa. It is currently believed that the chimpanzee/gorilla version of the immunodeficiency virus (called simian immunodeficiency virus- SIVcpz/gor) is the source of HIV-1 [31, 32] and a similar infection in sooty mangabeys (SIVsmm) is the source of HIV-2 [33]. Scientists have found that

SIVsmm infection does not cause clinical disease in its natural host [34]. However when the virus, experimentally or incidentally, infects an Indian rhesus macaque (RM), the animal will develop simian AIDS (SAIDS) [35]. SIVmac infection in RM is used as an animal model of HIV. Moreover, for different research purposes, others have combined

HIV and SIVmac genomes to make SHIV (Simian human immunodeficiency virus) [36].

Fig. 1.3 shows the complexity of the HIV evolutionary history. The SIVcpz has jumped to humans as different HIV forms at least three times, and likely from gorillas at least once [32, 37].

39

Figure 1.3: Evolutionary history of HIV; a phylogenetic tree of HIV and SIV sequences indicated multiple cross-species transfer event. SIVsm/HIV-2 dynasties are green and their most recent common ancestor is shown with an open green star. SIVcpz/HIV-1 dynasties are red and their most recent common ancestor is depicted by a closed red star. Other SIV dynasties are blue. The Figure sourced from [38].

40 HIV disease progression

HIV disease progression can be separated into three major steps; as shown in Fig. 1.4 they are: primary infection, HIV latency or chronic infection (the asymptomatic phase) and the AIDS phase [39].

Primary HIV infection (sometimes called acute HIV infection) is the early stage of HIV disease. In this phase, the virus enters into the body and starts to replicate. During this early period of infection, the virus replicates at a rapid rate. The CD4 T-cells, in which HIV replicates, are eventually destroyed as the infection persists, causing a reduction in the number of CD4 cells in peripheral blood [40-42]. This stage of infection generally develops within 4 to 6 weeks and is often attended by a short flu-like illness such as fever, headache and rash [42, 43]. In the acute phase, the immune system responds to the virus by generating HIV-specific antibodies and cytotoxic T-lymphocytes [44-46]. A direct effect of immune system activation against the virus is a reduction in viral load from its peak, and a gradual increase in peripheral blood CD4 T-cell count [44-46].

The second phase of HIV infection is asymptomatic chronic HIV infection (also called clinical latency). During the chronic phase, HIV continues to replicate in the peripheral blood at very low levels and immune responses to the virus stay at a fixed level [45-48]. During this stage it is possible for infected individuals to suffer great reductions in their CD4 T-cell count, due to viral replication [49, 50]. The clinical latency stage lasts for an average of 8-10 years but some patients may progress quicker [51, 52].

AIDS is the final and most serious phase of HIV infection. In this stage, due to the immune deficiency generated by HIV chronic infection, infected patients have a higher risk of acquiring other infections and infection-related cancers, called opportunistic infections [46, 53, 54]. In this stage, the level of antibodies and CD4 T-cells declines and the infectious HIV in the peripheral blood increases [50, 55, 56]. The number of CD4 T- cells can be a good measure of the level of HIV progression in the body. The number of CD4 T-cells in a healthy individual is between 500 to 1500 cells [57, 58]. However, the number of CD4 cells can become as low as during the course of HIV infection [57, 58].

41

Figure 1.4: The course of HIV infection; HIV disease progression within an individual is typically divided into three main stages: primary infection, HIV latency and symptoms of AIDS. The Figure adapted from [59].

Immune response activation

Most HIV-infected individuals mount an immune response to the virus in the first few months of infection but over time it will prove ineffective. The immune response is generally characterised by innate and adaptive immune responses (also called acquired immune responses) that partially limit the viral replication [60, 61].

The innate immune responses provide a rapid response to infections. When innate immune cells (such as dendritic cells, macrophages and neutrophils) identify a pathogen, they can respond by phagocytising infected cells and secreting cytokines and chemokines, signalling to other immune responses, in particular adaptive responses, that pathogens have been detected; they also present antigens for adaptive immune responses to identify. Innate immune responses provide an important first level of defence against infectious HIV via effector mechanisms that engage the pathogen directly. The adaptive immune

42 response is slower to take effect, but provides a more efficient response against infectious HIV. Adaptive immune responses are produced through clonal selection (that is a theory explaining how a single B- or T-cell of the immune system that recognises an antigen, generates large amounts of the correct cells that eliminate the antigen) of lymphocytes [62, 63].

Lymphocytes are a form of small leucocyte (white blood cell) in a vertebrate's immune system that is made in the bone marrow. There are two primary types of lymphocytes in the acquired immune system, B-lymphocytes (or B-cell) and T-lymphocytes (or T-cell) [63, 64]. Both emanate from lymphoid stem cells. The immune response generated by B- cells, is known as the humoral immune response and the immune response generated by T-cells, is known as the cellular immune response [65]. Unlike innate immune cells, adaptive immune cells have the capacity to remember previous antigens, and so provide a faster and more efficient immune response toward pathogens that reappear in a host after the first infection. This is known as immunological memory. The immunological memory is specific for a particular antigen and is long-lived. T-cells play a major role in providing cellular or cell-mediated immunity.

There are a number of different T-cell types, including: cytotoxic T-cells, helper T-cells, memory T-cells and regulatory T-cells.

CD8 T-cells: These are cytotoxic T-cells (also referred to as killer T-cells) that play an important role in controlling HIV. They recognise and destroy cells presenting antigen and tumour cells, through binding to the MHC molecule and the antigen [63, 66]. CD8 T-cells produce proteins called T-cell receptors (TCR). These receptors recognise antigens and annihilate cells infected with virus or other intracellular pathogens [63, 67].

CD4 T-cells: These are also known as helper T-cells, and these cells are the major target for the HIV infection. They help other cells by identifying antigens and releasing T-cell cytokines that activate B-cells as well as killer T-cells [63]. However, infection of CD4 cells by HIV may kill them.

The CD8 and CD4 T-cell receptors identify and bind to viral epitopes (peptide) presented by MHC class I and II respectively. MHC molecules are present on the surface of many

43 host cells (e.g. CD4 T-cells, Macrophages), and their function is to present fragments of viral proteins on the cell surface, thus informing other cells (e.g. CD8 T-cells) about the presence of a pathogen (see Fig. 1.5). If an infected cell is detected the various T-cell responses can be employed in an attempt to kill the infected cells.

Figure 1.5: T-cell receptor binding to MHC-antigen complex.

Both innate and adaptive immune responses are produced during HIV infection, but the adaptive immune response is most important for long-term protection against HIV virus based on its ability to learn, adapt and remember.

HIV immune escape

Cytotoxic T-lymphocytes (CTLs) have a major role in the immune control of HIV and SIV infections, as they identify virus-derived peptide epitopes presented by MHC molecules on the surface of infected cells. During the course of infection, virus-specific immune responses exert a selective pressure on the virus. The CTL pressure favours the growth of viral variants that are mutated in and around the epitope-coding regions

44 (regions of an antigen that are identified by the immune system and bound to MHC molecules) [68-71]. In this process, new HIV variants that cannot be identified by only one epitope are called CTL ‘escape mutants’. The viral mutations that lead to a total or partial loss of CTL recognition can also damage the replication capacity of the virus; this is known as the fitness cost of an escape mutation [68, 69, 72]. Viral escape from host immune CTL responses via antigenic mutation of epitope-coding regions is a major challenge to natural or vaccine-induced immune control of HIV infection.

Within a single infected individual there are usually a large number of genetically distinct viral variants, due to the highly error prone reverse transcription step in the virus life cycle. The highly prevalent and non-mutated strain of HIV virus is known as the ‘wild- type’ form of virus. The wild-type epitope is likely to be identified by CD8 T-cell receptors and eventually controlled. In this instance one of the other mutant viral variants present in an individual is likely to become dominant, while the WT virus is controlled (this is called immune escape). Viral quasispecies, that have mutation in MHC class I- limited epitopes, can influence the capability of T-cell receptors to bind the epitope-bound MHC complex and lead to no immune pressure on this epitope [73].

The HIV genetic variants are different in terms of their ability to infect and replicate within host chromosome. Several studies have shown that wild-type viruses are the fittest, and hence, the dominant HIV strains among all the quasispecies. Thus they are the main targets of antiretroviral therapy. Typically, mutant viruses have lower replication rates compared to the wild type virus [74, 75].

Viral latency

During the HIV life cycle, the cDNA enters the nucleus and uses a viral enzyme called integrase to integrate itself into the host cell. The integrated form of the viral DNA is called a provirus. As shown in Fig. 1.6, a small percentage of proviruses may not produce virus immediately and so be shielded from surveillance by the immune system, thus remaining in this inactive state for several years. By producing little/no viral proteins, this small percentage of proviruses shows no evidence of infection to the immune system and persists in the body as a latent pool of the virus, known as latently infected cells [23,

45 24]. Latently infected cells can become activated and start to generate large numbers of viral progeny at any time [71].

Figure 1.6: Model of HIV latent infection and active infection; A) the latently infected T- cell has the proviral DNA. Considering that there is no evidence of infection at the surface of the infected T-cell, the immune system is not able to recognise this cell as an infected cell. B) Latently infected T-cell can be activated to produce new HIV infected T-cell via viral reactivation. The Figure adapted from College, M.T., 2014 [76].

Latency is one of the main challenges in curing HIV infection. It enables the virus to hibernate and evade the immune system.

Vaccine and antiretroviral therapy

Vaccines are substances that incite an immune response, causing the production of immune memory, such as antibodies and memory T-cells to protect against certain infections. The adaptive immune system is able to provide protection against previous infections by the same pathogen. There is a delay of approximately 1-2 weeks before the adaptive immune response is fully activated towards an unfamiliar pathogen. In acute infection, this delay enables the pathogen to distribute itself throughout the whole body. A vaccine is a mixture of either dead or weakened whole pathogens, or immunogenic fragments of pathogen, used to stimulate the establishment of antibodies against this pathogen. Memory B-cell vaccines are the most common vaccines which aim to develop

46 a memory B cell repertoire against a given infection. These memory B-cells can quickly divide into plasma cells and make an antibody response against the pathogen if it emerges in the host. Since cellular immunity is thought to be more effective than antibody- mediated immunity for control of HIV, some have attempted to design T-cell vaccines against HIV [77, 78]. Unfortunately, an effective HIV vaccine has still not been developed but there are treatments to control HIV infection, known as combination antiretroviral therapy (ART). Antiretroviral medications inhibit the activity of different viral enzymes, for example reverse transcriptase and protease, at various stages of the HIV life cycle. These therapies inhibit new infections of T-cells or production of virus from infected cells. In this way, they reduce viral loads, delay disease progression and decrease the HIV transmission risk. The effectiveness of ART is challenged by the large number of drug-resistant mutations that emerge in patients under therapy, which reduce the susceptibility to antiretroviral drugs, especially in patients receiving long-term ART [79]. Latently infected cells play an important role in treatment failure and are considered the main barrier to curing HIV infection, since they are very long-lived, and apparently unaffected by current antiretroviral therapy. Since these cells are unaffected by antiretroviral drugs and hidden from the immune system, they can reactivate and begin to produce virus within several weeks, if therapy is stopped [80-83].

APOBEC proteins

Several years ago a novel intra-cellular protein, which protects against retroviral infection by inducing mutations in the genome of pathogenic agents, was identified. It was discovered that human cells code for a group of enzymes, apolipoprotein B mRNA- editing enzyme-catalytic polypeptide-like 3 (APOBEC3) proteins, which are incorporated into budding HIV virions and cause C-to-U mutation in the single-stranded regions of the minus strand DNA formed during reverse transcription of the HIV RNA template [84-88].

The APOBEC3 genes are located on human chromosome 22; they are a seven-membered family of cytosine deaminase enzymes (Fig. 1.7). All proteins in the APOBEC3 family contain a short α-helical domain that is followed by a cytidine deaminase (CDA) domain, a linker region, and a pseudo-active site. In APOBEC3B, APOBEC3D, APOBEC3F, and APOBEC3G, these domains are present in duplicate [88]. Several of these proteins such

47 as APOBEC3G, APOBEC3F and APOBEC3H haplotype II are active against HIV and possibly HBV.

48

Figure 1.7: Members of the APOBEC3 enzymes family of cytidine deaminase; Figure created by NCBI genomic sequence viewer [89]; (NC_000022.11: 38858996...39305748 Homo sapiens chromosome 22); access on Nov 2015.

49 Host restriction factor APOBEC3 versus HIV viral infectivity factor (Vif) protein

APOBEC3 proteins have been widely studied and various antiviral mechanisms have been reported [90-97]. APOBEC proteins can act as cytidine deaminases and induce hypermutation in the HIV genome during reverse transcription. Moreover, as shown in Fig. 1.8, A3G (APOBEC3G) and possibly other members of this protein family can inhibit the accumulation of HIV reverse transcription and hinder the accumulation of viral reverse transcripts and cDNA in target cells in a deaminase-independent manner [98- 100]. It has also been reported that A3G can block viral DNA integration into the host chromosome in a manner, which is independent on deaminase activity [101].

APOBEC3G (A3G) and APOBEC3F (A3F) interact with high-molecular-weight ribonucleoprotein complexes in activated CD4 T-cells and immortalised cell lines (transformed cell that can grow and divide for an unlimited period) and they may also inhibit virus replication through deamination-independent mechanisms [99, 102]. Vif overcomes this antiviral defence mechanism of A3G and A3F through binding to these proteins and forming an E3 ubiquitin ligase (that is an enzyme that catalyses the ubiquitination and targets specific proteins for degradation) complex with cullin 5, elongin B and elongin C that causes proteosomal degradation [103, 104]. The interaction between Vif and A3G or A3F proteins is highly specific and this specificity has been mapped to a single amino-acid residue replacement [105].

50

Figure 1.8: Antiviral activity of APOBEC3G in the presence, or absence, of Vif protein; the Vif protein blocks antiviral activity of APOBEC3 enzymes and prevents incorporation of endogenous APOBEC3 enzyme into progeny virions. In the absence of Vif protein, the APOBEC3 enzyme is packaged into the HIV virion and is able to restrict HIV infection in resting CD4 T-cells. The Figure sourced from [106].

HIV hypermutation

In the absence of HIV Vif, the cytoplasmic localised A3G and A3F are packaged into HIV virions; they can deaminate cytosine to uracil in viral minus-strand DNA during reverse transcription of the viral genome, resulting in G-to-A hypermutation and inactivation of the newly synthesised viral DNA [107-111]. The locations of mutations are not completely random. APOBEC3G substitutes G by A preferentially within a GG context. Other members of this family change G to A mainly within GA [100, 109, 112- 114]. Importantly, in protein-coding regions, G-to-A mutations frequently generate stop codons within TGG or TGA contexts (TGG-to-TAG, TGG-to-TAA, and TGA-to-TAA). The activity of APOBEC3 against HIV results in upwards of 10% of guanosines being mutated to adenosines [109, 111, 115].

51 Sequencing

High-throughput genome sequencing methods allow us to simultaneously investigate a large number of DNA or RNA sequence reads. These technologies produce a great wealth of information about biological processes, but also bring with them many new challenges for biologists and data scientists, as to how these massive amounts of complex information can be processed and interpreted to ensure reliable biologically inferences are made. Below I present a concise description of the most common sequencing technologies and their associated challenges.

Sanger sequencing

The Sanger sequencing method was developed by Frederick Sanger and colleagues in 1977 and was the most extensively used DNA sequencing technique for almost 25 years [117] before the development of next-generation sequencing techniques. This method relies on the selective incorporation of chain-terminating dideoxynucleotides (ddNTPs) stained with different fluorescent dyes by DNA polymerase during in-vitro DNA replication [117, 118]. Sanger sequencing needs a single-stranded DNA template, a DNA primer for synthesis, a DNA polymerase, and a mixture of deoxynucleotide triphosphhates (dNTPs) and dideoxynucleotide triphosphhates (ddNTPs) that terminate DNA strand elongation. Dideoxynucleotides have a hydrogen group on the 3’ carbon, instead of a hydroxyl group (OH). In this method, DNA sequence is denatured and primers are annealed to the DNA template and the solution is split into four separate sequencing reactions each including four dNTPs (A, C, G and T), a DNA polymerase in the attendance of a mixture of the four unadapted nucleotides (either A, T, G, or C) and one type of modified chain-terminating nucleotide (dideoxynucleotide either A, T, G, or C). After the random incorporation of deoxynucleotide and dideoxynucleotide, DNA fragments of varying length are heat denatured and separated, based on their size, through electrophoresis on agarose gels.

The Sanger sequencing approach has the advantages of providing reliable data and long contiguous DNA sequence reads but the volume of data and the cost per sequence pose limitations in larger-scale projects. For this reason, next-generation sequencing technologies are now a popular alternative in most large-scale research projects. Sanger

52 sequencing continues to be used mostly for smaller-scale projects and to validate results obtained using next-generation sequencing technologies. In the remainder of this section I discuss the various high-throughput sequencing technologies and the relative advantages and disadvantages.

454 sequencing

454 sequencing [119] is a high-throughput DNA sequencing technology that utilises the large-scale parallel pyrosequencing method developed by Ronaghi et al. [120]. Genomic DNA is randomly fragmented and denatured into single stranded DNA; adapter sequences are then added, and DNA templates are clonally amplified using emulsion PCR [120]. Following amplification, millions of identical pieces of the DNA fragments are individually deposited into an array. In the pyrosequencing step, the DNA sequencing takes place via sequential addition of the four nucleotides. When a complementary nucleotide binds with a specific nucleotide or with the template strand, the DNA polymerase elongates the growing DNA strand by adding nucleotide(s). By adding one or more nucleotides to the existing DNA chain, a perceptible light is emitted that is detected by a camera in the machine. The sequencing cycle is repeated until the target sequence read length has been achieved. For read lengths of up to 500 base pairs (bp), a single run in a 454 sequencing machine can generate approximately 500 megabases of DNA in 10 hours. Currently, the most advanced platform in 454 family (GS FLX Titanium Sequencing Kits XL+) can generate approximately 1 million reads per run with reads up to 1000bp.

One disadvantage of the 454 method is that it often misidentifies the length of homopolymers (short stretches of identical bases). Additionally, 454 platforms are often considered to be cost-ineffective. For some time since the advent of high-throughput technologies it was a preferred option for studies requiring longer reads but in recent years more cost-effective and reliable technologies have been developed that provide an option for longer read-lengths and lower throughput volumes, as I discuss in the following paragraphs.

53 Illumina sequencing

Illumina sequencing technology was originally invented and developed by Balasubramanian and Klenerman at Cambridge University, U.K. [www.illumina.com/technology/next-generation-sequencing/solexa-technology.html]. It is a high-throughput, short-read and massively parallel sequencing technology. The Illumina sequencing platform uses sequencing by synthesis (SBS) of large DNA stretches spanning entire genomes on an eight-channel flowcell and generates many millions of highly accurate reads in each channel; sequence reads are up to 150bp (Genome Analyzer IIx) or 100bp (HiSeq 2000) in length. An Illumina sequencing library is prepared by fragmenting a genomic DNA and ligating adapters to both fragment ends and then DNA fragments are amplified on a flowcell using a bridge PCR (named ‘cluster generation’) to form clusters of identical fragments. Illumina platforms use reversible dye-terminators in the reaction mixture that allow for the reading of single bases per sequencing cycle per cluster. This platform can also be used for whole genome resequencing, transcriptome analysis, genome-wide protein-nucleic acid interaction analysis, expression profiling, and epigenomic sequencing.

Pacific bioscience

PacBio RS is a Single-Molecule Real-Time (SMRT) sequencing system developed by Pacific Biosciences. SMRT is a sequencing technology performed synthetically by DNA polymerase, using four distinct fluorescently-labelled nucleotides. The fluorescent dye can act as a signal to identify the nucleotide type (A/C/G/T) [121]. In this technique, signals are detected by a camera in a movie format so that the experiment can identify the fluorescent signal period and length of intervals between consecutive signals. In epigenetic studies these structural variances could be beneficial [122, 123]. PacBio allows sample preparation to be completed within 4-6 hours, rather than days. Additionally, it does not need a PCR step at preparation stage. This could reduce the inaccuracy in library quantification caused by the PCR. PacBio sequencing provides much longer read lengths (~1300bp), which is longer than any second-generation sequencing technology; it also provides faster runs than second-generation sequencing methods. PacBio transcriptome sequencing provides very reliable gene isoform identification with high sensitivity; additionally, this method generates information that is helpful for direct identification of

54 base modifications, such as DNA methylation, DNA damage and other epigenetic information [124]. Nonetheless, PacBio is hindered by a comparatively lower throughput, higher error rate, and higher per base sequencing costs [122].

Nanopore sequencing techniques

SMRT and nanopore sequencing [125] are two nanotechnology-based single molecule sequencing techniques. Nanopore sequencing, the newest, rapid and low-cost third- generation sequencing technology, consists of a very small biopore placed in a protein channel (facilitating ion exchange), housed on a membrane bilayer, and capable of passing electrical signals [122], recognised as A, C, G, and T. This method is about settling a single-stranded DNA string through a protein nanopore, which would form a transmembrane channel with heptametrical structure by self-assembly with exceptional endurance to high voltage. The protein nanopore is extracted from Staphylococcus aurous, namely α-hemolysin. A constant ionic flow can be observed in this technology. One of the most noteworthy advantages of the nanotechnology-based, single molecule sequencing is that minimal sample preparation steps are required. No enzymatic reactions are needed. Another plus is its potential to generate long reads (practical in DNA methylation mapping). Nanopore sequencing is yet to reach the market. However, due to its distinctive analytical capability (identifying four kinds of bases at single-base resolution), it would definitely create cost-efficient rapid DNA sequencing, especially in the study of epigenomics in the near future [126].

In summary, the existing platforms have distinctive properties, advantages and disadvantages. Although the technology needs significant improvement, currently, the main challenge is effective data handling and data processing. In the following years, researchers would need to design sophisticated bioinformatic pipelines in order to use NGS data efficiently and implement biological surveys. The ultimate goal of NGS could be defined as realising human genomic information, and undoubtedly, elucidating the function of the genome. This would be a stepping-stone for implementing personalised medicine in the future. Importantly, with its high throughput, effective sensitivity, speed, cost-effectiveness and information capabilities, one could foresee a robust future for NGS as a clinical tool, leaving current diagnostic standards behind.

55 NGS data analyses

Here, I present a brief review of some bioinformatics tools available for HIV genomic research, with a focus on next-generation sequencing technologies. I discuss the main stages involved in NGS data analysis, including: read quality control, mapping and generating consensus sequences, and mutation calling. It is not possible to explain all existing bioinformatics tools in this brief overview; however, I have tried to discuss main stages involved in NGS data analysis.

Read quality control

An important concern in NGS data analysis is quality control because lower quality is typically a consequence of the larger volume of data generated using NGS technologies. However, NGS technologies provide quality scores as part of their data output that represent the level of certainty that the sequencing technology has in identifying each base in a read. Phred quality scores are logarithmically related to the base calling error probability. For example, a Phred quality score of 10 represents a 1 in 10 probability of an incorrect base call (i.e. 90 accuracy), a Phred quality score of 20 represents a 1 in 100 probability of an incorrect base call (i.e. 99 accuracy), and so on. This base quality information enables bioinformaticians to assess the quality of the sequence output and perform various quality control steps to regulate the quality of the data analysis. These steps usually involve the filtering of reads and/or the trimming of lower quality stretches of sequence.

Because of adapter/primer sequence, homopolymer sequence and sequencing errors in NGS reads, the quality control step is very important for careful downstream analysis. Adapter/primer sequences are important for a successful sequencing run; however, after sequencing these need to be deleted from the read because they can hinder the accurate assembling/mapping of reads and impact on SNP calling and other analyses. Two of the important tools for deleting adapter/primer sequences are Cutadapt [127] and Trimmomatic [128].

NGS reads are also typically trimmed to delete lower-quality bases from the ends (e.g. the 3' end) of reads, and then filtered. Reads that are very short or very long, compared to

56 expected read length for a platform, tend to be of lower quality and include a large number of errors like ambiguous bases, which should be filtered out from the dataset. In some steps of NGS data analyses (for instance de novo assembly), it would be useful to identify and delete identical read from the dataset. There are several tools for trimming and filtering NGS reads. However, two of the most commonly used tools for read trimming, filtering and duplication removal are Trim Galore [www.bioinformatics.babraham.ac.uk/projects/trim_galore] and PRINSEQ [129]. Sometimes it can be useful to know which reads are mapped or unmapped. There are many reads that can't be mapped to the reference genome (especially viral genome in the first run because the number of allowable mismatches has been exceeded. In second run, only the unmapped reads are used for mapping. An additional quality control step can be done for reads produced through the Ion Torrent and Roche 454 sequencing platforms, using tools such as RC454 [130] and Coral [131] to handle errors in homopolymeric regions due to carry-forward and incomplete extension [132]. Moreover, tools such as PyroCleaner [133], pbh5tools and PoreTools [134] can handle 454 (sff format), PacBio (hd5 format) and Oxford Nanopore Technologies (FAST5 format) reads, respectively.

Mapping and consensus sequence generation

One of the important steps in NGS application in relation to viral sequences is to generate a consensus sequence of the whole viral genome from NGS reads. After quality control, the NGS reads need to be mapped to a known viral reference sequence, which is usually the genome of a closely related organism. Another fundamental step in NGS data analysis is the mapping of reads onto a viral reference sequence. This is a very important step because it defines where each read has to be mapped onto a viral reference sequence and thus affects all future analyses, such as polymorphism calling and phylogenetic tree generation. The most widely used mapping and aligment methods are hash-based tools such as Mosaik [135] and Stampy [136], or Burrows-Wheeler transform (BWT)-based tools such as BWA [137] and Bowtie2 [138].

However, specific alignment software such as BLASR [139] and LAST [140] are usually required for longer reads. These aligners are usually used for reads generated via single- molecule sequencers such as Oxford Nanopore MinION and Pacific Biosciences RS II. BWT-based aligners can quickly map a large FASTQ file containing sample reads onto

57 a viral reference sequence via ordinary computational resources such as memory and multi-core processors; in contrast to BWT-based tools, hash-based aligners are slower, but are generally more sensitive programs for mapping viral genomes with high diversity to the closest available viral reference sequence. Therefore, selection of a suitable viral reference sequence is a critical step in mapping viral reads. A viral reference sequence from a species too-distantly related will thus only allow the detection of reads mapped to the highly conserved areas in the target species, resulting in low or incomplete coverage in some regions.

Most mapping programs use the sequence alignment map (SAM) format [141] for storing read alignments against the reference genome. A SAM file is typically converted into Binary Alignment Map (BAM) format, which saves storage space and is faster to manipulate. BAM files are usually compressed and so more efficient for downstream analysis than SAM files. A BAM file can be visualised through visualisation programs such as Tablet [142], MapView [143] and Geneious (Biomatters Ltd, Auckland, New Zealand). This enables researchers to investigate coverage, polymorphisms and indels, visually, across the whole, or segments of a viral genome. SAMtools [144] is an important tool that provides a set of utilities (e.g. sorting, merging, viewing, indexing, variant calling) for using and manipulating SAM and BAM formatted alignments [141]. SAMtools can be utilised in collaboration with VCFtools (variant call format tools, which are used to manipulate and work with genetic variation data in the form of VCF and BCF files) [145] to find polymorphisms and indels from the viral reference sequence and make a consensus sequence for the viral organism. Another alternative program to generate viral consensus sequences from aligned reads is VarScan [146], a variant-calling and somatic-mutation/CNV detection (within targeted region, exome, and full-length genome) tool for next-generation sequencing data from Illumina, SOLiD, Life/PGM, Roche/454, and similar platforms.

Variant calling

NGS technologies provide a higher depth of sequencing of viral species, which enables the detection of sequence variation present at low frequencies within the aligned reads of viral species, such as those that enhance virus evolution or convey drug resistance. For instance, NGS technologies have been used to examine high-pathogenicity avian

58 influenza (HPAI) polymorphisms in low-pathogenicity individuals [147]. However, in virus genome populations, it is difficult to characterise real nucleotide variations with low frequency, from errors, which arose during sequencing and the sample preparation process (e.g. in viral reverse transcription-polymerase chain reaction (RT-PCR)) [148]. Some other errors arose through the sequencing platforms, such as base miscalls [151].

Several bioinformatics programs are available for variant calling, at all frequencies, from viral species: ShoRAH [149], V-Phaser 2 [150], LoFreq [151], deepSNV [152], RVD [153-154] and Geneious software, which take into account the sources from which errors may arise. Lo-Freq tools is a fast and sensitive variant-caller and uses read quality scores to model base miscalls [151], and detect strand-biased variants. Lo-Freq can find variants below the average base-call quality (i.e. sequencing error rate). V-Phaser is another tool to discover variants in genetically heterogeneous populations and can identify variants at lower frequencies through using information on the co-location of variants on sample reads [150, 155]. V-Phaser can identify rare variants in diverse populations that arise at frequencies of less than %1. However, these programs act based on specified assumptions and, in addition, do not take into account error effects of the RT or PCR processes [148]. There are alternative tools that employ changes to standard protocols for variant calling, such as circular re-sequencing [156] and incorporating unique barcodes into individual DNA sequences [157-158]. These tools have been used frequently for viral genome reads and virus fitness analyses.

Unfortunately, variant callers detect mutations across a sample`s genome (NGS mapped reads) and do not determine what mutations are located together in different sample genomes. In other words, these tools are not able to identify those mutations that arise in the same positions across different sample genomes and are associated with biological concepts such as sample haplotype. In Chapter 2, I performed a bioinformatics analysis to identify those mutations (known as viral escape mutations) and sample haplotype for both single mutations and groups of mutations, based on a 10-amino-acid window.

Geneious software

The two important functions of bioinformatics analysis are the organisation and analysis of biological materials through computational processes and software. Geneious is an

59 integrated, cross-platform bioinformatics software which is widely used in biological research. The software was designed by Biomatters Ltd (www.biomatters.com) for working on biological data such as DNA and RNA sequences and proteins. It is a popular software used for many types of analyses including those of evolutionary relationships, phylogenies analysis, 3D structure information, mapping reads, calling variants, sequence alignment, contig assembly, primer design, annotate genomes, restriction analysis, access to the NCBI and UniProt databases, BLAST, protein structure viewing, automated PubMed searching, and more. One of the major benefits is its user-friendly interface. For lots of bioinformatics analysis Geneious offers several different algorithms, which enable users to compare the results with different approaches. For example, Geneious provides a number of useful heuristic approaches for aligning nucleic acid- or protein sequences such as: ClustalW [159], MUSCLE [160], MAFFT [161], Mauve [162], Geneious aligner and more.

For assembly and mapping, the Geneious assemblers can handle biological data from Sanger and NGS machines. It provides optimal parameters to minimise error probabilities generating from different platforms. The Geneious assemblers work well with Sanger, Illumina, Ion Torrent, 454, and PacBio CCS data. There are also multiple different parameters, enabling users to improve the performance of mapping and assembling. Considering multiple options for map DNA/RNAseq reads (BWA (0.6.2-r126) [163], Bowtie 1 and 2 (2.0.0) [164], SMALT (0.6.4) (www.sanger.ac.uk/science/tools/smalt-0), SOAP2 (2.20) [165], Geneious (6.0.3) and more) users can perform high-quality mapping for both short- and long length reads. This software also has great options for read quality control, read trimming, and checking and removing duplicate and chimeric reads. Based on sequencing platforms, there are different parameters for these purposes. Geneious also has excellent options and good visualisation for finding SNPs and variants. Parameters to find variants can be configured to only discover variants that occur above a minimum threshold to exclude disagreements due to read errors. There are also multiple additional parameters to only find disagreements in coding/non-coding regions, which can help users to analyse the impact of polymorphisms on the protein translation, to enable users to quickly recognise synonymous or nonsynonymous polymorphisms. Geneious can also perform a statistical test to calculate P values for SNPs, and filter SNPs with a determined maximum P value. Geneious has one of the best visualisations for generating phylogenetic trees. It offers two inbuilt methods - Neighbour-joining [166] and UPGMA

60 [167] - for generating phylogenetic trees. Geneious also provides more complex algorithms of phylogenetic analysis, such as Maximum Likelihood, Bayesian MCMC, MrBayes, PhyML, GARLI, RAxML, FastTree, and PAUP, which are available as plugins for professional users. I used Geneious software to analyse NGS data for Chapter 1 of this thesis. To do any of the above NGS data analysis, I checked different methods and parameters in Geneious to provide a sensitive and conservative result. I also checked some of the Geneious results with other bioinformatics tools.

Aim of this thesis

One potential strategy for reducing the spread of HIV infection is to design cytotoxic T- lymphocytes, (CTL)-specific HIV vaccines and antiretroviral therapies. Many researchers have studied [68, 168-171] the effects of major histocompatibility complex class I (MHC- I) molecules, which present peptide fragments of pathogens on the cell surface to T-cells, on the immune control of HIV and SIV diseases. On the other hand, CTLs have a major role in controlling infections and in mediating adaptive immunity [168-170, 172, 173]. CTLs select for HIV and SIV escape variants and it is known that SIV mutates to escape recognition by MHC-I, which binds to and presents SIV epitopes. Patterns of CTL escape mutations have been studied in HIV-1 infected humans, but it is poorly defined in the animal models of HIV, upon which much HIV research is centred. So, there is a need for developing an analytical approach to identify potential SIV-specific CTL epitopes along with their MHC-I restriction.

APOBEC3 proteins induce G-to-A mutations in the positive strand of the HIV genome during reverse transcription. This process is referred to as ‘hypermutation’ and the sequences are known as hypermutated sequences [96, 97, 100, 109, 112, 113, 174-176]. Most recent approaches for identification of hypermutated sequences rely on alignment- based methods [114, 177-180]. A major problem with alignment-based methods is that indels (insertions/deletions) can result in genomic sites being incorrectly identified or ignored in the analysis of hypermutation. It is therefore important to propose a new method for hypermutated sequence detection that considers indels in the aligned sequences. There is also a need to develop and apply a quantitative approach to find the motif preferences of each of the APOBEC enzymes.

61 It is well understood that the dinucleotide CpG in the HIV genome is underrepresented but the source of this depletion is not well characterised [181, 182]. However, the sequence context of CpG depletion has not yet been studied in detail and could provide valuable insights into mechanisms contributing to CpG depletion in the genome of HIV.

It has been shown that the insertion of HIV DNA into the host chromosome is not random. Previous studies have reported that HIV virus preferentially targets some genes multiple times [183-190]. However, the reasons behind these preferential sites of HIV integration are not yet well understood. Investigating the differences between genes that are preferentially targeted by HIV, and other genes, may provide valuable insights into the mechanisms contributing to this process.

The aim of this thesis is to develop and apply computational and bioinformatic methods to investigate the above-mentioned key questions about HIV.

The aims of the studies presented in this thesis are:

Aim 1. MHC class I alleles affect viral escape/diversity in the SIV-infected macaque model, but this interaction is under-studied and is not well understood in the macaque model. In Chapter 2, I developed a bioinformatics analysis between MHC-I haplotypes and SIV sequences to identify novel possible cytotoxic T-lymphocyte (CTL) epitopes and their associated patterns of escape mutations using data from forty-four pigtailed macaques.

Aim 2. The human genome encodes for a family of editing enzymes known as APOBEC3 (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like3). They induce context-dependent G-to-A changes, referred to as ‘hypermutation’, in the genome of viruses such as HIV, SIV, HBV and endogenous retroviruses. Hypermutation is characterised by aligning affected sequences to a reference sequence. I show that indels (insertions/deletions) in the sequences lead to an incorrect assignment of APOBEC3 targeted and non-target sites. This can result in an incorrect identification of hypermutated sequences and erroneous biological inferences made based on hypermutation analysis. To address these types of errors, I develop a method (G2A3) that correctly identifies hypermutation signatures.

62 Aim 3. The human genome encodes seven APOBEC3 genes, the product of some of which are known to induce G-to-A mutation in the HIV genome. Mutation by these enzymes is sequence context dependent. Studies so far have usually identified targeted motifs using in-vitro or ex-vivo experiments and usually within short sequence fragments. Here I perform a comprehensive analysis to investigate the motif preference and possible sequence hierarchy of mutation by APOBEC3 enzymes using a large number of full- genome HIV-1 sequences from a wide range of naturally infected patients. I develop a data matrix decomposition approach to discriminate among normal and hypermutated sequences, in terms of the representation of mono- to tetra-nucleotide motifs. This allows the identification of motifs associated with hypermutation by different APOBEC3 enzymes.

Aim 4. The genome of human immunodeficiency virus (HIV) is highly CpG (cytosine + guanine nucleotide) depleted. To explain CpG depletion in HIV two hypotheses can be postulated: A) CpG methylation-induced transcriptional silencing and B) CpG recognition by Toll-like receptors (TLRs). Here I investigate these two hypotheses by determining the sequence context dependency of CpG depletion and methylation.

Aim 5. HIV is a retrovirus that can infect billions of cells each day. Integration of HIV DNA into the host genome is an important step in the HIV life cycle. The inserted viral genome can persist in patients for many years. There is substantial evidence that integration of retroviral genomes into the host DNA is not random. Previous studies have shown that HIV preferentially integrates into transcriptionally active genes and mainly into introns. Recent studies demonstrated that a single infection event can produce multiple latently infected cells. It was also reported that HIV preferentially integrated into cancer genes in expanded clones of cells or genes related to cell proliferation. In Chapter 6, I considered gene expression to further investigate the relationship between cell proliferation, gene expression and the frequency and pattern of HIV integration into the human genome.

In this thesis, I applied bioinformatics, bio-data mining and statistical techniques to analyse a large number of relevant biological data.

The rest of this thesis is organised into the following chapters:

63 Chapter 2 is related to linkage analysis between major histocompatibility complex Class I (MHC-I) haplotypes and SIV sequences in macaques infected with SIV. In Chapter 3, I discuss hypermutation and my proposed method for hypermutated sequences detection. In Chapter 4, I develop a data matrix decomposition approach to discriminate among normal and hypermutated sequences in terms of the representation of mono- to tetra- nucleotide motifs. Chapter 5 reports the methods, analyses and results of the source of CpG depletion in the HIV genome.

In Chapter 6, I investigate the relationship between the cell proliferations, gene expression and the frequency and patterns of HIV integration into the human genome.

And finally, Chapter 7 presents a summary of findings of the study and future works.

References

1. Malani, P.N., Mandell, Douglas, and Bennett’s Principles and Practice of Infectious Diseases. JAMA, 2010. 304(18): p. 2067-71.

2. Bennett, J.E., et al., Mandell, Douglas, and Bennett's principles and practice of infectious diseases. 2010, Philadelphia: Churchill Livingstone/Elsevier.

3. Barre-Sinoussi, F., et al., Isolation of a T-lymphotropic retrovirus from a patient at risk for acquired immune deficiency syndrome (AIDS). Science, 1983. 220(4599): p. 868-71.

4. Control, C.f.D., Update on acquired immune deficiency syndrome (AIDS) among patients with hemophilia A. MMWR Morb Mortal Wkly Rep, 1982. 31(48): p. 644.

5. Shum, K.T., et al., Aptamer-based therapeutics: new approaches to combat human viral diseases. Pharmaceuticals (Basel), 2013. 6(12): p. 1507-42.

6. Gallo, R., et al., HIV/HTLV gene nomenclature. Nature, 1988. 333(6173): p. 504.

7. Muesing, M.A., et al., Nucleic acid structure and expression of the human AIDS/lymphadenopathy retrovirus. Nature, 1985. 313(6002): p. 450-8.

8. Coffin, J., et al., What to call the AIDS virus? Nature, 1986. 321(6065): p. 10.

9. Clavel, F., HIV-2, the West African AIDS virus. Aids, 1987. 1(3): p. 135-40.

10. Morison, L., The global epidemiology of HIV/AIDS. Br Med Bull, 2001. 58: p. 7-18.

64 11. Jamison, D., et al., Disease and Mortality in Sub-Saharan Africa. 2006, US: World Bank Publications.

12. Reeves, J.D. and R.W. Doms, Human immunodeficiency virus type 2. J Gen Virol, 2002. 83(Pt 6): p. 1253-65.

13. Vallari, A., et al., Confirmation of putative HIV-1 group P in Cameroon. J Virol, 2011. 85(3): p. 1403-7.

14. McCutchan, F.E., Understanding the genetic diversity of HIV-1. Aids, 2000. 14 (Suppl 3): p. S31-44.

15. Peeters, M. and P.M. Sharp, Genetic diversity of HIV-1: the moving target. Aids, 2000. 14 (Suppl 3): p. S129-40.

16. Bobkov, A.F., et al., Temporal trends in the HIV-1 epidemic in Russia: predominance of subtype A. J Med Virol, 2004. 74(2): p. 191-6.

17. Goudsmit, G., Viral Sex; The Nature of AIDS. 1998: Oxford University Press.

18. Eberle, J. and L. Gurtler, HIV types, groups, subtypes and recombinant forms: errors in replication, selection pressure and quasispecies. Intervirology, 2012. 55(2): p. 79-83.

19. Hemelaar, J., et al., Global and regional distribution of HIV-1 genetic subtypes and recombinants in 2004. Aids, 2006. 20(16): p. 13-23.

20. Chan, D.C. and P.S. Kim, HIV entry and its inhibition. Cell, 1998. 93(5): p. 681-4.

21. Wyatt, R. and J. Sodroski, The HIV-1 envelope glycoproteins: fusogens, antigens, and immunogens. Science, 1998. 280(5371): p. 1884-8.

22. Pulsinelli, G.A. and H.M. Temin, High rate of mismatch extension during reverse transcription in a single round of retrovirus replication. Proc Natl Acad Sci U S A, 1994. 91(20): p. 9490-4.

23. Goff, S.P., Retroviral reverse transcriptase: Synthesis, structure and function. J Acquir Immune Defic Syndr, 1990. 3(8): p. 817-31.

24. Miller, M.D., et al., Human immunodeficiency virus type 1 preintegration complexes: studies of organization and composition. J Virol, 1997. 71(7): p. 5382-90.

25. Lindenbach, B.D., and C.M. Rice, Unravelling hepatitis C virus replication from genome to function. Nature, 2005. 436(7053): p. 933-38.

26. Greene, W.C. and B.M. Peterlin, Charting HIV's remarkable voyage through the cell: Basic science as a passport to future therapy. Nat Med, 2002. 8(7): p. 673-80.

65 27. Jones, K.A. and B.M. Peterlin, Control of RNA initiation and elongation at the HIV-1 promoter. Annu Rev Biochem, 1994. 63: p. 717-43.

28. Pollard, V.W. and M.H. Malim, The HIV-1 Rev protein. Annu Rev Microbiol, 1998. 52: p. 491-532.

29. Ganser-Pornillos, B.K., et al., The structural biology of HIV assembly. Curr Opin Struct Biol, 2008. 18(2): p. 203-17.

30. Zhang, C., et al., Hybrid spreading mechanisms and T cell activation shape the dynamics of HIV-1 infection. PLoS Comput Biol, 2015. 11(4): p. e1004179.

31. Van Heuverswyn, F., et al., Human immunodeficiency viruses: SIV infection in wild gorillas. Nature, 2006. 444(7116): p. 164.

32. Plantier, J.C., et al., A new human immunodeficiency virus derived from gorillas. Nat Med, 2009. 15(8): p. 871-2.

33. Santiago, M.L., et al., Simian immunodeficiency virus infection in free-ranging sooty mangabeys (Cercocebus atys atys) from the Tai Forest, Cote d'Ivoire: implications for the origin of epidemic human immunodeficiency virus type 2. J Virol, 2005. 79(19): p. 12515-27.

34. Fultz, P.N., et al., Prevalence of natural infection with simian immunodeficiency virus and simian T-cell virus type I in a breeding colony of sooty mangabey monkeys. Aids, 1990. 4(7): p. 619-25.

35. Kestler, H., et al., Induction of AIDS in rhesus monkeys by molecularly cloned simian immunodeficiency virus. Science, 1990. 248(4959): p. 1109-12.

36. Feinberg, M.B. and J.P. Moore, AIDS vaccine models: challenging challenge viruses. Nat Med, 2002. 8(3): p. 207-10.

37. Takehisa, J., et al., Origin and biology of simian immunodeficiency virus in wild-living western gorillas. J Virol, 2009. 83(4): p. 1635-48.

38. Wertheim, J.O. and M. Worobey, Dating the age of the SIV lineages that gave rise to HIV-1 and HIV-2. PLoS Comput Biol, 2009. 5(5): p. e1000377.

39. Pantaleo, G. and A.S. Fauci, Immunopathogenesis of HIV infection. Annu Rev Microbiol, 1996. 50: p. 825-54.

40. Guadalupe, M., et al., Severe CD4+ T-cell depletion in gut lymphoid tissue during primary human immunodeficiency virus type 1 infection and substantial delay in restoration following highly active antiretroviral therapy. J Virol, 2003. 77(21): p. 11708-17.

66 41. Brenchley, J.M., et al., CD4+ T cell depletion during all stages of HIV disease occurs predominantly in the gastrointestinal tract. J Exp Med, 2004. 200(6): p. 749-59.

42. Douek, D.C., et al., HIV preferentially infects HIV-specific CD4+ T cells. Nature, 2002. 417(6884): p. 95-8.

43. Clark, S.J., et al., High titers of cytopathic virus in plasma of patients with symptomatic primary HIV-1 infection. N Engl J Med, 1991. 324(14): p. 954-60.

44. McMichael, A.J., et al., The immune response during acute HIV-1 infection: clues for vaccine development. Nat Rev Immunol, 2010. 10(1): p. 11-23.

45. McMichael, A.J. and S.L. Rowland-Jones, Cellular immune responses to HIV. Nature, 2001. 410(6831): p. 980-7.

46. Streeck, H. and D.F. Nixon, T cell immunity in acute HIV-1 infection. J Infect Dis, 2010. 202 (Suppl 2): p. S302-8.

47. Marchetti, G., C. Tincati, and G. Silvestri, Microbial translocation in the pathogenesis of HIV infection and AIDS. Clin Microbiol Rev, 2013. 26(1): p. 2-18.

48. Mogensen, T.H., et al., Innate immune recognition and activation during HIV infection. Retrovirology, 2010. 7: p. 54.

49. Fevrier, M., et al., CD4+ T cell depletion in human immunodeficiency virus (HIV) infection: role of apoptosis. Viruses, 2011. 3(5): p. 586-612.

50. Okoye, A.A. and L.J. Picker, CD4(+) T-cell depletion in HIV infection: mechanisms of immunological failure. Immunol Rev, 2013. 254(1): p. 54-64.

51. Nowak, M.A., Evolutionary dynamics of HIV infections, in First European Congress of Mathematics. 1994: Paris. p. 311-26.

52. Wattal, C. and N. Khardori, Hospital Infection Prevention: Principles & Practices. 2014. Germany: Springer.

53. Ghate, M., et al., Incidence of common opportunistic infections in HIV-infected individuals in Pune, India: analysis by stages of immunosuppression represented by CD4 counts. Int J Infect Dis, 2009. 13(1): p. -8.

54. Cohen, M.S., et al., The detection of acute HIV infection. J Infect Dis, 2010. 202(Suppl 2): p. S270-7.

55. Weber, J., The pathogenesis of HIV-1 infection. Br Med Bull, 2001. 58: p. 61-72.

67 56. Fauci, A.S. and R.C. Desrosiers, Pathogenesis of HIV and SIV, in Retroviruses, J.M. Coffin, S.H. Hughes, and H.E. Varmus, Editors. 1997, Cold Spring Harbor Laboratory Press: Cold Spring Harbor (NY).

57. Deacon, N.J., et al., Genomic structure of an attenuated quasi species of HIV-1 from a blood transfusion donor and recipients. Science, 1995. 270(5238): p. 988-91.

58. Ho, D.D., et al., Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection. Nature, 1995. 373(6510): p. 123-6.

59. Lewis, G.K., Role of Fc-mediated antibody function in protective immunity against HIV- 1. Immunology, 2014. 142(1): p. 46-57.

60. Bosinger, S.E., et al., Gene expression profiling of host response in models of acute HIV infection. J Immunol, 2004. 173(11): p. 6858-63.

61. Iwasaki, A., Innate immune recognition of HIV-1. Immunity, 2012. 37(3): p. 389-98.

62. Schoenborn, J.R. and C.B. Wilson, Regulation of interferon-gamma during innate and adaptive immune responses. Adv Immunol, 2007. 96: p. 41-101.

63. Alberts, B., et al., Molecular Biology of the Cell. 2007, USA: Garland Science.

64. Abbas, A.K., et al., Cellular and molecular immunology. 2006, Saunders. p. 560.

65. Mauri, C. and A. Bosma, Immune regulatory function of B cells. Annu Rev Immunol, 2012. 30: p. 221-41.

66. Schmitz, J.E., et al., Control of viremia in simian immunodeficiency virus infection by CD8+ lymphocytes. Science, 1999. 283(5403): p. 857-60.

67. Coplan, P.M., et al., Cross-reactivity of anti-HIV-1 T cell immune responses among the major HIV-1 clades in HIV-1-positive individuals from 4 continents. J Infect Dis, 2005. 191(9): p. 1427-34.

68. Borrow, P., et al., Antiviral pressure exerted by HIV-1-specific cytotoxic T lymphocytes (CTLs) during primary infection demonstrated by rapid selection of CTL escape virus. Nat Med, 1997. 3(2): p. 205-11.

69. Phillips, R.E., et al., Human immunodeficiency virus genetic variation that can escape cytotoxic T cell recognition. Nature, 1991. 354(6353): p. 453-9.

70. Goulder, P.J. and D.I. Watkins, Impact of MHC class I diversity on immune control of immunodeficiency virus replication. Nat Rev Immunol, 2008. 8(8): p. 619-30.

71. Dimmock, N., E. A., and K. Leppard, Introduction to Modern Virology. 2007, Wiley- Blackwell: USA. p. 536.

68 72. Furutsuki, T., et al., Frequent transmission of cytotoxic-T-lymphocyte escape mutants of human immunodeficiency virus type 1 in the highly HLA-A24-positive Japanese population. J Virol, 2004. 78(16): p. 8437-45.

73. Bowen, D.G. and C.M. Walker, Adaptive immune responses in acute and chronic hepatitis C virus infection. Nature, 2005. 436(7053): p. 946-52.

74. Jessen, H. and H. Jaeger, Primary HIV Infection: Pathology, Diagnosis, Management. 2005: Georg Thieme Verlag.

75. Kelleher, A.D., et al., Clustered mutations in HIV-1 gag are consistently required for escape from HLA-B27-restricted cytotoxic T lymphocyte responses. J Exp Med, 2001. 193(3): p. 375-86.

76. College, M.T. Model of HIV Latent infection and active infection. 2014 [cited 2015; Disorders Associated With The Immune System]. Available from: www.classes.midlandstech.edu/carterp/Courses/bio225/chap19/lecture8.htm.

77. Robinson, H.L. and R.R. Amara, T cell vaccines for microbial infections. Nat Med, 2005. 11(4 Suppl): p. S25-32.

78. Korber, B.T., et al., T-cell vaccine strategies for human immunodeficiency virus, the virus with a thousand faces. J Virol, 2009. 83(17): p. 8300-14.

79. Gilks, C.F., et al., The WHO public-health approach to antiretroviral treatment against HIV in resource-limited settings. Lancet, 2006. 368(9534): p. 505-10.

80. Siliciano, J.D., et al., Long-term follow-up studies confirm the stability of the latent reservoir for HIV-1 in resting CD4+ T cells. Nat Med, 2003. 9(6): p. 727-8.

81. Katlama, C., et al., Barriers to a cure for HIV: new ways to target and eradicate HIV-1 reservoirs. Lancet, 2013. 381(9883): p. 2109-17.

82. Finzi, D., et al., Latent infection of CD4+ T cells provides a mechanism for lifelong persistence of HIV-1, even in patients on effective combination therapy. Nat Med, 1999. 5(5): p. 512-7.

83. Chun, T.W., et al., Re-emergence of HIV after stopping therapy. Nature, 1999. 401(6756): p. 874-5.

84. Sheehy, A.M., et al., Isolation of a human gene that inhibits HIV-1 infection and is suppressed by the viral Vif protein. Nature, 2002. 418(6898): p. 646-50.

85. Prochnow, C., et al., The APOBEC-2 crystal structure and functional implications for the deaminase AID. Nature, 2007. 445(7126): p. 447-51.

69 86. Wedekind, J.E., et al., Messenger RNA editing in mammals: new members of the APOBEC family seeking roles in the family business. Trends Genet, 2003. 19(4): p. 207- 16.

87. Vasudevan, A.A., et al., Structural features of antiviral DNA cytidine deaminases. Biol Chem, 2013. 394(11): p. 1357-70.

88. Jarmuz, A., et al., An anthropoid-specific locus of orphan C to U RNA-editing enzymes on chromosome 22. Genomics, 2002. 79(3): p. 285-96.

89. NCBI, NCBI Sequence Viewer 3. 2015 [cited 2015; Available from: www.ncbi.nlm.nih.gov/projects/sviewer.

90. Xu, H., et al., Stoichiometry of the antiviral protein APOBEC3G in HIV-1 virions. Virology, 2007. 360(2): p. 247-56.

91. Strebel, K. and M.A. Khan, APOBEC3G encapsidation into HIV-1 virions: which RNA is it? Retrovirology, 2008. 5: p. 55.

92. Stopak, K., et al., HIV-1 Vif blocks the antiviral activity of APOBEC3G by impairing both its translation and intracellular stability. Mol Cell, 2003. 12(3): p. 591-601.

93. Sawyer, S.L., M. Emerman, and H.S. Malik, Ancient adaptive evolution of the primate antiviral DNA-editing enzyme APOBEC3G. PLoS Biol, 2004. 2(9): p. E275.

94. Newman, E.N., et al., Antiviral function of APOBEC3G can be dissociated from cytidine deaminase activity. Curr Biol, 2005. 15(2): p. 166-70.

95. Browne, E.P., et al., Restriction of HIV-1 by APOBEC3G is cytidine deaminase- dependent. Virology, 2009. 387(2): p. 313-21.

96. Harris, R.S. and M.T. Liddament, Retroviral restriction by APOBEC proteins. Nat Rev Immunol, 2004. 4(11): p. 868-77.

97. Holmes, R.K., et al., APOBEC-mediated viral restriction: not simply editing? Trends Biochem Sci, 2007. 32(3): p. 118-28.

98. Aguiar, R.S. and B.M. Peterlin, APOBEC3 proteins and reverse transcription. Virus Res, 2008. 134(1-2): p. 74-85.

99. Bishop, K.N., et al., Antiviral potency of APOBEC proteins does not correlate with cytidine deamination. J Virol, 2006. 80(17): p. 8450-8.

100. Holmes, R.K., et al., APOBEC3F can inhibit the accumulation of HIV-1 reverse transcription products in the absence of hypermutation. Comparisons with APOBEC3G. J Biol Chem, 2007. 282(4): p. 2587-95.

70 101. Mbisa, J.L., et al., Human immunodeficiency virus type 1 cDNAs produced in the presence of APOBEC3G exhibit defects in plus-strand DNA transfer and integration. J Virol, 2007. 81(13): p. 7099-110.

102. Luo, K., et al., Cytidine deaminases APOBEC3G and APOBEC3F interact with human immunodeficiency virus type 1 integrase and inhibit proviral DNA formation. J Virol, 2007. 81(13): p. 7238-48.

103. Marin, M., et al., HIV-1 Vif protein binds the editing enzyme APOBEC3G and induces its degradation. Nat Med, 2003. 9(11): p. 1398-403.

104. Luo, K., et al., Primate lentiviral virion infectivity factors are substrate receptors that assemble with cullin 5-E3 ligase through a HCCH motif to suppress APOBEC3G. Proc Natl Acad Sci U S A, 2005. 102(32): p. 11444-9.

105. Bogerd, H.P., et al., A single amino acid difference in the host APOBEC3G protein controls the primate species specificity of HIV type 1 virion infectivity factor. Proc Natl Acad Sci U S A, 2004. 101(11): p. 3770-4.

106. Cullen, B.R., Role and mechanism of action of the APOBEC3 family of antiretroviral resistance factors. J Virol, 2006. 80(3): p. 1067-76.

107. Alce, T.M. and W. Popik, APOBEC3G is incorporated into virus-like particles by a direct interaction with HIV-1 Gag nucleocapsid protein. J Biol Chem, 2004. 279(33): p. 34083- 6.

108. Harris, R.S., et al., DNA deamination mediates innate immunity to retroviral infection. Cell, 2003. 113(6): p. 803-9.

109. Mangeat, B., et al., Broad antiretroviral defence by human APOBEC3G through lethal editing of nascent reverse transcripts. Nature, 2003. 424(6944): p. 99-103.

110. Miyagi, E., et al., Enzymatically active APOBEC3G is required for efficient inhibition of human immunodeficiency virus type 1. J Virol, 2007. 81(24): p. 13346-53.

111. Zhang, H., et al., The cytidine deaminase CEM15 induces hypermutation in newly synthesized HIV-1 DNA. Nature, 2003. 424(6944): p. 94-8.

112. Chiu, Y.L. and W.C. Greene, The APOBEC3 cytidine deaminases: an innate defensive network opposing exogenous retroviruses and endogenous retroelements. Annu Rev Immunol, 2008. 26: p. 317-53.

113. Liddament, M.T., et al., APOBEC3F properties and hypermutation preferences indicate activity against HIV-1 in vivo. Curr Biol, 2004. 14(15): p. 1385-91.

71 114. Pathak, V.K. and H.M. Temin, Broad spectrum of in vivo forward mutations, hypermutations, and mutational hotspots in a retroviral shuttle vector after a single replication cycle: deletions and deletions with insertions. Proc Natl Acad Sci U S A, 1990. 87(16): p. 6024-8.

115. Mariani, R., et al., Species-specific exclusion of APOBEC3G from HIV-1 virions by Vif. Cell, 2003. 114(1): p. 21-31.

116. Land, A.M., et al., Human immunodeficiency virus (HIV) type 1 proviral hypermutation correlates with CD4 count in HIV-infected women from Kenya. J Virol, 2008. 82(16): p. 8172-82.

117. Sanger, F. and A.R. Coulson, A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol, 1975. 94(3): p. 441-8.

118. Sanger, F., et al., DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A, 1977. 74(12): p. 5463-7.

119. Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-80.

120. Ronaghi, M., et al., Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem, 1996. 242(1): p. 84-9.

121. Flusberg, B. A., et al., Direct detection of DNA methylation during single-molecule, real- time sequencing. Nat methods, 2010. 7(6): p. 461-5.

122. Liu, L., et al., Comparison of next-generation sequencing systems. BioMed Res Int, 2012. 251364.

123. Eid, J., et al., Real-time DNA sequencing from single polymerase molecules. Science, 2009. 323(5910): p. 133-8.

124. Rhoads, A. and K.F. Au, PacBio sequencing and its applications. Genomics Proteomics Bioinformatics, 2015. 13(5): p.278-89.

125. Clarke, J., et al., Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol, 2009. 4(4): p. 265-70.

126. Branton, D., et al., The potential and challenges of nanopore sequencing. Nat Nanotechnol, 2008. 26(10): p. 1146-53.

127. Martin, M., Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet Journal, 2011. 17(1): p. 10.

72 128. Bolger, A.M., et al., Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 2014. 30(15): p. 2114-20.

129. Schmieder, R., and R. Edwards, Quality control and preprocessing of metagenomic datasets. Bioinformatics, 2011. 27(6): p. 863-4.

130. Henn, M.R., et al., Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog, 2012. 8(3): e1002529.

131 Salmela, L. and J. Schroder, Correcting errors in short reads by multiple alignments. Bioinformatics, 2011. 27(11): p. 1455-61.

132. Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-80.

133. Jerome, M., et al., Assessment of replicate bias in 454 pyrosequencing and a multi- purpose read-filtering tool. BMC Res Notes, 2011. 4(1): p. 149-58.

134. Loman, N.J. and A.R. Quinlan, Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics, 2014. 30(23): p. 3399-401.

135. Lee, W.P., et al., MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS One, 2014. 9(3): p. e90581.

136. Lunter, G. and M. Goodson, Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res, 2011. 21(6): p. 936-39.

137. Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60.

138. Langmead, B., et al., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 2009. 10(3): p. R25.

139. Chaisson, M.J. and G. Tesler, Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics, 2012. 13(1): p. 238.

140. Frith, M.C., et al., Parameters for accurate genome alignment. BMC Bioinformatics, 2010. 11(1): p. 80.

141. Li, H., et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics, 2009. 25(16): p. 2078-79.

142. Milne, I., et al., Using Tablet for visual exploration of second-generation sequencing data. Brief Bioinform, 2013. 14(2): p. 193-202.

73 143. Bao, H., et al., MapView: visualization of short reads alignment on a desktop computer. Bioinformatics, 2009. 25(12): p.1554-55.

144. Thorvaldsdottir, H., et al., Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform, 2013. 14(2): p. 178-192.

145. Danecek, P., et al., The variant call format and VCFtools. Bioinformatics, 2011. 27(15): p. 2156-58.

146. Koboldt, D.C., et al., VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics, 2009. 25(17): p. 2283-85.

147. Monne, I., et al., Emergence of a highly pathogenic avian influenza virus from a low- pathogenic progenitor. J Virol, 2014. 88(8): p. 4375-88.

148. Orton, R.J., et al., Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data. BMC Genomics, 2015. 16(1): p. 229.

149. Zagordi, O., et al., ShoRAH: estimating the genetic diversity of a mixed sample from next- generation sequencing data. BMC Bioinformatics, 2011. 12(1): p. 119.

150. Yang, X., et al., V-Phaser 2: variant inference for viral populations. BMC Genomics, 2013. 14(1): p. 14:674.

151. Wilm, A., et al., LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res, 2012. 40(2): p. 11189-201.

152. Gerstung, M., et al., Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat Commun, 2012. 3: p. 811.

153. Flaherty, P., et al., Ultrasensitive detection of rare mutations using next-generation targeted resequencing. Nucleic Acids Res, 2011. gkr861.

154. Cushing, A., et al., RVD: a command-line program for ultrasensitive rare single nucleotide variant detection using targeted next-generation DNA resequencing. BMC Res Notes, 2013. 6(1): p. 206.

155. Macalalad, A.R., et al., Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Comput Biol, 2012. 8(3): p. e1002417.

156. Acevedo, A., et al., Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Nature, 2014. 505(7485): p. 686-90.

74 157. Wu, N.C., et al., High-throughput profiling of influenza A virus hemagglutinin gene at single-nucleotide resolution. Sci Rep, 2014. 4: p. 4942.

158. Mangul, S., et al., Accurate viral population assembly from ultra-deep sequencing data. Bioinformatics, 2014. 30(12): p. 329-37.

159. Thompson, J.D., et al., Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics, 2002. p. 2-3.

160. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004. 32(5): p.1792-7.

161. Katoh, K., et al., MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res, 2002. 30(14): p.3059-66.

162. Darling, A.C., et al., Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res, 2004. 14(7): p.1394-403.

163. Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 2009. 25(14): p.1754-60.

164. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods, 2012. 9(4): p.357-9.

165. Li, R., et al., SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 2009. 25(15): p.1966-7.

166. Saitou, N. and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987. 4(4): p. 406-25.

167. Michener, C.D. and R.R. Sokal, A quantitative approach to a problem in classification. Evolution, 1957. 1: p.130-62.

168. Price, D.A., et al., Positive selection of HIV-1 cytotoxic T lymphocyte escape variants during primary infection. Proc Natl Acad Sci U S A, 1997. 94(5): p. 1890-95.

169. Rowland-Jones, S., et al., The role of cytotoxic T-cells in HIV infection. Dev Biol Stand, 1998. 92: p. 209-14.

170. Goulder, P.J., et al., Late escape from an immunodominant cytotoxic T-lymphocyte response associated with progression to AIDS. Nat Med, 1997. 3(2): p. 212-7.

171. Borrow, P., et al., Virus-specific CD8+ cytotoxic T-lymphocyte activity associated with control of viremia in primary human immunodeficiency virus type 1 infection. J Virol, 1994. 68(9): p. 6103-10.

75 172. Almeida, C.-A.M., et al., Translation of HLA–HIV Associations to the Cellular Level: HIV Adapts To Inflate CD8 T Cell Responses against Nef and HLA-Adapted Variant Epitopes. J Immunol, 2011. 187(5): p. 2502-13.

173. Almeida, C.A.M., et al., Exploiting knowledge of immune selection in HIV-1 to detect HIV-specific CD8 T-cell responses. Vaccine, 2010. 28(37): p. 6052-7.

174. Hultquist, J.F., et al., Human and rhesus APOBEC3D, APOBEC3F, APOBEC3G, and APOBEC3H demonstrate a conserved capacity to restrict Vif-deficient HIV-1. J Virol, 2011. 85(21): p. 11220-34.

175. Jern, P., et al., Role of APOBEC3 in genetic diversity among endogenous murine leukemia viruses. PLoS Genet, 2007. 3(10): p. 2014-22.

176. Armitage, A.E., et al., Conserved footprints of APOBEC3G on Hypermutated human immunodeficiency virus type 1 and human endogenous retrovirus HERV-K(HML2) sequences. J Virol, 2008. 82(17): p. 8743-61.

177. Rose, P.P. and B.T. Korber, Detecting hypermutations in viral sequences with an emphasis on G --> A hypermutation. Bioinformatics, 2000. 16(4): p. 400-1.

178. Pace, C., et al., Population level analysis of human immunodeficiency virus type 1 hypermutation and its relationship with APOBEC3G and vif genetic variation. J Virol, 2006. 80(18): p. 9259-69.

179. Oliver, A., et al., Hypermutation and the preexistence of antibiotic-resistant Pseudomonas aeruginosa mutants: implications for susceptibility testing and treatment of chronic infections. Antimicrob Agents Chemother, 2004. 48(11): p. 4226-33.

180. Ulenga, N.K., et al., The level of APOBEC3G (hA3G)-related G-to-A mutations does not correlate with viral load in HIV type 1-infected individuals. AIDS Res Hum Retroviruses, 2008. 24(10): p. 1285-90.

181. Bauer, S., et al., Human TLR9 confers responsiveness to bacterial DNA via species- specific CpG motif recognition. Proc Natl Acad Sci U S A, 2001. 98(16): p. 9237-42.

182. Martinelli, E., et al., HIV-1 gp120 inhibits TLR9-mediated activation and IFN-{alpha} secretion in plasmacytoid dendritic cells. Proc Natl Acad Sci U S A, 2007. 104(9): p. 3396-401.

183. Brady, T., et al., HIV integration site distributions in resting and activated CD4+ T cells infected in culture. Aids, 2009. 23(12): p. 1461-71.

184. Bushman, F., et al., Genome-wide analysis of retroviral DNA integration. Nat Rev Microbiol, 2005. 3(11): p. 848-58.

76 185. Lewinski, M.K., et al., Genome-wide analysis of chromosomal features repressing human immunodeficiency virus transcription. J Virol, 2005. 79(11): p. 6610-9.

186. Moalic, Y., et al., Porcine endogenous retrovirus integration sites in the human genome: features in common with those of murine leukemia virus. J Virol, 2006. 80(22): p. 10980- 8.

187. Maldarelli, F., et al., HIV latency. Specific HIV integration sites are linked to clonal expansion and persistence of infected cells. Science, 2014. 345(6193): p. 179-83.

188. Wagner, T.A., et al., HIV latency. Proliferation of cells with HIV integrated into cancer genes contributes to persistent infection. Science, 2014. 345(6196): p. 570-3.

189. Wang, G.P., et al., HIV integration site selection: analysis by massively parallel pyrosequencing reveals association with epigenetic modifications. Genome Res, 2007. 17(8): p. 1186-94.

190. Taruscio, D. and L. Manuelidis, Integration site preferences of endogenous retroviruses. Chromosoma, 1991. 101(3): p. 141-56.

77

78

Chapter 2:

A linkage analysis between MHC-I alleles and SIV sequences in macaques infected with SIV using an in- house bioinformatics pipeline to identify new CTL escape mutations

Publication details:

S. L. Gooneratne &, H. Alinejad-Rokny &, D. Ebrahimi, P. S. Bohn, R. W. Wiseman, D. H. O'Connor, M. P. Davenport and Stephen J. Kent. (2014). "Linking Pig-Tailed Macaque Major Histocompatibility Complex Class-I Haplotypes and Cytotoxic T Lymphocyte Escape Mutations in SIV Infection", Journal of Virology, 88(24): 14310-14325.

&: SLG and HA-R contributed equally to this project.

Author contributions to thesis Chapter 2:

SLG and SJK: designed and performed laboratory experiments. PSB, RWW and DHO: Performed MHC analysis for animals. HA-R: designed and implemented the computational algorithms, and performed the bioinformatics analysis and interpretation under the supervision of DE and MPD. HA-R: Wrote the chapter. DE, VV and MPD: Read and revised the chapter. HA-R: created all Tables (except Tables 2.1 and 2.2) and all Figures.

79 Abstract

Previous studies on major histocompatibility complex Class-I revealed that MHC-I alleles have remarkable effects on the diversity of human immunodeficiency virus (HIV). Yet, little is known about how MHC-I haplotypes affect viral diversity in the simian immunodeficiency virus (SIV)-infected pigtailed macaque (Macaca nemestrina) model. Here, I studied forty-four SIV infected macaques with a range of MHC-I alleles (haplotypes). I implemented an in-house bioinformatics pipeline to investigate potential associations between cytotoxic T-lymphocyte (CTL) escape mutations in SIV-infected macaques with MHC Class-I haplotypes. I performed two sets of analyses to investigate the: A) the association between MHC and single point mutations in the SIV genome, B) the association between MHC and mutation within a 30bp window representing viral epitopes. The method successfully identified previously reported epitopes and over 70 novel SIV mutations (single mutation and window analyses) linked to common MHC Class-I alleles. I also developed a permutation analysis to test the statistical significance of the discovered MHC-linked mutations.

80 Introduction

Human Immunodeficiency Virus (HIV) and its acquired immunodeficiency syndrome (AIDS) remain global health concerns. Current therapies extend the life expectancy of patients but do not provide a cure. CD8+ T-cells recognise antigenic peptides (epitopes) bound to major histocompatibility complex Class-I (MHC-I) alleles on virus-infected cells. Cytotoxic T-lymphocytes (CTLs) play an important role in infections and in mediating adaptive immunity. HIV-specific CTLs can destroy HIV-infected cells by cytolytic and/or non-cytolytic mechanisms to control viremia [1-5]. However, HIV replicates using its low-fidelity reverse transcriptase, which generates mutations in the CTL epitopes [6]. The resulting viruses can evade CTLs, and are thus known as CTL escape mutants. Successful identification of CTL epitopes through analysis of HIV-1 quasispecies in humans [7-9] has provided knowledge that is essential for the development of HIV vaccines.

Investigation of the association between CTL escape mutations and MHC Class-I haplotypes could reveal HIV-1 regions that are under active CTL selection pressure. This information will help better understand the immune correlates of protection against HIV- 1 infection. In addition, mapping CTL escape and the fitness cost to the HIV-1 genome provide resources to facilitate HIV-1 epitope discovery, which can inform the development of effective CTL-based HIV-1 vaccines. As indicated the goal of finding T- cell epitopes is to detect epitope-binding sequences that evoke a T-cell-mediated response when presented by MHC alleles on the cell surface [10].

Nevertheless because factors such as founder effect [11], the timing of infection, the infecting viral strain are not always known, studies of HIV escape in human remains challenging. As such there are several animal models to study HIV pathogenesis [12-17]. Non-human primate models including Indian rhesus macaques (Macaca mulatta) have been used widely to study HIV pathogenesis. The pigtail macaque (Macaca nemestrina) is an alternative model, and has a simian immunodeficiency virus (SIV) disease progression which is similar to that of the Indian rhesus macaque model [18]. Thus, the SIV infection of pigtailed macaques is a good model of HIV pathogenesis.

81 It is known that SIV mutates to escape recognition by CTL escape mutations, which bind to, and present, SIV epitopes. However, the effect of MHC-I haplotypes on disease progression in the SIV macaque models is not well characterised and so far only a few CTL epitopes and their associated escape mutations have been identified and characterised. For example in pigtailed macaques, Mane-A084 allele drives viral escape in KP9 in Gag [12, 13] and KSA10 and KVA10 in Tat [14]. The majority of these CTL mutations are in the Env protein of the SIV genome, but they have not been associated with a MHC Class-I allele [12, 13]. Furthermore, these mutations possessed delayed and variable escape kinetics, and they did not adversely impact the viral replicative susceptibility [19, 20].

An array of CTL responses against SIV epitopes may play an important role in control of viremia in macaques, as seen in two separate studies, one where Mauritian cynomolgus macaques, heterozygous or homozygous for MHC Class-I alleles were compared, post infection, and in a separate study using Mane-A084 homozygous pigtail macaques, vaccinated with whole Gag or a single CTL epitope [20, 21]. Compared to the MHC Class-I homozygous macaques, the Mauritian cynomolgus macaques that were heterozygous for MHC Class-I gene loci controlled the viremia better [19]. This ‘heterozygous advantage’ is also observed in humans with HIV-1 infection [22]. In the case of HIV-1, it is known that CTL responses are correlated with improved viral control [23-26]. Collectively, these evidences suggest that analysis of the linkage between CTL escape mutations and MHC Class-I haplotypes in pigtailed macaque will help scientists to make CTL vaccines that restrict CTL escape.

The purpose of the present study is to develop a bioinformatics pipeline to discover the association between CTL escape mutations and specific MHC Class-I alleles (haplotypes) using data from forty-four pigtailed macaques that were challenged with SIVmac251.

Materials and methods

In this study forty-four pigtailed macaques were infected by SIVmac251 virus (GenBank accession number M19499) and the infection progression was monitored by measuring viral load and peripheral blood CD4 T-cell counts from week 0 to weeks 23-64 (sampling mean ~ 16 weeks). The vaccinations were managed from week 4 to week 10 after

82 infecting animals with SIVmac251. Details of the data are shown in Table 2.1. We used 44 infected macaques from 3 different experiments. Eleven animals were infected with an ineffective Kunjin virus vector SIV or HIV vaccine [27]. Two animals were infected with an ineffective Influenza virus that expressed KP9 [28]. Thirty-one animals were infected with SIVmac251 and then received ART from week 3 to week 10 after infection (ten animals, no vaccination; ten animals received a Gag peptide-based SIV vaccine and eleven animals received peptide-based SIV vaccine based on overlapping epitopes spreading all nine SIV proteins) [29]. Animals were infected with the similar stock of

SIVmac251 (GenBank accession number M19499), as displayed in Table 2.1.

Samples

Cellular DNA and viral RNA were extracted from frozen macaque PBMCs and infected macaque plasma samples collected in EDTA vacuettes, respectively.

Whole genome amplification

The whole viral genome was amplified from viral RNA by RT-PCR in four over-lapping fragments. Fragments from a single genome were pooled and libraries were made for each macaque using the Illumina Nextera XT dual indexing kit. Libraries were sequenced by utilising the Illumina MiSeq system and about 2.5 million reads of SIV sequences were generated for each animal and also for the stock, virus to be used as a reference (more details in the paper [30]).

MHC Class-I typing

RT-PCRs were used to amplify MHC alleles from cellular RNA and macaques were genotyped using 454 Roche pyrosequencing in a Junior run 430. The MHC-I genotyping data of animals was provided by Dr. O'Connor`s Lab at the Wisconsin National Primate Research Centre [31, 32]. Table 2.2 shows the MHC Class-I genotyping of the 44 pigtail macaques studied.

83

Table 2.1: Details of infected pig-tailed macaques. Table provided by S. Gooneratne [30].

Study n Vaccines ART Challenge Outcome of vaccine Ref

1 11 Kunjin-SIV(n=11) No SIVmac251 intravenous* No effect [27]

2.1 10 Peptide immunotherapy Yes (wks 3-10) SIVmac251 intravenous* Immunotherapy effective [29] (Gag peptide-based vaccine)

2.2 11 Peptide immunotherapy Yes (wks 3-10) SIVmac251 intravenous* Immunotherapy effective [29] (overlapping peptide-based vaccine)

2.3 10 Peptide immunotherapy Yes (no vaccination) SIVmac251 intravenous* Immunotherapy effective [29]

3 2 Influenza SIV No SIVmac251 intrarectal* No effect [28]

Total n = 44

84

Table 2.2: The MHC-I genotype of the macaques that were used in this study. Table provided by S. Gooneratne [30].

Mane- A or B # Haplotypes Haplotype Major Alleles Minor Alleles Haplotype Observed Frequency (%) "Diagnostic" 2 3 4 5 1 2 3

A084 16 36.3 A1*084 A3*13 A082 13 29.5 A1*082 A3*13 A052 5 11.3 A1*052 A3*13 A5*30 A019 5 11.3 A1*019 A2*05 A4*14 A010 4 9.0 A1*010 A3*13 A4*14 A006 3 6.8 A1*006 A2*05 A009 3 6.8 A1*009 A2*05 A031 3 6.8 A1*031 A4*14 A072 2 4.5 A1*072 A3*13 A016 2 4.5 A1*016 A4*14 A032 2 4.5 A1*032 A4*14 A047 2 4.5 A1*047 A2*05 A4*01 A053 2 4.5 A1*053 A2*05 A4*14 A066 2 4.5 A1*066 A3*13 A083 2 4.5 A1*083 A2*05 A3*13 A114 2 4.5 A1*114 A2*05 A3*13 A003 2 4.5 A1*003 A2*05 A004 1 2.2 A1*004 A4*14 A007 1 2.2 A1*007 A2*05 A085 1 2.2 A1*085 A2*05 A4*14 A018 1 2.2 A1*018 A2*05 A4*14 A unknown 8 18.1 A1 novel

85 Mane- A or B # Haplotypes Haplotype Major Alleles Minor Alleles Haplotype Observed Frequency (%) "Diagnostic" 2 3 4 5 1 2 3 B118 13 29.5 B*118 B*122 B*027 B*030 B*082 B043 8 18.1 B*043 B*030 B*045 B*072 B015 8 18.1 B*015 B*068 B*064 B*088 B028 7 15.9 B*028 B*061 B*124 B*068 B*021 B*088 B*079 B*045 B016 6 13.6 B*016 B*041 B*089 B*088 B120 6 13.6 B*120 B*107 B*082 B*078 B047 6 13.6 B*047 B*068 B*112 B039 5 11.3 B*039 B*108 B*088 B069 4 9.0 B*069 B*081 B*107 B*082 B*079 B*051 B119 3 6.8 B*119 B*068 B150 3 6.8 B*150 B*068 B*057 B*089 B*072 B104 3 6.8 B*104 B*144 B*057 B*046 B004 2 4.5 B*004 B*004 B110 2 4.5 B*110 B*123 B*082 B*068 B*072 B017 2 4.5 B*017 B*060 B056 2 4.5 B*056 B*017 B*041 B*082 B*079 B052 2 4.5 B*052 B*058 B008 1 2.2 B*008 B*125 B*082 B*079 B*072 B019 1 2.2 B*019 B*014 B*057 B*116 B024 1 2.2 B*024 B*068 B*089 B*082 B099 1 2.2 B*099 B*058 B*072 B*054 B*063 B101 1 2.2 B*101 B*068 B*060 B*089

86

Data analysis (identifying associations between MHC haplotypes and viral mutations)

I developed a bioinformatics pipeline to identify MHC Class-I haplotypes that were associated with nonsynonymous mutations (compared to the reference sequence) in the SIV genome. Fig. 2.1 shows an overview of the bioinformatics pipeline.

87

Figure 2.1: Flowchart of the bioinformatics pipeline to identify associations between MHC-I haplotypes and mutations.

88

I used the software Geneious (version 7.0.2; Biomatters, Auckland, New Zealand) to trim, map/assemble (mapping onto the stock reference) and call variants of the Illumina reads of SIV sequences obtained from infected animals. As indicated in the introduction chapter, Geneious is a powerful bioinformatics software for visualisation, management and analysis of biological data. It contains the commonly used mapping, trimming and SNP-calling algorithms and has one the best user interfaces to visualise data and perform quick simple tests. After testing a group of bioinformatics packages and command line tools for trimming, mapping and variant calling and other related tasks, I chose Biomatters’ Geneious. it has many options for the tasks mentioned above. For example for mapping, one can choose between the standard Geneious assembler, Geneious for RNAseq, or the BBMap, Bowtie or Tophat mappers. As such, I did not develop new bioinformatics tools for most parts.

The stock reference reads were assembled by mapping to the publicly available reference sequence (SIVmac251). Using codes in Matlab and visual basic (VB Studio), I identified mutated positions that are associated (significantly) with MHC haplotypes. I also used Graphpad prism 6 for plotting. The summary of my data analysis method is given below:

Import animal sequences: Import the animal reads (forward and reverse reads) into the Geneious environment.

Set paired reads: 'paired ends' indicates the two ends of a single DNA molecule. Paired- end reads are more likely to be accurately aligned to a reference. In this study, first the paired reads were identified and then the software assembler used to map reads onto a reference genome.

Sequence trimming: There is a “Trim Ends” option in the Geneious software to trim the poor quality base off the ends of the reads. Regions with more than a “user-defined threshold” chance of an error per base were trimmed. The identification of low-quality regions is based on the error probability dedicated to each base. There are several settings to get good trimmed data. For example, the “Error Probability Limit” option is used to trim by quality; the default setting of 0.05 usually trims sequences with quality less than Q20 (low quality is normally determined as confidence of 20 or less). However, the user can change the default setting to get a more conservative result. I used an error probability

89

limit of 0.05 for trimming reads. I looked for a consecutive set of bases, for each read, with quality >= 20. This resulted in an average quality >= 20 (the average quality for each over all reads) for all animal fastq files. The trimmed reads with length > 120bp (average length of consecutive set of high quality bases at each read, over all animal reads), were kept. The rest were discarded. Figs. 2.2 and 2.3 are provided as examples of quality control of our data with the FASTQC (www.bioinformatics.babraham.ac.uk/projects/fastqc) tool before and after trimming. An overview of trimmed reads is shown in Fig. 2.4.

90

(A)

(B) Figure 2.2: Quality scores across all bases for animal 1335. x-axis shows position in read; y-axis shows quality of the position. A) Before trimming; B) After trimming.

91

(A)

(B) Figure 2.3: Quality score distribution over all sequences in animal 1335; x-axis shows mean sequence quality. y-axis shows number of bases with that quality. A) Before trimming; B) After trimming.

92

Figure 2.4: An overview of the trimming process for animal 1335. Blue colours indicate good quality base pairs and gold colours show trimmed regions.

93

Sequence assembly: I selected animal reads and mapped to the stock reference with a medium sensitivity speed and iterated up to 5 times. Fig. 2.5 shows a snapshot of mapped reads.

Figure 2.5: An overview of mapped reads for animal 1335.

Single nucleotide mutation calling: There are two types of single nucleotide mutations- synonymous, which does not change the protein sequence, and nonsynonymous, which changes the protein sequence. The aim of the present study was to investigate MHC- induced viral escape through nonsynonymous mutations in viral epitopes. Therefore, only nonsynonymous mutations were considered in this research.

Geneious has a good default setting for SNP calling on different kinds of data. There are many published papers, which used Geneious for SNP calling, for instance, references [33-38]. However, I performed several different pre-analysis to obtain optimal setting of parameters. I also took advice from our collaborators, who have used, this software in their previous projects several times using the same settings.

Minimum coverage: This means the number of reads that cover the polymorphism region. The coverage contains both the reads containing the polymorphism and other reads at that region. The default value is 10, however I changed it to 55, to be more conservative and have a low risk of non-real variant calling.

94

Minimum variant frequency: This refers to the percentage of reads that have the SNP at that position. For every SNP that Geneious will find, there is a determined number of reads that match the reference genome, and a determined number of reads that match the alternative (polymorphism) call. A minimum variant frequency of 0.05 means that there are 5 alternative calls and 95 reference calls (5/100=0.05). Based on our experimental collaborators` advice, I set it to 0.05.

Maximum variant P value: The P value demonstrates the probability of a sequencing error resulting in observing bases with at least the given sum of qualities. Overall, the lower the P value, the more likely the SNP is real. However, to find more variants, the user needs to increase the value of this parameter. However, the user can control the risk of the SNP calling and set the “Maximum Variant P value” parameter according to the risk they want to take. I set it to 10-6 to have a low risk of non-real variant calling.

Minimum strand bias: False polymorphisms due to strand bias (when sequencing errors occur only on reads in one direction) can be removed by setting a threshold for the “Minimum Strand-Bias P value” parameter. Polymorphisms with a smaller strand bias P value will be removed from the results when using this parameter. I set it to 10-6 to have a low risk of non-real variant calling (However, increasing or decreasing the parameter did not make a significant change in the results).

Fig. 2.6 shows a snapshot of the called mutations within a selected region of the SIV genome in animal 1335.

95

Figure 2.6: A snapshot of called SNPs within a selected region of animal 1335.

96

Method validation

I made 1000 simulated fastq reads based on our reference sequence and assigned random qualities to each base of the reads and inserted mutations in 6% of the reads. As such here I know which base in which read should be trimmed and where the SNPs are. I used Geneious (same algorithms and parameters) for read mapping, viral sequence assembly, and variant calling of the simulated data. The results are shown in Figs. 2.7, 2.8, 2.9 and 2.10. As I expected, Geneious trimmed those bases that had a quality less than 20 and found those SNPs that I made for some of reads.

97

(A)

(B) Figure 2.7: Quality scores across all bases for simulated data. x-axis shows position in read; y- axis shows quality of the position. I assigned different quality to positions and especially, I assigned lowest qualities to position 150-151 for all simulated reads. A) Before trimming; positions 150-151 have lowest qualities and Geneious should trim these positions in all reads B) After trimming; all low quality positions have trimmed; Geneious also trimmed positions 150- 151 in all reads.

98

(A)

(B) Figure 2.8: Quality score distribution over all sequences in simulated data. x-axis shows mean sequence quality; y-axis shows number of bases with that quality. A) Before trimming; B) After trimming.

99

Figure 2.9: An overview of the trimming process for simulated independent sequences. Blue colours indicate good quality base pairs and gold colours show trimmed regions. Geneious trimmed all bases that had quality less than 20. It also trimmed position 150-151 (those position that had lowest qualities) in all reads.

100

Figure 2.10: An overview of mapping of the simulated reads. I used same parameters that I set for real reads. Mutations were inserted in some of simulated reads and Geneious SNP calling could successfully identify those SNPs. In the Figure, G-to-A mutations were inserted for sequences 1, 2, 21, 125, 34 and geneious identified all of them.

101

Association detection approach

I exported the mutations of all animals into a file and wrote a code in VB.NET to find the associations between mutations and MHC haplotypes. The pseudo code of the algorithm is as following:

AssociationDetection (SNP dataset) Read the file (all animal annotation details). For i = 1 to number of rows Identify which HIV gene the mutation is in. Read position of mutation and identify all samples with >5% mutated reads. Identify codon and the amino acid of the mutated position. Identify 10 amino acids upstream and 10 amino acids downstream of this position for motif identification of this position. Apply Fisher exact test to calculate p value for associations. End for End Sub

Pseudo code of my method for finding associations for each haplotype.

102

Figure 2.11: Flowchart for point mutation algorithm to identify the correlations between CTL scape mutation and MHC type.

103

Figure 2.12: Flowchart for window analysis algorithm to identify the correlations between CTL scape mutation and MHC type.

104

I used my in-house association detection algorithm to identify the correlations. The method identified all positions with a non-synonymous mutation frequency > 5% within all SIV sequences. Additionally, I identified mutations within a moving 30bp (1bp increment) window using an in-house code in Matlab and VB.NET. I then used a Fisher`s exact test to identify mutations in the SIV genome that are associated with the animal MHC haplotypes. For each association a P value was obtained. I manually checked some of the associations to validate the results. The flowcharts of the algorithm are shown in Figs. 2.11 and 2.12.

A permutation analysis to determine statistically significant associations

Analysis of the macaques’ MHC haplotypes for around 10,000 bases of the SIV genome can produce lots of association that will include many spurious associations. To assess whether an association is likely to be genuine (i.e. not random), I used a permutation analysis to determine the probability of obtaining an association by chance. To do this, I performed a randomisation method and aimed to answer the question: “How many random associations would I expect to see, given the Mane haplotype data structure (i.e., the number of animals with the different Mane alleles) and the virus data structure (number of sites mutated)?” [30]. My permutation method does not change the structure of Mane haplotype and viral mutation data during the permutation process. For this purpose, I permuted all the Mane haplotypes of a given animal en bloc, as well as all the viral sequences from a given animal en bloc. In other words, I permuted the order of the animals (and all of the associated Mane haplotypes), leaving the order of viral sequences intact [30]. For each permutation, I randomly distributed the monkey MHC haplotypes with respect to viral sequences, and then estimated the probability of all of the observed associations using the Fisher’s exact test (exactly as I did with the original non- permutated experimental data) to detect the P values that I would get analysing the associations between Mane haplotypes and viral polymorphisms from this random matching of the Mane haplotype and viral sequences. Figs. 2.13 and 2.14 show permutation flowcharts with matrices of animals vs. mutated positions respectively, during permutations.

105

Figure 2.13: Flowchart of permutation method. In Step 1, for each permutation, new random numbers generate and animals will be distributed basis on new random numbers, therefore, animals haplotypes will be distributed as well. Figs. 2.14 and 2.15 show Step 1 in more details. In Step 2, a P value related to each association will be calculated using Fisher exact test and then in Steps 3 and 4, after calculating the average of P values of associations in a position, 3 best P values will be reported.

106

Mutated positions within whole SIV genome 151 162 167 176 191 ... 323 453 ... 1225 … 2169 2170 ... 9486 9524 9539 9540 1 1 1 9 2 1 4 5 1 6 2 8 6 2 9 7 4 3 sorted random number Animal ID 0.068 9176 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0.104 9019 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0.155 1335 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0.298 8020 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0.517 5807 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0.779 8673 0 0 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0.790 . 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 0

First permutation First 0.823 . 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0.863 9183 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 1 1 1

0.038 8020 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0.058 9019 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0.062 9176 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 1 1 1 0.068 5807 0 0 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0.161 8673 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0.261 . 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 0 0.310 . 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0.568 9183 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 Second permutation Second 0.728 1335 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 1

0.075 1335 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0.099 5807 0 0 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0.137 9183 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0.308 9019 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0.576 9176 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 1 1 1 0.647 . 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 0 0.723 . 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 0 permutation N permutation 0.758 8020 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0.770 8673 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 Figure 2.14: Matrix of animals vs. mutated positions in permutation process. The number of mutated animals in each mutated position stays fixed in each permutation, but after generating new random numbers and sorting them, the order of animals may be changed for each permutation.

107

By performing this permutation 1000 times, I could estimate the probability that an association at a given level of significance might have arisen by chance. In fact, by this permutation, I could find in which levels of P value a random association can come up as well. By comparing, for a certain P value, the ratio of observed associations in the experimental data with the ratio of observed associations in the permutation analysis, I could estimate the proportion of associations that were likely to come up by chance. Therefore, the permutation analysis enabled me to generate three P value thresholds (96% (P value < = 1.05 × 10-3), 84% (P value < = 3.17 × 10-3) and 65% (P value < = 9.79 × 10- 3)). Any association with P value below the thresholds is expected to be a ‘true’ association (Figs. 2.15A and 2.15B). I used these thresholds for single-point mutations and for mutations within a window of 30bp, to find the expected frequencies of associations. For example, if we were to set the threshold of significance to 1.05 × 10-3 then we expect only 2 associations with a P value this small by chance. Since we see 65 associations, we assume two of these occurred by chance, but the rest were likely ‘real’. Therefore, the probability of being real at this threshold is [65-2]/65.

108

(A)

6E-7 3E-3 7E-3 1E-2 2E-2 3E-2 4E-2 5E-2 6E-2

(B) Figure 2.15: A) Compression of best P values in the 1000 permutations and real data (bold black line); best P value after 1000 permutations is 1.05 × 10-5, which occurred only once within 1000 permutations. However, I found several associations in the results with P value

109

less than 1.05 × 10-5. B) Difference in the number of significant associations between the experimental result and the permutation analysis result; Blue: P value for original results. Red: P value for permutation. At the first cut-off level (P value < = 1.05 × 10-3), I found 65 significant associations in the real data; but in permutation results (in average) only two significant associations occurred. At the second cut-off level, (P value < = 3.17 × 10-3), there are 27 significant associations; but in the permutation results (on average) only 4 significant associations occurred. At the third cut-off level (P value < = 9.79 × 10-3), I found 8 significant associations in the real data; however, there is no significant association in the permutation results (on average).

Results

Results validation

Smith et al. [13] and Mason et al. [14] have previously reported KP9, KVA10 and KSA10 as CTL epitopes in macaques with haplotype A084. To validate my informatics pipeline, I first investigated the association for these known epitopes in macaques, with and without Mane-A084. For all of these known epitopes, my analysis found a strong association (see Table 2.3). For example, in position 165 of Gag, the method showed that 11 animals (out of 16) with Mane A084 had the amino acid mutation K-to-R with a P value 5.7 × 10-7. However, no animal without Mane A084 had mutation in this position. These findings confirmed that my method can successfully identify other potential epitopes undergoing CTL escape mutations.

110

Table 2.3: Comparison and validation of results with known CTL responses. Region of No. of Mane+ Mutation+ No. of Mane-Mutation+ Epitope P value Amino acid Protein Epitope animals/Total Mane+ animals/Total Mane- Escape Motif Within (two-tail) change animals animals Protein Gag KP9 164-172 11/16 0/28 5.7 E-07 K -> R KRFGAEVVP

Tat KSA10 87-96 06/16 0/28 1.1 E-03 A -> T KKAKTNTSSA

KSA10 87-96 06/16 0/28 1.1 E-03 S -> T KKAKANTSTA or or S -> P KKAKANTSPA KSA10 87-96 05/16 0/28 4.0 E-03 A -> T KKTKANTSSA

Tat KVA10 114-123 09/16 0/28 1.6 E-05 K -> E EKETVEKAVA KVA10 114-123 07/16 0/28 3.0 E-04 K -> E KEETVEKAVA

111

New MHC-linked epitope mutations

Using the bioinformatics pipeline I searched for associations for all Mane haplotypes that existed in at least two animals. The analysis of point mutations revealed a large number of associations. As described in the method section, I used three cut-off levels to identify real (non-random) associations. After ignoring associations with P value > 9.79 × 10-3 and excluding the mutations that are known to be associated with Mane-A084, my analysis revealed 46 novel amino acid mutations within 44 macaques that were associated with specific MHC Class-I haplotypes at a likelihood level of > 66% of being real associations. No novel association with a likelihood of > 96% (P value < 1.05 × 10-3) was identified, but 27 associations within 18 different haplotypes had P value < 3.1 × 10-3 (based on my permutation results, these associations are 84% likely to be real associations). A total of 19 associations had P value < 9.79 × 10-3. As an example, all Mane-B017 positive animals had I-to-T mutation in positions 109 and 9568 of Nef proteins of SIV virus. Fig. 2.16 shows one of the Mane B028 haplotype associations in the Nef protein.

To identify potential enrichment of mutations within particular motifs, I also investigated ten amino acids upstream and downstream of the mutated positions. Details of novel associations are given in Table 2.4.

Within the significant associations, I found negative associations as well. Positive association means that a significant number of samples with a given Mane haplotype (for example samples with the A084 haplotype) have a mutation at the position compared to samples with non-A084 haplotype. Negative association means that a significant number of samples with non-Mane haplotype (for instance, samples with non-A084 haplotype) have a mutation at the position compared to samples with the A084 haplotype. For example, for the position 2496 of Pol (based on numbering of whole genome numbering), 26 of 28 Mane-A084 negative animals had R-to-Q mutation with P value < 2.09 × 10-3. But only 8 of 16 Mane-A084 positive animals had this mutation at the same position. Possible interpretation for this negative correlation is that there is an escape mutation in the wild-type sequence (stock sequence) compared to original reference sequence (SIVmac251). But it is worth nothing that in our analysis we filtered out mutations in the stock sequence, as a results the above mentioned possibility it not very

112

likely. Future studies are needed to better understand the negative correlation between MHC alleles and mutations. 8 of 16 Mane-A084 positive animals maintained these escape mutations, however other Mane-A084 positive animals (8 animals) and 26 of 28 Mane- A084 negative animals with the R-to-Q mutation have reverted to the original reference sequence, which is amino acid Q at position 2496 of the original reference sequence (because it is less fit in the absence of immune pressure).

113

Table 2.4: New SIVmac251 mutations associated with MHC Class-I haplotypes detected by proposed approach in this study.

Nucleotid No. of Mane+ Mut+ No. of Mane-Mut+ P value* Amino acid Haplotype Protein Wild-type motif e position / Total Mane+ / Total Mane- (two tail) changes

Mane-A010 GAG 2179 2/4 0/40 0.0064 P -> S EALKEALAPVPIPFAAAQQRG POL 4405 2/4 0/40 0.0064 N -> S CPTESESRLVNQIIEEMIKKS POL 4408 2/4 0/40 0.0064 Q -> R PTESESRLVNQIIEEMIKKSE ENV 7483 2/4 0/40 0.0064 D -> E NRTYIYWHGRDNRTIISLNKY ENV 7493 2/4 0/40 0.0064 I -> V IYWHGRDNRTIISLNKYYNLT ENV 8019 2/4 0/40 0.0064 T -> I VTSLIANIDWTDGNQTNITMS Mane-A032 GAG 2179 2/2 0/42 0.0011 P -> S EALKEALAPVPIPFAAAQQRG POL 4405 2/2 0/42 0.0011 N -> S CPTESESRLVNQIIEEMIKKS POL 4408 2/2 0/42 0.0011 Q -> R PTESESRLVNQIIEEMIKKSE ENV 7483 2/2 0/42 0.0011 D -> E NRTYIYWHGRDNRTIISLNKY ENV 7493 2/2 0/42 0.0011 I -> V IYWHGRDNRTIISLNKYYNLT VIF 5863 2/2 0/42 0.0064 N -> D NPTWKQWRRDNRRGLRMAKQN Mane-A053 NEF 9676 2/2 0/42 0.0011 S -> Y, S -> F RHYLMQPAQTSKWDDPWGEVL T -> A, T -> Mane-A082 ENV 7001 5/13 0/31 0.0012 LTKSSTTTAPTTTTAASKIDM S, T -> P ENV 7033 8/13 5/31 0.0087 M -> I TTTTAASKIDMVNETSSCITH Mane-A114 GAG 2477 2/2 0/42 0.0011 Q -> R PAVDLLKNYMQLGKQQREKQR Mane-B004 VIF 5935 2/2 1/41 0.0032 K -> E DKQRGGKPPTKGANFPGLAKV ENV 6996 2/2 1/41 0.0032 A -> V WGLTKSSTTTAPTTTTAASKI A -> G ENV 7017 2/2 2/42 0.0064 TTTAPTTTTAASKIDMVNETS A -> V Mane-B015 ENV 9236 4/8 1/36 0.0024 L -> V RIRQGLELTLL ENV 7280 3/8 0/36 0.0042 T -> A QESCDKHYWDTIRFRYCAPPG

114

Nucleotid No. of Mane+ Mut+ No. of Mane-Mut+ P value* Amino acid Haplotype Protein Wild-type motif e position / Total Mane+ / Total Mane- (two tail) changes

Mane-B016 GAG 1360 3/6 0/38 0.0015 I -> V KVKHTEEAKQIVQRHLVVETG ENV 6789 3/6 0/38 0.0015 G -> D WGTTQCLPDNGDYSELALNVT Mane-B017 NEF 9568 2/2 0/42 0.0011 I -> T DWQDYTSGPGIRYPKTFGWLW Mane-B028 ENV 7049 3/7 0/37 0.0027 S -> P SKIDMVNETSSCITHDNCTGL ENV 9105 4/7 3/37 0.0075 Q -> R YGWSYFQEAVQAGWRSATETL NEF 9105 4/7 3/37 0.0075 K -> E MGGAISRRRSKPAGDLRQRLL Mane-B039 ENV 7027 3/5 2/39 0.0070 I -> M APTTTTAASKIDMVNETSSCI Mane-B043 VIF 5917 3/8 0/36 0.0043 G -> S KQNSRGDKQRGGKPPTKGANF NEF 9402 5/8 5/36 0.0091 T -> A VPVMPRVPLRTMSYKLAIDMS Mane-B056 POL 4618 2/2 1/42 0.0032 R -> K KELVFKFGLPRIVARQIVDTC Mane-B069 GAG 1880 2/4 0/40 0.0064 R -> K WIQLGLQKCVRMYNPTNILDV TAT 8808 2/4 0/40 0.0064 I -> V, I -> L KPISNRTRHCQPE ENV 8804 2/4 0/40 0.0064 T -> A FSSPPSYFQQTHIQQDPALPT ENV 8808 2/4 0/40 0.0064 H -> R, H -> P SSPPSYFQQTHIQQDPALPTR Mane-B104 TAT 6408 2/3 0/41 0.0032 G -> R DATTPESANLGEEILSQLYRP ENV 8019 2/3 0/41 0.0032 T -> I VTSLIANIDWTDGNQTNITMS Mane-B118 ENV 8981 4/13 0/31 0.0053 G ->A FSNCRTLLSRAYQILQPILQR Mane-B119 POL 2538 2/3 0/41 0.0032 D -> N RKQREALQGGDRGFAAPQFSL ENV 9089 2/3 0/41 0.0032 F -> V LTYLQYGWSYFQEAVQAGWRS ENV 9093 2/3 0/41 0.0032 Q -> R TYLQYGWSYFQEAVQAGWRSA NEF 9089 2/3 0/41 0.0032 I -> M MGGAISRRRSKPAGD NEF 9093 2/3 0/41 0.0032 R -> G MGGAISRRRSKPAGDLR TAT 6364 2/3 1/41 0.0094 C -> Y SLESSNERSSCISEADATTPE ENV 7931 2/3 1/41 0.0094 V -> I IRQIINTWHKVGKNVYLPPRE Mane-B150 VPR 6055 2/3 0/41 0.0032 M -> I YLCLIQKALFMHCKKGCRCLG * P value is uncorrected P value (Fisher’s exact test); Mut+ = mutation positive animals

115

Figure 2.16: An example of number and type of mutations within all macaques, with and without Mane-B028 at position 2928 in Nef protein. (P value = 0.0075, Fisher’s exact test).

116

New MHC-linked mutations using sliding window analysis

To identify epitopes that escape in different ways in different macaques, I clustered different mutations that happened in the same epitope in different macaques. I used the same concept described in previous studies [39-44] to identify MHC Class-I haplotypes that are associated with SIV mutations, by applying a 30-nucleic acid sliding window with 3 nucleic acid base increments. In this case, I just considered windows that no single mutation has a significant association with a specific MHC-I haplotype. This approach is expected to identify the sites that were discovered using my previous method of single- point mutation analysis; however, this approach might have less power to identify highly mutated single sites, because the probability of random mutations in MHC-negative animals is higher when considering a window than when considering a single position. As such, I only report ‘new’ sites identified by the sliding window approach. Here, the thresholds for the probability of regions being significantly associated are different from those of single-point mutations analysis described above. The window analysis approach identified an additional 32 unique new mutated regions (142 windows with overlapped regions) that were >86% probable to be genuinely associated with MHC haplotypes (P value < 3.1 × 10-3). Table 2.5 shows all SIV-specific mutations linked to animal MHC-I haplotype using the sliding window analysis.

As an example, 4 of 5 Mane-A019 positive animals had mutations in a window with the start position 3048 and end position 3077 (P value 1.8 × 10-4) and only 1 of 39 Mane- A019 negative animals had mutation in this region. Fig. 2.17 shows other example of a window-based association in Gag which was linked to the Mane A084 haplotype.

Similar to the single-point mutation analysis results, negative associations were found in the window analysis. For example in the window that starts from position 198 and ends at position 227 (windows within protein Nef), 13 of 28 Mane-A084 negative animals had Q-to-H mutations with P value = 7.11 × 10-3. But only 1 of 16 Mane-A084 positive animals had this mutation at the same position. I also found the same window-based association in Nef between positions 9657 and 9686.

117

Table 2.5: New SIVmac251 mutations associated with MHC Class-I haplotypes detected by proposed sliding windows approach in this study. No. of positive Nucleotide No. of Mane+ No. of Mane- windows for P value (two Amino acid sequence spanning Haplotype Protein region Mut+ / Total Mut+ / Total the amino tail) positive sliding windows Start End Mane+ Mane- acid sequence * Mane-A006 GAG 2182 2226 2/3 1/41 0.0094 6 IPFAAAQQRGPRKPI POL 4047 4076 2/3 1/41 0.0094 1 QWWTDYWQVT ENV 7538 7588 3/3 6/41 0.0063 8 RPGNKTVLPVTIMSGLV Mane-A019 POL 3036 3095 4/5 2/39 0.0005 11 AIKKKDKNKWRMLIDFRELN ENV 9023 9091 4/5 2/39 0.0005 10 ALQRIREVLRTELTYLSYF Mane-A032 TAT 6393 6425 2/2 2/42 0.0063 2 ESANLGEEILS Mane-A052 ENV 6911 6958 3/5 2/39 0.0070 7 KLSPLCITMRCNKSET Mane-A082 ENV 7256 7309 4/13 0/31 0.0053 9 SCDKHYWDTIRFRYCAPP Mane-A084 GAG 2284 2313 5/16 0/28 0.0040 1 RQGCWKCGKM ENV 8795 8836 13/16 7/28 0.0005 5 FQQTHIQQDPALPT Mane-A114 GAG 2446 2475 2/2 1/42 0.0032 1 PAVDLLKNYM POL 4044 4073 2/2 2/42 0.0063 1 EQWWTDYWQV VIF 5701 5748 2/2 2/42 0.0063 7 EVRRAIRGEQLLSCCK Mane-B017 POL 3426 3455 2/2 2/42 0.0063 1 DRTDLEHDRV Mane-B039 TAT 8862 8900 3/5 1/39 0.0029 4 EKAVATAPGLGR Mane-B043 POL 5277 5306 3/5 0/36 0.0042 1 ILKVGTDIKV VIF 5533 5562 4/8 1/35 0.0023 1 SHLEVQGYWH VIF 5548 5589 4/8 2/36 0.0065 5 QGYWHLTPERGWLS ENV 9137 9190 5/8 2/36 0.0009 8 AGAWGDLWETLRRGRWI

118

No. of positive Nucleotide No. of Mane+ No. of Mane- windows for P value (two Amino acid sequence spanning Haplotype Protein region Mut+ / Total Mut+ / Total the amino tail) positive sliding windows Start End Mane+ Mane- acid sequence * Mane-B047 ENV 8837 8866 6/6 14/38 0.0055 1 REGKEGDGGE NEF 9279 9314 5/6 9/38 0.0089 3 PWRNPAEEREKL Mane-B052 POL 2751 2780 2/2 2/42 0.0063 1 VLGKRIKGTI POL 3351 3413 2/2 1/42 0.0032 12 VLEPFRKANPDVTLVQYMDDI VIF 5869 5904 2/2 2/42 0.0063 3 RGLRMAKQNSRG Mane-B069 GAG 2272 2304 2/4 0/40 0.0063 2 RAPRRQGCWKC POL 2568 2603 2/4 0/40 0.0063 3 LWRRPVVTAHIE ENV 6815 6853 2/4 0/40 0.0063 4 VTESFDAWENTVT Mane-B104 6619 6657 3/3 7/41 0.0091 4 ESAAYRHLAFKCL ENV 7142 7189 2/3 1/41 0.0094 7 KEYNETWYSTDLVCEQ ENV 8291 8341 2/3 1/41 0.0094 8 QQQQLLDVVKRQQELLR Mane-B119 POL 3822 3857 2/3 1/41 0.0094 3 PLEATVIKSQDN ENV 7907 7939 2/3 1/41 0.0094 2 QIINTWHKVGK

* Number of overlapped windows in the same region (around a potential epitope).

119

Figure 2.17: An example of number and type of mutations based on the window analyses within all macaques, with and without Mane-A084, in protein Gag between positions 2284 and 2313 (P value = 0.0040, Fisher’s exact test).

120

Comparison with previously reported epitopes

I compared the identified potential CTL epitopes with the reported SIV and HIV epitopes using the Immune Epitope Database (www.iedb.org) [45]. Table 2.6 shows the results of this comparison. Using a 70% similarity-based comparison, I identified 8 previously reported HIV epitopes and 40 SIV epitopes, provided in Table 2.6.

121

Table 2.6: Comparison of discovered epitopes in this study with known human epitopes via an Immune Epitope Database (www.iedb.org) [45]. A 70% similarity-based threshold has been applied to find similar epitopes.

Our potential epitope sequence Region (HIV/SIV) Database epitope sequence MHC Restriction Epitope ID

EALKEALAPVPIPFAAAQQRG GAG HIV ALAPVPIPFAAAQQR Mamu-DRB*w2:01 2443 CPTESESRLVNQIIEEMIKKS POL HIV ESELVNQIIEQLIKK 14143 NRTYIYWHGRDNRTIISLNKY ENV IYWHGRDNRTIISLNKYYNLT ENV VTSLIANIDWTDGNQTNITMS ENV NRTYIYWHGRDNRTIISLNKY ENV IYWHGRDNRTIISLNKYYNLT ENV SIV RDNRTIISL Mamu-A1*011:01 53404 PTWKQWRRDNRRGLRMAKQN VIF SIV WRRDNRRGLRMAKQN Mamu-DRB1*04:06 73048 RHYLMQPAQTSKWDDPWGEVL NEF LMQPAQTSKW Mamu-B*017:04 38103 LTKSSTTTAPTTTTAASKIDM ENV TTTTAASKIDMVNETSSCITH ENV VDMVNETSSCI Mamu-A1*011:01 68028 PAVDLLKNYMQLGKQQREKQR GAG DKQRGGKPPTKGANFPGLAKV VIF SIV KPPTKGANF Mamu-A1*001:01 32840 WGLTKSSTTTAPTTTTAASKI ENV TTTAPTTTTAASKIDMVNETS ENV RIRQGLELTLL ENV QESCDKHYWDTIRFRYCAPPG ENV KVKHTEEAKQIVQRHLVVETG GAG WGTTQCLPDNGDYSELALNVT ENV SIV CLPDNGDYSEL Mamu-A1*001:01 6617 DWQDYTSGPGIRYPKTFGWLW NEF HIV TPGPGIRYPL HLA-B*07:02 232348 SKIDMVNETSSCITHDNCTGL ENV

122

Our potential epitope sequence Region (HIV/SIV) Database epitope sequence MHC Restriction Epitope ID

YGWSYFQEAVQAGWRSATETL ENV MGGAISRRRSKPAGDLRQRLL NEF APTTTTAASKIDMVNETSSCI ENV KQNSRGDKQRGGKPPTKGANF VIF HIV QVPLRPMTYK HLA-A11, HLA-A2, 52760 VPVMPRVPLRTMSYKLAIDMS NEF HLA-A33 and more KELVFKFGLPRIVARQIVDTC POL SIV GLPRIVARQIV Mamu-A1*001:01 21069 HIV WIILGLNKIVRMYSPTSI HLA-DRB1*15:01, 72624 HLA-DRB1*04:01, WIQLGLQKCVRMYNPTNILDV GAG HLA-DRB1*04:05 and more KPISNRTRHCQPE TAT SIV RTRHCQPEKA Mamu-A1*002:01 56167 FSSPPSYFQQTHIQQDPALPT ENV SIV SPPSYFQQTHI Mamu-A1*001:01 60193 SSPPSYFQQTHIQQDPALPTR ENV SIV THIQQDPAL HLA-B*38:01 508610 DATTPESANLGEEILSQLYRP TAT SIV GEEILSQLY Mamu-A1*011:01 19243 VTSLIANIDWTDGNQTNITMS ENV SIV IDWIDGNQTNI Mamu-B*001:01 25703 FSNCRTLLSRAYQILQPILQR ENV SIV LSRVYQILQPILQRL Mamu-DRB*w2:01 39642 RKQREALQGGDRGFAAPQFSL POL SIV REALQGGDRGF Mamu-A1*011:01 53458 SIV YFHEAVQAVW Mamu-B*017:04, 73820 LTYLQYGWSYFQEAVQAGWRS ENV Mamu-B*017:04 MGGAISRRRSKPAGD NEF MGGAISRRRSKPAGDLR NEF SLESSNERSSCISEADATTPE TAT SIV RSSCISEADA Mamu-A1*002:01 55954 IRQIINTWHKVGKNVYLPPRE ENV SIV NTWHKVGKNVY Mamu-A1*002:01 46314 YLCLIQKALFMHCKKGCRCLG VPR SIV YLCLIQKALFMHCKK Mamu-DRB*w2:01 74584

123

Our potential epitope sequence Region (HIV/SIV) Database epitope sequence MHC Restriction Epitope ID

Windows analysis epitopes QWWTDYWQVT POL SIV WEQWWTDYWQV Mamu-A1*011:01 72369 RPGNKTVLPVTIMSGLV ENV SIV KTVLPVTIMS Mamu-A1*001:01 33903 AIKKKDKNKWRMLIDFRELN POL SIV KDKNKWRMLI Mamu-A1*011:01 30189 SIV REVLRTELTYL Mamu-A1*011:01, 53662 ALQRIREVLRTELTYLSYF ENV Mamu-B*001:01 ESANLGEEILS TAT SIV TTPESANLGEE Mamu-A1*001:01 66763 KLSPLCITMRCNKSET ENV SIV CVKLSPLCITMRCNK Mamu-DRB*w2:01 7275 SCDKHYWDTIRFRYCAPP ENV SIV CDKHYWDAI Mamu-A1*011:01 6114 FQQTHIQQDPALPT ENV SIV THIQQDPAL HLA-B*38:01 508610 PAVDLLKNYM GAG SIV VDLLKNYM Mamu-A1*011:01 68010 EQWWTDYWQV POL SIV WEQWWTDYWQV Mamu-A1*011:01 72369 EVRRAIRGEQLLSCCK VIF HIV FTAGEVRRAI Mamu-A1*001:01 17905 DRTDLEHDRV POL SIV TDLEHDRVVL Mamu-B*001:01 63188 EKAVATAPGLGR TAT SIV VEKAVATAPGL Mamu-A1*011:01 68230 ILKVGTDIKV POL SIV GTDIKVVPRRKAKII Mamu-DRB*w2:01 22623 SIV SHLEVQGYW Mamu-B*017:04, 58376 SHLEVQGYWH VIF Mamu-B*017:04 SIV LTPEKGWL Mamu-A1*001:01, 40086 QGYWHLTPERGWLS VIF Mamu-A1*001:01 SIV WETLRRGGRW Mamu-A1*011:01, 72380 AGAWGDLWETLRRGRWI ENV Mamu-B*017:04 REGKEGDGGE ENV PWRNPAEEREKL NEF SIV RNPAEEKEK Mamu-A1*001:01 55039 VLGKRIKGTI POL SIV VEIEVLGKRI Mamu-A1*011:01 68213

124

Our potential epitope sequence Region (HIV/SIV) Database epitope sequence MHC Restriction Epitope ID

HIV MTKILEPFR HLA-A*68:01, HLA- 42821 A*31:01, HLA- VLEPFRKANPDVTLVQYMDDI POL A*11:01 and more RGLRMAKQNSRG VIF SIV WRRDNRRGLRMAK Mamu-DRB1*04:06 73048 RAPRRQGCWKC GAG HIV AKNCRAPRKKGCWRCG HLA-A*02:01 2248 SIV QFSLWRRPVVTAHIE Mamu-DRB*w2:01, 50817 LWRRPVVTAHIE POL Mamu-DRB1*04:06 VTESFDAWENTVT ENV SIV TESFDAWNNTV Mamu-A1*011:01 3524 ESAAYRHLAFKCL VPX SIV KEYNETWYSTDLVCEQ ENV SIV NETWYSADLV Mamu-A1*011:01 43774 QQQQLLDVVKRQQELLR ENV SIV IVQQQQQLLDVVKR Mamu-DRB*w2:01 29421 PLEATVIKSQDN POL SIV QEGKPLEATVI Mamu-A1*011:01 50601 QIINTWHKVGK ENV HIV KQIINMWQEVGKAMYA 32996

125

Association between viral load and ART/vaccination, MHC haplotype and mutation

In this study, I had 21 macaques that received ART and vaccination, 10 macaques with ART and 13 macaques with neither ART nor vaccination. The viral load was calculated for each animal by averaging viral loads from week 6 onwards (i.e. after peak viremia, when it reached a steady state). For each haplotype a two-way ANOVA with Bonferroni correction (because of multiple correlations) was employed to study the link between viral load and A) ART/Vaccination and B) MHC haplotype. a) Viral load and ART/Vaccination: my statistical analysis revealed that the viral load of the animals that did not receive ART and vaccination was significantly higher (P value < 0.0001) than those that received ART and vaccination (Fig. 2.18A). There was also significant viral load difference between animals with ART and no vaccination and those that did not receive ART and vaccination (Fig. 2.18B). The viral load in the first group was lower than in the second group. There was no significant link between viral load and animals with only ART (Fig. 2.18C). On the other hand, animals that received ART had a significantly (P value < 0.0001) lower viral load compared to those that did not receive ART (Fig. 2.18D).

b) Viral load and MHC haplotype: To investigate whether MHC haplotypes can affect disease progression in terms of viral load, I compared the viral loads of animals having MHC haplotypes that induce escape, with the animals lacking such haplotypes. A corrected two-way ANOVA test did not show a significant link between MHC haplotypes and viral loads (see Table 2.7). Smith et al. [13] previously reported, in a much smaller study of only 8 pigtail macaques (3 KP9 responders and 5 KP9 non- responders), that Mane-A084 may reduce SIV viral levels. I also analysed the viral loads of Mane-A084 positive animals at the KP9 epitope but did not find a significant correlation. However Mane-A084 animals that mutated at the KP9 epitope had lower viral load in contrast to those that had no mutation.

126

1 0 P v a lu e < 0 .0 0 0 1

8

d

a

o

L

l 6

a

r

i V

4

2

A R T = Y A R T = N V a c c in e = Y V a c c in e = N

(a )

1 0 P v a lu e = 0 .1 2 1 0 P v a lu e < 0 .0 0 0 1

8

8

d

d

a

a

o

o

L

L

l 6

l 6

a

r

a

i

r

i

V V

4 4

2 2

A R T = Y A R T = Y A R T = Y A R T = N V a c c in e = Y V a c c in e = N

(c ) (d )

Figure 2.18: Viral load comparison in different categories of ART and vaccination.

127

Table 2.7: SIV viral load for different MHC Class I haplotypes (mean viral loads after week 6).

Number of macaques VL* Corrected Haplotype Haplotype Haplotype No ART, Not Vaccinated Early ART, Not Vaccinated Early ART, Vaccinated P value positive negative Mane positive Mane negative Mane positive Mane negative Mane positive Mane negative Mane-A006 3 41 5.96 5.74 NA 4.84 3.31 4.43 0.56 Mane-A082 13 31 5.96 5.62 5.48 4.57 4.70 4.30 0.14 Mane-A084 16 28 5.61 5.83 4.55 5.04 4.25 4.48 0.22 Mane-A114 2 42 NA 5.78 5.42 4.78 3.31 4.43 0.62 Mane-B015 8 36 5.54 5.88 4.90 4.83 5.13 4.30 0.72 Mane-B016 7 37 5.98 5.69 4.75 4.85 5.14 4.30 0.30 Mane-B017 2 42 5.38 5.81 NA 4.84 4.70 4.36 0.94 Mane-B028 7 37 5.39 5.85 4.28 4.98 4.18 4.41 0.12 Mane-B043 7 37 5.67 5.82 NA 4.84 4.33 4.38 0.75 Mane-B047 6 38 6.07 5.75 4.90 4.84 3.80 4.51 0.24 Mane-B069 4 40 5.61 5.79 4.32 4.97 4.34 4.38 0.36 Mane-B118 13 31 6.06 5.75 5.63 4.65 4.65 4.13 0.08 Mane-B120 6 38 6.13 5.75 4.31 5.07 3.26 4.49 0.11

128

Validation of detected associations

I tested two novel Mane-B028 (position 9105) and Mane-B017 (position 9568) linked mutations in collaboration with Professor Stephen Kent`s Lab. Three macaques were infected with SIVmac239 (one macaque with Mane-B028 positive haplotype and two other macaques with Mane-B028 negative haplotype). The Mane-B028 positive macaque was expressed and had a positive response to the Nef peptide, but other macaques without Mane B028 had no response to the Nef peptide. This experimental result confirmed that the novel CTL escape mutations identified by my pipeline method are recognised in vivo.

Discussion

The HIV-1/SIV antigen specific CD8+ T-cell response expands almost 14 days after infection, which coincides with a decrease in viral load. The immune pressure on the virus by CTLs can lead to the emergence of escape mutations [46, 47]. Genome-wide association studies have demonstrated that host MHC Class I haplotypes impact CTL escape and this has been well determined in humans at the population level [48, 49].

The SIV-infected macaques` models are great models to study HIV-1 immunity, candidate vaccines, and therapeutic drugs. Within macaque models, pigtailed macaque models of HIV infection are accessible in Australia and are a good resource to study SIV infection. The association between MHC Class-I molecules and SIV infection has been documented for pigtailed macaques, but with limited MHC diversity [50, 51]. Impact of MHC Class-I alleles on disease progression in the SIV pigtailed macaque model has not been well characterised. This is largely because of the limited MHC Class-I immunogenetic details in this model; only a few CTL mutations and their restricted epitopes have been identified for the SIV macaque model. Smith et al. and Mason et al. [12-14] reported an important CTL response in pigtailed macaques and its escape mutations in Gag, which is known as KP9; Mason et al. also reported KSA10 and KVA10 CTL responses in pigtailed macaques in Tat [14]; all these CTL mutations and their restricted epitopes are found in the Mane-A084 allele. Several Env CTL escape mutations have also been characterised in SIV-infected macaques, but they have not been restricted to specific MHC Class-I haplotype [52]. Hence there is still a need for a comprehensive and systematic investigation to identify SIV-specific CTL epitope mutations along with

129

their association with the simian MHC Class-I haplotypes in pigtailed macaque models. On the other hand, there has not been a comprehensive study to discover a cluster of mutations to identify epitopes that escape in different ways in different macaques.

In this Chapter I performed a bioinformatics analysis to identify novel associations between viral escape mutations and Mane alleles for both single mutations and group of mutations, based on a 10 amino-acid window (since escape is not always monomorphic) [53] in 44 SIV infected pigtailed macaques, which overall, had 22 MHC Class-I haplotypes. The pipeline detected many potential CTL escape epitopes across the SIV proteome that had P value < 0.05. However, these potential CTL escape epitopes that the pipeline discovered probably illustrate an overestimate of the actual epitopes and their escape motifs at these sites, since A) some of the discovered associations may have arisen by chance, B) some polymorphisms may present compensatory mutations for CTL mutations elsewhere, and C) some polymorphisms may present adaptation to high-avidity CD8 T-cell immune responses rather than CTL escape mutations [7, 54]. To select genuine associations from random associations and determine the probability of a false discovery rate, I designed a permutation analysis. Based on the permutation results, I defined cut-off levels for the likelihood that associations were real and ignored all single polymorphism mutation associations with < 66% reality (that is probability of being a real association) and window analysis associations with < 86% reality.

The pipeline showed a strong association in previously reported CTL epitopes KP9, KSA10 and KVA10 in Gaga and Tat proteins [12-14], which are associated with Mane- A084. Importantly, these findings were validated by performing intracellular cytokine staining (ICS) assays on frozen peripheral blood mononuclear cell (PBMC) samples. Overall, the method found 46 novel non-synonymous point mutations that were associated with particular MHC Class-I alleles that had > 66% likelihood of being real associations. The method also found 32 novel non-silent mutated regions (those amino acids that changed at different positions in different macaques) that were associated with common MHC Class-I alleles with > 86% likelihood of being real associations. Such sharing of orthologous MHC alleles among multiple different pigtailed macaques indicates that certain Class-I alleles may have a same ancestor and play an important role in CD8 T-cell immune responses.

130

33 associations (of 78 both point and window associations) were in the envelope protein of SIV genome. Previous studies [55-58] reported that the envelope region of HIV-1 is highly variable in sequence and length; so the larger number of associations in this region is presumably related to the higher level of sequence variability. However, at least, some of these associations in this region appear to be real. As an example, all Mane-A032 positive macaques had D-to-E and I-to-V point mutations in positions 7483 and 7493 of the envelope region respectively and no Mane-A032 negative macaques had mutations in these positions (Fig. 2.19).

The approach also found 24 associations within Gag and Pol genes. These genes are not very variable between the HIV-1 and/or SIV genomes. Previous studies have also reported Gag-specific responses in both HIV-1 and SIV [59-61]. This suggests that a significant proportion of the associations that this study identified for SIV can be used in those studies with CTL-based control of SIV. The experimental confirmation (using a peptide stimulated CD8 T-cell assay), of both Mane-B028 and Mane-B017 associated CTL mutations at positions 9105 and 9568 respectively, verified that these two linked mutations are real and they had positive responses in the Nef epitope. In the case of Mane B017, previous research [62] was also detected a CTL mutations in rhesus macaques. Nevertheless, there is still more work to be done to experimentally confirm our findings of SIV-specific CTL epitopes. These studies will be challenging because the stock virus and other alleles that may exist in a macaque will impact CTL escape. Notwithstanding these difficulties, it will still be useful to experimentally validate reminder potential epitopes discovered in our study, to take an in-depth understanding of SIV pathogenesis in the pigtailed macaque models.

131

Figure 2.19: An example of number and type of mutations based on my single polymorphism analyses within all macaques, with and without Mane-A032 in protein Env at positions 7483 and 7493 (P value = 0.0010, Fisher’s exact test).

132

Some CTL escape mutations usually carry fitness cost to HIV-1/SIV, and thus may be associated with lower infectivity [63]. I investigated a potential link between MHC Class- I alleles and SIV control by analysing the mean viral loads 6 weeks post challenge with virus in our cohort. However, after correction for the multiple comparisons, the analysis did not show any correlation. Specially, my analysis did not demonstrate that Mane-A084 was associated with an improved outcome; however, a correlation between Mane-A084 and SIV infection has been previously reported in a previous study of 8 macaques [13]. Further, in agreement with my finding, another group with a cohort including 24 Mane- A084 macaques also indicated no reduction in the peripheral viral load of SIV-infected macaques (Mane-A084 positive macaques) [64]. Altogether, these data indicate that we can use Mane-A084+ individuals in vaccine trials as this group of macaques do not control SIV better than average.

As indicated before, this study involved data from different groups of SIVmac251-infected macaques that received various combinations of vaccination and treatment (as detailed in Table 2.1). However, it is unlikely that vaccination and treatment were important factors in our study that investigates an association between MHC-I alleles and CTL escape mutations in SIV infection. Firstly, the various MHC-I alleles were distributed across the groups of animals such that particular MHC-I alleles were not associated with a specific vaccination/treatment group of animals. In other words, a drug resistance mutation present only in ART animals would not be associated with a particular MHC-I allele because it was only observed in the ART animals. However, as treatment may delay CTL escape, mutations that may normally be associated with a particular MHC may not necessarily be observed in some animals with this MHC. Therefore, treatment may reduce our power to detect some MHC-mutation associations. It has also been reported that vaccination may enhance rapid escape mutations and speed up late responses, emerging as rapid early escape [65-66]. However, as these rapid escape mutations would not be seen in the unvaccinated animals, this would have reduced the power in our study to identify a significant association with MHC and these mutation positions. Moreover, none of the vaccines worked well and therefore are unlikely to be an effective factor in identifying MHC-mutation associations.

The pigtailed macaque is a promising SIV/HIV-1 model and the knowledge of their MHC Class-I restricted CD8 T-cell responses will be helpful in the T-cell based vaccine design

133

and future study. Having a wider suite of SIV-specific CTL epitopes, and their escape mutation patterns that were detected in this study, could assist in determining which CTL responses are capable of controlling SIV infection. It should also provide a valuable resource for researchers studying T-cell control of SIV infection and using macaque models for the future design of CTL-specific SIV/HIV-1 vaccines and antiretroviral therapy strategies of SIV/HIV-1 infection. It should also be noted that our experiments confirmed that rhesus macaque MHC-I allele such as Mane-B017 tetramers can identify T-cell responses in pigtailed macaques sharing very similar MHC-I alleles, suggesting that some of the tools and reagents created for rhesus macaque research can be usable for the pigtailed macaques study as well. A future work can focus on experimental validation of our findings; specially, focusing on those MHC Class-I haplotypes that exist in both rhesus and pigtailed models will enable the interpretation of discoveries between the two macaque models.

To find patterns of MHC-induced escape mutations in the SIV genome, a complex assembly process needs to be done. In this process many parameters need to be set to make a final genome assembly (e.g. mutation calling). In my study, all mutated positions with a variant frequency of less than 5% were ignored. A systematic future study will include a new bioinformatic pipeline with multiple thresholds to call single nucleotide variants.

In summary, I performed a linkage analysis between MHC-I haplotypes and SIV genomes in SIV-infected pigtailed macaques to discover potential CTL escape mutations and related epitopes. My discovery of potential epitopes should be a good source to start new studies on CTL-based HIV-1 vaccine design.

References

1. Price, D.A., et al., Positive selection of HIV-1 cytotoxic T lymphocyte escape variants during primary infection. Proceedings of the National Academy of Sciences, 1997. 94(5): p. 1890-95.

2. Rowland-Jones, S., et al., The role of cytotoxic T-cells in HIV infection. Dev Biol Stand, 1998. 92: p. 209-14.

134

3. Goulder, P.J., et al., Late escape from an immunodominant cytotoxic T-lymphocyte response associated with progression to AIDS. Nat Med, 1997. 3(2): p. 212-7.

4. Borrow, P., et al., Virus-specific CD8+ cytotoxic T-lymphocyte activity associated with control of viremia in primary human immunodeficiency virus type 1 infection. J Virol, 1994. 68(9): p. 6103-10.

5. Borrow, P., et al., Antiviral pressure exerted by HIV-1-specific cytotoxic T lymphocytes (CTLs) during primary infection demonstrated by rapid selection of CTL escape virus. Nat Med, 1997. 3(2): p. 205-11.

6. O'Connell, K.A., et al., CD4+ T cells from elite suppressors are more susceptible to HIV- 1 but produce fewer virions than cells from chronic progressors. Proceedings of the National Academy of Sciences, 2011. 108(37): p. E689-E698.

7. Almeida, C.-A.M., et al., Translation of HLA–HIV Associations to the Cellular Level: HIV Adapts To Inflate CD8 T Cell Responses against Nef and HLA-Adapted Variant Epitopes. The Journal of Immunology, 2011. 187(5): p. 2502-13.

8. Almeida, C.-A.M., et al., Exploiting knowledge of immune selection in HIV-1 to detect HIV-specific CD8 T-cell responses. Vaccine, 2010. 28(37): p. 6052-7.

9. Merani, S., et al., Effect of immune pressure on hepatitis C virus evolution: Insights from a single‐source outbreak. Hepatology, 2011. 53(2): p. 396-405.

10. Hammer, J., New methods to predict MHC-binding sequences within protein antigens. Curr Opin Immunol, 1995. 7(2): p. 263-9.

11. Bhattacharya, T., et al., Founder effects in the assessment of HIV polymorphisms and HLA allele associations. Science, 2007. 315(5818): p.1583-6

12. Smith, M.Z., et al., The pigtail macaque MHC class I allele Mane‐A* 10 presents an immundominant SIV Gag epitope: identification, tetramer development and implications of immune escape and reversion. Journal of medical primatology, 2005. 34(5‐6): p. 282- 93.

13. Smith, M.Z., et al., Analysis of pigtail macaque major histocompatibility complex class I molecules presenting immunodominant simian immunodeficiency virus epitopes. Journal of virology, 2005. 79(2): p. 684-95.

14. Mason, R.D., et al., Differential patterns of immune escape at Tat-specific cytotoxic T cell epitopes in pigtail macaques. Virology, 2009. 388(2): p. 315-23.

15. Evans, D.T., et al., Definition of five new simian immunodeficiency virus cytotoxic T- lymphocyte epitopes and their restricting major histocompatibility complex class I

135

molecules: evidence for an influence on disease progression. J Virol, 2000. 74(16): p. 7400-10.

16. Friedrich, T.C., et al., Consequences of cytotoxic T-lymphocyte escape: common escape mutations in simian immunodeficiency virus are poorly recognized in naive hosts. J Virol, 2004. 78(18): p. 10064-73.

17. Su, J., et al., Novel simian immunodeficiency virus CTL epitopes restricted by MHC class I molecule Mamu-B*01 are highly conserved for long term in DNA/MVA-vaccinated, SHIV-challenged rhesus macaques. Int Immunol, 2005. 17(5): p. 637-48.

18. Klatt, N.R., et al., Dynamics of simian immunodeficiency virus SIVmac239 infection in pigtail macaques. Journal of virology, 2012. 86(2): p. 1203-13.

19. O'Connor, S.L., et al., MHC heterozygote advantage in simian immunodeficiency virus- infected Mauritian cynomolgus macaques. Sci Transl Med, 2010. 2(22): p. 22ra18.

20. Peut, V. and S.J. Kent, Substantial envelope-specific CD8 T-cell immunity fails to control SIV disease. Virology, 2009. 384(1): p. 21-7.

21. Peut, V. and S.J. Kent, Utility of human immunodeficiency virus type 1 envelope as a T- cell immunogen. Journal of virology, 2007. 81(23): p. 13125-34.

22. Carrington, M., et al., HLA and HIV-1: heterozygote advantage and B*35-Cw*04 disadvantage. Science, 1999. 283(5408): p. 1748-52.

23. Mothe, B., et al., CTL responses of high functional avidity and broad variant cross- reactivity are associated with HIV control. PLoS One, 2012. 7(1): p. e29717.

24. Kiepiela, P., et al., CD8+ T-cell responses to different HIV proteins have discordant associations with viral load. Nat Med, 2007. 13(1): p. 46-53.

25. Dzutsev, A.H., et al., Avidity of CD8 T cells sharpens immunodominance. Int Immunol, 2007. 19(4): p. 497-507.

26. Yerly, D., et al., Increased cytotoxic T-lymphocyte epitope variant cross-recognition and functional avidity are associated with hepatitis C virus clearance. J Virol, 2008. 82(6): p. 3147-53.

27. Kent, S.J., et al., Evaluation of recombinant Kunjin replicon SIV vaccines for protective efficacy in macaques. Virology, 2008. 374(2): p. 528-34.

28. Sexton, A., et al., Evaluation of recombinant influenza virus-simian immunodeficiency virus vaccines in macaques. Journal of virology, 2009. 83(15): p. 7619-28.

136

29. De Rose, R., et al., Control of viremia and prevention of AIDS following immunotherapy of SIV-infected macaques with peptide-pulsed blood. PLoS Pathog, 2008. 4(5): p. e1000055.

30. Gooneratne, S.L., et al., Linking pig-tailed macaque major histocompatibility complex class I haplotypes and cytotoxic T lymphocyte escape mutations in simian immunodeficiency virus infection. J Virol, 2014. 88(24): p. 14310-25.

31. Karl, J.A., et al., Identification of MHC class I sequences in Chinese-origin rhesus macaques. Immunogenetics, 2008. 60(1): p. 37-46.

32. Fernandez, C.S., et al., Screening and confirmatory testing of MHC class I alleles in pig- tailed macaques. Immunogenetics, 2011. 63(8): p. 511-21.

33. Dudley, D.M., et al., Low-cost ultra-wide genotyping using Roche/454 pyrosequencing for surveillance of HIV drug resistance. PLoS One, 2012. 7(5): p.e36494.

34. Lorenc, M.T., et al., Discovery of single nucleotide polymorphisms in complex genomes using SGSautoSNP. Biology, 2012. 1(2): p.370-82.

35. McCormack, J.E., et al., Next-generation sequencing reveals phylogeographic structure and a species tree for recent bird divergences. Mol Phylogenet Evol, 2012. 62(1): p.397- 406.

36. Ruperao, P. and D. Edwards, Bioinformatics: identification of markers from next- generation sequence data. Plant Genotyping: Methods and Protocols. Springer, 2015. p. 29-47.

37. Nelson, C.W., et al., SNPGenie: estimating evolutionary parameters to detect natural selection using pooled next-generation sequencing data. Bioinformatics, 2015. 31(22): p. 3709-11.

38. Roorkiwal, M., et al., Exploring germplasm diversity to understand the domestication process in Cicer spp. using SNP and DArT markers. PLoS One, 2014. 9(7): p. e102016.

39. Carlson, J., et al., Leveraging hierarchical population structure in discrete association studies. PLoS One, 2007. 2(7): p. e591-e591.

40. Storey, J.D. and R. Tibshirani, Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 2003. 100(16): p. 9440-45.

41. Dong, T., et al., Extensive HLA-driven viral diversity following a narrow-source HIV-1 outbreak in rural China. Blood, 2011. 118(1): p. 98-106.

42. Liu, Y., et al., Evolution of human immunodeficiency virus type 1 cytotoxic T-lymphocyte epitopes: fitness-balanced escape. J Virol, 2007. 81(22): p. 12179-88.

137

43. Manocheewa, S., et al., Composite Sequence-Structure Stability Models as Screening Tools for Identifying Vulnerable Targets for HIV Drug and Vaccine Development. Viruses, 2015. 7(11): p. 5718-35.

44. Zanotto, P.M., et al., Genealogical evidence for positive selection in the nef gene of HIV- 1. Genetics, 1999. 153(3): p. 1077-89.

45. Vita, R., et al., The immune epitope database 2.0. Nucleic Acids Res, 2010. 38(suppl 1): p. D854-D862.

46. O'Connor, D.H., et al., Acute phase cytotoxic T lymphocyte escape is a hallmark of simian immunodeficiency virus infection. Nat Med, 2002. 8(5): p. 493-9.

47. Fryer, H.R. and A.R. McLean, Modelling the spread of HIV immune escape mutants in a vaccinated population. PLoS Comput Biol, 2011. 7(12): p. e1002289.

48. Moore, C.B., et al., Evidence of HIV-1 adaptation to HLA-restricted immune responses at a population level. Science, 2002. 296(5572): p. 1439-43.

49. Limou, S., Genomewide association study of an AIDS-nonprogression cohort emphasizes the role played by HLA genes (ANRS Genomewide Association Study 02). J Infect Dis, 2009. 199(3): p. 419-26.

50. Allen, T.M., et al., Selective escape from CD8+ T-cell responses represents a major driving force of human immunodeficiency virus type 1 (HIV-1) sequence diversity and reveals constraints on HIV-1 evolution. J Virol, 2005. 79(21): p. 13239-49.

51. Berman, P.W., et al., Protection of chimpanzees from infection by HIV-1 after vaccination with recombinant glycoprotein gp120 but not gp160. Nature, 1990. 345(6276): p. 622-5.

52. Peut, V. and S.J. Kent, Substantial envelope-specific CD8 T-cell immunity fails to control SIV disease. Virology, 2009. 384(1): p. 21-7.

53. O'Connor, D.H., et al., Major histocompatibility complex class I alleles associated with slow simian immunodeficiency virus disease progression bind epitopes recognized by dominant acute-phase cytotoxic-T-lymphocyte responses. J Virol, 2003. 77(16): p. 9029- 40.

54. Keane, N.M., et al., High-avidity, high-IFNgamma-producing CD8 T-cell responses following immune selection during HIV-1 infection. Immunol Cell Biol, 2012. 90(2): p. 224-34.

55. Hahn, B.H., et al., Genomic diversity of the acquired immune deficiency syndrome virus HTLV-III: different viruses exhibit greatest divergence in their envelope genes. Proc Natl Acad Sci U S A, 1985. 82(14): p. 4813-7.

138

56. Modrow, S., et al., Computer-assisted analysis of envelope protein sequences of seven human immunodeficiency virus isolates: prediction of antigenic epitopes in conserved and variable regions. J Virol, 1987. 61(2): p. 570-8.

57. Starcich, B.R., et al., Identification and characterization of conserved and variable regions in the envelope gene of HTLV-III/LAV, the retrovirus of AIDS. Cell, 1986. 45(5): p. 637-48.

58. Willey, R.L., et al., Identification of conserved and divergent domains within the envelope gene of the acquired immunodeficiency syndrome retrovirus. Proc Natl Acad Sci U S A, 1986. 83(14): p. 5038-42.

59. Nqoko, B., et al., HIV-specific gag responses in early infancy correlate with clinical outcome and inversely with viral load. AIDS Res Hum Retroviruses, 2011. 27(12): p. 1311-6.

60. Leligdowicz, A., et al., Robust Gag-specific T cell responses characterize viremia control in HIV-2 infection. J Clin Invest, 2007. 117(10): p. 3067-74.

61. Kiepiela, P., et al., Dominant influence of HLA-B in mediating the potential co-evolution of HIV and HLA. Nature, 2004. 432(7018): p. 769-75.

62. Maness, N.J., et al., Comprehensive immunological evaluation reveals surprisingly few differences between elite controller and progressor Mamu-B*17-positive simian immunodeficiency virus-infected rhesus macaques. J Virol, 2008. 82(11): p. 5245-54.

63. Fernandez, C.S., et al., Rapid viral escape at an immunodominant simian-human immunodeficiency virus cytotoxic T-lymphocyte epitope exacts a dramatic fitness cost. J Virol, 2005. 79(9): p. 5721-31.

64. Mankowski, J.L., et al., Natural host genetic resistance to lentiviral CNS disease: a neuroprotective MHC class I allele in SIV-infected macaques. PLoS One, 2008. 3(11): p. e3603.

65. Reece, J.C., et al., Timing of immune escape linked to success or failure of vaccination. PLoS One, 2010. 5(9): p. e12774.

66. Davenport, M.P., et al., Rates of HIV immune escape and reversion: implications for vaccination. Trends Microbiol, 2008. 16(12): p.561-6.

139

140

Chapter 3:

G2A3: a method to avoid errors associated with the analysis of hypermutated viral sequences by alignment- based methods

Publication details:

H. Alinejad-Rokny, D. Ebrahimi. (2015). "A Method to Avoid Errors Associated with the Analysis of Hypermutated Viral Sequences by Alignment-Based Methods", Journal of Biomedical Informatics, 58(2015): 220-225.

Author contributions to thesis Chapter 3:

HA-R and DE: Conceived and designed the experiments. HA-R: designed and implemented the computational algorithms, and performed the bioinformatics analysis. HA-R: Analysed the data. HA-R: Wrote the chapter: DE and MPD: Revised the chapter. HA-R: Created all Tables and Figures.

141

Abstract

The human genome encodes a family of editing enzymes known as APOBEC3 (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like3). They induce context dependent G-to-A changes, referred to as ‘hypermutation’, in the genome of viruses such as HIV, SIV, HBV and endogenous retroviruses. Hypermutation is characterized by aligning affected sequences to a reference sequence. Here, I show that indels (insertions/deletions) in the sequences lead to an incorrect assignment of APOBEC3 target and non-target sites. This can result in an incorrect identification of hypermutated sequences and erroneous biological inferences made based on hypermutation analysis.

142

Introduction

Human APOBEC3 is a seven-membered family of cytosine deaminase enzymes, several of which such as APOBEC3G, APOBEC3F and APOBEC3H haplotype II are active against HIV and possibly HBV [1-7]. These enzymes induce, at multiple positions, C-to- U mutations in the negative strand of viral genomes. This leads to multiple G-to-A replacements, referred to as ‘hypermutation’ [2, 8, 9], in the viral positive strand. APOBEC3G replaces G by A preferentially within a GG context. Other members of this family change G-to-A mainly within GA [2, 8-14]. Commonly, the first step in the analysis of hypermutated sequences is an alignment to a reference sequence [13, 14]. Hypermutation is then quantified by calculating the number and context of G-to-A changes. For this latter step different approaches have been reported. Pace et al. [15] used a ratio of the number of G-to-A changes in the context of GG or GA over the total number of G-to-A changes to study hypermutation. Ulenga et al. [16] investigated hypermutation by normalizing the frequency of GG-to-AG and GA-to-AA changes using the frequency of all mutations. Oliver et al. [17] quantified hypermutation using APOBEC3-induced mutations within the trinucleotides TGG and TGA [17-19]. It is known than the GG and GA sites flanked by a 3’ C, are disfavoured APOBEC3 target sites [3, 20]. Armitage et al. considered this motif preference and excluded from the analysis GG and GA motifs flanked by C [20].

The commonly used alignment-based method for identification of hypermutated sequences is a freely available program called Hypermut from the Los Alamos National Laboratories [14, 21]. Hypermut uses a Fisher’s exact test [22, 23] to calculate, for each sequence, a probability associated with hypermutation, based on the number of G-to-A changes within the available APOBEC3 target sites GGN and GAN (N: A, G or T) and the number of G-to-N (N: A, C, G or T) changes within non-target sites GGC, GAC, GC and GT. Target sites are GG and AG motifs followed by A, G or T but not C.

Additionally, there are few reports about the analysis of hypermutated sequences without the need for aligning sequences. These methods are mostly based on the analysis of the frequency or representation of short sequence motifs [15, 24].

143

In the alignment-based methods, gaps in the sequences as a result of indels (insertions/deletions) can significantly affect the calculated number of target and non- target sites, and therefore the calculated frequencies of mutations within these sites. This is illustrated in Fig. 3.1 using two highlighted sites in a multiple sequence alignment. The sequences were aligned using MUSCLE (www.ebi.ac.uk/Tools/msa/muscle) and checked manually.

Figure 3.1: Presence of indels in aligned sequences; sequences were aligned by MUSCLE (www.ebi.ac.uk/Tools/msa/muscle) and checked manually.

As will be explained later indels can result in genomic sites being incorrectly identified or ignored in the analysis of hypermutation. In this Chapter, I identify and report several of these errors that affect the outcome of the analysis by alignment-based methods. This would adversely impact some biological inference made relying on the proportion and extent of hypermutated sequences. To address these types of errors, I developed a method (G2A3) that correctly identifies hypermutation signatures using a heuristic algorithm. I explain errors in the hypermutation analysis by referring to the widely used Hypermut program [14, 21] and the alignment-based methods proposed by Pace et al. [15] and Ulenga et al. [16].

144

Methods

Pace method

Pace et al. [15] used the following equations to identify hypermutated sequences:

#GGAG replacements / #consensus GG  3GScore  Eq. 1 #GA G / #  

#GA AA replacements / #consensus GA 3FScore  Eq. 2 #G A / # G  

These equations indicate the proportions of G-to-A mutations within the APOBEC3G or APOBEC3F preferred target dinucleotide contexts.

Ulenga method

The method used by Ulenga et al., is different from the Pace method in that the frequency of context-specific G-to-A mutations is normalized by the frequency of all mutations in the genome [16]:

#GG AG replacements / #consensus GG G A mutation due to hA3G= Eq. 3 # mutations / # nucleotides sequenced  

#GA AA replacements / #consensus GA G A mutation due to hA3F= Eq. 4 # mutations / # nucleotides sequenced  

Errors in Pace and Ulenga methods

The alignment of sequences containing indels creates gaps in the reference or query sequences. In both methods, sites having an indel within the APOBEC3 target motifs (GG, GA) or product motifs (AG, AA) are ignored.

145

Hypermut program

In this program for each sequence a probability associated with hypermutation is calculated using a Fisher’s exact test with the following four parameters [14, 21]:

a) Number of sites that are APOBEC3 target sites and have been mutated (T+M+), defined as: GGNAGN and GANAAN changes (N: A, G or T).

b) Number of sites that are APOBEC3 target sites but have not been mutated (T+M-),

c) Number of sites that are not APOBEC3 target sites but have been mutated (T-M+), defined as: GGCAGC, GACAAC, GCAC and GTAT changes.

d) Number of sites that are not APOBEC3 target sites and have not been mutated (T- M-).

To calculate these four parameters the Hypermut program has three enforce context options that determines which sequence (reference, query or both) to be used to look for G-to-N changes. By default the program enforces context on query sequence. To illustrate this an example is given in Fig. 3.2. As displayed when enforce context on the reference sequence is selected, the program finds GGT within which a G-to-A change is, by defnition, an APOBEC3 target site mutation. When enforce context on the query sequence is selected, the context is AGC within which a G-to-A change is a non-target site mutation. When both reference and query sequences are selected, this site is ignored because there is a mismatch in the third position of the motif (T in reference vs. C in query).

146

Enforce context on reference Target site Reference: AGTCGTGGTAG

Enforce context by both Site is ignored

Query: AGTCGTAGCAG Enforce context by query Non-Target site

Figure 3.2: Schematics of the context enforcement in the Hypermut program; By default the Hypermut program enforces context on the query sequence.

The Hypermut program identifies the number of mutated and non-mutated target sites as well as the number of mutated and non-mutated non-target sites. It then performs a Fisher’s exact test (Fig. 3.3) to calculate, for each sequence, a probability associated with hypermutation.

Figure 3.3: Hypermut 2-by-2 table as the input of Fisher’s exact test to calculate a probability associated with hypermutation for each sequence.

Type of errors in the Hypermut program

Given that by default the Hypermut program enforces context on the query sequence, here I report the errors associated with indels based on this enforced context setting.

There are two scenarios in which alignment-based methods fail to correctly identify target and non-target sites resulting in underestimation or overestimation of hypermutation frequency. It is important to note that these errors are not due to a misalignment. They occur in perfectly aligned sequences as shown in Fig. 3.1.

147

A) Error type 1 (site is ignored)

In this scenario, when there is an indel after A in the query sequence the site is ignored (see sequence 2 in Fig. 3.1 and error type 1 in Table 3.1).

B) Error type 2 (site is misidentified)

In this scenario, when there is an indel followed by C at the 3’ end of the target dinucleotide, Hypermut incorrectly identifies a non-target site as a target site (see sequence 4 in Fig. 3.1 and error type 2 in Table 3.1). Depending on whether there is a G- to-A mutation in this site or not, it leads to an overestimation or underestimation of hypermutation frequency.

Table 3.1: Type of errors in alignment-based methods. Type of error Query Seq. Reference Seq. Description 1 R(indel) NN Site is ignored. (R = A or G) 2 AR(indel)C GNNN Non-target site misidentified as Target site

G2A3: My proposed method

The input of my proposed method, G2A3, is an aligned fasta file in which the first sequence is considered “reference” and subsequent sequence(s) are query sequences. Similar to the Hypermut program the user first selects the sequence on which to enforce the context. Let's assume the query sequence is chosen to enforce the context. The program finds within the query sequence sites that start with an A. Then it identifies the two 3' flanking nucleic acids. If there is an indel in any of these two positions they are removed by the algorithm until a nucleic acid is reached. To determine whether the site has undergone G-to-A mutation or not, the program identifies within the reference sequence the nucleic acid that corresponds to the nucleic acid A in the query sequence. If there is a G in this position, the site is either a mutated APOBEC3 target site or a mutated APOBEC3 non-target site. To clarify these steps, consider the reference sequence "GGCCTG" and the query sequence "AG-CTG". The algorithm identifies, in the query, the site "AGC" after removing the indel. This site is identified as a non-target site due to the presence of a C at the second 3’ position after the underlined target G in AGC. The

148

program then refers to the reference sequence to find out whether mutation has occurred at position A. In this example there has been a G-to-A mutation. Therefore, this site is a mutated non-target site. The Hypermut program would incorrectly consider this site as a mutated target site. This is because the presence of indel does not allow this program to detect C at the 3’ position.

By the same token, the G2A3 method using settings of “enforce context on reference or both sequences” successfully deals with indels. The pseudo code of the proposed algorithm is given below. The flowchart of the proposed method is shown in Fig. 3.4.

G2A3(X, maxindel, mod, NS) { Input: Data set X as FASTA format of aligned sequences with references sequence in the first row; maxindel as maximum gap that method can ignore for each site; mod: reference, query or both; NS: number of sequences. Fori = 2 to NS SL = Len (X.Sequence(i)) For j = 1 to SL If mod = ‘reference’ If X.Sequence(1){j} == ‘G’ k = j + 1 While (k < j + maxindel) and (X.Sequence(1)[25] == GAP) k = k+1; m = k While (m < k + maxindel) and (X.Sequence(1)[25] == GAP) m = m+1; end if Elseifmod = ‘query’ If X.Sequence(i)[25] == ‘G’ k = j + 1 While (k < j + maxindel) and (X.Sequence(i)[25] == GAP) k = k+1; m = k While (m < k + maxindel) and (X.Sequence(i)[25] == GAP) m = m+1; end if Elseifmod = ‘both’ If (X.Sequence(1){j} == ‘G’) and (X.Sequence(i){j} == ‘G’) k = j + 1 While (k < j + maxindel) and (X.Sequence(1){k} == GAP and X.Sequence(i){k} == GAP) k = k+1; m = k While (m < k + maxindel) and (X.Sequence(1){m} == GAP and X.Sequence(i){m} == GAP) m = m+1; end if end if target motif = X.Sequence(1){j} + X.Sequence(1){k} + X.Sequence(1){m}

149

product motif = X.Sequence(i){j} + X.Sequence(i){k} + X.Sequence(i){m} If site is a mutated target site T+M+ counter ++; If site is non-mutated target site T+M- counter ++; If site is a mutated non-target site T-M+ counter ++; If site is a non-mutated non-target site T-M- counter ++; end for perform Fisher’s exact test and obtain p-values; If p-value <= 0.05 tag the sequence “hypermutated”; end for }

Pseudo code of G2A3.

150

Figure 3.4: Flowchart of proposed method when enforce context on the reference sequence is selected.

151

Results

To illustrate the impact of indels in alignments on the output of hypermutation analysis by the current alignment-based methods I compared my proposed G2A3 method with Hypermut and two other methods by Pace and Ulenga using two data sets: a) 454 gag sequences of simian immunodeficiency virus (SIV) obtained from laboratory infected pigtail macaques [26, 27]. For this dataset I used SIVmac251 as a reference. b) Single genome HIV-1 gag sequences obtained from naturally infected patients [28]. For this data set I used a consensus sequence as a reference. The details of these two data sets are given in Table 3.2. A snapshot of a region of 454 SIV sequences and positions identified incorrectly by the above mentioned methods are shown in Fig. 3.5. The results of the analysis are given in Tables 3.3, 3.4 and 3.5. Hypermut incorrectly identified 4 out of 10 hypermutated single genome HIV-1 sequences as non-hypermutated at P value < 0.05. Additionally, it identified 5 out of 10 non-hypermutated 454 SIV sequences as hypermutated. Table 3.5 shows the number of mutations identified as an APOBEC signature by the Pace and Ulenga methods and my G2A3 method. The last columns of Tables 3.3 - 3.5 show the performance of the Hypermut and Ulenga and Pace methods when gaps in the alignments are removed manually before the analysis.

Table 3.2: Details of the sequences used in this study. Data Total number of sequences Sequence length Single genome HIV-1 sequences 95 1100

454 SIV sequences 281 350

152

Figure 3.5: A snapshot of the 454 SIV sequences and examples of the errors resolved by G2A3 (insertion and deletion).

153

Table 3.3: Comparison of Hypermut and G2A3 methods for the analysis of single genome HIV-1 sequences. Sequences identified incorrectly are in grey. Only hypermutated sequences are shown. G2A3 Hypermut P value after indels Seq. ID T+M+ T+M- T-M+ T-M- P value T+M+ T+M- T-M+ T-M- P value removed 548_20L 13 64 4 88 0.003 11 63 3 86 0.005 0.003 548_9L 9 64 3 85 0.030 9 64 3 84 0.022 0.030 548_3L 14 69 1 86 0.000 13 67 1 86 0.000 0.000 548_7L 12 70 1 86 0.001 12 67 2 86 0.001 0.001 548_27L 9 66 3 88 0.030 9 68 4 87 0.051 0.030 548_5L 6 60 1 87 0.019 5 66 1 86 0.055 0.019 548_12L 19 71 2 86 0.000 18 69 2 85 0.000 0.000 548_25L 17 70 5 88 0.001 18 71 5 87 0.001 0.001 548_28L 6 67 1 86 0.044 3 66 1 86 0.223 0.044 548_16L 4 70 0 87 0.038 3 65 0 87 0.076 0.038

154

Table 3.4: Comparison of Hypermut and G2A3 methods for analysis of 454 SIV sequences. Sequences identified incorrectly are in grey. Only hypermutated sequences are shown. G2A3 Hypermut P value after indels Seq. ID T+M+ T+M- T-M+ T-M- P value T+M+ T+M- T-M+ T-M- P value removed kp9.f.16 5 26 1 42 0.027 6 36 0 40 0.009 0.027 kp9.f.36 5 26 1 41 0.029 6 37 0 39 0.011 0.029 kp9.f.51 3 20 2 42 0.317 5 38 0 40 0.024 0.317 kp9.f.56 4 30 1 42 0.153 5 36 0 40 0.020 0.153 kp9.f.76 4 25 0 43 0.016 4 36 0 41 0.044 0.016 kp9.f.86 5 26 1 42 0.027 6 36 0 40 0.009 0.027 kp9.r.107 3 25 1 42 0.143 4 37 0 40 0.049 0.143 kp9.r.62 3 20 2 42 0.317 5 38 0 40 0.024 0.317 kp9.r.68 3 24 1 42 0.133 4 37 0 40 0.049 0.133 kp9.f.16 5 26 1 42 0.027 6 36 0 40 0.009 0.027

155

Table 3.5: Comparison of the Pace and Ulenga methods with G2A3 for the analysis of single genome HIV-1 sequences (only hypermutated sequences are shown). # APOBEC3-induced After indels removed # APOBEC3-induced # APOBEC3-induced mutations Seq ID mutations detected by Ulenga (Ulenga and Pace mutations detected by G2A3 detected by the Pace method method methods) 548_20L 15 13 13 15 548_9L 11 11 11 11 548_3L 15 14 14 15 548_7L 12 13 13 12 548_27L 11 12 12 11 548_5L 7 6 6 7 548_12L 21 20 20 21 548_25L 17 20 20 17 548_28L 6 3 3 6 548_16L 3 3 3 3

156

Discussion

In the alignment-based methods, gaps in the sequences as a result of insertion/deletion (indels) can significantly affect the calculated number of target and non-target sites and also the calculated frequencies of mutations within these sites. In this Chapter, I identify and report several of these errors that affect the outcome of hypermutation analysis by alignment-based methods. This would adversely impact any biological inference made relying on the proportion and extent of hypermutated sequences. To address these types of errors, I developed a method (G2A3) that correctly identifies hypermutation signatures using a heuristic algorithm. I compared G2A3 with three existing methods Hypermut [14, 21], Pace [15] and Ulenga [16].

Considering that Hypermut is the most comprehensive and widely used alignment-based method, I explain errors in the hypermutation analysis by referring to this program. This program compares a reference sequence with one or more query sequence(s) in an alignment by quantifying APOBEC3-induced G-to-A mutations and other mutations that are unrelated to APOBEC3. Based on this the program calculates a probability (P value) associated with hypermutation. The lower the P value is, the more likely it is that the query sequence is hypermutated. Usually a given query sequence with P value < 0.05 is considered hypermutated. This study has shown that alignment-based methods are sensitive to indels, which are an unavoidable feature of any alignment. It is important to note that the indels I am referring to in this Chapter, are not due to a poor alignment (see Figs. 3.1 and 3.5) that could be rectified before the analysis by programs such as Hypermut. These differences between the reference and query sequences are resulted from natural insertion/deletions during viral replication.

I analysed two data sets of 454 and single genome viral sequences to illustrate how errors in detecting APOBEC3 target and non-target sites can lead to misclassification of sequences. Tables 3.3 and 3.4 show the 454 SIV and single genome HIV-1 sequence identified as hypermutated at P value < 0.05 using Hypermut and my proposed method, G2A3. As shown in both cases a significant number of sequences (4 out of 10 HIV-1 sequences and 5 out of 10 SIV sequences, highlighted in grey) are incorrectly identified using Hypermut. This is due to an incorrect assignment of sites shown in Tables 3.3 and 3.4. For example, the sequence 548_28L has 6 mutated target sites (T+M+), three of

157

which are ignored by Hypermut due to the gaps in the alignment. Table 3.5 compares G2A3 and Pace and Ulenga methods for the analysis of single genome HIV-1 sequences. As shown in both methods by Pace and Ulenga, the number of hypermutated sequences are underestimated or overestimated depending on the sequence.

When indels were manually corrected (see Tables 3.3 - 3.5, last column) all methods correctly identified the APOBEC3-mutated sites. As an example, sequence 548_28L was identified as non-hypermutated by Hypermut (P value >> 0.05). However, after the manual correction of gaps this sequence was correctly identified as hypermutated (P value < 0.05). This implies indels needs to be corrected first if alignment-based methods are used to investigate hypermutation.

Errors in identification of hypermutated sequences can potentially impact biological inference made, such as the link between the proportion of hypermutated sequences and viral loads as a measure of viral control. The method G2A3 is a modified version of the Hypermut program that removes the impact of indels on the analysis of hypermutation by APOBEC3 enzymes.

As a conclusion, when indels are present in the sequences, the alignment-based methods either ignore the APOBEC3 target and non-target sites or incorrectly classify the sites. This leads to an incorrect assignment of the frequency of GG-to-AG and GA-to-AA mutations. In this study, I identified and reported errors associated with indels in the analysis of hypermutated sequences. To avoid these errors, I proposed a method based on the Hypermut program and compared it, using two viral data sets, with three alignment- based methods.

References

1. Goila-Gaur, R. and K. Strebel, HIV-1 Vif, APOBEC, and intrinsic immunity. Retrovirology, 2008. 5: p. 51.

2. Harris, R.S. and M.T. Liddament, Retroviral restriction by APOBEC proteins. Nat Rev Immunol, 2004. 4(11): p. 868-77.

158

3. Hultquist, J.F., et al., Human and rhesus APOBEC3D, APOBEC3F, APOBEC3G, and APOBEC3H demonstrate a conserved capacity to restrict Vif-deficient HIV-1. J Virol, 2011. 85(21): p. 11220-34.

4. Ooms, M., et al., HIV-1 Vif adaptation to human APOBEC3H haplotypes. Cell Host Microbe, 2013. 14(4): p. 411-21.

5. Ooms, M., et al., The resistance of human APOBEC3H to HIV-1 NL4-3 molecular clone is determined by a single amino acid in Vif. PLoS One, 2013. 8(2): p. e57744.

6. Reuman, E.C., et al., A classification model for G-to-A hypermutation in hepatitis B virus ultra-deep pyrosequencing reads. Bioinformatics, 2010. 26(23): p. 2929-32.

7. Wang, X., et al., Analysis of human APOBEC3H haplotypes and anti-human immunodeficiency virus type 1 activity. J Virol, 2011. 85(7): p. 3142-52.

8. Chiu, Y.L. and W.C. Greene, The APOBEC3 cytidine deaminases: an innate defensive network opposing exogenous retroviruses and endogenous retroelements. Annu Rev Immunol, 2008. 26: p. 317-53.

9. Jern, P., et al., Role of APOBEC3 in genetic diversity among endogenous murine leukemia viruses. PLoS Genet, 2007. 3(10): p. 2014-22.

10. Holmes, R.K., et al., APOBEC3F can inhibit the accumulation of HIV-1 reverse transcription products in the absence of hypermutation. Comparisons with APOBEC3G. J Biol Chem, 2007. 282(4): p. 2587-95.

11. Liddament, M.T., et al., APOBEC3F properties and hypermutation preferences indicate activity against HIV-1 in vivo. Curr Biol, 2004. 14(15): p. 1385-91.

12. Mangeat, B., et al., Broad antiretroviral defence by human APOBEC3G through lethal editing of nascent reverse transcripts. Nature, 2003. 424(6944): p. 99-103.

13. Pathak, V.K. and H.M. Temin, Broad spectrum of in vivo forward mutations, hypermutations, and mutational hotspots in a retroviral shuttle vector after a single replication cycle: deletions and deletions with insertions. Proc Natl Acad Sci U S A, 1990. 87(16): p. 6024-8.

14. Rose, P.P. and B.T. Korber, Detecting hypermutations in viral sequences with an emphasis on G --> A hypermutation. Bioinformatics, 2000. 16(4): p. 400-1.

15. Pace, C., et al., Population level analysis of human immunodeficiency virus type 1 hypermutation and its relationship with APOBEC3G and vif genetic variation. J Virol, 2006. 80(18): p. 9259-69.

159

16. Ulenga, N.K., et al., The level of APOBEC3G (hA3G)-related G-to-A mutations does not correlate with viral load in HIV type 1-infected individuals. AIDS Res Hum Retroviruses, 2008. 24(10): p. 1285-90.

17. Oliver, A., et al., Hypermutation and the preexistence of antibiotic-resistant Pseudomonas aeruginosa mutants: implications for susceptibility testing and treatment of chronic infections. Antimicrob Agents Chemother, 2004. 48(11): p. 4226-33.

18. Ebrahimi, D., H. Alinejad-Rokny, and M.P. Davenport, Insights into the motif preference of APOBEC3 enzymes. PLoS One, 2014. 9(1): p. e87679.

19. Kitamura, K., et al., Uracil DNA glycosylase counteracts APOBEC3G-induced hypermutation of hepatitis B viral genomes: excision repair of covalently closed circular DNA. PLoS Pathog, 2013. 9(5): p. e1003361.

20. Armitage, A.E., et al., Conserved footprints of APOBEC3G on Hypermutated human immunodeficiency virus type 1 and human endogenous retrovirus HERV-K(HML2) sequences. J Virol, 2008. 82(17): p. 8743-61.

21. Rose, P.P.K., B. T. Hypermut, Analysis & Detection of APOBEC-induced Hypermutation. 2010. Available from: www.hiv.lanl.gov/content/sequence/hypermut.

22. Fisher, R.A., On the interpretation of χ2 from contingency tables, and the calculation of P. value. Journal of the Royal Statistical Society, 1922. 85(1): p. 87-94.

23. Fisher, R.A., Statistical methods for research workers. 1934, Oliver and Boyd, Edinburgh.

24. Ebrahimi, D., F. Anwar, and M.P. Davenport, APOBEC3 has not left an evolutionary footprint on the HIV-1 genome. J Virol, 2011. 85(17): p. 9139-46.

25. Centers for Disease Control, Update on acquired immune deficiency syndrome (AIDS) among patients with hemophilia A. MMWR Morb Mortal Wkly Rep, 1982. 31(48): p. 644.

26. De Rose, R., et al., Control of viremia and prevention of AIDS following immunotherapy of SIV-infected macaques with peptide-pulsed blood. PLoS Pathog, 2008. 4(5): p. e1000055.

27. Reece, J., et al., An "escape clock" for estimating the turnover of SIV DNA in resting CD4(+) T cells. PLoS Pathog, 2012. 8(4): p. e1002615.

28. Batorsky, R., et al., Estimate of effective recombination rate and average selection coefficient for HIV in chronic infection. Proc Natl Acad Sci U S A, 2011. 108(14): p. 5661-6.

160

161

Chapter 4:

Insights into the motif preference of APOBEC3 enzymes using multivariate analysis of full genome HIV-1

Publication details:

D. Ebrahimi, H. Alinejad-Rokny and M. Davenport. (2014). "Insights into the Motif Preference of APOBEC3 Enzymes", PLOS ONE, 9(1), pp. e87679.

Author contributions to thesis Chapter 4:

HA-R: Made modifications to the original publication (written by DE), including the introduction, presentation of methods, results and discussion, for presentation as a chapter here. Also performed some additional analyses not included in the original publication. DE, VV and MPD: Revised the chapter. DE and HA-R: Conceived and designed the experiments. HA-R and DE: Contributed reagents/materials and designed and implemented the computational algorithms, and performed the bioinformatics analysis. DE and HA-R: Analysed the data. HA-R: Created all Tables and Figures.

Author contributions to publication: DE and HA-R: Conceived and designed the experiments. HA-R and DE: Analysed the data. HA-R and DE: Contributed reagents/materials/analysis tools. DE: Wrote the paper: DE, MPD and HA-R: revised the paper. HA-R: created all Figures and Tables.

162

Abstract

The human genome encodes for seven APOBEC3 genes, the product of some of which are known to induce G-to-A mutation in the HIV genome. Mutation by these enzymes is sequence context dependent. APOBEC3G preferentially mutates G within GG, TGG and TGGG while for APOBEC3F, GA, TGA and TGAA have been reported as preferred targets. The mutagenic impact of the remaining members of this family is less studied but they seem to mutate G within GA dinucleotides. Studies so far have usually identified targeted motifs using in-vitro or ex-vivo experiments and usually within short sequence fragments. Here I perform a comprehensive analysis to investigate the motif preference and possible sequence hierarchy of mutation by APOBEC3 enzymes using a large number of full genome HIV-1 sequences from a wide range of naturally infected patients. I developed a data matrix decomposition approach to discriminate among normal and hypermutated sequences, in terms of the representation of mono- to tetra- nucleotide motifs. This allowed the identification of motifs associated with hypermutation by different APOBEC3 enzymes.

163

Introduction

The human genome encodes a family of seven genes called APOBEC3 (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like 3). These proteins are packaged into nascent virions along with the HIV viral RNA, and upon release into a newly infected cell induce C-to-U mutations in the minus strand of the HIV genome during reverse transcription. These mutations manifest as G-to-A changes on the HIV viral RNA. The HIV sequences targeted by these enzymes are referred to as ‘hypermutated’ [1-3]. APOBEC3 proteins induce G-to-A mutations within specific target sites. The preferred positive strand target for APOBEC3G is GG (targeted G is underlined). APOBEC3G also preferentially targets Gs with preceding T nucleotides [4]. Other enzymes including APOBEC3F preferentially target G within GA [3]. It has also been reported that APOBEC3F preferentially targets positions within higher motifs (e.g. TGA and TGAA) [5].

Several studies have described that G-to-A mutations by APOBEC3G/3F are not random and they display different target sequence preferences in in-vivo or ex-vivo experiments [5-7]. For example, several studies reported that APOBEC3G targeted GGG trinucleotides more than GGC 3-mers and APOBEC3F targeted GAA 3-mers more than GAC 3-mers [3, 4, 7]. This suggests that the GG and GA sites flanked by a 3′C are disfavoured APOBEC3 target sites. Armitage et al. considered this motif preference and excluded, from the analysis, GG and GA motifs flanked by C [3, 4, 7].

APOBEC3-induced mutations in hypermutated viral sequences are usually detected by aligning the sequences and then comparing the query sequences to a reference sequence. There are several studies in which hypermutated sequences were analysed in order to identify, using an alignment to a constructed reference, APOBEC3G and APOBEC3F target motifs [4, 8]. Commonly, a multiple alignment needs to be done to provide a consensus sequence that can be used as a reference sequence. This consensus/reference sequence might not be the original ancestral sequence of the hypermutated sequences. Using alignment to detect hypermutated sequences is very useful when a parental reference sequence is available. However, when the parental sequence is not available (e.g., in individuals infected with natural HIV and/or SIV), using an alignment method may not be a suitable way to analyse hypermutated sequences. Therefore it is important

164

to develop a method that can identify and quantify hypermutated sequences without needing to align to a reference sequence. Previous studies [9, 10] have shown that the use of frequency of motifs, instead of motif representation, is misleading. For example, two tetra-mer motifs TGGG and GGGG both have APOBEC3G target site GG. Motif TGGG has two GG target sites and motif GGGG has three overlapping GG target sites. Therefore GGGG is expected to be targeted by APOBEC3G at a higher rate when compared to TGGG. This important difference in the nucleotide contents of motifs is ignored when ‘frequency’ is used instead of ‘representation’.

Studies have shown that a more accurate estimation of motif representation can be acquired using Markov models of conditional probabilities [10, 11]. In these methods [10, 11] the expected frequency of a motif is estimated using the observed frequencies of the motif constituents, considering the overlapping nucleotide(s).

Motifs such as GG and GGG, which are preferentially targeted and mutated by APOBEC3, are less represented in the genome of hypermutated sequences compared to those in non-hypermutated sequences (for product motifs such as AG and AGG the opposite is true). Therefore representation of these motifs can be used to investigate hypermutation. Here I quantify the representation of short-sequence motifs and use a multivariate analysis to identify the motif preference of APOBEC3 proteins. In this study, I investigated the representation of all k‐mers up to length 4 (340 in total; A, C… AA, AC… TTTT) to identify those that are underrepresented (i.e. targeted motifs) or overrepresented (i.e. produced motifs) in the genome of hypermutated HIV sequences compared to those in normal (non-hypermutated) HIV sequences. I used 2047 full genome HIV-1 sequences containing 54 hypermutated viral sequences from the subtypes A1, B, C and the recombinant 01_AE, from naturally infected patients, that were obtained from the Los Alamos National Laboratory (LANL) database. I then computed the representation (also known as D-ratio) of all 340 k-mer motifs in those genomes. To identify motifs that related with hypermutation, I used an analysis approach that reveals the association between HIV-1 sequences and motifs.

The representation data of 340 k-mer motifs in 2047 HIV sequences forms a matrix in which k-mers and HIV sequences are variables and objects, respectively (see Fig. 4.1). Here each HIV-1 sequence is defined using 340 variables and each k-mer is described

165

using 2047 HIV-1 sequences. The exploratory analysis of such a data matrix requires a multivariate approach such as Principal Component Analysis (PCA) [12]. This method reveals the differences among the HIV-1 subtypes in terms of the representation of k-mer motifs as well as the differences among k-mer motifs in terms of the HIV-1 sequences. Most importantly it enables the identification of k-mer motifs that are descriptive of the similarities and dissimilarities among HIV sequences and vice versa. For the purpose of this study PCA is used to identify k-mer motifs that differ between normal and hypermutated HIV sequences and to also identify motifs that changed as a result of hypermutation.

Materials and methods

I defined ‘representation’ as the ratio of observed frequency (Pobs) of a motif over its expected frequency (Pexp) in the genome. The Pobs of a motif is defined as the number of times that motif appears in the sequence, divided by the total number of all motifs with the same length [13, 14]. The Pexp can be calculated in different ways.

Pexp(CpGpT) = Pobs(C) × Pobs(G) × Pobs(T) Eq. 1

= Pobs(CpG) × Pobs(T) Eq. 2

= Pobs(C) × Pobs(GpT) Eq. 3

Examples of 1st and 2nd order Markov models are given in Equations 4 and 5.

Pobs(CpG) × Pobs(GpT) Pexp(CpGpT) = Eq. 4 Pobs(G)

Pobs(CpGpT) × Pobs(GpTpA) Pexp(CpGpTpA) = Eq. 5 Pobs(GpT)

In this study I used 1st and 2nd order models to calculate the expected frequencies of tri- nucleotides and tetra-nucleotides, respectively. I then calculated for each motif its representation (D-ratio) by dividing the Pobs of the motif by its Pexp. This is shown using an example in Eq. 6. The expected probability of the tetra-nucleotide motif TGGG is

166

computed using the observed probabilities of its dinucleotides (TG, GG) and mononucleotide G constituents.

pobs(TGGG) pobs(TGGG) D(TGGG) = = p (TGG)×p (GGG) Eq. 6 pexp(TGGG) obs obs pobs(GG)

The computed D-ratio in Eq. 6 is a pure ‘representation’ of TGGG, and importantly, it is independent of the changes that may have happened in TGG, GGG, and GG.

PCA is a powerful statistical technique that is used to transform a set of observations of possibly correlated variables into a smaller number of uncorrelated variables, each of which is a linear combination of the original variables and called a principal component (PC) [12, 15]. When there are a large number of variables in the data (i.e. high- dimensional datasets such as biological data sets), extracting information from the data can be difficult. As such, it is necessary to reduce the number of variables without losing useful information. In a PCA analysis, the number of principal components is less than or equal to the number of original variables. The first principal component has the largest possible variance, so it accounts for as much variation in the data as possible. The second PC also has the second largest proportion of variance not accounted for by the first principal component. Similarly, all subsequent PCs have this property. PCs are linear combinations that account for as much of the remaining variation as possible (not covered by the previous components). Each observation (here HIV-1 sequences) is expressed as a linear combination of the first few PCs. A schematic of PCA applied to the motif representation data is shown in Fig. 4.1.

167

Figure 4.1: A schematic of principal component analysis applied to the motif representation data of HIV sequences.

168

How PCs can be obtained

Let’s suppose that I have N different observations of a given set of random variables

X={X1, X2 … XP} organised in an N*P matrix. Then each PC is defined as a linear regression, predicting PCi from X1… Xp:

PC1 = e11X1 + e12X2 + … e1pXp

PC2 = e21X1 + e22X2 + … e2pXp

PCp = ep1X1 + ep2X2 + … eppXp

The coefficients (e) can be calculated from eigenvalues and eigenvectors of the covariance matrix of the original dataset [12, 15]. The matrix that contains the eij coefficients is called the loading matrix, which determines the share of each variable in each PC. Another important matrix is the score matrix (si), which allows each observation th to be expressed as a linear combination of PCs. For example, the i observation, Oi can be expressed as:

Oi = si1 * PC1 + si2 * PC2 + … + sip * PCp

MATLAB has a built-in function for performing PCA with the following format:

[coeff, score, latent, tsquared, explained, mu] = PCA (O); where the input and output parameters are explained in the Table 4.1.

Table 4.1: Input and output parameters of PCA analysis in Matlab 2015b.

Parameter Description 1 O The dataset (a matrix of N*P) 2 coeff Loading matrix (P*P) 3 score Score matrix (N*P) 4 latent Eigenvalues of the covariance matrix of X (1*P). 5 tsquared Sum of squares of the standardised scores for each observation (1*N) 6 explained Percentage of the total variance explained by each principal component (1*P) 7 mu Estimated means of the variables in X

169

Here the scores matrix (HIV-1 sequences x principal components) describes the relationship (similarity/dissimilarity) between HIV-1 sequences in terms of latent variables (principal components) that are representative of the original variables (i.e. motifs). The loadings matrix (principal components x motifs) includes information about the similarity/dissimilarity between motifs in terms of latent variables (principal components) that are representative of the original objects (i.e. HIV-1 sequences).

To investigate the grouping of HIV sequences, columns of the scores matrix were plotted against one another. Similarly, by plotting different rows of the loadings matrix against one another, the groupings of motifs were investigated. The data were autoscaled as a pre- processing step. This was done by subtracting the data in each column from the average of the column of the matrix (i.e. each motif). After this centralisation step, I normalised the data by dividing the data in each column by its corresponding standard deviation.

The accession numbers of hypermutated sequences used in this study are listed in Table 4.2.

Table 4.2: Accession number of hypermutated sequences used in this study.

Subtype Accession numbers EF165366; EF165365; AF484484; AF457091; AF457076; AF457071; 1 A1 AF457057; FJ388907. EF165363; AY829213; AY037274; AY779556; AY781125; AY818643; AY818642; AY818641; AY531116; AY561241; FJ195087; FJ388922; 2 B FJ388897; JF689891; JF689888; JF689882; JF689881; JF689880; JF689878; JF689861; JF689858; JF689855; JN235961; EF178404 EF165360; EF165359; DQ275665; DQ164128; DQ164125; DQ164124; 3 C DQ164123; DQ056407; AY734561; AY734557; AY255828. EF165361; AY945729; AY945723; AY945715; AY945714; AY358058; 4 01_AE AY358055; AY358054; AY358053; GU201515; GU564226

Results

To examine whether there are principal components with information about the motif preference of hypermutated sequences, I considered all principal components up to 20.

170

As shown in Fig. 4.2 PC1 and PC20 have the maximum and minimum eigenvalues, respectively.

Figure 4.2: Principal components with their eigenvalue.

Identification of different HIV-1 subtypes

The subtype differences among the HIV-1 sequences results in a remarkable source of variation in the motif representation data. Fig. 4.3A shows the principal component analysis of PC1 vs. PC2. The score plot PC1 vs. PC2 confirmed the ability of my method and showed four different clusters of HIV-1 sequences, where each cluster includes one of the HIV-1 subtypes. An interesting feature is the difference between the scores of the cluster that includes subtype B sequences and the other clusters (subtypes A1, C and recombinant 01_AE). The PC1 indicated a positive score for all subtype B sequences, but it showed negative scores for the other subtypes. PC2 also has other information about subtypes C, A1 and recombinant 01_AE. This PC separated clusters including subtype C as a positive score cluster and two other clusters A1 and 01_AE as negative score clusters.

171

The HIV-1 genetic diversity is generated by several factors, such as the error-prone viral- encoded polymerase [5, 16], host selective immune pressures [17], and frequent genomic recombination events during replication [18]. Overall, HIV-1 sequences are grouped into three major phylogenetic clusters: clade M (main), clade O (outlier), and clade N (non- M/non-O) [19-21]. Clade M, which is responsible for more than 90% of HIV cases, can be further sub-grouped into 10 different sub-subtypes A to K. Grouping of HIV-1 subtypes was originally based either on nucleotide sequences derived from multiple geographic distribution of samples genes (e.g. gag, pol, and env) of the same isolates, or on full-length HIV sequence analysis. Molecular epidemiological studies [22, 23] also indicated that subtype A viruses are predominant in central and eastern Africa and in eastern European countries formerly constituting the Soviet Union. Subtype B is dominant in Europe, North Americas, Japan, and Australia and is also common in several countries of northern Africa, and the Middle East. Subtype C is the dominant form of HIV-1 in Southern Africa, Eastern Africa, India, and some regions of China [24]. Studies have shown evidence that HIV-1 subtypes have different phenotypic properties, such as coreceptor utilization [25-28], in-vitro replication fitness [29, 30], rate of disease progression [31-34], biology of transmission [35-38], and mutational motifs [39-41]. These differences often manifest as mutations in the HIV genes, especially in Env, which includes the envelope surface gp120 and gp41 [42], which are responsible for binding to the host cell.

Lynch et al. [43] have shown that genomic differences between subtypes are not limited to a specific motif or group of motifs. They showed that these differences are detectable across whole genomes of different subtypes. My analysis in Fig. 4.3B (loading plot of PC1 vs. PC2) also does not show different clusters of motifs between different HIV-1 subtypes, which suggests that differences among HIV-1 subtypes do not come from different groups of motifs for each subtype. For instance, I aligned several whole genomes HIV-1 sequences (subtypes 01_AE, A1, B and C) to show that polymorphic sites are present everywhere in the HIV genome and not confined to a particular motif (Fig. 4.4).

To find out more about the motif preferences of APOBEC3G and APOBEC3F, I also plotted other PCs.

172

Figure 4.3: Principal component analysis of the motif representation data of HIV-1 sequences: A) Scores plot (PC1 vs. PC2) of the motif representation data of HIV-1 sequences from subtypes/recombinant B, C, A1 and 01_AE. Each point is an HIV-1 full genome sequence. B) Loadings plot (PC1 vs. PC2) of the motif representation data of HIV-1 sequences from subtypes/recombinant B, C, A1 and 01_AE. Each point is a k-mers motif.

173

Figure 4.4: Multiple alignments of randomly selected portion of HIV-1 sequences (Gag region, positions 300-650) from selected 4 different subtypes. Polymorphic sites are distributed evenly across the genome and are not confined to a specific motif (see colour coded consensus sequence above the graph). Green, red, blue and yellow bars indicate nucleotides T, A, C and G respectively.

174

Identification of hypermutation by APOBEC3G

While PC1 and PC2 discriminated the HIV-1 subtypes, the next PCs provided valuable information about the APOBEC3G motif preference. Fig. 4.5A shows the score plot of PC3 vs. PC4 for HIV-1 sequences related to the four subtypes A1, C, B and 01_AE. This plot shows that there is a main cluster of sequences in the centre of the plot and several outlier sequences. The outlier sequences that are indicated by “H” are sequences that are hypermutated by APOBEC3G (these sequences are marked as hypermutated sequences in the Los Alamos National Laboratory (LANL) HIV database as well). As will be discussed in the next section, there are two other outlier sequences, which are marked as hypermutated sequences by APOBEC3F in the LANL HIV database.

Fig. 4.5B displays the loading plot of PC3 vs. PC4. Like the score plot of PC3 vs. PC4, there is a main cluster of motifs and several outlier motifs in two inverse orientations. Based on the hypermutations identified in the score plots, these outlier motifs can be used to identify the motif preference of APOBEC3G. It is known that APOBEC3G mutates G within 2-mers GG that are not flanked by a 3’ C [3, 4]. Target motifs GG, GGG, TGG and TGGG are favourite target motifs for APOBEC3G and emerge at the top right-hand side of the loading plot, and in the opposite orientation, product motifs AG, AGG, TAG and TAGG emerge in the bottom left-hand side of the loading plot. There are also motifs GC, TGC, TGGC within the product motifs. These are disfavoured target motifs and I will explain in the discussion section why they are among product motifs.

175

Figure 4.5: Principal component analysis of the motif representation data of HIV-1 sequences. A) Scores plot (PC3 vs. PC4) of the motif representation data of HIV-1 sequences from subtypes/recombinant B, C, A1 and 01_AE. Each point is an HIV-1 full genome. HIV-1 from subtypes B, C, A1 and the recombinant 01_AE are shown by orange, green, blue and red, respectively and hypermutated sequences recognised by the Los Alamos HIV Sequence Database are indicated by ‘‘H’’. B) Loading plot (PC3 vs. PC4) of the motif representation data of HIV-1 sequences from subtypes/recombinant B, C, A1 and 01_AE. Each point is a k-mers motif. The motifs related to hypermutation by APOBEC3G appear as outliers.

176

Based on previous studies [4, 44], motif GGGG seemed to be a good target motif for APOBEC3G but it also appeared in the product part of the loading plot in Fig. 4.5B as a disfavoured target. To find out more detail of this specific motif, I investigated the positional distribution of the motif in the hypermutated HIV-1 sequences. The result is shown in Fig. 4.6 and it demonstrates that there are many G nucleotides at the central (c) and 3` polypurine tracts (PPTs are regions of the HIV genome that remain double stranded during HIV reverse transcription and are immune against APOBEC3G pressure because it only deaminates single-stranded DNA) [45-47].

To investigate the effect of the preservation of G nucleotides at cPPTs and 3`PPTs on the high representation of the motif GGGG in the HIV genome, I calculated the frequency of the motif GGGG and two control motifs TGGG and TGGC that mutated to a different motif in the full genome of each HIV sequence when hypermutated by APOBEC3G (I used a reference sequence as the non-hypermutated sequence, which is provided in the same study) [4]. Fig. 4.7A shows the result of this comparison when whole genome HIV sequences contain polypurine tracts regions (cPPTs and 3`PPTs) and Fig. 4.7B shows the result of my comparison when whole genome HIV sequences do not contain polypurine tracts regions. The analysis confirmed that the preservation of G nucleotides at cPPTs and 3`PPTs is not a source of high representation of the motif GGGG and this motif is an unfavoured target motif for APOBEC3G.

177

Figure 4.6: Conserved polypurine tracts shown in a representative alignment of 54 hypermutated HIV-1 sequences used in this study. Each line is a full genome hypermutated sequence within which the positions of GGGG motifs are shown by small triangles.

178

5 0 (A ) TG G G G G G G

T G G C

d e

l 4 0

e

b

z

i

a

l

l

i

a

a

m v

r 3 0

a

o

f

n

s

o

f

i

G

r

t

e

G o

b 2 0

m

m

d

u

e

n

t

a e

t 1 0

h

u

t

m

y

b % 0 3 G 6 3 G 7 3 G 8 3 G 3 3 3 G 4 0 3 G 7 2 3 G 1 0 5 3 G 1 0 8 3 G 1 1 1 3 G 1 1 3

S a m p le

5 0 (B ) TG G G

G G G G d

e T G G C l

e 4 0

b

z

i

a

l

l

i

a

a

m

v

r a

o 3 0

f

n

o

s

f

i

r

G

t

e

o G

b 2 0

m

m

d

u

e

n

t

a e

t 1 0

h

u

t

m

y

b % 0 3 G 6 3 G 7 3 G 8 3 G 3 3 3 G 4 0 3 G 7 2 3 G 1 0 5 3 G 1 0 8 3 G 1 1 1 3 G 1 1 3

S a m p le

Figure 4.7: Percentage change of motifs within which G-to-A change(s) in the context of GG has occurred. The number of available GGs (i.e. mutability) has been used to normalise the results. A) Analysis of whole HIV genome; B) Analysis of the HIV genome without polypurine tracts; hypermutated sequences (see reference [4] for details) are shown on the horizontal axes.

Identification of hypermutation by APOBEC3F

Researchers previously reported that APOBEC3F mutates G-to-A preferably within GA contexts [5, 48-51]. To find out the motif preference of APOBEC3F, I searched all principal components that have GA and AA 2-mer motifs as outlier motifs in two inverse

179

orientations. In the PC11 vs. PC13, I identified 2-mer motifs GA and AA as outlier motifs in the two opposite orientations of the loading plot.

In the score plot of Fig. 4.8, there is a main cluster of HIV-1 sequences and also two outlier sequences in the top left-hand side of the plot. I checked the representation of the motifs GA and AA in these two sequences and found motifs GA and AA appeared as underrepresented and overrepresented motifs in these sequences, respectively, when compared to other HIV-1 sequences (APOBEC3G-hypermutated and non-hypermutated sequences). The motifs GA and AA appeared in the two opposite orientations of the loading plot, suggesting that they should be target and product motifs, respectively.

180

Figure 4.8: Principal component analysis of the motif representation data of HIV-1 sequences. A) Scores plot (PC11 vs. PC13) of the motif representation data of HIV-1 sequences from subtypes/recombinant B, C, A1 and 01_AE. Each point is an HIV-1 full genome. HIV-1 from subtypes B, C, A1 and the recombinant 01_AE are shown by orange, green, blue and red, respectively. The accession numbers of the two sequences that extend in the direction of mutation by APOBE3F are indicated adjacent to their points. B) Loadings plot (PC11 vs. PC13) of the motif representation data of HIV-1 sequences from subtypes/recombinant B, C, A1 and 01_AE. Each point is a k-mers motif. The arrow shows the direction of GA-to-AA mutation by APOBEC3F and other APOBEC3 proteins.

181

I also used the Hypermut 2 software [52] and analysed 54 HIV-1 sequences that were reported in the LANL database as hypermutated sequences. I found that only these two outlier sequences in Fig. 4.8A have a clear signature of GA-to-AA mutations (I also confirmed the Hypermut 2 result by another method known as Hypersign [10]). These analyses confirmed that PC11 vs. PC13 provides some information about APOBEC3F mutations.

In the loading plot of Fig. 4.8, there is a main cluster of motifs and also some outlier motifs. It is clear that motifs GA and AA should be target and product motifs of APOBEC3F (and/or other member of the APOBEC3 enzymes). However, there are no other outlier motifs that may be identified as favoured target and product motifs of APOBEC3F. The motif TGA, that seems to be a good target motif for APOBEC3F [4, 45], appeared in the lower right-hand side of the loading plot but TAA, which is a product motif of TGA, did not appear as an outlier motif in the upper left-hand side of the plot. However, I compared D-representation of target and product motifs for all tri- and tetra- nucleotides between APOBEC3F-mutated, APOBEC3G-mutated, and non-hypermutated sequences. As shown in Fig. 4.9, the difference (DRPRODUCT - DRTARGET) between representation of target motifs TGA and GAT and their respective product motifs TAA and AAT in HIV sequences hypermutated by APOBEC3F, is greater compared to those in the other two groups. However, for the control motif TGC there is not a large difference between the D-representation of TGC and its potential product motif TAC, in all these three groups (see Fig. 4.9). Perhaps it is because of the relatively small representation differences between TGA and TAA (compared to GA and AA), that these two motifs did not appear as outlier motifs in the loading plot of PC11 vs. PC13, however detailed analysis (see Fig. 4.9) shows a significant difference between these two APOBEC3F target and product motifs. Therefore motifs TGA and TAA are indeed favoured target and product motifs of APOBEC3F, respectively.

182

Figure 4.9: Average of D-representation of motifs TGA, TAA, TGC, TAC, GAT, AAT, GAC, and AAC in APOBEC3F-mutated, APOBEC3G-mutated and non-hypermutated HIV-1 sequences. Non-hypermutated, APOBEC3G and APOBEC3F sequences indicated by black, brown and green, respectively.

183

PC11 vs. PC13 (and other PCs as well) could not explain a large proportion of APOBEC3F variation. It is because very few hypermutated sequences are marked as APOBEC3F-induced hypermutation in the database (and in my analysed data as well).

Other principal components are not discussed here since they only explained inter- sequence mutations as a result of general HIV diversity.

Discussion

Restriction factors are a potent arm of innate immune responses and are considered to be the first line of defence against HIV-1. APOBEC3 enzymes are important restriction factors, which control viral infection by either mutating the HIV-1 genome, or by other mutation-independent strategies [1, 50, 53-55]. APOBEC3G and APOBEC3F, as well- known members of the APOBEC3 family, play important roles in HIV inhibition by inducing G-to-A changes in the HIV positive strand in specific target motifs [4-8].. Several methods have been proposed to identify hypermutated sequences, however in these methods, the hypermutation signatures have been characterised through aligning of query sequences to a constructed reference sequence and that may not be the correct ancestral genome of the hypermutated sequences. Importantly, the expected “mutability” of motifs is ignored in these analyses. In this study, I aimed to identify the motif preference of APOBEC3G and APOBEC3F proteins using an unbiased mathematical method based on a Markov Models of conditional probabilities. I quantified the representation of up to tetra-mers motifs (340 motifs) for 2047 HIV-1 sequences including 54 hypermutated sequences and then used a multivariate analysis to recognise the motif preferences of these APOBEC3 enzymes.

The data set of this project includes both normal (non-hypermutated) and hypermutated HIV-1 sequences from four different HIV-1 subtypes; thus the motif representation of the dataset could potentially include multiple sources of variation (e.g. variations related to clade diversity between the HIV-1 sequences and variations related to hypermutation by APOBEC3 enzymes). In the first step, PCA returned a number of principal components describing major and independent sources of variation in the data. The PCA detected four sequence clusters, which indicates four different HIV-1 subtypes. PC1 in Fig. 4.3.A shows that HIV-1 sequences in clade B are different from other HIV-1 sequences from

184

clades C, A1 and 01_AE, suggesting that clade B HIV-1 sequences may be biologically more distant from the other HIV-1 clades. However the complete grouping of HIV-1 sequences, based on their subtypes, also indicate that these sequences have distinct motif representation profiles. The result in Fig. 4.3.B shows a single motif cluster with no obvious outlier, suggesting that the differences between HIV-1 clades are not due to a subgroup of motifs.

The 54 hypermutated sequences that were examined in this study show at least two different signatures, GG-to-AG and GA-to-AA, which were induced by APOBED3G and APOBEC3F respectively. Therefore, in addition to HIV-1 clade variations illustrated above, we expect to see at least two sources of variation due to APOBEC3 enzyme activities on the sequences. On the other hand, the majority of HIV-1 hypermutated sequences (52 sequences) examined in this study were hypermutated by APOBEC3G, therefore we expect to see a greater contribution to the variance in PCs describing the impact of APOBEC3G than in PCs describing the effect of APOBEC3F.

Principal components, other than PC1 and PC2, provided information about APOBEC3G footprints. The loading plot PC3 vs. PC4 provides information about APOBEC3G motif preference (Fig. 4.5B). In this plot, there is a main motif cluster and two groups of outlier motifs. These motifs are overrepresented in hypermutated sequences as a result of G-to- A mutations by APOBEC3G. The motifs in this group are favoured product k-mers or disfavoured target k-mers. For example, motifs AG, TA, TAG, AGG, AAG and TAAG are favoured product motifs (i.e. motifs produced as a result of APOBEC3G targeting of motifs GG, TG, TGG, GGG, GGG and TGGG respectively). Motifs GC and GGC are disfavoured target motifs that appear in this group. In the opposite direction, there are underrepresented motifs that are either favoured target motifs or disfavoured product motifs for APOBEC3G. The 2-mer GC is not a product motif but it appears as an overrepresented motif; this is because the frequencies of GC and C (Eq. 4) do not change as a result of hypermutation, but the frequency of G is decreased as a result of APOBEC3G activity.

My analysis also indicated that motif GGGG, which has been previously reported as a favourite target site for APOBEC3G, not only did not emerge as a target site, but also emerged as a disfavoured target site for APOBEC3G protein. To further investigate why

185

GGGG is overrepresented, despite being reported as a favoured target [4, 44], I specified the positional distribution of motif GGGG within the genome of hypermutated sequences. The analysis revealed that this motif stays intact in the genome of hypermutated sequences in the 3’ and central polypurine tracts (PPTs) (Fig. 4.6), but can undergo mutation in other regions of the genome. The frequencies of G-to-A changes within GGGG and within two other 4-mers TGGG and TGGC as favoured and disfavoured target motifs, respectively, with and without the two polypurine tracts, are depicted in Fig. 4.7. The figure shows that the percentage of targeted GGGGs is comparable to that of TGGC, which is a disfavoured target motif. It is known that the HIV genome remains double stranded at the two PPT regions during reverse transcription [45, 46]. Therefore the G nucleotides in these areas are immune from mutation by APOBEC3 whose substrate is single-stranded DNA. This indicates that the conservation of motif GGGG at PPTs is not due to the infrequent targeting by APOBEC3G; however, it is observed as a general feature of this motif in the HIV-1 genome.

There are other product motifs such as GC, GT, GGC and TGGC that appeared as overrepresented motifs in the product section of Fig. 4.5B. This shows that these motifs, despite having nucleotide G, are not usually targeted by APOBEC3G. It is known that the presence of a 3’ flanking C in the second position (+2) after G significantly reduces the rate of APOBEC3-induced mutation [4, 45]. My analysis showed that those motifs with G nucleotides flanked by C at both first and second 3’ positions (+1 and +2) are disfavoured by APOBEC3G.

Fig. 4.8B displays the motif preference of APOBEC3F. In this loading plot PC11 vs. PC13, most preferred APOBEC3F target motif GA appears as an overrepresented motif in the same direction as those of hypermutated sequences by APOBEC3F (see Fig. 4.8B) and its product motif AA appears as an underrepresented motif in the opposite direction. Unlike APOBEC3G, in Fig. 4.8B, no other motifs appear as outlier motifs. However further analysis of representation values of motifs in hypermutated and non-hypermutated HIV-1 sequences showed that target motifs TGA and GAT, and their respective product motifs TAA and AAT, are favoured target and product motifs of APOBEC3F, respectively (see Fig. 4.9).

186

Multivariate exploratory analysis of the motif representation data confirmed the previously reported motif preferences and hierarchy of APOBEC3G up to 4-mers. It showed that APOBEC3G targets G mainly within GG, TG, TGG, GGG, TGGG and also GGGT. The k-mer motif TAAG, not TAGG, was found to be the product of APOBEC3G- induced mutation within TGGG. However it was shown that APOBEC3G usually targets two Gs concurrently within its most preferred target TGGG. It was shown that stretches of up to three G nucleotides are favoured targets of APOBEC3G, in particular when they are flanked by a T nucleotide at either end. The analysis also revealed that the motif- dependent mutation of G within the HIV genome by other members of the APOBEC3 family, such as APOBEC3F, was limited to GA-to-AA changes. My analysis indicated APOBEC3F (as a representative of other APOBEC3 family) preferentially mutates G within GA, TGA and GAT. There were only two HIV-1 sequences hypermutated by APOBEC3F in my study and these two sequences have not left any other motif preference evidence. However, Liddament et al. [5] reported that the motif preference of APOBEC3F may be beyond dinucleotides, suggesting that there may be other members of APOEBC3 proteins that are responsible for the GA-to-AA mutations observed on the HIV-1 sequence hypermutated by APOBEC3F. Our results also suggest that APOBEC3G has a stronger motif preference than APOBEC3F. In other words, APOBEC3G mutations strongly inhibit HIV-1 viral activity; in contrast, APOBEC3F mutations may assist viral diversification.

The analysis also showed that motif GA, which is the specific targeted motif for APOBEC3F, is also the second preferred target site of APOBEC3G. This may suggest that in most patients either APOBEC3G is the dominant enzyme against HIV-1, in terms of targeting HIV-1 for hypermutation or one/a few of the other known enzymes APOBEC3D, APOBEC3F and APOBEC3H. It is not unexpected to observe zero or very few co-mutated viral sequences if, in the majority of individuals, either APOBEC3G or other APOBEC3 members are ‘active’ against HIV-1. It may also indicate that mutation by none of the APOBEC3 enzymes is exclusively motif specific.

In conclusion, having a comprehensive and unbiased mathematical method to analyse the mutational footprints of APOBEC3G and APOBEC3F may help researchers to identify the source of the footprints and then the mechanisms that are responsible against these footprints.

187

References

1. Goila-Gaur, R. and K. Strebel, HIV-1 Vif, APOBEC, and intrinsic immunity. Retrovirology, 2008. 5: p. 51.

2. Harris, R.S. and M.T. Liddament, Retroviral restriction by APOBEC proteins. Nat Rev Immunol, 2004. 4(11): p. 868-77.

3. Hultquist, J.F., et al., Human and rhesus APOBEC3D, APOBEC3F, APOBEC3G, and APOBEC3H demonstrate a conserved capacity to restrict Vif-deficient HIV-1. J Virol, 2011. 85(21): p. 11220-34.

4. Armitage, A.E., et al., Conserved footprints of APOBEC3G on Hypermutated human immunodeficiency virus type 1 and human endogenous retrovirus HERV-K(HML2) sequences. J Virol, 2008. 82(17): p. 8743-61.

5. Liddament, M.T., et al., APOBEC3F properties and hypermutation preferences indicate activity against HIV-1 in vivo. Curr Biol, 2004. 14(15): p. 1385-91.

6. Beale, R.C., et al., Comparison of the differential context-dependence of DNA deamination by APOBEC enzymes: correlation with mutation spectra in vivo. J Mol Biol, 2004. 337(3): p. 585-96.

7. Bishop, K.N., et al., Cytidine deamination of retroviral DNA by diverse APOBEC proteins. Curr Biol, 2004. 14(15): p. 1392-6.

8. Kijak, G.H., et al., Variable contexts and levels of hypermutation in HIV-1 proviral genomes recovered from primary peripheral blood mononuclear cells. Virology, 2008. 376(1): p. 101-11.

9. Ebrahimi, D., et al., APOBEC3 has not left an evolutionary footprint on the HIV-1 genome. J Virol, 2011. 85(17): p. 9139-46.

10. Ebrahimi, D., et al., APOBEC3G and APOBEC3F rarely co-mutate the same HIV genome. Retrovirology, 2012. 9: p. 113.

11. Anwar, F., et al., Footprint of APOBEC3 on the genome of human retroelements. J Virol, 2013. 87(14): p. 8195-204.

12. Ma, S. and Y. Dai, Principal component analysis based methods in bioinformatics studies. Brief Bioinform, 2011. 12(6): p. 714-22.

13. Burge, C., et al., Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci U S A, 1992. 89(4): p. 1358-62.

188

14. Leung, M.Y., et al., Over- and underrepresentation of short DNA words in herpesvirus genomes. J Comput Biol, 1996. 3(3): p. 345-60.

15. Jolliffe, I.T., Principal Component Analysis, ed. S.S.i. Statistics. 2002, Berlin: Springer. 448.

16. Pathak, V.K. and H.M. Temin, Broad spectrum of in vivo forward mutations, hypermutations, and mutational hotspots in a retroviral shuttle vector after a single replication cycle: Substitutions, frameshifts, and hypermutations. Proc Natl Acad Sci USA, 1990. 87(16): P. 6019-23.

17. Michael, N.L., Host genetic influences on HIV-1 pathogenesis. Curr Opin Immunol, 11(4): p. 466-74.

18. Temin, H.M., Retrovirus variation and reverse transcription: abnormal strand transfers result in retrovirus genetic variation. Proc Natl Acad Sci USA, 1993. 90(15): p. 6900-3.

19. Ayouba, A., et al., HIV-1 group N among HIV-1-seropositive individuals in Cameroon. AIDS, 2000. 14(16): p. 2623-5.

20. Gurtler, L., et al., A new subtype of human immunodeficiency virus type 1 (MVP-5180) from Cameroon, J Virol, 1994. 68(3): p. 1581-5.

21. Simon, F., et al., Identification of a new human immunodeficiency virus type 1 distinct from group M and group O. Nat Med, 1998. 4(9): p. 1032-7.

22. Hemelaar, J., et al., Global and regional distribution of HIV-1 genetic subtypes and recombinants in 2004. AIDS, 2006. 20(16): p. W13-W23.

23. Taylor, B.S., et al., The challenge of HIV-1 subtype diversity. N Engl J Med, 358(15): p.1590-602.

24. Taylor, B.S., et al., The challenge of HIV-1 subtype diversity. N Engl J Med, 2008. 358(15): p. 1590-602.

25. Bjorndal, A. et al., Phenotypic characteristics of human immunodeficiency virus type 1 subtype C isolates of ethiopian AIDS patients. AIDS Res Hum Retroviruses, 1999. 15(7): p. 647-53.

26. De Wolf, F., et al., Syncytium-inducing and non-syncytium-inducing capacity of human immunodeficiency virus type 1 subtypes other than B: phenotypic and genotypic characteristics. WHO network for HIV Isolation and characterization. AIDS Res Hum Retroviruses, 1994. 10(11): p. 1387-400.

189

27. Choge, I., et al., Genotypic and phenotypic characterization of viral isolates from HIV-1 subtype C-infected children with slow and rapid disease progression. AIDS Res Hum Retroviruses, 2006. 22(5): p. 458-65.

28. Zhong, P., et al., Genetic and biological properties of HIV type 1 isolates prevalent in villagers of the Cameroon equatorial rain forests and grass fields: Further evidence of broad HIV type 1 genetic diversity. AIDS Res Hum Retroviruses, 2003. 19(12): p. 1167- 78.

29. Ball, S.C., et al., Comparing the ex vivo fitness of CCR5-tropic human immunodeficiency virus type 1 isolates of subtypes B and C. J Virol, 2003. 77(2): p. 1021-38.

30. Marozsan A.J., et al., Differences in the fitness of two diverse wild-type human immunodeficiency virus type 1 isolates are related to the efficiency of cell binding and entry. J Virol, 2005. 79(11): p. 7121-34.

31. Vasan, A., et al., Different rates of disease progression of HIV type 1 infection in Tanzania based on infecting subtype. Clin Infect Dis, 2006. 42(6): p. 843-52.

32. Baeten, J.M., et al., HIV-1 subtype D infection is associated with faster disease progression than subtype A in spite of similar plasma HIV-1 loads. J Infect Dis, 2007. 195(8): p. 1177-80.

33. Rangsin, R., et al., The natural history of HIV-1 subtype E infection in young men in Thailand with up to 14 years of follow-up. AIDS, 2007. 21(Suppl. 6): p. S39-S46.

34. Kiwanuka, N., et al. Effect of human immunodeficiency virus Type 1 (HIV-1) subtype on disease progression in persons from Rakai, Uganda, with incident HIV-1 infection. J Infect Dis, 2008. 197(5): p. 707-13.

35. Chohan, B., et al., Selection for human immunodeficiency virus type 1 envelope glycosylation variants with shorter V1–V2 loop sequences occurs during transmission of certain genetic subtypes and may impact viral RNA levels. J Virol, 2005. 79(10): p. 6528- 31.

36. Frost, S.D., et al., Constrained neutralization during transmission of HIV-1 subtype B. J Virol, 2005. 79(10): p. 6523-7.

37. Derdeyn, C.A., et al., Envelope-constrained neutralization-sensitive HIV-1 after heterosexual transmission. Science, 2004. 303(5666): p. 2019-22.

38. Krachmarov, C.P., et al., Factors determining the breadth and potency of neutralization by v3-specific human monoclonal antibodies derived from subjects infected with clade A

190

or clade B strains of human immunodeficiency virus type 1. J Virol, 2006. 80(14): p. 7127-35

39. Travers, S.A., et al., Evidence for heterogeneous selective pressures in the evolution of the env gene in different human immunodeficiency virus type 1 subtypes. J Virol, 2005. 79(3): p. 1836-41.

40. Choisy, M., Comparative study of adaptive molecular evolution in different human immunodeficiency virus groups and subtypes. J Virol, 2004. 78(4): p. 1962-70.

41. Korber, B.T., et al., Diversity considerations in HIV-1 vaccine selectioobserved in different genetic lineages of human immunodeficiency virus type 1. J Virolm, 1994. 68(10): p. 6730-44.

42. Gaschen, B., et al. Diversity considerations in HIV-1 vaccine selection. Science, 2002. 296(5577): p. 2354-60.

41. Felsovalyi, K., et al., Distinct sequence patterns characterize the V3 region of HIV type 1 gp120 from subtypes A and C. AIDS Res Hum Retroviruses, 2006. 22(7): p. 703-8.

42. Hunter, E., Viral entry and receptors. Cold Spring Harbor Laboratory Press; Plainview, NY: 1997. p. 71-121.

43. Lynch, R.M., et al., Appreciating HIV type 1 diversity: subtype differences in Env. AIDS Res Hum Retroviruses, 2009. 25(3): p.237-48.

44. Zheng, Y.H., et al., Human APOBEC3F is another host factor that blocks human immunodeficiency virus type 1 replication. J Virol, 2004. 78(11): p. 6073-6.

45. Yu, Q., et al., Single-strand specificity of APOBEC3G accounts for minus-strand deamination of the HIV genome. Nat Struct Mol Biol, 2004. 11(5): p. 435-42.

46. Suspene, R., et al., Twin gradients in APOBEC3 edited HIV-1 DNA reflect the dynamics of lentiviral replication. Nucleic Acids Res, 2006. 34(17): p. 4677-84.

47. Jern, P., et al., Likely role of APOBEC3G-mediated G-to-A mutations in HIV-1 evolution and drug resistance. PLoS Pathog, 2009. 5(4): p. e1000367.

48. Jern, P., et al., Role of APOBEC3 in genetic diversity among endogenous murine leukemia viruses. PLoS Genet, 2007. 3(10): p. 2014-22.

49. Chiu, Y.L. and W.C. Greene, The APOBEC3 cytidine deaminases: an innate defensive network opposing exogenous retroviruses and endogenous retroelements. Annu Rev Immunol, 2008. 26: p. 317-53.

191

50. Holmes, R.K., et al., APOBEC3F can inhibit the accumulation of HIV-1 reverse transcription products in the absence of hypermutation. Comparisons with APOBEC3G. J Biol Chem, 2007. 282(4): p. 2587-95.

51. Mangeat, B., et al., Broad antiretroviral defence by human APOBEC3G through lethal editing of nascent reverse transcripts. Nature, 2003. 424(6944): p. 99-103.

52. Rose, P.P.K. and B.T. Korber, Hypermut, Analysis & Detection of APOBEC-induced Hypermutation. 2010; Available from: www.hiv.lanl.gov/content/sequence/hypermut.

53. Holmes, R.K., et al., APOBEC-mediated viral restriction: not simply editing? TIBS, 2007. 32(3): p. 118-28.

54. Iwatani, Y., et al., Deaminase-independent inhibition of HIV-1 reverse transcription by APOBEC3G. Nucleic Acids Res, 2007. 35(21): p. 7096-108.

55. Bishop, K.N., et al., APOBEC3G inhibits elongation of HIV-1 reverse transcripts. PLoS Pathog, 2008. 4(12): p. e1000231.

192

193

Chapter 5:

Source of CpG depletion in the HIV-1 genome

Publication details:

H. Alinejad-Rokny, F. Anwar, S. Waters, M. Davenport and D. Ebrahimi. (2016). "Source of CpG Depletion in the HIV-1 Genome", Molecular Biology and Evolution, 33(12): 3205-3212.

Author contributions to thesis Chapter 5:

HA-R, DE and MPD: Conceived and designed the experiments. HA-R: Designed and implemented the computational algorithms, and performed the bioinformatics analysis and interpretation under supervision DE and MPD. HA-R: Wrote the chapter. DE and MPD: Read and revised the chapter. HA-R: created all Tables and Figures.

194

Abstract

The dinucleotide CpG is highly underrepresented in the genome of human immunodeficiency virus type 1 (HIV-1). To identify the source of CpG depletion in the HIV-1 genome I investigated two biological mechanisms: a) CpG methylation-induced transcriptional silencing and b) CpG recognition by Toll-like receptors (TLRs). I hypothesized that HIV-1 has been under selective evolutionary pressure by these mechanisms leading to the reduction of CpG in its genome. A CpG depleted genome would enable HIV-1 to avoid methylation-induced transcriptional silencing and/or to avoid recognition by TLRs that identify foreign CpG sequences. I investigated hypothese by determining the sequence context dependency of CpG depletion and comparing it with that of CpG methylation. I found that in both human and HIV-1 genomes the CpG motifs flanked by T/A were depleted most and those flanked by C/G were depleted least. Similarly my analyses of human methylome data revealed that the CpG motifs flanked by T/A were methylated most and those flanked by C/G were methylated least. Given that a similar CpG depletion pattern was observed for the human genome within which CpGs are not likely to be recognized by TLRs, I argue that the main source of CpG depletion in HIV-1 is likely host-induced methylation. Analyses of CpG motifs in over 100 viruses revealed that this unique CpG representation pattern is specific to the human and simian immunodeficiency viruses.

195

Introduction

In the human immunodeficiency virus type 1 (HIV-1) genome the frequency of CpG is much less than what is expected (see Fig. 5.1) based on the frequencies of its constituent mononucleotides C and G [1, 2]. Two hypotheses were be considered to explain the negative selection against CpG. The first hypothesis is that HIV-1 has evolved under a selection pressure from the host methylation machinery leading to a CpG depleted genome, which enables HIV-1 to avoid methylation-induced transcriptional silencing [3]. The second hypothesis is that there are host factors such as toll-like receptors (TLR) that identify and/or target CpG within foreign genomes [4-11], thus putting pressure on HIV- 1 to reduce its CpG level.

DNA methylation plays an important role in regulation of gene expression. Often methylation of CpG islands within gene promoters leads to transcriptional suppression [12]. However contradictory data exist with regards to correlation between transcriptional suppression and promoter methylation in HIV-1 [13-19].

Methylation of cytosine within CpG dinucleotide significantly increases its mutation rate by spontaneous deamination to thymine [20, 21]. As such, CpG depletion of the human genome is attributed to methylation-induced deamination of C to T within CpG [22, 23]. However, it is not clear whether this mechanism is also responsible for CpG depletion in HIV-1.

1 .5

V

I

H

, 1 .0

n

o

i

t

a

t

n e

s 0 .5

e

r

p

e R

0 .0 A A A C A G A T C A C C C G C T G A G C G G G T T A T C T G T T

2 -m e rs m o tifs

Figure 5.1: Comparison of the representation of 2-mers motifs of average of 2000 HIV-1 sequences.

196

Toll-like receptors have been implicated as an alternative mechanism responsible for CpG depletion [7-10]. It is thought that the human Influenza A has been originated from an avian source [8]. The frequency of CpG in the influenza A genome has decreased significantly after the influenza virus crossed an interspecies barrier from avian to human, suggesting a role for host-related factors identifying CpG patterns. CpG motifs were reported to be depleted most when flanked by T and A nucleotides [7-9]. In addition, host immune factors such as toll-like receptor 9 (TLR-9) identify unmethylated CpG within the context of T and A in foreign DNA. Thus, this mechanism was postulated as a source of CpG depletion in influenza [7-9]. Nevertheless, being localized in the endoplasmic reticulum [24, 25], TLR-9 is not likely to have access to the HIV-1 DNA, thus TLR-9 is unlikely to have played role in the depletion of CpG motifs in the HIV-1 genome.

The CpG depletion of the human genome, in addition to the reported sequence context dependency of CpG depletion in influenza virus and its possible link to components of the human immune system inspired us to explore the sequence contexts of CpG depletion and methylation in the HIV-1 genome. To investigate the mechanism responsible for CpG depletion in HIV-1, I analysed the representation of all CpG motifs up to tetra-nucleotides in over 2000 complete genome HIV-1 sequences, and also in the human genome. I also analysed the CpG methylation profile of the human methylome to reveal the sequence context dependency of methylation. Finally, I performed the same analysis on a large number of other viruses to investigate whether the source of CpG depletion is specific to HIV-1. Overall, the patterns of CpG depletion and methylation suggest that methylation may be the major mechanism responsible for CpG depletion in the HIV-1 genome. Importantly, viral genome mimicry of the host CpG methylation-induced mutation appears to be specific to the human and simian immunodeficiency retroviruses.

197

Materials and methods

Data acquisition

I downloaded a total of 2368 near full-length HIV-1 sequences form the Los Alamos National Laboratory database (www.hiv.lanl.gov) on October 2014. The sequences of human were obtained from the University of California, Santa Cruz (UCSC) genome browser build 37. I obtained the whole genome methylation data of peripheral blood monocytes (PBMC) [26] from the NGSmethDB database (http://bioinfo2.ugr.es:8080/NGSmethDB) [27, 28] on June 2015. Additionally I downloaded, from Genbank, available full genome sequences of over 100 human and non-human viruses. Table 5.1 shows representative examples of viruses from all four groups of single-stranded RNA (ssRNA), double stranded RNA (dsRNA), single- stranded DNA (ssDNA) and double-stranded DNA (dsDNA). The remaining viruses are listed in the Table 5.2. I also downloaded influenza sequences (1000 human influenza type A subtype H1N1, 800 human influenza type A subtype H3N2, and 200 human influenza type B) from the Influenza Research Database (IRD), (www.fludb.org/brc/home.spg?decorator=influenza).

198

Table 5.1: List of representative viruses investigated in this study (A full list of viruses studied is given in the Table 5.2).

~Genome Number of CpG D- Virus Type Host Length (nt) Sequences ratio HIV-1 (retrovirus) ssRNA Human 9000 2000 0.20 HIV-2 (retrovirus) ssRNA Human 9000 20 0.29

SIVCPZ (retrovirus) ssRNA Simian 9000 29 0.21

SIVSM (retrovirus) ssRNA Simian 9300 5 0.26 HTLV-1 (retrovirus) ssRNA Human 9000 20 0.60 SFV (retrovirus) ssRNA Simian 12900 28 0.32 HERV-K (retrovirus) ssRNA Human 9300 22 0.29 SINV ssRNA Human 11700 3 0.91 Semliki Forest ssRNA Human 11440 3 0.89 Influenza A H1N1 ssRNA Human 12500 2000 0.42 HCV ssRNA Human 9600 6 0.71 CTFV (RNA virus) dsRNA Human 25200 11 0.93 RV Type C (RNA virus) dsRNA Human 18500 2 0.59 AAV ssDNA Human 4700 11 0.89 BoV ssDNA Human 5300 6 0.46 JC Polyomavirus dsDNA Human 5120 40 0.08 HBV dsDNA Human 3300 400 0.54

HIV: Human Immunodeficiency Virus; SIVCPZ: Simian Immunodeficiency Virus (Chimpanzee);

SIVSM: Simian Immunodeficiency Virus (Sooty Mangabey); HTLV: Human T-Lymphotropic Virus; SFV: Simian Foamy Virus; HERVK: Human Endogenous Retrovirus Type K; SINV: Sindbis Virus; HCV: Hepatitis C Virus; CTFV: Colorado Tick Fever Virus; RV: Rotavirus; AAV: Adeno-associated Virus; BoV: Bocavirus; HBV: Hepatitis B Virus

* Retroviruses package single stranded RNA in their particles but their life cycle also includes a DNA intermediate.

199

Table 5.2: List of additional viruses used in this study.

~Genome Number of Virus Host Length (nt) Sequences SIV_AGI (SIV Agile Mangabey) Monkey 9500 1 SIV_AGM (SIV African Green Monkey) Monkey 9200 5 SIV_DEB (SIV De Brazza's Monkey) Monkey 9200 2 SIV_DEN (SIV Dent’s Mona Monkey) Monkey 9600 1 SIV_GOR (SIV Gorilla) Monkey 9200 5 SIV_GSN (SIV Greater Spot-Nosed Monkey) Monkey 9400 2 SIV_MAC239 (SIV Macaque 239) Monkey 9000 7 SIV_MND (SIV Mandrill) Monkey 9500 2 SIV_MON (SIV Mona Monkey) Monkey 8300 2 SIV_MUS (SIV Mustached Monkey) Monkey 9500 5 SIV_RCM (SIV Red-Capped Mangabey) Monkey 9500 2 SIV_MAC251 (SIV Macaque 251) Monkey 10300 2 SIV_TAL (SIV Talapoin Monkey) Monkey 9200 2 Simian SRV-1 type D retrovirus Monkey 8100 1 Human herpesvirus 1 Human 152000 27 Influenza type A-H1N1 virus Human 12500 900 Influenza type A-H3N2 virus Human 12500 600 Influenza type B virus Human 12500 215 Human parainfluenza virus 1 Human 15600 57 Human Hendra virus Human 18200 15 Human metapneumovirus Human 13300 62 Human parechovirus 6 Human 7300 5 Human poliovirus 1 Human 7400 104 Human rhinovirus A Human 7200 4 Human parechovirus 1 Human 7300 26 BK polyomavirus Human 5100 39 Hepatitis B virus Human 3000 13 Eastern Equine encephalitis virus Human 11500 11 Epstein-Barr virus (EBV) Human 171000 47 Hepatitis E virus Human 7400 162 Herpes simplex virus Human 152000 2 Human cytomegalovirus Human 235000 1 Human enterovirus Human 7400 87 Human papillomavirus Human 8000 103 Human Enterovirus D Human 7400 36 Barmah Forest virus Human 11400 3 Vesicular stomatitis Indiana virus Human 11200 27 Human respiratory syncytial virus Human 15200 45 Infectious flacherie virus Human 9600 6 Crocuta papillomavirus Human 8300 2 Measles virus Human 16700 40 Rubella virus Human 9700 58

200

~Genome Number of Virus Host Length (nt) Sequences Enzootic nasal tumour virus Human 7500 16 Human Cowpox virus Human 17500 25 Human Vaccinia virus Human 190000 2 Human Variola virus Human 190000 55 Human Ebola virus Human 18900 88 Human zika virus Human 10800 103 Human Salivirus A Human 7900 13 Human echovirus Human 7500 13 Human Human 31000 20 Human parechovirus 2 Human 7300 2 Human herpesvirus 2 Human 155000 2 Human Aichi virus Human 8250 5 polyomavirus non-Human 5300 10 Bovine rhinitis B virus non-Human 7500 1 Bovine rhinovirus 2 non-Human 7500 1 Mouse mammary tumor virus non-Human 9800 5 Western equine encephalitis virus non-Human 11400 28 Mouse parvovirus 4 non-Human 4700 6 Hog cholera virus non-Human 12300 11 Avian endogenous non-Human 7000 2 Avian leukosis virus non-Human 7600 81 Dog canine parvovirus 2 non-Human 4200 4 Pig classical swine fever virus non-Human 12300 7 Gallus endogenous virus non-Human 7250 4 Friend murine leukemia virus non-Human 8350 4 Lelystad virus non-Human 15100 2 Porcine parvovirus 2 non-Human 5500 7 Porcine parvovirus 6 non-Human 6100 14 Jaagsiekte sheep retrovirus non-Human 7460 6 Feline calicivirus non-Human 7600 42 Goatpox virus non-Human 149500 3 Frog virus 3 non-Human 105900 4 Equid herpesvirus 1 non-Human 149000 32 CHK (Chicken) non-Human 7200 3 Rous sarcoma virus non-Human 9600 8 Chicken megrivirus non-Human 9500 4 Duck megrivirus non-Human 9700 3 Influenza avian type A-H1N1 non-Human 12300 78 Influenza avian type A-H3N2 non-Human 12300 46 Chicken astrovirus non-Human 7500 2 Black beetle virus non-Human 3000 3 Chikungunya virus non-Human 12230 106 Kunjin virus non-Human 11200 14 Abaca bunchy top virus Plant 1000 23

201

~Genome Number of Virus Host Length (nt) Sequences Abutilon mosaic Bolivia virus Plant 2700 6 Fig badnavirus 1 Plant 7140 8 Bamboo mosaic virus Plant 6360 10 Acanthamoeba polyphaga mimivirus Plant 1.2M 1 Drosophila C virus Plant 9260 2 Alajuela virus Plant 6500 2 Apium virus Y Plant 990 2 Beet curly top virus Plant 3000 63 Apple mosaic virus Plant 3000 10 Grapevine virus A Plant 7350 15 Cactus virus X Plant 6600 4 Carnation ringspot virus Plant 3800 1 Cedar virus Plant 18150 4 Celery mosaic virus Plant 10000 2 Blackberry virus Y Plant 11000 2 Citrus leaf blotch virus Plant 8740 7 Daphne virus S Plant 8700 3

202

Data analysis

Analysis of CpG representation

I define “representation” as the ratio of observed frequency (Pobs) of a motif over its expected frequency (Pexp) in the genome. The Pobs of a motif is defined as the number of times that motif appears in the sequence divided by the total number of all motifs with the same length [29, 30]. The Pexp can be calculated in different ways.

푃푒푥푝(퐶푝퐺푝푇) = 푃표푏푠(퐶) × 푃표푏푠(퐺) × 푃표푏푠(푇) Eq. 1

= 푃표푏푠(퐶푝퐺) × 푃표푏푠(푇) Eq. 2

= 푃표푏푠(퐶) × 푃표푏푠(퐺푝푇) Eq. 3

Previous studies have shown that a more accurate estimation of motif representation can be acquired using Markov models of conditional probabilities [31, 32]. In this method [31] the expected frequency of a motif is estimated using the observed frequencies of the motif constituents, considering the overlapping nucleotide(s). Examples of 1st and 2nd order Markov models are given in Equations 4 and 5.

푃표푏푠(퐶푝퐺) × 푃표푏푠(퐺푝푇) 푃푒푥푝(퐶푝퐺푝푇) = Eq. 4 푃표푏푠(퐺)

푃표푏푠(퐶푝퐺푝푇) × 푃표푏푠(퐺푝푇푝퐴) 푃푒푥푝(퐶푝퐺푝푇푝퐴) = Eq. 5 푃표푏푠(퐺푝푇)

In this study I used 1st and 2nd order models to calculate the expected frequencies of tri- nucleotides and tetra-nucleotides, respectively. I then calculated, for each motif, its representation (D-ratio) by dividing the Pobs of the motif by its Pexp. To compare the frequencies of different motifs I performed non-parametric Mann-Whitney tests. To determine the statistical significance of correlations, I used non-parametric Spearman correlation tests.

203

Analysis of % methylation of CpG

I performed three different analyses to quantify the percentage of methylated CpG tri- and tetra-nucleotides using the whole genome PBMC DNA methylome data [26] from the NGSmethDB database (http://bioinfo2.ugr.es:8080/NGSmethDB) [27, 28]. In the first analysis I quantified %methylation in whole chromosomes 1, 2, 3 and 10. For the second and third analyses, I quantified %methylation inside and outside CpG islands, respectively. For these two analyses, I only used data from chromosomes 1 and 2. It is worth noting that analysis of data from a single chromosome at a time returned the same results, implying that the reported data in this paper is independent of which chromosome(s) is(are) used. For each analysis, I report %methylation as an average of all analysed chromosomes. To identify CpG islands I used the Matlab function ‘cpgisland’ with a 100 bp moving window, which is a default setting of this function. The CpG islands were defined as regions within which GC content is greater than 50% and the ratio of the observed over expected CpG content (based on the frequencies of C and G) is greater than 60%.

Results

Analysis of the representation of CpG motifs shown in Figs. 5.2A-D indicates that CpG depletion is sequence context dependent. The CpG tri-nucleotides flanked by T or A (shown by circles) are significantly less represented than those flanked by C or G (shown by triangles) in both HIV-1 (see Fig. 5.2A) (P value = 0.03, Mann-Whitney) and human (see Fig. 5.2C) (P value = 0.03). The same pattern is observed for CpG tetra-nucleotides shown in Figs. 5.2B and 5.2D. The CpG tetra-nucleotides having T and/or A only (e.g. CpGpTpA, ApCpGpA, …) were significantly less represented compared to CpG motifs flanked by C and/or G only (CpGpCpG; GpCpGpG, …) (P value = 0.04 and 0.0002 for HIV-1 and human genome, respectively).

204

Figure 5.2: Comparison of the representation of CpG motifs flanked by different nucleotides within HIV-1 and Human genomes. A) Tri-nucleotides in HIV-1; B) Tetra-nucleotides in HIV-1; C) Tri-nucleotides in human; D) Tetra-nucleotides in human. In all cases CpG motifs flanked by T/A have a lower representation than those flanked by C/G (P value ≤ 0.04).

To show that the observed CpG depletion pattern of the HIV-1 genome is not an artefact of HIV-1 genome biases such as A-richness and possible hypermutation signatures of APOBEC3 enzymes we performed the following two analyses:

205

a) I performed a representation analysis for the motif GpC, which has the same composition as CpG but is not under methylation-induced mutation pressure. If my reported CpG depletion pattern (i.e. greater depletion of CpG motifs flanked by T/A compared to those flanked by C/G) is due to A-richness of the HIV-1 genome, one would expect to see the same pattern for GpC. my analyses of GpC tri- and tetra- nucleotides did not show the pattern we observed for CpG (see Fig. 5.3). b) I performed two separate representation analyses, one for hypermutated HIV-1 sequences and one for non-hypermutated HIV-1 sequences. As predicted the CpG representation patterns of non-hypermutated and hypermutated sequences were not different (Fig. 5.4).

These two analyses indicate that genome composition biases such as A-richness and APOBEC-induced hypermutation signature do not affect my analysis.

I also investigated CpG representation in the HIV-1 long terminal repeats (LTRs), which are not transcribed. The results showed that similar to the rest of the HIV-1 genome, in the HIV-1 LTR regions, CpG motifs flanked by T/A are less represented than those flanked by C/G (Fig. 5.5).

206

Figure 5.3: Comparison of the representation of GpC motifs flanked by different nucleotides within HIV-1 sequences. A) GpC tri-nucleotide motifs; B) GpC tetra-nucleotide motifs.

207

Figure 5.4: Comparison of the representation of CpG motifs flanked by different nucleotides within non-hypermutated and hypermutated sequences, separately. A) Tri-nucleotides in non-hypermutated HIV-1 sequences; B) Tri-nucleotides in hypermutated HIV-1 sequences.

208

Figure 5.5: Comparison of the representation of CpG motifs flanked by different nucleotides within HIV-1 sequences with and without LTR region, separately. A) Tri-nucleotides in HIV-1 LTR regions; B) Tri-nucleotides in HIV-1 whole sequence excluding LTR.

209

To investigate whether codon bias has played role in the observed lower representation of CpG motifs flanked by T/A, I compared the codon usage of Arginine, which is coded by all four possible CpG-containing codons, namely CGA, CGU, CGC, and CGG [33]. If the observed CpG depletion pattern (lower representation of CGT and CGA compared to those of CGC and CGG) is due to the HIV-1 codon bias, one would expect to see a lower usage, for Arginine, of the codons CGU and CGA compared to CGC and CGG. My results showed that in the HIV-1 genes codon CGA is used the most (see Fig. 5.6). This suggests that CpG depletion pattern that I report here is not an artefact of codon bias of the HIV-1 genes.

Figure 5.6: Comparison of the codon usage of Arginine in HIV-1 genes and in human.

To investigate whether the observed pattern of CpG depletion is specific to HIV-1 or is common among viruses, I analysed full genome sequences from a wide range of viruses including viruses with ssDNA, dsDNA, ssRNA, and dsRNA genomes having a wide range of CpG representations (see Tables 5.1 and 5.2). The results of few viruses from each group are shown in Fig. 5.7 and those of additional 103 viruses investigated are given in Figs. 5.8A-J. Similar to HIV-1, in HIV-2 and in their simian immunodeficiency counterpart viruses (i.e. SIVCPZ and SIVSM) the representation of CpG was lower when it was flanked by T/A nucleotides. All other SIV sequences studied (e.g. SIVAGM, SIVMND,

SIVMUS, …) also showed the same CpG depletion pattern. The only exception was

SIVGOR (see Fig. 5.8A). This pattern was not observed in any of other viruses even in the

210

human endogenous retroviruses HERV-K (see Figs. 5.8B-J). Contrary to what has been reported previously [8, 9] I found that in the influenza virus, CpG depletion is not sequence context dependent (see Figs. 5.7, 5.8B and 5.8H).

211

Figure 5.7: Comparison of the representation of CpG motifs flanked by T/A versus C/G in selected viruses from all four groups of ssRNA, dsRNA, ssDNA, and dsDNA. A full list of all viruses studied is given in the Table 5.2 and Fig. 5.8 (A-J). In only HIV and SIV genomes, CpG motifs flanked by T/A have a significantly lower representation compared to those flanked by C/G.

212

213

214

215

216

217

218

219

220

221

Figure 5.8: Comparison of the representation of CpG motifs flanked by T/A versus those flanked by C/G in 103 viruses.

222

The lifecycle of HIV-1 includes both RNA and DNA stages. To investigate whether the mechanism responsible for CpG depletion in HIV-1 has acted on the HIV-1 RNA genome or on the integrated HIV-1 DNA, I compared the representation of each CpG motif with that of its reverse complement motif (e.g. CpGpA with TpCpG) as shown in Fig. 5.9 If the mechanism has acted at a DNA level, and assuming there has not been a strand targeting bias, I expect to see reverse complementary motifs represented equally in the HIV-1 genome. As displayed in Fig. 5.9A, there is a positive correlation between the representations of reverse complementary tri-nucleotide CpG motifs. Strikingly, this pattern is also present for tetra-nucleotide CpG motifs (see Fig. 5.9B). These data suggest that the mechanism responsible for context-dependent CpG depletion has acted on the HIV-1 DNA.

223

Figure 5.9: Positive correlation between the representation of CpG reverse complementary tri-nucleotide motifs (A) and tetra- nucleotide motifs (B). ICC: Intra-Class Correlation. This data indicates that the mechanism responsible for the HIV-1 CpG depletion has acted on the viral DNA.

224

In order to investigate whether methylation has played role in the CpG depletion of HIV- 1 genome, I analysed DNA methylation patterns in human PBMC. The results of the percentage methylation of CpG tri- and tetra-nucleotide motifs in the human genome are shown in Figs. 5.10A and 5.10B, respectively. The results indicate that CpG methylation is a sequence context dependent mechanism. The CpG motifs flanked by T/A are methylated at a higher rate compared to those flanked by C/G. To ensure that the observed pattern is a general feature of methylation and is not affected by clustering of CpGs in the human genome, I performed two additional analyses in which only CpGs inside CpG islands or outside CpG islands were included. The results of the analysis of CpGs inside CpG islands are shown in Figs. 5.10C and 5.10D for tri- and tetra-nucleotides, respectively. The results of the analysis of CpGs falling outside CpG islands are given in Figs. 5.10E and 5.10F for tri- and tetra-nucleotides, respectively. As indicated, regardless of the positioning of CpG sites inside or outside GpG islands, CpG motifs flanked by T/A exhibit a higher percentage of methylation. These data suggest that the sequence context dependency of methylation is independent of CpG distribution in the human genome.

225

226

Figure 5.10: Comparison of the % methylation of CpG motifs flanked by different nucleotides in the human genome. A) Whole genome/tri-nucleotides; B) Whole genome/tetra-nucleotides; C) Inside CpG islands/tri-nucleotides; D) Inside CpG islands/tetra- nucleotides; E) Outside CpG islands/tri-nucleotides; F) Outside CpG islands/tetra-nucleotides. Data indicate that rate of methylation is higher when CpG is flanked by T/A.

227

I also computed the correlation between % methylation and depletion of CpG tri- and tetra-nucleotide motifs. Figs. 5.11A and 5.11B show a significant inverse correlation between CpG methylation and CpG depletion, confirming that CpG motifs that are methylated most are underrepresented most, and vice versa.

228

Figure 5.11: Inverse correlation between % methylation and representation of CpG tri-nucleotide (A) and tetra-nucleotide (B) motifs. Data indicate that CpG motifs that are methylated more (those flanked by T/A) are represented less in the HIV-1 genome.

229

In Fig. 5.11B all CpG tetra-nucleotide motifs, namely NNCG, CGNN, and NCGN (N: C/G or T/A) were included. In other word, all CpG motifs flanked by both T/A and C/G such as ACGC, TCGC, ACGG were excluded from analysis. I also performed two separate analyses for NCGN and (CGNN and NNCG) motifs. The results, presented in the Fig. 5.12, show a significant inverse correlation between % methylation and representation of NNCG and CGNN motifs in both human and HIV-1 genomes. However, for the motifs NCGN the inverse correlation was insignificant at 95% confidence level in both human and HIV-1.

230

Figure 5.12: Correlation between % methylation and representation of CpG tetra-nucleotide motifs within NCGN, NNCG and CGNN groups, separately. A) NNCG and CGNN motifs in HIV-1; B) NCGN motifs in HIV-1; C) NNCG and CGNN motifs in human; D) NCGN motifs in human.

231

Additionally, I compared the representation of CpG motifs flanked by T/A and those flanked by C/G. The results revealed that, in both human and HIV-1, only for NNCG and CGNN motifs but not for NCGN motifs, the representation is lower when CpG is flanked by T/A (see Fig. 5.13). Putting together these data indicate that the methylation rate of CGNN and NNCG is higher when N is T/A compared to when N is C/G. As a result, the representation of CGNN and NNCG is lower when N is T/A compared to when N is C/G. Importantly, this pattern is present in both human and HIV-1 genomes. In contrast, the methylation rate and representation of NCGN motifs appears to be independent of the type of flanking nucleotides. Again, this pattern is present in both human and HIV-1 genomes. These data suggest: a) methylation of CpG sites is non-random and sequence context dependent; and b) The mechanism responsible for CpG depletion of the human and HIV-1 genomes is likely the same.

232

Figure 5.13: Comparison of the representation of CpG motifs flanked by different nucleotides within NCGN, NNCG and CGNN groups, separately. A) NCGN in HIV-1; B) NCGN in Human; C) NNCG and CGNN in HIV-1; D) NNCG and CGNN in Human.

233

Discussion

CpG is highly underrepresented in the genome of HIV-1 and many other viruses [2, 3, 29, 34, 35]. Depletion of CpG is also a known feature of the human and other vertebrate genomes. CpG depletion in vertebrates is thought to be a result of CpG methylation followed by spontaneous deamination to TpG on the same strand or a mismatch repair to CpA on the complementary strand [22, 23]. However, it is not clear why CpG is depleted in the genome of viruses such as HIV-1. In this work I aimed to answer this question by investigating the motif dependency pattern of CpG depletion. I quantified the representation of CpG tri- and tetra-nucleotide motifs in a diverse population of full genome HIV-1 sequences and also in the human genome. I then compared the CpG depletion patterns with the CpG methylation and TLR recognition patterns to identify the source of CpG depletion in HIV-1. Analysis of the representation of CpG motifs revealed that in both human and HIV-1 genomes the CpG motifs flanked by T/A have been depleted more compared to those flanked by C/G (see Fig. 5.2). These data suggest that the mechanism responsible for CpG depletion has mutated CpG motifs in a sequence context dependent manner in both genomes.

The host defence molecule TLR-9 [36, 37] recognises foreign CpG DNA, mainly in the context of T and A [7-11]. It has been speculated that human influenza virus had evolved to reduce, in its genome, the level of CpG flanked by T/A to avoid recognition by TLR- 9 [7-11]. Although I was inspired by the results reported previously for the influenza virus, my representation analyses of whole genome and different segments of influenza virus did not show the reported patterns. Figs. 5.7, 5.8B and 5.8H show that there is no difference between the representation of CpG motifs flanked by T/A and C/G in the genome of human and avian influenza viruses. Nevertheless, given that the human genome exhibits the same CpG depletion pattern as HIV-1 and that the human genome is unlikely to have been affected by TLR-9 [24], this suggests TLR-9 recognition may not be the source of CpG depletion in HIV-1.

My results shown in Fig. 5.9 indicated that CpG reverse complementary motifs are nearly identically represented in the HIV-1 genome suggesting that CpG depletion has occurred at the DNA stage of the HIV-1 life cycle, likely in HIV-1 proviruses. Additionally I note that TLR-9 exist in the endolysosomal compartments of innate immune cells [24, 25] thus

234

its access to the HIV-1 DNA during reverse transcription in the cytoplasm or after the HIV-1 DNA is integrated into the human genome, seems unlikely.

In addition to TLR, in this study I investigated methylation-induced mutation as a potential source of CpG depletion in HIV-1. I reason that if methylation-induced silencing (and/or mutation) is responsible for the depletion of CpG motifs in HIV-1, those motifs that are methylated more are expected to be depleted more. In other words, I would expect CpG motifs flanked by T/A to be methylated more frequently compared to CpG motifs flanked by C/G. Fig. 5.10 shows that the presence of T and A in the flanking regions of CpG significantly increases the rate of methylation. In addition, methylation of double stranded DNA is expected to act equally on a motif and its reverse complement, which is what I observe in Fig. 5.9. The inverse correlation between % methylation and representation of CpG motifs (see Fig. 5.11) is another piece of evidence suggesting that methylation is the source of CpG depletion in HIV-1. The HIV-1 genome is only 9000bp in size. In such a small genome, it is striking that even CpG tetra-nucleotides show patterns that points to methylation as the source of CpG depletion in HIV-1 (see Figs. 5.2B, 5.9B, and 5.11B).

The genome of many viruses including those studied here (e.g. Influenza, HERV-K, HBV, SFV, JC Polyomavirus) is CpG depleted. However, my results indicate that among the viruses studied only immunodeficiency viruses HIV and SIV have a CpG depletion pattern in their genomes that mimics the pattern of human methylation-induced mutations. Surprisingly, even viruses such as HERV-K, which is endogenous to the human genome and JC Polyomavirus, which has a dramatically reduced CpG level (see Table 5.1), do not show a CpG depletion pattern that can be attributed to methylation- induced mutation (see Figs. 5.7 and 5.8). The exclusive host CpG-mimicry in HIV and SIV may provide a unique advantage in these viruses; nevertheless further evidence is needed to support a viral adaptation model. My study provides evidence that suggest CpG methylation is the source of CpG depletion in the HIV and SIV genomes. However what drives CpG depletion in other viruses, including those infecting hosts that are not CpG depleted remains to be understood.

My results indicate that among the viruses studied only immunodeficiency viruses HIV and SIV have a unique CpG depletion pattern in their genome that can be explained by

235

host methylation-induced silencing. Other endogenous and exogenous viruses including those having a CpG depleted genome do not show a similar CpG pattern (see Fig. 5.7), suggesting that adaptation to the host methylation machinery may provide a unique advantage in HIV and SIV.

References

1. Shpaer, E.G. and J.I. Mullins, Selection against CpG dinucleotides in lentiviral genes: a possible role of methylation in regulation of viral expression. Nucleic Acids Res, 1990. 18(19): p. 5793-7.

2. Karlin, S., et al., Why is CpG suppressed in the genomes of virtually all small eukaryotic viruses but not in those of large eukaryotic viruses? J Virol, 1994. 68(5): p. 2889-97.

3. Van der Kuyl, A.C. and B. Berkhout, The biased nucleotide composition of the HIV genome: a constant factor in a highly variable virus. Retrovirology, 2012. 9: p. 92.

4. Boehme, K.W. and T. Compton, Innate sensing of viruses by toll-like receptors. J Virol, 2004. 78(15): p. 7867-73.

5. Ohto, U., et al., Structural basis of CpG and inhibitory DNA recognition by Toll-like receptor 9. Nature, 2015. 520(7549): p. 702-U303.

6. Chang, J.J. and M. Altfeld, TLR-mediated immune activation in HIV. Blood, 2009. 113(2): p. 269-70.

7. Greenbaum, B.D., R. Rabadan, and A.J. Levine, Patterns of Oligonucleotide Sequences in Viral and Host Cell RNA Identify Mediators of the Host Innate Immune System. PLoS One, 2009. 4(6): p. e5969.

8. Greenbaum, B.D., et al., Patterns of Evolution and Host Gene Mimicry in Influenza and Other RNA Viruses. PLoS Pathog, 2008. 4(6): p. e1000079.

9. Jimenez-Baranda, S., et al., Oligonucleotide Motifs That Disappear during the Evolution of Influenza Virus in Humans Increase Alpha Interferon Secretion by Plasmacytoid Dendritic Cells. J Virol, 2011. 85(8): p. 3893-904.

10. Pezda, A.C., et al., Suppression of TLR9 Immunostimulatory Motifs in the Genome of a Gammaherpesvirus. J Immunol, 2011. 187(2): p. 887-96.

11. Bauer, S., et al., Human TLR9 confers responsiveness to bacterial DNA via species-specific CpG motif recognition. Proc Natl Acad Sci U S A, 2001. 98(16): p. 9237-42.

236

12. Jones, P.A., Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet, 2012. 13(7): p. 484-92.

13. Singh, M.K. and C.D. Pauza, Extrachromosomal Human-Immunodeficiency-Virus Type-1 Sequences Are Methylated in Latently Infected U937 Cells. Virology, 1992. 188(2): p. 451- 8.

14. Pion, M., et al., Transcriptional suppression of in vitro-integrated human immunodeficiency virus type 1 does not correlate with proviral DNA methylation. J Virol, 2003. 77(7): p. 4025-32.

15. Ishida, T., et al., 5' long terminal repeat (LTR)-selective methylation of latently infected HIV-1 provirus that is demethylated by reactivation signals. Retrovirology, 2006. 3(1): p. 69.

16. Blazkova, J., et al., CpG methylation controls reactivation of HIV from latency. PLoS Pathog, 2009. 5(8): p. e1000554.

17. Blazkova, J., et al., Paucity of HIV DNA Methylation in Latently Infected, Resting CD4(+) T Cells from Infected Individuals Receiving Antiretroviral Therapy. J Virol, 2012. 86(9): p. 5390-2.

18. Duverger, A., et al., Determinants of the establishment of human immunodeficiency virus type 1 latency. J Virol, 2009. 83(7): p. 3078-93.

19. Weber, S., et al., Epigenetic analysis of HIV-1 proviral genomes from infected individuals: Predominance of unmethylated CpG's. Virology, 2014. 449: p. 181-9.

20. Zhang, X.L. and C.K. Mathews, Effect of DNA Cytosine Methylation Upon Deamination- Induced Mutagenesis in a Natural Target Sequence in Duplex DNA. J Biol Chem, 1994. 269(10): p. 7066-9.

21. Fryxell, K.J. and W.J. Moon, CpG mutation rates in the human genome are highly dependent on local GC content. Mol Biol Evol, 2005. 22(3): p. 650-8.

22. Bird, A.P., DNA Methylation and the Frequency of Cpg in Animal DNA. Nucleic Acids Res, 1980. 8(7): p. 1499-504.

23. Simmen, M.W., Genome-scale relationships between cytosine methylation and dinucleotide abundances in animals. Genomics, 2008. 92(1): p. 33-40.

24. Barton, G.M., et al., Intracellular localization of Toll-like receptor 9 prevents recognition of self DNA but facilitates access to viral DNA. Nature Immunol, 2006. 7(1): p. 49-56.

25. Chockalingam, A., et al., TLR9 traffics through the Golgi complex to localize to endolysosomes and respond to CpG DNA. Immunol Cell Biol, 2009. 87(3): p. 209-17.

237

26. Li, Y., et al., The DNA methylome of human peripheral blood mononuclear cells. PLoS Biol, 2010. 8(11): p. e1000533.

27. Geisen, S., et al., NGSmethDB: an updated genome resource for high quality, single- cytosine resolution methylomes. Nucleic Acids Res, 2014. 42(Database issue): p. D53-9.

28. Hackenberg, M., et al., NGSmethDB: a database for next-generation sequencing single- cytosine-resolution DNA methylation data. Nucleic Acids Res, 2011. 39(Database issue): p. D75-9.

29. Burge, C., et al., Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci U S A, 1992. 89(4): p. 1358-62.

30. Leung, M.Y., G.M. Marsh, and T.P. Speed, Over- and underrepresentation of short DNA words in herpesvirus genomes. J Comput Biol, 1996. 3(3): p. 345-60.

31. Anwar, F., et al., Footprint of APOBEC3 on the genome of human retroelements. J Virol, 2013. 87(14): p. 8195-204.

32. Ebrahimi, D., et al., APOBEC3G and APOBEC3F rarely co-mutate the same HIV genome. Retrovirology, 2012. 9(1): p. 113.

33. Pandit, A. and S. Sinha, Differential Trends in the Codon Usage Patterns in HIV-1 Genes. PLoS One, 2011. 6(12).

34. Kypr, J., et al., Nucleotide Composition Bias and Cpg Dinucleotide Content in the Genomes of HIV and HTLV 1/2. Biochim Biophys Acta, 1989. 1009(3): p. 280-2.

35. Cheng, X., et al., CpG usage in RNA viruses: data and hypotheses. PLoS One, 2013. 8(9): p. e74109.

36. Barton, G.M., Viral recognition by Toll-like receptors. Semin Immunol, 2007. 19(1): p. 33- 40.

37. Hemmi, H., et al., A Toll-like receptor recognizes bacterial DNA. Nature, 2000. 408(6813): p. 740-5.

238

239

Chapter 6:

Relationship between cell division, gene expression and patterns of HIV integration into the human genome

Author contributions to thesis Chapter 6: HA-R and MPD: Conceived and designed the experiments. HA-R: Designed and implemented the computational algorithms, and performed the bioinformatics analysis and interpretation under the supervision of MPD. HA-R: Wrote the chapter. VV, DE and MPD: Read and revised the chapter. HA-R: created all Tables and Figures.

240

Abstract

Since the beginning of HIV epidemic, almost 78 million people have been infected with HIV-1 and about 39 million people have died of HIV-1. HIV-1 is a retrovirus that can infect billions of vital immune cells each day. The insertion of HIV-1 DNA into the host chromosome is an important step in the HIV-1 life cycle. Therefore, understanding this process is important for the development of therapeutic strategies. There is evidence that insertion of HIV-1 DNA into the host DNA is not random. It has also been reported that HIV-1 preferentially integrates into cancer genes in expanded clones of cells or genes related to cell division. However, the factors controlling the specificity of CD4 T-cell division and the reason why HIV-1 selects certain genes in a target cell remain poorly understood. In this study, I investigated the relationship between cell division, gene expression and patterns of HIV-1 integration into the human genome.

241

Introduction

HIV-1 is a retrovirus that can infect billions of cells (10.3 × 109) each day [1]. Like other retroviruses, HIV-1 replication requires its provirus to be integrated into the genome of host cell [2]. Integration of the HIV-1 DNA into the host genome and the persistence of integrated viruses in latently-infected CD4 T-cells is the major barrier to HIV clearance [3-5].

Combination antiretroviral therapy can suppress the replication of HIV-1. However, it cannot eradicate the latently-infected CD4 T-cells [3, 6-8]. Although a large fraction of the HIV-1 proviruses that persist in infected individuals are defective, a small fraction of the latent reservoir remains intact and is able to reactivate from latency, and begin a new infection [9]. The latent state is established very early during infection, but the exact timing is unknown [10, 11], and because of its extremely long half-life of forty-four months in infected patients [12] it is the major obstacle to achieving a cure against HIV- 1 infection [3]. However, because the latent reservoir is very small, it is not easy to study the HIV-1 latent reservoir formation. The establishment and persistence of the latent HIV- 1 reservoir needs intact reverse transcribed proviral DNA integration into the host genome and subsequent transcriptional silencing [3]. Whether or not the site of integration of HIV- 1 into the host cell`s genome affects latency is not clear [13-15]. Analysis of HIV-1 integration sites in infected patients undergoing suppressive antiretroviral therapy has reveled that clonal expansion of the HIV-1 proviral reservoir may contribute to the long- term persistence of viral latent reservoirs.

There is substantial evidence that integration of retroviral genomes into the host DNA is not random. Although each retrovirus has its own specific strategy and its specific target site preference, previous studies [16-20] have indicated that HIV-1 preferentially integrates into transcriptionally active genes and mainly into introns and/or regulatory regions [21-25]. It has also been reported that the sites of integration can influence the expression of the integrated HIV-1 provirus [19, 20, 22-25].

Two recent studies [26, 27] provide evidence that clonal proliferation of latently infected CD4 T-cells play an important role in the maintenance of the HIV-1 latent reservoir. The DNA from CD4 T-cells of patients infected by HIV-1 was sequenced and the sites of

242

integration in the human genome were determined. The observation of identical integration sites in multiple cells indicated that some infected cells can grow and divide broadly after infection, giving rise to expanded clones of HIV-1-infected cells, all of which have a HIV-1 provirus at the identical integration site [26-28]. Simonetti et al. [29] have recently reported that highly expanded CD4 T-cell clones can contain intact proviruses, which can give rise to infectious HIV-1 virus. Their study also demonstrated that expanded clones produced infectious virus that was recognised as persistent plasma viremia during combination antiretroviral therapy in an HIV-1-infected individual who had squamous cell cancer. CD4 T-cells with an intact HIV-1 provirus, were broadly distributed and remarkably enriched in cancer metastases [29].

It has also been reported that HIV-1 preferentially integrated into cancer genes in expanded clones of cells or genes associated with cell proliferation (for example, MKL2) [26, 27]. In the individuals studied, there were several genes with multiple HIV-1 integration sites and many genes with only single HIV-1 integration sites. Some of these genes with multiple integration sites, such as MKL2 and BACH2, were observed in clonally expanded cells.

Although based on the studies by Maldarelli et al. and Wagner et al. [26, 27] and observations that cells bearing integrated HIV-1 undergo cell proliferation in infected individuals receiving c-ART, it has been proposed that the clonally expanded cells and preferential sites of HIV-1 integration in latently-infected CD4 T-cells play a critical role in maintaining the reservoir and the persistence of HIV-1 [26, 27]. However, there remains much to understand about the mechanisms controlling these processes. For example, gene expression may be one of the factors that affect clonal expansion. In fact, in previous studies [26, 27] it has been reported that some genes, such as MKL2, have been targeted by HIV-1 integration at several independent sites, while other genes have been targeted only at one site. However, those studies did not address why cells with particular HIV-1 integration sites have undergone more cell proliferation.

To gain insight on the regions of the human genome favoured for HIV-1 integration, and on the role of cell proliferation and clonal expansion in maintaining the latent reservoir, I investigated published data from two recent studies [26, 27] and used gene expression data from a third study [30] to further investigate the relationship between cell

243

proliferation, gene expression, and the frequency and patterns of HIV-1 integration into the human genome.

Materials and methods

Genome sequence and gene expression data

I analysed HIV-1 integration sites in CD4 T-cells using published data from two studies. The first study includes five patients from the NIH clinical centre in Bethesda, Maryland [26]. In this study, proviral sequences from PBMCs or CD4 T-cells were obtained using single genome sequencing of 1100bp segments near the end of the HIV-1 GAG gene. 50 x 106 cells, using a Stem-cell technologies and EasySep Human CD4 T-cell enrichment kit, were processed [26]. A total of 2050 integration sites were obtained from this study. I also used data from another study [27] to confirm the results of my analysis of data from the first study [26]; in the second study [27], 534 integration sites were obtained from three HIV-1 infected patients enrolled at the Seattle Children's Institutional Review Board for Human Subjects [27]. Table 6.1 shows a summary of the data from these two published studies [26, 27]. For the gene expression analysis, I used previously reported microarray data [30], which has been uploaded in the Gene Expression Omnibus (GEO) with accession number GSE23321; the data includes a list of transcript cluster IDs and gene expression values of TSCM (stem T-cell), TCM (memory T-cell), TEM (effector memory T-cell) and TN (naive T-cell) from three donors for each gene. Lymphocyte- enriched apheresis blood in the gene-expression study, were generated from the blood of three healthy humans. For all individuals, the gene expression analysis was done using total RNA from sorted CD8 T-cell subsets isolated using an RNEasy Micro Kit and then was processed and labelled using the Transcript Expression Kit and Transcript Terminal Labelling Kit, respectively and was finally hybridised to Transcript Human Gene arrays [30]. The raw data from the provided .CEL files was brought in from Cmap Raw Data files into the Partek Genomics Suite by the RMA method [30]. I used processed gene expression data, which is a BED file containing region_ID, gene_ID and Log 2- transformed RMA signal intensities from Partek GS, for each gene.

244

Table 6.1: Details of the data obtained from references 26 and 27 that were used in this study.

NIH Clinical Centre in Seattle Children's

Bethesda study Institutional study *

Number of patients 5 3 Number of integration sites 2050 534 Number of genes 1189 262 Number of genes with expanded sites 217 48 Number of genes with non-expanded 972 214 sites

There are many cell-surface markers that represent different T-cell subsets. Many of these markers are up-regulated or down-regulated rapidly after T-cell activation. In previous studies, it has emerged that the pool of memory CD8 T-cells can be partitioned into two groups: TEM (effector memory T-cell) and TCM (memory T-cell) [30-32]. Downregulation of the lymphoid homing markers CD62L and CCR7 in the TEM group of cells restricts their capability to inhabit the lymph node, allowing them to distribute and home to non-lymphoid tissues. In addition, the TEM group of cells stays poised to generate urgent effector functions. The TCM subset expresses high levels of CD62L and CCR7, restricting their homing to lymphoid tissues. To generate stem T-cells, Gattinoni et al. [30] stimulated CD45RO-CD62L+ cells with α-CD3/CD28 beads in the presence of GSK-3β inhibitor TWS119 for 2 weeks. An extensive phenotypic analysis using established markers CD45RA (or CD45RO), CCR7, CD27, CD28, IL7Rα, CD95, CD3, CD62L has been done to characterise whether the CD45RO-CD62L+ T-cells which emerged in the presence of TWS119 inhibitor were really naive cells or had accumulated memory properties [30]. Most of the molecules (CD45RA, CCR7, CD27, IL-2Rα, IL- 7Rα, CD69, 41BB, CCR5 and CD57) presented the same expression pattern between subsets [30]. However, some molecules showed a different expression pattern in different subsets. CD8+ T-cell subsets were determined as follows: TN cells: CD3+CD8+CD45RO−CCR7+CD45RA+CD62L+CD27+CD28+IL7Rα+CD95−; TSCM cells: CD3+CD8+CD45RO–CCR7+CD45RA+CD62L+CD27+CD28+IL7Rα+CD95+; TCM cells: CD3+CD8+CD45RO+CD45RA−CCR7+CD62L+; TEM cells: CD3+CD8+CD45RO+CD45RA−CCR7−CD62L− [30]. CD8 T-cells were enriched from healthy human peripheral blood mononuclear cell (PBMC) by using the EasySep CD8 T-

* Used to confirm the results obtained from analysing the data from reference [26].

245

cell Enrichment Kit (Stem-cell Technologies), coloured with various mixtures of fluorescent-labelled monoclonal antibodies and sorted with a FACSAria (BD Biosciences) in various live (Live/Dead-; Invitrogen) CD8 T-cell subsets [30].

Identification of sites by hierarchical clustering

To identify genes with expanded and non-expanded integration sites, I used a hierarchical clustering method known as ‘connectivity based clustering’ [33]. This method connects integration sites to build clusters (sites) based on their distance and a common region greater than a specific threshold. The method groups integration sites that are close together and have common regions of nucleic acids greater than a specified threshold, and creates a bigger site until there is no integration site with common regions of nucleic acids greater than the specified threshold in the original data. Table 6.2 shows an example of sites that the method could detect.

246

Table 6.2: An example of the output from the first step of the method that identifies the integration sites using hierarchical clustering.

Site name #Copy CHR Start End ID_REF Year Strand LTR Gene name PT3|62|chr22:30514808-30515408 62 chr22 30514808 30515408 8072304 7.2 - 3' HORMAD2 PT1|55|chrX:13040508-13041459 55 chrX 13040508 13041459 8169904 11.4 - 3' PT1|34|chr16:14307289-14308336 34 chr16 14307289 14308336 7993310 11.4 - 5' MKL2 PT1|17|chr1:198506485-198506962 17 chr1 198506485 198506962 7923164 11.4 - 3' ATP6V1G3 PT1|13|chr13:108927120-108927548 13 chr13 108927120 108927548 7969986 11.4 - 5' TNFSF13B ...... PT4|13|chr5:138977164-138977377 13 chr5 138977164 138977377 8108435 12.2 + 3' UBE2D2 PT1|10|chr15:39974857-39975376 10 chr15 39974857 39975376 7987426 0.2 - 3' FSIP1 PT1|9|chr1:35948969-35949387 9 chr1 35948969 35949387 7914809 11.4 - 3' KIAA0319L PT1|9|chr16:14308469-14309049 9 chr16 14308469 14309049 7993310 11.4 - 5' MKL2 PT1|8|chr20:43561095-43561610 8 chr20 43561095 43561610 8062890 11.4 + 5' PABPC1L ...... PT1|2|chr1:150117257-150117310 2 chr1 150117257 150117310 7905099 11.4 - 3' VPS45 PT1|2|chr10:69745232-69745549 2 chr10 69745232 69745549 7933947 11.4 - 5' HERC4 PT4|2|chr12:107352896-107353077 2 chr12 107352896 107353077 7958346 12.2 - 3' C12orf23 PT5|1|chr1:27056289-27056350 1 chr1 27056289 27056350 7899220 14.5 + 3' ARID1A PT1|1|chr1:116938051-116938166 1 chr1 116938051 116938166 7904254 11.4 - 3' ATP1A1 PT4|1|chr8:101961400-101961551 1 chr8 101961400 101961551 8152096 12.2 + 3' YWHAZ

247

Both HIV-1 integration sites datasets [26, 27] and the gene details dataset [30] specify start and end points of the regions and genes respectively (i.e. integrated regions and gene regions). The next step of the algorithm is to find the number of expanded and non- expanded HIV-1 integration sites that exist in each gene. This method clustered genes into two groups:

Expanded sites genes (ES-Gene): Genes that include sites in which multiple HIV-1 sequences have been integrated (genes that include regions with multiple copies of integration site).

Non-expanded sites genes (NES-Gene): Genes that include sites within which only one integrated viral genome is found (gene that include regions with only one copy of an integration site).

In fact, the purpose of this step is to classify genes into two groups - genes that include expanded sites and genes that include non-expanded sites - as well as, within these two groups, to classify genes into different groups based on the number of sites (see Fig. 6.1). Within these two groups there are subgroups with different numbers of integration sites.

Figure 6.1: The method divides genes into two different groups: a group of genes that include only non-expanded sites and a group of genes that include at least one expanded site. Genes within each group are then further clustered into smaller groups based on the number of existing sites within genes. NES: Non-expanded sites genes; ES: Expanded sites genes.

As an example, Fig. 6.2 shows a distribution of expanded and non-expanded sites in the human gene MKL2 on chromosome 16. For this gene, there are 11 sites in

248

which the HIV-1 genome has integrated. These include nine expanded and two non-expanded integration sites.

249

Figure 6.2: Expanded sites and non-expanded sites of the HIV-1 genome within the MKL2 gene on chromosome 16 using the published data from reference 26. Figure drawn using the Geneious software.

250

In the last step, the pipeline determines the expression level of each gene (those genes that arose in the previous step of the pipeline; that is ‘output 2’ in Fig. 6.4). In both the new dataset (‘output 2’ in Fig. 6.4) and the expression dataset [30] there are start and end points of the genes and expressed regions respectively (i.e. gene coordinates in the pipeline output file and expressed region coordinates in expression dataset). To characterize the expression level for each gene, the method searches for the closest region, in the expression dataset, to the gene. For most of the genes, the method found one unique expression value because the region and gene had the same coordinates or the gene only overlapped with one region (see parts A and B in Fig. 6.3, respectively). In some cases, the gene overlapped with two regions; however, the method selects the expression value associated with the region that has more common base pairs with the gene. If the gene was equally overlapped with both regions, the method selects the average expression of both regions (see parts C and D in Fig. 6.3, respectively). If genes overlapped with more than two regions, the method assigns an average gene expression across all regions to the gene (see part E in Fig. 6.3). A schematic of the pipeline is shown in Fig. 6.4; and an example of the final output of the pipeline is shown in Table 6.3.

251

Figure 6.3: Determining the expression level for each gene. A) gene and region have same coordinates or B) gene only overlapped with one region; in these situations, the method uses region expression as the gene expression. C) when genes overlapped with two regions, the method selects the expression value associated with the region that had more common base pairs with the gene. D) If gene was equally overlapped with both regions, the method selects average expression of both regions. E) If genes overlapped with more than two regions, method uses average expression of regions as gene expression.

252

Figure 6.4: A schematic of the pipeline, which has been used in this study.

253

Table 6.3: An example of the final output of the method that specifies, for various genes, the average gene expression level across 3 donors for various types of T-cells and the identified sites of integration. TSCM: stem memory T-cell, TCM: memory T-cell, TEM: effector memory T-cell, and TN: naive T-cell.

Average Gene expression value of 3 donors Non- Gene # Expanded CHR Gene size # Copy expande #Sample Sample name TSCM TCM TEM TN name Sites sites d sites MKL2 chr16 195453 121 11 9 2 2 PT1, PT4 8.919 8.177 8.029 8.306 BACH2 chr6 370380 25 16 6 10 2 PT1, PT3 8.494 7.967 7.769 8.510 STAT5B chr17 77540 18 11 3 8 4 PT1, PT3, PT4, PT5 11.802 11.172 11.115 10.723 NSD1 chr5 167191 7 4 3 1 2 PT1, PT5 9.695 9.257 9.351 9.194 TNRC6B chr22 290991 7 5 2 3 2 PT1, PT5 10.420 10.252 10.218 10.276 ...... MKL1 chr22 226422 9 4 2 2 1 PT1 6.357 6.109 6.170 6.189 PRKCB chr16 384611 6 4 2 2 3 PT1, PT2, PT3 10.890 10.662 10.679 10.240 KIAA0319L chr1 124461 12 3 2 1 1 PT1 9.997 9.876 9.973 9.312 PAK2 chr3 92791 10 3 2 1 2 PT1, PT2 9.945 10.090 9.976 9.866 HORMAD2 chr22 96902 64 3 1 2 2 PT1, PT3 4.527 4.306 4.106 4.153 NFAT5 chr16 139573 4 3 1 2 2 PT1, PT3 8.868 9.128 9.350 8.963 ...... OSBPL3 chr7 185096 4 3 1 2 1 PT1 7.938 8.447 8.405 7.090 SHOC2 chr10 94125 4 3 1 2 1 PT1 7.897 7.620 7.474 7.669 PABPC1L chr20 48974 9 2 1 1 2 PT1, PT3 7.962 7.765 7.579 7.535 ATP6V1G3 chr1 17724 17 1 1 0 1 PT1 4.430 4.416 4.461 4.453 TNFSF13B chr13 57245 13 1 1 0 1 PT1 6.9654 6.449 7.802 6.405 ZNF586 chr19 50285 2 1 1 0 1 PT1 7.568 7.573 7.418 7.814 ......

254

Average Gene expression value of 3 donors Non- Gene # Expanded CHR Gene size # Copy expande #Sample Sample name TSCM TCM TEM TN name Sites sites d sites CYTH1 chr17 108250 6 6 0 6 2 PT1, PT5 11.368 11.368 11.349 11.428 KDM2A chr11 138819 5 5 0 5 2 PT1, PT5 10.055 10.171 10.141 10.102 PACS1 chr11 174385 5 5 0 5 2 PT1, PT4 11.330 10.834 10.660 11.060 FOXK2 chr17 124950 4 4 0 4 3 PT1, PT3, PT5 9.734 9.709 9.822 9.517 ...... ANKRD12 chr18 148981 3 3 0 3 2 PT1, PT5 10.726 10.477 10.621 10.123 ARHGEF18 chr19 77365 3 3 0 3 2 PT1, PT3 9.407 9.3658 8.727 9.737 ATM chr11 146619 3 3 0 3 2 PT1, PT3 11.181 11.280 11.023 11.612 CDC42SE2 chr5 152955 3 3 0 3 2 PT1, PT3 12.417 12.321 12.337 12.242 ...... IQGAP1 chr15 114026 2 2 0 2 1 PT5 11.018 11.259 11.534 10.540 KIAA0195 chr17 58932 2 2 0 2 2 PT1, PT3 9.495 9.545 9.684 9.380 ASCC3 chr6 373179 1 1 0 1 1 PT1 8.842 9.000 8.907 8.859 ADNP chr20 42374 1 1 0 1 1 PT1 9.889 9.922 9.874 9.978 BCAT2 chr19 15968 1 1 0 1 1 PT1 6.815 6.660 6.632 7.008

255

Results

Gene and non-gene regions

I first analysed the frequency of intra- and inter-genic sites of HIV-1 integration in the published data of reference 26; I found that 87% of sites were within genes (see Fig. 6.5A). I also analysed the fraction of expanded sites and non-expanded sites inside and outside genes. It is shown in Fig. 6.5B that the fraction of genes that include at least one expanded site is slightly greater than that of genes that include only non-expanded sites within genes.

Figure 6.5: A): Fraction of integration sites across the whole genome in gene and non-gene

regions; B): Fraction of expanded and non-expanded sites in gene and non-gene regions.

Table 6.4 shows a summarised list of genes that contain regions with multiple copies of integration site in the Maldarelli study [26].

256

Table 6.4: A summarised list of genes with expanded sites in Maldarelli study. Multiple Single Gene name Gene length All sites #Sample copies IS copy IS BACH2 370380 16 6 10 2 STAT5B 77540 10 2 8 4 MKL2 195453 8 6 2 2 RPTOR 421553 6 1 5 4 TNRC6B 290991 5 2 3 2 NSD1 167191 4 3 1 2 TAOK1 161441 4 3 1 1 MKL1 226422 4 2 2 1 PRKCB 384611 4 2 2 3 DNMT1 97942 4 1 3 2 FKBP5 154999 4 1 3 2 MAP4 238588 4 1 3 1 VPS45 78137 4 1 3 2 KIAA0319L 124461 3 2 1 1 PAK2 92791 3 2 1 2 ARID2 178376 3 1 2 3 EPS15L1 116847 3 1 2 2 FNBP1 156008 3 1 2 1 GATAD2A 123102 3 1 2 1 HORMAD2 96902 3 1 2 2 NFAT5 139573 3 1 2 2 NFATC3 144509 3 1 2 2 NUMA1 77830 3 1 2 3 OSBPL3 185096 3 1 2 1 R3HDM2 181397 3 1 2 1 RP11-67A1.2 44932 3 1 2 2 SHOC2 94125 3 1 2 1 SLC30A7 85678 3 1 2 2 SMARCE1 23547 3 1 2 2 STAT3 75245 3 1 2 1 UBE2H 122218 3 1 2 2 VPRBP 100713 3 1 2 2 RSRC2 22358 2 2 0 1 ACOX1 37928 2 1 1 2 ALG12 15240 2 1 1 1 ATF7IP 133088 2 1 1 1 BTBD9 471698 2 1 1 1 C12orf23 23060 2 1 1 2 C17orf70 14077 2 1 1 2 CDK6 231674 2 1 1 1 CDKAL1 697948 2 1 1 1 DIP2A 111115 2 1 1 1 DNAJC7 44944 2 1 1 1

257

Multiple Single Gene name Gene length All sites #Sample copies IS copy IS ERCC6L2 138860 2 1 1 2 ETS1 128782 2 1 1 1 HERC4 153441 2 1 1 2 HNRNPM 44348 2 1 1 1 ILF3 38157 2 1 1 2 IWS1 90680 2 1 1 2 JMJD1C 298742 2 1 1 1 KMT2E 100183 2 1 1 1 LOH12CR1 109828 2 1 1 2 LRRC14 7182 2 1 1 1 MIB1 166001 2 1 1 1 NAA25 82327 2 1 1 1 NCOR1 189029 2 1 1 1 NOSIP 34552 2 1 1 2 PABPC1L 48974 2 1 1 2 PARP8 180624 2 1 1 1 PRDM2 124882 2 1 1 1 RALGAPB 106046 2 1 1 2 RB1 178236 2 1 1 1 RP11- 22687 2 1 1 1 146F11.5 SSH2 304339 2 1 1 1 STIM1 238683 2 1 1 1 TMED2 14039 2 1 1 1 TSKS 23578 2 1 1 2 ZCCHC11 145206 2 1 1 2

Gene expression analysis

I identified 2050 HIV-1 integration sites within 1189 genes in the publish data of the Maldarelli paper [26]. To investigate the link between gene expression and HIV-1 integration, I performed an association analysis between the genes from the HIV-1 integration dataset and the gene expression data [30]. I found that the expression of genes that had at least one integration site was significantly higher than that of the genes that did not have any integration sites (see Fig. 6.6A).

258

Figure 6.6: A): Gene expression comparison between genes that have at least one HIV-1 integration site and genes that do not have any site. B): Gene expression comparison between genes that have at least one expanded site and genes that have only non-expanded sites. (P value calculated by non-parametric two-tailed Mann-Whitney test).

To investigate gene expression patterns in the genes that have at least one expanded site and in genes that have only non-expanded sites, I first made a comparison without setting a specific threshold on the number of sites. The result, shown in Fig. 6.6B, revealed that genes that include at least one expanded site, when compared to genes that include only non-expanded sites, had significantly lower gene expression levels with a highly significant P value. However, this figure does not tell us whether this pattern holds for all expanded sites or not. For example, does a gene with one expanded site have a higher gene expression level than a gene with two expanded sites? To test this, I compared the expression of expanded-site genes and non-expanded site genes having the same number of integration sites. The question, which I tried to answer is: what is the difference between expanded-site genes and non-expanded site genes with different thresholds in the number of sites. I found, for different numbers of sites, that expanded-site genes, at

259

all thresholds of number of sites, had lower gene expressions compared to non-expanded site genes with a significant P value. The result is shown in Fig. 6.7.

I also found that there is a positive correlation between the number of integration sites and gene expression in both groups’ expanded-site and non-expanded site genes. This analysis shows that HIV-1 integration favours genes that have high gene expression (within each group). That is, most of the genes with higher numbers of integration sites had high gene expression within each group. In Fig. 6.7, the red dashed line shows significantly increasing gene expression at different thresholds of number of sites within non-expanded site genes (P value < 0.0001) and the green dashed line shows increasing gene expression at different thresholds of number of sites within expanded-site genes.

Figure 6.7: Gene expression comparison between genes that include at least one expanded site and genes that include only non-expanded sites at different thresholds levels in the number of integration sites per gene. (P value calculated by non- parametric two-tailed Mann-Whitney test and non-parametric one-way ANOVA).

260

Comparison of gene expression level between different kinds of T-cells

Two recently published papers in Science [26, 27] reported new evidence of the clonal proliferation of CD4 T-cells latently infected with HIV. By identifying identical integration sites of HIV in multiple cells, they demonstrated that a single infection event could produce multiple latently infected cells and HIV preferentially integrated into cancer genes in expanded clones of cells. Considering these two papers, I tested an alternative hypothesis based on HIV infection of CD4 stem cells. That is, in short-lived effector cells the effector genes are enriched for HIV integration, and in precursor cells (stem cell memory and central memory cells) integration sites and preferentially within the genes that are associated with proliferation. With expansion over time, this will lead to an increasing proportion of integration sites in ‘cancer genes’, and multiple copies of each. To test this alternative hypothesis, I analysed the expression of genes with expanded and non-expanded integration sites, separately, and investigated the patterns for stem cell memory, central memory cell, effector memory and naive cells within both groups of genes. In Fig. 6.8, the red line shows gene expression at different thresholds of the number of sites within non-expanded site genes and within four different types of T-cells. The green line shows the gene expression at different thresholds of the number of sites within expanded-site genes. As shown in Fig. 6.8, there are no significant differences in the expression level of different cell types within both expanded and non-expanded sites genes.

261

Figure 6.8: Red line: Gene expression comparison of different T-cells within genes that include only non-expanded sites. Green line: Gene expression comparison of different T- cells within genes that include at least one expanded site. TSCM: stem T-cell, TCM: memory T-cell, TEM: effector memory T-cell, and TN: naive T-cell.

Cancer genes and non-cancer genes

In the previous section, it was shown that for some T-cell types there appears to be an association between gene expression level and expanded sites of HIV integration across all genes. It is known that cancer genes are good targets for HIV integration [26, 27, 34]. I therefore extended this analysis to separately consider cancer and non-cancer genes. I compared between expanded-site genes and non-expanded-site genes, without any specific threshold on the number of sites. The results are shown in Fig. 6.9. My investigations showed that there were similar patterns of gene expression in the genes associated with cancer- and non-cancer genes. In both groups, expanded-site genes had lower gene expression than non-expanded-site genes. This difference was statistically significant (P value < 0.0001) for non-cancer genes but not for cancer genes (possibly because of low number of cancer genes).

262

Figure 6.9: Gene expression comparison between genes that include at least one expanded site and genes that include only non-expanded sites within cancer- and non-cancer genes (P value calculated by non-parametric two-tailed Mann Whitney test).

Expansion size ratio

It is important to know whether there is a correlation between the maximum copy number of integration sites within each patient and the number of patient sequences with gene expression level. In fact, using this measure, one can establish the impact of the number of patient sequences in multiple integration sites identification. Is there a positive correlation between the number of patient sequences and number of sites? Does a patient with more sequences have more multiple copy integration sites? To examine this, I defined the following ratio:

maximum copy number of integration sites within gene 푬풙풑풂풏풔풊풐풏 풔풊풛풆 = Eq. 1 number of patient’s sequences

Expansion size estimates, for each gene, the maximum cell expansion relative to the total number of patient sequences. I applied a Spearman correlation test with a 95% confidence interval to obtain a two-tailed P value for the relationship between gene expression and Expansion Size for each patient.

263

The results of this analysis showed that for all five patients there was a negative correlation between gene expression and Expansions size (see Fig. 6.10). For one patient (PT3) out of five patients, this negative correlation between gene expression and Expansion size was statistically significant.

264

Figure 6.10: Spearman correlation between gene expression and Expansion size in different patients.

265

Each dot in Fig. 6.10 represents a gene, and the charts represent the correlation between expansion size ratio of a gene and gene expression. Many of the genes have coinciding dots, because they have identical maximum copy number of integration sites and patient sequences number (because they originate from the same patient), so they have the same expansion size ratio. As shown in Fig. 6.10, there is a significant negative correlation between gene expression and Expansion size in PT3 only. This negative correlation between Expansion size and gene expression shows that genes with higher expanded integration sites have lower gene expression (in each patient separately). This means that number of sequenced integration sites within each patients does not impact on our results.

Confirmation of the results using an alternative published data set

To confirm the results of my analysis of the published data from reference 26, I analysed data from another smaller study [27]. In this second study, 534 integration sites that integrated into 262 different genes were obtained from three HIV-1 infected patients enrolled at the Seattle Children's Institutional Review Board for Human Subjects [27]. The gene expression data used for the analysis of this integration site dataset is the same as that used in the previous integration site dataset [26] and only the integration site data is new.

Similar to my analysis of the first published data set [26], in this second data set, most of the integration sites (both non-expanded and expanded integration sites) were observed in gene regions. I analysed the expression level of genes with expanded and non-expanded integration sites at different thresholds of the number of sites. Similar to the first set of results, the analysis of the second data set showed that genes with expanded integration sites had lower expressions compared to genes with non-expanded integration sites. However, as this second data set is smaller than the first data set, for some thresholds of the number of integration sites the differences in gene expression levels were not significant. I also compared the expression of different T-cells (TSCM, TCM, TEM and TN) for this integration site dataset. Again, in this new dataset, there are no significant differences in the expression level of different cell types within both expanded and non- expanded sites genes. These results confirm those obtained from my analysis of the first data set. Because of the small number of cancer genes represented in the second published study, I could not compare patterns of gene expression between expanded- and non-

266

expanded-site genes within cancer- and non-cancer genes. However, I observed similar patterns in the associations between Expansion ratio and gene expression to those obtained in the analysis of the first data set.

Discussion

Previous studies have shown that HIV-1 preferentially integrate into transcriptionally active genes, especially those activated against HIV-1 infection [16, 17, 34-37]. It has also been reported that the sites of integration can influence the expression of the integrated HIV-1 provirus [21, 24, 25, 36]. Recent studies [26, 27] presented new evidence of the clonal expansion of CD4 T-cells latently infected with HIV-1, by identifying identical integration sites of HIV-1 in multiple cells. These studies have reported that HIV-1 integration sites have an important effect on the clonal expansion and proliferation of infected cells in patients. They indicated that a single infection event can produce multiple latently infected cells and that HIV-1 preferentially integrated into ‘cancer genes’ in expanded clones of cells.

Although it has been suggested that sites of HIV-1 integration have important effects on the clonal expansion and that integrations within some genes are linked to clonal expansion, previous studies were unable to define what role the transcriptional activity of the human genes plays in HIV-1 integration and clonal expansion. In this report, I investigated the level of expression of each gene with integrated virus. I grouped genes into two clusters - expanded-sites genes (ES-Genes) and non-expanded sites genes (NES- Genes). Within these two clusters, I further classified genes according to the number of sites. I then investigated the level of expression of genes within these various clusters.

Several studies have reported that HIV-1 preferentially integrates into genes and especially highly expressed genes [2, 34]. My analysis supports these previous findings. In particular, my analysis showed that 87% of integration sites were within genes (see Fig. 6.5A) and that both expanded and non-expanded integration sites were preferentially found in gene regions (see Fig. 6.5B). Moreover, a slightly greater fraction of expanded integration sites compared with non-expanded sites was found in genes. My further analysis also showed that genes that include at least one HIV-1 integration site expressed at a significantly higher level compared to genes that did not have any integration sites

267

(see Fig. 6.6A), confirming that HIV-1 preferentially integrates into highly expressed genes [2, 34].

To investigate whether the site of HIV-1 integration in the host genome associates with clonal expansion, I used previously reported expression data by Gattinoni et al. [30] and compared genes with the clonally expanded sites to genes that include only single integrations. My analyses showed that genes that included at least one expanded site expressed at a significantly lower level compared to genes that included only non- expanded sites (see Fig. 6.6B). This is consistent with in-vitro data in cell lines demonstrating that the amount of HIV-1 transcription is dependent of the status of the surrounding chromatin [13-15]. I also found that for different numbers of sites, expanded- site genes at all thresholds of the number of sites, had significantly lower expressions compared to non-expanded-site genes (see Fig. 6.7), suggesting HIV-1 single integrations were more likely to be discovered in highly expressed genes compared to clonal integrations. In other words, the HIV-1 proviruses that integrate into lowly expressed genes are more likely to become clonally expanded. Therefore, cells carrying HIV-1 integrations in genes that are lowly expressed are more likely to be clonally expanded. This observation is also supported by data from a recently published paper [34]. There are two possible interpretations: 1) Genes that include only non-expanded sites are more active regions of the genome and they are more likely to support viral reactivation, so the cells may die and not expand. In contrast, most genes that include at least one expanded site had low levels of gene expression. 2) A number of studies have shown that HIV-1 integration can affect the gene expression of host cells [38-40], but it is not clear whether expression is enhanced or silenced. Previous studies have provided evidence that suggests HIV-1 integration contributes to the clonal expansion of infected cells. HIV-1 integration in transcriptional regulator genes MKL2 and BACH2 has been observed in multiple individuals [26, 27, 41-43]. In one individual, for instance, 16 out of 1052 unique HIV-1 integration sites were in MKL2. A similar observation was made for HIV-1 integrations in two introns of the BACH2 gene in this individual [26]. All of the HIV-1 proviruses were in the same transcriptional orientation as the gene, highlighting that they may have impacted the expression of these genes by the insertion of transcriptional control elements (promoters and enhancers), polyadenylation signal sequences or transcriptional termination sites that are included in the HIV-1 LTRs. On the other hand, infected cells in culture, showed, neither evidence for preferential integration in the same introns, nor

268

any preferential integration in one of the two possible orientations. In individuals, in the case of BACH2, all of the integrations were identified upstream of the coding exons; and for MKL2 gene, all of the integrations were among coding exons. Studies have shown that both MKL2 and BACH2 genes are human proto-oncogenes [44-46]. Thus, cells carrying HIV-1 proviruses in the introns of these genes were presumably chosen in individuals because the integrated HIV-1 virus impacted the expression of these genes, modifying the growth and/or survival attributes of the infected cells.

Several studies [26, 27, 34] have reported an enrichment of HIV-1 integration in genes associated with cancer compared with normal genes. To find out the pattern of gene expression between expanded and non-expanded genes within cancer and non-cancer genes, I analysed cancer and non-cancer genes separately. Fig. 6.9 shows that for both cancer and non-cancer genes the expression of genes with expanded integration sites is lower than those of genes with non-expanded sites. However, because of the small number of cancer genes in my data set, the difference between expanded and non- expanded site genes within the cancer group is not significant, but both cancer and non- cancer genes are in the same direction. Therefore, the mechanisms of HIV-1 persistence and clonal expansion may be the same for cancer- and non-cancer genes. However, the mechanisms by which HIV-1 integration into cancer-associated genes (e.g. BACH2 or MKL2) causes persistence and clonal expansion remain to be understood. Although, why some HIV-1 integration sites are allowable for clonal expansion is unclear, but finding that expanded clones with HIV-1 integrations arise in cancer associated genes led to the suggestion that HIV-1 integration into genes that regulate cell division enhances expansion [27, 34].

Resting memory CD4 T-cells are thought to contribute most substantially to HIV-1 clonal expansion [4, 18, 47-48]. The main question after the identification of clonally expanded HIV-1-integrated cells is whether the HIV-1 from these infected cells contribute to the viral reservoir, which may facilitate long-term persistence of HIV-1 [26, 27]. To investigate my second hypothesis that precursor cells (stem memory cell and central memory cells) preferentially have integration sites in genes associated with proliferation, I also compared expression levels of ES-genes and NES-genes within different T-cell subsets (TSCM, TCM, TEM and TN). The results showed that there are no significant differences between expression levels of different T-cells within both expanded- and non-

269

expanded sites genes (see Fig. 6.8), suggesting there is no T-cells preference for those HIV-1 integrants which had undergone clonal expansion. In other words, my analysis has not demonstrated that HIV-1 proviruses that integrated into stem memory T-cells and had lower gene expression, are more likely to undergo clonal expansion compared HIV-1 proviruses that integrated into to other cell types. In agreement with this result, Cohn et al. has recently showed that all three subsets of central, transitional and effector memory CD4 T-cells can harbour the expanded clones of HIV-1 integrants [34].

However, lack of standardisation in laboratory process and technical distinctions (e.g., reagents, kinetic of stimulation, fresh vs. frozen cells) may have an impact on T-cell categorisation. There has also been a lack of separation between activated and resting T- cells, in the study, and more work needs to be done to provide stronger approaches that allow sufficient identification of subdominant or low avidity CD8 T-cells. On the other hand, the preferred markers used to define the T-cell subsets is an ever-evolving area of research as more subsets of these T-cell populations and new populations that fall in the grey area between these T-cell populations (in terms of marker expression and new functionalities) are being discovered.

Although association of HIV-1 integration sites with gene expression has indicated some evidence of particular features of the genomic sites that up-regulate/down-regulate gene expression and favour latency, a clear observation of same correlations in primary cells has been more difficult. An association among viral antigen expression and particular genomic regions has been reported within various in vitro experiments of HIV-1 latency, but the association have not linked to the general features of HIV-1 integration sites that expand between several primary CD4 T-cells models of HIV-1 latency. The expansion of particular HIV-1 integration sites in individual CD4 T-cells has demonstrated new insights into the HIV-1 latency in vivo model. Two mechanisms can be behind of this process. First, the clonal expansion of HIV-1 infected cells is identical at both insertion regions and by HIV-1 sequence. This follows the evidences provided with HIV-1 sequence analysis alone and raising the question of a) if HIV-1 integration site settles in a region favouring latency or b) if HIV-1 is non-inducible because it is defective. The second alternative mechanism proposed by the recent investigations of an obvious enrichment for multiple HIV-1 integration sites in particular regions of a gene is that HIV- 1 integration in particular host gene enables cells carrying provirus to be expanded. These

270

observations that specific genes are associated with cell survival proposes a new idea that integration sites may change the function of those genes and play roles in cell survival. My analyses of HIV-1 integration sites and gene expression suggest that HIV-1 provirus that integrated into lowly expressed genes undergoes clonal expansion. It will be important now to characterise if this is a direct genomic impact or if expression of specific HIV gene products is involved.

References

1. Perelson, A.S., et al., HIV-1 dynamics in vivo: virion clearance rate, infected cell life- span, and viral generation time. Science, 1996. 271(5255): p. 1582-6.

2. Craigie, R. and B. Frederic, HIV DNA Integration. Cold Spring Harbor Perspectives in Medicine, 2012. 2(7): p. a006890.

3. Siliciano, R.F. and W.C. Greene, HIV latency. Cold Spring Harb Perspect Med, 2011. 1(1): p. a007096.

4. Chomont, N., et al., Maintenance of CD4+ T-cell memory and HIV persistence: keeping memory, keeping HIV. Curr Opin HIV AIDS, 2011. 6(1): p. 30-6.

5. Bosque, A., et al., Homeostatic proliferation fails to efficiently reactivate HIV-1 latently infected central memory CD4+ T cells. PLoS Pathog, 2011. 7(10): p. e1002288.

6. Blow, J.J. and T.U. Tanaka, The chromosome cycle: coordinating replication and segregation. Second in the cycles review series. EMBO Rep, 2005. 6(11): p. 1028-34.

7. Gillet, N.A., et al., The host genomic environment of the provirus determines the abundance of HTLV-1-infected T-cell clones. Blood, 2011. 117(11): p. 3113-22.

8. Hacein-Bey-Abina, S., et al., LMO2-associated clonal T cell proliferation in two patients after gene therapy for SCID-X1. Science, 2003. 302(5644): p. 415-9.

9. Ho, Y.C., et al., Replication-competent noninduced proviruses in the latent reservoir increase barrier to HIV-1 cure. Cell, 2013. 155(3): p. 540-51.

10. Chun, T.W., et al., Early establishment of a pool of latently infected, resting CD4+ T cells during primary HIV-1 infection. Proc Natl Acad Sci U S A, 1998. 95(15): p. 8869- 73.

11. Archin, N.M., et al., Immediate antiviral therapy appears to restrict resting CD4+ cell HIV-1 infection without accelerating the decay of latent infection. Proc Natl Acad Sci U S A, 2012. 109(24): p. 9523-8.

271

12. Finzi, D., et al., Latent infection of CD4+ T cells provides a mechanism for lifelong persistence of HIV-1, even in patients on effective combination therapy. Nat Med, 1999. 5(5): p. 512-7.

13. Jordan, A., et al., HIV reproducibly establishes a latent infection after acute infection of T cells in vitro. EMBO J, 2003. 22(8): p. 1868-77.

14. Jordan, A., et al., The site of HIV‐1 integration in the human genome determines basal transcriptional activity and response to Tat transactivation. EMBO J, 2001. 20(7): p. 1726-38.

15. Sherrill-Mix, S., et al., HIV latency and integration site placement in five cell-based models. Retrovirology, 2013. 10(1): p. 1.

16. Brady, T., et al., HIV integration site distributions in resting and activated CD4+ T cells infected in culture. Aids, 2009. 23(12): p. 1461-71.

17. Bushman, F., et al., Genome-wide analysis of retroviral DNA integration. Nat Rev Microbiol, 2005. 3(11): p. 848-58.

18. Lewinski, M.K., et al., Genome-wide analysis of chromosomal features repressing human immunodeficiency virus transcription. J Virol, 2005. 79(11): p. 6610-9.

19. Moalic, Y., et al., Porcine endogenous retrovirus integration sites in the human genome: features in common with those of murine leukemia virus. J Virol, 2006. 80(22): p. 10980- 8.

20. Wang, G.P., et al., HIV integration site selection: analysis by massively parallel pyrosequencing reveals association with epigenetic modifications. Genome Res, 2007. 17(8): p. 1186-94.

21. Abel, U., et al., Real-time definition of non-randomness in the distribution of genomic events. PLoS One, 2007. 2(6): p. e570.

22. Hacein-Bey-Abina, S., et al., Insertional oncogenesis in 4 patients after retrovirus- mediated gene therapy of SCID-X1. J Clin Invest, 2008. 118(9): p. 3132-42.

23. Holman, A.G. and J.M. Coffin, Symmetrical base preferences surrounding HIV-1, avian sarcoma/leukosis virus, and murine leukemia virus integration sites. Proc Natl Acad Sci U S A, 2005. 102(17): p. 6103-7.

24. Howe, S.J., et al., Insertional mutagenesis combined with acquired somatic mutations causes leukemogenesis following gene therapy of SCID-X1 patients. J Clin Invest, 2008. 118(9): p. 3143-50.

272

25. Wu, X., et al., Weak palindromic consensus sequences are a common feature found at the integration target sites of many retroviruses. J Virol, 2005. 79(8): p. 5211-4.

26. Maldarelli, F., et al., HIV latency. Specific HIV integration sites are linked to clonal expansion and persistence of infected cells. Science, 2014. 345(6193): p. 179-83.

27. Wagner, T.A., et al., HIV latency. Proliferation of cells with HIV integrated into cancer genes contributes to persistent infection. Science, 2014. 345(6196): p. 570-3.

28. Hughes, S.H. and J.M. Coffin. What Integration Sites Tell us about HIV persistence. Cell Host Microbe, 19(5), 2016. p. 588-98.

29. Simonetti, F.R., et al., Clonally expanded CD4+ T cells can produce infectious HIV-1 in vivo. Proc Natl Acad Sci U S A, 113(7), 2016. p. 1883-8.

30. Gattinoni, L., et al., A human memory T cell subset with stem cell-like properties. Nat Med, 2011. 17(10): p. 1290-7.

31. Youngblood, B., et al., Memory CD8 T cell transcriptional plasticity. F1000prime reports, 2015. 7.

32. Appay, V., et al., Phenotype and function of human T lymphocyte subsets: consensus and issues. Cytometry A, 2008. 73(11): p. 975-83.

33. Wei, D., et al., A novel hierarchical clustering algorithm for gene sequences. BMC Bioinformatics, 2012. 13: p. 174.

34. Cohn, L.B., et al., HIV-1 integration landscape during latent and active infection. Cell, 2015. 160(3): p. 420-32.

35. Osborne, C.S., et al., Active genes dynamically colocalize to shared sites of ongoing transcription. Nat Genet, 2004. 36(10): p. 1065-71.

36. Schroder, A.R., et al., HIV-1 integration in the human genome favors active genes and local hotspots. Cell, 2002. 110(4): p. 521-9.

37. Xing, Y., et al., Nonrandom gene organization: structural arrangements of specific pre- mRNA transcription and splicing with SC-35 domains. J Cell Biol, 1995. 131(6 Pt 2): p. 1635-47.

38. Bushman, F., Lateral DNA Transfer: Mechanisms and Consequences. 2001, USA: Cold Spring Harbor Laboratory Press.

39. Ciuffi, A. and F.D. Bushman, Retroviral DNA integration: HIV and the role of LEDGF/p75. Trends Genet, 2006. 22(7): p. 388-95.

40. Coffin, J.M., et al., Retroviruses. 1997, USA: Cold Spring Harbor Laboratory Press.

273

41. Han, Y., et al., Resting CD4+ T cells from human immunodeficiency virus type 1 (HIV- 1)-infected individuals carry integrated HIV-1 genomes within actively transcribed host genes. J Virol, 2004. 78(12): p. 6122-33.

42. Ikeda, T., et al., Recurrent HIV-1 integration at the BACH2 locus in resting CD4+ T cell populations during effective highly active antiretroviral therapy. J Infect Dis, 2007. 195(5): p. 716-25.

43. Mack, et al., HIV insertions within and proximal to host cell genes are a common finding in tissues containing high levels of HIV DNA and macrophage-associated p24 antigen expression. J Acquir. Immune Defic Syndr, 2003. 33(3): p. 308-20.

44. Flucke, U., et al., Presence of C11orf95–MKL2 fusion is a consistent finding in chondroid lipomas: a study of eight cases. Histopathology, 2013. 62(6): p. 925-30.

45. Kobayashi, S., et al., Identification of IGHCδ–BACH2 fusion transcripts resulting from cryptic chromosomal rearrangements of 14q32 with 6q15 in aggressive B‐cell lymphoma/leukemia. Genes Chromosomes Cancer, 2011. 50(4): p. 207-16.

46. Muehlich, S., et al., The transcriptional coactivators megakaryoblastic leukemia 1/2 mediate the effects of loss of the tumor suppressor deleted in liver cancer 1. Oncogene, 2012. 31(35): p. 3913-23.

47. Chomont, N., et al., HIV reservoir size and persistence are driven by T cell survival and homeostatic proliferation. Nat Med, 2009. 15(8): p. 893-900.

48. Churchill, M.J., et al., HIV reservoirs: what, where and how to target them. Nature Rev Microbiol, 2016. 14(1): p. 55-60.

274

275

Chapter 7:

Summary and future work

Author contributions to thesis Chapter 7:

HA-R: wrote the chapter. DE, MPD and VV: read and revised the chapter. HA-R: created all Tables and Figures (except Fig. 7.1).

276

Synopsis

Human immunodeficiency virus (HIV) continues to be a major global health problem. Although research in the past decades has significantly advanced our understanding of HIV biology, there remain many unanswered questions that continue to hinder the development of a successful vaccine. In this thesis, I used bioinformatics and applied computational approaches to address several important questions about HIV biology and infection. These questions are related to the cellular immune mechanisms that play role in fighting HIV infection and to the mechanisms of HIV evolution to evade the immune system and optimise survival. Understanding these mechanisms will help inform future preventative and/or curative therapies.

277

Figure 7.1: A summary of the key questions addressed in this thesis and the key findings for each question.

278

Summary of findings

CTL epitopes undergo patterns of escape mutations

Antiretroviral therapy has substantially reduced HIV-1 fatality rate; nevertheless, an effective approach to eliminate HIV-1 and/or to inhibit transmission (e.g. a vaccine) is yet to be developed. Many studies have indicated that CTL (cytotoxic T lymphocytes) play a central role in fighting HIV-1 infection [1-3]. As such, development of a CTL- based vaccine is currently being pursued as a potential curative strategy. Previous studies of CTL-based vaccines in SIV-infected macaques have shown decreased viral loads, increased levels of protection against infection, and greater survival in vaccinated animals, when compared to non-vaccinated controls [4-8]. As an example, Barouch et al. [9] performed a vaccine trial in simian/human immunodeficiency virus (SHIV) infected rhesus macaques, and observed that vaccinated monkeys had a secondary immune responses, fixed CD4+ T-cell counts, and low to undetectable set point viral loads. Additionally, the animals preserved SHIV-specific CD4+ T-cell responses, and no clinical disease or fatality was observed up to five months after viral challenge. However, the first HIV-1 vaccine trial in human, by Flynn et al. in 2005, showed no benefit in terms of preventing HIV-1 acquisition and reduction of viral loads in HIV-1 infected participants [10]. In that study, a recombinant HIV-1 envelope subunit gp120 protein was used as vaccine antigen. Uberla et al. performed, in 2008, another T-cell based HIV-1 vaccine trial and again did not observe a significant protection [11]. Importantly, in that study, the risk of HIV infection in the vaccinated participants was higher than that in the placebo group. Eventually, in 2009, the vaccine trial “RV144 Phase III” showed promising results with 31% reduction in incidence of HIV-1 infection [12]. This trial combined two previous vaccines that failed on their own [13, 14]. The RV144 trial used a canary pox vector expressing gp120, Gag, and protease proteins followed by a gp120 protein boost. The RV144 Phase III was a landmark clinical trial, however the protection was modest and the effective responses needed to inhibit HIV-1 infection stayed unclear; individuals who showed high levels of an anti-HIV-1 ADCC mediating IgG antibody to part of the HIV virus coat called V1V2 were less likely to be infected. Also those individuals with higher levels of an anti-HIV-1 Env IgA antibody to another part of the outer coat known as C1, were less likely to be protected.

279

Nevertheless, there is still a major barrier to developing an effective vaccine against HIV- 1 infection and that is the ability of HIV-1 to quickly mutate in and around the epitopes targeted by CD8+ T-cells and its ability to downregulate HLA (a.k.a MHC in non-human primates) class I to reduce CD8+ T-cell recognition of HIV-infected CD4+ T cells [15, 16]. The immune pressure by CD8+ T-cells can result in reduced viral load [17], but it can also result in selection of escape mutations [18-21]. Some escape mutations may exact a considerable cost to virus fitness [21]. However compensatory mutations that happen together with escape mutations can reduce viral replication fitness cost, resulting in an overall virus fitness analogous to that of a wild-type virus [22, 23]. Previous studies have identified associations between MHC/HLA alleles and viral escape mutations within SIV/HIV epitopes. For example, in pig-tailed macaques escape mutations in Gag and Tat, are known to be associated with MHC alleles KP9 (Gag), KSA10 and KVA10 (Tat) [24- 27]. Most of the reported associations are related to MHC class I alleles, which have a crucial role in antigen-specific CD8 T-cell response [28-31]. MHC class I molecules also act as ligands for activation of natural killer cell receptors (e.g. Killer-cell immunoglobulin-like receptors (KIRs)). Genome-wide association studies (GWAS) have also reported that HIV-1 polymorphisms are linked with specific human HLA class-I alleles [32-35]. These findings have provided important guidelines for the discovery of CTL epitopes.

CTL escape mutations mostly involve continuous selection of mutants, both within a single epitope and across different epitopes. Multiple studies have reported that some of such epitopes are restricted by HLA class I alleles such as B*57, B*27, B*51, B*810/B*391; most of these escape mutations are found in the HIV-1 Gag regions [36- 44]. However, there are anchor positions in the allele-specific epitope-binding motifs, which are conserved across all the epitopes presented by a specific HLA class-I alleles [45, 46]. Thus, identification of the epitope-binding motifs of a specific MHC allele will enable the discovery of immunogenic epitopes presented by that MHC Class-I allele.

Studies have shown that some CTL escape mutations in MHC Class-I epitopes may persist for many years [47-51]; subsequent studies have also shown that those mutations that either happened at very little fitness cost to the HIV-1 virus or have been compensated (i.e. compensatory mutations) may persist in populations with a specific MHC [39, 52, 53], resulting in a “MHC footprint” on HIV-1 genomes [35, 39, 50, 54, 55], which was

280

initially reported by Moore et al. in 2002 [35]. In that study a multivariable analysis was performed to investigate the relationship between HIV-1 reverse transcriptase nonsynonymous polymorphisms and HLA-A and HLA-B alleles in the early host response to HIV-1 infection. A Western Australia cohort of 473 HIV-1-infected individuals was genotyped for the HLA-A and HLA-B loci. The most recent HIV-1 reverse transcriptase sequence between amino acid positions 20 and 227 was aligned to a HIV-1 consensus sequence, and viral polymorphisms were detected and studied for potential association with different HLA-A or HLA-B alleles. After correction for multiple testing, they identified 14 associations between HLA alleles and nonsynonymous mutations. Multiple of these polymorphisms were located in known CTL epitopes suggesting that people with the same HLA allele have a shared potential to identify the same peptides. Thus the HIV-1 sequences in those individuals are expected to display same mutational signatures associated with the host HLA.

In 2004, Kiepiela et al. [32] designed a study to investigate the contributions of HLA-A and HLA-B alleles in HIV-1 infection. Their analyses of Nef and Gag proteins in a South African cohort of 375 HIV-1-infected individuals, indicated 25, 12, and 9 associations between HLA-B, HLA-A and HLA-C alleles respectively with viral mutation. 11 of the mutations were reported previously, of which 10 were presented by HLA-B alleles. Their finding indicated that HLA-B-restricted CD8 T-cells had a significantly higher responses than HLA-A-restricted CD8 T-cells [32], which suggests that mutations at the HLA-B locus has the greatest impact on viral set-point compared to HLA-A allele expression. Nonetheless, investigation of the impact of HLA class I alleles on HIV-1 evolution at a population level is not trivial, because factors such as virus founder effect, inclusion of amino acids in several overlapping peptides, and compensatory mutations to maintain HIV-1 fitness may confound the analysis [23, 56-58]. Korber`s Lab in 2007 [59] performed two statistical strategies (maximum-likelihood phylogeny and likelihood-ratio test) that account for phylogenetic relationships between HIV viruses to investigate whether HLA allele-polymorphism associations would be ascribable to confounding due to viral lineage founder effects, rather than immune escape polymorphism directly selected by the associated HLA allele. They found that viral lineage (“founder”) effects rather than immune escape mutations often mimicked HLA-mediated immune-escape mutations [59]. They analysed the cohort used in Moore et al.’s study [35], and showed that out of 80 associations, only 20 were not HIV-1 subtype driven.

281

Altogether, these studies highlight the importance of HLA-associated viral escape studies, however they also point to a major source of false discovery rooted in viral polymorphism, i.e. founder effect. In the study presented in Chapter 2, this issue is resolved, because in our cohort, macaques were infected with the same stock of SIV virus. Therefore, mutations/polymorphisms were detected by means of comparison to the same SIV reference sequence. Additionally, factors such as the timing of infection cannot be controlled in studies involving infected human subjects. Therefore, in that respect, a non- human primate model is highly advantageous and are an excellent model for studies of epitope discovery, CTL-based vaccine designs, and drug trials.

There are three macaque species (rhesus (Macaca mulatta), cynomolgus (Macaca fascicularis), and pigtail (Macaca nemestrina)) [60] that reproduce the course of AIDS- like disease [61]. Within SIV-infected non-human primate models, Indian rhesus macaque (Macaca mulatta) have been used widely to study HIV-1 pathogenesis. The pigtail macaque (Macaca nemestrina) is an alternative model, has a simian immunodeficiency virus (SIV) disease progression, which is similar to that of the Indian rhesus macaque model and is accessible in Australia [24]. In our study we used pigtailed macaques whose MHC has a close phylogenetic relationship to humans. Also, pigtail macaques unlike other macaques have an extremely restricted MHC diversity, which make them a well-suited model for large vaccine research [62, 63].

The association between MHC Class-I haplotypes and disease progression in pigtailed macaque models of HIV-1 is understudied. Only a few CTL mutations and their restricted epitopes have been identified for SIV macaque model. Smith et al. and Mason et al. [24- 26] reported an important CTL responses in pig-tailed macaques and its escape mutations in Gag, which known as KP9; Mason et al. also reported KSA10 and KVA10 CTL responses in pig-tailed macaques in Tat [26]; all these CTL mutations and their restricted epitopes found in the Mane-A084 allele. Many Env CTL escape mutations have been also determined, but they have not been confined to specific MHC class-I haplotype [64]. As an example of other macaques, Allen et al. identified two very well defined immunodominant epitopes in SIVmac239-infected rhesus macaques in Tat (Tat-SL8) and Gag (Gag-CM9) proteins [15, 65].

282

Investigation of CTL escape patterns in SIV genomes enables the identification of SIV regions under active CTL selection pressure. In addition, mapping CTL escape and the fitness cost to SIV genomes provides resources to facilitate SIV epitope discovery, which can help the selection of an effective CTL-based SIV/HIV-1 vaccine. To this aim, in the collaborative studies presented in chapter 2 (as one of the questions described in Fig 7.1), I designed a combined computational/experimental study using non-human primate models. I developed a bioinformatics approach to identify novel associations between SIV escape mutations and MHC-I haplotypes in 44 SIV-infected pigtailed macaques. This comprehensive analysis of single point mutations and regions of mutations led to the discovery of 46 novel non-synonymous mutations and. Our initial analysis only identified epitopes within which the same position underwent a non-synonymous mutation in multiple macaques. Given that an escape mutation within an epitope does not necessarily have to be confined to the same amino acid position, we sought to perform an additional and more inclusive analysis in which mutations within small windows were considered. To detect such variable escape regions, I analysed the frequency of mutations within windows with a size of 30 nucleic acids sliding by 3 base increments. This analysis led to the discovery of an additional 32 novel regions of non-synonymous mutations that were associated with specific classes of MHC-I haplotypes. 24 of 78 potential epitopes identified within the SIV Gag and Pol genes. These genes are not very variable between the HIV and SIV genomes. Previous studies have also reported Gag-specific responses in both HIV and SIV [16, 66, 67].

Having a wider suite of SIV-specific CTL epitopes and their escape mutations patterns that were detected in this study could assist in determine which CTL responses are capable of controlling SIV infection and may provide a foundation for researchers studying T-cell control of SIV infection. It should also provide a valuable resource for researchers studying T-cell control of SIV infection and using macaque models for the future design of CTL-specific SIV/HIV-1 vaccines and antiretroviral therapy strategies of SIV/HIV-1 infection.

The pigtailed macaques have been becoming a potential non-human model for SIV/HIV- 1 pathogenesis and T-cell vaccine studies and the newly mapped epitopes have a greater potential to study pigtails as an experimental model. Nevertheless, there is still more work to be done to experimentally confirm our findings of SIV-specific CTL epitopes. On the

283

other hand, in our studies because of the limited number of macaques and low frequency of many MHC haplotypes, there was not adequate statistical power to identify escape mutations that are associated with less abundant MHC haplotypes. There may well be rare alleles associated with control of SIV in this model and to find out more about these associations and exploring the frequencies of such alleles, in the future, a new study with a larger number of macaques could better address the role of these rare alleles in MHC- restricted CD8 T-cell responses in pigtailed macaques and it may contribute to further expanding this potential HIV-1 model. An additional limitation of this work is that there are no certain rules about the structure of the MHC loci, their alleles and the functions of MHC Class-I-restricted CD8 T-cell responses in macaques to date, making it harder to use for epitope identification studies. For instance, an allele frequency of 9.1% was categorized as Mane-A allele unknown (chapter 2, table 2.2) and we were unable to map SIV genome mutations to these unknown.

As conclution, I performed a linkage analysis to identify potential CTL escape mutations and related epitopes in SIV in 44 pigtailed macaques. My discovery may enable a more refined level of analysis of CTL-based HIV-1 vaccine design.

APOBEC proteins as a host defence mechanism against HIV-1 infection

The immune response is generally characterised by innate and adaptive immune responses that partially limit the viral replication. The innate immune response is the first line and dominant system of host defence and provide a rapid response to pathogens. When innate immune cells identify a pathogen, they can respond by phagocytising infected cells and secreting cytokines and chemokines, signalling to other immune responses, in particular adaptive responses, that pathogens have been detected. The innate immune system includes a wide range of components such as Toll-like receptors, natural killer cells, and restriction factors. Restriction factors such as APOBEC3G, Tetherin, SAMHD1 (Sterile Alpha Motif Histidine-Aspartic domain-containing protein 1), and TRIM5α (Tripartite motif-containing Motif 5 α) play important roles in inhibiting viral infection [68, 69]. These factors target various steps of the virus life cycle with distinct effects on HIV-1 (Fig 7.2).

284

Figure 7.2. A schematic of the HIV life cycle and the anti-viral role of main host restriction factors red circles. The Figure adapted from Haller. Cell Host Microbe. 2013.

SAMHD1, is a cellular dGTP-regulated deoxynucleoside triphosphate (dNTP) triphosphohydrolase that is expressed in myeloid cells, where it barricades viral replication through preventing reverse transcription of the viral genomic RNA [70, 71]. However, lentiviruses such as human HIV-2 and sooty mangabeys SIV, encode an accessory protein known as Vpx which counteracts the effect of SAMHD1. Tetherin (also called BST-2) is another restriction factor, which plays role in viral inhibition. Tetherin is a type II integral membrane protein, which has a rather non-specific antiviral role This non-specificity permits tetherin to restrict a wide range of virus families, including retroviruses, filoviruses, , herpesviruses, and paramyxoviruses [73, 74].

A third and arguably the most potent restriction factor is a family of editing enzymes called APOBEC3. These proteins are different from other restriction factors such as SAMHD1 and Tetherin in that they interact directly with viral genomes, in human there are seven APOBEC3 enzymes, namely APOBEC3A, APOBEC3B, APOBEC3C, APOBEC3D, APOBEC3F, APOBEC3G, and APOBEC3H. They are responsible for

285

inhibiting endogenous and exogenous retroviruses by deaminating cytosine (C) to uracil (U) in viral genomes or by several other deamination-independent mechanisms [86-90]. Recent studies have also shown that members of this family particularly APOBEC3A, APOBEC3B, and APOBEC3H can act as DNA mutator in cancer cells. It is known that at least four members of this enzyme family (APOBEC3D/F/G/H) can package into the budding HIV-1 particles and induce C-to-T mutations in the HIV-1 single stranded DNA (minus strand) during reverse transcription in a target cell. This is then manifested as G- to-A mutations in viral RNA. APOBEC3-induced mutations can create multiple stop codons in the targeted sequences (a.k.a hypermutated sequences) that often lead to the formation of replication-defective HIV-1 proviruses. HIV-1 counteracts the antiviral activity of APOBEC3 enzymes using one of its genes called Vif, which binds to APOBEC3 enzymes and prevents their incorporation into budding virions [91-97]. It is well known that in the absence of HIV-1 Vif, APOBEC enzymes (particularly APOBEC3G) can suppress HIV-1 replication [98-103].

There are currently two viewpoints about the impact of APOBEC-induced mutations on HIV. Some studies have reported that APOBEC3 can induce limited number of mutation, therefore has the potential to contribute to HIV-1 diversity, immune escape, and drug resistance. Mulder et al. [104] reported that non-synonymous mutations in the HIV-1 Vif can lead to G-to-A changes in the HIV-1 reverse transcriptase position 184, which is associated with lamivudine drug resistance (M184I). In another study, Jern et al. [105] designed an in silico model based on G-to-A mutations and reported that APOBEC3G- mediated G-to-A changes can potentially contribute to HIV-1 evolution. They also observed a small increase in mutations at known drug resistance sites. Sadler et al. [106] reached the same conclusion using a panel of cell lines that expressed APOBEC3G. Wood et al. [107] investigated early diversification of the HIV-1 Env protein in 81 HIV-1 subtype B infected individuals and showed that some codon sites diversified more rapidly than others. Out of 24 such codon sites, 14 sites were either in CTL epitopes and assisted escape from early CTL immune responses or they were in a sequence context that is preferentially targeted by APOBEC3.

On the other hand, there are multiple lines of evidence that suggest APOBEC3-induced hypermutation is an “all or nothing” phenomenon. In other words, a sequence that is targeted by APOBEC3 is a dead-end virus, thus is unlikely to be selected for. Deforche

286

et al. [108] used a mathematical likelihood approach to evaluate different models of nucleotide misincorporation bias in the HIV genome using an in vivo experiment from 5614 pairs of HIV Pol sequences (1084 bp). They demonstrated that biased G-to-A substitutions are affected by imbalances in dNTP concentrations, not by APOBEC3G/F editing. Ebrahimi et al. [109] hypothesized that if the HIV genome has evolved under a G-to-A mutation pressure from APOBEC3 proteins, this should be evidenced as an APOBEC3 signature in the form of underrepresentation of APOBEC3 target motifs and overrepresentation of APOBEC3 product motifs in the HIV genome. To test this hypothesis, they developed Markov models and investigated the representation of 340 motifs among 1,932 whole HIV-1 sequences. They observed that highly targeted motifs were not underrepresented and those motifs produced in the HIV genome as a result of targeting by APOBEC3 were not overrepresented. As such their analysis did not show any evidence to suggest that HIV genome evolves under APOBEC-induce G-to-A mutation pressure. In a subsequent study Armitage and colleagues [110] designed an in vitro experiment and illustrated that it is unlikely that APOBEC3G can increase HIV-1 evolution because: 1) hypermutated sequences were not found in plasma; 2) Hypermutated sequences had many stop codons, and 3) G-to-A replacement pattern of mutated sequences was bimodal (hypermutated or not mutated). In that study it was shown that a single APOBEC3G protein is sufficient to induce multiple mutations in the HIV genome [110].

As described above, there is not a consensus in the research community, therefore both of the above-mentioned scenarios (lethal hypermutation versus sub-lethal mutation) have been explored as potential anti-HIV strategies. The first approach is to block the activity of APOBEC3 enzymes in an effort to minimize HIV diversity and escape [111, 112]. The alternative approach is to enable lethal APOBEC3-mediated hypermutation by disrupting the interaction between HIV Vif and APOBEC3 [113-119]. For the first approach, small- molecules that can abolish the enzymatic activity of APOBEC3G by targeting its key residues have been proposed [111, 112]. For the second approach, small molecule that can inhibit Vif thus can increase APOBEC3 expression and viral encapsidation have been proposed [113-119].

One of the main reasons for the conflicting reports in the APOBEC literature is related to the sequence contexts of sites targeted by APOBEC3 enzymes and the methods used to

287

identify those sites. APOBEC3G replaces G by A preferentially within a GG context but other members of this family change G to A mainly in a GA context [120, 121]. Importantly, the most prevalent HIV-1 mutation, in the absence of APOBEC3 enzymes, is also G-to-A changes resulting from error-prone HIV-1 reverse transcription. Because both GG and GA dinucleotide motifs have high abundances in the HIV-1 genome, the likelihood of a random G-to-A mutation to occur within a GG or GA site is high.

Therefore, it is not trivial to distinguish between an APOBEC-induced mutation and a reverse transcriptase-induced random error in “lightly mutated” sequences. Additionally, editing by APOBEC3 enzymes is not 100% motif specific. For example, APOBEC3G preferentially targets GG sites but also induces G-to-A changes within GA sites, although to a lesser extent. This preference can vary significantly but on average GG sites are targeted over 4 times than GA sites. For other APOBEC3 enzymes GA sites are targeted at a rate that is on average >3 times greater than that of GG sites. This important point is typically ignored in the analysis of hypermutated sequences and all GG-to-AG mutations are mistakenly attributed to APOBEC3G and all GA-to-AA mutations are regarded signatures of other APOBEC3 enzymes such as APOBEC3D/F/H. This again has created conflicting reports in the literature. For example, a study by Ebrahimi et al. and the study presented in Chapter 4 of this dissertation show that in almost all of the hypermutated HIV-1 sequences obtained from infected patients one signature (GG-to-AG or GA-to- AA) dominates, although both signatures are present in the sequences. We have shown that the ratio between the number of targeted GG and GA sites in sequences from naturally infected patients is almost identical to that of a sequenced obtained from HIV- infected cells that have been transfected in vitro by only one APOBEC3 enzyme, e.g. APOBEC3G or APOBEC3F. These data imply that co-mutation by APOBEC3G and one/few of other APOBEC3 enzymes is a rare event, possibly because a yet to be known mechanism prevents APOBEC3 enzymes from co-packaging into viral particles in vivo. Nevertheless, a recent study by Desimmie et al. [122] has reported that co-packaging of APOBEC3 enzymes can occur in vitro. They used single-virion fluorescence microscopy to investigate the possibility of APOBEC3 co-packaging and viral co-mutation. In this study the 3-mer sequence contexts of sites targeted by APOBEC3G, APOBEC3F, APOBEC3D and APOBEC3H in -based experiments were quantified and used to estimate the contribution of each APOBEC3 enzyme in the mutation of proviral DNA sequences derived from HIV-1 patients [122]. Their in-vitro results showed that

288

APOBEC3G can be co-packaged with either of APOBEC3D, APOBEC3F, or APOBEC3H and can co-mutate the same HIV-1 genome in a single cycle infection [122]. In this study, it was estimated that 83% of sequences obtained from infected patients were targeted by either A3G or one/few of other APOBEC3 enzymes and only 17% appeared to be co-mutated. Although, the estimated proportion of co-mutated sequences was significantly lower than those of mono-mutated sequences in that study, it can still grossly overestimate the co-mutation rate, primarily because the threshold used to define co- mutation was 80% GG and 20% GA or vice versa. In other words, any sequence in which the percentage of GG is less than 80% and that of GA% is greater than 20% was considered co-mutated. Based on such a definition many sequences in which one signature is dominant but does not reach the 80% threshold may be incorrectly regarded as co-mutated.

Being able to correctly identify the number of mono-mutated and co-mutated sequences is highly important, because this information can help uncover the mechanism that prevents the full anti-viral capacity of APOBEC3 enzymes, and this limitation could well be a significant source of immune weakness against HIV. Accurate identification of co- mutated sequences requires an accurate analysis of G-to-A mutation and as described above, this is not trivial. As such, our current understanding of the cooperative role of APOBEC3 enzymes including their viral co-packaging in natural infection is limited.

One approach to help answer some of the above issues including the prevalence of co- mutation is to use an un-biased mathematical method to analyse mutational footprints of APOBEC3 enzymes as described in Chapter 3. Such analyses have revealed that mutations by APOBEC3G and APOBEC3F have a unique sequence hierarchy that goes beyond a dinucleotide motif. For example GG and GA sites flanked by a 3′C have been shown to be highly disfavoured APOBEC3 target sites [123-125]. Studies have also reported that motifs GG, TG, GGG, GGT, AGG, GGGG, AGGG and TGGG are favourite target sites for APOBEC3G proteins [121, 126-130]. Armitage et al., Zheng et al. and Kijak et al. have also reported that motifs GA, TGA and GAT are favourite target sites for APOBEC3F (or other members of the APOBEC3 family) [121, 126, 130]. However, these methods have usually used sequence alignment and compared the query sequences to a parental sequence for this purpose. In the alignment-based methods, a consensus sequence need to be provided as a reference sequence and this consensus/reference

289

sequence might not be the original ancestral sequence of the hypermutated sequences in natural infection. Therefore alignment-based methods may not be a suitable way to analyse hypermutations and motif preference of APOBEC3 proteins [131-133]. Additionally, analytical methods that are mostly based on the analysis of the frequency of short sequence motifs [131] are not efficient, because motif abundance is not a correct measure of motif representation and can be misleading [109, 134] Also, analysis of short sequences (<10% of the whole HIV genome) [127, 135-139], which is very typical in the field of APOBEC research, should be done with cautious, as it may not provide reliable hypermutation data.

Therefore, there is a great need for quantitative [140] approaches to study APOBEC3- induced mutational signatures. To be able to harness APOBEC mutagenesis as a potentially curative anti-HIV therapy, we first need to understand the role these enzymes play individually and in combination. APOBEC3 mutational signatures on hypermutated sequences are a great source of information and if analysed correctly can provide valuable information about the molecular mechanism of APOBEC3/HIV interaction. As an example, our analysis of full genome HIV-1 sequences from naturally infected patients (Chapter 4) revealed that hypermutated HIV-1 sequences harbouring both mutational signatures, GG-to-AG and GA-to-AA, are uncommon. It suggests APOBEC3G acts independently of other APOBEC3 enzymes. This is unexpected given that members of APOBEC3 family seem to co-express widely [140, 141]; and stimulation of CD4 T-cells leads to the induction of all APOBEC3 enzymes, except APOBEC3A [141]. Additionally, APOBEC3G and APOBEC3F, two of the most mutagenic enzymes, share highly similar promoter regions [141] and co-localise in mRNA processing bodies [142].

Future studies can include testing the following two hypotheses to investigate “why co- mutation by APOBEC3 enzymes is rare”.

Hypothesis 1: In most patients both APOBEC3 enzyme classes (3G and non-3G) hypermutate HIV-1, but the likelihood of co-mutation by multiple enzymes is low. The argument here is that HIV vif inhibits APOBEC3 packaging into the virions; therefore, the probability of a virion with multiple different APOBEC3 enzymes is very low. Supporting evidence for this argument is the low frequency, typically <10%) of hypermutated HIV-1 DNA sequences within each patient. This hypothesis predicts the

290

presence of both mutational signatures (GG-to-AG and GA-to-AA) on different viral sequences, but not on the same sequence.

Hypothesis 2: In most patients one APOBEC3 enzyme class (3G or non-3G) hypermutates HIV-1. In other words, either APOBEC3G is the dominant enzyme against HIV-1 in terms of targeting HIV for hypermutation or one/few of the other known enzymes. It is not unexpected to observe zero or very few co-mutated viral sequences if in the majority of individuals either APOBEC3G or other APOBEC3 members are “active” against HIV-1. It is worth noting that here “activity” is defined as the ability of APOBEC3 to hypermutate HIV-1. Therefore it can be defined in terms of protein expression, functionality, competition for viral sequences or combinations of these. This hypothesis predicts the presence of only one type of mutational signature (either GG-to- AG or GA-to-AA) on viral sequences within a patient.

Additionally, in this thesis (Chapter 3) I discovered several errors that affect the outcome of hypermutation analysis using alignment-based approaches such as commonly used Hypermut [143] and developed a new bioinformatics method called G2A3, to tackle this problem.

Also, in Chapter 4, I proposed a comprehensive multivariate data analysis method for an un-biased identification of motifs that are underrepresented or overrepresented as a result of mutation by APOBEC3 enzymes. I showed that motif representation data could also be used to determine the subtype of an unknown HIV-1 sequence. However at the time of my study, there were only two whole genome hypermutated sequences with a dominant GA-to-AA signature and this is a limitation of this particular study. Future studies need to include more sequences with a non-APOBEC3G signature to enable better quantification of APOBEC3D/F/H motif preference in vivo.

Methylation may be a potential source of CpG depletion in the HIV-1 genome

HIV-1 can establish a state of latent infection in a reservoir of resting CD4 T-cells during primary HIV-1 infection to become undetectable by the host immune system and evade the ART. HIV-1 latency is thus the main barrier to viral eradication [144-151] and despite extensive research, our understanding of this complex process is limited. Methylation of

291

integrated viral genomes has been proposed as an HIV-1 transcriptional silencing mechanism that can augment latency. DNA methylation is a well-studied epigenetic mechanism in vertebrates, which is responsible for genomic imprinting, genome integrity, silencing of endogenous retroviral sequences, and gene regulation [152]. DNA methylation is catalysed by a family of DNA methyltransferase (Dnmt) enzymes and occurs predominantly in CpG dinucleotides. It has been suggested that CpG methylation of retroviral regulatory element regions located in the 5′ long terminal repeats (LTR) is associated with silencing of various retroviruses such as human leukemia virus type-1 [153, 154], Rous sarcoma virus [155, 156], Moloney murine leukemia virus [157, 158], and endogenous retroviruses HERV-H, HERV-K, and HERV-W [159, 160]. It has been reported that CpG methylation of the HIV-1 5′ LTR may prohibit the binding of transcription factors such as NF-κB and Sp1 [161]. Studies suggest that CpG methylation contributes to the maintenance of HIV-1 latency in long-term-infected U937 cells and in latently infected ACH-2 cell line [162-165], by inducing changes in the HIV-1 promoter. Studies by Blazkova et al. and Kauder et al. demonstrated that hypermethylated CpG islands flanking the HIV-1 transcription start site are essential for HIV-1 latency in infected cells J- Lat T-cells and primary CD4 T-cells latency [162, 163]. MBD2 (Methyl-CpG-binding domain protein 2) and HDAC2 (Histone deacetylase 2 are detectable at one of these hypermethylated CpG islands during HIV-1 latency [163], which can promote a state of transcriptional repression. However, other studies have reported that the CpG methylation of 5′ LTR in latently infected cells (e.g. Jurkat and primary CD4+ T-cells) is generally infrequent and does not associate with the transcriptional activity of HIV-1 promoter [166-168]. These contradicting observations imply that the role of proviral DNA methylation in HIV-1 latency is still unclear and more studies are needed to investigate the role of CpG methylation in HIV-1 latency. The first step in understanding latency and its potential association with methylation is to understand the impact of methylation on genomic patterns. In the HIV-1 genome, the frequency of CpG is much less than what is expected based on the frequencies of its constituent mononucleotides C and G [169, 170]. Importantly, this motif is also depleted in the genomes of vertebrates and many other viruses. The CpG depletion of the vertebrate genomes is thought to be a result of CpG methylation followed by spontaneous deamination to TpG on the same strand [171, 172]. However, it is not clear why CpG is depleted in the genome of viruses such as HIV-1. Two hypotheses can be postulated to investigate the negative selection against CpG sites:

292

One hypothesis is that CpG sites are identified by the immune system. Human cells are provided with various pattern recognition receptors containing Toll-like receptors (TLRs) that identify non-self DNA substrates [173]; specifically, TLR-9 has been indicated to detect non-methylated CpG dinucleotide in foreign DNA [174]. Un-methylated CpG dinucleotides can induce immunostimulatory signals through TLR [175]. During TLR9 activation, plasmacytoid dendritic cells generate large amounts of type I interferons (IFNs-I) in HIV-1 infection [74, 176]. Monocytes secreting interleukins and transcription factor nuclear factor-kappaB (NF-kappaB) can also be activated by the TLR9 pathway [177]. Greenbaum et al. and Jimenez-Baranda et al. have speculated that Toll-like receptors may be responsible for CpG depletion in influenza virus [177, 178]. They observed lower levels of CpG motifs in the chicken genome and avian flu genome when compared to those in human genome and human influenza genomes. They proposed that influenza genome has undergone a host mimicry mechanism by reducing the frequency of it CpG dinucleotides during a cross-species transmission from avian to human. This was to evade detection by TLR9 in human cells [179, 180].

An alternative hypothesis is that depletion of HIV-1 CpG dinucleotides is a result of methylation of CpG its simultaneous mutation to TpG, It is known that methylation of CpG motifs significantly increases their mutation rate [181, 182]. As such, CpG depletion of the human genome is attributed to methylation-induced deamination of C-to-T within CpG dinucletotides [171, 172]. Analysis of a wide range of animal genomes has shown that replacement of methyl-CpG to TpG is a relatively frequent mutation event and, the CpG depletion in those genomes is proportional to TpG/CpA excess [171].

Importantly, previous studies and also the study presented in Chapter 5 indicate that CpG depletion in the human genome is highly sequence context dependent. Analysis of multiple genomes including human, chimpanzee, mouse, pufferfish, zebrafish, sea squirt, fruit fly, mosquito, and nematode demonstrated that in the methylated genomes, there is a positive correlation between CpG depletion and G+C content of the sequence [172]. In other words, CpG is depleted less in regions with a higher C+G content [172]. Our analyses of human genome revealed that indeed CpG sites flanked by C and G have a significantly higher representation compared to those flanked by A and T. Importantly, we observed the same pattern in the HIV-1 sequences suggesting that the source of CpG depletion in human and HIV genomes is the same mechanism, likely methylation. It is

293

important to note that TLR-9 also identifies CpG sites that are flanked by T and/or A. Therefore by only considering the context of CpG depletion, one cannot determine which of the above-mentioned two hypothesis (TLR9 or methylation) is the source of HIV-1 CpG depletion. Nevertheless, we reason that being localized in the endoplasmic reticulum [183, 184] TLR-9 is not likely to have access to the HIV-1 DNA, which is primarily in the nucleus. Additionally, given that the human genome also shows the same CpG depletion pattern, TLR-9 is less likely to be responsible because there is not evidence to suggest that TLR-9 can put pressure on the human genome. Also, we found, using the analysis of methylation data, that CpG motifs that are methylated more are depleted more, further suggesting that methylation-induced mutation is responsible for the depletion of CpG motifs in HIV-1.

In our studies it was the use of an unbiased quantitative approach that enabled us to unravel the sequence context dependency of CpG methylation and depletion. Using these types of analyses can quantify the representation of a motif by taking into consideration the representation of sub-motifs. This is essential particularly for the analysis of CpG motifs that are highly underrepresented. Accurate identification and quantification of underrepresented or overrepresented sequence motifs is essential, because they might be associated with viral fitness and survival mechanisms. For example, in the case of HIV- 1, selection of a CpG depleted genome can be a mechanism to facilitate latency and escape from the human immune system. In fact analysis of the distribution of CpG motifs in the HIV-1 genome indicates that Gag, Pol, and Env regions are highly CpG-depleted [169]. However, CpG-enriched regions are present in the LTRs (Fig 7.3). This skewed CpG abundance in different regions of the HIV-1 genome may be an optimum arrangement. On one hand, HIV-1 benefits from a CpG-depleted genome that is less likely to be detected by the immune system and is less likely to be transcriptionally inactivated by methylation. On the other hand, for HIV-1 to become latent, it may need to have, in its genome, regions with enriched CpG dinucleotides that can be methylated [185].

294

Figure 7.3: Distribution of CpG motifs in the HIV-1 subtype B reference genome (HXB2).

Our studies revealed important genomic patterns that shed light on the evolution of HIV but also raised several new questions. For example, our results suggest that CpG methylation is the source of CpG depletion in the HIV and SIV genomes only. Thus, future studies are needed to determine the source of CpG depletion in other viruses such as HTLV and HERV-K. These viruses integrate their DNA into the human genome, thus are exposed to methylation, however they do not show the CpG depletion pattern we observed for HIV-1.

HIV-1 integration into lowly expressed genes is associated with clonal expansion

As described, a small fraction of HIV-1 sequences become transcriptionally silent after they integrate into the genome of T-cells [186]. This “latent” reservoir is established within the first few days of infection [187]. The integrated viral genomes can persist inactively in patients for many years. Although transcriptionally silent, the HIV-1 latent reservoir is fully capable of producing new infectious viruses upon reactivation by various cytokines or a recall antigen or when treatment is interrupted [188]. Studies have reported that transcription factors such as NFκB can rapidly translocate into the nucleus in response to T-cell receptor activation and can augment the transcription of HIV-1 LTR through the dual proximal NFκB/NFAT binding sites [189-193]. These studies suggest that latency is initiated in resting CD4 T-cells. Other studies suggest that HIV-1 latent proviruses can be established in all CD4 T-cell categories including naïve (TNA), stem cell memory (TSCM), central memory (TCM) and effector memory (TEM) T-cells [194- 196] as well as in monocytes and macrophages [197-199]. It has been suggested that the heterogeneity of HIV-1 reservoir may play role in the stability of the HIV-1 latency in

295

CD4 T-cells [200-202]. For HIV-1, there are several studies that established HIV-1 integration in latently infected cells through direct infection of activated CD4 T-cells or resting CD4 T-cells [203-206]. In one of these studies, Vatakis et al. showed that there is a considerable increase in the level of sequences with LTR defects and abnormal 2-LTR circles in resting CD4 T-cells compared to activated CD4 T-cells [203].

Multiple other studies have reported clonal expansion and proliferation of infected CD4 T-cells in individuals on long-term ART. In a study by Chomont et al. it has been shown that more than 50% of cells harbouring replication-competent HIV-1 proviruses were memory T-cells [202]. HIV-1 proviruses were present predominantly in central memory T-cells (TCM) and transitional memory T-cells (TTM) but not in effector memory T-cells (TEM). Both TCM and TTM CD4 T-cell populations appear to have low levels of proliferation due to immune system activation caused by long-term ART. Therefore persistence of HIV-1 infection in these cells seems to be assisted by their increased half- life, and this facilitates HIV-1 reservoir maintenance in this cell subsets [202]. In contrast, a study by Imamichi et al. showed long term persistence of TEM cells and its potential association with defective proviruses [207]. In this in vitro study the integration sites of a single HIV-1 provirus containing a stop codon at position 42 of the protease gene was mapped in a longitudinal analysis. The results showed that TEM cells could persist for 17 years and HIV-1 integration sites that carried W42Stop mutation could be identified. Given that TEM cells are usually terminally differentiated memory T-cells with a functional half-life of 3-6 days [208, 209], these data suggests a dramatic phenotypic change in TEM cells after viral integration, however the underlying mechanism remains unknown.

An additional contradictory interpretation was presented by Buzon et al. who hypothesized that HIV-1 can use CD4 stem cells memory T-cells (TSCM) as a preferred niche to enhance long-term HIV-1 persistence [195]. They demonstrated that replication- competent viruses are found in TSCM sequenced from three individuals on long-term ART [195]. Higher per-cell levels of HIV-1 DNA sequences were found in TSCM compared to TCM, TEM, and TTD. Additionally, TSCM made an increasing contribution to the total HIV-1 CD4 T-cell reservoir over time, indicating that HIV-1 in infected CD4 TSCM may persist as a stable HIV-1 reservoir. It has been shown that CD4 TSCM are permissive to HIV-1 infection, have a low level of apoptosis, and they also

296

can survive for a long time [210]; however, it not known whether HIV-1 preferentially integrates into the genome of these cells.

One mechanism that plays role in viral persistence (latency) and may also be responsible for the above-mentioned contradictory results is viral integration. It is known that different cell types have different histone modification and chromatin structures. These cell-specific genomic features give rise to differential gene transcription activities in different cell types [211, 212]. It is reasonable to postulate that this phenomenon also plays role in the selection of viral integration sites. Studies have shown that HIV-1 mostly favours integration within the gene body or within regulatory regions of transcriptionally active genes [213-218]. The location of integrated HIV-1 sequences within the host genomes can potentially determine the latent or non-latent fate of an infected cell. As such, understanding the mechanisms that determine the HIV-1 integration sites is important, because it helps better understand HIV-1 latency.

Maldarelli et al. studied a total of 2410 integration sites by analysing blood samples of five HIV-1 infected individuals on long-term ART. About 57% of the integration sites were detected once, indicating that they were derived from independent HIV-1-infection events in different cell. 43% of sites were detected more than once. Since identical integration sites are unlikely to occur from independent infection events, multiple copies of the same integration site are assumed to result from clonal expansion of a parental infected cell [219]. In one of the individuals 20% of integration sites were detected in the same site of HORMAD2 gene. Another analysis indicated that almost 70% of genes with multiple different HIV-1 integration sites were known to be involved in cell growth (e.g. BACH2, STATB5, SMG1 or MAP4). In another study, Wagner and colleagues performed a longitudinal study (3 time points) of integration sites in three patients. In that study 534 HIV-1 proviral integration sites were investigated in multiple cells within each individual. Several identical HIV-1 integration sites were detected in more than 2 individual cells, providing evidence for the expansion of infected cells in these individuals [220]. They also observed that proviral integration sites in these individuals were overrepresented in cancer-associated genes. Almost 12% of genes with multiple different integration sites were associated with cancer and were found across multiple individuals [220]. Genes such as CREBBP, BACH2, C2CD3, MKL2 and STAT5B were identified in two or three individuals suggesting that HIV-1 preferentially integrates into

297

these chromosomal sites. Potential changes in the activity of these genes as a result of HIV-1 integration may induce long-term survival or clonal expansion in these cells during suppressive ART. To gain further insights into the preferred sites of integration and the effect of clonal expansion in maintaining the HIV-1 reservoir, Cohn et al. performed a single cell analysis approach and investigated a large number of HIV-1 integration sites from treated and untreated patients [221]. They demonstrated that clonally expanded T- cells represent the majority of HIV-1 integration sites and their abundance increased during therapy. However, none of the expanded T-cell clones included intact, full-length HIV-1 sequences. Instead, the cells carrying unique integration sites decreased in frequency with time on ART. The surviving cells were enriched for HIV-1 integration in transcriptionally silent regions of the host genome [221]. It was shown that dividing clonally expanded T-cells contain defective HIV-1 proviruses, and that the replication- competent HIV-1 reservoir is established in CD4 T-cells that remain quiescent [221]. However, a recent study by Simonetti et al. reported that highly expanded CD4 T-cell clones can contain intact proviruses, which can give rise to infectious HIV-1 particles [222]. This study also demonstrated that expanded clones produced infectious virus that was recognised as persistent plasma viremia during combination antiretroviral therapy in a HIV-1-infected individual who had squamous cell cancer. The CD4+ T-cells, which contained an intact HIV-1 provirus, were broadly distributed and remarkably enriched in cancer metastases [222]. Taken together, these results demonstrate that some HIV-1 infected cells can grow and divide broadly after infection, giving rise to expanded clones of HIV-1-infected cells, all of which have a HIV-1 provirus at an identical integration site [214].

Nevertheless, the source and mechanisms involved in the appearance of identical HIV-1 proviruses during long-term ART are not well understood. To shed light on the molecular mechanisms of HIV-1 integration into the human genome and the role of cell proliferation and clonal expansion in HIV-1 latency, I undertook a quantitative approach and developed a bioinformatics pipeline to analyse data from two previous studies [219, 220]. The specific aim of this project was to investigate the relationship between cell proliferation, gene expression, and the frequency and patterns of HIV integration in human cells.

298

My results showed that genes that contained at least one integration site that was observed multiple times (i.e. in clonal expanded cells), had lower expression levels compared to genes that had only non-expanded integrated sites. Previous studies have provided evidence that suggest HIV-1 integration contributes to the clonal expansion of infected cells. HIV-1 integration in transcriptional regulator genes MKL2 and BACH2 has been observed in multiple individuals [219, 220, 223-225]. In one individual, for instance, 16 out of 1052 unique HIV-1 integration sites were in MKL2. Same observation was found for HIV-1 integrations in two introns of the BACH2 gene in this individual [219]. All of the HIV-1 proviruses were in the same transcriptional orientation as the gene, highlighting that they may have impacted the expression of these genes by the insertion of transcriptional control elements (promoters, enhancers), polyadenylation signal sequences or transcriptional termination sites that are included in the HIV-1 LTRs. On the other hand, infected cells in culture, showed neither evidence for preferential integration in the same introns, nor any preferential integration in one of the two possible orientations. In individuals, in the case of BACH2, all of the integrations were identified upstream of the coding exons; and for MKL2 gene, all of the integrations were among coding exons. Studies have shown that both MKL2 and BACH2 genes are human proto-oncogenes [226-228]. Thus, cells carrying HIV-1 proviruses in the introns of these genes were presumably selected in individuals because the integrated HIV-1 virus impacted the expression of these genes, modifying the growth and/or survival attributes of the infected cells. My results showed that for both cancer and non-cancer genes, the expression of genes with expanded integration sites is lower than those of genes with non- expanded integration sites. Therefore, the mechanisms of HIV-1 persistence and clonal expansion may be the same for cancer and non-cancer genes. However, the mechanisms by which HIV-1 integration into cancer-associated genes (e.g. BACH2 or MKL2) causes persistence and clonal expansion remain to be understood. Although the HIV-1 integration sites and provirus orientations are well described, the features of the cells carrying these HIV-1 proviruses are not well specified. Moreover, given that only a region of the HIV-1 genome close to the integration locations is typically sequenced, it is not known whether integrated proviral HIV-1 in the genes with expanded integration sites are infectious and/or intact proviruses. In addition, the impact of HIV-1 provirus integration on expression of genes with evidence of clonal expansion of integration sites stays unknown, and it is not clear whether expression is enhanced or silenced. It is possible that

299

integrated HIV-1 proviral DNA disrupts gene activity and regulate expression level of host gene to elevate persistence or clonal expansion of HIV-1 infected cells.

In our studies, we used the gene expression data from CD8+ T-cells not those of CD4+ T-cells; because, we could not find a suitable expression dataset from CD4+ T-cells. This limitation may potentially affect the results. However, my analysis of gene expression data from the Fantom5 (Functional annotation of the mammalian genome 5) database [229, 230] showed a positive correlation between the gene expression of CD4+ and CD8+ T-cells. I compared an average gene expression of all memory CD4+ cells (12 samples) and CD8 cells (22 samples) in more than 201000 promoters. The following plot shows that there is a positive correlation (R2=0.873) between the gene expression of memory CD4+ and CD8+ T-cells. It is also important to note that our results confirm the data generated by Cohn et al. [221] in a recent study in which CD4+ expression data were used.

Figure 7.4. Positive correlation between gene expressions of memory CD4+ and CD8 T-cells in FANTOM5 (Functional annotation of the mammalian genome 5) dataset.

300

Additionally, due to limited number of HIV-1 integration sites, in the present study, there was not enough statistical power to identify other potential genes with multiple copies of integration sites. A future study with greater number of HIV-1 integration sites in PBMCs or CD4+ T-cells from patients on prolonged cART will help better identify genes with multiple copies of integration sites.

If integration sites are situated close to an oncogene, the integrated viral genome can induce the expression of segment or whole of the gene leading to aberrant cell proliferation and tumor formation [231-233]. For example murine leukemia virus and avian leukosis virus proviruses induce tumors in mice and chickens by changing the expression of adjacent oncogenes [234]. The reported clonal expansions associated with HIV-1 integrations in genes that are associated with cancer (e.g. BACH2 or MKL2) may suggest that HIV-1 integration in the genes that regulate cell growth/division and promote proliferation slows down viral decay during antiretroviral therapy.

It also remains to be determined what percentage of expanded CD4+ T-cell clones contain an infectious proviral DNA? It has been shown that CD4+ T-cells containing an infectious HIV-1 provirus can clonally expand and persist for a long time in a patient on cART [222].

Conclusion

The computational biology research reported in my thesis addresses several important questions that advance understanding of HIV and HIV infection. Furthermore, my work demonstrates that the design and use computational methods can provide additional and important biological insights. With the rapid increase in the amount of experimental data that can be generated using new technologies there is a growing need for computational methods to analyse and interpret this data. A limitation to current research is that researchers are often locked in to a particular way of analysing the data because they are reliant on using standard software. However, today, with increases in size and type of data, researchers need to use a combination meta-analyses of standard softwares and new in-house pipelines to analyse all the available data to obtain the best results. In summary, computational research has much to contribute to the future advancement of my understanding of HIV and HIV infection.

301

References

1. Walker, C.M., et al., CD8+ lymphocytes can control HIV infection in vitro by suppressing virus replication. Science, 1986. 234(4783): p. 1563-6.

2. Kannagi, M., et al., Suppression of simian immunodeficiency virus replication in vitro by CD8+ lymphocytes. J Immunol, 1988. 140(7): p. 2237-42.

3. Nixon, D.F., et al., HIV-1 gag-specific cytotoxic T lymphocytes defined with recombinant vaccinia virus and synthetic peptides. Nature, 1988. 336(6198): p. 484-7.

4. Shiver, J.W., et al., Replication-incompetent adenoviral vaccine vector elicits effective anti-immunodeficiency-virus immunity. Nature, 2002. 415(6869): p. 331-5.

5. Berman, P.W., et al., Protection of chimpanzees from infection by HIV-1 after vaccination with recombinant glycoprotein gp120 but not gp160. Nature, 1990. 345(6276): p. 622-5.

6. Marthas, M.L., et al., Immunization with a live, attenuated simian immunodeficiency virus (SIV) prevents early disease but not infection in rhesus macaques challenged with pathogenic SIV. J Virol, 1990. 64(8): p. 3694-700.

7. Daniel, M.D., et al., Protective effects of a live attenuated SIV vaccine with a deletion in the nef gene. Science, 1992. 258(5090): p. 1938-41.

8. Gundlach, B.R., et al., Env-independent protection induced by live, attenuated simian immunodeficiency virus vaccines. J Virol, 1998. 72(10): p. 7846-51.

9. Barouch, D.H., et al., Control of viremia and prevention of clinical AIDS in rhesus monkeys by cytokine-augmented DNA vaccination. Science, 2000. 290(5491): p. 486-92.

10. Flynn, N.M., et al., Placebo-controlled phase 3 trial of a recombinant glycoprotein 120 vaccine to prevent HIV-1 infection. J Infect Dis, 2005. 191(5): p. 654-65.

11. Uberla, K., HIV vaccine development in the aftermath of the STEP study: re-focus on occult HIV infection? PLoS Pathog, 2008. 4(8): p. e1000114.

12. Rerks-Ngarm, S., et al., Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand. N Engl J Med, 2009. 361(23): p. 2209-20.

13. Belshe, R.B., et al., Induction of immune responses to HIV-1 by canarypox virus (ALVAC) HIV-1 and gp120 SF-2 recombinant vaccines in uninfected volunteers. NIAID AIDS Vaccine Evaluation Group. Aids, 1998. 12(18): p. 2407-15.

302

14. Karnasuta, C., et al., Antibody-dependent cell-mediated cytotoxic responses in participants enrolled in a phase I/II ALVAC-HIV/AIDSVAX B/E prime-boost HIV-1 vaccine trial in Thailand. Vaccine, 2005. 23(19): p. 2522-9.

15. Allen, T.M., et al., Tat-specific cytotoxic T lymphocytes select for SIV escape variants during resolution of primary viraemia. Nature, 2000. 407(6802): p. 386-90.

16. Goulder, P.J., et al., Late escape from an immunodominant cytotoxic T-lymphocyte response associated with progression to AIDS. Nat Med, 1997. 3(2): p. 212-7.

17. Harrer, T., et al., Cytotoxic T lymphocytes in asymptomatic long-term nonprogressing HIV-1 infection. Breadth and specificity of the response and relation to in vivo viral quasispecies in a person with prolonged infection and low viral load. J Immunol, 1996. 156(7): p. 2616-23.

18. Price, D.A., et al., Positive selection of HIV-1 cytotoxic T lymphocyte escape variants during primary infection. Proc Natl Acad Sci U S A, 1997. 94(5): p. 1890-5.

19. Phillips, R.E., et al., Human immunodeficiency virus genetic variation that can escape cytotoxic T cell recognition. Nature, 1991. 354(6353): p. 453-9.

20. Evans, D.T., et al., Virus-specific cytotoxic T-lymphocyte responses select for amino-acid variation in simian immunodeficiency virus Env and Nef. Nat Med, 1999. 5(11): p. 1270- 6.

21. Fernandez, C.S., et al., Rapid viral escape at an immunodominant simian-human immunodeficiency virus cytotoxic T-lymphocyte epitope exacts a dramatic fitness cost. J Virol, 2005. 79(9): p. 5721-31.

22. Reece, J.C., et al., Trivalent live attenuated influenza-simian immunodeficiency virus vaccines: efficacy and evolution of cytotoxic T lymphocyte escape in macaques. J Virol, 2013. 87(8): p. 4146-60.

23. Friedrich, T.C., et al., Extraepitopic compensatory substitutions partially restore fitness to simian immunodeficiency virus variants that escape from an immunodominant cytotoxic-T-lymphocyte response. J Virol, 2004. 78(5): p. 2581-5.

24. Smith, M.Z., et al., Analysis of pigtail macaque major histocompatibility complex class I molecules presenting immunodominant simian immunodeficiency virus epitopes. J Virol, 2005. 79(2): p. 684-95.

25. Smith, M.Z., et al., The pigtail macaque MHC class I allele Mane-A*10 presents an immundominant SIV Gag epitope: identification, tetramer development and implications of immune escape and reversion. J Med Primatol, 2005. 34(5-6): p. 282-93.

303

26. Mason, R.D., et al., Differential patterns of immune escape at Tat-specific cytotoxic T cell epitopes in pigtail macaques. Virology, 2009. 388(2): p. 315-23.

27. Rosenberg, E.S., et al., Vigorous HIV-1-specific CD4+ T cell responses associated with control of viremia. Science, 1997. 278(5342): p. 1447-50.

28. O'Connor, D.H., et al., Major histocompatibility complex class I alleles associated with slow simian immunodeficiency virus disease progression bind epitopes recognized by dominant acute-phase cytotoxic-T-lymphocyte responses. J Virol, 2003. 77(16): p. 9029- 40.

29. MacDonald, K.S., et al., Influence of HLA supertypes on susceptibility and resistance to human immunodeficiency virus type 1 infection. J Infect Dis, 2000. 181(5): p. 1581-9.

30. Martin, M.P. and M. Carrington, Immunogenetics of HIV disease. Immunol Rev, 2013. 254(1): p. 245-64.

31. Carrington, M. and R.E. Bontrop, Effects of MHC class I on HIV/SIV disease in primates. Aids, 2002. 16 (Suppl 4): p. S105-14.

32. Kiepiela, P., et al., Dominant influence of HLA-B in mediating the potential co-evolution of HIV and HLA. Nature, 2004. 432(7018): p. 769-75.

33. Fellay, J., et al., A whole-genome association study of major determinants for host control of HIV-1. Science, 2007. 317(5840): p. 944-7.

34. Pereyra, F., et al., The major genetic determinants of HIV-1 control affect HLA class I peptide presentation. Science, 2010. 330(6010): p. 1551-7.

35. Moore, C.B., et al., Evidence of HIV-1 adaptation to HLA-restricted immune responses at a population level. Science, 2002. 296(5572): p. 1439-43.

36. Boutwell, C.L., et al., Reduced viral replication capacity of human immunodeficiency virus type 1 subtype C caused by cytotoxic-T-lymphocyte escape mutations in HLA-B57 epitopes of capsid protein. J Virol, 2009. 83(6): p. 2460-8.

37. Brockman, M.A., et al., Early selection in Gag by protective HLA alleles contributes to reduced HIV-1 replication capacity that may be largely compensated for in chronic infection. J Virol, 2010. 84(22): p. 11937-49.

38. Frater, A.J., et al., Effective T-cell responses select human immunodeficiency virus mutants and slow disease progression. J Virol, 2007. 81(12): p. 6742-51.

39. Kawashima, Y., et al., Adaptation of HIV-1 to human leukocyte antigen class I. Nature, 2009. 458(7238): p. 641-5.

304

40. Rolland, M., et al., Amino-acid co-variation in HIV-1 Gag subtype C: HLA-mediated selection pressure and compensatory dynamics. PLoS One, 2010. 5(9).

41. Schneidewind, A., et al., Structural and functional constraints limit options for cytotoxic T-lymphocyte escape in the immunodominant HLA-B27-restricted epitope in human immunodeficiency virus type 1 capsid. J Virol, 2008. 82(11): p. 5594-605.

42. Schneidewind, A., et al., Escape from the dominant HLA-B27-restricted cytotoxic T- lymphocyte response in Gag is associated with a dramatic reduction in human immunodeficiency virus type 1 replication. J Virol, 2007. 81(22): p. 12382-93.

43. Wright, J.K., et al., Gag-protease-mediated replication capacity in HIV-1 subtype C chronic infection: associations with HLA type and clinical parameters. J Virol, 2010. 84(20): p. 10820-31.

44. Wright, J.K., et al., Impact of HLA-B*81-associated mutations in HIV-1 Gag on viral replication capacity. J Virol, 2012. 86(6): p. 3193-9.

45. Klein, J. and A. Sato, The HLA system. First of two parts. N Engl J Med, 2000. 343(10): p. 702-9.

46. Falk, K., et al., Allele-specific motifs revealed by sequencing of self-peptides eluted from MHC molecules. 1991. J Immunol, 2006. 177(5): p. 2741-7.

47. Chopera, D.R., et al., Transmission of HIV-1 CTL escape variants provides HLA- mismatched recipients with a survival advantage. PLoS Pathog, 2008. 4(3): p. e1000033.

48. Friedrich, T.C., et al., Reversion of CTL escape-variant immunodeficiency viruses in vivo. Nat Med, 2004. 10(3): p. 275-81.

49. Leslie, A.J., et al., HIV evolution: CTL escape mutation and reversion after transmission. Nat Med, 2004. 10(3): p. 282-9.

50. Matthews, P.C., et al., Central role of reverting mutations in HLA associations with human immunodeficiency virus set point. J Virol, 2008. 82(17): p. 8548-59.

51. Wright, J.K., et al., Influence of Gag-protease-mediated replication capacity on disease progression in individuals recently infected with HIV-1 subtype C. J Virol, 2011. 85(8): p. 3996-4006.

52. Draenert, R., et al., Immune selection for altered antigen processing leads to cytotoxic T lymphocyte escape in chronic HIV-1 infection. J Exp Med, 2004. 199(7): p. 905-15.

53. Crawford, H., et al., Compensatory mutation partially restores fitness and delays reversion of escape mutation within the immunodominant HLA-B*5703-restricted Gag

305

epitope in chronic human immunodeficiency virus type 1 infection. J Virol, 2007. 81(15): p. 8346-51.

54. Leslie, A., et al., Differential selection pressure exerted on HIV by CTL targeting identical epitopes but restricted by distinct HLA alleles from the same HLA supertype. J Immunol, 2006. 177(7): p. 4699-708.

55. Klenerman, P. and A. McMichael, AIDS/HIV. Finding footprints among the trees. Science, 2007. 315(5818): p. 1505-7.

56. Yusim, K., et al., Clustering patterns of cytotoxic T-lymphocyte epitopes in human immunodeficiency virus type 1 (HIV-1) proteins reveal imprints of immune evasion on HIV-1 global variation. J Virol, 2002. 76(17): p. 8757-68.

57. Kelleher, A.D., et al., Clustered mutations in HIV-1 gag are consistently required for escape from HLA-B27-restricted cytotoxic T lymphocyte responses. J Exp Med, 2001. 193(3): p. 375-86.

58. Peyerl, F.W., et al., Fitness costs limit viral escape from cytotoxic T lymphocytes at a structurally constrained epitope. J Virol, 2004. 78(24): p. 13901-10.

59. Bhattacharya, T., et al., Founder effects in the assessment of HIV polymorphisms and HLA allele associations. Science, 2007. 315(5818): p. 1583-6.

60. Zhou, Y., et al., SIV infection of rhesus macaques of Chinese origin: a suitable model for HIV infection in humans. Retrovirology, 2013. 10: p. 89.

61. Daniel, M.D., et al., Isolation of T-cell tropic HTLV-III-like retrovirus from macaques. Science, 1985. 228(4704): p. 1201-4.

62. Krebs, K.C., et al., Unusually high frequency MHC class I alleles in Mauritian origin cynomolgus macaques. J Immunol, 2005. 175(8): p. 5230-9.

63. Wiseman, R.W., et al., Simian immunodeficiency virus SIVmac239 infection of major histocompatibility complex-identical cynomolgus macaques from Mauritius. J Virol, 2007. 81(1): p. 349-61.

64. Peut, V. and S.J. Kent, Substantial envelope-specific CD8 T-cell immunity fails to control SIV disease. Virology, 2009. 384(1): p. 21-7.

65. Allen, T.M., et al., Characterization of the peptide binding motif of a rhesus MHC class I molecule (Mamu-A*01) that binds an immunodominant CTL epitope from simian immunodeficiency virus. J Immunol, 1998. 160(12): p. 6062-71.

66. Wei, X., et al., Antibody neutralization and escape by HIV-1. Nature, 2003. 422(6929): p. 307-12.

306

67. Letvin, N.L. and B.D. Walker, Immunopathogenesis and immunotherapy in AIDS virus infections. Nat Med, 2003. 9(7): p. 861-6.

68. Neil, S.J., et al., HIV-1 Vpu promotes release and prevents endocytosis of nascent retrovirus particles from the plasma membrane. PLoS Pathog, 2006. 2(5): p. e39.

69. Neil, S.J., et al., An interferon-alpha-induced tethering mechanism inhibits HIV-1 and Ebola virus particle release but is counteracted by the HIV-1 . Cell Host Microbe, 2007. 2(3): p. 193-203.

70. Hrecka, K., et al., Vpx relieves inhibition of HIV-1 infection of macrophages mediated by the SAMHD1 protein. Nature, 2011. 474(7353): p. 658-61.

71. Laguette, N., et al., SAMHD1 is the dendritic- and myeloid-cell-specific HIV-1 restriction factor counteracted by Vpx. Nature, 2011. 474(7353): p. 654-7.

72. Goujon, C., et al., Human MX2 is an interferon-induced post-entry inhibitor of HIV-1 infection. Nature, 2013. 502(7472): p. 559-62.

73. Tokarev, A., et al., Antiviral activity of the interferon-induced cellular protein BST- 2/tetherin. AIDS Res Hum Retroviruses, 2009. 25(12): p. 1197-210.

74. Zenner, H.L., et al., Herpes simplex virus 1 counteracts tetherin restriction via its virion host shutoff activity. J Virol, 2013. 87(24): p. 13115-23.

75. Le Tortorec, A., et al., Antiviral inhibition of enveloped virus release by tetherin/BST-2: action and counteraction. Viruses, 2011. 3(5): p. 520-40.

76. Perez-Caballero, D., et al., Tetherin inhibits HIV-1 release by directly tethering virions to cells. Cell, 2009. 139(3): p. 499-511.

77. Neil, S.J., et al., Tetherin inhibits retrovirus release and is antagonized by HIV-1 Vpu. Nature, 2008. 451(7177): p. 425-30.

78. Van Damme, N., et al., The interferon-induced protein BST-2 restricts HIV-1 release and is downregulated from the cell surface by the viral Vpu protein. Cell Host Microbe, 2008. 3(4): p. 245-52.

79. Jia, B., et al., Species-specific activity of SIV Nef and HIV-1 Vpu in overcoming restriction by tetherin/BST2. PLoS Pathog, 2009. 5(5): p. e1000429.

80. Zhang, F., et al., Nef proteins from simian immunodeficiency viruses are tetherin antagonists. Cell Host Microbe, 2009. 6(1): p. 54-67.

307

81. Iwabu, Y., et al., HIV-1 accessory protein Vpu internalizes cell-surface BST-2/tetherin through transmembrane interactions leading to lysosomes. J Biol Chem, 2009. 284(50): p. 35060-72.

82. Dube, M., et al., Antagonism of tetherin restriction of HIV-1 release by Vpu involves binding and sequestration of the restriction factor in a perinuclear compartment. PLoS Pathog, 2010. 6(4): p. e1000856.

83. Goffinet, C., et al., HIV-1 antagonism of CD317 is species specific and involves Vpu- mediated proteasomal degradation of the restriction factor. Cell Host Microbe, 2009. 5(3): p. 285-97.

84. Mangeat, B., et al., HIV-1 Vpu neutralizes the antiviral factor Tetherin/BST-2 by binding it and directing its beta-TrCP2-dependent degradation. PLoS Pathog, 2009. 5(9): p. e1000574.

85. Gupta, R.K., et al., Mutation of a single residue renders human tetherin resistant to HIV- 1 Vpu-mediated depletion. PLoS Pathog, 2009. 5(5): p. e1000443.

86. Goila-Gaur, R. and K. Strebel, HIV-1 Vif, APOBEC, and intrinsic immunity. Retrovirology, 2008. 5: p. 51.

87. Holmes, R.K., et al., APOBEC-mediated viral restriction: not simply editing? Trends Biochem Sci, 2007. 32(3): p. 118-28.

88. Holmes, R.K., et al., APOBEC3F can inhibit the accumulation of HIV-1 reverse transcription products in the absence of hypermutation. Comparisons with APOBEC3G. J Biol Chem, 2007. 282(4): p. 2587-95.

89. Iwatani, Y., et al., Deaminase-independent inhibition of HIV-1 reverse transcription by APOBEC3G. Nucleic Acids Res, 2007. 35(21): p. 7096-108.

90. Bishop, K.N., et al., APOBEC3G inhibits elongation of HIV-1 reverse transcripts. PLoS Pathog, 2008. 4(12): p. e1000231.

91. Conticello, S.G., et al., The Vif protein of HIV triggers degradation of the human antiretroviral DNA deaminase APOBEC3G. Curr Biol, 2003. 13(22): p. 2009-13.

92. Kobayashi, M., et al., Ubiquitination of APOBEC3G by an HIV-1 Vif-Cullin5-Elongin B- Elongin C complex is essential for Vif function. J Biol Chem, 2005. 280(19): p. 18573-8.

93. Marin, M., et al., HIV-1 Vif protein binds the editing enzyme APOBEC3G and induces its degradation. Nat Med, 2003. 9(11): p. 1398-403.

308

94. Mehle, A., et al., Vif overcomes the innate antiviral activity of APOBEC3G by promoting its degradation in the ubiquitin-proteasome pathway. J Biol Chem, 2004. 279(9): p. 7792- 8.

95. Stopak, K., et al., HIV-1 Vif blocks the antiviral activity of APOBEC3G by impairing both its translation and intracellular stability. Mol Cell, 2003. 12(3): p. 591-601.

96. Sheehy, A.M., et al., The antiretroviral enzyme APOBEC3G is degraded by the proteasome in response to HIV-1 Vif. Nat Med, 2003. 9(11): p. 1404-7.

97. Yu, X., et al., Induction of APOBEC3G ubiquitination and degradation by an HIV-1 Vif- Cul5-SCF complex. Science, 2003. 302(5647): p. 1056-60.

98. Alce, T.M. and W. Popik, APOBEC3G is incorporated into virus-like particles by a direct interaction with HIV-1 Gag nucleocapsid protein. J Biol Chem, 2004. 279(33): p. 34083- 6.

99. Harris, R.S., et al., DNA deamination mediates innate immunity to retroviral infection. Cell, 2003. 113(6): p. 803-9.

100. Mangeat, B., et al., Broad antiretroviral defence by human APOBEC3G through lethal editing of nascent reverse transcripts. Nature, 2003. 424(6944): p. 99-103.

101. Miyagi, E., et al., Enzymatically active APOBEC3G is required for efficient inhibition of human immunodeficiency virus type 1. J Virol, 2007. 81(24): p. 13346-53.

102. Zhang, H., et al., The cytidine deaminase CEM15 induces hypermutation in newly synthesized HIV-1 DNA. Nature, 2003. 424(6944): p. 94-8.

103. Yu, Q., et al., APOBEC3B and APOBEC3C are potent inhibitors of simian immunodeficiency virus replication. J Biol Chem, 2004. 279(51): p. 53379-86.

104. Mulder, L.C., et al., Cytidine deamination induced HIV-1 drug resistance. Proc Natl Acad Sci U S A, 2008. 105(14): p. 5501-6.

105. Jern, P., et al., Likely role of APOBEC3G-mediated G-to-A mutations in HIV-1 evolution and drug resistance. PLoS Pathog, 2009. 5(4): p. e1000367.

106. Sadler, H.A., et al., APOBEC3G contributes to HIV-1 variation through sublethal mutagenesis. J Virol, 2010. 84(14): p. 7396-404.

107. Wood, N., et al., HIV evolution in early infection: selection pressures, patterns of insertion and deletion, and the impact of APOBEC. PLoS Pathog, 2009. 5(5): p. e1000414.

309

108. Deforche, K., et al., Estimating the relative contribution of dNTP pool imbalance and APOBEC3G/3F editing to HIV evolution in vivo. J Comput Biol, 2007. 14(8): p. 1105- 14.

109. Ebrahimi, D., et al., APOBEC3 has not left an evolutionary footprint on the HIV-1 genome. J Virol, 2011. 85(17): p. 9139-46.

110. Armitage, A.E., et al., APOBEC3G-induced hypermutation of human immunodeficiency virus type-1 is typically a discrete "all or nothing" phenomenon. PLoS Genet, 2012. 8(3): p. e1002550.

111. Li, M., et al., First-in-class small molecule inhibitors of the single-strand DNA cytosine deaminase APOBEC3G. ACS Chem Biol, 2012. 7(3): p. 506-17.

112. Olson, M.E., et al., Small-molecule APOBEC3G DNA cytosine deaminase inhibitors based on a 4-amino-1,2,4-triazole-3-thiol scaffold. ChemMedChem, 2013. 8(1): p. 112- 7.

113. Nathans, R., et al., Small-molecule inhibition of HIV-1 Vif. Nat Biotechnol, 2008. 26(10): p. 1187-92.

114. Cen, S., et al., Small molecular compounds inhibit HIV-1 replication through specifically stabilizing APOBEC3G. J Biol Chem, 2010. 285(22): p. 16546-52.

115. Nowotny, B., et al., Inducible APOBEC3G-Vif double stable cell line as a high- throughput screening platform to identify antiviral compounds. Antimicrob Agents Chemother, 2010. 54(1): p. 78-87.

116. Ejima, T., et al., An anti-HIV-1 compound that increases steady-state expression of apoplipoprotein B mRNA-editing enzyme-catalytic polypeptide-like 3G. Int J Mol Med, 2011. 28(4): p. 613-6.

117. Ali, A., et al., Synthesis and structure-activity relationship studies of HIV-1 virion infectivity factor (Vif) inhibitors that block viral replication. ChemMedChem, 2012. 7(7): p. 1217-29.

118. Mohammed, I., et al., SAR and Lead Optimization of an HIV-1 Vif-APOBEC3G Axis Inhibitor. ACS Med Chem Lett, 2012. 3(6): p. 465-9.

119. Matsui, Y., et al., Defining HIV-1 Vif residues that interact with CBFbeta by site-directed mutagenesis. Virology, 2014. 449: p. 82-7.

120. Hultquist, J.F., et al., Human and rhesus APOBEC3D, APOBEC3F, APOBEC3G, and APOBEC3H demonstrate a conserved capacity to restrict Vif-deficient HIV-1. J Virol, 2011. 85(21): p. 11220-34.

310

121. Armitage, A.E., et al., Conserved footprints of APOBEC3G on Hypermutated human immunodeficiency virus type 1 and human endogenous retrovirus HERV-K(HML2) sequences. J Virol, 2008. 82(17): p. 8743-61.

122. Desimmie, B.A., et al., APOBEC3 proteins can copackage and comutate HIV-1 genomes. Nucleic Acids Res, 2016. 44(16): p. 7848-65.

123. Harris, M., et al., Acute-phase CD8 T cell responses that select for escape variants are needed to control live attenuated simian immunodeficiency virus. J Virol, 2013. 87(16): p. 9353-64.

124. Daza-Vamenta, R., et al., Genetic divergence of the rhesus macaque major histocompatibility complex. Genome Res, 2004. 14(8): p. 1501-15.

125. Fukami-Kobayashi, K., et al., Genomic evolution of MHC class I region in primates. Proc Natl Acad Sci U S A, 2005. 102(26): p. 9230-4.

126. Zheng, Y.H., et al., Human APOBEC3F is another host factor that blocks human immunodeficiency virus type 1 replication. J Virol, 2004. 78(11): p. 6073-6.

127. Liddament, M.T., et al., APOBEC3F properties and hypermutation preferences indicate activity against HIV-1 in vivo. Curr Biol, 2004. 14(15): p. 1385-91.

128. Beale, R.C., et al., Comparison of the differential context-dependence of DNA deamination by APOBEC enzymes: correlation with mutation spectra in vivo. J Mol Biol, 2004. 337(3): p. 585-96.

129. Bishop, K.N., et al., Cytidine deamination of retroviral DNA by diverse APOBEC proteins. Curr Biol, 2004. 14(15): p. 1392-6.

130. Kijak, G.H., et al., Variable contexts and levels of hypermutation in HIV-1 proviral genomes recovered from primary peripheral blood mononuclear cells. Virology, 2008. 376(1): p. 101-11.

131. Pace, C., et al., Population level analysis of human immunodeficiency virus type 1 hypermutation and its relationship with APOBEC3G and vif genetic variation. J Virol, 2006. 80(18): p. 9259-69.

132. Ulenga, N.K., et al., The level of APOBEC3G (hA3G)-related G-to-A mutations does not correlate with viral load in HIV type 1-infected individuals. AIDS Res Hum Retroviruses, 2008. 24(10): p. 1285-90.

133. Oliver, A., et al., Hypermutation and the preexistence of antibiotic-resistant Pseudomonas aeruginosa mutants: implications for susceptibility testing and treatment of chronic infections. Antimicrob Agents Chemother, 2004. 48(11): p. 4226-33.

311

134. Ebrahimi, D., et al., APOBEC3G and APOBEC3F rarely co-mutate the same HIV genome. Retrovirology, 2012. 9: p. 113.

135. Kieffer, T.L., et al., G-->A hypermutation in protease and reverse transcriptase regions of human immunodeficiency virus type 1 residing in resting CD4+ T cells in vivo. J Virol, 2005. 79(3): p. 1975-80.

136. Land, A.M., et al., Human immunodeficiency virus (HIV) type 1 proviral hypermutation correlates with CD4 count in HIV-infected women from Kenya. J Virol, 2008. 82(16): p. 8172-82.

137. Piantadosi, A., et al., Analysis of the percentage of human immunodeficiency virus type 1 sequences that are hypermutated and markers of disease progression in a longitudinal cohort, including one individual with a partially defective Vif. J Virol, 2009. 83(16): p. 7805-14.

138. Gandhi, S.K., et al., Role of APOBEC3G/F-mediated hypermutation in the control of human immunodeficiency virus type 1 in elite suppressors. J Virol, 2008. 82(6): p. 3125- 30.

139. Janini, M., et al., Human immunodeficiency virus type 1 DNA sequences genetically damaged by hypermutation are often abundant in patient peripheral blood mononuclear cells and may be generated during near-simultaneous infection and activation of CD4(+) T cells. J Virol, 2001. 75(17): p. 7973-86.

140. Koning, F.A., et al., Defining APOBEC3 expression patterns in human tissues and hematopoietic cell subsets. J Virol, 2009. 83(18): p. 9474-85.

141. Refsland, E.W., et al., Quantitative profiling of the full APOBEC3 mRNA repertoire in lymphocytes and tissues: implications for HIV-1 restriction. Nucleic Acids Res, 2010. 38(13): p. 4274-84.

142. Wichroski, M.J., et al., Human retroviral host restriction factors APOBEC3G and APOBEC3F localize to mRNA processing bodies. PLoS Pathog, 2006. 2(5): p. e41.

143. Rose, P.P. and B.T. Korber, Detecting hypermutations in viral sequences with an emphasis on G --> A hypermutation. Bioinformatics, 2000. 16(4): p. 400-1.

144. Chun, T.W., et al., Re-emergence of HIV after stopping therapy. Nature, 1999. 401(6756): p. 874-5.

145. Davey, R.T., et al., HIV-1 and T cell dynamics after interruption of highly active antiretroviral therapy (HAART) in patients with a history of sustained viral suppression. Proc Natl Acad Sci U S A, 1999. 96(26): p. 15109-14.

312

146. Strain, M.C., et al., Heterogeneous clearance rates of long-lived lymphocytes infected with HIV: intrinsic stability predicts lifelong persistence. Proc Natl Acad Sci U S A, 2003. 100(8): p. 4819-24.

147. Ramratnam, B., et al., The decay of the latent reservoir of replication-competent HIV-1 is inversely correlated with the extent of residual viral replication during prolonged anti- retroviral therapy. Nat Med, 2000. 6(1): p. 82-5.

148. Finzi, D., et al., Latent infection of CD4+ T cells provides a mechanism for lifelong persistence of HIV-1, even in patients on effective combination therapy. Nat Med, 1999. 5(5): p. 512-7.

149. Chun, T.W., et al., Presence of an inducible HIV-1 latent reservoir during highly active antiretroviral therapy. Proc Natl Acad Sci U S A, 1997. 94(24): p. 13193-7.

150. Finzi, D., et al., Identification of a reservoir for HIV-1 in patients on highly active antiretroviral therapy. Science, 1997. 278(5341): p. 1295-300.

151. Wong, J.K., et al., Recovery of replication-competent HIV despite prolonged suppression of plasma viremia. Science, 1997. 278(5341): p. 1291-5.

152. Li, E. and Y. Zhang, DNA methylation in mammals. Cold Spring Harb Perspect Biol, 2014. 6(5): p. a019133.

153. Koiwa, T., et al., 5'-long terminal repeat-selective CpG methylation of latent human T- cell leukemia virus type 1 provirus in vitro and in vivo. J Virol, 2002. 76(18): p. 9389-97.

154. Taniguchi, Y., et al., Silencing of human T-cell leukemia virus type I gene transcription by epigenetic mechanisms. Retrovirology, 2005. 2: p. 64.

155. Hejnar, J., et al., Inhibition of the rous sarcoma virus long terminal repeat-driven transcription by in vitro methylation: different sensitivity in permissive chicken cells versus mammalian cells. Virology, 1999. 255(1): p. 171-81.

156. Hejnar, J., et al., CpG island protects Rous sarcoma virus-derived vectors integrated into nonpermissive cells from DNA methylation and transcriptional suppression. Proc Natl Acad Sci U S A, 2001. 98(2): p. 565-9.

157. Harbers, K., et al., DNA methylation and gene expression: endogenous retroviral genome becomes infectious after molecular cloning. Proc Natl Acad Sci U S A, 1981. 78(12): p. 7609-13.

158. Robbins, P.B., et al., Consistent, persistent expression from modified retroviral vectors in murine hematopoietic stem cells. Proc Natl Acad Sci U S A, 1998. 95(17): p. 10182- 7.

313

159. Lavie, L., et al., CpG methylation directly regulates transcriptional activity of the human endogenous retrovirus family HERV-K(HML-2). J Virol, 2005. 79(2): p. 876-83.

160. Matouskova, M., et al., CpG methylation suppresses transcriptional activity of human syncytin-1 in non-placental tissues. Exp Cell Res, 2006. 312(7): p. 1011-20.

161. Bednarik, D.P., et al., DNA CpG methylation inhibits binding of NF-kappa B proteins to the HIV-1 long terminal repeat cognate DNA motifs. New Biol, 1991. 3(10): p. 969-76.

162. Blazkova, J., et al., CpG methylation controls reactivation of HIV from latency. PLoS Pathog, 2009. 5(8): p. e1000554.

163. Kauder, S.E., et al., Epigenetic regulation of HIV-1 latency by cytosine methylation. PLoS Pathog, 2009. 5(6): p. e1000495.

164. Singh, M.K. and C.D. Pauza, Extrachromosomal human immunodeficiency virus type 1 sequences are methylated in latently infected U937 cells. Virology, 1992. 188(2): p. 451- 8.

165. Ishida, T., et al., 5' long terminal repeat (LTR)-selective methylation of latently infected HIV-1 provirus that is demethylated by reactivation signals. Retrovirology, 2006. 3: p. 69.

166. Pion, M., et al., Transcriptional suppression of in vitro-integrated human immunodeficiency virus type 1 does not correlate with proviral DNA methylation. J Virol, 2003. 77(7): p. 4025-32.

167. Pannell, D., et al., Retrovirus vector silencing is de novo methylase independent and marked by a repressive histone code. Embo J, 2000. 19(21): p. 5884-94.

168. Duverger, A., et al., Determinants of the establishment of human immunodeficiency virus type 1 latency. J Virol, 2009. 83(7): p. 3078-93.

169. Shpaer, E.G. and J.I. Mullins, Selection against CpG dinucleotides in lentiviral genes: a possible role of methylation in regulation of viral expression. Nucleic Acids Res, 1990. 18(19): p. 5793-7.

170. Karlin, S., et al., Why is CpG suppressed in the genomes of virtually all small eukaryotic viruses but not in those of large eukaryotic viruses? J Virol, 1994. 68(5): p. 2889-97.

171. Bird, A.P., DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res, 1980. 8(7): p. 1499-504.

172. Simmen, M.W., Genome-scale relationships between cytosine methylation and dinucleotide abundances in animals. Genomics, 2008. 92(1): p. 33-40.

314

173. Barton, G.M., Viral recognition by Toll-like receptors. Semin Immunol, 2007. 19(1): p. 33-40.

174. Hemmi, H., et al., A Toll-like receptor recognizes bacterial DNA. Nature, 2000. 408(6813): p. 740-5.

175. Goldberg, B., et al., Beyond danger: unmethylated CpG dinucleotides and the immunopathogenesis of disease. Immunol Lett, 2000. 73(1): p. 13-8.

176. Lund, J., et al., Toll-like receptor 9-mediated recognition of Herpes simplex virus-2 by plasmacytoid dendritic cells. J Exp Med, 2003. 198(3): p. 513-20.

177. Greenbaum, B.D., et al., Patterns of oligonucleotide sequences in viral and host cell RNA identify mediators of the host innate immune system. PLoS One, 2009. 4(6): p. e5969.

178. Jimenez-Baranda, S., et al., Oligonucleotide motifs that disappear during the evolution of influenza virus in humans increase alpha interferon secretion by plasmacytoid dendritic cells. J Virol, 2011. 85(8): p. 3893-904.

179. Greenbaum, B.D., et al., Patterns of evolution and host gene mimicry in influenza and other RNA viruses. PLoS Pathog, 2008. 4(6): p. e1000079.

180. Chi, H. and R.A. Flavell, Innate recognition of non-self nucleic acids. Genome Biol, 2008. 9(3): p. 211.

181. Zhang, X. and C.K. Mathews, Effect of DNA cytosine methylation upon deamination- induced mutagenesis in a natural target sequence in duplex DNA. J Biol Chem, 1994. 269(10): p. 7066-9.

182. Fryxell, K.J. and W.J. Moon, CpG mutation rates in the human genome are highly dependent on local GC content. Mol Biol Evol, 2005. 22(3): p. 650-8.

183. Barton, G.M., et al., Intracellular localization of Toll-like receptor 9 prevents recognition of self DNA but facilitates access to viral DNA. Nat Immunol, 2006. 7(1): p. 49-56.

184. Chockalingam, A., et al., TLR9 traffics through the Golgi complex to localize to endolysosomes and respond to CpG DNA. Immunol Cell Biol, 2009. 87(3): p. 209-17.

185. Bednarik, D.P., et al., Methylation as a modulator of expression of human immunodeficiency virus. J Virol, 1987. 61(4): p. 1253-7.

186. Lassen, K., et al., The multifactorial nature of HIV-1 latency. Trends Mol Med, 2004. 10(11): p. 525-31.

315

187. Chun, T.W., et al., Early establishment of a pool of latently infected, resting CD4(+) T cells during primary HIV-1 infection. Proc Natl Acad Sci U S A, 1998. 95(15): p. 8869- 73.

188. Siliciano, R.F. and W.C. Greene, HIV latency. Cold Spring Harb Perspect Med, 2011. 1(1): p. a007096.

189. Nabel, G. and D. Baltimore, An inducible transcription factor activates expression of human immunodeficiency virus in T cells. Nature, 1987. 326(6114): p. 711-3.

190. Siekevitz, M., et al., Activation of the HIV-1 LTR by T cell mitogens and the trans- activator protein of HTLV-I. Science, 1987. 238(4833): p. 1575-8.

191. Bohnlein, E., et al., The same inducible nuclear proteins regulates mitogen activation of both the interleukin-2 receptor-alpha gene and type 1 HIV. Cell, 1988. 53(5): p. 827-36.

192. Kinoshita, S., et al., The T cell activation factor NF-ATc positively regulates HIV-1 replication and gene expression in T cells. Immunity, 1997. 6(3): p. 235-44.

193. Cron, R.Q., et al., NFAT1 enhances HIV-1 gene expression in primary human CD4 T cells. Clin Immunol, 2000. 94(3): p. 179-91.

194. Chun, T.W., et al., Relationship between pre-existing viral reservoirs and the re- emergence of plasma viremia after discontinuation of highly active anti-retroviral therapy. Nat Med, 2000. 6(7): p. 757-61.

195. Buzon, M.J., et al., HIV-1 persistence in CD4+ T cells with stem cell-like properties. Nat Med, 2014. 20(2): p. 139-42.

196. Blankson, J.N., et al., The challenge of viral reservoirs in HIV-1 infection. Annu Rev Med, 2002. 53: p. 557-93.

197. Koenig, S., et al., Detection of AIDS virus in macrophages in brain tissue from AIDS patients with encephalopathy. Science, 1986. 233(4768): p. 1089-93.

198. Koppensteiner, H., et al., Macrophages and their relevance in Human Immunodeficiency Virus Type I infection. Retrovirology, 2012. 9: p. 82.

199. Kumar, A., et al., HIV-1 latency in monocytes/macrophages. Viruses, 2014. 6(4): p. 1837- 60.

200. Chomont, N., et al., Maintenance of CD4+ T-cell memory and HIV persistence: keeping memory, keeping HIV. Curr Opin HIV AIDS, 2011. 6(1): p. 30-6.

201. Bosque, A., et al., Homeostatic proliferation fails to efficiently reactivate HIV-1 latently infected central memory CD4+ T cells. PLoS Pathog, 2011. 7(10): p. e1002288.

316

202. Chomont, N., et al., HIV reservoir size and persistence are driven by T cell survival and homeostatic proliferation. Nat Med, 2009. 15(8): p. 893-900.

203. Vatakis, D.N., et al., Human immunodeficiency virus integration efficiency and site selection in quiescent CD4+ T cells. J Virol, 2009. 83(12): p. 6222-33.

204. Pace, M.J., et al., Directly infected resting CD4+T cells can produce HIV Gag without spreading infection in a model of HIV latency. PLoS Pathog, 2012. 8(7): p. e1002818.

205. Brady, T., et al., HIV integration site distributions in resting and activated CD4+ T cells infected in culture. Aids, 2009. 23(12): p. 1461-71.

206. Sherrill-Mix, S., et al., HIV latency and integration site placement in five cell-based models. Retrovirology, 2013. 10: p. 90.

207. Imamichi, H., et al., Lifespan of effector memory CD4+ T cells determined by replication- incompetent integrated HIV-1 provirus. Aids, 2014. 28(8): p. 1091-9.

208. Macallan, D.C., et al., Rapid turnover of effector-memory CD4(+) T cells in healthy humans. J Exp Med, 2004. 200(2): p. 255-60.

209. Lanzavecchia, A. and F. Sallusto, Dynamics of T lymphocyte responses: intermediates, effectors, and memory cells. Science, 2000. 290(5489): p. 92-7.

210. Gattinoni, L., et al., A human memory T cell subset with stem cell-like properties. Nat Med, 2011. 17(10): p. 1290-7.

211. Maruyama, R., et al., Epigenetic regulation of cell type–specific expression patterns in the human mammary epithelium. PLoS Genet, 2011. 7(1): p. e1001369.

212. LeRoy, G., et al., A quantitative atlas of histone modification signatures from human cancer cells. Epigenetics & chromatin, 2013. 6(1): p. 20.

213. Schroder, A.R., et al., HIV-1 integration in the human genome favors active genes and local hotspots. Cell, 2002. 110(4): p. 521-9.

214. Mitchell, R.S., et al., Retroviral DNA integration: ASLV, HIV, and MLV show distinct target site preferences. PLoS Biol, 2004. 2(8): p. E234.

215. Ciuffi, A., et al., A role for LEDGF/p75 in targeting HIV DNA integration. Nat Med, 2005. 11(12): p. 1287-9.

216. Lewinski, M.K., et al., Genome-wide analysis of chromosomal features repressing human immunodeficiency virus transcription. J Virol, 2005. 79(11): p. 6610-9.

217. Barr, S.D., et al., Integration targeting by avian sarcoma-leukosis virus and human immunodeficiency virus in the chicken genome. J Virol, 2005. 79(18): p. 12035-44.

317

218. Wu, X., et al., Transcription start regions in the human genome are favored targets for MLV integration. Science, 2003. 300(5626): p. 1749-51.

219. Maldarelli, F., et al., HIV latency. Specific HIV integration sites are linked to clonal expansion and persistence of infected cells. Science, 2014. 345(6193): p. 179-83.

220. Wagner, T.A., et al., HIV latency. Proliferation of cells with HIV integrated into cancer genes contributes to persistent infection. Science, 2014. 345(6196): p. 570-3.

221. Cohn, L.B., et al., HIV-1 integration landscape during latent and active infection. Cell, 2015. 160(3): p. 420-32.

222. Simonetti, F.R., et al., Clonally expanded CD4+ T cells can produce infectious HIV-1 in vivo. Proc Natl Acad Sci U S A, 2016. 113(7): p. 1883-8.

223. Han, Y., et al., Resting CD4+ T cells from human immunodeficiency virus type 1 (HIV- 1)-infected individuals carry integrated HIV-1 genomes within actively transcribed host genes. J Virol, 2004. 78(12): p. 6122-33.

224. Ikeda, T., et al., Recurrent HIV-1 integration at the BACH2 locus in resting CD4+ T cell populations during effective highly active antiretroviral therapy. J Infect Dis, 2007. 195(5): p. 716-25.

225. Mack, K.D., et al., HIV insertions within and proximal to host cell genes are a common finding in tissues containing high levels of HIV DNA and macrophage-associated p24 antigen expression. J Acquir Immune Defic Syndr, 2003. 33(3): p. 308-20.

226. Flucke, U., et al., Presence of C11orf95-MKL2 fusion is a consistent finding in chondroid lipomas: a study of eight cases. Histopathology, 2013. 62(6): p. 925-30.

227. Kobayashi, S., et al., Identification of IGHCdelta-BACH2 fusion transcripts resulting from cryptic chromosomal rearrangements of 14q32 with 6q15 in aggressive B-cell lymphoma/leukemia. Genes Chromosomes Cancer, 2011. 50(4): p. 207-16.

228. Muehlich, S., et al., The transcriptional coactivators megakaryoblastic leukemia 1/2 mediate the effects of loss of the tumor suppressor deleted in liver cancer 1. Oncogene, 2012. 31(35): p. 3913-23.

229. Carninci, P., et al., The transcriptional landscape of the mammalian genome. Science, 2005. 309(5740): p. 1559-63.

230. Forrest, A.R., et al., A promoter level mammalian expression atlas. Nature, 2014. 507(7493): p. 462-70.

231. Hayward, W.S., et al., Activation of a cellular onc gene by promoter insertion in ALV- induced lymphoid leukosis. Nature, 1981. 290(5806): p. 475-80.

318

232. Kim, R., et al., Genome-based identification of cancer genes by proviral tagging in mouse retrovirus-induced T-cell lymphomas. J Virol, 2003. 77(3): p. 2056-62.

233. Payne, G.S., et al., Multiple arrangements of viral DNA and an activated host oncogene in bursal lymphomas. Nature, 1982. 295(5846): p. 209-14.

234. Rosenberg, N. and P. Jolicoeur, Retroviral Pathogenesis, in Retroviruses, J.M. Coffin, S.H. Hughes, and H.E. Varmus, Editors. 1997, Cold Spring Harbor Laboratory Press Cold Spring Harbor Laboratory Press.: Cold Spring Harbor (NY).

319

320