Retroviral Long Terminal Repeats; Structure, Detection and Phylogeny
Total Page:16
File Type:pdf, Size:1020Kb
!"#$ %&#$&'( !$)*)+,) !"$-./*-*0*..0,*, 12111 *+,,+/ ! " # $%&& &'(&&) * ) )+* *," )- ./0* 1 2 */ 3 * * "/%&&/4 0 4 56 +* / # / 7/%$ / /8639'!:': ;:;&:&/ < ,<04. : ) * : )<04 /0* )<04 * */ - ,--. 1 ) 1**/0* )*1 1 <04 ) <04 --/ + 8 ** )-- )) )<04 ) ,=4>. ,24>./ 0* )--1 1) **) * ) <04/ 0* * -- ) ) 1** 1 ) )=4>/0* ) )* )) ) <041 * /0*1 * * 0? @# *#0:* *1** 1*0#0# ##0### / 8 + 88 * )* ) <041 --: /0*<041 1* * : 1 <04 * / 8 + 888** + 8 881 <04) * * )* ) 1 <04/" *) <04 * * /0*1 1 *<04 * )) )*+ /0* <04 ) 81) <04/+8 ) % ** 0?A@#/ + 8>1 * )B ) * )24>/24>1 ) * * * )* /C**) ) 24>/# 24> =4> 1 ) * * 1* 1 ) 24>B / + > 1 ) ) ) ) 24> ) ) ) D /8 ) * )* 24> * / 8 --1 ) <04/+ 1 * * /0** ** )) ) : 9#B / 4 * - * ): ! "# #$# %&'()*( E" 3 * * %&& 8669$ :$%&$ 8639'!:': ;:;&:& ( ((( :%&&%!,* (AA //A F G ( ((( :%&&%!. To my family List of papers I Benachenhou, F., Jern, P., Oja, M., Sperber, G., Blikstad, V., Somervuo, P., Kaski, S. and Blomberg, J. Evolutionary conser- vation of orthoretroviral long terminal repeats (LTRs) and ab initio detection of single LTRs in genomic data. PLoS ONE 4 (2009b), p. e5179 II Benachenhou, F., Blikstad, V. and Blomberg, J. The phylogeny of orthoretroviral long terminal repeats (LTRs). Gene 448 (2009a), pp. 134-8. III Benachenhou, F., Sperber, G. and Blomberg, J. Universal struc- ture and phylogeny of Long Terminal Repeats (LTRs). Manu- script. IV Blomberg, J., Benachenhou, F., Blikstad, V., Sperber, G. and Mayer, J. Classification and nomenclature of endogenous retro- viral sequences (ERVs): problems and recommendations. Gene 448 (2009), pp. 115-23. V Blikstad, V., Benachenhou, F., Sperber, G.O. and Blomberg, J. Evolution of human endogenous retroviral sequences: a concep- tual account. Cell Mol Life Sci 65 (2008), pp. 3348-65. Contents Introduction.....................................................................................................1 Retroviruses and LTR retrotransposons.....................................................1 Background............................................................................................1 XRVs and ERVs....................................................................................3 Relation to other genomic elements ......................................................4 LTRs ......................................................................................................5 Hidden Markov models..............................................................................7 Why HMMs? .........................................................................................7 Generalities............................................................................................7 Building HMMs for LTRs.....................................................................8 Overfitting and regularisation................................................................8 Model surgery........................................................................................8 Scoring...................................................................................................9 Evaluation..............................................................................................9 SuperViterbi...........................................................................................9 Summary of papers .......................................................................................10 Paper I ......................................................................................................10 Paper II .....................................................................................................13 Paper III....................................................................................................17 Paper IV ...................................................................................................19 Paper V.....................................................................................................19 Conclusions and future prospects .................................................................21 Acknowledgements.......................................................................................23 References.....................................................................................................24 Abbreviations env Envelope gene ERV Endogenous retrovirus gag Group specific antigen gene HERV Human endogenous retrovirus HML Human MMTV-like retrovirus HMM Hidden Markov model HTLV Human T-cell leukemia virus IN Integrase domain JSRV Jaagsiekte sheep retrovirus LTR Long terminal repeats MAP Maximum a posteriori ML Maximum likelihood MLV Mouse leukemia virus MMTV Mouse mammary tumor virus ORF Open reading frame PAS Polyadenylation site PBS Primer binding site PLE Penelope like elements pol Polymerase gene PPT Polypurine tract pro Protease gene pSIVgml Grey mouse lemur prosimian immu- nodeficiency virus R Repeat region (in LTR) RELIK Rabbit endogenous lentivirus type K RH RNAse H domain RT Reverse transcriptase domain TSD Target site duplication TSS Transcription start site U3 Unique 3’ region U5 Unique 5’ region XRV Exogenous retrovirus YR Tyrosine recombinase Introduction The present work is devoted to the study of a specific part of retroviruses and related retrotransposons, the long terminal repeats (LTRs). The study of LTRs is important for two reasons. First, there are very few systematic investigations of LTRs published in the literature. This is partly because of the scarcity of conserved motifs and partly because LTRs do not code for proteins and as a result standard alignment techniques such as ClustalW cannot be used. Second, LTRs are ubiquitous components of many eukaryote genomes. In the human genome for example, seven to eight per- cent is of retroviral origin (Lander et al., 2001). The aim of the present work is threefold. i) To create reliable alignments of LTRs. ii) To study the evolutionary relationships between LTR retrotrans- posons using the alignments. iii) To detect LTRs in genomic DNA. This was done by utilising the mathematical tool hidden Markov models (HMMs). Retroviruses and LTR retrotransposons Background Retroviruses are a family of RNA viruses which infect vertebrates (Coffin et al., 1997; Mager and Medstrand, 2003). Their genome consists of two iden- tical positive-strand RNAs. One of their characteristic features is their ability to make a DNA copy of their RNA genome, reverse transcription. Retroviruses integrate into the host genome as part of their replication. This property, although unique among viruses, is shared with other retroelements, the LTR retrotransposons (Boeke and Stote, 1997; Eickbush and Jamburut- hugoda, 2008). It is remarkable that the Copia/Ty1 retrotransposons (Flavell et al., 1997; Peterson-Burch and Voytas, 2002), found in fungi, plants, invertebrates and some vertebrates, have almost the same genetic structure as retroviruses. The only difference is that they generally lack the env-gene and therefore cannot leave the cell. They are however capable of reintegrating into the cellular genome and make multiple copies of their own genome. However, one line- 1 age of Copia/Ty1, referred to as Agroviruses or Sireviruses (Peterson-Burch and Voytas, 2002; Llorens et al., 2009), has an env-like gene. The Gypsy/Ty3 elements (Kordis, 2005; Llorens et al., 2008) found in fungi, plants, invertebrates and some vertebrates, are even more similar to retrovi- ruses. Like the Copia/Ty1 group they are generally non-infectious transpos- able elements except for the errantividae of insects which have infectious properties (Terzian et al., 2001) thanks to an env-like gene. The remaining group of LTR-containing retrotransposons is the Bel/Pao group in metazoans (Boeke and Stote, 1997). Retroviruses and other env-containing retroelements can occasionally infect the germ line and be transmitted in a Mendelian fashion thereby becoming endogenous retroviruses (ERVs, HERVs in the human genome). The genome of LTR retrotransposons comprises four protein-coding genes (Coffin et al., 1997): The gag-, the pro-, the pol- and the env-gene. Pol is the best conserved gene and encodes the reverse transcriptase (RT), the RNAse H (RH) and integrase (IN) proteins. The env-encoded proteins allow for the fusion between retrotransposon and cell, thus conveying infectivity. In the genomic, full length, RNA, the four coding genes are flanked by two non-coding sequences. Both of them end in an identical region, the repeat region R, but they also contain two unique regions adjacent to R: The U5 region at the 5’ end and the U3 region at the 3’ end. After reverse transcrip- tion the flanking non-coding sequences become identical and are called LTRs. They consist of U3 followed by R and U5. Downstream of the 5’ LTR is the primer binding site (PBS) that pairs to the host t-RNA priming the synthesis of the first DNA-strand in the reverse transcription reaction. The 3’ LTR is preceded by the polypurine tract (PPT) which primes the syn- thesis of the second DNA-strand. ERVs are often present as solo LTRs, lacking the internal genes. They were formed through homologous recombination between the two LTRs, and elimination