VIROLOGY 224, 256–267 (1996) ARTICLE NO. 0527

Common Modular Structure of LTRs

View metadata, citation and similar papers at core.ac.uk brought to you by CORE KORNELIE FRECH,* RUTH BRACK-WERNER,*,† and THOMAS WERNER*,1 provided by Elsevier - Publisher Connector

*Institut fu¨rSa¨ugetiergenetik and †Institut fu¨r Molekulare Virologie, GSF-Forschungszentrum fu¨r Umwelt und Gesundheit GmbH, Ingolsta¨dter Landstrasse 1, D-85758 Oberschleißheim, Germany Received April 17, 1996; accepted August 5, 1996

Retroviruses are expressed under the control of viral control regions designated long terminal repeats (LTRs), which contain all signals for transcriptional initiation as well as transcriptional termination. However, retroviral LTRs from different species within a common genus, such as Lentivirus, do not show significant overall sequence homology. We compiled a model of the functional organization of 20 Lentivirus LTRs which we show to recognize all known Lentivirus LTRs. To this end we combined our previously published methods for identification of transcription elements with secondary structure element analysis in a novel modular approach. We deduced descriptions for three new Lentivirus-specific sequence elements present in most of the Lentivirus LTRs but absent in LTRs of other families (B, C, D-type, BLV-HTLV, Spuma). Four of the 10 elements defined in our study were primate-specific. We were able to deduce a phylogeny based on our model which agrees in general with the phylogeny derived from the polymerase genes of these . Our model indicated that more than 100 LTRs from the databases are of Lentivirus origin and can be clearly separated from all other LTR types (B, C, D, BLV-HTLV, Spuma). This selectivity appears to be a unique feature of our modular approach. ᭧ 1996 Academic Press, Inc.

INTRODUCTION which showed a clear spatial organization. This informa- tion included three newly defined sequence elements are first described in sheep and was compiled into a unique model which appeared in the form of Maedi–Visna (Narayan and Clements, to be suitable for LTR classification and phylogenetic 1989). They share several features that set them apart analysis. from other retroviruses like -encoded regulatory pro- teins, multiply spliced mRNAs, and slow development of disease (Narayan and Clements, 1989; Joag et al., 1996). MATERIALS AND METHODS Lentiviruses became prominent when the human and Data sets primate viruses HIV-1, HIV-2, and several forms of SIV were detected between 10 and 15 years ago (for review The training set of Lentivirus LTR sequences has been see Levy, 1994). Several studies have shown that the selected from the 1994 Release of the Human Retrovi- long terminal repeat (LTR) structures and their regulation ruses and AIDS database (Myers et al., 1994). Accession are of particular interest for HIV expression (Gaynor, numbers are indicated by ‘‘AC:’’ after the sequence identi- 1992). More than 40 potential and often experimentally fier. Our data set contains at least one LTR sequence verified binding sites for transcription factors have been from each subgroup of the primate lentiviruses and all found in Lentivirus LTRs. However, Lentivirus LTRs seem nonprimate Lentivirus LTRs. to have only a few features in common such as the TAR The primate data set consists of: HIV-1, human immu- region or some salient enhancer elements like NF-kBor nodeficiency virus type 1: hivmvp, AC: L20571 (Gu¨rtler et SP-1 (Perkins et al., 1993; Cullen and Garrett, 1992). In al., 1994), hivlai, AC: K02013 (Wain-Hobson et al., 1985); addition, Lentivirus LTRs also show low overall sequence HIV-2, human immunodeficiency virus type 2: hiv2ben, similarity when nonprimate lentiviruses are included. AC: M30502 (Kirchhoff et al., 1990); SIV, simian immuno- Thus, it is not clear which features are important for deficiency viruses found in chimpanzee: sivcpz, AC: general function of Lentivirus LTRs and whether Lentivi- X52154 (Huet et al., 1990), macaque: sivmne, AC: rus LTRs can be distinguished as a group from other M32741, sivmm251, AC: M19499, (Kestler et al., 1988), LTRs. Therefore, it was our objective to deduce a com- and mangabey: resivsmm, AC: X14307 (Hirsch et al., mon functional organization of Lentivirus LTRs which is 1989); MND, simian immunodeficiency viruses found in not based on comparison of total nucleotide sequences. mandrills: sivcps, AC: X15781 (Tsujimoto et al., 1989); We were able to define 10 Lentivirus-specific elements AGM, simian immunodeficiency viruses found in African green monkey: sivltr, AC: D10702 (Tomonaga et al., 1993), 1 To whom correspondence should be addressed. Fax: x89-3187- sivagm90, AC: M33718 (Johnson et al., 1990), sivsab1, 4400. E-mail: [email protected]. AC: U04005 (Jin et al., 1994); SYK, simian immunodefi-

0042-6822/96 $18.00 256 Copyright ᭧ 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.

AID VY 8162 / 6a1f$$$701 09-04-96 20:14:40 vira AP: Virology LENTIVIRUS LTR MODEL 257 ciency virus from Syke’s monkey: sivsyk, AC: L06042 steps by maximization of the consensus index of regions (Hirsch et al., 1993). around the tuple. Finally, CoreSearch creates the same The nonprimate data set consists of: FIV, feline immu- consensus descriptions as ConsInd. nodeficiency virus: fivltr, AC: J04541 (Olmstead et al., 1989); EIAV, virus: eia5ltr, AC: Definition and identification of complex elements M87594 (Perry et al., 1992); CAEV, caprine arthritis–en- cephalitis virus: ceavltr, AC: M14149 (Hess et al., 1986); For definition of complex elements like LTRs, a new BIV, bovine immunodeficiency virus: bimpevpp, AC: method is used which combines the consensus ele- L04972 (Carpenter et al., 1993); VLV, Visna Lentivirus: ments defined by the above-mentioned methods and vlvcg, AC: M10608 (Sonigo et al., 1985); OLV, ovine Len- other nonconsensus elements (e.g., direct repeats or tivirus: olvcg, AC: M31646 (Querat et al., 1990), olvtransac hairpin structures) to a model of the functional organiza- AC: L19199 (Campbell et al., 1993); PLV, puma Lentivirus: tion of such units. This complex model is composed of pumaltr, AC: U03982 (Langley et al., 1994). determining and nondetermining elements. Determining More than one representative of a subgroup was se- elements are elements with low probability of random lected, if the overall sequence similarity between the occurrence which are present in almost all sequences; LTRs within this subgroup was lower than the similarity nondetermining elements are either present in a subset between sequences of different subgroups (determined of the sequences only or have a higher probability of by the GCG program Gap). The resulting training set of random occurrence. For the latter reason analysis of non- Lentivirus LTRs consisted of 12 primate and 8 nonpri- determining elements is restricted to regions between mate LTR sequences. The test sets of other LTR se- determining elements. The relative frequency and aver- quences were selected from Release 45.0 of the EMBL age score of each element and distances distributions data library and Release 92.0 of GenBank. These LTR between the elements are part of the model. sequences were derived from the following genera of After definition of the functional model, new sequences the family of retroviridae: mammalian type B retroviruses can be scanned for matches to the described unit. If the (MMTV, HERV-K), mammalian type C retroviruses [en- sum of scores of determining elements exceeds a given dogenous retroviruses (human, mouse, baboon, African threshold (set as a parameter), nondetermining elements green monkey), murine and feline leukemia viruses], type are identified and the sum of the scores of all elements D retroviruses (Simian Mason–Pfizer retrovirus, SRV-1, is calculated. This total element score and the score for SRV-2), Spumavirus retroviruses (human, bovine, simian), element distances are then used to evaluate the fit of and BLV-HTLV retroviruses (human T-cell leukemia vi- complex elements to the model. Score thresholds are ruses I and II, BLV). expressed as a percentage of the mean scores of the model. A detailed description of the algorithm and the programs ModelGenerator and ModelInspector is be- Definition of consensus elements yond the scope of this paper and will be published else- For definition of consensus elements (e.g., transcrip- where (Frech et al., in preparation). tion factor binding sites) present in Lentivirus LTR se- quences several computer methods were used: ConsInd Phylogenetic analysis (Frech et al., 1993), MatInd (Quandt et al., 1995), and CoreSearch (Wolfertstetter et al., 1996). Most of these Each sequence of the sequence set S Å {S1 , S2 ,..., programs are available at http://www.gsf.de/biodv. Sn} on which the model is based can be represented as ConsInd constructs a description of a consensus from a vector of normalized element scores Ei Å (ei1 , ei2 ,..., a set of sequences containing a known binding site by eim) with employing a weight-corrected alignment of the se- kj quences and consecutive evaluation of the alignment. ∑ The resulting consensus description yields the degree eij Å wij/wV j wV j Å wij/kj , of conservation of each binding site position (consensus iÅ1 index, Ci ) and the extension of the consensus element. where wij are the scores assigned to the detected ele- MatInd is a simplified and faster version of ConsInd ments by the corresponding method (e.g., Ci-score of which is based on weight matrices and performs no de- ConsInspector or matrix similarity of MatInspector); wV j limitation of the binding site. However, it can be used if represents the mean score of element j; kj is the number input sequences are restricted to the binding sites. The of sequences in which element j occurs (kj £ n); n repre- program CoreSearch is used to detect unknown consen- sents the number of sequences and m the total number sus elements in a set of unaligned sequences. It starts of elements. with a search for n-tuples which occur in all or almost These element score vectors are used to create a all of the sequences. Since this initial search defines matrix D of pairwise distances within the group of se- more than one tuple, tuples are selected in sequential quences,

AID VY 8162 / 6a1f$$$702 09-04-96 20:14:40 vira AP: Virology 258 FRECH, BRACK-WERNER, AND WERNER

FIG. 1. The elements (position and scoring) for each of the 20 LTRs on which the Lentivirus LTR model is based. Determining elements (TATA upstream element, Bel-1 similar region, TATA-box, poly(A) signal, poly(A) downstream element) are boxed with heavy borders, nondetermining elements are boxed with thinner borders. Numbers within boxes designate the position of the element (core) within the LTR and the scores (in parentheses). Four nondetermining elements present only in primate sequences are marked by gray bars. The graphical representation shows that the Lentivirus LTR model is composed of consensus elements (black boxes defined by ConsInd, white boxes defined by MatInd) and hairpins (secondary structural elements). The relative frequency of the elements in all sequences analyzed is shown below the graphical representation (0.60 Å element present in 60% of all sequences).

d11 d12 — d1n types based on the analysis of a set of 20 Lentivirus LTR

d21 d22 — d2n sequences (Fig. 1). Some of these elements were known D Å ————, to be present in all LTRs (e.g., TATA box) or in primate ———— LTRs (e.g., SP-1). Six elements were either newly defined

dn1 dn2 —dnn or a Lentivirus-specific consensus was deduced in this study (Table 1). The LTR model recognized about 100 with d Å 0, d Å d , and d is the Euclidian distance of ii ij ji ij primate and nonprimate Lentivirus LTRs; all other LTR vectors E and E . i j types were rejected (Fig. 2). For phylogenetic analysis of this distance data we use the split decomposition method by Bandelt and Dress Definition of consensus elements contained in (1992). It constructs the global phylogenetic splits which Lentivirus LTRs separate one sequence group from all others. The collec- tion of splits is displayed as a splits-graph, indicating The TATA box and the poly(A) signal are known canon- how the different sequences are related to each other ical signals in Lentivirus LTRs. Lentivirus-specific de- (see Fig. 3). scriptions for these sites were generated with the pro- grams ConsInd (Frech et al., 1993) and MatInd (Quandt et RESULTS al., 1995). These element descriptions are clearly distinct We deduced a model of the functional organization of from the elements with similar function in C-type LTRs Lentivirus LTRs consisting of 10 elements of different (Frech et al., in preparation). Descriptions for NF-kB and

AID VY 8162 / 6a1f$$$702 09-04-96 20:14:40 vira AP: Virology LENTIVIRUS LTR MODEL 259

TABLE 1 Summary of Elements Present in Lentiviral LTR Sequences

Specificity in Identified Element Consensus lentiviral LTRs in lentiviral LTRs Reference

Primate element block WGATTGGM »13 bp… Primates HIV-1 Yang and Engel (1993); Gaynor (1992); AGGRCCAGGA »3bp… Cooney (1991); this study AGATACCC »7bp… TTTGGATK TATA upstream element CTAACCGCAAG All except HIV-1,2 VISNA Gabuzda et al. (1989); this study

and SIVcpz

Bel-1 similar region ANCWGCTGACRCW All except SIVsyk HIV-1 (Bel-1) Keller et al. (1992); this study EIAV, FIV, Puma

NF-kB GGGACTTTCCA Primates Cullen and Garrett (1992) SP1 GGGGCGGGGC Primates Cullen and Garrett (1992) TATA box TATAWAAGCAGCTGCTT All (lentivirus specific This study consensus) TAR region — All Garcia et al. (1989) Poly(A) hairpin — Primates Berkhout et al. (1995) Poly(A) signal AATAAAG All (lentivirus specific This study consensus)

Poly(A) downstream STGTGTNYYCWT All (lentivirus specific This study element consensus)

SP-1 were already available from the ConsInspector li- foamy virus Bel-1 protein and is present in most of the brary. Groups containing primate, nonprimate, and all LTRs (see Fig. 1 and Table 2c for details). A variety of Lentivirus LTR sequences were analyzed with the pro- other transcription factor binding sites like AP-1 (HIV-1: gram CoreSearch (Wolfertstetter et al., 1996) to reveal Lu et al., 1989; BIV: Fong et al., 1995; VISNA: Sargan et unknown conserved sequence motifs. Only motifs that al., 1995), AP-2 (HIV-1: Perkins et al., 1994), or AP-4 (BIV: were clearly restricted to one of three initial regions (up- Fong et al., 1995; VISNA: Sargan et al., 1995) have been stream of the TATA box, between TATA box and poly(A) reported to be functional in Lentivirus LTRs. Our study signal, and downstream of the poly(A) signal) in most of showed that these elements were not part of the clearly the sequences were incorporated into the model. This defined core organization of Lentivirus LTRs, despite method led to identification of a multicomponent element their functionality in individual LTR sequences. consisting of four binding sites (designated primate ele- ment block) in primate sequences (Fig. 1, Table 2a). Bind- Complete Lentivirus LTR description ing of cellular factors to these four binding sites in HIV- 1 LTRs (h-Gata-3: Yang and Engel, 1993; COUP: Gaynor The eight identified consensus elements and two hair- 1992; h-Gata-3, COUP-chick: Cooney 1991) has been pin elements, the TAR region and a previously described shown by different experimental approaches. No consen- hairpin structure for the poly(A) signal (Berkhout et al., sus elements specific for nonprimate Lentivirus LTR se- 1995, redefined by our program) were combined to estab- quences were found. CoreSearch analysis of all Lentivi- lish a complex model for Lentivirus LTRs. Five determin- rus sequences identified two additional common ele- ing elements were identified which were present in al- ments. The first element is present in all Lentivirus LTRs most all sequences (TATA upstream element, Bel-1 simi- except HIV-1, HIV-2, and the HIV-1 related SIVCPZ located lar region, TATA box, poly(A) signal, poly(A) downstream about 100 bp upstream of the TATA box and was desig- element). Five nondetermining elements were located nated TATA upstream element (Table 2b). This site has (primate element block, NF-kB, SP-1, TAR region, poly(A) been shown to be protected in DNase I footprinting ex- hairpin), which by definition occurred in the regions be- periments for the Visna virus LTR, indicating that it is a tween the determining elements (for definition see Mate- true binding site in vitro (Gabuzda et al., 1989). The sec- rials and Methods). The model moreover included the ond element (Bel-1 similar region) comprises parts of the relative frequency and average score of each element Bel-1 (bel: between env and LTR) element found in HIV- and the distance distributions between the elements. A 1 (Keller et al., 1992) which can be transactivated by the graphical representation of the LTR model is shown in

AID VY 8162 / 6a1f$$$703 09-04-96 20:14:40 vira AP: Virology 260 FRECH, BRACK-WERNER, AND WERNER

FIG. 2. Specificity of Lentivirus LTR model. The total element scores determined by ModelInspector with the Lentivirus LTR model for different LTR types. For each set of sequences the minimum, maximum, and mean scores are given. Element scores marked by black bars are reached with default parameter settings (element score threshold for determining elements, 50%; element score threshold for all elements, 50%; distance score threshold, 30%). Element scores indicated by gray bars were obtained with lower thresholds.

Fig. 1. The model identified both primate and nonprimate between primate and nonprimate Lentivirus LTRs (Table Lentivirus LTR sequences. Primate and nonprimate se- 3) are not clearly separated. For example the mean over- quences can be distinguished by the presence or ab- all similarity of sivmne with nonprimate Lentivirus LTRs sence of the primate-specific elements (primate element (39.2%) is about the same as the similarity with C-type block, SP-1, NF-kB, poly(A) hairpin) which leads to higher LTRs (37.9%). We also constructed a complete phyloge- element scores for primate LTRs. Non-Lentivirus LTR se- netic tree employing all pairwise alignments of the 20 quences were compared with the Lentivirus LTR descrip- Lentivirus LTRs and the MoMuLV LTR (a typical C-type tion by the newly developed program ModelInspector retrovirus). This analysis did not separate the MoMuLV (Frech et al., in preparation). When the default thresholds LTR from the Lentivirus LTRs, demonstrating the lack of for element and distance scores (provided by Model discernible overall sequence similarities (data not Inspector) were used, all non-Lentivirus LTRs were re- shown). Most notably, primate and nonprimate LTRs are jected (black bars on the right side in Fig. 2). Lowering not separated clearly despite the differences in nucleo- thresholds for the score of determining elements and the tide similarity (Table 3). The SplitsTree analysis that was distance score led to the detection of single elements based on element score distances (see Materials and in some LTR sequences, but the total element scores Methods) shown in Fig. 3b clearly distinguishes between achieved for all non-Lentivirus LTRs were well below the primate and nonprimate LTRs and agrees in general with lowest score obtained for a Lentivirus LTR (gray bars in the SplitsTree analysis based on polymerase sequence Fig. 2). This indicates that this Lentivirus LTR model is alignment (Fig. 3c). specific enough to distinguish Lentivirus LTRs from other The element-based phylogenetic order contains one LTR types. remarkable difference to that obtained by alignment of polymerase genes: HIV-2 was placed in the HIV group Phylogenetic analysis rather than in the SIV group (Fig. 3c). Restriction of the Alignment-based phylogenetic analysis by the Splits- analysis to primate LTRs clearly showed this assignment Tree program (Bandelt and Dress, 1992) did not yield a to be significant, although the split between the HIV significant phylogeny but rather resulted in a star-like group and the SIV groups appears small in the analysis structure (Fig. 3a). This is not surprising since the overall of all Lentivirus LTRs (data not shown). Closer relat- sequence similarities between different LTR types and edness of HIV-2 LTR to HIV-1 than to SIV LTRs is due

AID VY 8162 / 6a1f$$$703 09-04-96 20:14:40 vira AP: Virology LENTIVIRUS LTR MODEL 261

TABLE 2 Alignment of Lentivirus LTR Sequences to Consensus Elements (Consensus Sequences Are Given in IUPAC-Ambiguity Code; Matching Positions Are Printed in Uppercase Letters and Nonmatching Positions in Lowercase Letters)

Sequence name Position Sequence

(a) Primate element block consensus WGATTGGM »13 bp… AGGRCCAGGA »3bp… AGATACCC »7bp… TTTGGATK sivcpz 102 TGAcTGGC AGGACCAGGA AGATtCCC TTTGGATG hivmvp 102 TGATTGGC gGGACCAGGA AGATtCCC TTTGGATG hivlai 102 TGATTGGC AGGGCCAGGg AGATAtCC TTTGGATG hiv2ben 102 AGATTGGC tGGGCCAGGA AGgTACCC TTcGGgTG resivsmm 102 AGATTGGC AGGGCCAGGA AGATACCC TaTGGgTG sivmne 102 AGATTGGC gGGACCAGGA AGATACCC TTTGGcTG sivmm251 102 AGATTGGC AGGACCAGGA AGATACCC TTTGGcTG sivltr 98 TGATTGGA AGGGCCAGGA AGATACCC TTcGGcTT sivagm90 102 TGATTGGA AGGACCAGGc AGATAtCC TTTGGcTT isvsab1 102 TGgaTGGC tGGGCCAGGt AGgTACCC TTTGGATG sivcps 102 TGgaTGGC AGGtCCAGGA AGATAtCC TTTGGATT sivsyk 99 AGAcaGGA AGGGCCAGGA cGgTACCC TTTGGATG

(b) TATA upstream element consensus CTAACCGCAAG resivsmm 357 CTAACCGCAAG sivmne 357 CTAACCGCAAG sivmm215 356 CTAACCGCAAG sivltr 343 CTAACCGCAAG sivagm90 346 CTAACCGCAgG sivsab1 329 CTAACCaCAAa sivcps 363 CTAACCGCAAa sivsyk 317 aaAACCaCAAG ceavltr 225 gTAACCGCAAG vlvcg 151 gTAACCGCAAG olvcg 161 gTAACCGCAAG olvtransac 143 gTAACCGCAAG bimpevpp 81 aTAACC.CAga eia5ltr 77 aTAACCcaAAG fivltr 143 CTAACCGCAAa pumaltr 107 tTAACCGCAAa

(c) Bel-1 similar region consensus ANCWGCTGACRCW sivcpz 359 AACTGCTGACtCT hivmvp 358 gACTGCTGACACT hivlai 328 AACTGCTGACAtc hiv2ben 438 AACAGCTGAgGCT resivsmm 402 ACaAGCTGAgACA sivmne 403 ACtAGCTGAgACA sivmm251 402 ACtcGCTGAgAtA sivltr 393 AACTGCTGACAag sivagm90 398 AACTGCTGACA.. sivsab1 382 AACTGCTGACtgA sivcps 413 ACCTGCTtA.GCA ceavltr 265 ATCAGCTGAtGCT vlvcg 192 AGCAGCTGAtGCT olvcg 202 ATCAGCTGAtGCT olvtransac 183 AGCAGCTGAtGCT bimpevpp 293 AGCTGCTTACGCT

AID VY 8162 / 6a1f$$8162 09-04-96 20:14:40 vira AP: Virology 262 FRECH, BRACK-WERNER, AND WERNER

TABLE 2—Continued

Sequence name Position Sequence

(d) NF-kB consensus GGGACTTTCCA sivcpz 369 GGGACTTTCCA 381 GGGACTTTCCg hivmvp 342 GGGACTTTCCA 369 GGGACTTTCCA hivlai 349 GGGACTTTCCg 363 GGGACTTTCCA hiv2ben 449 GGGACTTTCCA resivsmm 413 GGGACTTTCCA sivmne 414 GGGACTTTCCA sivmm251 413 GGGACTTTCCA sivltr 412 GGGACTTTCCA sivagm90 405 GGGACTTTCCA 417 GGGACTTTCCA sivsab1 392 GGGACTTTCCA sivcps 448 GGGACTTTCCA

(e) SP-1 consensus GGGGCGGGGC sivcpz 394 GGGGaGaGGt 407 cGGGaGGcGg 410 GaGGCGGGGt 421 GGGGtGtGGC hivmvp 386 tGGGaGGGat 398 GGGGCGGt.t 408 GGGGaGtGGC hivlai 376 aGGGaGGcGt 390 tGGGCGGGaC 401 GGGGaGtGGC hiv2ben 479 aGGGaGGGaC 490 tGGGaGGaGC resivsmm 440 tGGGaGGtaC 451 GGGGaGGaGC sivmne 442 GGGGaGGtaC 453 GGGGaGGaGC sivmm251 441 GGGGaGGaGC sivltr 442 cGGGaGtGGC sivagm90 430 aGGGCGGGcC 441 tGGGCGGtaC 451 GGGGaGtGGt sivsab1 405 aGGGtGGaGa 416 tGGGCGGtaC 426 tGGGaGtGGC sivcps 465 aGGGaGGGGg 471 GGGGaGGctC sivsyk 393 GGGGaGGaGC 404 tGGGCGGGGg

(f) TATA box consensus TATAWAAG.CAGCTGCTT sivcps 445 cATATAAG.CAGCTGCTT hivmvp 432 cATATAAG.CAGCTGCTT hivlai 425 cATATAAG.CAGCTGCTT hiv2ben 525 TATAAAtG.tAcCcGCTT resivsmm 485 TATAAAta.CAaCTGCaT sivmne 488 TATAAAtatCA.CTGCaT

AID VY 8162 / 6a1f$$8162 09-04-96 20:14:40 vira AP: Virology LENTIVIRUS LTR MODEL 263

TABLE 2—Continued

Sequence name Position Sequence sivmm251 476 TATAAAtatCA.CTGCaT sivltr 466 cATAAAAG.CAGaTGCTc sivagm90 475 cATAAAAG.CAGaTGCTc sivsab1 450 aATATAAG.CAGCTGCTT sivcps 491 TATATAAG.CAGCTGCaT sivsyk 428 cATAAAAG.CcGaTGCTc ceavltr 304 TATATAAG.gAGaaGCTT vlvcg 233 TATATAAG.CcGCTtgcT olvcg 243 TATAAAAG.CtGCTtgcT olvtransac 224 TATATAAG.CtGCTtgcT eia5ltr 183 TATATAAG..tGCTtgTa fivltr 208 TATATAA..CcagTGCTT pumaltr 162 TtTAAAAGcCtGCaaCTT

(g) Poly(A) signal consensus AATAAAG sivcpz 545 AATAAAG hivmvp 533 AATAAAG hivlai 526 AATAAAG hiv2ben 706 AATAAAG resivsmm 667 AATAAAG sivmne 670 AATAAAG sivmm251 658 AATAAAG sivltr 592 AATAAAt sivagm90 602 AATAAAc sivsab1 664 AATAAAG sivcps 669 AATAAAG sivsyk 613 AATAAAG ceavltr 392 AATAAAG vlvcg 326 AATAAAG olvcg 327 AATAAAG olvtransac 321 AATAAAG bimpevpp 470 AATAAAG eia5ltr 266 AATAAAt fivltr 289 AATAAAt pumaltr 226 AATAAAG

(h) Poly(A) downstream element consensus STGTGTNYYCWT sivcpz 573 GTGTGTGCCtAT hivmvp 558 GTGTGTGCTCAT hivlai 553 GTGTGTGCCCgT hiv2ben 735 GTGTGTTCCCAT resivsmm 695 GTGTGTTCCCAT sivmne 696 GTGTGTGCTCcc sivmm251 687 GTGTGTTCCCAT sivltr 671 CTGgGTTCTCTc sivagm90 681 CTGgGTTCTCTc sivsab1 696 GTGTGTGCCCAT sivcps 735 tTGTGTTCTCAT ceavltr 425 GTGatTACTgAg vlvcg 355 CTGgtTATTaTc olvcg 356 CTGgtTATTaTc olvtransac 350 CTGgtTATTaTc bimpevpp 505 CTGTcTTgTCTT eia5ltr 290 CTGTcTTTagTT fivltr 320 CTGTGTAaTCTT pumaltr 259 CTGaGTTTTaTg

AID VY 8162 / 6a1f$$8162 09-04-96 20:14:40 vira AP: Virology 264 FRECH, BRACK-WERNER, AND WERNER

FIG. 3. Visualization of Lentivirus sequence distances with SplitsTree. Sequences are referred to by their name and the subgroup to which they belong. (a) Splits-graph of distances of Lentivirus LTR sequences based on overall sequence alignment (distances created by the GCG programs PileUp and Distances). (b) Splits-graph of distances of Lentivirus polymerase genes based on overall sequence alignment (distances created by the GCG programs PileUp and Distances based on the nucleotide sequences). (c) Splits-graph of distances of Lentivirus LTR element score vectors (determination of element distances is detailed under Materials and Methods). mainly to absence of the TATA upstream element in HIV- MuLV LTR, demonstrating the discriminative power of 1 and HIV-2 which is usually present in SIV LTRs (Fig. the approach. 1). The TATA upstream element was not present in the The model is based on occurrence of elements at 37 HIV-1 LTRs and was only found in one of the 13 HIV- distinct locations within the functional framework of Len- 2 LTRs analyzed. This HIV-2 virus isolate (HIV2FO784) is tivirus LTRs. A very simple basic model including only a highly divergent variant which is closer to SIVs than to two such elements (TATA box and poly(A) signal) was other HIV-2 isolates in all structural genes (Gao et al., sufficient to initiate the construction of the complex 1992). In contrast to the human immunodeficiency vi- model. Three new common consensus elements (pri- ruses 29 of 37 SIV LTRs analyzed contained the TATA mate block, Bel-1 similar region, TATA upstream ele- upstream element. ment) were identified solely by computer analysis. Three other elements were redefined (TAR, poly(A) downstream DISCUSSION element, and poly(A) hairpin) and also incorporated into the model. TAR was also identified by the program Mod- We identified a common modular structure for Lentivi- elGenerator independent of previous definitions. The rus LTRs based on 10 individual sequence elements, Lentivirus LTR model presented here is more than a com- resulting in the most general and complete compilation puterized representation of published results since it led published so far. Six of these elements have already to the definition of new sequence elements common in been shown to be involved in transcriptional regulation Lentivirus LTRs. of HIV (Gaynor 1992; Berkhout et al., 1995) and one of the We used 20 representatives of known lentiviruses for newly defined common elements binds cellular factors in the generation of the LTR model. However, we also used footprinting studies with the VISNA LTR (Gabudzda et al., all possible subsets of 19 LTRs to generate test models 1989). The model is capable of recognizing all known and all of them recognized all 20 LTRs (hold-one-out Lentivirus LTRs (more than 100) and clearly discriminat- strategy). This demonstrated that the presence of a par- ing these from all non-Lentivirus LTRs. Significant overall ticular LTR or a closely related sequence in the initial sequence similarity is not required for the identification training set is no prerequisite for recognition by our of an LTR, allowing the detection and classification of model in contrast to alignment-based approaches. The new LTRs that escape similarity based approaches. unique ability of our model to recognize all known Lentivi- Alignment-based phylogenetic analysis placed MoMuLV rus LTRs (more than 100) allows classification of LTRs LTR within the group of Lentivirus LTRs. In contrast, our by distinct LTR types (we also generated a C-type model, model clearly recognized all primate and nonprimate Frech et al., in preparation). Detection of additional ele- Lentivirus LTRs and completely rejected the C-type Mo- ments which are characteristic for Lentivirus LTRs should

AID VY 8162 / 6a1f$$$703 09-04-96 20:14:40 vira AP: Virology LENTIVIRUS LTR MODEL 265

FIG. 2—Continued facilitate direct experimental verification of their biologi- functional units rather than on direct sequence similarity cal function. to be a more correct representation of the evolution of The phylogenetic analysis based on our element defi- regulatory sequences. The grouping may reflect func- nitions allowed for the first time a meaningful analysis tional similarities with respect to transcriptional control. of functionally related sequences which show too low However, experimental evidence supporting this hypoth- overall similarity for conventional alignment-based ap- esis is not yet available. proaches. We believe phylogenetic analysis based on Closer relatedness of HIV-2 LTR with the LTR of HIV-

AID VY 8162 / 6a1f$$$703 09-04-96 20:14:40 vira AP: Virology 266 FRECH, BRACK-WERNER, AND WERNER

TABLE 3 REFERENCES Sequence Similarities among LTR Sequences Bandelt, H.-J., and Dress, A. (1992). Split decomposition: A new and useful approach to phylogenetic analysis of distance data. Mol. Phy- sivmne MoMuLV logenet. Evol. 1, 242–252. All Lentivirus LTRs 53.7% 37.8% Berkhout, B., Klaver, B. E. P., and Das, A. T. (1995). A conserved hairpin Primate Lentivirus LTRs 63.3% 38.1% structure predicted for the poly(A) signal of human and simian immu- Nonprimate Lentivirus LTRs 39.2% 37.4% nodeficiency viruses. Virology 207, 276–281. Mammalian type C LTRs 37.9% 53.9% Campbell, B. J., Thompson, D. R., Williams, J. R., Campbell, S. G., and Avery, R. J. (1993). Characterization of a New York ovine Lentivirus Note. The values are pairwise sequence similarities determined by isolate. J. Gen. Virol. 74, 201–210. the GCG program Gap (default parameters) averaged over all LTR Carpenter, S., Nadin-Davis, S. A., Wannemuehler, Y., and Roth, J. A. sequences in a group. (1993). Identification of transactivation-response sequences in the long terminal repeat of bovine Immunodeficiency-like virus. J. Virol. 67, 4399–4403. Cooney, A. J., Tsai, S. Y., Omalley, B. W., and Tsai, M. J. (1991). Chicken 1 rather than that of SIV may reflect adaptation of the ovalbumin upstream promoter transcription factor binds to a negative regulatory region in the human immunodeficiency virus type-1 long human immunodeficiency viruses to the human tran- terminal repeat. J. Virol. 65, 2853–2860. scription machinery. This is reflected in the common LTR Cullen, B. R., and Garrett, E. D. (1992). A comparison of regulatory organization we found and it is tempting to speculate features in primate Lentiviruses. AIDS Res. Human Retroviruses 8, that HIV-2 is a virus better adapted to the human host 387–393. than to SIV. The TATA upstream element appears to be Fong, S. E., Pallansch, L. A., Mikovits, J. A., Lackman-Smith, C. S., Ruscetti, F. W., and Gonda, M. S. (1995). cis-acting regulatory ele- ubiquitous for nonprimate LTRs, is still present in the ments in the bovine immunodeficiency virus long terminal repeat. majority of SIV LTRs, and is completely absent in human Virology 209, 604–614. and chimpanzee Lentivirus LTRs. This may indicate that Frech, K., Herrmann, G., and Werner, T. (1993). Computer-assisted pre- the element became lost during the adaption of the Len- diction, classification, and delimitation of protein binding sites in tivirus to human host organisms. The separation of CAEV nucleic acids. Nucleic Acids Res. 21, 1655–1664. Gabuzda, D. H., Hess, J. L., Small, J. A., and Clements, J. E. (1989). from OLV and VISNA is mainly caused by the presence Regulation of the Visna virus long terminal repeat in macrophages of the poly(A) hairpin in CAEV which is not contained involves cellular factors that bind sequences containing AP-1 sites. in the other nonprimate lentiviruses. Although this split Mol. Cell. Biol. 9, 2728–2733. appears to be more dramatic than the split of HIV-2 from Gao, F., Yue, L., White, A. T., Pappas, P. G., Barchue, J., Hanson, A. P., the SIVs we regard the difference in the HIV group to be Greene, B. M., Sharp, P. M., Shaw, G. M., and Hahn, B. H. (1992). Human infection by genetically-diverse SIVsm-related HIV-2 in West more significant because it was deduced from several Africa. Nature 358, 495–499. sequences rather than a single sequence as in the case Garcia, J. A., Harrich, D., Soultanakis, E., Wu, D., Mitsuyasu, R., and of CAEV. The principle of our analysis is by no means Gaynor, R. B. (1989). Human immunodeficiency virus type 1 LTR TATA confined to retroviral LTRs. Recently, we have shown that and TAR region sequences required for transcriptional regulation. even a set of only two or three elements can be sufficient EMBO J. 8, 765–778. Gaynor, R. (1992). Cellular transcription factors involved in the regula- to assign promoters of genes of glycolytic enzymes to a tion of HIV-1 gene expression. AIDS 6, 347–363. single group from a genomic sequence background of Gu¨rtler, L. G., Hauser, P. H., Eberle, J., von Brunn, A., Knapp, S., Zekeng, four complete yeast chromosomes (Quandt et al., 1996). L., Tsague, J. M., and Kaptue, L. (1994). A new subtype of human Any set of sequences with a detectable organization of immunodeficiency virus type 1 (Mvp-5180) from Cameroon. J. Virol. functional elements can be compiled into a similar 68, 1581–1585. Hess, J. L., Pyper, J. M., and Clements, J. E. (1986). Nucleotide sequence model. This suggests that our approach is well suited to and transcriptional activity of the caprine arthritis-encephalitis virus define and localize complex functional units in genomic long terminal repeat. J. Virol. 60, 385–393. sequences, even if only very simple models can be es- Hirsch, V. M., Olmsted, R. A., Murphy-Corb, M., Purcell, R. H., and tablished initially. The results obtained from the Lentivi- Johnson, P. R. (1989). An African primate Lentivirus (SIVsm) closely rus LTR model show that this type of sequence analysis related to HIV-2. Nature 339, 389–392. Hirsch, V. M., Dapolito, G. A., Goldstein, S., McClure, H., Emau, P., Fultz, can help to define a characteristic framework as demon- P. N., Isahakia, M., Lenroot, R., Myers, G., and Johnson, P. R. (1993). strated by the detection of the 10 most common se- A distinct african Lentivirus from Sykes’ monkeys. J. Virol. 67, 1517– quence motifs within the LTRs. 1528. Huet, T., Cheynier, R., Meyerhans, A., Roelants, G., and Wain-Hobson, S. (1990). Genetic organization of a chimpanzee Lentivirus related to HIV-1. Nature 345, 356–359. ACKNOWLEDGMENTS Jin, M. J., Hui, H., Robertson, D. L., Muller, M. C., Barre-Sinoussi, F., Hirsch, V. M., Allan, J. S., Shaw, G. M., Sharp, P. M., and Hahn, We thank Kerstin Quandt for critically reading the manuscript. This B. H. (1994). Mosaic genome structure of simian immunodeficiency work was supported in part by the BMBF Verbundprojekt GENUS 413- virus from west African green monkeys. EMBO J. 13, 2935–2947. 4001-01 IB 306 D (Fo¨rderschwerpunkt Bioinformatik) and in part by EU Joag, S. V., Stephens, E. B., and Narayan, O. (1996). Lentiviruses. In Grant BI04-CT95-0226 (TRADAT). Fields Virology (B. N. Fields, D. M. Knipe, P. M. Howley, R. M. Cha-

AID VY 8162 / 6a1f$$$704 09-04-96 20:14:40 vira AP: Virology LENTIVIRUS LTR MODEL 267

nock, J. L. Melnick, T. P. Monath, B. Roizman, and S. E. Straus, Eds.), kappaB and SP1 is required for HIV-1 enhancer activation. EMBO J. 3rd ed., Lippcott–Raven, Philadelphia. 9, 3551–3558. Johnson, P. R., Fomsgaard, A., Allan, J. S., Gravell, M., London, W. T., Perkins, N. D., Agranoff, A. B., Duckett, C. S., and Nabel, G. J. (1994). Olmstead, R. A., and Hirsch, V. M. (1990). Simian immunodeficiency Transcription factor AP-2 regulates human immunodeficiency virus viruses from African green monkeys display unusual genetic diver- type 1 gene expression. J. Virol. 68, 6820–6823. sity. J. Virol. 64, 1086–1092. Perry, S. T., Flaherty, M. T., Kelley, M. J., Clabough, D. L., Tronick, S. R., Coggins, L., Whetter, L., Lengel, C. R., and Fuller, F. (1992). The Keller, A., Garrett, E. D., and Cullen, B. R. (1992). The bel-1 protein of surface envelope protein gene region of equine infectious anemia activates human immunodeficiency virus type-1 virus is not an important determinant of tropism in vitro. J. Virol. 66, gene expression via a novel DNA target site. J. Virol. 66, 3946–3949. 4085–4097. Kestler, H. W., Li, Y., Naidu, Y. M., Butler, C. V., Ochs, M. F., Jaenel, G., Quandt, K., Frech, K., Karas, H., Wingender, E., and Werner, T. (1995). King, N. W., Daniel, M. D., and Desrosiers, R. C. (1988). Comparison MatInd and MatInspector—New fast and versatile tools for detec- of Simian immunodeficiency virus isolates. Nature 331, 619–622. tion of consensus matches in nucleotide sequence data. Nucleic Kirchhoff, F., Jentsch, K., Bachmann, B., Stuke, A., Laloux, C., Lueke, Acids Res. 23, 4878–4884. W., Stahl-Henning, C., Schneider, J., Nieselt, K., Eigen, M., and Huns- Quandt, K., Grote, K., and Werner, T. (1996). GenomeInspector: Basic mann, G. (1990). A novel proviral clone of HIV-2: Biological and phylo- software tools for analysis of spatial correlations between genomic genetic relationship to other primate immunodeficiency viruses. Vi- structures within megabase sequences. Genomics 33, 301–304. rology 177, 305–311. Querat, G., Audoly, G., and Sonigo, P. (1990). Nucleotide sequence Koken, S. E. C., Vanwamel, J. L. B., Goudsmit, J., Berkhout, B., and analysis of SA-OMVV, a visna-related ovine Lentivirus: Phylogenetic Geelen, J. L. M. C. (1992). Natural variants of the HIV-1 long terminal history of lentiviruses. Virology 175, 434–447. repeat—Analysis of promoters with duplicated DNA regulatory mo- Sargan, D. R., Sutton, K. A., Bennett, I. D., McConnell, I., and Harkiss, tifs. Virology 191, 968–972. G. D. (1995). Sequence and repeat structure variants in the long terminal repeat of maedi-visna virus EV1. Virology 208, 343–348. Langley, R. J., Hirsch, V. M., O’Brien, S. J., Adger-Johnson, D., Goeken, Sonigo, P., Alizon, M., Staskus, K., Klatzmann, D., Cole, S., Danos, O., R. M., and Olmsted, R. A. (1994). Nucleotide sequence analysis of Retzel, E., Tiollas, P., Haase, A., and Wain-Hobson, S. (1985). Nucleo- puma Lentivirus (PLV-14): Genomic organization and relationship to tide sequence of the Visna lentivirus: Relationship to the AIDS virus. other lentiviruses. Virology 202, 853–864. Cell 42, 369–382. Levy, J. A. (1994). HIV and the Pathogenesis of AIDS. American Society Tomonaga, K., Katahira, J., Fukasawa, M., Ahmed, H. M., Kawamura, of Microbiology Press, Washington, DC. M., Akari, H., Miura, T., Goto, T., Nakai, M., Suleman, M., et al. (1993). Lu, Y., Stenzel, M., Sodroski, J. G., and Haseltine, W. A. (1989). Effects Isolation and characterization of simian immunodeficiency virus from of long terminal repeat mutations on human immunodeficiency virus African white-crowned mangabey monkeys (Cercocebus torquatus type 1 replication. J. Virol. 63, 4115–4119. lunulatus). Arch. Virol. 129, 77–92. Myers, G., Wain-Hobson, S., Henderson, L. E., Korber, B., Jeang, K.-T., Tsujimoto, H., Hasegawa, A., Maki, N., Fukasawa, M., Miura, T., Speidel, and Pavlakis, G. N. (1994). Human Retroviruses and AIDS 1994. A S., Cooper, R. W., Moriyama, E. N., Gojobori, T., and Hayami, M. Compilation and Analysis of Nucleic Acid and Amino Acid Se- (1989). Sequence of a novel simian immunodeficiency virus from a quences. Database of the Los Alamos National Laboratory. wild-caught African mandrill. Nature 341, 539–541. Narayan, O., and Clements, J. E. (1989). Review article: Biology and Wain-Hobson, S., Sonigo, P., Danos, O., Cole, S., and Alizon, M. (1985). Nucleotide sequence of the AIDS virus, LAV. Cell 40, 9–17. pathogenesis of lentiviruses. J. Gen. Virol. 70, 1617–1639. Wolfertstetter, F., Frech, K., Herrmann, G., and Werner, T. (1996). Identifi- Olmstead, R. A., Barnes, A. K., Yamamoto, J. K., Hirsch, V. M., Purcell, cation of functional elements in unaligned nucleic acid sequences R. H., and Johnson, P. R. (1989). Molecular cloning of feline immuno- by a novel tuple search algorithm. Comp. Appl. Biosci. 12, 71–80. deficiency virus. Proc. Natl. Acad. Sci. USA 86, 2448–2452. Yang, Z. Y., and Engel, J. D. (1993). Human T-Cell transcription factor Perkins, N. D., Edwards, N. L., Duckett, C. S., Agranoff, A. B., Schmid, GATA-3 stimulates HIV-1 expression. Nucleic Acids Res. 21, 2831– R. M., and Nabel, G. J. (1993). A cooperative interaction between NF- 2836.

AID VY 8162 / 6a1f$$$704 09-04-96 20:14:40 vira AP: Virology