Recurrent of DNA-binding motifs in the Drosophila centromeric

Harmit S. Malik*, Danielle Vermaak*, and Steven Henikoff*†‡

†Howard Hughes Medical Institute, *Basic Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109

Communicated by Stanley M. Gartler, University of Washington, Seattle, WA, December 12, 2001 (received for review October 29, 2001) All contain centromere-specific variants being one of the most evolutionarily constrained classes of (CenH3s), which replace H3 in centromeric . We have eukaryotic proteins. Second, CenH3s are evolving adaptively, previously documented the adaptive evolution of the Drosophila presumably in concert with centromeres, which consist of the CenH3 (Cid) in comparisons of Drosophila melanogaster and Dro- most rapidly evolving DNA in the . sophila simulans, a divergence of Ϸ2.5 million years. We have Chromosomes (and their centromeres) may compete at mei- proposed that rapidly changing centromeric DNA may be driving osis I for inclusion into the oocyte (11). A centromeric satellite CenH3’s altered DNA-binding specificity. Here, we compare Cid expansion could bias centromeric strength, leading to preferred sequences from a phylogenetically broader group of Drosophila inclusion in the oocyte, but also to increased nondisjunction. A species to suggest that Cid has been evolving adaptively for at least subsequent alteration of CenH3’s DNA-binding preferences 25 million years. Our analysis also reveals conserved blocks not could restore parity among different chromosomes (2), thereby only in the histone-fold domain but also in the N-terminal tail. In alleviating the deleterious consequences of satellite drive. Suc- several lineages, the N-terminal tail of Cid is characterized by cessive rounds of satellite and CenH3 drive are detected as an subgroup-specific oligopeptide expansions. These expansions re- excess of replacement changes in CenH3. Adaptive evolution of semble minor groove DNA binding motifs found in various histone CenH3 has been documented in the histone fold domain of tails. Remarkably, similar oligopeptides are also found in N-termi- Drosophila (10). The N-terminal tails of CenH3 in both Dro- nal tails of human and mouse CenH3 (Cenp-A). The recurrent sophila and Arabidopsis are also subject to adaptive evolution, evolution of these motifs in CenH3 suggests a packaging function consistent with their proposed role in contacting centromeric for the N-terminal tail, which results in a unique chromatin orga- DNA (10, 12). In addition, deletion studies in Saccharomyces nization at the primary constriction, the cytological marker of cerevisiae have revealed essential protein–protein interaction centromeres. sites encoded within the N-terminal tail of CenH3 (13). Al- though sequence divergence precludes aligning the N-terminal entromeres are the chromosomal loci responsible for pole- tails of CenH3s from different lineages, a comprehensive study Cward movement at meiosis and mitosis, and are essential for of the molecular evolution within a particular lineage of CenH3s could help elucidate function for this important feature of the faithful segregation of genetic information. Despite their GENETICS crucial role in the cell cycle, relatively little is known about how centromeric chromatin by revealing the architecture of evolu- most centromeres are organized at the molecular level. In tionary constraint. complex eukaryotes, centromeres contain large arrays of tan- To this end, we present a detailed analysis of the evolution of demly repeated satellite sequences, yet their inheritance appears the Drosophila centromeric histone (Cid) in the melanogaster to be largely sequence-independent (reviewed in refs. 1–3). group of species. Comparison of Cid genes across this 25-million- Instead, the presence of specialized centromeric is year divergence reveal highly conserved blocks in the histone thought to be a key determinant of centromere identity and fold domain, as expected, with the DNA-binding Loop 1 region maintenance. Centromeric histones (CenH3s) belong to the (14) evolving most rapidly. This analysis also revealed the histone H3 family of proteins, based on homology in the histone unexpected presence of oligopeptide repeats in the N-terminal fold domain, and can replace H3 in nucleosomes (4, 5). However, tail of at least three different lineages of Drosophila Cids. The centromeric chromatin is distinct from the rest of the genome, occurrence of these expansions in some vertebrate CenH3s and as demonstrated by nuclease accessibility studies in both budding their similarity to previously documented peptides that interact and fission yeast (6–8), as well as by topoisomerase II cleavage with the minor groove of DNA suggest an attractive model for patterns of human centromeres (9). Instead of the regular the unique chromatin structure present at centromeres. nucleosomal arrays observed in surrounding heterochromatin, Methods centromeric DNA appears uniformly accessible to nucleases. These findings have led to suggestions that centromeric nucleo- Drosophila species (Table 1) were obtained from the Drosophila somes are randomly arranged (6, 7). A better understanding of Species Center (presently at Tucson, AZ) except for Drosophila this atypical chromatin might be gained by a closer examination trilutea (kind gift of Dr. M.T. Kimura, Hokkaido University, of CenH3 evolution. Japan). Genomic DNA was extracted from all flies and PCR was The structural organization of the C-terminal histone fold carried out to amplify the Cid gene and flanking regions as domain of CenH3 is believed to be very similar to that of H3 described in ref. 10. For more divergent species, additional itself, based on greater than 40% identity and analysis primers were designed to internal regions of Cid that are Ј Ј of reconstituted nucleosomes (2, 4). However, this domain relatively conserved, 5 -CGGCAGCTTGGGTATCAGTGT-3 evolves more rapidly in CenH3s than in canonical H3s. In addition, whereas H3 has a conserved 45-aa N-terminal tail, those of CenH3s vary in size from 20 to 200 aa, and cannot be Abbreviation: CenH3, centromeric histone H3. Data deposition: The sequences reported in this paper have been deposited in the GenBank aligned across different eukaryotic lineages (2). database (accession nos. AF435454–AF435467). This dichotomy in evolutionary rates between CenH3s and ‡To whom reprint requests should be addressed at: 1100 Fairview Avenue North A1-162, H3s is believed to have two causes. First, whereas the CenH3s are Seattle, WA 98109-1024. E-mail: [email protected]. only required to interact with centromeric DNA, conventional The publication costs of this article were defrayed in part by page charge payment. This H3 needs to interact with, and mediate the expression of, the rest article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. of the genome (10). This requirement results in the latter class §1734 solely to indicate this fact.

www.pnas.org͞cgi͞doi͞10.1073͞pnas.032664299 PNAS ͉ February 5, 2002 ͉ vol. 99 ͉ no. 3 ͉ 1449–1454 Downloaded by guest on October 2, 2021 Table 1. Species used in the present study bio.psu.edu͞adaptivevol.html). Briefly, the alignment and phy- Drosophila GenBank logenetic tree are used to infer ancestral codons at all interior species Species group Stock center ID accession nos. nodes of the tree. Next, the total number of synonymous and nonsynonymous substitutions are calculated for each codon and melanogaster melanogaster ref. 3 AF259371 a global average is computed. Adaptive evolution is inferred simulans melanogaster ref. 3 AF321923 when the rate of nonsynonymous substitutions per nonsynony- sechellia melanogaster 14021-0248.0 AF321925 mous site (rN) exceeds that of synonymous substitutions per Ͻ mauritiana melanogaster 14021-0241.0 AF321924 synonymous site (rS) at a significant level. Conversely, if rS rN, yakuba melanogaster 14021-0261.0 AF435454 purifying selection is inferred. The 9-aa expansions in both the teissieri melanogaster 14021-0257.0 AF321926 takahashii and ananassae groups were examined for evidence of orena melanogaster 14021-0245.0 AF435456 . We used adaptsite-d (Poisson) to calculate rN erecta melanogaster 14021-0224.0 AF435455 and rS followed by adaptsite-t to evaluate statistical significance. takahashii takahashii 14022-0311.0 AF435460 PDB structures of the nucleosome core particle (1AOI, ref. trilutea takahashii Dr. M.T. Kimura AF435457 14) were displayed using the SWISSPDB VIEWER (Version 3.7b2, prostipennis takahashii 14022-0291.0 AF435354 http:͞͞www.expasy.ch͞spdbv). paralutea takahashii 14022-0281.0 AF435458 lutescens takahashii 14022-0271.0 AF435459 Results lucipennis suzukii 14023-0331.0 AF435356 To amplify a phylogenetic range of Cid sequences (Table 1), we mimetica suzukii 14023-0341.0 AF435461 used a PCR strategy using internal primers in conjunction with rajasekhari suzukii 14023-0361.0 AF435355 primers to upstream and downstream genes (10). In all 22 species ananassae ananassae 14024-0371.0 AF435466 studied, the Cid gene lacks introns and varies from 639 to 879 bipectinata ananassae 14024-0381.0* AF435462 nucleotides in length. Most of the length variation in Cid can be malerkotliana ananassae 14024-0391.0 AF435463 attributed to the N-terminal tail, whereas the histone fold pseudoananassae ananassae 14024-0411.0 AF435465 domain is nearly constant in length (schematized in Fig. 1A). parabipectinata ananassae 14024-0401.0 AF435464 We present a multiple alignment of the histone fold domain pseudoobscura pseudoobscura 14011-0121.0 AF435467 from various Drosophila Cids along with the Drosophila H3 *In addition, strains 0381.1, 0381.2, 0381.3, and 0381.4 were partially charac- sequence for comparison (Fig. 1B). Most residues involved in terized for repeat expansions. mediating the histone–histone interactions of H3 (14, 22) are well conserved in Cid, consistent with their structural role. In contrast, the Loop 1 region of Cid, which is predicted to form an (CidmidR2), 5Ј-GGCCATCAATGCCATGTCNCGNACN- extensive DNA-interaction domain with Loop 2 of H4, is more TC-3Ј (degenerate primer H3degR1), and 5Ј-CTGCCGTTC- variable (Fig. 1B). All CenH3s have a longer Loop 1 region, TCGCGTCTNGTNCGNGA-3Ј (degenerate primer H3degF1). consistent with its predicted role in conferring some degree of The internal Cid-specific primers were used to amplify and DNA-binding specificity (23). We have previously shown that the sequence only an internal piece of Cid, and species-specific two replacements fixed between Drosophila melanogaster and primers were designed to amplify the entire gene. PCR products Drosophila simulans in Loop 1 have been subject to adaptive were cloned and sequenced using BigDye sequencing (Applied evolution (10). In the present analysis, comparison of pairs of Biosystems). At least three independent clones were sequenced closely related species indicate additional positions in Loop 1 on both strands for each species. All sequences have been that have undergone frequent episodes of amino acid replace- deposited in GenBank (accession nos. in Table 1). ments. For instance, two replacements separate Drosophila Multiple alignment of Cid sequences was carried out using yakuba from Drosophila teissieri [GAT GAA GCA (amino acids DEA) versus GAT GGAGAA (DGE), respectively], whereas CLUSTAL X (15). Phylogenetic trees were created using the neighbor-joining program (16). Bootstrap trials were conducted four replacements separate Drosophila erecta from Drosophila orena [TTG AGG GTC TCC GAG GGC (LRVSEG) versus using the PAUP* package of programs, Version 4.0b6 (17). In case TTG AAG ATC AGC GTGGCA (LKISVA), respectively] in of the repeat expansions in the takahashii and ananassae sub- their Loop 1 regions (Fig. 1B). The adaptive evolution that groups, oligopeptides from all species were aligned, and a characterizes Loop 1 may be indicative of changing DNA position-specific scoring matrix (PSSM) created using BLOCK- ͞͞ specificity (2, 10). In addition to regions subject to adaptive MAKER (http: blocks.fhcrc.org). The PSSMs were displayed as evolution, our analysis also highlights residues in Loop 1 that are Logos (18), a graphical representation of aligned sequences well conserved and likely play a structural role in Cid function where at each position the size of each letter is proportional to instead of conferring DNA specificity. the frequency of that particular residue in that position and the Based on the structure of the nucleosome (14, 22), there are total height of all of the letters in the position is proportional to additional DNA-interaction sites of structural significance on the conservation (information content) of the position. Letters the H3 molecule that are well conserved in Cid. From the in the logo are colored according to the physical and chemical ␣ multiple alignment, we find that both the N-terminal helix N characteristics of the amino acid residues they specify. PSSMs and the carboxy-terminus of the protein are also evolving ␣ generated were also used to search both the nonredundant rapidly. N is predicted to make specific contacts with the DNA database and a collection of known centromeric histones by gyres as the N-terminal tail of H3 exits the nucleosome (14, 22) using the Motif Alignment Search Tool (MAST, ref. 19). The and like Loop 1, this region may also confer some discrimination CenH3s database was created by collating information from the in DNA binding. various sequencing projects underway. PSSMs generated were Based on a multiple alignment of the nucleotide sequences of also compared with each other by using the LAMA search tool the histone fold domain (data not shown), we can reconstruct the (20), which reports significance of matches by using expected phylogeny of the Cid gene in the melanogaster group of species number based on searching 5,000 blocks and Z-scores (number (Fig. 1C). As expected for an essential single-copy gene, the of standard deviations away from the mean match to a set of phylogeny of Cid is congruent with previous phylogenies of this shuffled blocks). group of species (ref. 24, for example). The divergence between To detect natural selection on individual codon positions, we the melanogaster species group and the ananassae species used the ADAPTSITE package of programs (ref. 21, http:͞͞mep. groups is estimated as at least 12.7 million years (based on

1450 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.032664299 Malik et al. Downloaded by guest on October 2, 2021 GENETICS

Fig. 1. Phylogenetic conservation of Cid, the Drosophila CenH3, in the melanogaster group. (A) Schematic representation of Cid from different Drosophila species. The N-terminal tail is shown as a wavy blue line and the histone fold domain in red. The relative locations of three conserved sequence blocks in the tail are indicated as blue boxes (not drawn to scale). Each oligopeptide expansion is represented by red arrowheads; the amino acid repeat length is indicated by a number following the expansion and the number of iterations are indicated by the number of arrowheads. In addition, the consensus sequence of the repeats is shown in parentheses. (B) A multiple alignment of the histone fold domain of Drosophila Cids and D. melanogaster H3. Black dots represent gaps in the alignment. The alignment has been shaded to a 75% consensus by using MACBOXSHADE, with identical and similar residues shown in black and gray, respectively. Secondary structure assignments are from ref. 14 with the exception of Loop 1, which has been redefined to encompass the flanking variable sequence (between the dotted vertical lines). For instance, because of insertions͞deletions, it is unclear exactly where the ␣1 helix ends and the ␣2 helix begins within Loop 1 of Cid. H3 contacts with itself, another H3 molecule and H4 are indicated by blue diamonds, blue dots, and red dots, respectively (14). (C) Cid gene phylogeny. Bootstrap support (percentage of 1,000 trials) is shown next to each node. (D) The three conserved blocks in Cid’s N-terminal tail in Logos format (Methods).

paraphyly of ananassae, montium, and melanogaster species In contrast to the histone fold domain, only few segments of groups), whereas that between the obscura and melanogaster the N-terminal tail are conserved. We identified three stretches groups is estimated as 24 million years (25). Despite a total of conserved protein sequences which we have termed Blocks alignment length of only 300 nucleotides, we observe a high 1–3 respectively (Figs. 1 A and D). Block 1 is 44 aa long and resolution in the phylogeny. Because eukaryotes are presumed to occurs close to the N-terminal end of the protein. We have shown have one characteristic CenH3, it might serve as a universal the conservation of Block 1 in Logos format in Fig. 1D, where phylogenetic marker in groups that have been traditionally hard the ‘‘tallest’’ residues are invariant across all species sampled by to resolve. our study. Block 1 is unique in that we were unable to find any

Malik et al. PNAS ͉ February 5, 2002 ͉ vol. 99 ͉ no. 3 ͉ 1451 Downloaded by guest on October 2, 2021 other significant hits in the nr-database when using a MAST search. Block 2 is 20 aa long and occurs in the middle of the N-terminal tail. It primarily consists of a stretch of acidic residues flanked by a pair of well conserved serines. We could find no homologous sequences to Block 2 in the database other than those in very acidic proteins. Block 3 is 11 aa long and imme- diately abuts the histone fold domain. It is characterized by a basic region as well as invariant proline and arginine residues. The position of the third block is significant because the N- terminal tail of histone H3 is predicted to exit the nucleosome through a minor groove channel precisely at this junction (14). Block 3 resembles a putative nuclear localization signal encoded by four consecutive basic residues (26). There is precedent for some histones relying on an NLS-dependent pathway of nuclear import (27), but this has yet to be demonstrated for any CenH3. Most of the length in different Cid genes can be directly traced to oligopeptide repeat expansions in the N-terminal tail (Fig. 1A). These oligopeptides vary both in repeat lengths as well as the number of iterations. Thus, Dro- sophila lucipennis has a pentameric glutamine (Q) expansion, whereas D. teissieri and D. orena have three repeats of (QN) and (NPKS), respectively (Fig. 1A). The diverse location of these expansions and their absence in closely related species attest to their independent evolutionary origins. In the lineages leading to the takahashii, suzukii and ananassae species groups (Table 1), the expansions occur in the same location (dashed vertical line, Fig. 1A) and are each 9 aa residues long. In the takahashii and suzukii sister groups, copy number varies from 1 to 3, whereas in the ananassae group, it varies from 3 to 6 copies (Fig. 1A). To investigate the evolutionary significance of the oligopep- tide repeat expansions in Cid’s N-terminal tail, we aligned all instances of the repeat expansion in the takahashii͞suzukii and ananassae species groups (Fig. 2 A and B). From the multiple synonymous (blue) and replacement (red) substitutions, it is evident that the repeats are ancient and have not been subject to concerted evolution (which would lead to low nucleotide diversity). Despite their age, the amino acid residues encoded at many of the nine positions are subject to purifying selection (P values in Fig. 2 A and B). Indeed, in comparisons of these repeats Fig. 2. Molecular evolution of oligopeptide expansions in CenH3s. (A) within strains of Drosophila bipectinata, we find six synonymous Multiple alignment of the oligopeptide expansions in the pair of taka- and no replacement polymorphisms. The number of iterations of hashii͞suzukii species groups. All synonymous and replacement substitu- the repeat is not fixed in a species. For example, three strains of tions are shown in blue and orange relative to the first line of the align- D. bipectinata have five copies of each repeat, whereas two ment. Eight codon positions had substitutions that could be evaluated for strains have six. The Logo of the nine-residue stretch clearly natural selection by using Adaptsite; of these, seven positions had rn Ͻ rS Ͼ highlights conserved and rapidly evolving residues. The two (2 significant, P values shown), whereas codon 4 has rn rS, although only consensus oligopeptides from takahashii and ananassae groups marginally so. Also presented is a Logo of the repeat consensus, which are similar to each other in the last four positions (Z score: 4.2; clearly highlights the conservation of the first five codons relative to the last four. (B) Expansions in the ananassae species group. All positions have 5.7 expected hits in a LAMA search of 5,000 blocks). rn Ͻ rS (7 significant, P values shown), indicating a stringent purifying The first five positions of the ananassae consensus are selection. This conservation is confirmed by the Logo of the ananassae similar to previously defined four-residue-long SPKK motifs consensus shown in C.(C) Match of ananassae consensus to vertebrate (Z score: 3.4; 20 expected hits in a LAMA search of 5,000 Cenp-As. E-values reported are from a search of about 15 putative CenH3s. blocks), known to mediate histone interactions with linker The entire N-terminal tails of the human, mouse, and zebrafish Cenp-As DNA in the minor groove. SPKK motifs have been found in the (Danio rerio Cenp-A, accession nos. BF156113, BI877040, and BF158223; N-terminal tails of sea urchin sperm histones H1 and H2B, as Washington Univ. St. Louis Zebrafish EST Project) are shown upstream of well as the C-terminal tails of a class of angiosperm histone the histone fold domain. In addition to the ananassae consensus, we could H2As (Fig. 3A, refs. 28–30). A Logo consensus of these motifs identify matches to the SPKK motifs (blue boxes) as well as short dipeptide is shown in Fig. 3B. SPKK motifs are also found embedded in motifs that include a proline at every alternate site (dark arrowheads). larger repeats within the carboxy-terminal tails of linker histone H1s (not shown; ref. 31). main chain CO of the ith amino acid and the main chain NH of The SPKK class of minor groove binding motifs includes ϩ3 ϩ1 substantial variation at the primary sequence level. The structure i . The only strict requirement isaPinthei position. A of SPKK bound to DNA has not been solved, but NMR and variety of SPKK motif-containing peptides have been shown to structural modeling suggest that SPKK forms a turn stabilized by bind in the minor groove of DNA (29, 33, 34), with each turn one or two hydrogen bonds (29, 32). The structure is called an binding 2 bp. Given the inherent limitations in comparing four Asx turn if the hydrogen bond is between the OH or CO side residue motifs and the variety of SPKK motifs that can bind in chain of the ith amino acid (S, T, D, or N) and the main chain the minor groove (only a subset of which is represented in Fig. NH of the iϩ2 amino acid (A or a hydrophilic amino acid; e.g., 3B), the similarity of the ananassae oligonucleotide consensus to K,R,E,T). In a beta turn there is a hydrogen bond between the the SPKK motif is compelling.

1452 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.032664299 Malik et al. Downloaded by guest on October 2, 2021 Cenp-A N-terminal tails at approximately similar locations upstream of the histone fold domain. We confirmed these matches by a MAST search of a database consisting of 15 putative CenH3s obtained from various nucleotide databases (E-values in Fig. 2C). The ananassae consensus motifs were not found in CenH3s from fungi, nematodes, plants, or even D. melanogaster and Drosophila pseudoobscura. Therefore, the ananassae- vertebrate sequence similarity likely represents an example of evolutionary convergence. In addition to the ananassae consen- sus repeats, the three Cenp-A sequences have an additional SPKK motif (GPRR in human and mouse Cenp-A) as well as recurring dipeptide motifs that include a proline at every other position (GP, SP, etc.). Discussion Centromeric chromatin is distinct from bulk chromatin, most conspicuously by the presence of CenH3. CenH3 is the only protein that is constitutively present at the centromere of every examined so far (2). A specialized chromatin structure at the centromere is suggested by nuclease digestion experiments (6–8) and may be important for the epigenetic maintenance of the centromere through multiple cell divisions. In addition, this chromatin probably plays a key role in assembling the kineto- chore at meiosis and mitosis. Thus, CenH3 could be expected to both interact with the underlying centromeric DNA, as well as interact with other proteins (including other CenH3 molecules) to provide the foundation for the kinetochore (35). In S. cerevisiae, deletion analysis has shown that the N- terminal tail of CenH3 (Cse4) includes a 33-amino acid stretch (END domain) that is essential for function (13). By two-hybrid and dosage suppression of temperature-sensitive , the END domain has been shown to interact with kinetochore components (13). Using an evolutionary approach, we have identified three conserved blocks in the N-terminal tail of

Drosophila Cid. By analogy with END in Cse4, we suggest that GENETICS at least two of these three Cid blocks may also mediate inter- actions with other centromere–kinetochore determinants (Block 3 in Fig. 1D may interact with DNA). Unlike the single nucleosome centromere of S. cerevisiae, metazoan centromeres are hundreds of kilobases in length. Thus, some of the interac- tions involving Cid’s N-terminal tail may occur between two successive Cid molecules in a centromeric nucleosome array. These blocks can now be used in protein interaction screens to elucidate the next level of centromere organization. In two genera with complex centromeres, CenH3s appear to Fig. 3. SPKK motifs in histone tails. (A) Schematics of sea urchin sperm H1 and be undergoing adaptive evolution (10, 12). Because female H2B, and angiosperm H2A histones. Representative histones (28–30) are meiosis in plants and most metazoans is asymmetric, centro- shown from sea urchin (sperm): Parechinus angulosus, Echinolampas crassa, Strongylocentrotus nudus, Strongylocentrotus purpuratus, and Lytechinus meres have the opportunity to compete for inclusion into the pictus; and angiosperms: Oryza sativa, Triticum aestivum, Pisum sativum, and egg (11). Left unchecked, this competition could lead to Arabidopsis thaliana. Each instance of an SPKK motif is shown with a dark runaway satellite expansions and increased nondisjunction as arrowhead. For comparison, a histone of each type lacking these motifs is also centromeres of differing strengths undergo meiosis (2). By presented. (B) Logo of a multiple alignment of all SPKK motifs shown in A.(C) altering its DNA-binding specificity, CenH3 could suppress A schematic representation of the nucleosome core structure (14) highlighting such a drive process. This model predicts that the youngest the tails of H2A, H2B, and (Cen)H3 where SPKK expansions have been found satellites found in centromeric regions would represent the and their proximity to the minor groove of the DNA wrapped around the extant centromere, a prediction that has recently been con- . Note that the two molecules of H3 have different lengths of firmed by high-resolution mapping of the human X centromere N-terminal tails in the structure. For clarity, the two molecules of are omitted. (36). Under this model, fungal CenH3s are not expected to evolve adaptively both because they are not subject to meiotic asymmetries and because they rely on other factors for their centromeric localization (8, 37). Using the consensus created from the ananassae group, we Adaptive evolution of Cid has been mapped to both the Loop used the MAST program (20) to search the nonredundant data- 1 region of the histone fold domain as well as to multiple base of proteins. Surprisingly, the top hit found was the mouse locations of the N-terminal tail in comparisons of D. melano- ϭ CenH3, the Cenp-A protein (E-value 1). Given that our search gaster and D. simulans (10). In addition, adaptive evolution was query was only nine amino acids long and that the match to detected for the N-terminal tail in Arabidopsis CenH3 (12). We Cenp-A was in the N-terminal tail, we found this match espe- have proposed that altered DNA-binding specificity drives cially noteworthy. In addition to mouse, we could find matches CenH3’s adaptive evolution. Because this adaptation has oc- to the ananassae group consensus in zebrafish and human curred in multiple segments of Cid’s N-terminal tail, the N-

Malik et al. PNAS ͉ February 5, 2002 ͉ vol. 99 ͉ no. 3 ͉ 1453 Downloaded by guest on October 2, 2021 terminal tail must make extensive contacts with the linker DNA Two of the best characterized minor-groove binding motifs, in centromeric chromatin. SPKK and AT-hooks (42), show no sequence similarity to each One special instance of a likely CenH3-DNA interaction is other, suggesting that we have explored only a subset of potential evident in the case of SPKK-containing oligopeptides we find in minor-groove binding motifs. In addition to the SPKK-bearing the ananassae species’ Cid and vertebrate Cenp-A. Even in the motifs, we have found additional types of repeats both in Cenp-A absence of SPKK motifs, conventional histone tails interact as well as in the different lineages of Drosophila, such as the extensively with the linker DNA (38). When found in H2A takahashii consensus (Fig. 2A) and NPKS expansions in D. orena C-terminal or H2B N-terminal histone tails, SPKK-containing (Fig. 1A), that may all carry out the same function—i.e., bind in motifs confer an additional protection of 16 bp of linker DNA the minor groove of centromeric DNA. These peptides, like in reconstitution experiments (30, 39). The N-terminal tails of SPKK, are predicted to have a strong preference for AT-rich H2B and H3 are perfectly positioned to interact with the minor DNA, and may only recognize architectural DNA features such groove of DNA as they exit the nucleosome (Fig. 3C) in minor as minor groove shape rather than specific bases. groove channels (14). Similarly, while the winged helix domain In conclusion, our examination of evolutionary constraint in of linker histones is located at the nucleosome dyad, the posi- the Drosophila CenH3 lineage has revealed that the N-terminal tively charged N- and C-terminal tails interact extensively with tail comprises a combination of interaction sites. We have the linker DNA (40). This interaction is weaker in rat testis H1 inferred putative protein interaction blocks that might lay the (H1t), which lacks SPKK motifs, compared with somatic H1s, foundation for kinetochore formation, DNA binding regions which have them (41). Thus, the presence of minor groove subject to adaptive evolution, and nonspecific minor groove interactions mediated by CenH3’s N-terminal tail might result in binding motifs. The resulting complex with DNA may form the uniform nuclease accessibility of centromeric chromatin. A universal cytological feature of centromeres, the primary comparison of H3 and H1 tails is particularly relevant as the H1 constriction. binding site on the nucleosome is very close to where the N-terminal tails of H3s exit. We propose that the minor groove We thank the Drosophila Species Center for the various fly strains used, binding repeats in CenH3 N-terminal tails play a similar role to Jorja Henikoff for help with the Blocks, LAMA, MAST, and Adaptsite H1 tails by neutralizing phosphates in the linker DNA—i.e., they searches, and Judith O’Brien for technical assistance. We also thank shield charges to allow superhelical bending and collapse into a Kami Ahmad, Jennifer Cooper, and Paul Talbert for stimulating dis- higher order chromatin structure. This structure would be seen cussions and for comments on the manuscript. We are supported by as the primary constriction, the cytological marker for centro- the Helen Hay Whitney Foundation (H.S.M.), the Damon Runyon mere position. It will be interesting to determine whether Cancer Research Foundation (D.V.), and the Howard Hughes Medical centromeric chromatin is especially lacking in linker histones. Institute (S.H.).

1. Karpen, G. H. & Allshire, R. C. (1997) Trends Genet. 13, 489–496. 22. Harp, J. M., Hanson, B. L., Timm, D. E. & Bunick, G. J. (2000) Acta Crystallogr. 2. Henikoff, S., Ahmad, K. & Malik, H. S. (2001) Science 293, 1098–1102. D 56, 1513–1534. 3. Choo, K. H. (2001) Dev. Cell 1, 165–177. 23. Shelby, R. D., Vafa, O. & Sullivan, K. F. (1997) J. Cell Biol. 136, 501–513. 4. Yoda, K., Ando, S., Morishita, S., Houmura, K., Hashimoto, K., Takeyasu, K. 24. Goto, S. G. & Kimura, M. T. (2001) Mol. Phylogenet. Evol. 18, 404–422. & Okazaki, T. (2000) Proc. Natl. Acad. Sci. USA 97, 7266–7271. (First 25. Russo, C. A., Takezaki, N. & Nei, M. (1995) Mol. Biol. Evol. 12, 391–404. Published June 6, 2000; 10.1073͞pnas.130189697) 26. Gorlich, D. (1997) Curr. Opin. Cell Biol. 9, 412–419. 5. Palmer, D. K., O’Day, K., Wener, M. H., Andrews, B. S. & Margolis, R. L. 27. Mosammaparast, N., Jackson, K. R., Guo, Y., Brame, C. J., Shabanowitz, J., (1987) J. Cell Biol. 104, 805–815. Hunt, D. F. & Pemberton, L. F. (2001) J. Cell Biol. 153, 251–262. 6. Bloom, K. S. & Carbon, J. (1982) Cell 29, 305–317. 28. von Holt, C., de Groot, P., Schwager, S. & Brandt, W. F. (1984) in Histone 7. Polizzi, C. & Clarke, L. (1991) J. Cell Biol. 12, 191–201. Genes, eds. Stein, G. S., Stein, J. L. & Marzluff, W. F. (Wiley Interscience, New 8. Takahashi, K., Chen, E. S. & Yanagida, M. (2000) Science 288, 2215–2219. York), pp. 65–105. 9. Floridia, G., Zatterale, A., Zuffardi, O. & Tyler-Smith, C. (2000) EMBO Rep. 29. Suzuki, M. (1989) EMBO J. 8, 797–804. 1, 489–493. 30. Lindsey, G. G., Orgeig, S., Thompson, P., Davies, N. & Maeder, D. L. (1991) 10. Malik, H. S. & Henikoff, S. (2001) Genetics 157, 1293–1298. J. Mol. Biol. 218, 805–813. 11. Zwick, M. E., Salstrom, J. L. & Langley, C. H. (1999) Genetics 152, 31. Churchill, M. E. & Travers, A. A. (1991) Trends Biochem. Sci. 16, 92–97. 1605–1614. 32. Suzuki, M., Gerstein, M. & Johnson, T. (1993) Protein Eng. 6, 565–574. 12. Talbert, P. B., Masuelli, R., Tyagi, A. P., Comai, L. & Henikoff, S. (2001) Plant 33. Churchill, M. E. & Suzuki, M. (1989) EMBO J. 8, 4189–4195. Cell, in press. 34. Khadake, J. R. & Rao, M. R. (1997) Biochemistry 36, 1041–1051. 13. Chen, Y., Baker, R. E., Keith, K. C., Harris, K., Stoler, S. & Fitzgerald-Hayes, 35. Sullivan, K. F. (2001) Curr. Opin. Genet. Dev. 11, 182–188. M. (2000) Mol. Cell. Biol. 20, 7037–7048. 36. Schueler, M. G., Higgins, A. W., Rudd, M. K., Gustashaw, K. & Willard, H. F. 14. Luger, K., Mader, A. W., Richmond, R. K., Sargent, D. F. & Richmond, T. J. (2001) Science 294, 109–115. (1997) Nature (London) 389, 251–260. 37. Ortiz, J., Stemmann, O., Rank, S. & Lechner, J. (1999) Genes Dev. 13, 1140–1155. 15. Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F. & Higgins, D. G. 38. Angelov, D., Vitolo, J. M., Mutskov, V., Dimitrov, S. & Hayes, J. J. (2001) Proc. (1997) Nucleic Acids Res. 25, 4876–4882. Natl. Acad. Sci. USA 98, 6599–6604. (First Published May 29, 2001; 10.1073͞ 16. Saitou, N. & Nei, M. (1987) Mol. Biol. Evol. 4, 406–425. pnas.121171498) 17. Swofford, D. L. (2000) PAUP*: Phylogenetic Analysis Using Parsimony (*and 39. Lindsey, G. G. & Thompson, P. (1992) J. Biol. Chem. 267, 14622–14628. Other Methods) (Sinauer, Sunderland, MA), Version 4. 40. Zhou, Y. B., Gerchman, S. E., Ramakrishnan, V., Travers, A. & Muyldermans, 18. Schneider, T. D. & Stephens, R. M. (1990) Nucleic Acids Res. 18, 6097–6100. S. (1998) Nature (London) 395, 402–405. 19. Bailey, T. L. & Gribskov, M. (1998) J. Comput. Biol. 5, 211–221. 41. Khadake, J. R., Markose, E. R. & Rao, M. R. (1994) Ind. J. Biochem. Biophys. 20. Pietrokovski, S. (1996) Nucleic Acids Res. 24, 3836–3845. 31, 335–338. 21. Suzuki, Y., Gojobori, T. & Nei, M. (2001) Bioinformatics 17, 660–661. 42. Aravind, L. & Landsman, D. (1998) Nucleic Acids Res. 26, 4413–4421.

1454 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.032664299 Malik et al. Downloaded by guest on October 2, 2021