REVIEWS

Determining the specificity of protein–DNA interactions

Gary D. Stormo and Yue Zhao Abstract | Proteins, such as many transcription factors, that bind to specific DNA sequences are essential for the proper regulation of expression. Identifying the specific sequences that each factor binds can help to elucidate regulatory networks within cells and how genetic variation can cause disruption of normal gene expression, which is often associated with disease. Traditional methods for determining the specificity of DNA-binding proteins are slow and laborious, but several new high-throughput methods can provide comprehensive binding information much more rapidly. Combined with in vivo determinations of binding locations, this information provides more detailed views of the regulatory circuitry of cells and the effects of variation on gene expression.

In vivo Many proteins interact with DNA to accomplish a wide and are a major determinant of the regulated and In this context we use the variety of cellular processes. These include the crucial controlled expression of all . term in vivo to refer to any tasks of DNA replication, repair and recombination in In the last few years technological advances have experiments performed on all cells. In eukaryotes there are, in addition, histones greatly increased our knowledge of the locations of the living cells, whether within or outside a whole organism and other proteins that determine the structure of TF binding sites within genomes and the sequence- (sometimes referred to chromatin within the nucleus. The expression of genes specific binding preferences for many TFs. These as ex vivo). requires transcription by RNA polymerase, and usually advances include both in vivo and in vitro experimental there are a variety of associated proteins, referred to gen- methods and the development of new computational erally as transcription factors, that facilitate and regu- analyses. The in vivo approaches, such as chromatin late the transcription process. One class of transcription immuno precipitation followed by microarray (ChIP–chip) factors interacts directly with the DNA in a sequence- or by sequencing (ChIP–seq), determine the locations specific manner, binding only to locations within the within the genome that the transcription factors bind genome that contain the appropriate DNA sequences. and provide candidate genes that they are likely to regu- Upon binding, these transcription factors may regulate late6–8. The locations can be determined for different cell the expression of associated (usually nearby) genes, types, at different stages of development and in different either activating or repressing their RNA synthesis. environmental conditions to help uncover the regulatory Recently it has also become clear that many of the changes that are associated with changes in cell physiol- binding sites within the genome do not affect the expres- ogy1. This information alone can provide a framework sion of nearby genes and it is unknown whether they for the regulatory network, indicating which factors are perform some other function or are just non-functional likely to regulate which genes. However, the resolution binding events1–3. The analysis of the genomic locations of the binding locations is not sufficient to identify the of transcription factor binding sites, and how they differ binding sites precisely, only to indicate a region of 100 between distinct cell types, provides a wealth of infor- or more base pairs in which it resides. Applying motif Department of Genetics, Washington University School mation about the dynamics of the regulatory network discovery algorithms to the sequences of those binding of Medicine, 4444 Forest of cells, but that is not the topic of this Review. Rather, regions can usually identify the main features of the bind- Park Blvd, St. Louis, the focus of this Review is the intrinsic specificity of ing patterns and the highest affinity binding sequences Missouri 63110‑8510, USA. sequence-specific transcription factors (TFs) — their for the factors9,10, but the accuracy of those specificity Correspondence to G.D.S. ability to distinguish different DNA sequences in the models is often limited owing to confounding factors e‑mail: [email protected] doi:10.1038/nrg2845 absence of any other cooperative or competitive inter- that occur in vivo. Knowing the intrinsic specificity of Published online actions. In many species, from bacteria to humans, TFs the factors, together with the binding locations, provides 28 September 2010 represent 5–10% of the genes encoded in the genome4,5 additional valuable information. For example, if strong

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 751 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS

binding sites are not bound it indicates that those regions of different DNA binding site sequences, each at mul- of the genome are inaccessible to the factor, probably tiple different concentrations. The total DNA binding owing to the local chromatin organization. Conversely, site concentration in each chamber is determined by its if the factor binds to regions of the genome that lack fluorescent signal, and any DNA that is not bound to strong binding sites, this indicates that the factor is bind- the TF that is attached to the surface is then flushed out. ing indirectly (that is, through interaction with another The amount of protein is determined by its fluorescent factor) or is binding to a weak binding site that requires signal, and the amount of DNA bound to the TF is deter- cooperativity with other factors11. In both cases, that mined by the remaining DNA signal. by determining information leads to the search for additional interacting the fraction of DNA that is bound at several different factors that are involved in the network. DNA concentrations, the relative affinities of each dif- For many years scientists have been measuring the ferent sequence can be determined. This does not, by

binding affinity of TFs to specific DNA sequences, but itself, determine the absolute dissociation constant (Kd) Chromatin these experiments have generally been undertaken one between the TF and the binding site (BOX 1), but compar- immunoprecipitation (ChIP). A technique that is at a time, each determining the affinity of the TF to a sin- ison to a reference with known Kd allows determination (BOX 1) used to identify the location gle DNA sequence . However, in vivo the affinity of the absolute Kd for each binding site sequence. of DNA-binding proteins of the TF is not as crucial as its specificity. Inside a bac- Using 17 such devices, Maerkl and Quake12 meas- and epigenetic marks in the terial cell or a eukaryotic nucleus, the concentration of ured the binding affinities to 464 different DNA sites genome. Genomic sequences containing the mark of interest DNA is so high (typically millimolar for potential bind- for four different proteins, all of which were members are enriched by binding soluble ing sites) that TFs will essentially always be bound to of the eukaryotic basic helix–loop–helix (bHlH) TF DNA chromatin extracts DNA, even if there are no high-affinity sites. binding family. Members of that family generally bind to sites (complexes of DNA and protein) to their regulatory sites — those that are crucial for the with CANNTG consensus sequences (in which N rep- to an antibody that recognizes proper regulation of gene expression — requires that resents any of the four bases). They took advantage of the mark. ChIP can be followed by analysis of the precipitated the TFs can distinguish those sites from the vast major- the fact that these bHlH TFs bind as homodimers and DNA through hybridization to a ity of non-regulatory DNA sequences. The information have symmetric specificities. by keeping one half-site microarray (ChIP–chip) or by required for understanding and modelling the regula- fixed, their set of 464 binding sites contained all pos- sequencing of the precipitated tory network is not the affinity to the preferred binding sible four-base-long sequences (256) for the other half- DNA (ChIP–seq). sites but the differences in binding affinity for all of the site plus many additional variants of the adjacent bases. Motif potential binding sites, which is what we mean by They showed that the specificities for two yeast proteins, In this Review motif refers the specificity of the TF (BOX 1). This requires knowing the centromere-binding protein 1 (Cbf1) and phosphate sys- to a representation of the affinity to the entire set of potential binding sites, or at tem positive regulatory protein 4 (Pho4), are similar but specificity of a transcription least enough of them to construct models that allow the have differences — especially in their preferences in the factor. For instance, a position (BOX 2) weight matrix is a motif for a complete specificity to be estimated . flanking sequences — that could help to distinguish their transcription factor, but there several recent technological advances have facilitated in vivo binding sites and better define their independent could be other ways to high-throughput determination of the intrinsic specifi- regulatory roles. represent the specificity. city of transcription factors. This information is useful

Microfluidic device not only for more detailed analysis of the regulatory net- Surface plasmon resonance. surface plasmon resonance A device in which fluids are works within genomes and the affects of genetic varia- (sPR) is often used to study protein interactions with conveyed to samples in tions on those networks but also for aiding the design of ligands (including other proteins) but can also be used channels with diameters in the new proteins with novel specificity. This Review focuses to measure protein–DNA interactions. As an approach order of 1 μm; these chambers on the recent technological advances that facilitate the for determining the affinity of a TF for a DNA sequence, can be used to precisely and dynamically control the rapid and accurate determination of the specificity of TFs, sPR can be made moderately high-throughput by using 13–15 microenvironment to which including several new high-throughput experimental arrays of DNA sequences . sPR is based on the fact cells are exposed. methods and how computational modelling can be used that the angle of light reflection from a thin surface of to interpret those data. gold, for example, depends on the mass of molecules Kd The dissociation constant attached to the other side of the surface. DNA can between two molecules (here, Direct affinity measurements be attached to the surface, the TF can be added to the for a transcription factor and a Microfluidics. One recent method that uses a microfluidic solution and the change in reflection can be measured DNA sequence). It is the ratio device is the mechanically induced trapping of molecular over time as the TF–DNA complex accumulates, until of the off-to-on rate for the interactions (MITOMI). It provides a means for deter- equilibrium is attained. Then the TF can be allowed to formation and dissolution of the complex. mining binding specificities accurately through direct dissociate from the DNA by washing the array and the measurements in a relatively high-throughput proc- change in reflection can be monitored until equilibrium ess, obtaining the binding affinities of a TF to each of is reached again. This ability to measure binding and A single sequence (possibly a few hundred different DNA sites per device12 (FIG. 1A). unbinding over time is one of the main advantages of degenerate) that represents (BOX 1) the specificity of a transcription synthetic genes for the TFs are flowed into each chamber sPR; the rate constants kon and koff can be deter- factor. It is usually the highest along with material to support the synthesis of the pro- mined and from them the affinity, Kd, can be determined. affinity sequence and would teins within the chamber, thus avoiding the problem of Multiple can be arrayed on the surface so that the be the most frequent base at purifying the TF (a step that is required in many in vitro affinities of many different DNAs to the same protein each position in a collection of assays). each chamber of the device has antibodies to can be determined in parallel13–15. so far, only medium- binding sites. It is also possible to use degeneracies, for attach the TF to its surface and is seeded with a specific sized arrays have been used but it seems feasible to example, ‘R = A or G’, if two DNA binding site at a specific concentration and with a increase the number of simultaneous measurements (or more) bases are equivalent. fluorescent tag. Overall, the device contains hundreds substantially.

752 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS

Box 1 | Binding affinity and specificity Affinity a The binding of protein to DNA is a bimolecular process governed by two rates: an on-rate for the formation of the complex and an off-rate for its dissociation. This is shown in the figure (see the + figure, part a; the transcription factor (TF) is blue and DNA sequence (S) is red) and by the equation:

MQP 6(+ 5 6(“ 5  MQȭ TF Sequence The affinity of the TF for the sequence S is usually defined as the dissociation constant K d b 1.0 (or its reciprocal, the association constant):

MQƑ =6(?=5? -F== ∆)°= 46NP-F  0.8

“ y MQP =6( 5?

The square brackets represent the concentrations 0.6 o of those entities at equilibrium, ΔG is the Gibbs obabilit standard free energy, R is the gas constant and T is 0.4 the temperature in degrees Kelvin. The Kd is the

concentration of free TF for which the DNA is half Binding pr bound. Such experiments are usually performed 0.2 at excess TF so that the free TF is approximately the same as the total TF. The probability of the 0.0 sequence S being bound is: 0 500 1,000 1,500 2,000

=6(“ 5? =6(? 2 5DQWPF = =  [TF] “ =6( 5?+ =5? =6(?+ -F c The binding probability to the consensus sequence of the A isoform of the human TF MAX for different concentrations of MAX isoform A is also shown (see the figure, part b). + specificity The term ‘specificity’ refers to how well a protein can distinguish between different sequences. In vivo the TF has the entire genome of potential binding sites, and in order for the regulatory TF Sequences network to function properly the TF must be able to distinguish its functional binding sites from the vast d 1.0 majority of competing non-functional potential sites (see the figure, part c; each colour represents a different DNA sequence). The complete 0.8 y

specificity of a TF can be defined by the list of Kds to all possible binding sites, but it is also convenient 0.6 to have a single number to quantify the specificity. obabilit A useful definition is: 0.4

-F 5K -F 5K Binding pr 5RGE≡Σ NP  5K Σ-F 5K 〈-F 5K 〉 0.2 This is equal to the ‘information content’ of the 49 binding sites under certain simplifying conditions, 0.0 such as assuming additivity of the positions (BOX 2) 0 1 2 and that the binding sites are selected in proportion Relative 3 0 500 1,000 1,500 2,000 50 Gibbs standard free energy to their binding affinities . It is also useful to view binding energy [TF] The difference in free energy the specificity graphically with a three-dimensional between the equilibrium plot (see the figure, part d). This is similar to the Nature Reviews | Genetics state and the standard state two-dimensional plot in part b, but is for every different sequence. In the two-dimensional plot, a single sequence (all reactants and products is used and the concentration of TF ([TF]) is varied to obtain the Kd for that sequence; this is equivalent to one of at 1M concentration). the lines parallel to the [TF] axis in the three-dimensional plot for one particular sequence. However, in many of the high-throughput methods described in this Review, the [TF] is constant and the probabilities are measured over all Gas constant (Represented by R). The sequences. This corresponds to one of the lines parallel to the binding energy axis in the three-dimensional plot. thermodynamic constant By employing a model for binding, such as a position weight matrix (BOX 2), it is possible to estimate all of the 37 relating energy per mole parameters of the model, including [TF] and the nonspecific binding energy, from the data . Data in parts b and d to temperature. are from ReF. 12.

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 753 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS

Microarray-based methods that every eight-base-long sequence occurs 32 times, Protein-binding microarrays. Protein-binding micro- taking both orientations into account. A TF, either arrays (PbMs) are a technology developed in the last purified from cells or synthesized in vitro, is added to 10 years that has greatly increased the throughput for the array, which is then washed to remove nonspecific assessing the binding specificities of TFs16–18 (FIG. 1B). As binding. The amount of protein binding to any specific with microarrays for gene expression, this technology DNA spot is determined with a fluorescent antibody to has made possible large-scale, high-throughput analy- the protein. ses to collect information that previously was much The specificity of the TF can be represented as posi- more laborious to acquire (typically on a gene-by-gene tion weight matrices (PwMs; BOX 2), but it is also com- basis)19,20. The current version of PbM uses arrays of mon to report enrichment scores for all eight-base-long over 44,000 spots that contain all possible ten-base-long sequences. These scores are based on the rank median DNA binding sites once on each array, which means intensities of array sequences that contain a particular

Box 2 | Modelling specificity

The specificity of a DNA binding protein is determined Position 1 2 3 4 5 6 by its relative affinity to all possible binding sites, or at A 1.4 2.4 5.8 4.8 2.1 6.7 least to a large number of high affinity sites, such that C 2.4 0 6.5 4.0 4.3 6.9 one can estimate the probability of binding to any G 0 3.7 0 0.8 0 0 particular site given the competition from all of the T 3.6 1.7 7.1 0 4.8 6.8 other sites (BOX 1). Using the high-throughput methods described in this Review it is possible to simply determine the binding affinity for all possible binding sites up to the length allowed by the method. However, even when that information is available, GCGTGG 0 determining a model for binding can have advantages, as described in the GCGGGG 0.8 main text. ACGTGG 1.4 The simplest model to represent the specificity of a DNA-binding protein, GCGTGG 1.7 other than just using the consensus sequence for its highest affinity sites, GCGTAG 2.1 is a position weight matrix (PWM). In the general form of this model there is ACGGGG 2.2 a score assigned to each possible base at each position in the binding site GAGTGG 2.4 (see the figure). The sum of the elements that correspond to a specific GTGGGG 2.5 sequence gives a total score for that sequence. This GCGGAG 2.9 allows the model to provide a score to all possible … … binding sites for the protein. The ‘logo’ provides a convenient graphical representation of the PWM51. The PWM model assumes that each position independently contributes to binding, but more 2 complex models are easily generated that do not assume additivity of the individual bases. For example, it is possible to make a matrix model that includes scores 1.5 for adjacent dinucleotides — this requires a PWM-like model with 16 rows for all of the possible dinucleotides

at each position — or other higher-order contributions ontent to the binding44. Simple fitting measures can 1 determine which model is best, given the number tion c of parameters45,52,53. One can also employ models orma

composed of two or more PWMs if the TF has multiple Inf 24 modes for binding to DNA . Different methods have 0.5 been developed to assign the elements of the PWM. The Microarray most useful are those methods that treat the elements A high-density array of DNA of the PWM as binding energy contributions (which molecules, typically with each makes the PWM an energy matrix), because this enables element of the array containing 0 a different DNA sequence. the probability of binding to be calculated for any 123456 37,54 Single-stranded DNA arrays are protein concentration (BOX 1): Position used to hybridize to labelled  Nature Reviews | Genetics DNA (or RNA) sequences 2 5KDQWPF =  ' sµ to determine the relative + G K abundances of different Ei is the difference in binding energy between the sequence Si and the reference (usually the preferred) sequence sequences. Double-stranded (which by convention is defined as having energy = 0, see the PWM above). The logarithm of the ratio of the free DNA arrays are used in concentration of the TF to the K of the reference sequence is: protein-binding microarray d and cognate site identifier =6(? methods to determine the µ=NP  -  5 binding preferences of F TGH transcription factors or other In general, even if the total concentration of the TF is known, [TF] is not known but μ can be estimated from the data DNA-binding molecules. along with the parameters of the PWM37.

754 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS

A Equilibrium binding Mechanical trapping Wash off unbound molecules PDMS membrane Cy5-labelled target DNA

Transcription Microfluidic factor dimer channel

Penta-histidine antibody Substrate

B

C Immobilized protein Unbound

Sequencing Bound

Random pool of oligonucleotides Selection using target protein Amplification of bound sequences

Da RNA polymerase b Transcription α α β′ RNA polymerase factor ω β α α β′ σ Transcription ω β factor σ HIS3 HIS3 Randomized Weak Randomized Weak region region promoter

Figure 1 | Different methods for studying DnA–protein interactions. A | An overview of the mechanically induced | trapping of molecular interactions (MITOMI) device for binding affinity measurements. The transcriptionNature Revie factorws Genetics (TF) is bound to the surface by antibodies and some fraction of the DNA binds to the TF. The unbound DNA is expelled by washing, so the bound fraction can be measured. B | An overview of universal protein-binding microarray (PBM) design and use. All possible ten-base-long sequences are included on the array (left panel). Primer-directed DNA synthesis creates dsDNA on the array (middle panel). Proteins (shown by purple crescents) are bound, nonspecific binding is minimized by washing of the array and the remaining protein is quantified with a fluorescent antibody (right panel). Fluorescent groups are indicated by stars. c | An overview of the high-throughput (HT)-SELEX procedure. DNA molecules from the random library are exposed to the TF; some sequences are bound, whereas others flow through. The bound fraction is sequenced to determine the probability of being bound. Rounds of amplification of bound sequences may be used (see the main text). D | An overview of the bacterial one-hybrid (B1H) system. The TF is fused to the ω subunit of RNA polymerase and a sequence from a randomized library is inserted upstream of the promoter of the HIS3 gene (which encodes a component of the histidine biosynthesis pathway) (Da). When the TF binds to the randomized sequence, the HIS3 promoter becomes more active (Db) and this increases the growth rate of the cells. PDMS, polydimethylsiloxane. Part A is modified, with permission, from ReF. 12 © (2007) Highwire Press. Part B is modified, with permission, from ReF. 16 © (2006) Macmillan Publishers Ltd. All rights reserved. Part D is modified, with permission, from ReF. 43 © (2008) Oxford University Press.

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 755 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS

8-mer compared to a background average. An online than 100 independent sites33. Alternatively, if the library database, UniProbe21, contains information from all is created from genomic DNA the bound fragments can of the published data sets from Martha bulyk’s group, be detected by hybridization to genomic microarrays34. including the 8-mer enrichment scores and PwMs These methods are capable of determining important for each TF that has been tested. The optimal method for aspects of the binding specificity, including the consen- modelling the specificity of TFs based on PbM data sus sequence and the relative variability in affinity for remains an open question, with different methods some- different bases at different positions within the binding times leading to different conclusions (see the section on sites, but if multiple rounds of selection are used it is computational modelling). TF specificities determined not straightforward to determine binding energy distri- by PbM data have been compared to in vivo binding butions directly. As sequencing methods are capable of location experiments in yeast to assess whether certain much longer read lengths than the size of a binding site, TFs appear to bind directly or indirectly through inter- ligating several sites together and sequencing them all actions with other factors11,22. PbMs have recently been at once — a technique known as seleX-serial analysis used to determine the variations in specificity among of gene expression (seleX-sAGe)35 — has made the most of the 42 bHlH proteins of Caenorhabditis elegans method much more efficient. seleX-sAGe is capa- (including those that form heterodimers), which has ble of obtaining thousands of sequenced binding sites helped to distinguish their functional roles23. The bind- and requires fewer rounds of selection, which makes ing specificities for many mouse TFs have also recently quantitative modelling more accurate35,36. been analysed24. by utilizing new sequencing technologies it is now possible to derive binding energy profiles from seleX Cognate site identifier. A similar method, called the methods quite efficiently using a method called high- cognate site identifier (CsI), also uses arrayed DNA throughput seleX (HT-seleX)37 or bind-and-seq38. An sequences to measure relative binding by TFs; current advantage of this approach is that the output (the number arrays include all possible ten-base-long sequences25,26. of counts observed for each sequence) is digital and there One difference between PbMs and CsIs is that in CsIs, is a very large dynamic range. From a total set of hun- single-stranded DNAs are synthesized that fold back to dreds of thousands or millions of individual sequences form dsDNA binding sites, thereby eliminating the need there will be many nonspecific sites, but they usually only for primer directed DNA synthesis to generate dsDNA, occur once, whereas the highest affinity sites may occur which is required for PbMs (FIG. 1B). The CsI develop- thousands of times. From millions of reads one can esti- ers also have an alternative method of visualizing the mate the binding model — such as a PwM, as well as specificity of the TFs, besides determining PwMs, that nonspecific binding energies and the free-TF concentra- is called ‘specificity landscapes’ and that may provide tion — after a single round of selection and obtain mod- useful additional information in some cases27. CsI fluo- els that fit the data well37. Performing additional rounds rescence intercalation displacement (CsIFID) is a modi- of selection may provide more information about specific fication in which the TF and a DNA-binding fluorescent segments of the energy distribution and may give more dye compete for binding to the DNA; this eliminates the accurate models, particularly for low-specificity TFs or need for tagged proteins or antibodies to the TF and may those with large non-independent contributions (see the simplify the procedure28. section on computational modelling). For example, the input to the second round of selection is enriched for In vitro selection higher affinity sites and the number of nonspecific sites Using purified proteins to select high-affinity binding is reduced. by comparing the distribution of sites before sites from random libraries in vitro is a very powerful and after selection, one can obtain better resolution of the Gel filtration technique. Although invented independently multiple binding energy differences among the high-affinity sites. A permeable gel, such a times, the term seleX seems to be the most commonly Multiple rounds may provide highly accurate specificity polyacrylamide or agarose, is 29–32 (FIG. 1C) used to separate molecules used name . The general strategy is to create models even for low-specificity TFs. and molecular complexes a library of potential binding sites, which may be from Another advantage is that there is no inherent limit based on their size. DNA will randomly synthesized DNA or created from genomic to the length of the binding site that can be selected, migrate faster through a gel sequences. both ends of the library sequences can have although the library size will limit the length that is than the same DNA that is primer binding sites so that they can be amplified by covered comprehensively; using a nanomole of DNA bound to a protein, so the 15 protein–DNA complex can be PCR. Purified TF is added to the library of DNA sites (about 10 individual sequences) nearly all possible separated from the free DNA. and the bound and unbound sequences are separated 25-base-long potential binding sites are included. Of by various means, such as gel filtration, filter binding or course the number of sites that are sequenced is lim- Barcoding binding to immobilized protein. Although higher affinity ited to about 108, but they can be selected from an enor- The process of adding the same unique DNA sequence sites have a higher probability of being bound by the TF, mously larger set of possible binding sites so that one to the ends of DNA molecules after a single selection most of the bound sequences are can study TFs with very long binding sites, such as bac- from the same experiment, so still low affinity because they greatly exceed the number terial TFs that commonly have sites longer than 16 base that the resulting DNA reads of high-affinity sequences. To increase the fraction of pairs. A recent publication39 has pushed this approach can be traced back to the high-affinity sites, the bound fraction can be amplified much further. Using tagged proteins the authors per- experiment that generated them, allowing several and rebound and those steps repeated as many times as formed HT-seleX from cell extracts (rather than puri- experiments to be sequenced needed. Typically, after several rounds, the selected sites fying the TF) and by barcoding individual experiments simultaneously (multiplexed). would be cloned and sequenced, often obtaining fewer they collected binding sites for several TFs in parallel.

756 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS

In total they obtained binding site data for 19 different of the binding site, up to some saturation level, one can TFs, many of which have low specificity and required treat the counts similarly to HT-seleX or PbM data and multiple rounds of selection. This demonstrated that obtain good quantitative models (see below). HT-seleX can have very high throughput and gener- ate enormous amounts of specificity data quite rapidly. Computational modelling of specificity The optimal method for extracting specificity models Given the data produced by each of the experimental from the data, especially after multiple rounds of selec- methods one can model the specificity using computa- tion, remains an open question (see the section on tional approaches (BOX 2). One can ask why a model is computational modelling). needed if the affinities to all possible binding sites can be determined directly. However, even in cases in which Bacterial one-hybrid selections the binding sites are short enough that all possible vari- bacterial one-hybrid (b1H) selections are not in vitro ations can be measured, there are still some advantages assays (unlike the methods described above) and can to having a model. The model parameters are averaged be used for any TF that can be cloned and expressed over many independent measurements and can reduce in Escherichia coli; this method has the advantage that the uncertainty for any particular sequence compared the TF need not be purified or synthesized in vitro40–42. to the raw data, which are often fairly noisy. A simple The approach uses a library of randomized binding sites model provides an easy method for scanning a genome upstream of a weak promoter that drives the expression of and predicting the most likely binding sites as well as a selectable gene, typically the yeast HIS3 (which encodes the effects of genetic variations. A model has fewer para- a component of the histidine biosynthesis pathway; the meters than the entire data set, and the complexity of E. coli strain lacks the bacterial homologue) (FIG. 1D). the model required for a good fit to the data can provide when the cells are grown in medium lacking histidine, insight into the recognition mechanism. For example, if expression of the HIS3 gene is required for growth and a simple PwM (BOX 2) fits the data very well, this indi- the stringency of the selection can be modulated by the cates that the protein binds in an additive manner in addition of 3-amino-1,2,4-triazole (3AT), an inhibitor of which each position contributes independently to the the HIS3 enzyme. The TF is fused to the non-essential ω binding energy. If such a model does not fit the data well, subunit of RNA polymerase, so that TF binding recruits it indicates more complex interactions between the pro- RNA polymerase and increases promoter activity. tein and the DNA. simple models lend themselves to In published work the binding sites from selected predicting the effects of variants, both in the binding colonies were sequenced individually, typically obtain- sites and also in the protein itself, and can facilitate the ing about 50 sequences for each selection42,43. However, design of proteins with novel specificity. the application of new sequencing methods allows one Consider the example of studying the bHlH proteins to collect all of the cells on the plate — including the using MITOMI12. A simple additive model assumes large colonies, the microcolonies and even cells that that the binding energy changes at each position can be were plated but did not grow — and sequence them all summed to provide the binding energies to any sequence en masse, obtaining millions of binding sites for each with multiple changes. Under such a model one would

experiment (G.D.s., unpublished observations). The only need to determine the Kd of the consensus sequence high-affinity sites lead to more HIS3 expression and so and each single base variant of that sequence, a total allow the cells with those sites to grow the fastest, result- of 3l + 1 measurements for a binding site of length ing in a higher number of sequence reads. Therefore, l. Maerkl and Quake concluded that such an additive coupling b1H to high-throughput sequencing provides model does not fit the data very well for the bHlH TFs — similar to HT-seleX — a digital read-out with a very they studied. However, we showed44 that a simple exten- large dynamic range, the highest affinity sites occurring sion to the simple additive model fits the data well, within hundreds to thousands of times and the lowest affinity the variance of the measurements. The simple extension sites typically occurring only once or not at all. It can also is that energy changes are needed for each adjacent dinu- be multiplexed so that the data from several experiments, cleotide pair but not of more distant pairs; the binding for different TFs or under different selection stringencies seems to be additive over adjacent dinucleotides. For the can be obtained in parallel. four-base-long half sites the extended model requires In b1H there is no real limit to the length of the ran- only 40 measurements to capture all of the important domized site that can be selected for, although the total variables (the nonspecific binding energy may need to be library size is limited by the efficiency of transforma- determined separately44). In fact a TF that binds to a ten- tion into E. coli to about 108 independent sites. That is base-long site could be assayed directly for binding to enough to include all possible 12-base-long sequences, the preferred site and all possible adjacent dinucleotide but it is also possible to select from incomplete librar- variants with only 112 different sequences. This opens 42,43 Multiplexing ies of longer randomized regions . Using the data to up a method like MITOMI to comprehensive studies of The process of mixing several model the binding specificity is not as straightforward as TFs with much longer binding sites than have previously experimental samples together with PbM or HT-seleX data because the counts are not been studied, provided that the preferred binding site is and sequencing them (or some directly related to binding affinity but are confounded by known. This is also within the range of existing array other process) simultaneously. 14 each sample can be barcoded the effects of selection for cell growth over many genera- sizes for sPR methods . At this time it is not known to recover the information tions. by making some simplifying assumptions about what fraction of TFs can be well modelled using adjacent about its experimental origin. the growth rate being proportional to the TF occupancy base-pair energy contributions.

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 757 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS

The other three methods described above do not leads to very similar parameters when each data set is determine affinities directly but rather measure some- analysed using the same method for estimating binding thing related to affinity: the fluorescence intensities energies37 (FIG. 2d–f), and in each case they fit the experi- for microarray-based methods and the counts of indi- mental data well (FIG. 2g–i). This suggests that the choice vidual binding sites for in vitro selection and bacterial of method should depend on factors other than one-hybrid methods (FIG. 2a–c). The data are usually col- the quality of the resulting data, such as the length of the lected at a fixed concentration of the TF, although the binding site, the ease of purifying or synthesizing the TF experiment can be repeated at different concentrations. and the cost of sequencing compared to microarrays. even if the total concentration of the TF is known this is not the concentration of free TF, which depends on Conclusions both the concentration of TF and the concentrations and The specificity of TFs — their ability to discriminate affinities of all the potential binding sites. It is the free between different sequences — is essential to the proper concentration of TF that is needed to relate the binding functioning of regulatory networks and gene expres- energy to the binding probability (BOXeS 1,2). sion. Recent technological advances in several different The raw data provide relative binding probabilities for approaches have greatly increased the speed at which PbM and seleX data but are related to the growth rate quantitative affinity measurements can be obtained for for b1H data. If one uses simplifying assumptions that many different sequences. Methods such as MITOMI the growth rate is proportional to the occupancy of the and sPR can determine binding affinities directly but binding site, up to a certain saturation level (or the point have relatively low throughput compared to the other at which the concentration of HIS3 is no longer limiting methods described here and require specialized equip- for growth) b1H data can be treated very similarly to PbM ment. Using purified or in vitro synthesized proteins, and seleX data, which are equivalent to the curves on PbM and CsI can be used on microarray equipment the three-dimensional plot of BOX 1 at a fixed value of the that is commonly available and these approaches have concentration of TF ([TF]). with sufficient data, and by many of the attributes of microarrays for gene expres- employing a model of how the binding energy depends sion studies. The cost of microarrays is now quite low, on the sequence (such as a PwM), it is possible to esti- which is one advantage of these approaches. HT-seleX mate all of the parameters of the model, including the also requires purified TFs, although recent results show [TF] and the nonspecific binding energy37. Models with that it can also be performed on cellular extracts of tagged various complexities can be compared to see which fits proteins39. The results are counts of individual binding the data the best, taking into account the number of esti- sites, which give HT-seleX a large dynamic range and mated parameters; more complex models should always allow assay binding sites of any length. single rounds of provide a better fit but there may be over-fitting (that is, selection are often sufficient, but low-specificity TFs may fitting the experimental noise) if there are excess para- require multiple rounds to determine good models. b1H meters. For example, one could test whether using energy methods do not require purified protein and are per- terms for adjacent dinucleotides provides a substantially formed by simple selections in bacterial cells. Unlimited better fit than a simple PwM, as was observed for the site sizes can be selected, although library sizes are limited bHlH proteins using MITOMI12,37. As noted above, by transformation efficiencies. For both HT-seleX and the UniProbe database of the bulyk group21 reports b1H only standard laboratory equipment, and access to PwMs for each TF that they have analysed using PbMs high-throughput sequencing facilities, which are now and also 8-mer enrichment scores, which are another widely available, are needed. form of model for the specificity of the TF. Using all 8-mer In the few years that these methods have been used, scores produces a binding model with 32,896 parameters information has rapidly accumulated regarding the (all unique 8-mers accounting for both orientations), specificities of human TFs as well as the TFs of several but in general only a small number of highly enriched important model organisms. Many are still unknown, 8-mers are used to define the specificity. we have found but in the coming years our knowledge of TF specifici- that in most cases a PwM, perhaps extended to include ties should expand enormously. That information will be adjacent dinucleotides, fits the PbM data as well as 8-mer very useful for the next phase of elucidating regulatory enrichment scores (G.D.s. and Y.Z., unpublished observa- networks: the identification of interacting TFs in vivo and tions), but there can be exceptions that indicate complex the combinations of TFs that control specific genes and interactions, perhaps multiple modes of binding. pathways. Comparison of the intrinsic specificities with FIGURe 2 shows the data and results for PbM, HT-seleX location experiments, such as ChIP–chip and ChIP–seq, and b1H using the same protein, the zinc finger protein will help in that analysis. In addition, as more specificity ZIF268 (also known as early growth response protein 1). information is obtained it will facilitate the prediction of As with other microarrays, PbMs tend to have higher specificities of proteins without experimental analysis backgrounds than the digital counting methods, but there and aid the design of new proteins with novel specifici- is still a very strong signal with the highest intensities being ties27,42,46,47. There is already considerable effort devoted about 15-fold higher than the mean background (about 65 to the design of zinc finger proteins coupled with nucle- standard deviations above the background distribution). ases to target specific locations in the genome48. As more ZIF268 has previously been shown to be well represented specificity information emerges for more families of TF by a simple PwM, with only a small energy contribution proteins, it is likely that many more can be modelled well from adjacent dinucleotides45. Furthermore, each method enough to develop design criteria.

758 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS

a b c

200 3,000 1e+05 0 12 200 3,000 150 2,500 100 8e+04 150 80

y y y 2,000 100 60 100 2,000 6e+04 40

equenc 50 equenc equenc 1,500 50 Fr Fr Fr 20 4e+04 0 0 1,000 0 1,000 0 200 400 600 800 000 1,000 2, 3,000 4,000 5,000 50,000 100,000 150,000 200,000 250,000 300,000 2e+04 500

0 0e+00 0 050,000 100,000 150,000 200,000 250,000 0 200 400 600 800 0 1,0002,000 3,000 4,000 8-mer median intensityRead countRead count

d 2 e 2 f 2

1.5 1.5 1.5 ontent ontent ontent

1 1 1 tion c tion c tion c orma orma orma Inf Inf Inf 0.5 0.5 0.5

0 0 0 1234567891012345678910 123456 Position Position Position

g h i R-squared = 0.75 R-squared = 0.92 R-squared = 0.91

250,000 600 4,000 y 500 200,000 3,000 ounts 400 ounts

150,000 ed c ed c

300 2,000

100,000 Observ Observ

8-mer median intensit 200 1,000 50,000 100

0 0 0 50,000 100,000 150,000 200,000 250,000 0 100 200 300 400 500 600 0 1,0002,000 3,000 4,000 PWM predicted 8-mer median intensity PWM predicted counts PWM predicted counts

Figure 2 | comparison of data and analysis from three methods. a | Histogram of fluorescent Naintensitiesture Revie fromws | Genetics protein-binding microarray (PBM) data for the zinc finger transcription factor ZIF268 (also known as early growth response protein 1). b | Histogram of counts from high-throughput (HT)-SELEX data for ZIF268 (the total count is about 260,000). c | Histogram of counts from bacterial one-hybrid (B1H) data for ZIF268 in which just the first six positions were randomized (G.D.S., unpublished observations) (experimental conditions were 1 mM 3-amino-1,2,4-triazole, 50 μM isopropyl β-d-1-thiogalactopyranoside (IPTG), the total count is approximately 160,000). d | Logo of position weight matrix (PWM) for data shown in part a. e | Logo of PWM for data shown in part b. f | Logo of PWM for data shown in part c. g | Fit of the predicted to observed 8-mer median intensities using the model of part d on the data from part a. h | Fit of the predicted to observed counts using the model of part e on the data from part b. i | Fit of the predicted to observed counts using the model of part f on the data of part c. Data in part a are from ReF. 16. Data in part b are from ReF. 37.

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 759 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS

Note added in proof the analysis55. This increases the throughput consider- A recent paper extended the capability of the MITOMI ably over the initial method, but the throughput is still approach by increasing the number of chambers in the substantially less than when all ten-base-long sites are array and by binding to longer oligonucleotides, such included in PbM and CsI, or than for the even longer that all eight-base-long binding sites are included in selected sites allowed by HT-seleX and b1H.

1. Farnham, P. J. Insights from genomic profiling of 22. Zhu, C. et al. High-resolution DNA-binding specificity 41. Meng, X. & Wolfe, S. A. Identifying DNA sequences transcription factors. Nature Rev. Genet. 10, 605–616 analysis of yeast transcription factors. Genome Res. recognized by a transcription factor using a bacterial (2009). 19, 556–566 (2009). one-hybrid system. Nature Protoc. 1, 30–45 (2006). 2. Li, X. Y. et al. Transcription factors bind thousands of 23. Grove, C. A. et al. A multiparameter network reveals 42. Noyes, M. B. et al. Analysis of homeodomain active and inactive regions in the Drosophila extensive divergence between C. elegans bHLH specificities allows the family-wide prediction of blastoderm. PLoS Biol. 6, e27 (2008). transcription factors. Cell 138, 314–327 (2009). preferred recognition sites. Cell 133, 1277–1289 3. Zhang, X. et al. Genome-wide analysis of cAMP-response 24. Badis, G. et al. Diversity and complexity in DNA (2008). element binding protein occupancy, phosphorylation, recognition by transcription factors. Science 324, 43. Noyes, M. B. et al. A systematic characterization of and target gene activation in human tissues. 1720–1723 (2009). factors that regulate Drosophila segmentation via a Proc. Natl Acad. Sci. USA 102, 4459–4464 (2005). 25. Puckett, J. W. et al. Quantitative microarray profiling bacterial one-hybrid system. Nucleic Acids Res. 36, 4. Madan Babu, M., Teichmann, S. A. & Aravind, L. of DNA-binding molecules. J. Am. Chem. Soc. 129, 2547–2560 (2008). Evolutionary dynamics of prokaryotic transcriptional 12310–12319 (2007). 44. Stormo, G. D. & Zhao, Y. Putting numbers on the regulatory networks. J. Mol. Biol. 358, 614–633 (2006). 26. Warren, C. L. et al. Defining the sequence-recognition network connections. Bioessays 29, 717–721 (2007). 5. Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. profile of DNA-binding molecules. Proc. Natl Acad. Sci. 45. Benos, P. V., Bulyk, M. L. & Stormo, G. D. & Luscombe, N. M. A census of human transcription USA 103, 867–872 (2006). Additivity in protein–DNA interactions: how good factors: function, expression and evolution. This paper introduced the CSI method and an approximation is it? Nucleic Acids Res. 30, Nature Rev. Genet. 10, 252–263 (2009). its application to TFs as well as small 4442–4451 (2002). 6. Kharchenko, P. V., Tolstorukov, M. Y. & Park, P. J. Design DNA-binding molecules. 46. Alleyne, T. M. et al. Predicting the binding preference and analysis of ChIP-seq experiments for DNA-binding 27. Carlson, C. D. et al. Specificity landscapes of DNA of transcription factors to individual DNA k-mers. proteins. Nature Biotech. 26, 1351–1359 (2008). binding molecules elucidate biological function. 25, 1012–1018 (2009). 7. Park, P. J. ChIP-seq: advantages and challenges Proc. Natl Acad. Sci. USA 107, 4544–4549 (2010). 47. Benos, P. V., Lapedes, A. S. & Stormo, G. D. of a maturing technology. Nature Rev. Genet. 10, 28. Hauschild, K. E., Stover, J. S., Boger, D. L. & Probabilistic code for DNA recognition by proteins 669–680 (2009). Ansari, A. Z. CSI-FID: high throughput label-free of the EGR family. J. Mol. Biol. 323, 701–727 8. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. detection of DNA binding molecules. Bioorg Med. (2002). Genome-wide mapping of in vivo protein–DNA Chem. Lett. 19, 3779–3782 (2009). 48. Cathomen, T. & Joung, J. K. Zinc-finger nucleases: interactions. Science 316, 1497–1502 (2007). 29. Oliphant, A. R., Brandl, C. J. & Struhl, K. Defining the the next generation emerges. Mol. Ther. 16, 9. Pepke, S., Wold, B. & Mortazavi, A. Computation for sequence specificity of DNA-binding proteins by 1200–1207 (2008). ChIP-seq and RNA-seq studies. Nature Methods 6, selecting binding sites from random-sequence 49. Schneider, T. D., Stormo, G. D., Gold, L. & S22–S32 (2009). oligonucleotides: analysis of yeast GCN4 protein. Ehrenfeucht, A. Information content of binding sites 10. Ji, H. et al. An integrated software system for analyzing Mol. Cell. Biol. 9, 2944–2949 (1989). on nucleotide sequences. J. Mol. Biol. 188, 415–431 ChIP-chip and ChIP-seq data. Nature Biotech. 26, 30. Tuerk, C. & Gold, L. Systematic evolution of ligands (1986). 1293–1300 (2008). by exponential enrichment: RNA ligands to 50. Stormo, G. D. & Fields, D. S. Specificity, free energy 11. Gordan, R., Hartemink, A. J. & Bulyk, M. L. bacteriophage T4 DNA polymerase. Science 249, and information content in protein–DNA interactions. Distinguishing direct versus indirect transcription 505–510 (1990). Trends Biochem. Sci. 23, 109–113 (1998). factor–DNA interactions. Genome Res. 19, 31. Blackwell, T. K. & Weintraub, H. Differences and 51. Schneider, T. D. & Stephens, R. M. Sequence logos: 2090–2100 (2009). similarities in DNA-binding preferences of MyoD a new way to display consensus sequences. 12. Maerkl, S. J. & Quake, S. R. A systems approach and E2A protein complexes revealed by binding Nucleic Acids Res. 18, 6097–6100 (1990). to measuring the binding energy landscapes of site selection. Science 250, 1104–1110 (1990). 52. Stormo, G. D., Schneider, T. D. & Gold, L. transcription factors. Science 315, 233–237 (2007). 32. Wright, W. E., Binder, M. & Funk, W. Quantitative analysis of the relationship between This paper introduced the MITOMI method and Cyclic amplification and selection of targets (CASTing) nucleotide sequence and functional activity. demonstrated its application on four bHLH TFs. for the myogenin consensus binding site. Mol. Cell. Nucleic Acids Res. 14, 6661–6679 (1986). 13. Paul, S., Vadgama, P. & Ray, A. K. Surface plasmon Biol. 11, 4104–4110 (1991). 53. Lee, M. L., Bulyk, M. L., Whitmore, G. A. & resonance imaging for biosensing. IET 33. Fields, D. S., He, Y., Al-Uzri, A. Y. & Stormo, G. D. Church, G. M. A statistical model for investigating Nanobiotechnol. 3, 71–80 (2009). Quantitative specificity of the Mnt repressor. binding probabilities of DNA nucleotide 14. Shumaker-Parry, J. S., Aebersold, R. & Campbell, C. T. J. Mol. Biol. 271, 178–194 (1997). sequences using microarrays. Biometrics 58, Parallel, quantitative measurement of protein binding 34. Liu, X., Noll, D. M., Lieb, J. D. & Clarke, N. D. 981–988 (2002). to a 120-element double-stranded DNA array in real DIP-chip: rapid and accurate determination of 54. Djordjevic, M., Sengupta, A. M. & Shraiman, B. I. time using surface plasmon resonance microscopy. DNA-binding specificity. Genome Res. 15, 421–427 A biophysical approach to transcription factor Anal. Chem. 76, 2071–2082 (2004). (2005). binding site discovery. Genome Res. 13, 2381–2390 15. Campbell, C. T. & Kim, G. SPR microscopy and its 35. Roulet, E. et al. High-throughput SELEX SAGE (2003). applications to high-throughput analyses of method for quantitative modeling of transcription- 55. Fordyce, P. M. et al. De novo identification and biomolecular binding events and their kinetics. factor binding sites. Nature Biotech. 20, 831–835 biophysical characterization of transcription-factor Biomaterials 28, 2380–2392 (2007). (2002). binding sites with microfluidic affinity analysis. A review of SPR methods and applications, 36. Nagaraj, V. H., O’Flanagan, R. A. & Sengupta, A. M. Nature Biotech. 28, 970–975 (2010). including the study of protein–DNA interactions. Better estimation of protein–DNA interaction 16. Berger, M. F. et al. Compact, universal DNA parameters improve prediction of functional sites. Acknowledgements microarrays to comprehensively determine BMC Biotechnol. 8, 94 (2008). We thank S. Wolfe for providing the plasmids and cells for the transcription-factor binding site specificities. 37. Zhao, Y., Granas, D. & Stormo, G. D. Inferring binding B1H experiments and useful comments on the manuscript. Nature Biotech. 24, 1429–1435 (2006). energies from selected binding sites. PLoS Comput. We thank members of the Stormo Laboratory for comments This paper introduced the universal PBM that Biol. 5, e1000590 (2009). on the manuscript and especially R. Christensen, D. Granas, includes all possible ten-base-long binding sites Introduction of HT-SELEX and the maximum Z. Zuo and L. Schriefer for providing the data from the B1H and its application on several TFs. likelihood method ‘binding energy estimates using selections with ZIF268. 17. Mukherjee, S. et al. Rapid analysis of the DNA-binding maximum likelihood’ (BEEML) for obtaining binding specificities of transcription factors with DNA energy models from the data. Competing interests statement microarrays. Nature Genet. 36, 1331–1339 (2004). 38. Zykovich, A., Korf, I. & Segal, D. J. The authors declare no competing financial interests. 18. Bulyk, M. L., Huang, X., Choo, Y. & Church, G. M. Bind-n-Seq: high-throughput analysis of in vitro Exploring the DNA-binding specificities of zinc fingers protein-DNA interactions using massively parallel with DNA microarrays. Proc. Natl Acad. Sci. USA 98, sequencing. Nucleic Acids Res. 37, e151 (2009). FURTHER INFORMATION 7158–7163 (2001). 39. Jolma, A. et al. Multiplexed massively parallel Gary D. Stormo’s homepage: http://ural.wustl.edu 19. Berger, M. F. & Bulyk, M. L. Universal protein-binding SELEX for characterization of human transcription Berkeley database of Drosophila TF specificities: microarrays for the comprehensive characterization of factor binding specificities. Genome Res. 20, http://bdtnp.lbl.gov/Fly-Net/ the DNA-binding specificities of transcription factors. 861–873 (2010). Database of Drosophila TF DNA-binding specificities: Nature Protoc. 4, 393–411 (2009). This study describes the use of HT-SELEX in http://pgfe.umassmed.edu/TFDBS 20. Philippakis, A. A., Qureshi, A. M., Berger, M. F. & parallel to determine the binding specificities JASPAR database of TF binding specificities: Bulyk, M. L. Design of compact, universal DNA of several human TFs. http://jaspar.genereg.net microarrays for protein binding microarray 40. Meng, X., Brodsky, M. H. & Wolfe, S. A. A bacterial PAZAR database: http://www.pazar.info experiments. J. Comput. Biol. 15, 655–665 (2008). one-hybrid system for determining the DNA-binding SELEX-SAGE database: http://www.isrec.isb-sib.ch/htpselex 21. Newburger, D. E. & Bulyk, M. L. UniPROBE: an online specificity of transcription factors. Nature Biotech. 23, UniProbe database: database of protein binding microarray data on 988–994 (2005). http://the_brain.bwh.harvard.edu/uniprobe protein-DNA interactions. Nucleic Acids Res. 37, The introduction of an efficient B1H approach All links Are Active in the online pDf D77–D82 (2009). for determining TF binding specificities.

760 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved