<<

On the Concept of cis-Regulatory Information: From Sequence Motifs to Logic Functions

Dedicated to Grzegorz Rozenberg's 65th Birthday

Ryan Tarpine∗† Sorin Istrail∗‡

Abstract The regulatory genome is about the system level organization of the core genomic regulatory apparatus, and how this is the locus of causality underlying the twin phenomena of animal development and animal evolu- tion. [8] Information processing in the regulatory genome is done through regulatory states, dened as sets of factors (sequence-specic DNA binding which determine expression) that are ex- pressed and active at the same time. The core information processing ma- chinery consists of modular DNA sequence elements, called cis-modules, that interact with transcription factors. The cis-modules read the infor- mation contained in the regulatory state of the cell through binding, process it, and directly or indirectly communicate with the basal transcription apparatus to determine . This en- dowment of each gene with the information-receiving capacity through their cis-regulatory modules, is essential for the response to every possi- ble regulatory state to which it might be exposed during all phases of the life cycle and in all cell types. We present here a set of challenges addressed by our CYRENE re- search project aimed at studying the cis-regulatory code of the regulatory genome. The CYRENE Project is devoted to (1) the construction of a database, the cis-Lexicon, containing comprehensive information across species about experimentally validated cis-regulatory modules; and (2) the software development of a next-generation genome browser, the cis- Browser, specialized for the regulatory genome. The presentation is an- chored on three main computational challenges: the Gene Naming Prob- lem, the Bottleneck Problem, and the Logic Function Inference Problem.

∗Department of Computer Science and Center for Computational Molecular , Brown University, 115 Waterman Street, Box 1910, Providence, RI 02912, USA †[email protected][email protected]

1 1 Introduction

Gene expression is regulated largely by the binding of transcription factors to genomic sequence. Once bound, these factors interact with the transcription apparatus to activate or repress the gene. Unlike a restriction , which recognizes a single sequence or clearly dened set of sequences, a transcription factor binds to a family of similar sequences with varying strengths and eects. While the similarity between sequences bound by a single factor is usually obvi- ous at a glance, there is as yet no reliable sequence-only method for determining functional binding sites. Even the seemingly conservative method of looking for exact sequence matches to known binding sites yields mostly false positives since binding sites are small, usually less than 15 bases long, by chance alone millions of sequence matches are scattered throughout any genome. Very few of these have any inuence on gene regulation. The activity of a putative bind- ing site depends on more than the site sequence, and it is hoped that the site context (i.e., the surrounding sequence) contains the necessary information. To predict new binding sites, we must understand the biology of gene regulation and incorporate the missing sources of regulatory information into our model. To this end, we have created a database for storing the location and function of all experimentally found and validated binding sites, the cis-Lexicon. The cis-Browser, a powerful visual environment for integrating many types of known genomic information and experimental data, together with the cis-Lexicon make up the core components of the CYRENE Project: a software suite, associated tools, and set of resources for regulatory genomics. Only by mining a large, high-quality collection of functional binding sites can we uncover the missing clues into what makes sites functional. We have come across many fundamental issues as the CYRENE Project has developed. We crystalized three of these under the names the Gene Naming Problem, the Consensus Sequence Bottleneck Problem, and the Logic Function Inference Problem.

2 A Case Study

A gene with one of the best-understood cis-regulatory regions is endo16 of the sea urchin Strongylocentrotus purpuratus [6]. A 2.3 kb sequence directly upstream of the transcription start site has been shown to contain binding sites for many transcription factors, clustered into six cis-regulatory modules[26], as seen in gure 1. Each module has its own independent function: Modules A and B drive expression in dierent periods of development, Modules DC, E, and F cause repression in certain regions of the embryo, and Module G is a general expression booster at all times. Each module has a similar organization: one or two binding sites for unique factors which don't bind anywhere else in the regulatory region, surrounded by several sites for factors which are found in other modules as well. Module A has been described as a hardwired logic processor, whose output

2 endo16 cis-regulatory logic 619

Modules F, E and DC. This may be a necessary change, since whereas at this time the activity of the SpOtx factor that drives these repressive interactions are likely to be in part signal Module A is in process of declining. mediated (C.-H. Y., unpublished observations), and all of the endo16 cis-regulatory logic 619 original veg2 interfaces with other blastomeres are altered at MATERIALS AND METHODS Modulesgastrulation F, E (see and Davidson DC. This et may al., 1998).be a necessary (3) The switch change, is itself since whereas at this time the activity of the SpOtx factor that drives thesea temporal repressive control interactions subsystem, aresince likely it is activated to be in only part when signal AllModule of the A methods is in process and procedures of declining. used to obtain the kinetic mediatedthe activity (C.-H. of the Y., Moduleunpublished B regulator observations), that we calland UIall rises.of the measurements of expression construct output referred to in this paper (4) The switch shunts control of endo16 expression into a original veg2 interfaces with other blastomeres are altered at haveMATERIALS been described AND earlier METHODS (Yuh et al., 1996; Yuh et al., 1998). Most gastrulationpathway capable (see Davidson of driving et al., very 1998). high (3) level The expressionswitch is itself in of the expression constructs, however, were generated specifically for the differentiating gut, i.e. the BA amplification subsystem, this work. A brief summary of their provenance follows; for constructs a temporal control subsystem, since it is activated only when All of the methods and procedures used to obtain the kinetic the activity of the Module B regulator that we call UI rises. A measurements of expression construct output referred to in this paper (4) The switch shunts control of endo16 expression into a have been described earlier (Yuh et al., 1996; Yuh et al., 1001998). bp Most pathway(-2300) capable of driving very high level expression in of the expression constructs, however, were generated specifically for G DC A Bp the differentiating gut, i.e. the BA F amplification E subsystem, this work. A brief summary of theirB provenance follows; for constructs

UI R Otx Z A 100 bp (-2300) G F E DC B A Bp

UI R OtxSpGCF1Z CYCB1 CB2 P CG

B -578 CY CB1 SpGCF1 CTAATGTACACCCACACTCAAAACGTGGCTTATGGAGAAGGGTTGGATTTCCAATTCGGAGTTGTTTTTGTATATCAAATAACAAATGAAGAGGGCACTTCTTAATTCAGAAATGTTAFigure 1: The structure of endo16 's regulatory regionSpGCF1 (from [25]) CYCB1 CB2 P CG GAATTC GTCTAGA (CYm) (CB1m) -430 B UI R CB2 SpGCF1 -578 TCATGAATAAAGACTTTAACTTTGTTGGTTTTTCAAATTAATTTTGGATGTTTTTCAGTCGATTCCGATGAGATAAAACCCCGAATATACCTGGAACTGAACGTTCTTTGTTTCCACG CY CB1 SpGCF1 CTAATGTACACCCACACTCAAAACGTGGCTTATGGAGAAGGGTTGGATTTCCAATTCGGAGTTGTTTTTGTATATCAAATAACAAATGAAGAGGGCACTTCTTAATTCAGAAATGTTAGATCTAGACC CTAGTCTAGA TCTAGA (UIm)(Rm) (CB2m) GAATTC -301GTCTAGA (CYm) (CB1m) AAAGATTTAATGCCCGTAAACACAAACATCTGACAAATGCATTAAACTTCATCAAACAATGTAACAAAAGCGTAATTTCCTCTTAAATCGCCCTTACTTCTAAGAAACGTCGTAAATCG-430 UI R CB2 SpGCF1 TCATGAATAAAGACTTTAACTTTGTTGGTTTTTCAAATTAATTTTGGATGTTTTTCAGTCGATTCCGATGAGATAAAACCCCGAATATACCTGGAACTGAACGTTCTTTGTTTCCACG B A GATCTAGACC-217 CTAGTCTAGA TCTAGA CG1 P Otx Z CG2 SpGCF1 (UIm)(Rm) (CB2m) CAACGTCTCAAAAATATTGACCAAAATGATCATACCATTATCATCGCGTAGGATTAAGTGATTAAACTACCAAGTGATTACATCATCTCAAAGTTATCACATCCCCGGGTTAAACT -301 GAATTC GAATTC ATCTAG GTTCTAG TCTAGA AAAGATTTAATGCCCGTAAACACAAACATCTGACAAATGCATTAAACTTCATCAAACAATGTAACAAAAGCGTAATTTCCTCTTAAATCGCCCTTACTTCTAAGAAACGTCGTAAATCG (CG1m) (CPm) (Otxm) (Zm) (CG2m) -69 -1 CG3 CG4 SpGCF1 SpGCF1 TATA GTTTGAGTTTCGTCTCCTGATTGTGCTATCAAAGACAAAGGGGTGTAACTTTACCCCCTCATCAAGAGCGGAGGGTTAAATAGAGAAAGACTGGTCGAGGACAGGTCATAATATB A -217 CG2 TCTAGA CG1 P TCTAGA Otx Z SpGCF1 CAACGTCTCAAAA (CG3m) ATATTGACCAAAATGATCATACCATTATCATCGCGTAGGATTAAGTGATTAAACTACCAAGTGATTACATCATCTCAAAGTTATCACATCCCCGGGTTAAACT (CG4m) A Bp GAATTC GAATTC ATCTAG GTTCTAG TCTAGA (CG1m) (CPm) (Otxm) (Zm) (CG2m) Fig. 2. Modules B and A of the endo16 cis-regulatory system. (A) The binding maps of the 2300 bp sequence that is necessary and -69 -1 sufficientCG3 to generate an accurate spatialCG4 and temporalSpGCF1 pattern of expressionSpGCF1 (Ransick et al., 1993;TATA Yuh and Davidson, 1996). The map is modifiedGTTTGAGTTTCGTCTCCTGATTGTGCTATCAAAGACAAAGGGGTGTAACTTTACCCCCTCATCAAGAGCGGAGGGTTAAATAGAGAAAGACTGGTCGAGGACAGGTCATAATAT slightly in accordance with current evidence from that derived by Yuh et al. (Yuh et al., 1994) by a three-step procedure. First, all sites of high specificity interaction were determined by a rapid gel shift mapping method in which embryo nuclear extract was reacted with nested TCTAGA TCTAGA 3 sets of end-labeled(CG3m) probes (high specificity (CG4m) here denotes interactions for which kr !5-10!10 , where kr=ks/kn, if ks is the equilibrium constant for the interaction with a givenA site, and, kn is the equilibrium constant for reaction Bpof the factor with synthetic double-stranded DNA polynucleotide). Second, the location of the sites was further narrowed down by oligonucleotide gel shift competition mapping. Third, the Fig.binding 2. Modules factors Bwere and enriched A of the by endo16 affinity cis- chromatographyregulatory system. and each (A) challenged The protein in bindingturn for crossreactionmaps of the 2300 with probes bp sequence representing that is allnecessary of the and sufficientidentified to binding generate sites. an Thisaccurate permitted spatial determination and temporal of pattern the complexity of expression and individuality (Ransick et of al., the 1993; binding Yuh factors and Davidson, (indicated by1996). color T inhe Fig. map 1A), is modifiedbased both slightly on the in cross-reaction accordance with tests currentand on theirevidence molecular from sizes,that derived as estimated by Yuh by et DNA-protein al. (Yuh et al.,interaction 1994) by blots. a three-step Factors ind proceduicated re.above First, the all sites ofline high representing specificity theinteraction DNA bind were uniquely determined in aFigure single by a region rapid ofgel 2:the shift sequence; mapping Quintessential those method indicated in which below embryo interact diagramnuclear in multiple extract regions. was reactedThe (from factors with with nested [25]) which this paper is concerned, i.e. those of Modules B and A, are indicated by labels: for Module3 A site functions see Yuh et al. (Yuh et al., sets of end-labeled probes (high specificity here denotes interactions for which kr !5-10!10 , where kr=ks/kn, if ks is the equilibrium constant for1998); the interaction for Module with B, this a given paper. site, For and, overview k is ofthe modular equilibrium functions constant in this for system reaction see ofreviews the factor by Davidson with synthetic (Davidson double-stranded 1999; Davids on,DNA 2001): Module G is a general booster for the wholen system; F, E and DC are modules that permit ectopic expression. (Modified from Yuh et polynucleotide). Second, the location of the sites was further narrowed down by oligonucleotide gel shift competition mapping. Third, the al. (Yuh et al., 1994)) (B) Sequence of cis-regulatory DNA of Modules B and A. Core target site sequences (Yuh et al., 1994; Yuh et al., 1998; bindingZeller etfactors al., 1995a; were Lienriched et al., 1997) by affinity are boxed chromatography in the same respective and each colors challenged as in (A), in turn and forbeneath crossreaction each, in red, with is shownprobes threpresentinge target site mutationsall of the identifiedused to test binding function sites. in vivoThis inpermitted the absence determination of that interaction. of the complexity and individuality of the binding factors (indicated by color in Fig. 1A), basedin both terms on the cross-reaction of gene tests and on expression their molecular sizes, as dependsestimated by DNA-protein entirely interaction blots. on Factors the indicated binding above the of its inputs. The line representing the DNA bind uniquely in a single region of the sequence; those indicated below interact in multiple regions. The factors with whichfactor this paper is whichconcerned, i.e. those binds of Modules uniquely B and A, are indicated to by Modulelabels: for Module A,A site functions Otx, see isYuh et the al. (Yuh driveret al., of the module, 1998); for Module B, this paper. For overview of modular functions in this system see reviews by Davidson (Davidson 1999; Davidson, 2001): Modulemeaning G is a general booster that for the the whole system; change F, E and DC in are the repressor output modules that permit of theectopic expression. module (Modified over from Yuh time et is a result of the al. (Yuh et al., 1994)) (B) Sequence of cis-regulatory DNA of Modules B and A. Core target site sequences (Yuh et al., 1994; Yuh et al., 1998; Zeller et al., 1995a; Li et al., 1997) are boxed in the same respective colors as in (A), and beneath each, in red, is shown the target site usedchange to test function in vivo the in the absence concentration of that interaction. of Otx [24]. The other factors which bind to Module A, while absolutely necessary for correct behavior, are ubiquitous and typically do not play a role in its output changing. The precise structure of a regulatory region is visualized in a quintessential diagram, as seen in gure 2. We call such an image the quintessential diagram because it displays at a glance the most fundamental knowledge we have of the organization of a gene's regulatory region. When searching for papers with information relevant to our database, we found that no useful paper lacks this diagram.

The sites within Module A carry out a complex logic program. Sites CG1 and P (highlighted green in gure 3) communicate the eects of Module B to the transcription apparatusif either of them is disabled (by mutating the sequence), the gene expression is as if Module B were not present at all. Site Z (highlighted

red) communicates the eects of Modules DC, E, and F. Sites CG1, CG2, and CG3 (highlighted blue) together boost the nal output of the module. If even one of them is disabled, then the boost is completely removed. Module B was shown to have a similar complexity [25]. In fact, all cis- regulatory modules execute information processing of their inputs, and the sum total of the computations performed by the regulatory regions of individual , when considered within the complex network established by the depen- dencies among genes, forms a genomic computer [13].

3 endo16 cis-regulatory logic 627

Module B Module A BP

CY CB1 UI R CB2 CG1 P OTX Z CG2 SPGCF1 CG3 CG4

i1 i4 i11

i3

i5 i7 i9 F, E, DC

i2 i6 i8 i10 i12

if CY & CB1 i1 = 1 if i5=0 i7 = OTX(t) Figureelse 3: Computational i1 = 0.5 logic else model for Modules i7 = 0 A and B of endo16 (from

i8 = i6 + i7 [25]) i2 = i1 • UI (t) if ( F or E or DC) & Z i9 = 1 else i9 =0 if R i3 = CB2(t) else i3 = k • CB2(t) if i9=1 i10 = 0 (1threshold & R & i 4‡ 0 i5 = 1 i12 = i11• i10 else  Charles i5 Darwin = 0 [29]

i6 = i4 • (i2+i3) Scientists would rather share a toothbrush than use the same name Fig. 8. Computational logic model for BA region of the endo16 cis-regulatory system. The regulatory DNA of endo16 is shown as a horizontal strip at the top of the fordiagram. the The individual same binding gene. sites are indicated by labeled boxes. Module B and its effects are shown in blue; Module A and its effects are shown in red. Intermediate logic functions (i) are indicated by numbered circles. Each represents a specific regulatory interaction modeled as Anonymousa logic operation. Driver functions (see text) are mediated by time-varying inputs, indicated by colored boxes, which determine the temporal and also spatial pattern of endo16 expression. Time invariant interactions which affect the level of expression and control intra-system output and input traffic are shown as open boxes. Boolean functions in the model are indicated by broken lines; scalar functions by thin lines; functions of which the quantitative value varies with time, as thick lines. The individual logic functions are defined in the set of statementsAs below one the diagram. tries Here, to statements compile of the form information ‘If X’, where X is the fromname of a target the site, literature mean that this site is into present a database for au- and occupied by the respective factor. If the site has been mutated (or were the factors inactivated or eliminated) this is denoted by a statement of the formtomatic ‘X=0’; or as the analysis, alternative (‘else’) oneto the site of being the present issues and occupied. that The statements immediately afford testable predic cropstions of the upoutput is that of naming. for any given , alteration or subelement of the system (see examples in Table 2). The model was built using MATLAB (MathWorks) for programming,If information analysis and quantitative was tests. insertedBriefly, the BA system verbatim, shown in the model then works as in follows: order CB1 and to CY interac ndtions something one would together synergistically increase by a factor of about two the output of the positive spatial and temporal driver of Module B, which binds at the UI site (i1,need i2; Figs 3, to 6). The use output the of the UI exact subsystem is terminology at (i2). An additional, smaller, of time the varying author. driver input, which It peaks would at about also be impossible 40 hours, is generated by the interaction at CB2 (i3; see Fig. 5). An interaction at R is required for the BA intermodule input switch, which shuts off Otx input (i5, i7; Fig. 7), but this switch operates only if there is input from UI, and the CB2 site is present and occupied (i5). Furthermore,to the bring proteins binding related at the adjacent information R and CB2 sites apparently together interact, in that automatically,if the R site is mutated CB2 input unless (at i3) is large  somewhat enhanced. The CG1 and P sites of Module A together with CB2 in Module B are all required for linkage of Module B to ModuleA (i4; Fig.tables 4 and Yuh et al., were 1998), and compiled for synergistic amplification manually, (by a factor of whichabout two) of the would Module B input suer at i6. If th frome switch mediated problems with both by R does not function (i.e., in an R mutation) the summed input of Modules A and B at i8 is observed (Fig. 7A). If CB2, CG1 or P sites are mutated,completeness Module B is unlinked from Module and A (Fig. precision. 4; Yuh et al., 1998). Completeness, That is, now i4=0, so i6=0, and because in this case i8 is thejust the output addition of the of a new term Otx interaction at i7. If the gene is in a location where the repressive interactions in any of the upstream F, E, or DC modules are operating, the system isin shut theoff via an database interaction at the Z would site of Module require A (i9, i10; Yuh a et al., new 1996; Yuh entry et al., 1998; in Yuh the and Davidson, translation 1996). Finally, the table, which could CG2, CG3 and CG4 sites combine (at i11) to boost whatever output is present (at i12) by an additional factor of two (Yuh et al., 1996; Yuh et al., 1998).easily The result at be i12 is left transmitted out to the by basal accident. transcription apparatus. Precision, The output at each sinceof the nodes many (i) is calculated terms by the operations are ambiguous when indicatedtaken in the Table outbelow the of Figure. context even if they were completely obvious in the original setting. Errors in completeness lead to missed relationships, while error in precision lead to spurious ones. Therefore, it is clear that a standard set of terms must be developed that all of the descriptions given in the literature can be translated to at the time the information is added to the database. This does lead to some loss of information, as the translation will always be imperfect, but the set of terms can always be expanded if systematic problems are noticed. A prime example of such a controlled vocabulary is the Gene Ontology, which describes gene and attributes[2]. In the course of developing the cis-Lexicon, we composed standard terms for describing regulatory function, such as activa- tion, repression, and cis-regulatory module linkage, called the Cis-Regulatory Ontology. We are in the process of dening a similar vocabulary for describing the functional interactions among transcription factor binding sites. In all of the above cases, the translation of terms to a standard vocabulary is relatively straightforward, because the properties described are clearly distinct

4 Figure 4: Synonyms for dl and eve and understood, regardless of the terminologyif one paper says a transcription factor increases expression and another says a factor boosts expression, it is very clear that both phrases refer to our canonical term activation. A much more dicult problem is the wide variety of names given to a single gene in the literature. A peek into GenBank for a well studied gene often yields more than 15 aliasessee gure 4 to see the synonyms for dl and eve of D. melanogaster displayed in the cis-Browser. To attempt to use tables of gene synonyms to determine which gene is really meant by a name can only lead to the troubles outlined in our discussion of translation tables earlier. The Gene Naming Problem arises from the many techniques biologists use to study genes: examining a gene from the phenotype of a mutation, linkage map, or similarity to other known genes all yield dierent views and consequently dierent names. Given the genomic sequence, the obvious unier would be the gene's locus, but this is not always known. In addition, even the denition of a gene and its locus is in ux[11]. When a gene is named according to its similarity with other known genes, dierent conventions lead to similar but separate names, such as oct-3 versus oct3, which can both be found in the literature. One method which is less widespread is to prex a homolog with an indicator of the new species, such as spgcm being a homolog of gcm in D. melanogaster. This method can be impaired by the diculty in determining what qualies as a new species  at least 26 denitions have been published [29]. Occasionally, dierent transcription factors can bind to the exact same se- quence. This is especially likely for factors in the same family which share an evolutionary historythe DNA binding motif may not have diverged much be- tween them. In this case, when a sequence is found to be a functional site, for example by a gel shift assay or perturbation analysis, it is not clear what factor is actually binding there. It's also possible for two putative binding sites

5 to overlap, where it is not clear whether only one or both play a role in gene regulation. This can play havoc in determining the function of individual sites. When biologists attempt to determine which factor is binding to a site by nding the molecular weights of proteins isolated by gel shifts or other similar techniques, often several weights are found. This can be caused by measuring dierent subunits of a single protein, alternative splicings of a single gene, dif- ferent translation products from a single mRNA molecule, or even completely distinct factors. For example, when the structure of endo16 was rst mapped, 8 of the 14 inputs tentatively identied were found with more than one molec- ular weight[26]. Only by cloning the cDNAs encoding the factors can it be determined whether the multiple weights are from the same gene or not. The molecular weight can be thought of as a type of hash code for a protein: its value depends on the protein's sequence, but contains less information. If two proteins have a slightly dierent weight, they can still be from totally unre- lated genes and their sequences can be very dierent. Analogously, even protein products from the same gene can have very dierent weights, if they dier in terms of splicing (entire added or subtracted which might not aect the nal function a great deal). Biologists depend, however, on the assumption that proteins from dierent genes will have dierent weights. This may be true with very high probability, but it cannot be universally true, especially since our methods of measurement are always imperfect. This is similar to how unique identiers are created and used in computer science: IDs are given by random number generators from a large enough range (e.g., 64-bit numbers) that the probability of a clash is almost zero.

4 The Consensus Sequence Bottleneck Problem

[E]ssentially all predicted transcription-factor (TF) binding sites that are generated with models for the binding of individual TFs will have no functional role.  Wasserman and Sandelin[23]

Given the wide variety of sequences that a transcription factor binds to, there is no straightforward yet accurate model. The earliest model, consensus sequence, while still preferred by most biologists, suers from being forced to choose either selectivity (matching only known site sequences) or sensitivity (matching all known site sequences)[21]. As almost any pair of binding sites for a single factor dier in at least one base (see [19] for a study where out of seven binding sites for the same factor, only two sites have identical sequence and only three match the known consensus in all positions), the consensus sequence model allows for dierences in two ways: (1) permitting a range of bases in certain positions through symbols such as Y (allowing C, T, or U); and (2) permitting a set number of global mismatches, such as allowing one or two bases which do not match the consensus at all. Both methods greatly increase the number of hits to random sequence: replacing an A with a R or V increases

6 Figure 5: Matches for known binding site sequence in endo16 the probability of a match by chance by a factor of 2 or 3, respectively. Allowing for 1 or 2 mismatches anywhere in a consensus sequence increases it by a factor of up to 4 or 16. Imagining the set of all DNA sequences of a single length to be a coordinate space, these two methods of allowing for dierences act to create a bubble whose contents are the sequences matched by the model. The more dierences allowed, the larger the bubble. The assumption is that since the set of sequences bound by a transcription factor are evidently similar in some unclear way, they should be clustered in this coordinate space and we should be able to nd a bubble which contains this cluster tightlyit should contain all the sequences in the set, and none outside it. In practice, however, this is not the case. Stormo showed that given just 6 sites in E. coli, in order to match all of them the consensus sequence must be so vague that a match is found by chance every 30 bp[21]. It's clear that at such levels of generality, the sets accepted by consensus sequence models begin to overlap as wellthe distinction between binding sites for individual factors eectively disappears. To demonstrate the futility of searching for individual sites using known binding site sequences, we examined the gene endo16 from S. purpuratus. Two cis-regulatory modules of this gene have been studied, yielding 13 unique tran- scription factor inputs binding to 17 sites[24, 25]. We searched for additional sequences which look like binding sites for these same 13 factors within the two modules by searching for sequences like those we have recorded, except allowing for one single mismatch. For one of the inputs, otx, we knew of 4 other sites in other genes (blimp1/krox[16] and otx itself[28]), and we included those in our search, yielding 3 unique site sequences total for otx. For another input, brn1/2/4, we knew of one other site in blimp1/krox[16], so we included its se- quence as well. The result of our search as visualized in the cis-Browser can be seen in gure 5. Exact matches are highlighted in red, and matches with one mismatch are drawn in gray. The second major model of transcription factor binding sites is the position

7 weight matrix (PWM). By incorporating not only the bases known to appear at each position but also their probabilities of occurrence, a much more precise model is made. While the independence/additivity assumptions are imperfect, they are a good approximation of reality, especially for the simplicity of the modelwhile there are a few notable exceptions, most factors are described rel- atively well by a PWM [3]. The logarithm of the observed base frequencies has been shown to be proportional to the binding energy contribution of the bases [4], so there is clear biological signicance to using these values as the weights of the matrix. But a matrix of coecients is not sucient to predict binding sites there is still the question of the best cuto score. Unlike a consensus sequence, a PWM assigns a score to every sequence, and a cuto must be chosen as the minimum score to expect a sequence to be a functional binding site. Even with a cuto chosen to best match the experimentally known sites, it is unclear how good a site is if it is barely over the limit, or how nonfunctional a sequence is if it is just under the cuto. As true binding sites are not merely on or o but have an eect whose degree may depend on the strength of the bind, the concept of a cuto may be incorrect.

4.1 Motif nding None of the existing methods of representing a binding site can predict which sites are functional. Unlike restriction which have an eect by them- selves, transcription factors only work by aecting the transcription machinery. Some factors do this directly, while others only communicate via intermediaries. Even a short sequence will contain what looks like many sites for many dierent transcription factors, and it is dicult to determine which actually determine gene expression. One method to bypass this problem is to look at a set of genes that appear to be coregulatedi.e., they are expressed at the same time in the same location. It is very likely that the same transcription factor regulates these genes, either directly or indirectly. Therefore many of the regions of these genes should contain a binding site for that factor. By simply searching for a short sequence which is found to be overrepresented (i.e., more common than expected by chance), in principle we should be able to nd such a binding site. Unfortunately, the binding sites will probably not be identical. Some type of tolerance for mismatches must be added to the search algorithm, which compli- cates things considerably (otherwise a simple count of the number of occurrences of, e.g., every 8-mer would suce). Some algorithms model the motif they are looking for combinatorially as a consensus string with a maximum number of mismatches [18], while others use a probabilistic or information-theoretic frame- work [14]. Without being able to discern de novo the regulatory regions of genes, we know that they should be conserved between closely-related species. Like the protein-coding sequence, has a functional purpose and most mutations to it will cause harm to an organism. Therefore few ospring who have any changes to the regulatory sequence will survive, in contrast to those

8 who have changes to sequence outside the regulatory and coding regions, which should have no diculty. Over generations, while a few minor changes occur within functional regions, large changes will accumulate in the rest. By examin- ing the sequence of species at the right evolutionary distance, we should see clear conservation only where the sequence has a specic function. We can exclude the protein-coding sequence from our analysis, either by using predicted gene models (e.g., [10]) or by transcriptome analysis (e.g., [20]), and only look the conserved patches of unknown function, which are likely to contain regulatory sequence (see, for example, [27, 17, 19, 1]). For the highest accuracy, several species can be compared simultaneously [5].

4.2 Evaluations of Motif-Finding Existing motif algorithms perform reasonably well for yeast, but not for more complex organisms [7]. Several evaluations of the many proposed methods have been attempted, but the use of real genomic promotor sequences is hampered by the simple fact that no one knows the complete 'correct' answer [22, 15]. For an overview of the algorithms and the models they are based on, see [7].

5 The Logic Function Inference Problem

Development of Western Science is based on two great achievements the invention of the formal logical system (in Euclidean geometry) by the Greek philosophers, and the discovery of the possibility to nd out causal relationships by systematic experiment (during the Re- naissance).  Albert Einstein

The observed expression of a gene in space and time is not determined by a single transcription factor binding site, but by a collection of sites whose eects are combined in nontrivial ways. Istrail & Davidson catalogued the known site interactions into 4 categories: transcriptional activation operators, BTA control operators, combinatorial logic operators, and external control operators[12]. To fully understand the regulation of any gene, we must (1) identify all binding sites, (2) understand the function of each site, and (3) understand the rules for combining their functions to infer the overall cis-regulatory output[8]. As discussed above, step 1 alone is an especially dicult problem, since knowledge of transcription factors is largely biased toward one category: the drivers. The drivers are those transcription factors whose expression varies in space and time, and thus determine the expression of the genes they regulate. Other factors are more or less ubiquitousthey do not vary, but instead are always present, which makes their presence harder to detect. Only thorough studies such as perturbation analysis can expose them. Perturbation analysis is currently the only method for the determining the functional interactions among modules and sites. Biologists take the known

9 regulatory sequence of a gene and modify it to determine the function of each part. When several modules are known to exist, a high level understanding of the function of each module and how they interact together can be found by observing the expression caused by dierent subsets of the full array of mod- ules. For example, to discover the function of the modules of endo16, Yuh and Davidson tried each of the six modules individually and in many dierent combinations (A, B, DC, E, F, FA, EA, DCA, FB, EB, DCB, etc)[26]. When putative sites are identied, biologists can tease out the complex hid- den logic underlying even the simplest regulatory region by mutating both in- dividual sites and groups of sites, each in independent experiments. This reg- ulatory archaeology is performed one experiment (dig) at a timeafter each experiment, possible models are postulated, and the next perturbation is cho- sen based on the results. When the S. purpuratus gene spgcm was studied by Ransick and Davidson ([19]), they identied three putative binding sites for Su(H) to investigate. They recognized that two of the sites were arranged in a manner similar to a pattern familiar from studies in Drosophila and mouse, so they treated the two as a single feature. To determine the precise function of the sites, they observed the expression resulting from mutating all three, only the pair, and only the third single site. Even with only two elements to exam- ine, they found a non-trivial logic behind their function: having either element yielded nearly the full correct activation, while one element in particular (the solo site, not the pair) was clearly necessary for correct repression (mutating the pair resulted in impaired repression but did not remove it entirely). There are a two obvious strategies for choosing the next construct when in the midst of a series of experiments: pick the perturbation whose results can rule out the greatest number of putative models (i.e., binary search); or pick the perturbation whose expected results would conrm the experimenter's current hunch (i.e., an expert heuristic). The better the algorithm, the fewer the experiments that need to be performed to uncover the regulatory logicand that means less time and less cost. The intermediate results of perturbation analysis is visualized as a quintessen- tial graph: with one row for each reporter construct tested, bars are drawn to show the expression and/or repression of each construct relative to the control, which contains the full regulatory region sequence. It can be dicult even to test whether a model is correct or notthe mea- sured output, gene expression, is not a simple value. It varies from organism to organism, and it cannot be characterized as a single number, but at best as a distribution with variance. This variance can cause the ranges of expression for perturbation constructs to overlap even when they are signicantly dierent. More experiments would yield tighter distributions, but the cost can sometimes not be aorded. At times, biologists must choose the next construct to test and move on without achieving statistically signicant results, although the re- sults they achieved might have been signicant to them in view of their domain knowledge.

10 596 A. Ransick, E.H. Davidson / Developmental Biology 297 (2006) 587 602 intact control construct (compare Fig. 5D). This correspon- Functions of Su(H) sites in D and P modules dence of cis- and trans-perturbation results demonstrates conclusively that Su(H) is a direct and essential of The strong effects on activation and repression obtained by the largely accurate spatial expression of the intact control mutating the E-module Su(H) sites suggested that the three ‘7/ construct. 8 base’ consensus sites elsewhere (1-P, 2-P and 6-D) might To determine whether the 5-E/4-E paired and 3-E solo Su(H) contribute only marginally to the overall control profile. The sites within E-module are functionally distinct, we analyzed the roles of these sites were investigated by means of three expression of GFP constructs in which just the 5-E/4-E paired additional constructs (see Figs. 6A and B). In one construct, all site or just the solo 3-E site had been mutated (Epair mut and the Su(H) sites outside of E-module were mutated to generate Esolo mut). Strikingly, both constructs generated expression Dmut-Ewildtype-Spmut-Pmut-GFP or the Ewt construct. In a second profiles with only modestly lower than normal SMC expres- construct, mutations were made in every consensus site match, sion, but with high levels of ectopic expression (Table 1, thereby producing Dmut-Emut-Spmut-Pmut-GFP or the all Su columns 4 and 5). Specifically, the Epair mut construct, averaged (H)mut construct. In the third construct, D-E-Spmut-P-GFP, or over five trials, produced a normalized expression profile of Spmut, only, the 7-S site found in the 1 kb Spacer was mutated. A = 0.93/R = 0.47 (Fig. 7C), while the averaged results from The Ewt construct and the Spmut construct (each one trial four trials with the Esolo mut construct gave a normalized only) gave expression profiles similar to the intact control expression profile of A = 0.83/R = 0.16 (Fig. 7D). These high A construct (Figs. 7E and F; Table 1). This result indicates that the values suggest that either of the Su(H) elements in E-module is consensus site matches in D-module, P-module or the Spacer sufficient to mediate (nearly all) the SMC-specific activation execute no important role. The all Su(H)mut construct, averaged function. In contrast, all three of the Su(H) sites are required for over two trials, gives a normalized result of A = 0.37/R = 0.07 maximum suppression of ectopic activation. The 3-E solo Su (Fig. 7G). The A = 0.37 activation level is the same as the (H) site seems to be most important, as evidenced by the A = 0.38 level obtained with the Eall mut construct, indicating similarly low R values of 0.19 and 0.16 obtained with Esolo mut again that the mutations to the non-E-module consensus sites and Eall mut, respectively (Figs. 7B and D). The 5-E/4-E paired created no additional loss of SMC expression. On the other site is also capable of mediating some repression, as shown by hand, the R value of 0.07 obtained with the all Su(H)mut an R value of only 0.47 when just this element is mutated (Fig. construct (versus R = 0.19 in Eall mut) indicates that the “weak 7C). The paired site thus needs the solo site to exert any sites” in D- and P-modules, and perhaps even the site in the repression at all, and the solo site needs the paired site to exert the Spacer region, are capable of mediating some repression of normal level of repression, indicating a functional interaction ectopic expression in the GFP reporter constructs. This among them. interpretation is supported by the R = 0.93 (rather than 1.0)

Fig. 7. Expression of cis-regulatory reporter constructs bearing Su(H) target site mutations. Symbolism is as described in legend to Fig. 5. (A) Intact control construct. (B–D) Constructs with mutations in the consensus Su(H) sites of E-module. Mutation of any of these Su(H) sites results in significantly lower R values, although A values not significantly depressed unless all E Su(H) sites areFigure mutated. (E 6:–G) Quintessential Constructs with mutations in graph the Su(H) (fromconsensus sites [19]) in other locations. (E, F)

Negligible effects on the expression profile are produced by Spmut alone or in the Dmut-Ewt-Spmut-Pmut-GFP construct. (G) Expression of Dmut-Emut-Spmut-Pmut-GFP construct, with all known consensus site matches mutated, shows an equivalent A value, but a lower R value, relative to Eall mut construct. 6 Conclusion

The concept of cis-regulatory information abstracts the wealth of types of data required for designing algorithms for computationally predicting the most likely organization of a piece of regulatory DNA into cis-regulatory modules. In this paper we presented three problems that highlight the computational diculties encountered in extracting such information. Two other sources of cis-regulatory information of fundamental importance are: (1) gene regulatory networks ar- chitecture (e.g., the endomesoderm network [9]) , and (2) genomic data from species at the right evolutionary distance that preserve only the cis-regulatory modules (e.g., S. purpuratus and L. variegatus). The challenge of extracting cis-regulatory information is much amplied by the relatively scarce data sets and the depth of the analysis required to unveil such rich information sources.

Acknowledgments We thank Eric Davidson for many inspiring discussions and other contributions to this project. This research was supported by National Science Foundation Award 0645955.

References

[1] Gabriele Amore and Eric H Davidson. cis-regulatory control of cyclophilin, a member of the ets-dri skeletogenic gene battery in the sea urchin embryo. Developmental Biology, 293(2):55564, May 2006. PMID: 16574094.

[2] M. Ashburner, CA Ball, JA Blake, D. Botstein, H. Butler, JM Cherry, AP Davis, K. Dolinski, SS Dwight, JT Eppig, et al. Gene ontology: tool

11 for the unication of biology. The Gene Ontology Consortium. Nat Genet, 25(1):259, 2000.

[3] P.V. Benos, M.L. Bulyk, and G.D. Stormo. Additivity in proteinDNA interactions: how good an approximation is it? Nucleic Acids Research, 30(20):44424451, 2002.

[4] OG Berg and PH von Hippel. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol, 193(4):72350, 1987.

[5] M. Blanchette and M. Tompa. Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, 12(5):739, 2002.

[6] Sea Urchin Genome Sequencing Consortium, Erica Sodergren, George M. Weinstock, Eric H Davidson, R. Andrew Cameron, Richard A. Gibbs, Robert C. Angerer, Lynne M. Angerer, Maria Ina Arnone, David R. Burgess, and et al. The genome of the sea urchin strongylocentrotus pur- puratus. Science, 314(5801):941952, Nov 2006.

[7] M.K. Das and H.K. Dai. A survey of DNA motif nding algorithms. feed- back, 2007.

[8] E.H. Davidson. The Regulatory Genome: Gene Regulatory Networks In Development And . Academic Press, 2006.

[9] EH Davidson, JP Rast, P Oliveri, A Ransick, C Calestani, CH Yuh, T Mi- nokawa, G Amore, V Hinman, C Arenas-Mena, and et al. A genomic regulatory network for development. Science, 295(5560):1678, 1669, Mar 2002.

[10] Christine Elsik, Aaron Mackey, Justin Reese, Natalia Milshina, David Roos, and George Weinstock. Creating a honey bee consensus gene set. Genome Biology, 8(1):R13, 2007.

[11] Mark B. Gerstein, Can Bruce, Joel S. Rozowsky, Deyou Zheng, Jiang Du, Jan O. Korbel, Olof Emanuelsson, Zhengdong D. Zhang, Sherman Weiss- man, and Michael Snyder. What is a gene, post-encode? history and updated denition. Genome Res., 17(6):669681, Jun 2007.

[12] S. Istrail and E.H. Davidson. Gene Regulatory Networks Special Feature: Logic functions of the genomic cis-regulatory code. Proceedings of the Na- tional Academy of Sciences, 102(14):4954, 2005.

[13] Sorin Istrail, Smadar Ben-Tabou De-Leon, and Eric H. Davidson. The regulatory genome and the computer. Developmental Biology, 310(2):187 195, Oct 2007.

12 [14] CE Lawrence, SF Altschul, MS Boguski, JS Liu, AF Neuwald, and JC Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262(5131):208214, 1993. [15] N. Li and M. Tompa. Analysis of computational approaches for motif discovery. Algorithms for Molecular Biology, 1(8), 2006. [16] Carolina B Livi and Eric H Davidson. Regulation of spblimp1/krox1a, an alternatively transcribed isoform expressed in midgut and hindgut of the sea urchin gastrula. Gene Expression Patterns: GEP, 7(1-2):17, Jan 2007. PMID: 16798107. [17] Takuya Minokawa, Athula H Wikramanayake, and Eric H Davidson. cis- regulatory inputs of the wnt8 gene in the sea urchin endomesoderm net- work. Developmental Biology, 288(2):54558, Dec 2005. PMID: 16289024. [18] P.A. Pevzner and S.H. Sze. Combinatorial approaches to nding subtle sig- nals in DNA sequences. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, 8:269278, 2000. [19] A. Ransick and E.H. Davidson. cis-regulatory processing of Notch signaling input to the sea urchin glial cells missing gene during mesoderm specica- tion. Developmental Biology, 297(2):587602, 2006. [20] Manoj P. Samanta, Waraporn Tongprasit, Sorin Istrail, R. Andrew Cameron, Qiang Tu, Eric H. Davidson, and Viktor Stolc. The transcrip- tome of the sea urchin embryo. Science, 314(5801):960962, Nov 2006. [21] G.D. Stormo. DNA binding sites: representation and discovery. Bioinfor- matics, 16(1):1623, 2000. [22] M. Tompa, N. Li, T.L. Bailey, G.M. Church, B. De Moor, E. Eskin, A.V. Favorov, M.C. Frith, Y. Fu, W.J. Kent, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology, 23:137144, 2005. [23] Wyeth W. Wasserman and Albin Sandelin. Applied for the identication of regulatory elements. Nat Rev Genet, 5(4):276287, Apr 2004. [24] C. H. Yuh, H. Bolouri, and E. H. Davidson. Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science, 279(5358):18961902, March 1998. [25] CH Yuh, H. Bolouri, and EH Davidson. Cis-regulatory logic in the endo16 gene: switching from a specication to a dierentiation mode of control. Development, 128(5):617629, 2001. [26] CH Yuh and EH Davidson. Modular cis-regulatory organization of endo16, a gut-specic gene of the sea urchin embryo. Development, 122(4):1069 1082, Apr 1996.

13 [27] Chiou-Hwa Yuh, C Titus Brown, Carolina B Livi, Lee Rowen, Peter J C Clarke, and Eric H Davidson. Patchy interspecic sequence similarities eciently identify positive cis-regulatory elements in the sea urchin. De- velopmental Biology, 246(1):14861, Jun 2002. PMID: 12027440.

[28] Chiou-Hwa Yuh, Elizabeth R Dorman, Meredith L Howard, and Eric H Davidson. An otx cis-regulatory module: a key node in the sea urchin en- domesoderm . Developmental Biology, 269(2):536 51, May 2004. PMID: 15110718.

[29] C. Zimmer. What is a species? Scientic American Magazine, 298(6):72 79, 2008.

14