US006057101A United States Patent (19) 11 Patent Number: 6,057,101 Nandabalan et al. (45) Date of Patent: May 2, 2000

54) IDENTIFICATION AND COMPARISON OF 5,525,490 6/1996 Erickson et al...... 435/29 PROTEIN-PROTEIN INTERACTIONS THAT 5,580,721 12/1996 Brent et al...... 435/6 OCCUR IN POPULATIONS AND 5,580,736 12/1996 Brent et al...... 435/6 IDENTIFICATION OF INHIBITORS OF OTHER PUBLICATIONS THESE INTERACTORS Mendelsohn et al., 1994, Curr. Opin. Biotech. 5:482-486. 75 Inventors: Krishnan Nandabalan, Guilford; Jonathan Marc Rothberg, Primary Examiner Eggerton A. Campbell Woodbridge; Meijia Yang, East Lyme; Attorney, Agent, or Firm-Pennie & Edmonds LLP James Robert Knight; Theodore 57 ABSTRACT Samuel Kalbfleisch, both of Branford, all of Conn. Methods are described for detecting protein-protein interactions, among two populations of proteins, each hav 73 Assignee: CuraGen Corporation, New Haven, ing a complexity of at least 1,000. For example, proteins are Conn. fused either to the DNA-binding domain of a transcriptional activator or to the activation domain of a transcriptional 21 Appl. No.: 08/874,825 activator. Two yeast Strains, of the opposite mating type and carrying one type each of the fusion proteins are mated 22 Filed: Jun. 13, 1997 together. Productive interactions between the two halves due to protein-protein interactions lead to the reconstitution of Related U.S. Application Data the transcriptional activator, which in turn leads to the activation of a reporter gene containing a binding site for the 63 Continuation-in-part of application No. 08/663,824, Jun. 14, DNA-binding domain. This analysis can be carried out for 1996. two or more populations of proteins. The differences in the 51 Int. Cl." ...... C12O 1/68; C12O 1/70 genes encoding the proteins involved in the protein-protein 52 U.S. Cl...... 435/6; 435/5 interactions are characterized, thus leading to the identifi 58 Field of Search ...... 435/516, 91.2, cation of Specific protein-protein interactions, and the genes 435/69.1, 7.37; 536/23.1 encoding the interacting proteins, relevant to a particular tissue, Stage or disease. Furthermore, inhibitors that interfere 56) References Cited with these protein-protein interactions are identified by their ability to inactivate a reporter gene. The screening for Such U.S. PATENT DOCUMENTS inhibitors can be in a multiplexed format where a set of 4,593,002 6/1986 Dulbecco ...... 435/172.3 inhibitors will be Screened against a library of interactors. 5,198.346 3/1993 Ladner et al. ... 435/69.1 Further, information-processing methods and Systems are 5,223,409 6/1993 Ladner et al...... 435/69.7 described. These methods and systems provide for identifi 5,270,170 12/1993 Schatz et al...... 435/7.37 cation of the genes coding for detected interacting proteins, 5,283,173 2/1994 Fields et al...... 435/6 for assembling a unified database of protein-protein inter 5,322,801 6/1994 Kingston et al. . 435/5.01 5,403,712 4/1995 Crabtree et al...... 435/6 action data, and for processing this unified database to obtain 5,468,614 11/1995 Fields et al...... 435/6 protein interaction domain and protein pathway information. 5,482,835 1/1996 King et al.. 5,512,473 4/1996 Brent et al...... 435/240.2 190 Claims, 38 Drawing Sheets U.S. Patent May 2, 2000 Sheet 1 of 38 6,057,101

CONSTRUCT DNA-BINDING &t ACTIVATION DOMAIN FUSION LIBRARES FROM DIFFERENT POPULATIONS (TISSUE/DEVELOPMENT STAGE/DISEASE STAGE)

ERFORM YEAST MATING SCREENS TO DENTIFY INTERACTING PARS OF PROTEINS N EACH POPULATION - ARRIVE AT"INTERACTIVE POPULATIONS"

CHARACTERIZE INTERACTIVE POPULATIONS BY A COMBINATION OF POOLING STRATEGIES, QEA & SEQ-QEA

ARRIVE AT PAIRS OF INTERACTIVE PROTEINS UNIQUE TO APOPULATION

SCREEN FOR INHIBITORS OF THESE UNIQUE INTERACTORS USING A YEAST-BASED SELECTION OF INHIBITION OF INTERACTIONS

ARRIVE AT LEAD ORUG COMPOUNDS THAT ARE SPECIFICALLY TARGETED AGAINST PAIRS OF INTERACTING PROTEINS UNIQUE TO A PARTICULAR POPULATION FIG.1 U.S. Patent May 2, 2000 Sheet 2 of 38 6,057,101

D-DNA-BINDING DOMAIN

US HIS3/URA3AlagZ REPORTER GENE AACMATION DOMAIN X=DNA-BINDING DOMAINFUSION PROTEIN Y=ACTMATION DOMAINFUSION PROTEIN

UAS HIS3/URA37 lac2. REPORTER GENE

O/o (A) TRANSCRIPTION

HIS3 REPORTER GENE

GROWTH ON MEDIA THAT MONITORING (3-GALACTOSIDASE SELECTS FOR INTERACTING ACTMTY OF INTERACTING PROTEINS PROTEINS (e.g-HIS OR -URA MEDIUM) FIG.2 U.S. Patent May 2, 2000 Sheet 3 of 38 6,057,101

MXN SCREEN M&N ARE TWO POPULATIONS OF PROTEINS SPECIFIC TO A PARTICULAR STATE (e.g., CANCER) GROWTH ON -URA -HIS +3-AT MEDIUM (SELECTS FOR INTERACTORS) AND SCREEN FOR 6-GALACTOSIDASE ACTIVITY

CREATION OF INTERACTIVE POPULATION

TWO- OR THREE DIMENSIONAL POOLING STRATEGES TO FORM POOLS OF ALL INTERACTING PAIRS

YEAST COLONY PCR FOR M & N INSERTS ON POOLS

M-PCR N-PCR

QEA & a CHARACTERIZE PARS OF c PROTEINS IN ONE DISEASE-STATE PERFORMANOTHER MXN ANALYSIS FOR A DIFFERENT STATE (e.g., NORMAL) COMPARISON OF THE TWO M x N ANALYSES EADS TO THE DENTIFICATION OF PARS OF INTERACTING PROTEINS SPECIFIC TO CERTAIN STATE (e.g., CANCER) FG3 U.S. Patent May 2, 2000 Sheet 4 of 38 6,057,101

Gel IMAGES 1A,B,C,D 1A2A3A4A

PLATE POOLS p12 H p1/ROW POOLS A COLUMN POOLS FIG.4B U.S. Patent May 2, 2000 Sheet 5 of 38 6,057,101

2 3 4 5 6 7 8 9 10 11 12

ROW POOLS A-H

3 4 5 6 7 8 9 10 11 12

FG .4C U.S. Patent May 2, 2000 Sheet 6 of 38 6,057,101

My N SCREEN M & N ARE TWO POPULATIONS OF PROTEINS SPECIFIC TO A PARTICULAR STAGE (e.g., CANCER) GROWTH ON-URA MEDIUM (SELECT FOR INTERACTORS) YEAST COLONY PCR FOR M & N INSERTS

M-PCR N-PCR STORE HALF OF THE PCR PRODUCTS

COMBINE THE REMAINING HALVES IN AN ARRAYED FASHION TO CREATE PCR PAIRS

POOL. ALL THE PCR PRODUCTS SPOT ONE-HALF OF EACH \| PCR PAIR ONTO NYLON MEMBRANES

f & SEQ-QEA INTERACTIVE GRIDS IDENTIFY DIFFERENCE BANDS EACH SPOT CORRESPONDS TO GENES ENCODING A PAR OF INTERACTING PROTEINS USE TO PROBE INTERACTIVE GRIDS

INTERACTIVE GRIDS IDENTIFY PARS OF INTERACTIVE PROTEINS THAT ARE STAGE-SPECIFIC FIG.5 U.S. Patent May 2, 2000 Sheet 7 of 38 6,057,101

N CHEMICALS X M INTERACTANTS N XM SCREENS NORMAL x NORMAL CANCER X CANCER NORMAL X CANCER POOL ALL POSITIVE INTERACTANTS (COLONIES)

EACH O REPRESENTS THE POOL OF STAGE-SPECIFIC INTERACTANTS

SCREEN FOR INHIBITION BY SMALL ORGANICS USING THE INHIBITION-DEPENDENT GROWTH ASSAY

N O R M AL X NORMAL CANCER X CANCER NORMAL O O O O O O O O () O O O O O O O O O O O

AT THIS STEP EACH O RECEIVES A DIFFERENT CHEMICAL O INHIBITION DEPENDENT GROWTH > ATUSED THIS CAN STEP BE THEENCODED INHIBITORS SO AS TO o NO GROWTH (NO INHIBITION) INCREASE CHEMICAL DIVERSITY (COMBINATORIAL APPROACHES CAN BE USED HERE) ARRIVE AT STAGE-SPECIFIC (NORMAL/DISEASE) INTERACTING PARS OF PROTEINS AND SMALL MOLECULAR INHIBITORS OF THESE INTERACTIONS FIG.6 U.S. Patent May 2, 2000 Sheet 8 of 38 6,057,101

ADC1-P PEPTIDE NLS UAGADC1-T ES32 Sfi I Asc I

FIG.7 U.S. Patent May 2, 2000 Sheet 9 of 38 6,057,101

te areer s

a.

r rate

O rise.

O t U.S. Patent May 2, 2000 Sheet 10 Of 38 6,057,101

as e

- r - - as - as . . . .

------

T - - - we ------

arra-wis well rraraval a- - - we was sawa a m me m m. m s

t e N cf. e t -- U.S. Patent May 2, 2000 Sheet 11 of 38 6,057,101

210

205 2O7 206 FG.10A

5'-AGC ACT CTC CAG CCT CTC ACC GAA-3' (SEQ ID NO: 1) 250 3'- AG TGG CTT CTAG-5 (SEQ ID NO:7) 5'-ACC GAC GTC GAC TAT CCA TGA AGC-3' (SEQ ID NO:42) 25, 3'-GT ACT TCG TCGA-5' (SEQ ID NO:44) FG, 10B 211

212 FIG.10C U.S. Patent May 2, 2000 Sheet 12 of 38 6,057,101

S U.S. Patent May 2, 2000 Sheet 13 Of 38 6,057,101

W||'0|| (ÇG'ONQIDES)01x00\,000\,0101000,001010700W-G 8||'0|| Z09)909) 0100010\–?º U.S. Patent May 2, 2000 Sheet 14 of 38 6,057,101

407. - - - - - 408

CB) 405 405 406 404 FG, 12A

421

409

422 U.S. Patent May 2, 2000 Sheet 15 of 38 6,057,101 1002 1003 SELECTED

(HUMAN goNA) DB,

1004 1005 SELECTED DB. (PREDICTED COMPLETE EXPRESSED DATABASE SEQUENCES) 1007 (DB) 1006 SELECTED DB. (cDNACDS,EST)

1008 1009

SELECTED (CERTAIN DB. EXPRESSED SEQUENCES) F.G. 13A

1011 1012 1013

1014 HUMAN m TACT . . .

U.S. Patent May 2, 2000 Sheet 16 of 38 6,057,101

1101

1102 RE 1103

REACTION 1 RE LABEL

RE2 1104

REACTION 2 1105 1106 RE2 1107

RE3 1108

PROBE P ABEL4 1112

REACTION 15 1110

1111

1100 F.G. 14 U.S. Patent May 2, 2000 Sheet 17 Of 38 6,057,101

NOLIWTITWIS S00HIEW

ZOZ! No.BJ|tg ||| U.S. Patent May 2, 2000 Sheet 18 of 38 6,057,101

N O O V CN ves O Cy

s U.S. Patent May 2, 2000 Sheet 19 of 38 6,057,101 SAE) 1301

INPUT: A SEQ., TARGET SUB- 1302 SEQUENCES; PROBE FOREACH SUB- 1303 SEQ, MAKE VECTOR OF CUTS

FOR ALL SUB SEQ, MERGE 1304 AND SORT VECTORS

CREATE 1305 FRAGMENTS BETWEEN CUTS 1506

PROBE PRESENT YES NO FIG.16 FOR EACH FRAGMENT, FIND ITS SEQ.

MARK IF PROBE BINDS TO FRAGMENT

SORT ON 1309 FRAGMENT LENGTH output. 1310 FRAGMENT VECTOR END 1311 U.S. Patent May 2, 2000 Sheet 20 Of 38 6,057,101 yo 1401 N. V1 = (10,14) RE1 (62.66 RE1 (610,614)RE VECTORS OF 1402 N. V2 = (237.241) (388,392) RE2) CUTS

y y y y (101lrti (62.65)PE 12724)RE; , , , SORTED VECTORS (388,392), (610,614)) OF CUIS FG.17B

1408 1409 (RE1RE1,52) (RE1RE2,175) (RE2,RE2,151 RE1RE2,222) FRAGMENTS FG.17C 1410 10,62) (62.237 (237,388 (388,610 FRAGMENTS FIG.17D SEQUENCES U.S. Patent May 2, 2000 Sheet 21 of 38 6,057,101

1411 (E. E.N.5) , RE2,RE2y, 151), SORTED FRAGMENT RE1RE2N, 175), RE1,RE2.N.222) VECTOR 1412 FIG.17E

RE1RE1 R DIGEST (RE1RE1 (RE1,RE2 (RE2,RE2 TABLE U.S. Patent May 2, 2000 Sheet 22 of 38 6,057,101 SAE) 1501 INPUT: DATA BASE, EXP. 1502 DEFINITION

INTIALIZE DIGEST 1503 DATABASE

1504 DO FOR ALL SEQUENCES IN DATABASE

DO NEXT MOCK FRAGMENTATION 1505 REACTION

INSERT

FRAGMENT VECTOR 1506 IN DIGEST DB

1507 MOR

REACTIONS N EXP

1504 END-DO

OUTPUT: 1508 DATABASEDIGEST F.G. 18 NND 1509 U.S. Patent May 2, 2000 Sheet 23 of 38 6,057,101

1624 1656 - 1611 1630 le 1607ty. 1610 1634 EE 1612

1669 1633 1635 1634 1625 1613 1617 1616 1618 1604 g 1635 E 1604

1615 1606 SEENCE Int 1605 BASE SELECTED DAA BASE 1605 CSELECTED

E.BASE 1602

F.G. 19A gagggggggababerry \

1603 U.S. Patent May 2, 2000 Sheet 24 of 38 6,057,101

1652

FRAGMENT SEPARATION AND

DETECTION

1651

1660 RECOGNITION REACTIONS MACINTOSH UNY WINDOWS 3. O/S WIND95

1655 (1656 (1657 1650 U.S. Patent May 2, 2000 Sheet 25 of 38 6,057,101

OUTPUT: STATE, ENERGY GENERATE U707 SNEW ENEW-E (SNEw-U" 1709

1710

FG.20A EXPI(E-ENEW)/T) NO CRANDOM NUMBER FROM O TO 1 U.S. Patent May 2, 2000 Sheet 26 of 38 6,057,101

1720

INPUT: 1721 EXPERIMENTAL DESTINATION

CREATE 1722 DIGEST DAABASE

COUNT GOOD

1723 SEQUENCES (OF INTEREST)

ENERGY 1724 EQUALS INVERSE OF COUNT

1725 OUTPUT: ENERGY

1726

FIG.20B U.S. Patent May 2, 2000 Sheet 27 of 38 6,057,101

1801

INPUT:ACCESSION NUMBER FOR 1802 SIGNAL OF INTEREST

1803 DO FOR EACH ACCESSION NUMBER

DO SIMULATED 1804 EXPERIMENT ON SEQUENCE

RANK SIMULATION 1805 AGAINST OBSERVATION 1803

SORT RANK SCORES

OUTPUT: SORTED SCORES

1808

FG.21 U.S. Patent May 2, 2000 Sheet 28 Of 38 6,057,101

65C, 10'

4C, hold FIG.22A

U.S. Patent May 2, 2000 Sheet 29 Of 38 6,057,101

65C, 10'

16°C, 30' FIG.22C

U.S. Patent May 2, 2000 Sheet 30 of 38 6,057,101 2306.239 26. 2307 (N- a Y N\230 230S2305 FIG.25A

5 - 2307,292 PE25 b

N/ N-1 2321 305 2315 2325 FIG.23B 2304 2312y 2330 2333 fr14-1-2. A 6. N 2308 4 2323 - 21-- FIG.23C 2308 5 2324 t 2309 - O E)- 2308 , FIG.23D U.S. Patent May 2, 2000 Sheet 31 Of 38 6,057,101

09:22

19929,922 U.S. Patent May 2, 2000 Sheet 32 of 38 6,057,101

SOLATION OF INTERACTING PROTEINS FROM AN MXN ANALYSIS

ISOLATE URA+HIS+loc7+, CELLS; POSITIVE FOR PROTEIN-PROTEIN INTERACTION e.g., VEGF-VEGF, R4-FKBP-12, RAS-RAF etc.,

TRANSFER TO AND GROWTH IN NON-INDUCING MEDIA (LACTATE)

TRANSFER TO AND GROWTH IN INDUCING MEDIA (GLUCOSE) THAT IS SELECTIVE FOR INTERACTION (-URA), ALONG WITH INHIBITOR

TRANSFER TO AND GROWTH IN 5-FOA MEDIA, ALONG WITH INHIBITOR

SELECTION OF 5-FOARCELLSCELLS IN WHICH PROTEIN-PROTEIN INTERACTIONS HAVE BEEN INHIBITED

e.g., SELECTION OF R4-FKBP12 CELLS IN THE PRESENCE OF FK506 IN 5-FOA FIG.24 U.S. Patent May 2, 2000 Sheet 33 Of 38 6,057,101

ISOLATION OF INTERACTING PROTEINS FROMAN MX N ANALYSIS

o ISOLATATE URA+HIS+loc7+, CELLS: POSITIVE FOR PROTEIN-PROTEIN INTERACTION e.g., VEGF-VEGF, R4-FKBP-12, RAS-RAF etc., | SELECTING INHIBITION OF NOVEL PROTEIN-PROTEIN INTERACTIONS BY CANDIDATE INHIBITORS USING THE 5-FOA ASSAY

O POOL. ALL INTERACTANTS FROMAN MX N ANALYSIS e.g., R4-FKBP12 IN A POOL OF INTERACTANTS 9 SCREEN AGAINST INHIBITORS USING THE 5-FOA ASSAY e.g., FK506 AGAINST R4-FKBP12 IN A POOL OF INTERACTANTS

O SELECTION OF THOSE PROTEIN-PROTEIN INTERACTION EVENTS WHERE INHIBITION OCCURRED e.g., SELECTION OF R4-FKBP12 AS 5-FOA RESISTANT CELLS AMONG A POOL OF INTERACTANTS WHEN EXPOSED TO FK506

ISOLATION AND PROTEIN-PROTEIN INTERACTIONS AND INHIBITORS OF THESE INTERACTIONS

O CHARACTERIZATION OF THE GENES ENCODING THE INTERACTING PROTEINS BY SEQUENCE ANALYSIS 9 CONFIRMATION OF INHIBITION BY ENZYMATIC ASSAYS e.g. (3-GALACTOSIDASE ASSAYS FIG.25 U.S. Patent May 2, 2000 Sheet 34 of 38 6,057,101

INTERACTION METHOD STEPS INFORMATION PROCESSING STEPS 2601^n- opic FUSION LIBRARIES 26021N-TRANSFORM YEAST STRAINS 2603N- NEGATIVE SELECTION STEP 26041N-MATING & SOLATION OF g INTERACTANTS 2608 2615 QEAy; SIGNALS) f & 2605^n - CHARACTERIZES.E. res:Discription EXPERIMENT Fis basisSEQUENCE 2609-1 N N p o an/ 2616 1N IDENTIFICATION L00KASIDE -u 2618 DATABASE DATABASES

2610N - - CONTROLINTERACTION QUALITY| - 2619 112611 2606-- PLASMIDDROP-OUT TEST INTERACTION DATABASE - N/ 2620 &/ORMATRX/ MATING TEST Y 2612 N DATABASE 2607/Nu FURTHER STEPS BROWSING PATHWAY 2621 CONSTRUCTION IDENTIFICATIONDOMAIN as N 2622 FIG.26 U.S. Patent May 2, 2000 Sheet 35 0f 38 6,057,101 2701

CD 2705 SEQUENCE/ HOMOLOGY SEQUENCE SERVER DATABASE

2702 2706

IDENTIFICATION

DATABASE DATABASE SERVER COMPUTER O "ESSINTERACTION 2704

2703 2708

USER TERMINAL coverUSER 4 /\ zo

LABORATORY ROBOT

FIG.27

U.S. Patent May 2, 2000 Sheet 38 0f 38 6,057,101

UNTITLED VIEW DETALS

PROTEIN PROTEIN A B

MDM2

PROTEIN C FIG.29

3005 6,057,101 1 2 IDENTIFICATION AND COMPARISON OF Separate domains and can function in trans (Keegan et al., PROTEIN-PROTEIN INTERACTIONS THAT 1986, Science 231:699-704). The reconstitution of the tran OCCUR IN POPULATIONS AND Scriptional activator is monitored by the activation of a IDENTIFICATION OF INHIBITORS OF reporter gene like the lacZ gene that is under the influence of a promoter that contains a binding site (Upstream Acti THESE INTERACTORS vating Sequence or UAS) for the DNA-binding domain of This application is a continuation in part of co-pending the transcriptional activator. This method is most commonly application Ser. No. 08/663,824, filed Jun. 14, 1996, which used either to detect an interaction between two known is incorporated by reference herein in its entirety. proteins (Fields and Song, 1989, Nature 340:245–246) or to This invention was made with United States Government identify interacting proteins from a population that would support under award number 70NANB5H1066 awarded by bind to a known protein (Durfee et al., 1993, Genes Dev. the National Institute of Standards and Technology. The 7:555–569; Gyuris et al., 1993, Cell 75:791–803; Harper et United States Government has certain rights in the inven al., 1993, Cell 75:805–816; Vojtek et al., 1993, Cell tion. 74:205–214). 15 Another system that is similar to the Two-Hybrid system 1. INTRODUCTION is the “Interaction-Trap system' devised by Brent and col leagues (Gyuris et al., 1993, Cell 75:791–803). This system The present method relates to the identification of protein is similar to the Two-Hybrid system except that it uses a protein interactions and inhibitors of these interactions that, LEU2 reporter gene and a lacZ reporter gene. Thus protein preferably, are specific to a cell type, tissue type, Stage of protein interactions leading to the reconstitution of the development, or disease State or Stage. transcriptional activator also allow cells to grow in media lacking leucine and enable them to express B-galactosidase. 2. BACKGROUND OF THE INVENTION The DNA-binding domain used in this system is the Lex A Proteins and protein-protein interactions play a central DNA-binding domain, while the activator Sequence is role in the various essential biochemical processes. For 25 obtained from the B42 transcriptional activation domain example, these interactions are evident in the interaction of (Ma and Ptashne, 1987, Cell 51:113-119). The promoters of hormones with their respective receptors, in the intracellular the reporter genes contain LeXA binding Sequences and and extracellular Signalling events mediated by proteins, in hence will be activated by the reconstitution of the tran enzyme Substrate interactions, in intracellular protein Scriptional activator. Another feature of this System is that trafficking, in the formation of complex Structures like the gene encoding the DNA-binding domain fusion protein ribosomes, Viral coat proteins, and filaments, and in antigen is under the influence of an inducible GAL promoter so that antibody interactions. These interactions are usually facili confirmatory tests can be performed under inducing and tated by the interaction of Small regions within the proteins non-inducing conditions. that can fold independently of the rest of the protein. These In yet another version of this system developed by independent units are called protein domains. Abnormal or 35 Elledge and colleagues, the reporter geneS HIS3 and lacZ disease States can be the direct result of aberrant protein (Durfee et al., 1993, Genes Dev. 7:555–569) are used. The protein interactions. For example, oncoproteins can cause transcriptional activator that is reconstituted in this case is cancer by interacting with and activating proteins respon GAL4 and protein-protein interactions allow cells to grow in sible for cell division. Protein-protein interactions are also media lacking histidine and containing 3-aminotriazole central to the mechanism of a virus recognizing its receptor 40 (3-AT) and to express f-galactosidase. 3-AT inhibits the on the cell Surface as a prelude to infection. Identification of growth of his3 auxotrophs in media lacking histidine domains that interact with each other not only leads to a (Kishore and Shah, 1988, Annu. Rev. Biochem. broader understanding of protein-protein interactions, but 57:627-663). also aids in the design of inhibitors of these interactions. In a different two-hybrid assay, a URA3 reporter gene Protein-protein interactions have been studied by both 45 under the control of Estrogen Response Elements (ERE) has biochemical and genetic methods. The biochemical methods been used to monitor protein-protein interactions. Here, the are laborious and slow, often involving painstaking DNA-binding domain is derived from the human estrogen isolation, purification, Sequencing and further biochemical receptor. The authors of the ERE assay propose that inhi characterization of the proteins being tested for interaction. bition of the protein-protein interactions can be identified by AS an alternative to the biochemical approaches, genetic 50 negative Selection on 5-FOA medium (Le Douarin et al., approaches to detect protein-protein interactions have 1995, Nucleic Acids Res. 23:876-878), but do not provide gained in popularity as these methods allow the rapid any details. detection of the domains involved in protein-protein inter A version of the two-hybrid approach called the “Con actions. tingent Replication ASSay” that is applicable in mammalian An example of a genetic System to detect protein-protein 55 cells has also been reported (Nallur et al., 1993, Nucleic interactions is the “Two-Hybrid” system to detect protein Acids Res. 21:3867-3873; Vasavada et al., 1991, Proc. Natl. protein interactions in the yeast Saccharomyces cerevisiae Acad. Sci. USA 88: 10686-10690). In this case, the recon (Fields and Song, 1989, Nature 340:245–246; U.S. Pat. No. Stitution of the in mammalian cells due 5,283,173 by Fields and Song). This assay utilizes the to the interaction of the two fusion proteins leads to the reconstitution of a transcriptional activator like GAL4 60 activation of the SV40 T antigen. This antigen allows the (Johnston, 1987, Microbiol. Rev. 51:458–476) through the replication of the activation domain fusion plasmids. interaction of two protein domains that have been fused to Another modification of the two-hybrid approach using the two functional units of the transcriptional activator: the mammalian cells is the "Karyoplasmic Interaction Selection DNA-binding domain and the activation domain. This is Strategy' that also uses the reconstitution of a transcriptional possible due to the bipartite nature of certain transcription 65 activator (Fearon et al., 1992, Proc. Natl. Acad. Sci. USA factors like GAL4. Being characterized as bipartite Signifies 89:7958–7962). Reporter genes used in this case have that the DNA-binding and activation functions reside in included the gene encoding the bacterial chloramphenicol 6,057,101 3 4 acetyl transferase, the gene for cell-Surface antigen CD4, Accordingly, it is one of the objectives of this invention and the gene encoding resistance to Hygromycin B. In both to devise a genetic method to identify and isolate preferably of the mammalian Systems, the transcription factor that is all possible protein-protein interactions within a population reconstituted is a hybrid transcriptional activator in which of proteins, or between two different populations of proteins, the DNA-binding domain is from GAL4 and the activation be it a tissue/cell-type, disease State or Stage of development. domain is from VP16. In all of the assays described above, the identity of one (or It is another objective of the present invention to perform both) of the proteins being tested for interaction is known. a comparative analysis of the protein-protein interactions All of the assays mentioned above can be used to identify that occur two or more different tissue/cell-types, disease novel proteins that interact with a known protein of interest. States, or Stages of development. In a variation of the “Interaction Trap" System, a “mating 1O It is also an objective of this invention to identify and grid” strategy has been used to characterize interactions isolate in a rapid manner the genes encoding the proteins between proteins that are thought to be involved in the involved in interactions that are specific to a tissue/cell-type, Drosophila cell cycle (Finley and Brent, 1994, Proc. Natl. disease State, or Stage of development. Acad. Sci. USA'91:12980–12984). This strategy is based on 15 It is yet another objective of this invention to provide a a technique first established by Rothstein and colleagues method for the concurrent identification of inhibitors of the (Bendixen et al., 1994, Nucleic Acids Res. 22:1778–1779) protein-protein interactions that characterize a given who used a yeast-mating assay to detect protein-protein population, be it a tissue/cell type, disease State, or Stage of interactions. Here, the DNA-binding and activation domain fusion proteins were expressed in two different haploid yeast development. These inhibitors may have therapeutic value. Strains, a and C, and the two were brought together by Citation of a reference herein shall not be construed as an mating. Thus, interactions between proteins can be Studied admission that Such is prior art to the present invention. in this method. However, even in this method, the identities of at least one of the proteins in the interacting pairs of 3. SUMMARY OF THE INVENTION proteins was known prior to analyzing the interactions The present invention provides methods and means to between pairs of proteins. 25 detect and isolate the genes encoding the proteins that Stanley Fields and coworkers have recently performed an interact with each other between two populations of analysis of all possible protein-protein interactions that can proteins, using the reconstitution of a Selectable event. This take place in the E. coli bacteriophage T7 (Bartel et al., 1996, Selectable event is the formation of a transcription factor. In Nature Genet. 12:72–77). Randomly sheared fragments of contrast to the prior art, in which problems with false T7 DNA were used to make libraries in both the DNA positives and low throughput limited the complexity of the binding domain and the activation domain plasmids and a populations that could be analyzed, each of the two popu genome-wide two-hybrid assay was performed by use of a lations of proteins has a complexity of greater than 10, and mating Strategy. The DNA-binding and the activation preferably has a complexity of at least 1,000. The reconsti domain fusions were transformed into Separate yeast Strains tution of a transcription factor occurs by interaction of of opposite mating type. The DNA-binding domain hybrids 35 fusion proteins expressed by chimeric genes. In a preferred containing yeast transformants were then divided into embodiment, the types of fusion proteins used are DNA groups of 10. The groups were Screened (by the mating binding domain hybrids and activation domain hybrids of Strategy outlined above) against a library of activation transcriptional activators. Libraries of genes encoding domain hybrids numbering around 10 transformants. By hybrid proteins are preferably constructed in both a DNA this method, 25 interactions were characterized among the 40 binding domain hybrid plasmid vector and in an activation proteins of T7. While this study provides a method to screen domain hybrid plasmid vector. In a preferred embodiment, more than one DNA-binding domain hybrid against more two types of haploid yeast Strains, a and C. respectively, are than one activation domain hybrid, it does not address the each transformed with a different one of the two libraries to issues involved in Screening complex libraries against each create two yeast libraries. The two yeast libraries are then other. This is an important limitation due to the value of 45 mated together to create a diploid yeast Strain that contains enabling the detection and isolation of interactants from both the kinds of fusion genes encoding the hybrid proteins. cDNA libraries prepared from complex organisms like If the two hybrid proteins can interact (bind) with each other, human beings. Indeed, the prior art has taught away from the transcriptional activator is reconstituted due to the proX using complex populations of proteins as hybrids to the imity of the DNA-binding and the activation domains of the DNA-binding domain, since random hybrids to the DNA 50 transcriptional activator. This reconstitution causes tran binding domain produce a large percentage of false positives Scription of reporter genes that, by way of example, enable (hybrids that have transcriptional activity in the absence of the yeast to grow in Selective media. In a preferred aspect, an interacting protein) (Bartel et al., 1993, “Using the two the activity of a reporter gene is monitored enzymatically. hybrid System to detect protein-protein interactions,” in The isolation of the plasmids that encode these fusion genes Cellulaar Transduction in Development, Ch. 7, Hartley, D. 55 leads to the identification of the genes that encode proteins A. (ed.), Practical Approach Series xviii, IRL Press at that interact with each other. Oxford University Press, New York, N.Y., pp. 154–179 at Thus, in a specific embodiment, the invention is directed 171; Ma and Ptashne, 1987, Cell 51:113). to a method of detecting one or more protein-protein inter None of the prior art systems provides a method that not actions comprising (a) recombinantly expressing within a only isolates and catalogues all possible protein-protein 60 population of host cells (i) a first population of first fusion interactions within a population, be it a tissue/cell-type, proteins, each said first fusion protein comprising a first disease State, or Stage of development, but also allows the protein Sequence and a DNA binding domain in which the comparison of Such interactions between two Such popula DNA binding domain is the same in each said first fusion tions thereby allowing the identification of protein-protein protein, and in which Said first population of first fusion interactions unique to any particular tissue/cell-type, disease 65 proteins has a complexity of at least 1,000; and (ii) a second State, or Stage of development. In contrast, Such a method is population of Second fusion proteins, each said Second provided by the present invention. fusion protein comprising a Second protein Sequence and a 6,057,101 S 6 transcriptional regulatory domain of a transcriptional does not necessarily imply that the protein-protein interac regulator, in which the transcriptional regulatory domain is tion will be blocked by the same agent. Another advantage the same in each Said Second fusion protein, Such that a first of the method is that multiple protein-protein interactions fusion protein is co-expressed with a Second fusion protein can be Screened against a prospective inhibitor in a Single in host cells, and wherein Said host cells contain at least one asSay. nucleotide Sequence operably linked to a promoter driven by one or more DNA binding sites recognized by said DNA This invention also provides information-processing binding domain Such that interaction of a first fusion protein methods and Systems. One aspect of these methods provides with a Second fusion protein results in regulation of tran methods for interpreting detected protein-protein interac Scription of Said at least one nucleotide Sequence by Said tions by providing for identification of the genes that code regulatory domain, and in which Said Second population of for the library inserts in the activation domain and fusion Second fusion proteins has a complexity of at least 1,000; domain hybrids. Another aspect of these methods provides and (b) detecting said regulation of transcription of Said at for assembling protein-protein interaction data detected least one nucleotide Sequence, thereby detecting an interac from one or more pairs of libraries into a unified database. tion between a first fusion protein and a Second fusion Further aspects of these methods provide for use of this protein. 15 unified database to assemble individual, pair-wise protein In further Specific embodiments, this invention provides protein interactions into putative pathways and networks of for detecting experimentally significant protein-protein protein interaction, providing a more general view of cel interactions between highly complex libraries of proteins. In lular functioning. Also provided for is the use of this unified particular, the invention provides protocols which achieve database to delimit or determine the protein domains respon highly effective screening of the DNA binding domain or Sible for particular protein-protein interactions. activation domain hybrids to eliminate those hybrids that 4. DESCRIPTION OF THE FIGURES produce false positive indications of protein-protein inter actions. Additional Screening protocols eliminate those These and other features, aspects, and advantages of the hybrids which, due to non-specific association with many present invention will become better understood by refer proteins, produce leSS experimentally significant or Specific 25 ence to the accompanying drawings, following description, indications of protein-protein interactions. Further protocols and appended claims, where the drawings are described provide for the efficient mating of large numbers of yeast briefly as follows: cells useful for handling complex libraries. FIG.1. An overview of an exemplary strategy to identify The present invention also provides a method to isolate pairs of interacting proteins that are specific to a particular concurrently inhibitors of Such protein-protein interactions population and to identify inhibitors of these interactors in a that occur in, are characteristic of or are specific to a given high throughput fashion. population of proteins. By way of example, preferably all FIG. 2. A yeast interaction mating assay for the detection the yeast diploids that harbor fusion proteins that interact of protein-protein interactions. The two test proteins are with each other are pooled together and exposed to candi 35 indicated as X and Y. X=DNA binding domain fusion date inhibitors. Exemplary candidate inhibitors include protein, Y=activation domain fusion protein. The activation chemically Synthesized molecules and genetically encoded and DNA-binding domains are indicated as A and D respec peptides. After treatment with candidate inhibitors, the yeast tively. The two yeast cell types are a and C, while the diploid cells harboring interacting hybrid proteins are Selected for is marked as a? C. A blue color (not shown) indicates expres the inactivation of the reporter gene, preferably by transfer 40 Sion of B-galactosidase by conversion of the clear X-gal to appropriate Selective media. Preferably, the same media Substrate into an insoluble blue precipitate. also Selects for the presence of the plasmids that encode the FIG. 3. Exemplary scheme for the isolation of stage interacting proteins, and the peptide-encoding peptides in Specific pairs of interacting proteins. M and N are two the case of the Screening for peptide inhibitors expressed populations of proteins expressed in a particular state (e.g., from expression plasmids. Successful inhibition events are 45 cancer). The mating of two populations M and N results in thus monitored by the inactivation of the reporter gene. the creation of an interactive population that contains all The major advantages of these methods are as follows. possible pairs of interacting proteins in the two populations. From a population of proteins characteristic of a particular The reporter genes are URA3, HIS3, and lacZ. The inter tissue or cell-type, all possible detectable protein-protein active population is further characterized by methods Such interactions that occur can be identified and the genes 50 as the OEATM method, the SEQ-OEATM method, and encoding these proteins can be isolated. Thus, parallel Sequencing which aid in the identification of the pairs of analyses of two cell types enumerates the protein-protein interacting proteins. A comparison of two Such interactive interactions that are common to both and those that are populations leads to the identification of Stage-Specific or Specific to both (differentially expressed in one cell type and disease State-specific pairs of interacting proteins. not the other). Such an analysis has value since protein 55 FIGS. 4A-C. Pooling strategies to characterize the inter protein interactions Specific to a disease State can Serve as active populations. PCR reactions are performed on pooled therapeutic points of intervention. yeast cells and the PCR products are either analyzed directly Furthermore, inhibitors of Such protein-protein interac by electrophoresis or by the QEA ' method and SEQ tions can be isolated in a rapid fashion. Such inhibitors can QEATM method. These methods lead to the characterization be of therapeutic value or Serve as lead compounds for the 60 of an interactive population. (A) 2-dimensional pooling and Synthesis of therapeutic compounds. This System can also be deconvolution. (B) 3-dimensional pooling and deconvolu used to identify novel peptide inhibitors of protein-protein tion. In order to determine the location of each clone, wells interactions. One advantage of this method over existing are pooled along planes (as opposed to lines in the methods is that peptides or chemicals are identified by an 2-dimensional Strategy). The location of a specific gene can ability to block protein-protein interactions. In many exist 65 be determined by finding which pool from each axis con ing methods, molecules are identified by an ability only to tains it. (C) 3-dimensional pooling from 96 well plates. 1152 bind to one of a pair of interacting proteins, Such binding positive colonies are arrayed into individual Wells of twelve 6,057,101 7 8 microtiter plates. A total of 32 pools are produced: 12 pooled FIG. 16 shows the detail of a method for simulating a along the columns axis (each from all 12 plates), 8 pooled QEATM reaction; along rows (A-H), and 12 pooled plates (p1-p12). These FIGS. 17A-F show exemplary results of the action of the pools contain genes from 96, 144, and 96 Wells, respectively. method of FIG. 16; Two-dimensional pooling and deconvolution requires 36+24 5 pools, but no pool is from more than 36 wells (genes), So it FIG. 18 shows the detail of a method for determining a is easier to get clearly separate bands from a SEQ-QEATM simulated database of experimental results for a QEATM method reaction of pools than with the three-dimensional embodiment; Strategy. FIGS. 19A and 19B show an exemplary computer system FIG. 5. Isolation of Stage-specific pairs of interacting apparatus implementing the QEAM methods, proteins by probing interactive grids. M and N are two FIG. 20A shows exemplary detail of an experimental populations of proteins expressed in a particular state (e.g., design method for a QEATM method, and FIG. 20B shows cancer). The PCR products corresponding to M and N exemplary detail of an experimental design method for an partners from an MXN analysis are spotted onto a Solid embodiment of a QEATM method; Support like a nitrocellulose membrane to create an interac 15 FIG. 21 shows an exemplary method for ordering the tive grid. The interactive grids are then probed with DNA DNA sequences found to be likely causes of a QEATM that is unique to a specific Stage to isolate pairs of interacting method Signal in the order of their likely presence in the proteins that are unique to a specific Stage. Sample, FIG. 6. integration of the expression linkage analysis and FIGS. 22A, 22B, 22C, and 22D show exemplary reaction inhibitor Screen. Exemplary Steps in an integrated isolation temperature profiles for preferred manual and automated of inhibitors of protein-protein interactions are depicted. The implementations of a preferred RE embodiment of a QEATM interactive populations that arise from an MXN analysis are method; and Screened against many inhibitorS Such that only Successful FIGS. 23A-23E show details of a SEO-OEATM embodi inhibition events are selected. Thus, from an MXN analysis ment of a OEATM method. not only are obtained Stage-specific pairs of interacting 25 proteins, but also inhibitors of Such interactions. FIG. 24. Exemplary protocol for selection of inhibitors of FIG. 7. Peptide expression vector polylinker. The protein-protein interactions. polylinker region of the peptide expression vector (PEV) is FIG. 25. Exemplary protocol for selection of novel inter depicted. ADC1-P and ADC1-T refer to the ADC1 promoter acting proteins and inhibitors of these interacting proteins. and terminator, respectively. This is a yeast promoter that FIG. 26. Exemplary method steps for a particular alter promotes transcription of genes downstream of it. Sfi I and native embodiment for detecting protein-protein interactions ASc I sites demarcate the region within which the peptide and exemplary information processing Steps. coding regions will be inserted. UAG refers to the termina FIG. 27. Exemplary computer-implemented System for tion codon and NLS refers to the Nuclear Localization performing the information processing Steps of FIG. 26. Signal that provides transport of the peptides into the 35 FIGS. 28A and 28 B. Exemplary computer display screens nucleus. for data Selection according to the information processing FIG. 8. A QEATM Method Analysis. A comparison is steps of FIG. 26. depicted of a QEATM method pattern from an MXN analysis FIG. 29. Exemplary computer display screen for protein conducted in duplicate (Section 6.5). The PCR products that interaction pathways according to the information proceSS were p AS-like vector-specific were pooled and Subjected to 40 ing steps of FIG. 26. a QEATM method analysis. I and II refer to duplicate MXN analyses. The dotted peaks correspond to the molecular FIG. 30. Example of an exemplary method for finding weight markers and the solid peaks are the QEATM method domains responsible for interaction according to the infor products. mation processing Steps of FIG. 26. FIG. 9. The QEATM Method Comparison. A comparison 45 5. DETAILED DESCRIPTION OF THE is shown of the QEATM method patterns from an MXN INVENTION analysis conducted wherein one of the interactive popula tions had the RAS-RAF interaction. The RAF peak obtained In contrast to prior art methods of detecting protein in the OEATM method is shown in Solid black. protein interactions between two protein populations, 50 wherein the number of false positives and low throughput FIGS. 10A, 10B, 10C, and 10D show DNA adapters for limited the applicability of Such prior art methods to Situa an RE/ligation implementation of a Quantitative Expression tions in which the complexity of at least one of the popu Analysis (“OEATM) method, where the restriction endonu lations was no more than 10, the present invention allows cleases generate 5' overhangs, open blocks indicating detection of protein-protein interactions (and isolation and strands of DNA; 55 characterization of the interacting proteins) between popu FIGS. 11A and 11B show the DNA adapters for an lations in which both populations can have complexities of RE/ligation implementation of a QEATM method, where the orders of magnitude Significantly greater than 10, e.g., restriction endonucleases generate 3' overhangs, 1,000, 100,000, or in the range of 50,000-100,000 as is FIGS. 12A, 12B, and 12C show an exemplary biotin found in mammalian cDNA populations. Methods for alternative embodiment of the OEATM method; 60 detecting, isolating, and characterizing inhibitors of Such FIGS. 13A and 13B show a method for DNA sequence interactions are also provided. database selection according to a QEATM method; For purposes of convenience of description and not by FIG. 14 shows an exemplary experimental description for way of limitation, the detailed description is divided into the an embodiment of a QEATM method; Subsections set forth below. FIGS. 15A and 15B show an Overview of a method for 65 5.1. DETECTING INTERACTING PROTEINS determining a simulated database of experimental results for The present invention provides methods for detecting an embodiment of a QEATM method; interacting proteins (including peptides). Interacting pro 6,057,101 9 10 teins are detected based on the reconstitution of a transcrip protein interactions present in a different Species, cell type, tional regulator in the presence of a reporter gene ("Reporter age, tissue type, non-diseased State or a different disease Gene') whose transcription is then regulated by the recon Stage, or different State of development, respectively. For Stituted regulator. In contrast to prior art methods, the example, in one embodiment, interactions are detected protein-protein interactions can be detected, and the inter between identical populations of proteins in which the acting pairs of proteins isolated and identified, between two population of proteins is from cDNA of cancerous or pre populations of proteins wherein both of the populations have cancerous (e.g., hyperplastic, metaplastic, or dysplastic a complexity of at least 10 (i.e., both populations contain cells), e.g., of prostate cancer, breast cancer, Stomach cancer, more than ten distinct proteins). The populations are lung cancer, OVarian cancer, uterine cancer, etc.; these inter expressed as fusion proteins to a DNA binding domain, and actants are then compared to interacting proteins detected to a transcriptional regulatory domain, respectively. In Vari between two other identical populations of proteins in which ous Specific embodiments, one or both of the populations of the population of proteins is from cDNA of cells not having proteins has a complexity of at least 50, 100, 500, 1,000, the cancer or precancerous condition, as the case may be. In 5,000, 10,000, or 50,000; or has a complexity in the range a specific embodiment, cDNA may be obtained from a of 25 to 100,000, 100 to 100,000, 50,000 to 100,000, or 15 preexisting cDNA sample or may be prepared from a tissue 10,000 to 500,000. For example, one or both populations can Sample. When cDNA is prepared from tissue Samples, be mammalian cDNA populations, generally having a com methods commonly known in the art can be used. For plexity in the range of 50,000 to 100,000; in such popula example, these can consist of largely conventional Steps of tions from total mRNA, the detection of a protein in an RNA preparation from the tissue Sample, preferably total interacting pair that is expressed to a particular level can be poly(A) purified RNA is used but less preferably total optimized by the Statistical considerations described in Sec cellular RNA can be used, RNase extraction, DNase tion 5.2.7 below. In a specific embodiment, the invention is treatment, mRNA purification, and first and Second Strand capable of detecting Substantially all detectable interactions cDNA synthesis. that occur between the component proteins of two Preferably, the populations of proteins between which populations, each population having a complexity of at least 25 interactions are detected are provided by recombinant 50, 100, 500, 1000, 5000, 10,000 or 50,000. In a specific expression of nucleic acid populations (e.g., cDNA or embodiment, the two populations are samples (aliquots) of genomic libraries). Also preferably, the interactions occur at least 100 or 1000 members (e.g., expressed in host yeast intracellularly. In another Specific embodiment, recombinant cells) of a larger population (e.g., a mammalian cdNA biological libraries expressing random peptides can be used library) having a complexity of at least 100, 1000, 5,000, as the Source nucleic acid for one or both of the nucleic acid 10,000, or 50,000; in a particular embodiment, the sample is populations. uncharacterized in that the particular identities of all or most In a specific embodiment, presented by way of example of its member proteins are not known. and not limitation, the method of the invention comprises the The populations can be the same or different populations. StepS Schematically depicted in FIG. 1. If it is desired to detect interactions between proteins 35 In a preferred aspect, the present invention provides a encoded by a particular DNA copulation, both protein popu method for detecting unique protein-protein interactions that lations are expressed from chimeric genes comprising DNA characterize a population or library of proteins by comparing Sequences representative of that particular DNA population. all detectable protein-protein interactions that occur in a In another embodiment, one protein population is expressed population or library with those interactions that occur in from chimeric genes comprising cDNA sequences of dis 40 another population or library. Furthermore, the method also eased human tissue, and the other protein population is enables the identification of inhibitors of Such protein expressed from chimeric genes comprising cDNA sequences protein interactions. of non-diseased human tissue. In a specific embodiment, one Protein-protein interactions are detected according to the or more of the populations can be uncharacterized in that the invention by detecting transcriptional regulation (preferably identities of all or most of the members of the population are 45 activation) which occurs upon interaction of proteins not known. Preferably, the populations are proteins encoded between the two populations being tested (referred to here by DNA, e.g., cDNA or genomic DNA or synthetically inafter merely for purposes of convenience as the M popu generated DNA. For example, the populations can be lation and the N population). Proteins of each population expressed from chimeric genes comprising cDNA sequences (M, N) are provided as fusion (chimeric) proteins from an uncharacterized Sample of a population of cDNA 50 (preferably by recombinant expression of a chimeric coding from mammalian RNA. Preferably, a cDNA library is used. Sequence) containing each protein contiguous to a prese The cDNA can be, e.g., a normalized or subtracted cDNA lected Sequence. For one population, the preselected population. The cDNA of one or both populations can be sequence is a DNA binding domain. The DNA binding cDNA of total mRNA or polyA" RNA or a subset thereof domain can be any available, as long as it specifically from a particular Species, particular cell type, particular age 55 recognizes a DNASequence within a promoter. For example, of individual, particular tissue type, disease State or disorder the DNA binding domain is of a transcriptional activator or or Stage thereof, or Stage of development. Accordingly, the inhibitor. For the other population, the preselected Sequence invention provides methods of identifying and isolating is an activator or inhibitor domain of a transcriptional interacting proteins that are present in or Specific to particu activator or inhibitor, respectively. lar Species, cell type, age, tissue type, disease State, or 60 In a preferred embodiment, each protein in one population disease Stage, and also provides methods for comparing the (e.g., M) is provided as a fusion to a DNA binding domain protein-protein interactions present in Such particular of a transcriptional regulator (e.g., activator). Each protein in Species, cell type, age, tissue type, disease State, or disease the other population (N) is provided as a fusion to an Stage (by e.g., using a cDNA library of total mRNA par activator domain of a transcriptional activator. The regula ticular to Such Species, cell type, age, tissue type, disease 65 tory domain alone (not as a fusion to a protein sequence) and State, or disease Stage, respectively, as both the populations the DNA-binding domain alone (not as a fusion to a protein between which interactions are detected) with the protein Sequence) preferably do not detectably interact (so as to 6,057,101 11 12 avoid false positives in the assay). When binding occurs of deficient, and the Selection medium lackS Such nutrient. The a fusion protein in M to a fusion protein in N, reconstitution Reporter Gene used need not be a gene containing a coding of a transcriptional activator occurs Such that transcription is Sequence whose native promoter contains a binding Site for increased of a gene (“Reporter Gene') responsive to (whose the DNA binding protein, but can alternatively be a chimeric transcription is under the control of) the transcriptional gene containing a Sequence that is transcribed under the activator. Thus, the Reporter Gene comprises a nucleotide control of a promoter that is not the native promoter for the Sequence operably linked to a promoter regulated by a DNA transcribed Sequence. binding site for the DNA binding domain of the transcrip In a specific embodiment, to make the fusion constructs tional activator. The activation of transcription of the (encoding the fusion proteins Such that the fusion proteins Reporter Gene occurs intracellularly, e.g., in prokaryotic or are expressed in the desired host cell) from each population eukaryotic cells, preferably in cell culture. (e.g., library), the activation domain and DNA binding The Reporter Gene comprises a nucleotide Sequence domain of a wide variety of transcriptional activator proteins operably linked to a promoter that is operably linked to one can be used, as long as these transcriptional activators have or more nucleic acid binding Sites that are specifically bound Separable binding and transcriptional activation domains. by the DNA binding domain of the fusion protein that is 15 For example, the GAL4 protein of S. cerevisiae, the employed in the assay of the invention, Such that binding of protein of S. cerevisiae (Hope and Struhl, 1986, Cell a reconstituted transcriptional activator or inhibitor to the 46:885–894); the ARD1 protein of S. cerevisiae (Thukral et one or more DNA binding Sites increases or inhibits, al., 1989, Mol. Cell. Biol. 9:2360-2369), and the human respectively, transcription of the nucleotide Sequence under (Kumar et al., 1987, Cell 51:941-951) the control of the promoter. The promoter that is operably have separable DNA binding and activation domains. The linked to the nucleotide Sequence can be a native or non DNA binding domain and activation domain that are native promoter of the nucleotide Sequence, and the DNA employed in the fusion proteins need not be from the same binding site(s) that are recognized by the DNA binding transcriptional activator. In a specific embodiment, a GALA domain portion of the fusion protein can be native to the or LEXA DNA binding domain is employed. In another promoter (if the promoter normally contains such binding 25 specific embodiment, a GAL4 or herpes simplex virus VP16 site(s)) or non-native. Thus, for example, one or more (Triezenberg et al., 1988, Genes Dev. 2:730–742) activation tandem copies (e.g., 4 or 5 copies) of the appropriate DNA domain is employed. In a specific embodiment, amino acids binding site can be introduced upstream of the TATA box in 1-147 of GAL4 (Ma et al., 1987, Cell 48:847-853; Ptashne the desired promoter (e.g., in the area of position -100 to et al., 1990, Nature 346:329-331) is the DNA binding -400). In a preferred aspect, 4 or 5 tandem copies of the 17 domain, and amino acids 411-455 of VP16 (Triezenberg et bp UAS (GAL4 DNA binding site) are introduced upstream al., 1988, Genes Dev. 2:730–742; Cress et al., 1991, Science of the TATA box in the desired promoter, that is in turn 251:87–90) is the activation domain. upstream of the desired coding Sequence that encodes a In a preferred embodiment, the transcriptional activator Selectable or detectable marker. In a preferred embodiment, that is reconstituted in the manner described above is the the GAL1-10 promoter is operably fused to the desired 35 yeast transcription factor GAL4 (FIG. 2). The host strain nucleotide Sequence; the GAL1-10 promoter already con bears a mutant GAL4 gene (e.g., having a deletion or point tains 5 binding sites for GAL4. Thus, in a particular mutation) and as Such cannot express the GAL4 transcrip embodiment, the transcriptional activation binding site of tional activator. the desired gene(s) can be deleted and replaced with GAL4 In another embodiment, the DNA-binding domain is binding sites (Bartel et al., 1993, BioTechniques 14(6) 40 Ace1N, the DNA binding domain of the Acel protein. In : 920–924; Chasman et al., 1989, Mol. Cell. Biol. another embodiment, the activation domain is Ace 1C, the 9:4746–4749). Referring to use of a particular gene as a activation domain of Ace 1. Ace1 is a yeast protein that Reporter Gene herein thus means that, if the native promoter activates transcription from the CUP1 operon in the pres is not driven by binding site(s) recognized by the DNA ence of divalent copper. CUP1 encodes metallothionein, binding domain used in the interaction assay of the 45 which chelates copper; thus, CUP1 gene expression is invention, such DNA binding site(s) have been introduced Reporter Gene expression suitable for use with Ace1N, in into the gene. which Selection is carried out by using copper in the media The Reporter Gene preferably comprises a nucleotide of the growing host cells which would otherwise be toxic to Sequence, whose transcription is regulated by the transcrip the cells. Alternatively or additionally, the Reporter Gene tional activator, that is a coding Sequence that encodes a 50 can comprise a CUP1-lacz fusion such that the enzyme detectable marker or Selectable marker, facilitating detection B-galactosidase is expressed upon binding of a transcrip of transcriptional activation, thereby detecting a protein tional activator reconstituted with Ace1N (see Chaudhuri et protein interaction. Preferably, the assay is carried out in the al., 1995, FEBS Letters 357:221–226). absence of background levels of the transcriptional activator In another specific embodiment, the DNA binding domain (e.g., in a cell that is mutant or otherwise lacking in the 55 of the human estrogen receptor is used, with a Reporter transcriptional activator). Preferably, more than one different Gene driven by one or three estrogen receptor response Reporter Gene is used to detect transcriptional activation, elements (see Le Douarin et al., 1995, Nucl. Acids. Res. e.g., one encoding a detectable marker, and one or more 23:876-878). encoding different selectable markers. The detectable In an embodiment in which the interaction assay is carries marker can be any molecule that can give rise to a detectable 60 out in a prokaryotic cell and in which fusion proteins to a Signal, e.g., an enzyme or fluorescent protein. The Selectable transcriptional inhibition domain are used as one of the marker can be any molecule which can be selected for its populations of proteins, both the DNA binding domain expression, e.g., which gives cells a Selective advantage over fusion population and the inhibition domain fusion popula cells not having the Selectable marker under appropriate tion can be fusions to the cI . In this embodiment, (Selective) conditions. In preferred aspects, the Selectable 65 interaction of two fusion proteins via the non-cI protein marker is an essential nutrient in which the cell in which the portions promotes oligomerization of the 2. cI DNA binding interaction assay occurs is mutant or otherwise lacks or is domain sufficient to cause DNA binding and inhibition of 6,057,101 13 14 transcription from the two phage major early promoters, encodes the SV40 T antigen. In the presence of reconstituted preventing lytic growth and rendering the host bacterial cells GAL4 DNA binding domain-activation domain (composed immune to superinfection by 2 (Hu et al., 1995, Structure of two interacting fusion proteins), SV40 T antigen is 3:431–433). Alternatively, the DNA binding domains of the produced from the Reporter Gene. If a plasmid is present LexA repressor (Schmidt-Dörr et al., 1991, Biochemistry that contains the SV40 origin of replication, this plasmid 30:9657-9664), 434 repressor (Pu et al., 1993, Nucl. Acids will replicate only upon the production of SV40 T antigen. Res. 21:4348-4355), or AraC protein (Bustos et al., 1993, Thus, replication of Such a plasmid is used as an indicator of Proc. Natl. Acad. Sci. USA 90:5638-5642) can be used in protein-protein interaction. Constructing one or both of the both the DNA binding domain and the transcription inhibi plasmids encoding the fusion proteins of the assay to contain tion fusion populations. an SV40 origin of replication means that replication of these The DNA binding domain and the transcription activator/ plasmids will be an indication of Reporter Gene activity. inhibitor domain each preferably comprises a nuclear local Sensitivity to DpnI can be used to destroy unreplicated ization signal (see Ylikomi et al., 1992, EMBO J. plasmids according to the methods described in Vasavada et 11:3681–3694; Dingwall and Laskey, 1991, TIBS al. (1991, Proc. Natl. Acad. Sci. USA 88:10686–10690). In 16:479–481) functional in the cell in which the fusion 15 an alternative embodiment, alternatively to an SV40 origin proteins are to be expressed. of replication, a polyoma virus replicon can be employed In another embodiment, the fusion constructs further (id.) comprise Sequences encoding affinity tags Such as Preferably, the protein-protein interactions are assayed glutathione-S-transferase or maltose-binding protein or an according to the method of the invention in yeast cells, e.g., epitope of an available antibody, So as to facilitate isolation Saccharomyces cerevisiae or Schizo-Saccharomyces pombe. of the encoded proteins by affinity methods (e.g., binding to Various vectors for producing the two fusion protein popu glutathione, maltose, or antibody, respectively) (see Allen et lations and host Strains for conducting the assay are known al., 1995, TIBS 20:511–516). In another embodiment, the and can be used (see, e.g., Fields et al., U.S. Pat. No. fusion constructs further comprise bacterial promoter 5,468,614 dated Nov. 21, 1995; Bartel et al., 1993, “Using Sequences operably linked to the fusion coding Sequences to 25 the two-hybrid System to detect protein-protein facilitate the production of the fusion proteins also in interactions,” in Cellular Interactions in Development, bacterial cells (see Allen et al., 1995, TIBS 20:511–516). Hartley, D. A. (ed.), Practical Approach Series xviii, IRL The host cell in which the interaction assay occurs can be Press at Oxford University Press, New York, N.Y., pp. any cell, prokaryotic or eukaryotic, in which transcription of 153–179; Fields and Sternglanz, 1994, TIG 10:286–292). the Reporter Gene can occur and be detected, including but By way of example but not limitation, yeast Strains or not limited to mammalian (e.g., monkey, chicken, mouse, derivative strains made therefrom which can be used are (see rat, human, bovine), bacteria, and insect cells, and is pref Section 6.3 and its subsections) N105, N106, N105", N106", erably a yeast cell. Expression constructs encoding and and YULH, the respective genotypes of these Strains are set capable of expressing the binding domain fusion proteins, forth in Section 6.3, infra. Exemplary strains that can be the transcriptional activation domain fusion proteins, and the 35 modified to create reporter Strains (containing the desired Reporter Gene product(s) are provided within the host cell, Reporter Gene for use in the assay of the invention) also by mating of cells containing the expression constructs, or include the following: by cell fusion, transformation, electroporation, Y190: MATa, ura3-52, his3-200, lys2-801, ade2-101, trp1 microinjection, etc. For example, GAL4 and VP16 are 901, leu2-3,112, gal4A, gal80A, cyh'2, LYS2::GAL1 functional in animal cells and thus the desired binding or 40 HIS3-HIS3, URA3::GAL1A-GAL1-lacz, activation domain thereof can be used in, e.g., yeast or (available from Clontech, Palo Alto, Calif.; Harper et al., mammalian cells. In a specific embodiment in which the 1993, Cell 75:805–816). assay is carried out in mammalian cells (e.g., hamster cells), Y190 contains HIS3 and lacZ Reporter Genes driven by the DNA binding domain is the GAL4 DNA binding GAL4 binding sites. domain, the activation domain is the herpes simplex virus 45 CG-1945: MATa, ura3-52, his3-200, lys2-801, ade2-101, VP16 transcriptional activation domain, and the Reporter trp1-901, leu2-3,112, gal4-542, gal80-538, cyh'2, LYS2: Gene contains the desired coding Sequence operably linked GAL1As-GAL1A1A-HIS3, URA3::GAL1 as 17 ca to a minimal promoter element from the adenovirus ElB CYC1 a-lacz (available from Clontech). CG-1945 con gene driven by several GAL4 DNAbinding sites (see Fearon tains HIS3 and lacZ Reporter Genes driven by GALA et al., 1992, Proc. Natl. Acad. Sci. USA89:7958–7962). As 50 binding Sites. will be apparent, other DNA binding domains, activation Y187: MATC, ura3-52, his3-200, ade2-101, trp1-901, leu2 domains, promoters, and/or DNA binding sites can be used, 3,112, gala-A, galSOA, URA3::GAL1As-GAL1-lacz, as long as the DNA binding sites are recognized by the DNA (available from Clontech). Y187 contains a lacZ Reporter binding domains, and the promoter is operative in the cells Gene driven by GAL4 binding sites. chosen in which to carry out the assay of the invention. The 55 SFY526: MATa, ura3-52, his3-200, lys2-801, ade2-101, host cell used should not express an endogenous transcrip trp 1-901, leu2-3,112, gal4-542, gal80-538, can, tion factor that binds to the same DNA site as that recog URA3:GAL1-lacz (available from Clontech). SFY526 nized by the DNA binding domain fusion population. Also, contains HIS3 and lacZ Reporter Genes driven by GALA preferably, the host cell is mutant or otherwise lacking an binding Sites. endogenous, functional form of the Reporter Gene(s) used in 60 HF7c: MATa, ura3-52, his3-200, lys2-801, ade2-101, trp1 the assay. 901, leu2-3,112, gal4-542, gal80-538, LYS2::GAL1 In a Specific embodiment, transcription of the Reporter HIS3, URA::GAL1 as 17 vers a-CYC1-lacz Gene is detected by a linked replication assay. For example, (available from Clontech). HF7c contains HIS3 and lacZ as described by Vasavada et al. (1991, Proc. Natl. Acad. Sci. Reporter Genes driven by GALA binding sites. USA 88:10686–10690), for use in animal cells, a Reporter 65 YRG-2: MATC, ura3-52, his3-200, lys2-801, ade2-101, Gene under the control of the E1B promoter, which pro trp 1 - 901, leu 2-3, 112, ga14-542, ga180-538, moter in turn is controlled by GAL4 DNA binding sites, LYS2::GAL1As-GAL1-HIS3, URA3::GAL1s 7 6,057,101 15 16 mers (x3)-CYC1-lacz (available from Stratagene). YRG-2 plasmids that have the yeast centromere in them may also be contains HIS3 and lacZ Reporter Genes driven by GAL4 used to express the activation and DNA-binding domain binding Sites. fusions (Elledge et al., 1988, Gene 70:303-312). In another Many other strains commonly known and available in the embodiment of the invention the DNA-binding chimeric art can be used. genes are introduced directly into the yeast chromosome via Consistent with convention in the art, wild-type gene homologous recombination. The homologous recombina names throughout this application are all capitalized and tion for these purposes is mediated through yeast Sequences italicized; mutant gene names are lower case and that are not essential for vegetative growth of yeast, e.g., italicized-except for lac7, for which the functional, non MER2, MER1, ZIP1, REC102, or ME14 gene. mutant gene is written lower case, italicized. In yet another embodiment of the invention, alternatively If not already lacking in endogenous Reporter Gene to plasmids, bacteriophage vectorS Such as a vectors are activity, cells mutant in the Reporter Gene may be Selected used as the DNA binding domain vectors and/or activation by known methods, or the cells can be made mutant in the domain vectors to make, e.g., the respective cDNA libraries. target Reporter Gene by known gene-disruption methods The use of ). Vectors generally makes it faster and easier to prior to introducing the Reporter Gene (Rothstein, 1983, 15 generate Such libraries than with the use of plasmid vectors. Meth. Enzymol. 101:202–211). The Second type of yeast host, for example the Strain C, In a Specific embodiment, plasmids encoding the different hosts a library of chimeric genes encoding hybrid proteins fusion protein populations can be both introduced into a that are all fusions of different genes to the activation Single host cell (e.g., a haploid yeast cell) containing one or domain of a transcriptional activator (see by way of example more Reporter Genes, by cotransformation, to conduct the Section 6.1.7). Preferably, this library is plasmid-borne, and assay for protein-protein interactions. As a preferred alter the plasmids are capable of replication in both E. coli and native to cotransformation of expression constructs, mating yeast. The plasmid contains a promoter directing the tran (e.g., of yeast cells) or cell fusion (e.g., of mammalian cells) Scription of the activation domain fusion gene, and a tran can be employed for delivery of a binding domain fusion Scriptional termination Signal. The plasmid preferably also expression construct and an activation domain fusion 25 contains a Selectable marker gene, the expression of which expression construct into a single cell. In a mating-type in the host cell permits Selection of cells containing the assay, conjugation of haploid yeast cells of opposite mating marker gene from cells that do not contain the Selectable type that have been transformed with a binding domain marker. In another embodiment of the invention the DNA fusion expression construct (preferably a plasmid) and an binding chimeric genes are introduced directly into the yeast activation (or inhibitor) domain fusion expression construct chromosome via homologous recombination. The homolo (preferably a plasmid), respectively, delivers both constructs gous recombination for these purposes is mediated through into the same diploid cell. The mating type of a Strain may yeast Sequences that are not essential for vegetative growth be manipulated as desired, by transformation with the HO of yeast. gene (Herskowitz and Jensen, 1991, Meth. Enzymol. In one embodiment of the invention, the DNA-binding 194:132–146). 35 domain and the activation domain arise from the same In a preferred embodiment, a yeast interaction mating transcriptional activator where these functions reside in assay is employed, using two different types of host cells, Separate domains. In another embodiment, the DNA-binding Strain-types a and C., of the yeast Saccharomyces cerevisiae and the activation domains may be from different transcrip (FIG. 2). The host cell preferably contains at least two tional activators. Preferably, the two chimeric gene libraries Reporter Genes, containing a binding Site for the DNA 40 are made from cDNA from various Sources, for example, binding domain (e.g., of a transcriptional activator), Such different human tissues, fused to the DNA-binding or the that the Reporter Gene is transcriptionally activated when activation domains, respectively (see by way of example the DNA-binding domain is in proximity to an activator Section 6.1.6). These cDNA libraries may be derived from domain of a transcriptional activator. The activator domain Subtracted or normalized cDNA populations. In other spe and DNAbinding domain are each parts of chimeric proteins 45 cific embodiments, the fusions are of genomic, Synthetic, formed from the two respective populations of proteins. viral or bacterial DNA fused to the DNA-binding domains or One type of host cell, for example the a Strain, hosts a the activation domains of the transcriptional activator. library of chimeric genes that encode hybrid proteins that are In a specific embodiment, the invention provides a all fusions of different nucleotide sequences (e.g., gene method of detecting one or more protein-protein interactions Sequences) to the DNA-binding domain of a transcriptional 50 comprising (a) recombinantly expressing in a first popula activator like GAL4 (see by way of example Section 6.1.7). tion of yeast cells of a first mating type, a first population of These hybrid proteins are capable of recognizing the DNA first fusion proteins, each first fusion protein comprising a binding Site on the Reporter Gene. In a preferred embodi first protein Sequence and a DNA binding domain, in which ment of this invention, the library of DNA-binding domain the DNA binding domain is the same in each said first fusion chimeric genes is introduced into the host cell as a set of 55 protein; wherein Said first population of yeast cells contains plasmids. These plasmids are preferably capable of autono a first nucleotide Sequence operably linked to a promoter mous replication in a host yeast cell and preferably can also driven by one or more DNA binding sites recognized by said be propagated in E. coli. The plasmid contains a promoter DNA binding domain such that an interaction of a first directing the transcription of the DNA binding domain fusion protein with a Second fusion protein, Said Second fusion gene, and a transcriptional termination signal. The 60 fusion protein comprising a transcriptional activation plasmid preferably also contains a Selectable marker gene, domain, results in increased transcription of Said first nucle the expression of which in the host cell permits Selection of otide Sequence, and in which said first population of first cells containing the marker gene from cells that do not fusion proteins has a complexity of at least 1,000; (b) contain the Selectable marker, upon incubation of the cells in negatively Selecting to eliminate those yeast cells expressing an environment in which Substantial death of the cells occurs 65 Said first population of first fusion proteins in which Said in the absence of expression of the Selectable marker. The increased transcription of Said first nucleotide Sequence plasmid can be single-copy or multi-copy. Single-copy yeast occurs in the absence of Said Second fusion protein; (c) 6,057,101 17 18 recombinantly expressing in a Second population of yeast the immunospecific binding of an antibody to Such protein, cells of a Second mating type different from Said first mating which antibody can be labeled, or alternatively, which type, a Second population of Said Second fusion proteins, antibody can be incubated with a labeled binding partner to each Second fusion protein comprising a Second protein the antibody, So as to yield a detectable signal. Alam and Sequence and an activation domain of a transcriptional Cook (1990, Anal. Biochem. 188:245-254) disclose non activator, in which the activation domain is the same in each limiting examples of detectable marker genes that can be Said Second fusion protein, and in which Said Second popu constructed So as to be operably linked to a transcriptional lation of Second fusion proteins has a complexity of at least regulatory region responsive to a reconstituted transcrip 1,000; (d) mating said first population of yeast cells with tional activator used in the method of the invention, and thus Said Second population of yeast cells to form a population of used as Reporter Genes. AS will be apparent, use of a diploid yeast cells, wherein Said population of diploid yeast particular Reporter Gene should be conducted in cells cells contains a Second nucleotide Sequence operably linked mutant or otherwise lacking in functional versions of the to a promoter driven by a DNA binding site recognized by Reporter Gene. Thus, for example, for (positive or negative) said DNA binding domain such that an interaction of a first selection for URA3 Reporter Gene activity, the host cell fusion protein with a Second fusion protein results in 15 should be homozygous mutant (point mutation or deleted or increased transcription of Said Second nucleotide Sequence, otherwise lacking function of the gene in both alleles) So as in which the first and Second nucleotide Sequences can be the to lack endogenous URA3 activity. Similarly, in the use of same or different; and (e) detecting said increased transcrip a LYS2 Reporter Gene, the host cell should be homozygous tion of Said first and/or Second nucleotide Sequence, thereby mutant for LYS2, in the use of a CAN1 Reporter Gene for detecting an interaction between a first fusion protein and a negative Selection, the host cell should be homozygous Second fusion protein. mutant for CAN1, in the use of a CYH2 Reporter Gene for In a preferred embodiment, the two libraries of chimeric negative Selection, the host cell should be homozygous genes are combined by mating the two yeast Strains on Solid mutant for CYH2, etc., in cases in which the host cell has an media for a period of approximately 6–8 hours (see Section endogenous form of the Reporter Gene. 6.1.1). In a less preferred embodiment, the mating is per 25 The activation of Reporter Genes like URA3 or HIS3 formed in liquid media. The resulting diploids contain both enables the cells to grow in the absence of uracil or the kinds of chimeric genes, i.e., the DNA-binding domain bi-stidine, respectively, and hence Serves as a Selectable fusion and the activation domain fusion. The interaction marker. Thus, after mating, the cells exhibiting protein between the two hybrid proteins within a diploid cell causes protein interactions are Selected by their abilities to grow in the activation domain to be in close proximity to the media lacking the requisite ingredient like uracil or DNA-binding domain of the transcriptional activator. This histidine, respectively (referred to as -URA (minus URA) in turn causes reconstitution of the transcriptional activator and -HIS medium, respectively) (see by way of example and is monitored by the activity of the Reporter Gene. Thus, Section 6.3-6.5). In a specific embodiment, -HIS medium when two libraries M and N are mated together, an MXN preferably contains 3-amino-1,2,4-triazole (3-AT), which is Screen for interacting proteins is performed. 35 a competitive inhibitor of the HIS3 gene product and thus In a preferred embodiment, the two host Strains are requires higher levels of transcription in the Selection (see preferably of the mating type a and C. of the yeast Saccha Durfee et al., 1993, Genes Dev. 7:555-569). Similarly, romyces cerevisiae. Each mating type of the host preferably 6-azauracil, which is an inhibitor of the URA3 gene product, has at least two Reporter Genes that each contain one or can be included in-URA medium (Le Douarinet al., 1995, more recognition Sites for the DNA-binding domain. 40 Nucl. Acids Res. 23:876-878). Alternatively to detecting Preferably, the Reporter Gene(s) are the URA3, HIS3 and/or URA3 gene activity by selecting in -URA medium, URA3 the lacZ (see, e.g., Rose and Botstein, 1983, Meth. Enzymol. gene activity can be detected and/or measured by determin 101:167-180) gene that have been manipulated so as to ing the activity of its gene product, orotidine-5'- contain recognition sites (preferably at least two) in the monophosphate decarboxylase (Pierrat et al., 1992, Gene promoter for the DNA-binding domain of GAL4 (see by 45 119:237-245; Wolcott et al., 1966, Biochem. Biophys. Acta way of example Section 6.3.5) (FIG. 2). In other 122:532-534). In other embodiments of the invention, the embodiments, Reporter Genes comprising the functional activities of the reporter genes like lacZ or GFP are moni coding Sequences of genes, including but not limited to, tored by measuring a detectable signal (e.g., fluorescent or Green Fluorescent Protein (GFP) (Cubitt et al., 1995, Trends chromogenic) that results from the activation of these Biochem. Sci. 20:448-455), luciferase, LEU2, LYS2, 50 Reporter Genes. For example, lac7, transcription can be ADE2, TRP1, CAN1, CYH2, GUS, CUP1 (encoding met monitored by incubation in the presence of a chromogenic allothionein which conferS resistance to copper) or chloram substrate, such as X-gal (5-bromo-4-chloro-3-indolyl-B-D- phenicol acetyl transferase (CAT) may be used, operatively galactoside), for its encoded enzyme, B-galactosidase. The linked to a promoter driven by DNA binding site(s) recog pool of all interacting proteins isolated by this manner from nized by the DNA binding domain being employed in the 55 mating the two libraries is termed the “interactive popula assay to form a fusion population. LEU2, LYS2, ADE2 and tion” (see by way of example FIG. 3). TRP1 are selectable markers, i.e., their activity results in In a preferred embodiment of the invention, false posi prototrophic growth in media lacking the nutrients encoded tives arising from transcriptional activation by the DNA by these genes, while the activity of luciferase, GUS and binding domain fusion proteins in the absence of a tran CAT are preferably monitored enzymatically. Preferably, 60 Scriptional activator domain fusion protein are prevented or CAN1 and CYH2 Reporter Genes are used to carry out reduced by negative Selection for Such activation within a negative Selection in the presence of canavanine and host cell containing the DNA binding fusion population, cyloheximide, respectively (see infra), rather than to detect prior to exposure to the activation domain fusion population. an interacting pair of proteins. With respect to GFP, the By way of example, if such cell contains URA3 as a natural fluorescence of the protein is detected. In another 65 Reporter Gene, negative Selection is carried out by incubat embodiment, the expression of Reporter Genes that encode ing the cell in the presence of 5-fluoroorotic acid (5-FOA, proteins can be detected by immunoassay, i.e., by detecting which kills URA+ cells (Rothstein, 1983, Meth. Enzymol. 6,057,101 19 20 101:167-180). Hence, if the DNA-binding domain fusions has URA3 as a reporter gene. This library should not activate by themselves activate transcription, the metabolism of transcription by itself. To weed out DNA-binding domain 5-FOA will lead to cell death and the removal of self fusions that activate transcription by themselves (carry out activating DNA-binding domain hybrids. By way of another negative Selection), the yeast transformants containing the example, if LYS2 is present as a Reporter Gene in the cell, DNA-binding domain library are plated out on media that negative Selection is carried out by incubating the cell in the contain the chemical 5-fluoroorotic acid (5-FOA). In order presence of B-amino-adipate (Chatoo et al., 1979, Genetics to easily detect the protein-protein interactions between 93:51), which kills LYS' cells. In another embodiment, if proteins in complex populations as provided by the methods CAN1 is present as a Reporter Gene in the cell, negative of the present invention, it is preferred to use a host cell Selection is carried out by incubating the cell in the presence containing at least two, preferably three, Reporter Genes of canavanine (CAN1 encodes an arginine permease that (e.g., HIS3, URA3, lacZ operably linked to a DNA binding renders the cell sensitive to the lethal effects of canavanine) Site of a transcription activator that is recognized by the (Sikorski et al., 1991, Meth. Enzymol. 194:302–318). In yet DNA binding domain part of the fusion protein, in a yeast another embodiment, if CYH2 is present as a Reporter Gene host cell), and to carry out negative Selection among the in the cell, negative Selection is carried out by incubating the 15 DNA binding domain-fusion protein population (e.g., by use cell in the presence of cycloheximide (CYH2 encodes the of 5-FOA and a URA3 Reporter Gene); and to use a yeast L29 protein of the yeast ribosome; the wild-type L29 protein mating assay in which the mating is performed on a Solid is Sensitive to cycloheximide which thus blockS protein phase, which increases the percentage of productive mating synthesis, resulting in cell death) (Sikorski et al., 1991, events that can be recovered. Meth. Enzymol. 194:302–318). Such negative selection with In a specific embodiment, a DNA binding domain fusion the DNA-binding domain fusion population helps to avoid library is expressed from a first plasmid population, and a false positives that become amplified through the preferred transcription activation domain fusion library is expressed processing Steps of the invention, and which becomes more from a Second plasmid population, and each plasmid con troublesome as the complexity of the assayed populations tains a Selectable marker. For example, the first plasmid increases. In another embodiment, the DNA-binding domain 25 population can express TRP1, and the Second plasmid fusion population can be Subjected to negative immunose population can express LEU2, or Some other gene encoding lection by use of antibodies Specific to the expressed protein an essential amino acid So that the presence of the plasmid product of a Reporter Gene; in this embodiment, cells can be Selected for in medium lacking the amino acid. In a expressing a protein that is recognized by the antibody are preferred embodiment, the first plasmid population is removed and the fusion constructs from the remaining cells expressed in a yeast Strain of a first mating type (Selected are kept for use in the interaction assay. In yet another from between a and C), and which yeast strain is deficient in embodiment, negative Selection can be carried out by plating endogenous URA3 and HIS3, and contains URA3 as a the DNA-binding domain fusion population on medium Reporter Gene and optionally also lacz as a Reporter Gene. selective for interaction (e.g., minus URA or minus HIS In a preferred embodiment, the Second plasmid population is medium if the Reporter Gene is URA3 or HIS3, 35 expressed in a yeast Strain of a Second mating type different respectively), following which all the Surviving colonies are from the first mating type, which yeast Strain is deficient in physically removed and discarded. Negative Selection endogenous URA3 and HIS3, and contains HIS3 as a involving the use of a Selectable marker as a Reporter Gene Reporter Gene and optionally also lac7 as it Reporter Gene. and the presence in the cell medium of an agent toxic or Yeast cells of the first mating type are transformed with the growth inhibitory to the host cells in the absence of Reporter 40 first plasmid population, and are positively Selected for the Gene transcription is preferred, since it allows high plasmids and are negatively Selected for false positive throughput, i.e., a much greater number of cells to be transcriptional activation by incubating the cells in an envi processed much more easily than alternative methods. ronment (e.g., liquid medium, and/or Solid phase plates) AS will be apparent, negative Selection can also be carried lacking the Selectable marker (e.g., tryptophan) and contain out on the activation domain fusion population prior to 45 ing 5-FOA. Selected cells are pooled. Yeast cells of the interaction with the DNA binding domain fusion population, Second mating type are transformed with the Second plasmid by Similar methods, alone or in addition to negative Selection population, and are positively Selected for the plasmids by of the DNA binding fusion population. incubating the cells in an environment lacking the appro In another embodiment, negative Selection can also be priate Selectable marker, e.g., leucine. Selected cells are carried out on the recovered pairs of protein interactants, by 50 pooled. Both groups of pooled cells are mixed together and known methods (see, e.g., Bartel et al., 1993, BioTechniques mating is allowed to occur on a Solid phase. The resulting 14(6): 920–924) although pre-negative selection (prior to the diploid cells are then transferred to Selective media, that interaction assay), as described above, is preferred. For Selects for the presence of each plasmid and for activation of example, each plasmid encoding a protein (peptide or Reporter Genes, i.e., in this embodiment, medium lacking polypeptide) fused to the activation domain (one-half of a 55 uracil, histidine, tryptophan and leucine, and optionally, also detected interacting pair) can be transformed back into the containing 3-amino-1,2,4-triazole. original Screening Strain, either without any other plasmid, In Specific embodiments, the invention also provides or with a plasmid encoding only the DNA-binding domain, purified cells of a single yeast Strain of mating type a, that the DNA-binding domain fusion to the detected interacting is mutant in endogenous URA3 and HIS3, and contains protein (the Second half of the detected interacting pair), or 60 functional URA3 coding Sequences under the control of a the DNA-binding domain fusion to an irrelevant protein; a promoter containing GAL4 binding sites, and contains func positive interaction detected with any plasmid other than tional lac7, coding Sequences under the control of a promoter that encoding the DNA-binding domain fusion to the containing GAL4 binding sites, and also provides purified detected interacting protein is deemed a false positive and cells of a single yeast Strain of mating type C, that is mutant eliminated from further use. 65 in endogenous URA3 and HIS3, and contains functional In a preferred embodiment of the invention, the DNA URA3 coding Sequences under the control of a promoter binding domain library is introduced into a host Strain that containing GAL4 binding sites, and contains functional lacZ 6,057,101 21 22 coding Sequences under the control of a promoter containing BY A PARTICULAR TISSUE TYPE, DISEASE STATE GAL4 binding Sites. A kit is also provided, comprising in OR STAGE OF DEVELOPMENT, AND CREATION OF one or more containers cells of the foregoing Strains. In a “PROTEIN INTERACTION MAPS Specific embodiment, the kit further comprises in one or An important object of the present invention is to provide more containers (a) a first vector comprising (i) a promoter; a method to identify protein-protein interactions that are (ii) a nucleotide sequence encoding a DNA binding domain, unique to particular disease States, Stages of development, or operably linked to the promoter; (iii) means for inserting a tissue type. An analysis of the interacting proteins between DNA sequence encoding a protein into the vector in Such a two populations of proteins (“MxN analysis”) performed in manner that the protein is capable of being expressed as part parallel on two types of tissue or disease States, wherein both of a fusion protein containing the DNA binding domain; (iv) the M and N populations are preferably identical and are a transcription termination Signal operably linked to the derived from the same type of tissue or disease State, will nucleotide sequence, (v) a means for replicating in the cells yield the respective interactive protein populations for each of the above-described yeast Strains; and (c) a Second vector type. The differences between the two interactive popula comprising (i) a promoter; (ii) a nucleotide Sequence encod tions will yield the protein-protein interactions that are ing an activation domain of a transcriptional activator, 15 characteristic of or unique to a particular tissue type or operably linked to the promoter; (iii) means for inserting a disease State. Hence, it is desired to identify and isolate the DNA sequence encoding a protein into the vector in Such a protein-protein interactions that are unique to a complex manner that the protein is capable of being expressed as part population. This is preferably achieved by coding, pooling of a fusion protein containing the activation domain of a and arraying Strategies for the interactants as described transcriptional activator; (iv) a transcription termination below and deconvolution of the arrayed interactants by Signal operably linked to the nucleotide sequence; and (v) a Sequencing a Quantitative Expression Analysis (QEATM means for replicating in the cells of the above-described method), SEQ-QEATM method, and/or other methods that yeast Strains. The means for inserting a DNA sequence can facilitate analysis of the interactants (e.g., SAGE be one or more restriction endonuclease recognition Sites (Velculescu et al., 1995, Science 270:484-487). suitably located within the vector. 25 Alternatively, Sequencing of individual interactants provides In a preferred embodiment of the invention, after an a method for identifying the interacting genes that does not interactive population is obtained, the DNA sequences necessarily use pooling or require deconvolution. Thus, in encoding the pairs of interactive proteins are isolated by a this alternative embodiment, clones of interactants can be method wherein either the DNA-binding domain hybrids or recovered, e.g., from the interactant-positive yeast cells, the activation domain hybrids are amplified Specifically in amplified or grown up in bacteria, and Subjected to Sequence an individual reaction (see by way of example Section 6.9). analysis. Sequencing can be carried out by any of numerous Preferably, both the DNA-binding fusion sequences and the methods known in the art (see e.g., Sanger et al., 1977, Proc. activation domain fusion Sequences are amplified, in Sepa Natl. Acad. Sci. USA 74(12):5463-5467). In a specific rate respective reactions. Preferably, the amplification is 5 embodiment, to enhance throughput, a multiplex Sequencing carried out by polymerase chain reaction (PCR) (U.S. Pat. 35 analysis can be conducted. For example, in a multiplex Nos. 4,683,202. 4,683,195 and 4,889,818; Gyllenstein et al., Sequencing analysis, one can carry out dideOXy Sequencing 1988, Proc. Natl. Acad. Sci. USA85:7652–7656; Ochman et reactions with just one of the dideoxynucleotides, e.g., ddT, al., 1988, Genetics 120:621-623; Loh et al., 1989, Science using a different dye on the dideoxynucleotide in the reac 243:217-220; Innis et al., 1990, PCR Protocols, Academic tion with DNA of each of four Separate interactant pairs, Press, Inc., San Diego, Calif.), using pairs of oligonucleotide 40 which reaction products are then pooled together and Sub primers that are specific to either the DNA-binding domain jected to electrophoresis. Comparing the pattern of bands hybrids or the activation domain hybrids in the PCR reaction formed by DNA of interactant pairs from different popula (see by way of example Section 6.1.8). This PCR reaction tions identifies differences, indicating an interacting protein can also be performed on pooled cells expressing interacting Specific to that population. The DNA for Such a protein can protein pairs, preferably pooled arrays of interactants. Other 45 then be sequenced fully. Moreover, identical patterns of amplification methods known in the art can be used, includ bands for a single dye between pooled groups identifies ing but not limited to ligase chain reaction (see EP 320,308) interactions which share the same partners, thus Saving use of O?3 replicase, or methods listed in Kricka et al., 1995, Sequencing DNA encoding a common interacting protein Molecular Probing, Blotting, and Sequencing, chap. 1 and over and over again. This method would raise throughput table IX, Academic Press, New York. 50 four-fold. In another embodiment of the invention, the plasmids 5.2.1. DETERMINATION OF ALL THE DETECTABLE encoding the DNA-binding domain hybrid and the activa PROTEIN-PROTEIN INTERACTIONS tion domain hybrid proteins are isolated from yeast cells by Cells containing interacting protein pairs are identified as transforming the yeast DNA into E. coli and recovering the described above, by detecting Reporter Gene expression. plasmids from E. coli (see e.g., Hoffman et al., 1987, Gene 55 Determining all the detectable pairs of Linteractions then 57:267-272). This is possible when the plasmid vectors used employs pooling and two sets of deconvolution reactions. for both the DNA-binding domain and the activation domain The first set characterizes all the “M” interacting partners; hybrids are shuttle vectors that can replicate both in E. coli the Second Set characterizes the "N' interacting partners. and in yeast. Many Such Shuttle vectors are known in the art Preferably, DNA of cells containing interacting proteins is and can be used. Alternatively, if a Shuttle Vector is not used, 60 Subjected to an amplification reaction that Specifically the yeast vector can be isolated, and the insert encoding the amplifies the DNA-binding fusion Sequences and, in a fusion protein Subcloned into a bacterial expression vector Separate reaction, the activation domain fusion Sequences. In for growth in bacteria. Growing up the interacting clones in a preferred embodiment, the characterizations of interacting bacteria yields large quantities without the use of amplifi partners are performed by “the SEQ-QEATM method” (see cation reactions Such as PCR. 65 infra) on PCR products that were generated with “M” or “N” 5.2. CHARACTERIZATION OF INTERACTIVE POPU Specific amplification primers, respectively (see by way of LATIONS THAT ARE DIFFERENTIALLY EXPRESSED example FIG. 3). The “M”-specific amplification primers 6,057,101 23 24 hybridize Specifically to and amplify Sequences from one incorporated by reference herein in their entireties. QEATM type of fusion construct, e.g., the DNA binding fusion methods that can be used are also described in Section 5.4, construct (e.g., by hybridization to vector Sequences flank infra, and, by way of example, in Section 6.1.12. ing the inserted variant protein coding Sequences of popu A QEATM method reveals the distribution (both qualita lation M that are fused to the DNA binding domain tive and quantitative) of genes within a population. Thus sequences). The "N"-specific amplification primers hybrid when comparing two interactive populations to which a ize specifically to and amplify Sequences from the other type QEATM method is applied, the differential presence of genes of fusion construct, the activation domain fusion construct between two interactive populations is identified as unique (e.g., by hybridization to vector Sequences flanking the or increased or decreased intensity bands after size Separa inserted variant protein coding Sequence of population N tion Such as in a denaturing polyacrylamide gel (see by way that are fused to the activation domain sequences). The PCR of example Section 6.1.12). In a preferred embodiment of is preferably performed wherein DNA-binding and activa the invention, the identity of the gene producing each band tion domain fusion specific primers are used to amplify the is determined by a modification of the QEATM method called genes encoding the two interacting proteins directly from the SEQ-QEATM method (see by way of example Sections yeast (see by way of example Section 6.1.8). This PCR 15 6.1.12.2 and 6.1.12.5). The SEQ-QEATM method (for product Serves as a reservoir for further analysis, including Sequencing QEATM) provides a method to identify the 4 the QEATM method, the SEQ-QEATM method (see infra) and terminal nucleotides next to a Subsequence that was used as Sequencing, that leads to the identification of interacting a recognition site in the QEATM method. Thus, by combining proteins, in particular, those that are differentially expressed the information from the OEATM method and the SEQ (e.g., stage-specific). The primers used in the PCR reaction QEATM method, it is possible to classify and identify pre may be labelled, e.g., by biotinylation or addition of fluo cisely the DNA sequences present in an interactive popula rescent tags and may also serve to introduce Specific restric tion without sequencing. A description of SEQ-QEATM tion endonuclease Sites. The labels are useful tools in the methods is provided in Section 5.4.4 and, by way of Subsequent QEATM method and Sequencing. example, in Sections 6.1.12.2 (and its Subsections) and Thus, in a specific embodiment, DNA isolated from each 25 6.1.12.5. cell containing each individual pair of interactants is, in 5.23. ARRAYING AND CODING STRATEGIES FOR separate reactions, subjected to PCR to amplify the DNA AN INTERACTIVE POPULATION encoding the DNA-binding domain fusion protein, and DNA In a preferred embodiment, “interactive colonies' are encoding the activation domain fusion protein, respectively. arrayed into Wells on microtiter plates. “Interactive colo The DNA encoding the DNA-binding domain fusion protein nies' are those colonies that emerge as a result of the and the DNA encoding the activation domain fusion protein Selection of interacting proteins. A deconvolution Strategy are each Subjected to Sequencing analysis to determine its allows for a characterization of both members of each pair Sequence and thus the Sequence of the interacting protein of interacting proteins (from all the individual Wells) without that formed a part of the fusion protein. In this manner, each Sequencing each pair individually. In this way, the proteins individual pair of interactants is identified. Alternative meth 35 expressed in each well are characterized and Statistics can be ods that can be used to identify individual pairs of interac gathered as to the frequency of the types of interactions. We tants are described in Sections 5.2.2 to 5.2.6.2. refer to the catalog of interacting proteins as a “protein 5.2.2. CLASSIFICATION OF THE ARRAYED POOLS interaction map’. This characterization can be further used OF INTERACTANTS BY THE OEATM METHOD AND to identify the genes of interest directly or to indicate the THE SEO-OEATM METHOD 40 Specific physical locations in the array of clones that should A Quantitative Expression Analysis method (QEATM be sequenced to determine (or confirm) the identities. Thus, method) produces signals comprising target Subsequence this proceSS provides information on protein-protein inter presence and a representation of the length in base pairs actions characterizing a population of interest. along a nucleic acid between adjacent target Subsequences The differences between the patterns of protein interac by measuring the results of recognition reactions on DNA 45 tions in different types of tissue (e.g., diseased versus (e.g., cDNA or genomic DNA) mixtures. A QEAW method normal, different stages of development, etc.) provide infor provides an economical, quantitative, and precise classifi mation that can be much more valuable than the knowledge cation of DNA sequences, either in arrays of Single Sequence of the interactions in a single tissue alone. Similarly, expres clones or in mixtures of Sequences, without actually sion levels (e.g., as determined by the QEATM method) yield Sequencing the DNA. Preferably, all the Signals taken 50 greater value when they can be correlated with fundamental together have Sufficient discrimination and resolution So that differences in various tissue Samples. A protein interaction each particular DNA sequence in a Sample may be individu map of any given tissue or cell type will contain many ally classified by the particular Signals it generates, and with non-biological or unimportant interactions. However, a reference to a database of DNA sequences possible in the comparison of the interactions taking place between a dis Sample individually determined. These signals are prefer 55 ease State, and a "normal” State will be very informative, as ably optical, generated by fluorochrome labels and detected this process of comparison tends to eliminate the unimpor by automated optical detection technologies. The Signals are tant interactions. Identifying the genes encoding interacting generated by detecting the presence or absence of short proteins may also provide information on the putative bio DNA Subsequences within a nucleic acid Sequence of the logical functions of the genes of interest, which will help Sample analyzed. The Subsequences are detected by use of 60 assess which of the interactions detected awe likely to take recognition means, or probes for the Subsequences. A place physiologically (Some interactions in the protein inter detailed description of the QEATM methods is provided in action map might be artifacts of the method) and be of the U.S. patent applications Ser. No. 08/547,214 filed on heightened interest. It can also be valuable to review the Oct. 24, 1995, and Serial No. to be assigned, filed on even differences in protein interaction maps with the results of a date herewith, both by Rothberg et al. and entitled “Method 65 QEATM method or other method of analyzing expression and apparatus for classifying, identifying, or quantifying levels (e.g., SAGE (Velculescu et al., 1995, Science DNA sequences in a Sample without Sequencing, which are 270:484-487; Northern analysis) performed on a cDNA 6,057,101 25 26 population prior to performing an interaction Screen accord have been identified. Sequencing of each pair of interactors ing to the invention. For instance, the appearance of a new individually corresponds formally to a 1-dimensional Strat interaction in diseased tissue that is not present in normal egy in which each pool draws from one of the N Samples. tissue can be correlated with the OEATM method or SAGE This yields N pools in total. In higher dimensions, the or Northern analysis measurements of the expression levels number of pools required is of the genes involved in the interaction. Upregulation or co-regulation of the genes would serve to corroborate the DN1/P protein interaction maps. 5.24. MAINTAINING LINKAGE BETWEEN PAIRS where D is the number of dimensions. (This assumes a OF INTERACTING PROTEINS Square grid). The maximum number of genes in each pool is The most preferable QEATM method on the amplified the number of colonies contributing to each of the pools. products derived from a pool of interactive colonies identi Again assuming a Square grid, the maximum number is fies the interactions that take place in a Sample and identifies the differences between Samples while retaining the linkage between a Specific gene and the corresponding interacting 15 where N is the total number of colonies used in a partner gene of the interacting pair. If the QEATM method is D-dimensional pooling Strategy. done on the entire pooled, interacting population and this is Increasing the dimensionally D reduces the total number compared to another entire population, the linkage between of pools but increases the total number of genes that can be interacting partners in each individual Sample (i.e., from in each pool. It is preferable to choose the largest value for each individual colony containing a separate interacting DSuch that the genes in a pool can Still be identified. Thus, pair) is lost. Preferable pooling Strategies are coupled with the optimal pooling Strategy, i.e., the preferred choice for D, deconvolution Strategies that maintain linkage between the depends on the number of individual genes that can be interacting partners and allow identification of the interac identified in a single pool as well as on the total number of tive colony that gives rise to each Set of interacting partners. interactive colonies. The invention provides a method of determining one or 25 In order to Standardize the pooling and deconvolution more characteristics of or the identities of nucleic acids Strategy, it can be preferable to use a 2-dimensional pooling encoding an interacting pair of proteins from among a and deconvolution Strategy exclusively. If the size of the population of cells containing a multiplicity of different interactive population is in the hundreds, then a simple nucleic acids encoding different pairs of interacting proteins, two-dimensional pooling Strategy Suffices. said method comprising (a) designating each group of cells Further details of preferred pooling and deconvolution containing nucleic acids encoding an identical pair of inter Strategies are provided below. In a specific embodiment, acting proteins as one point of a multidimensional array in Strategies are automated. which the intersection of axes in each dimension uniquely 5.2.5. POOLING STRATEGIES identifies a single said group; (b) pooling all groups along a 2-dimensional pools Simple axis to form a plurality of pooled groups; (c) ampli 35 In a preferred embodiment of a 2-dimensional Strategy, fying from a first aliquot of each pooled group a plurality of the interactive colonies are arrayed in a 12x8 grid repre first nucleic acids, each first nucleic acid comprising a senting 96 different interactive colonies (as shown in FIG. Sequence encoding a first protein that is one-half of a pair of 4A). The cells from the rows and columns are their pooled interacting proteins; (d) amplifying from a Second aliquot of together and amplification (preferably PCR) is performed on each pooled group a plurality of Second nucleic acids, each 40 the pools of interactants. Two sets of amplification (e.g., Second nucleic acid comprising a sequence encoding a PCR) reactions, one specific for one kind of the fusion Second protein that is the other half of the pair of interacting protein (or M) and the other specific for the second kind of proteins; (e) Subjecting said first nucleic acids from each fusion protein (or N), are then performed. If the total number pooled group to size separation, (f) Subjecting said Second of interactants is Small (<20), then electrophoretic separation nucleic acids from each pooled group to size separation; (g) 45 (e.g., by polyacrylamide or agarose gel electrophoresis) of identifying which at least one of Said first nucleic acids are the amplified (e.g., PCR) products is generally Sufficient to present in Samples of first nucleic acids from a pooled group distinguish the interactants from one another (see FIG. 4A). from each axes in each dimension, thereby indicating that In that case, comparison of the amplified products from each Said at least one first nucleic acid is present in Said array in row and column identifies the interactive colony from which the group designated at the interSection of Said axes in each 50 the amplified product originated. That is, the presence of a dimension; and (h) identifying which at least one of Said band in both a Sample from a pooled row and a Sample from Second nucleic acids are present in Samples of a Second a pooled column indicates that the band is present in the nucleic acid from a pooled group from axes in each interactive colony present at the interSection of the row and dimension, thereby indicating that the Said at least one colony. A perfect symmetry (the same PCR product in two Second nucleic acid is present in Said array in the group 55 rows and columns) indicates either the same pair of inter designated at the interSection of Said axes in each dimension; actants repeating or two pairs of interactants that have insert in which the first and Second nucleic acids that are indicated DNAS of identical lengths. to be present in Said array in a group designated at the same When the number of interactants is greater than 20 and interSection are indicated to encode interacting proteins. In within a few hundred, a 2-dimensional Strategy is still preferred aspects, Such a method is applied to colonies of 60 Sufficient. However, distinct inserts may have the same yeast cells, each colony containing nucleic acids encoding a lengths and may not be separated to adequate resolution, for different pair or interacting proteins identified according to example, by electrophoresis of PCR products. Therefore, in a method of the invention. Exemplary pooling and decon a preferred embodiment, to aid in the deconvolution, the volution Strategies are described below. QEATM method applied to cDNA populations is performed Pooling and deconvolution Strategies can be characterized 65 with a 4-mer or 6-mer recognition Subsequence. The length by the dimensionally of the pooling array. We assume that N of the recognition Subsequence is adjusted to provide a distinct colonies containing interacting pairs of proteins resolvable number of the OEATM method bands. Because the 6,057,101 27 28 Size of the inserts in interactive populations tends to be in the of interacting proteins, as described hereinbelow. The range of 0.5 to 3 kb when using mammalian cDNA libraries QEATM method performed on the interactive populations as Source of the populations, the use of 6-mer Subsequences identifies “difference bands (bands that differentiate one can necessitate that a large number of reactions be per interactive population from the other). In a pooling Strategy, formed in order to ensure that every insert DNA contains in which different colonies are pooled together before the two Such Subsequences and thus has been included in the QEATM method, it is preferable to have means to indicate QEATM method. The use of 4-mer recognition Subsequences which colony gave rise to each band. This Section describes provides more frequent cutting and can alleviate this prob means for performing Sequencing Studies to identify which lem. AS 4-mer Subsequence "hits’ occur more frequently colony gives rise to each band. The methods in this Section than with 6-mer Subsequences, the probability of including are based on Sequencing, which also provides the identity of each interactant in the OEATM method increases. the Sequence generating each QEATM method band in Furthermore, by limiting the number of interactants in a question, the same Sequences that encode the proteins given pool to 10 to 15, the number of “bands” or genes in responsible for the interactions. a QEATM method can be limited to about 40, and thus A QEATM method band includes knowledge of specific provide an easily analyzable QEATM method readout that 15 Sub-Sequences (which the recognition means, used in the can be used to deconvolute the pools. Exemplary protocols QEATM method reaction, detect). Specific PCR primers are for a QEATM method that can be used are described in designed based on these Sub-Sequences So as to be able to Section 6.1.12 and its subsections (particularly 6.1.12.2). hybridize to and thus amplify only those bands in a pooled In a preferred embodiment, the addition of the SEQ population that contain these Sub-Sequences. Thus, these QEATM method to the above analysis further refines the PCR primers are used to screen by PCR the entire interactive deconvolution process by imparting more information to population. This is done by performing PCR with gene each band (see, by way of example, Section 6.1.12.2). specific primers, preferably on the original stored PCR Furthermore, the SEQ-QEATM method aids in uniquely products (both the DNA-binding domain-specific and acti identifying the bands from the QEATM method reaction. This Vation domain-specific PCR products), when pooled accord often is not possible using a the 4-mer QEATM method alone 25 ing to the two-dimensional or three-dimensional pooling as the information from Such a QEATM method reaction is strategies described above. A specific PCR product will be generally not Sufficient to uniquely identify genes within a observed only if the particular PCR pool contains the gene eukaryotic cDNA population made from total mRNA. The that gives rise to the QEATM method band. Deconvolution ability to identify unambiguously the bands in each pool and Strategies can be carried out as described above. Thus, e.g., those in common between pools is the desired outcome of a PCR product appearing at the interSection of a pooled row deconvolution. The methods of QEATM method (preferably and pooled column (or pooled plate, in a three-dimensional 4-mer), preferably in combination with the SEQ-QEATM strategy) indicates that such PCR product arose from the method, resolve the identity of the bands in each pool, thus colony Situated at Such interSection, and indicates that Such identifying the proteins that appear in an interacting pair, and PCR product contains the subsequences to which the prim in common between pools without the need for Sequencing 35 ers were designed to hybridize. By this method, the original of the bands. By such methods, the identified bands that mating pair that gives rise to the QEATM method band can appear, or appear at increased level, after the interaction be identified and the Sequence of the two genes that encode assay of the invention is carried out wherein a first cDNA the interacting proteins can be confirmed by Sequencing the population forms both N and M populations, compared to respective DNA-binding domain and activation domain the bands that appear after the interaction assay is carried out 40 plasmids after isolating these plasmids from the relevant with a second cDNA population forming both N and M, colony. identifies differentially expressed proteins between the first 5.2.6.2. CREATION OF INTERACTIVE-GRIDS and Second cDNA populations that mediate protein-protein As a variation of the PCR-based strategy, a hybridization interactions. based Strategy can also be used to identify interacting 3-dimensional pools 45 proteins that are in an interactive population, or that are In the case of large interactive populations, a unique to such population. The PCR products from each of 3-dimensional coding and pooling strategy (FIGS. 4B-4C) the interactive colonies (the DNA-binding domain-specific is used. In the illustrated example of FIG. 4B-4C, a total of amplified products and the activation domain-specific 32 pools are used: 12 (pooled columns, 8x12 wells each)+8 amplified products, respectively) are spotted onto a mem (pooled rows, 144 wells)+12 (pooled plates, 96). Each pool 50 brane thus creating an “interactive grid”. Preferably, the will have a maximum of 144 genes (FIGS. 4B-4C). The DNA binding domain-specific products and the DNA acti QEATM method and SEQ-QEATM method are performed on Vation domain Specific products from a single colony are the PCR products derived from each pool (separately for the Spotted together in a single spot. This interactive grid is then DNA-binding fusions, and the activation fusions, probed with a band of interest that has been identified and respectively), and the intersection of three pooling dimen 55 isolated through the QEATM method process. If the band of Sions is used to identify the gene at each location. The interest is a band that, through the QEATM method, has been SEQ-QEATM method based on 4-mer Subsequences may not identified as an interacting band that is present only in one be easy to interpret due to the large number of bands (genes) population and not another, this method yields the identity of in each pool. Therefore, it can be preferable to use a large interacting proteins unique to the population in which Such number of less common Subsequence pairs (6-mers instead 60 band is present. Probes for this purpose can be prepared by of 4-mers) to discriminate between all the genes present. labeling the QEATM method band(s) of interest with 5.2.6. ALTERNATIVE STRATEGIES TO CHARAC radioisotopes, degoxigenin, biotin (detectable by its ability TERIZE INTERACTIVE POPULATIONS to bind to Streptavidin, e.g., conjugated to an enzyme), 5.2.6.1. SEQUENCE-BASED STRATEGIES TO IDEN fluorescent tags, or other detectable labels known in the art. TIFY PARS OF INTERACTING PROTEINS 65 The Spots on the interactive grid are contacted with the probe An alternative Strategy involving gene-specific PCR pro under conditions conducive to hybridization. Spots that vides means to identify the pair of genes coding for each Set hybridize thus pinpoint the pair of interacting proteins that 6,057,101 29 30 are unique to an interactive population (FIG. 5). A sequence between two populations that are cDNA of substantially analysis of these genes yields the identities of the interacting total mRNA from a cell, at least 5.8x10, or more preferably proteins. at least 1x10', or 1.7x10” matings between yeast cells in the 5.2.7. STATISTICAL CONSIDERATIONS FOR preferred yeast interaction mating assays are done. (By way DETECTING ALLPOSSIBLE INTERACTIONS AMONG of clarification, 1.7x10” matings means mixing 1.7x10' GENES THAT ARE EXPRESSED AT DIFFERENT LEV cells together of each fusion population for a total of 3.4x10' ELS cells.) The methods described herein allow achievement and In a library of 1x10 individual clones, taking into Selection of these numbers of matings, as well as the account that only Sense Strand cDNAS are cloned and thus increased number of matings shown in Table 1. In various one in every three will be in the proper reading frame, and specific embodiments, at least 1x10, 1.7x10, 8.5x10, that each gene has approximately 4 domains, there will be 1.7x10', 8.5x10', or 1.7x10' matings are carried out and about 80 copies of each domain of a gene that is expressed Reporter Gene activity is tested for in the mated cells, per at the high level of 1 in a 1000 transcripts within a cell interaction assay. (1/3x1/4x1/1000x10). After transformation into yeast, if 5.28. ALTERNATIVE PREFERRED EMBODIMENTS there are 5x10 individual transformants, then there will be 15 This subsection describes specific alternative embodi 40 copies of each domain of a gene that was originally ments that are generally preferred for the detection and expressed at a 1 in a 1000 level 80x(5x10)+(1x10)). comparison of protein-protein interactions in the following These guidelines can be used to calculate the number of circumstance. The embodiments of this SubSection are par copies of genes expressed at other levels. For instance, it a ticularly preferred in cases where the binding domain library gene is expressed at a 1 in 5000 level, a library of 2.5x10' has a complexity greater than 10, 1,000, or 1,000,000, and transformants in yeast will be contain roughly 2.5x10'x(1/ where the number of pairs of interacting proteins discovered 3x1/4x1/5000)=40 copies of each gene. is no more than approximately 10, 50, 100, 200, or 500. For a given Sample size, it is possible to calculate the However, these embodiments are also applicable to binding number of matings that are expected to yield a pair of and activation domain libraries of complexities less than 10 interacting proteins. Suppose that gene X and gene Y are 25 and more than 1,000,000 and to less than 10 or more than expressed at a level of 1 in 1000, and that domains of these 500 discovered interacting protein pairs. This alternative two genes interact. The fractions of cells bearing the proper preferred embodiment is optionally but preferably associ domain of each protein are ated with certain information-processing Steps for recording, Fraction of cells bearing Gene X=1/(3x4x1000)=1/12,000; comparing, and analyzing the results of detected interac Fraction of cells bearing Gene Y=1/12,000. The number of tions. Although applicable in general to the results of matings that bring together the interacting domains of gene detected protein-protein interactions, these associated X and gene Y is information-processing Steps are especially preferable in cases where one or both libraries have complexity Sufficient X-Y matings=(total number of matings)x(mating efficiency)x to result in large numbers of interactive proteins (i.e., greater (fraction bearing gene X)x(fraction bearing gene Y). 35 than 100, or 200, or preferably 500 protein-protein interactions), and as will be apparent to one of skill in the art, Assuming a mating efficiency of 25%, this yields the number of these StepS are particularly preferred to record, compare, and X-Y matings as: analyze the combined results of protein-protein interactions X-Y matings=(total number of matings)/5.8x10 detected from more than one pair of libraries. Results from 40 multiple libraries can be from either repetitions of the same Therefore, the total number of matings that must be per pair of libraries or from different pairs of activation and formed to expect to See one productive X-Y mating is, on binding domain libraries. average, total number of matings=5.8x10. This is a statis The current SubSection describes generally these preferred tical estimate of the number of matings, performing this protocol steps to the extent that they differ from the previ number of matings will result in a productive X-Y mating 45 ously described embodiments. Particular protocols for per roughly 50% of the time. To raise the probability of obtain forming these StepS are presented in the SubSections of ing a productive mating, it is preferable to perform even Section 6.1. Unless otherwise noted, the Same choices and more matings. An exemplary goal is a 95% confidence level alternatives appropriate to the embodiments previously that an interaction will be retrieved, which requires 3X described in Sections 5.1 and 5.2 are also applicable to this over-Sampling according to probability theory arguments. 50 embodiment. The following subsection (5.2.9) describes the For genes expressed at a level of 1 in 1000, the number of data-processing aspects of this embodiment. FIG. 26 illus matings for 95% confidence is 1.7x10. trates exemplary orderings of both the preferred protocol For genes that are expressed at moderate to low levels, by StepS and the information-processing Steps, as well as their calculations similar to those described above, the number of interrelation. matings for 95% confidence is as follows: 55 The StepS up to and including the transformation of the yeast mating Strains with plasmid libraries capable of TABLE 1. expressing fusion proteins proceed generally as previously described in Sections 5.1 and 5.2. In particular, the previ Expression Level Number of Matings ously described choices, namely those of yeast Strains with 1 in SOOO 8.5 x 10 60 promoter Sequences and operably linked reporter genes and 1 in 10,000 1.7 x 1010 of plasmids with marker genes Selectable in the yeast Strains, 1 in 50,000 8.5 x 100 are also appropriate to this embodiment. Therefore, by way 1 in 100,000 1.7 x 10 of example and without limitation, this embodiment is described with respect to a first and a Second plasmid library Thus, in a preferred embodiment, to detect all detectable 65 and two yeast mating Strains, a and C. When transformed interactions that occur between genes that are highly into yeast, the first plasmid library recombinantly expresses expressed in mammalian cells, by assaying interactions TRP1 and chimeric proteins comprising a GAL4 DNA 6,057,101 31 32 binding domain fused to proteins to be assayed for protein grown on a medium Selective for reporter gene activation. protein interactions, and the Second plasmid library recom Therefore, the greater the complexity of the activation binantly expresses LEU2 and chimeric proteins comprising domain library, the more Such false positive progeny will be a GALA activating domain fused to the same of further formed from each Such fortuitously activating binding proteins to be assayed for protein-protein interactions. The domain transformant. Additionally, fortuitous activation can two yeast mating Strains are each constructed to be deficient occur at a rate up to 1-5% among all binding domain in TRP1 and LEU2 and bear reporter genes URA3, and/or transformants. Therefore, the greater the complexity of the HIS3, and/or lacZ whose expression is under control of a binding domain library, the more Such false positive progeny GAL1-10 promoter Sequence capable of binding the GAL4 will be formed. For binding domain libraries with complexi DNA binding domain. This embodiment is adaptable to the ties of greater than 10, 10, 10", or even 10, it is preferable other alternatives described in Sections 5.1 and 5.2, in that the rate of fortuitous activation be below at least 10, particular to the alternative choices for promoters, reporter more preferably less than approximately 5x10, and most genes, Selectable marker genes, plasmids, yeast, and So forth preferably less than approximately 1x10'. The “rate of therein described. fortuitous activation” means the fraction of binding domain Where the matrix-mating is performed in confirmatory 15 fusion transformants that activates reporter genes in the Step 2606, the plasmids used to construct the activation and absence of any protein-protein interaction. binding domain libraries preferably further have character A negative Selection protocol preferred for use with this istics which allow them to act as shuttle vectors between the embodiment achieves a much reduced fortuitous activation yeast Strains used and bacteria Such as E. coli. These rate by combining Separate and independent negative Selec characteristics include one or more Sequences permitting tion Steps. It is important that Such separate negative Selec replication in bacteria and yeast and one or more marker tion Steps be independent in order that their negative Selec genes capable of expression and Selection in bacteria and tion effects be cumulative. The preferred negative Selection yeast. The Selectable marker genes expressible in bacteria protocol achieves a fortuitous activation rate of preferably typically express proteins confering resistance to certain less than approximately 5x10", or less than approximately antibiotics. 25 4x10, or less than approximately 3x10, or less than In more detail, construction of the plasmid fusion approximately 2x10", or more preferably less than approxi libraries, step 2601 in FIG. 26, proceeds as generally mately 1x10, or even less. In a preferred embodiment, described in Sections 5.1 and 5.2. Genomic DNA or cDNA where URA3 is a reporter gene two or more passages are is prepared from any of various tissueS of various organisms made on media containing 5-fluoroorotic acid (5-FOA) (the according to appropriate protocols known in the art. For chemical agent creating the toxic environment for URA3), example, in the case of animal cl)NA, mRNA can be which inhibits or kills URA+ cells. In a first passage, binding extracted and purified as described in Sections 6.1.3, 6.1.4 domain transformants are plated on media Selective for the and 6.1.5, and cDNA synthesized as described in Section binding domain plasmid and containing 5-FOA. After a 6.1.6. The activation and binding domain plasmid fusion Sufficient time for growth, resulting colonies are replica libraries can be constructed according to protocols known in 35 plated onto Similar Selective media also containing 5-FOA. the art. For example, cDNA, having ends complementary to It has been found that two passages by replica plating those produced by digestion by certain restriction enzymes, achieve a fortuitous activation rate of no more than approxi Such as can be perhaps produced by ligating short oligo mately 1x10". Further passage via replica plating are nucleotides to previously produced cDNA, can be ligated possible, and can be performed if fortuitous activation rate into plasmid vectors having appropriate poly-linker Sites 40 greater than the preferred rate is found. digested by the Same restriction enzymes. The poly-linker Replica plating is a preferred embodiment of the general Sites are placed in-frame adjacent to Sequences coding for method of achieving independent negative Selection Steps activation or binding domain protein fragments. For according to this invention. The general method proceeds by example, the methods of 6.1.6 can be used to construct the using any appropriate means to definitively Separate those plasmid libraries. 45 cells, which are actively growing in a toxic environment, Transformation of the yeast strains, step 2602 of FIG. 26, from Substantively all other cells, including dead cells, cells also proceeds generally as described in Sections 5.1 and 5.2. which are living but not viable, and importantly, cells which Such methods as electroporation, microinjection, and trans are dormant in the toxic environment but still viable and formation can be used to introduce the activation and capable of future growth in a non-toxic environment. By binding domain plasmid libraries into yeast Strains of Sepa 50 way of example, it has been found that an important, rate mating types. In an exemplary method (described in although Small, fraction of yeast cells in a toxic Sections 6.1.2 and 6.1.7), the yeast Strains of Separate mating environment, Such as a medium containing 5-FOA for types are transformed with activation and binding domain URA+ cells, are not killed, but merely become dormant yet plasmid libraries by lithium acetate treatment followed by viable. Such viable dormant cells are fully capable of heat-Shock. 55 resuming normal growth upon being rescued to a new Following transformation Step 2602 is negative Selection non-toxic environment. In particular, in the case of an Step 2603. This Step Screens out those yeast transformants organism, Such as yeast, for which cells growing on a plate bearing binding domain plasmids in which the reporter create colonies forming a heap above the Surface of the genes are fortuitously activated by the fusion proteinbearing medium, actively growing cells on a plate containing a the binding domain alone. Such fortuitously activating trans 60 medium with a toxic chemical agent create Such heaped-up formants can make impractical the task of finding a tiny colonies, while dormant cells remain on the Surface of the number of colonies truly positive for protein-protein inter medium. Accordingly, definitive Separation of actively actions among an overwhelmingly large number of falsely growing cells can be achieved by physically removing cells positive colonies produced from libraries of large complex from the heaped-up colonies, and preferably from the tops of ity. For example, each Such fortuitously activating binding 65 these colonies, without removing cells from the Surface of domain transformant will mate with any activation domain the medium. Careful replica plating, the preferred means, transformant to form falsely positive progeny which will reliably and economically removes only cells from the tops 6,057,101 33 34 of heaped-up colonies. Alternatively, other physical means a test mating (as described below). AS used herein, the rate can be used to remove cells from heaped colonies, Such as of reporter gene activation in a mating is the fraction of careful colony picking, perhaps by a laboratory robot. On diploid cells in which one or more reporter genes are the other hand, Scraping cell from the Surface of Such a activated. Fortuitously activating binding domain fusion medium removes both growing cells and dormant cells, and proteins are recognized by a rate of reporter gene activation therefore, is ineffective in achieving independent negative that is close to 1, e.g., greater than or approximately 0.5. Selection Steps. The dormant cells later resume growth in a Sticky fusion proteins are recognized by a rate of reporter non-toxic environment. Also, growth in Successive liquid gene activation that is anomalously high compared with the media having the toxic agent, without additional plating, expected rate, as determined by observations of Similar (does not achieve independent Selection and improved nega matings. For example, in matings of mammalian and, tive Selection rates. particularly, of human Samples, it has been observed that the After careful Separation of actively growing cells, their protein-protein association and reporter gene activation is further growth in a further toxic environment results in typically less than approximately 10 (i.e., reporter genes further and independent Selection by killing remaining Sen are activated about 1 diploid cell in 1,000,000 diploid cells). sitive cells. Dormant cells which escaped death in the 15 Accordingly, for Similar matings, a Sticky fusion protein is previous toxic environment will not again escape Selection indicated by a rate of reporter gene activation preferably in this further toxic environment, Since Substantially none of greater than approximately 10, or preferably greater than these cells are transferred to the Second toxic environment. approximately 10", or more preferably greater than Accordingly, the results of both Selection StepS combine to approximately 10. Since it is generally advantageous to result in a much reduced fortuitous activation rate. detect as many weak protein-protein interactions as possible, Alternatively, other reporter genes and associated toxic a library member with a rate of reporter gene activation in environments, as described in Section 5.1 or known in the art a test mating of greater than a threshold of approximately can be used in this protocol. It is preferable that all Such 10 is considered “sticky.” Where only stronger protein combinations achieve a rate of fortuitous activation of leSS protein interactions are of interest, fusion proteins with than 5x10' and more preferably less than approximately 25 activation rates between 10 and 10" (or 10) can also be 1x10'. For example, an alternative protocol can use two or considered “sticky.” Limited-complexity-library members more passages by replica plating in the presence of cyclo are considered validated for performing full library mating heximide where CYH2 is present as a reporter gene only if they are neither 30 fortuitous activators nor are (cycloheximide is the toxic chemical agent for CYH2) in the Sticky, that is if their reported gene activation rates are leSS yeast. Alternatively, the two passages can involve growth on than the appropriate thresholds. media having different compounds that are toxic upon An exemplary protocol for bait validation performs a fortuitous activation of Separate reporter genes. For Separate mating, according to the protocols described herein, example, where both URA3 and CYH2 are used as reporter of each member of the limited complexity library with a genes, a first passage can be on media containing 5-FOA and Sample of the more complex library. For example, each a Second passage can be on media containing cyclohexim 35 member is mated preferably with between approximately ide. In a further alternative, where two reporter genes having 10,000 and 100,000 colonies from the more complex library, difference toxic environments are used, one or more pas and most preferably with approximately 50,000 colonies. Sages can be on media with both toxic environments. For The approximate rate of diploid colonies which are also example, where both URA3 and CYH2 are used as reporter positive for reporter gene activation for the member is genes, one or more passages can be on a medium containing 40 observed. Only library members which meet the preferred both 5-FOA and cycloheximide. In all of these alternatives, rate of reporter gene activation (where weak protein-protein as described, only actively growing cells must be carefully interactions are of interest, a rate of greater tan 10) are Selected for a further negative Selection Step. Selected for full mating. A further negative Selection Step, called bait validation, is This invention also comprises other negative Selection preferred in the case of libraries of limited complexity. Such 45 techniques performed before a full mating, directed to libraries have a complexity preferably less than approxi removing from the full mating any fusion proteins that mately 500, or less than approximately 200, or less than fortuitously activate reporter gene expression and/or have approximately 100, or most preferably less than approxi non-specific (sticky) association with other proteins, that mately 50. The goal of the Step, in the case of binding will be apparent to those of skill in the art upon reviewing domain libraries, is to provide a further Screen for fortu 50 this disclosure. itously activating binding domain fusion proteins, and in the Following transformation and negative Selection, the case of both binding domain and activation domain fusion libraries of yeast transformants are Dated and colonies proteins, is to provide a Screen for "Sticky' fusion proteins Selected for activation of the reporter genes in Step 2604 of (See, also, Section 6.1.13.2). Although a particular fusion FIG. 26. In general, a mating protocol useful in these protein may activate reporter genes due to true protein 55 embodiments has the following preferable characteristics. protein association, this association may be non-specific. First, it is preferable that the large numbers of cells neces Since Such non-specific association may be of less interest Sary for good mating of complex libraries, according to the than Specific association between proteins, it may be advan Statistical estimates of Section 5.2.7, can be mated using tageous to remove library members expressing Such Sticky only a limited number of plates, and limited media and fusion proteins before a full mating. After a full mating and 60 mating resources. Second, mating conditions chosen pro positive colony Selection, the matrix-mating protocol mote cell mating but inhibit cell doubling. Accordingly, each described Subsequently performs a similar Screen for fusion Separate mating event constituting a protein-protein interac proteins that associate non-specifically with many other tion is more likely to produce only a single resulting colony partners in a particular mating. upon Selection. Third, also for good Statistical Sampling, the For the bait validation protocol, fortuitously activating 65 mating efficiency, the percentage of diploids formed, is high. binding domain fusion proteins and Sticky fusion proteins Generally, high mating efficiencies are produced when are recognized by the rate of reporter gene activation during well mixed yeast cells of the two mating Strains are main 6,057,101 35 36 tained in fixed and close contact, as occurs when the mating art knows how to plate appropriate dilutions of the cells are packed together and retained on various Solid harvested, mated cells in View of these fractions and of a Supports. Accordingly, mating on the Surface of plates or measured cell density. The percentage of diploids, or the filter discs is preferred, with filter discs being more preferred mating efficiency, can be estimated by plating Serial dilu due to their ability to pack together and to mate a greater tions of the mated cells onto plates Selective for each of the number of cells per disc. plasmids and for both of the plasmids (for example, accord One aspect of this invention is the discovery that the ing to the protocol in Section 6.1.1). The expected rate of disclosed filter-disc mating protocol permits significantly protein-protein interactions can be estimated from experi higher cell densities during mating than can be achieved ence with similar libraries. In the case of libraries derived with prior mating protocols, in particular by mating on the from total mRNA of human cells, the rate is often approxi Surface of a plate. In particular, filter-disc mating can mately 107, or at least between 10 and 10. achieve approximately at least 5x10", at least 1x10, at least The positive colonies harvested at the end of step 2604 1.5x10, preferably 3.5x10, and up to 4–6x10 cells per can be processed according to Subsequent StepS and Square millimeter on the filter-disc during mating. Mating protocols, including the characterization of the fusion pro cell densities above 4-6x10" are less advantageous since 15 tein produced at step 2605, confirmatory tests at step 2606, mating efficiency declines. These densities correspond to at and other further steps indicated at 2607. The confirmatory least approximately 3x10 cells, to at least approximately tests Screen for false positive colonies due to fortuitously 6x10 cells, to approximately 1x10" cells, to approximately activating binding domain plasmids (plasmid drop-out test) 2x10 cells, and up to approximately 3.5x10 cells per 90 and eliminate non-specifically interacting proteins (matrix mm filter disc, respectively (obtained by multiplying the cell mating test). The other further steps are described in Sec densities by the approximately 6400 Square millimeters in a tions 5.1, 5.2, and 5.3 and illustrated in FIGS. 1, 3, 5, and 6, 90 mm filter disc). According to the preferred protocol, cells and include Screening for inhibitors of protein-protein inter can be packed to these densities on a filter-disc by vacuum actions (described in Section 5.4), finding lead compounds assisted filtration from a culture of known cell density by for drugs that inhibit protein-protein hinteractions, finding using various Standard filtration apparatuses. Filter discs of 25 Stage or tissue specific protein-protein interactions, and So different diameters can accommodate appropriately Scaled forth. numbers of cells. Prior methods can typically accommodate, These Subsequent Steps can be performed in any order or at most, a mating cell density of 6x10 cells per Square even eliminated if not needed. The order shown in FIG. 26 millimeter (for example 1x10 cells on a 150 millimeter is the preferred order, especially where associated informa plate). tion processing Steps assist the analysis of interesting inter Cell doublings during the mating in a filter disc are limited actions. In the preferred order, fusion protein characteriza by maintaining the mating cells in an environment of a rich tion is performed first and produces input that the but dilute medium, as can be readily achieved by placing information processing StepS use to control performance of filter discS with the packed yeast cells cell-free Side down on the confirmatory Steps, which are performed Second. Other the surface of a plate with rich medium (e.g., the YPAD 35 orderings can include performing all these Steps in parallel, medium described in Section 6.1, Supra). Mating efficiently performing confirmatory tests in advance of fusion protein is also promoted by “boosting the cells with a short growth characterization, eliminating the further Steps, or other varia period on rich medium prior to mixing and mating. In tions. contrast, plate mating places the cells on a rich medium Step 2605 characterizes the fusion proteins in each of the resulting, typically, in Several cell doublings and Several 40 positive colonies harvested from the mating Step. Informa colonies for each positive mating event. tion produced in this Step is input, as represented by input AS in the protocols of Sections 5.1 and 5.2, mated cells are arrow 2608, to the information processing Steps which harvested and further plated out on media selective both for generally act to further characterize the interaction. Sections activation and binding domain plasmids, and thus for diploid 5.1 and 5.2 describe several methods for this characteriza yeast cells, and for activation of the reporter genes. Cell 45 tion. The pooling and deconvolution described therein are from positive colonies are taken by, e.g., picking from the preferably not applied to this embodiment. Since it is plates containing medium Selective for the presence of both anticipated that less than approximately 10, or 50, or 100, or plasmids and reporter gene activity (mating plates) and 200, or 500 positive colonies are found, the identification stored in individual cultures selective for both plasmids, efficiencies provided by pooling and deconvolution are leSS which are, for example, arrayed in 96-well plates, 384-well 50 advantageous to this embodiment. plates, or other convenient Storage format. Cells for further According to Sections 5.1 and 5.2, and also in this analysis of the positive colonies can be removed from the embodiment, analysis of Separate and individual colonies Storage cultures. It is advantageous for removing colonies proceeds, preferably, with a first amplification Step followed from the mating plates that the number of expected positive by a Second characterization Step. The amplification Step colonies as well as the total number of diploid cells per plate 55 Specifically amplifies the variable inserts coding for the be controlled. Too many colonies per plate makes difficult interacting protein fragment in the fusion proteins, by, in the picking colonies from mating plates to place them in Storage case of PCR amplification, using primerS designed to cultures. Too few colonies per plate wastes mating plates. In hybridize to regions flanking the variable inserts. The Second a particular embodiment, directed to automatic colony pick Step, which characterizes the amplified inserts, can be by ing by robot apparatus guided by an automatic vision 60 direct sequencing, or by QEA or SEQ-QEA methods System, a preferred number of colonies per plate is approxi (described in Section 5.4), or by other methods. Direct mately 50-100 and a preferred number of diploids per plate Sequencing is preferred in this embodiment, especially is less than approximately 10. where adequate Sequencing facilities are available, and the These plating targets are attained by estimating the Sequence data is directly input to the information processing expected percentage of diploid cells among all the mated 65 StepS. Direct Sequencing can be by any method known in the cells and by estimating the expected rate of protein-protein art, but is preferably according to the Sanger chain interactions among all the diploid cells. One of skill in the termination method using ddNTPs labeled with four distin 6,057,101 37 38 guishable dyes and followed by electrophoretic Separation of taining the different plasmids are grown on lines arranged in the sequencing fragments. If QEA or SEQ-QEA methods are a grid that intersects (a matrix). A positive protein-protein employed, the QEA signals (described in Section 5.4) pro interaction appear, as growth as the interSection of the two duced are input to the information processing Steps, and lines having the plasmids hearing the components of the gene identification is preceded by the gene finding methods interaction. described in Sections 5.4.5 and 5.4.6. This invention also comprises other negative Selection In detail, the first PCR amplification step preferably uses techniques performed after a full nating, directed to remov DNA templates produced from yeast obtained from the ing from the Selected positive colonies any colonies with positive colony storage. The DNA templates are freed of fusion proteins that fortuitously activate reporter gene cellular debris by extracting DNA from the results of cell expression and/or have non-specific (sticky) association lysis and (as described in Section 6.1.8). Pre with other proteins, that will be apparent to those of skill in ferred hot-start PCR protocols are also described in Section the art upon reviewing this disclosure. 6.1.8. A most preferred protocol Separates components of the This embodiment further comprises observation of “bi PCR reaction mix by a solid wax layer, so that no amplifi directional” interactions (also called herein “bi-directional cation can occur until the wax layer is melted. To Start 15 Screens'). Two fusion inserts, a first and a second insert, amplification, the PCR reaction mix components are pre participate in a bi-directional interaction if they are observed heated, the wax layer is melted, and thereby, the amplifica to interact under the following two conditions or directions: tion is hot-Started. This latter protocol is easily adapted to one, with the first insert in a binding domain fusion protein performance by Standard laboratory robots. library and the Second insert in an activation domain fusion Finally, Step 2606 confirms certain aspects of positive protein library in a first direction; and two, with the first colonies found after the mating Step. In particular, the insert in an activation domain fusion protein library and the plasmid drop-out test performs a protocol (described in Second insert in a binding domain fusion protein library in detail in Section 6.1.13.1) that separates false positive a Second direction. Bi-directional interactions can be dis colonies, due to reporter gene activation Solely by the covered by performing an interaction detection assay twice, binding domain fusion protein, from true positive colonies, 25 first with a pair of libraries constructed to have the inserts in in which reporter gene activation requires protein-protein either the first or the Second direction, and Second, with asSociation. In embodiments accompanied by information another pair of libraries constructed to have the inserts in the processing, performance of this steps is controlled, as indi other direction. Finding two fusion inserts in a bi-directional cated by control arrows 2610 and 2612, by assessment of the interaction increases the likelihood that the observed inter quality and biological Significance of a particular interaction actions is experimentally significant, and not an artifact of at Step 2618 or by browsing the database of interactions at the fusion libraries. Step 2620. Results of these confirmatory Steps are input, In Summary, a particular embodiment of this preferred according to input arrow 2611, to the information process alternative embodiment of this SubSection proceeds accord ing. ing to the following Steps: construction of fusion piasmid Briefly, the plasmid drop-out protocol grows cells from a 35 libraries, transformation of yeast Strains, negative Selection positive colony, first, in rich complete medium, and Second, of the binding domain library; mating of the yeast Strains, in medium Selective for the binding domain plasmid in order Selection of colonies positive for activation of the reporter to Select for drop-out of the activation domain plasmid. The genes, characterization of fusion protein from positive colo Selected progeny are tested for Such drop-out by lack of nies, confirmatory tests Such as plasmid drop-out and/or growth in a medium Selective for the activation domain 40 matrix-mating, and optional further Steps. Where, as is plasmid. Progeny cells lacking the activation domain plas preferred, information processing accompanies these Steps, mid are then assayed for activation of one or more of the the fusion protein characterization and the confirmatory reporter genes. Any positive colonies having reporter genes Steps input information into information processing func activated only by the binding domain plasmid are considered tions for further control of these same Steps and for false positive for protein-protein interactions. 45 recording, analysis, and comparison of protein-protein inter The matrix mating test performs a protocol (described in actions observed. detail in Section 6.1.1.3.2) that assays for the specificity of 5.2.9. INFORMATION PROCESSING ASPECTS OF observed protein-protein interactions. Generally, this test DETECTING PROTEIN-PROTEIN INTERACTIONS reconstitutes a Second two-hybrid interaction test using only The information-processing aspects of detecting protein the activation and binding domain plasmids from colonies 50 protein interactions record, compare, and analyze protein positive during a first interaction test. If a protein-protein protein interactions detected in experiments (also referred to interaction is specific, then it is expected that the activation herein as "Screens' or “matings’) involving one or more and binding domain plasmids bearing the components of the pairs of libraries. These information-processing aspects are Specific interaction will form a positive colony only when important to manage the large amounts of information they are mated together, and will not form positive colonies 55 generated from interactions detected in complex libraries, when they are mated with other plasmids. On the other hand, and especially from interactions detected in many pairs of if a protein component interacts non-specifically, then it is complex libraries. Although the information-processing expected that the plasmid bearing that component will form aspects are described primarily with respect to the alterna positive colonies with many other plasmids. The interaction tive preferred embodiment of Section 5.2.8, they are appli test is reconstituted, in Summary, by rescuing and maintain 60 cable to all embodiments of identification and comparison of ing plasmids from the positive colonies into a bacteria, Such protein-protein interactions according to this invention. Fur as E. coli. Accordingly, it is advantageous that the plasmids ther it will become apparent to those of skill in the art, that used have characteristics of Shuttle plasmids. Separate yeast the data Structures and processes described are also usefully mating Strains are transformed with the activation and applicable to other biological Systems (and to non-biological binding domain plasmid DNA extracted from the bacteria. 65 Systems) consisting of many pair-wise interacting compo The Strains are mated and grown on media Selective for the nents. They are even more applicable to Such of those reporter genes. In a particular-embodiment, yeast cells con Systems where the pair-wise interactions are determined by 6,057,101 39 40 components which can be Systematically Sampled according includes data describing all the individual positive colonies to geometrically comparable parameters, Such as linearly (referred to herein as “interactants”) whose two fusion arrangeable nucleotide or amino acid Sequences. inserts (referred to herein as an “interacting pair”) are In this SubSection, the information-processing aspects are fragments of the proteins coded for by the genes character described, first, with respect to their functions and relevant izing the parent interaction. This third class is particularly data classes, and Second, with respect to detailed Structures useful and can be further processed, as described of their databases, detailed Sequences of information Subsequently, to yield useful additional information. processing Steps, and their relation to accompanying Prior to describing the processing Steps in more detail, a protein-protein interaction detection. preferred and exemplary hardware and Software implemen The information-processing aspects provide, among tation of the functions and data classes is presented. It is others, three groups of functions and employ, among others, understood that this invention includes other hardware and three classes of data. The first group of functions is directed Software implementations that achieve equivalent functions. to identifying, if possible; the genes coding for the protein The individual groups of functions and certain components Fragments which have been found to interact, or, at least, of these groups are preferably implemented as independent produce colonies positive for reporter gene activation. This 15 programs which are coordinated by client-Server Style com group also includes functions for organization and Storage of munication. Such client-Server implementations are known data returned from the experimental protocols for detecting in the information-processing arts. The individual client and protein-protein interaction, for example, the data describing Server components are distributed on hardware platforms in interaction experiments performed and results of fusion a convenient and economical manner. protein characterization from positive colonies. The Second FIG. 27 illustrates an exemplary hardware system con group of functions is directed to quality control of the results figuration implementing for an exemplary distribution of of protein-protein interaction detection. It assists a user to client-server function. Computer 2702, which can be two or assess the biological meaning of each positive colony, for more computers, hosts programs implementing the example, candidate identifications of the genes coding for previously-described groups of functions and connects data the interacting fusion fragments found, and to identify the 25 bases and files Storing the classes of data. AS is generally biological context of the interactions detected. These func understood in the art, information relating to the entities in tions also assist with management of Steps of the experi the files and databases of this invention is represented and mental protocols, in particular, Selection of the confirmatory Stored in digital form. The digital representation can be tests to be performed in View of the biological Significance according to any convenient code known in the art. The first and context found for an interaction. Such management is class of data is typically Stored largely in Structured user generally called “workflow.” The third group of functions maintained files 2708, for example in descriptive text files. assembles interactions deemed Significant, for example, Preferably, the Second and third classes of data are largely because they are detected from two or more separate library Stored in relational databases. Identification database 2706 mating experiments, and provides facilities for review and Stores the Second class of data, and interaction database 2707 analysis of the assembled protein-protein interactions. In 35 Stores the third data class. A preferred relational database particular, this group also provides for assembling detected system (version 7.0 or preferably 7.3) is available from the interactions between pairs of proteins into pathways linking Oracle Corporation. multiple proteins and for discovering the domains in the Computer 2701 connects to sequence database 2705 proteins responsible for observed interactions. which is consulted in the process of determining candidate With regard to the classes of data employed, the first class 40 gene identification. Where fusion inserts are Sequenced, as is includes principally raw data describing and/or returned preferred, computer 2701Searches for database Sequences front each protein-protein interaction experiment. The data homologous to insert Sequences. Where QEA Signals are describing a particular experiment includes at least unique available, computer 2701 performs a database Search pro identifiers for each mating experiment and for each colony cess similar or equivalent to that described in Section 5.4.5 Found to be positive for reporter gene expression. this data 45 (see especially Section 5.4.5.1). optionally also describes the DNA libraries used to construct User computer 2703 connects to user display and key the plasmid fusion libraries and the precise materials, board 2709 in order to provide user access to the information methods, and conditions used in this mating. Data returned processing aspects of this invention. Typically, multiple from a particular experiment includes at least Sequences of users access the information-processing System from mul the fusion inserts (the library DNA sequences joined with 50 tiple user computers similar to computer 2703. Where the activation domain and binding domain Sequences in the information-processing functions include workflow man plasmid libraries) found in positive colonies, or in the case agement components that control Steps of the interaction of QEA analysis, the QEA Signals generated from the experiments, user computers can be made available to the amplified fusion fragments. laboratory technicians responsible for actually performing The Second data class Supplements the first class by 55 the protocol steps. Where the steps involve routine adding both organization and indexing components built manipulations, laboratory robots 2710 can be directly inter over the first class of data, in order to make it accessible for faced to the user computers. Such robots can be controlled easy reference, and also candidate identifications of the by and can return data to the information processing func genes coding for the positive fusion inserts. If no currently tions. For example, positive colony identification and pick known gene codes for a particular fusion insert, an internal 60 ing can be performed by a robot. accession number is generated to refer to the putative new The computers are connected by communication links gene and the closest homologous genes are recorded. 2704, which are adapted to the actual physical distribution Finally, the third class of data records all the distinct of the computers as is common in the art. When the protein-protein interactions found, each of which is charac computers are collocated, link 2704 can be a local area terized at least by the genes coding for the particular 65 network, when the computers are remotely located, link interacting proteins. For each Such protein-protein interac 2704 can be, for example, the Internet. Combinations of tion (referred to herein as an “interaction”), this class also networks can be used when computers are variously located. 6,057,101 41 42 In detail, System computers are appropriately sized mental identifier. It is also advantageous that data from all according to their processing loads, but are preferably at experiments relating to Selected libraries, Species, tissue least 166 Mhz or greater Pentium TM-based computers (or types, diseases, treatments, and So forth be similarly easily computers of equivalent performance based in Sparc" or retrievable by Searches on the corresponding fields. Alpha" processors). The System computers are provided with Standard Software components, including an operating In addition, input information 2614 includes data from system, which can be a version of UNIX (for example, one each colony found to be positive for reporter gene activation. of the versions available from Sun Microsystems) or one of Each positive colony is assigned a unique identifier, and the WindowsTM family of operating systems from the information obtained from that colony is indexed for rapid Microsoft Corporation (Windows NTTM, or Windows 95TM). retrieval using this identifier. The combination of mating Implementation languages can be general purpose experiment identifier and colony identifier for a positive languages, Such as C, C++, Java", language directed to colony is preferably unique among all the mating experi relational database manipulation, such as PL/SQLTM (Oracle ments and positive colonies data Stored in a particular Corporation), or similar language. The preferred language implementation of these information processing aspects. for graphical presentation aspects of these methods is 15 Data available for a positive colony characterizes the fusion Java", and the preferred language for relational database inserts found in the colony, and preferably, also includes manipulation and text screen formatting is PL/SQLTM. Pre management information Such as the physical Storage loca Sentation Services at the user computer are preferably pro tion of the colony and so forth. The physical location for a vided by an internet browser, such as NetScape TM from the colony indicates to a laboratory technician the location from NetScape Corporation, or other equivalent program capable which to retrieve cell Samples for further experimental StepS. of interpreting HTML formatted screens. Preferably, nucleotide Sequences characterize the fusion This invention also includes computer readable media inserts found in a positive colony. Such Sequence data is which contain computer-readable instructions capable of commonly provided by commercially-available Sequencing causing one or more computers to perform the processes of machines in various output formats. Most simply, the this invention. Such media include magnetic discS and tapes, 25 Sequences of the activation domain and binding domain optical discS, and other media types. The computer-readable fusion inserts can be simply Stored as, e.g., a String of instructions on these media include both instructions for nucleotide identifiers along with an indication of the correct performing the processing Steps of this invention and also reading frame. Where the QEA or the SEQ-QEA methods instructions for defining and establishing the files and data are used, the fusion inserts are characterized by QEA Sig bases of this invention. nals. QEA Signals, described in detail in Section 5.4, com 5.2.91. IDENTIFICATION DATABASE AND PRO prise three pieces of information, namely, the Sequences of CESSING two Subsequences present in the fusion insert (each having In this and the following SubSections, the information a length of, typically, 4 to 6 nucleotides) and the distance processing functions and data classes are described in more between these subsequences. In the case of SEQ-QEA detail. First, the identification database and its processing 35 Signal, the Subsequences are typically from 8 to 12 nucle functions are described. Next, the interaction database and otides long. All data for a particular colony is preferably its creation and update are described. Lastly, functions are easily retrieved using its colony identifier. described which are capable of deriving further information beyond that literally contained in the interaction database. Identification Step 2616, which creates an identification Generally, the right hand column of FIG. 26 illustrates an 40 database for a particular mating experiment, also refers to implementation of these information-processing StepS and certain external databases, primarily external Sequence data their relationship to the Steps for protein-protein interaction bases 2615. Representative external Sequence databases are experiments. available from governmental organizations (for example, Identification database 2617 for a protein-protein interac GenBank from the National Institutes of Health and similar tion experiment is created by gene identification Step 2616 45 databases available from the European Molecular Biology using input data 2614 and (external) sequence databases Laboratory) and from private organizations. By way of 2615. Input data 2614 describes the protein-protein interac example, without limitation, the following description in tion (mating) experiment and characterizes the fusion pro this Subsection is in terms of GenBank available at the tein inserts from colonies positive for reporter gene activa Internet address: “http://www.ncbi.nlm.nih.gov” tion. A description for a protein-protein interaction 50 Prior to describing the processing which creates identifi experiment includes at least a unique identifier that permits cation database 2617, information present in this database is efficient retrieval of all information relating to this experi described. Exemplary contents of an identification database ment. Further descriptive information includes, most are presented in the following Table 1A. preferably, information on the DNA source libraries from which the activation domain and binding domain plasmid 55 TABLE 1A fusion libraries were made. DNA library description can recite animal and tissue origin, library complexity, disease IDENTIFICATION DATABASE State and/or treatment information, if any, methods of library FIELD DESCRIPTION production, Storage location of library Samples, and So forth. mating experiment Appropriate unique identification of Additional experimental description information can include 60 identification the mating experiment with links to the precise and particular materials, methods, and conditions description of libraries used, precise used in the protein-protein interaction protocols. This protocols, and so forth. descriptive information can be Stored in coded or in free-text positive colony Appropriate unique identification of identification this positive colony (in particular for form, in files or in a database System, and can be advanta future retrieval from storage) geously indexed according to certain fields for rapid 65 A list of candidate homologues for activation domain fusion retrieval. For example, all data relating to a particular mating protein insert experiment is easily retrievable by using the unique experi 6,057,101 43 44 in the art. Tables can be defined for relating experiments and TABLE 1A-continued their positive colonies and for relating colonies and their candidate genes. Alternatively, this database can be stored in IDENTIFICATION DATABASE other database formats, or in a Set of user maintained files. FIELD DESCRIPTION In a further less preferable embodiment, the data content of the identification database can be stored as files in a raw or gene-AD Identity of homologue for activation unprocessed form, perhaps with distinctive filenames. domain fusion insert (database or internal accession number) Gene identification Step 2616 creates or updates the description of Name, species origin, tissue origin, identification database for a particular mating experiment. gene-AD and so forth According to the above exemplary table definitions, first, 3'-5' position-AD Location of fusion insert sequence on the homologue sequence (nucleotide tables relating all the positive colonies to the particular positions of fragments ends) mating experiment are loaded. Thereby, the unique colony score-AD Probability of homologue (e.g., BLAST identifiers, and perhaps colony descriptive information, are probability) related to the unique experiment identifier. Next, tables A list of candidate homologues for binding domain fusion 15 relating the insert Sequences in the positive colonies to their protein insert gene-BD Identity of homologue for binding candidate genes acre loaded. During this step, the candidate domain fusion insert (database or Sequences need to be determined. internal accession number) The determination of candidate genes proceeds, in the description of Name, species origin, tissue origin, preferred embodiment, by using one of the Several homol gene-BD and so forth 3'-5' position-BD Location of fusion insert sequence on ogy Search programs existing in the art, and in the the homologue sequence (nucleotide alternative, by using the QEA experimental analysis meth positions of fragments ends) ods described in Section 5.4.5.1. In the preferred score-BD Probability of homologue (e.g., BLAST embodiment, candidate genes are Selected by Searching a probability) Sequence database with a homology Search program using 25 the determined fusion insert Sequences as queries. These The mating experiment and colony identification fields programs often function in a client-Server mode, accepting contain their previously described identifiers. For each posi formatted query Sequence queries, referencing a nucleotide tive colony, this database includes lists of one or more Sequence database, and returning output text files describing candidate genes that have been determined to possibly code the results of the homology search. The output text files for the inserts in the activation domain and binding domain typically contain a list of homologous sequences (genes) fusion proteins. “Genes' are used herein to refer to nucleic from the Sequence database together with, for each acid coding Sequences, which can be, for example, cDNA or genomic. These genes are identified by their database Sequence, an indication of the likelihood of the homology “accession numbers,” for example the widely-used Gen and an indication of how the query maps onto the Sequence. Bank accession number. It is well known in the art that A preferred homology program is BLAST (Altschul et al., accession numbers can redundantly identify certain 35 1990, Basic Local Alignment Search Tool, J. Mol. Biol. Sequences, for, at least, the reason that Sequences of various 215:403-410) which is available at the Internet address fragments of the same naturally-occurring nucleic acid can “http://www.ncbi.nlm.nih.gov.” BLAST returns text files have been entered multiple times. To obtain a unique gene with the preferred information, that is Sequence accession identifier, it is preferred to use the accession number of the number, query Sequence location, and a homology Score full length coding Sequence for the gene, or at least, the 40 (multiple possible locations and associated homology Scores accession number for the longest fragment including the can be provided). A copy of BLAST along with a sequence Section found to be a candidate for the fusion insert. For each database can be loaded onto local computers. candidate gene, the identification database preferably In this embodiment, when the fusion-insert Sequence data includes at least certain descriptive information, the location becomes available for the positive colonies of an of the insert in the putative gene, and a Score of the degree 45 experiment, it is collected (for example by retrieving output of homology with the insert. Descriptive information in turn, text files from BLAST located using the experiment preferably includes Species origin and, optionally, an indi identifier) into a set of queries formatted for the BLAST cation of the general function, if known, of the protein coded program, one query for each of the fusion insert Sequence. by the gane, for example, a cell cycle protein, a signaling The queries are sent to an instance of BLAST and the output pathway protein, transcription factor, an enzyme, or Struc 50 text files are received and Stored. These output text files are tural protein and so forth. The location of the insert on the then parsed in manners well known in the arts (for example gene Sequence is described by providing, e.g., the nucleotide by a program in the PERL language), the relevant data numbers in the gene Sequence of the 3' and 5' ends of the extracted, and the identification database accordingly insert. The score field is an estimate of the likelihood that updated. Alternatively, the output files can be used in the this gene actually includes the insert Sequence by providing, 55 received format, perhaps indexed by colony identifier for e.g., a degree of Sequence homology or a probability that the easy retrieval. two Sequences are randomly related. In the case of fusion insert characterization by QEA In alternative embodiments, the identification database Signals, gene identification proceeds according to the fol can also include Such other information that will aid in lowing Steps. A Sequence database is Searched using the quality control Step 2617, as will be apparent to those of skill 60 QEA (or SEQ-QEA) signals as queries according to the in the art. For example, from protein databases additional processes described in Section 5.4.5.1. The output is a set of information about proteins known to be associated with the candidate sequences (genes) that include fragments gener gene can be added. Information can also be added from Still ating the same Signals as generated by the fusion inserts. For other databaseS Such as, e.g., literature databases Searched each candidate Sequence, an approximate position of the with gene name or accession number. 65 fusion insert, even though not Sequenced, can be found from The identification database is preferably Stored in a the positions of all the fragments of the candidate gene tional format in an appropriate normal form, as is common known to generate the observed signals. FIGS. 17A-F and 6,057,101 45 46 the accompanying description illustrate how the observed control step 2619. See, e.g., Russell et al., 1995, Artificial Signals correspond to fragment with particular positions on Intelligence-A Modern Approach, Prentice Hall, chaps 1 the candidate Sequence. Since the Signals generated by the and 15, the entirety of this reference is hereby incorporated fusion insert originate from fragments at known locations on by reference. Based on these decisions, interaction database the candidate Sequence, the fusion insert must include at 2620 is updated in the following manner. If an interaction least the Overlap of all the fragments. Thereby, overlapping between the Selected genes is already defined in the database, the new colony information defines in the database on each candidate gene all the fragments corresponding to a new interacting pair of fusion inserts representing an Signals generated from a fusion insert leads to an approxi additional observation of that interaction. If Such an inter mate position of this insert on the candidate gene. In this action does not yet exist, the new information defines in the embodiment, the only accessible homology information of database both a new interaction between the Selected genes the candidate Sequence is according to the methods of and a new interacting pair representing an observation of Section 5.4.5.3, that is an indication of whether the candi that interaction. Also as part of quality control Step 2619, the date is ambiguously or unambiguously identified. user, or alternatively an automated decision System, can It is advantageous for a user to monitor the Sequence data request confirmation tests on the particular positive colony. together with the BLAST results. In view of these data, it As represented by workflow arrow 2610, this decision may be apparent that a particular Sequence contains exces 15 generates requests to perform the tests that displayed, for Sive Sequencing errors. In this case, as represented by example, at terminals of the responsible laboratory techni workflow arrow 2609, the user can send a request (to a cians. laboratory technician) to retrieve the original stored colony Prior to describing the processing of this Step in more and to perform again amplification and Sequencing Steps detail, the preferable information content of interaction 2605. The new Sequence data is then entered into the System, database 2620 is described. The interaction database is as indicated by data-flow arrow 2608, and candidate gene conceptually divided into two components. The two com Sequences again Sought. ponents can be represented by physical divisions, by Sepa During gene identification Step 2616, certain information rate groups of tables, by logical views, or by other means is returned that it is advantageous to cache. Look-aside known in the art. The first component (the “interaction” databases 2018 contain this cache one Such look-aside 25 component) represents interactions generally, and the Sec database is a table of accession number Synonyms. When ond component (the “interacting pair component) repre multiple accession numbers are obtained for a candidate Sents interacting pairs evidencing general interactions. An Sequence for a fusion insert, they can be Stored, along with interacting pair from a positive colony evidence an interac the preferred accession number used for gene reference in tion if the fusion inserts observed in that colony are identi the databases of this invention, as Synonyms for future fied as being coded by the genes defining the interaction. look-up. When a further accession number is received, this In more detail, the interaction component includes infor table can be Searched to determine if it has been encountered mation exemplified in Table 1B. previously, and if So, the corresponding, preferred accession number used in the databases. Another look-aside database TABLE 1B is a homology database. The results of homology Searches 35 can be Saved as tables of accession numbers of Sequences INTERACTION DATABASE - INTERACTIONS having homologies above certain thresholds. For BLAST searches, such thresholds can be probabilities of e', e”, FIELD DESCRIPTION e3, e-0, e-50, e-9, e-70, e-89, e-0, or e-109. This table gene-1 Identity of gene coding for one permits doing simple homology Searches efficiently by find interacting protein of pair (database 40 or internal accession number) ing the accession numbers of those Sequences having a description of If homology exists name, species certain homology with a query accession number. gene-2 origin, tissue origin, and so forth; if 5.2.92. INTERACTION DATABASE no close homology exists, same Based on candidate gene identifications and other infor information for closest homologue mation in the identification database, interaction quality gene-2 Identity of gene coding for other 45 interacting protein of pair (database control step 2619, on FIG. 26, updates interaction database or internal accession number) 2620. The interaction database stores two types of informa description of If homology exists name, species tion: information on Specific protein-protein interactions gene-1 origin, tissue origin, and so forth; if observed in one or more mating experiments, and informa no close homology exists, same tion on which colonies containing which interacting fusion information for closest homologue number of Total number of positive colonies with inserts were positive in one or more mating experiments 50 interacting pairs interacting fragments from this gene (i.e., providing evidence for the protein-protein interaction). pair In this SubSection, the contents and update of this database number of Number of colonies with sufficiently are described. In the following SubSection, the useful ways independent different interacting fragments from that this information can be used (“mined”) are exemplified. interacting pairs this gene pair Updates of the interaction database proceed according to bi-directional Appropriate links to observed 55 links to interacting pairs for this interaction at least two general embodiments. Briefly, in a first interacting pairs (e.g., unique colony identifiers) embodiment, interaction quality control Step 2619 formats interaction type For example, inhibition/activation of and presents the data in identification database 2617 relating function, direction of interaction, and to each positive colony to a user skilled in biology and, so forth (determined from biochemical preferably, also skilled in the biology applicable to the type protocols) of protein-protein interactions being presented. For a posi 60 interaction source Interaction observed in this facility, tive colony, the user decides, first, if it is biologically observed in other facility, entered interesting or important, and if So, Second, Selects from from literature reference, etc. among the candidates those genes, if any, that are actually involved in this interaction. In an alternative embodiment, A protein-protein interaction, according to this where the user's decision criteria can be reduced to rules, or 65 embodiment, is considered to occur between two proteins to other computer processible representation, the decisions coded for by the two genes described in the fields gene-1 and for a colony can be performed automatically by quality gene-2. Where interactions can be observed that Simulta 6,057,101 47 48 neously involve three or more proteins, the data Structures of information can be manually entered. Interaction type pro the interaction database can be adapted in Straightforward vides biochemical information, if available, on the interac ways apparent to those of skill in the art. If already known tion. genes can be identified for the interaction, they are identified Next, Table 1C provides more detail on the interacting by the preferred sequence database (GenBank) accession 5 pair components. number. The description field can contain (as in the identi fication database) a description of the gene and its function. TABLE 1C If coding Sequences already known in databases cannot be identified for an observed fusion insert Sequence, a distin INTERACTION DATABASE - INTERACTING PAIRS guishable internal Sequence number is generated and asso FIELD DESCRIPTION ciated with the observed insert. Perhaps, for example, the mating experiment For interaction observed in this only highly homologous currently known Sequence is from identification facility, appropriate unique the Wrong Species, e.g., a mouse Sequence highly homolo identification of the mating experiment gous to an insert from a human Sample. Alternatively, with links to description of libraries perhaps no currently known Sequence is Sufficiently homolo 15 used, precise protocols, and so forth colony Appropriate unique identification of gous to an insert Sequence to be its possible Source. identification the positive colony with this Advantageously, when a generated Sequence number for a interacting pair (in particular for new Sequence is used, the description field can point to the future retrieval from storage) most homologous known genes. gene-1 Identity of gene selected as coding for one interacting protein of pair Additionally, interaction database fields relate to the (database or internal accession number) observed colonies, or interacting fusion insert pairs, evi 3'-5' position-1 Location of fusion fragment on gene dencing a general interaction. First, at least, the total number (nucleotide positions of fragments (ends) of Such interacting pairs is recorded. Second, the total other identifying Fragment from activation/binding number of “independent' interacting pairS US also recorded. data domain, pointer to measured sequence, An independent interaction pair is defined as follows: Out of 25 and so forth the total number of interacting pairs, it is likely that Several gene-2 Identity of gene selected as coding for will in fact be substantially identical. For example, several other interacting protein of pair (database or internal accession number) observed positive colonies can arise from doublings of a 3'-5' position-2 Location of fusion fragment on gene Single mated cell, or a single insert from the original DNA (nucleotide positions of fragments colony can be cloned into Several different plasmids. ends) Accordingly, two interacting pairs are considered Substan other identifying Fragment from activation/binding data domain, pointer to measured sequence, tially identical if both of their fusion inserts are the same to and so forth Within expected Sequencing errors. Typically, two inserts are confirmation test For example, plasmid drop-out test identical if they are of approximately the same length (to data results, matrix-mating test results, within less than preferably 5% or 10% of the insert length) 35 and so forth and have Substantially homologous nucleotide Sequences (to within less than preferably 5% or 10% of the number of Interacting pair information includes the unique experiment nucleotides). Otherwise, the two interacting pairs are con and colony identifiers which identify the colony and its sidered not substantially identical and thus “independent.” generating experiment. It also includes information identi For example, an insert of a first interacting pair can be 40 fying both fusion inserts of the identified positive colony. different in that it is: longer and overlaps (as determined by The fields gene-1 and gene-2 contain the accession numbers comparing 3' and 5' end information) the gene sequence of of the gene chosen as the Source of the fusion insert. an insert of a Second interacting pair, or is displaced with Alternatively, these fields can be labeled as activation respect to the Second insert, or has Substantially different domain and binding domain, as is done in the interaction Sequence, or So forth. Using Such criteria, a new interacting 45 database (Table 1A). These accession numbers can be pair can be compared to the already recorded interaction Sequence database accession numbers identifying known pairs for an interaction in order to determine if it is a new genes, or alternatively, internal accession numbers identify independent interacting pair. The greater the number of ing previously putatively new nucleotide Sequences. The independent interacting pairs evidencing an interaction, the location of the insert Sequence in the gene Selected as most more Statistically significant the interaction is considered to 50 homologous to the insert is indicated by the 3' and 5 be. nucleotide numbers in the gene Sequence of each end of the The interaction database also maintains bi-directional insert. Further fusion-insert information includes, e.g., links between general interaction information and interact whether the insert was found in an activation domain or a ing fusion insert pair information. This can be done explic binding domain library, a pointer to the measured insert itly by providing unique identifiers for interactions and 55 Sequence, and So forth. Further information for the colony interacting pairs in the database and by Storing identifiers for includes, e.g., the results of confirmatory tests performed on pairs with the related general interaction and Vice versa. the colony, or where further steps 2607 are performed. The Alternatively, existing identifiers can be used to make this interaction pair data base component ran be directly Stored bi-directional link. Unique colony identifiers can point to or indirectly point to the resulting information. interacting pair information, and unique gene accession 60 The components of the interaction database, like the numbers can point to general interaction information. identification database, are preferably Stored in a relational Finally, interaction information can include optional asso format using an appropriate normal form. Appropriate table ciated information, e.g., interaction Source and interaction Structures and indices will be apparent to those of skill in the type. Interaction Source indicates where the observation of art. In particular, the interactions and the interacting pairs this interaction was made. For example, it is advantageous 65 can be Stored in Separate tables. LinkS between these tables for interactions observed in other laboratories or reported in can be implicit, depending on matching the contents of the literature to be available in the interaction database. Such fields, or explicit, depending on an additional table Storing 6,057,101 49 SO explicit pointers. Alternatively, this database can be Stored fusion inserts are considered to be from unknown genes and according to other database formats, or less preferably, as internal accession numbers are generated for updating the user-maintained files. interaction database. An alternative rule for this first deci The remainder of this SubSection describes the processes Sion is simply to Select, independently for each fusion insert, and methods used by interaction quality control Step 2619 in the gene with the highest homology Score to the Sense order to create or to update the interaction database based on Strand, if known, that is above a certain threshold and is from information in the identification database. the correct species. If no Such gene exists, an internal In View of the information content in these databases, accession number is generated. three decisions for each positive colony are made during this An exemplary rule for the Second decision, in the case of processing: (1) Selection of the genes coding for the inter BLAST homology searches, is simply to assign the 3' and 5 acting pair, or assignment of an internal accession number or ends of the insert to be the nucleotide numbers of the numbers in case one or both genes cannot be identified; (2) Subsequence matched from the most homologous gene location of the 3' and 5' ends of the fusion insert on the found by BLAST. In the case of QEA signals, an alternative Selected gene; (3) decision as to whether this interacting pair rule for finding the 3' and 5' ends has previously been is independent of already known interacting pairs. In one 15 described. An exemplary rule for the third decision, whether embodiment of this step, the quality control process or not the current insert pair is independent, proceeds by retrieves and displays to a user information for positive retrieving an example of each of the independent interacting colonies from the identification database on formatted pairs already found for this interaction and to compare the Screens. The user reviews this information, makes the three degree of homology of both the activation and binding decisions, and trigger updates to the interaction database domain fusion inserts, respectively, with the current pair. A accordingly. Advantageously, the user makes these decision Simple test for insert homology is to check, first, that the 3' largely according to articulated and established rules. and 5' ends of both inserts differ preferably by less than 1%, Accordingly, in another embodiment, the three decision or by less than 5%, or by less than 10%, of the total length rules are encoded in a format Suitable for a computer of the insert, and to check, Second, whether the number of implemented rule based processor (Such as one of the expert 25 different nucleotides is preferably less than 1%, or less than System packages known in the art or commercially 5%, or less than 10%, of the total number of nucleotides in available). The rule processor then makes proposed deci the insert. These ranges are to accommodate for expected Sions and proposed updates, which are displayed for user Sequencing errors. Two inserts falling within these bounds acceptance or revision. In a further embodiment, the three are taken not to be independent. Alternatively, a homology decisions and database update are entirely automated by the search tool (such as BLAST) can be used to estimate the rule processor, which the user only later reviews and, degree of homology between the current inserts and the perhaps, revises during, e.g., database browsing Step 2621. retrieved examples. Current inserts without any significant Accordingly, presented herein is- an exemplary Set of homology to the retrieved examples are a new independent decision rules for controlled update of the interaction data pair. base based on information in the identification database. 35 Thereby, interaction quality control Step 2619 updates These rules are capable of being applied by a computer interaction database 2620, whose information content is implemented rule based processor or by an individual user. Summarized in Tables 1B and 1C, with new interactions and An exemplary rule for the first decision proceeds, first, by new interacting pairs based on information in identification discarding candidate genes that are not from the same database 2617 by computer-implemented and rule-based Species as the Source of the library used in making the fusion 40 processing according to, e.g., the previously described rules plasmids. Thus if the positive colony resulted from a mouse defining the biological significance of candidate genes. The derived activation domain library, then only mouse genes are interaction database can also be manually updated with considered further as possibly coding for an activation interaction information received from others or reported in domain fusion insert. Second, homologies to anti-Sense the literature. Strands, if known, are also discarded. Third, optionally, 45 5.2.93. INTERACTION DATABASE FUNCTIONS candidate genes for both fusion inserts are grouped together The interaction database contains valuable information by the general functions of their encoded proteins. For that can be usefully accessed and analyzed for diverse example, general protein functions can include cell cycle purposes. In this SubSection, three particular analysis func control, intra- or inter-cellular signaling, cell-specific func tions applicable to this information are presented: database tion Such as metabolic or Synthetic activities, and So forth. 50 browsing function 2621, protein domain identification func Other possible classifications of protein function will be tion 2622, and interaction pathway construction function readily apparent to those of skill in the art. Proteins of 2623. However, included within this invention are the other unknown function can be assigned to a default group that diverse uses of the valuable protein-protein interaction infor corresponds to every other functional group. Fourth, the mation Stored in the interaction database that will be appar homology Scores of all pairs of genes with protein of 55 ent to a perSon Skilled in the art. corresponding function is retrieved, and that pair of corre Database browsing 2621 is, on the one hand, a general sponding genes with the highest homology Scores to the function for Selecting (also called “filtering herein) a Subset observed fusion inserts is assigned to the fusion inserts. The of information from the database for function processing. accession numbers of these genes are used to update the On the other hand, database browsing also includes the interaction database. Preferably, the Selected genes will have 60 formatted display of the selected or filtered information on homology Scores above a certain threshold. In the case of a user's terminal. Especially where the interaction database BLAST search results, such a preferred homology threshold is Stored in relational format, Such Subset Selection can be is between a probability of e and e' for shorter insert easily done according to any possible relational query, as is Sequences of no more than a few tens (e.g., 20) of known in the art. The selected Subset can be formatted for nucleotides, or alternately, in the case of longer insert 65 display by presenting the Selected information rows in a sequences may be greater than, e.g., e?',s Ce, s e",C s Ce. s labeled, tabular layout with a Scrolling capability useful to e', e7, e, e”, or e'. If no such genes exist, the view further rows and fields not immediately viewable on 6,057,101 S1 52 one Screen. Any data in the interaction database, either from represents one gene, or protein, and the Set of Vertices V is the interaction component, the interacting fision insert pair assembled by retrieving all the distinct proteins, or genes, component, or both components combined can be Selected present in the Selected Subset of the interaction database. and displayed. Each edge represents one protein-protein interaction Since In particular embodiments, it is useful to assist the user by each Such interaction links two genes in the Set of Vertices, providing pre-established queries, or filters, available for and the Set of edges, E, is assembled by retrieving the Set of easy selection. FIGS. 28A-B illustrate examples of Such Selected interactions. For example, a graph for the pathways useful and hierarchically arranged filters. FIG. 28A allows a illustrated in FIG.29 is defined by the set of vertices (Protein user to select a particular mating experiment (identified by A, Protein B, Protein C, MDM2) and the set of edges the column labeled “screen”) for further display. The addi ((Protein A, MDM2,...), (Protein B, MDM2,...) (MDM2, tional columns display further data characterizing each Protein C, . . . )) (“ . . . " represents additional interaction experiment: the “library” columns identify the DNA librar information). Having defined the interaction graph, each ies used to construct the plasmid libraries, the “name’ Separate pathway is represented by a connected component column displays the activation domain and binding domain of this graph. Two vertices are in the same connected library identities; the further columns display the laboratory 15 component, if they are connected by a path of edges. No path Status of the experiment. Once a user Selects a particular of edges connects two vertices in different connected com experiment (or “screen”), FIG. 28B illustrates a next filter ponents. Finding connected components of a graph is well that allows the user to further Select particular information known to those of skill in the art, and can be done by the for this experiment from the interaction database. The data basic depth-first Search algorithm. See, e.g., Sedgewick, can be Selected by the Status of confirmation tests, by the 1990, Algorithms. In C, Addison-Wesley Publishing Co., "Screen' type, by the Source of the interaction, or by the chap. 29, the entirety of this reference is incorporated herein number of independent interacting pairs (referred to as by reference. “isolates” on FIG. 28B). The “list” options control the Finally, each connected component is then Separately display of selected data. Finally, the “SUBMIT TO PATH formatted and displayed on a user's computer Screen. For MAKERTM' permits the selected interactions to be 25 ease of viewing, the graph is preferably displayed with the assembled into pathways by pathway construction function protein, or gene, Vertices well Separated on the Screen, and 2623. also preferably, if possible, with the edges, representing In particular, the “Screen” filter permits Selection of interactions not crossing (that is as a planar graph). Since "forward,” “reverse,” and “bi-directional” Screens. For for Such a display can be difficult to create in general, an ward Screens, data is Selected in which a chosen library was exemplary approximation is to place graph vertices on the used to make activation domain fusions. For reverse Screens, Screen according to a simulated annealing algorithm, which the data refers to results in which the library was used to approximately minimizes an “energy' function using Statis make binding domain fusions. Finally, for "bi-directional tical techniques. See, e.g., Press, et al., 1988, Numerical Screens, the data refers to results in which the library was Recipes in C, Cambridge University Press, Cambridge, used to make both activation domain and binding domain 35 U.K., which is herein incorporated by reference in its fusions, in the same or in Separate mating experiments. For entirely. The preferred display goals are approximately example, an interaction of gene-A and gene-B passing a achieved by minimization of an “energy' function, which “bi-directional” filter has both at least one interacting pair in grows large both when two vertices are close and also when which gene-A is present in an activation domain fusion and edges croSS. An exemplary Such function includes a term for gene-B is present in a binding domain fusion and also at 40 each vertex that depends on the inverse of the distance to the least one interacting pair in which gene-A is present in a nearest neighbor of that vertex as well as a large positive binding domain fusion and gene-B is present in an activation factor for each edge crossing. Simulated annealing then domain fusion. Finding interactions present in a "bi Successively perturbS vertex Screen placement in order to directional” Screen increases confidence in the biological Search for a placement approximately minimizing the energy Significance of the interaction, and decreases the possibility 45 function. that the interaction is merely an artifactual occurrence. Further preferable display features include coding gene Pathway construction function 2623 automates construc information in the appearance of its vertex by, e.g., the tion of protein interaction pathways, which represent the vertex color, or codina interaction information by, e.g., the links by which proteins can interact with distant proteins edge color or graphic, and So forth. Additional information, through intermediate proteins. Preferably, this function also 50 beyond that SO coded, on a gene or an interaction can be provides for graphical display of the resulting pathways. obtained by "clicking on their Screen representations. For FIG. 29 illustrates such a graphical display of a simple example. clicking on a edge representing an interaction can pathway, in which three proteins, Proteins A, B, and C, have call up a window in which Summary or graphical informa been found to all individually interact with the protein tion on the interacting pairs evidencing that interaction is MDM2. These individual, pair-wise interactions create three 55 presented. Such information can include a graphical repre possible pathways by which Protein A can interact with Sentation of where the fusion inserts are located on the Protein B or with Protein C and by which Protein B can coding Sequence of the gene. interact with Protein C, all mediated by protein MDM2. Finally, domain identification function 2622 automates Determination of Such pathways Start with Selection of a locating the actual protein domains responsible for an inter Subset of the interactions Stored in the interaction database 60 action. In a first simple embodiment, applicable to a single for pathway analysis. AS described with respect to the pair-wise interaction, for example that of Protein A with database browsing function, this Selection can be by a protein MDM2, the locations of all the fusion inserts on the relational query. Alternatively, all the interactions in the gene Sequence are simply intersected in order to obtain a database can be analyzed into pathways. Pathway analysis location common to and included in all the fusion inserts. begins with representing the Selected protein interactions as 65 The protein domain responsible for the interaction evi a graph, which is defined by a Set of Vertices, V, and a set of denced by these fusion inserts lies within the amino acid edges, E, each edge connecting two vertices. Each vertex sequence coded by this common region. FIG. 30 illustrates 6,057,101 S3 S4 this processing. Sequence 3001 represent the entire gene teristic of a particular situation, but also enables the identi coding Sequence for the one interacting protein participating fication of inhibitors of Such interactions. The invention in an interaction. Sequences 3002, 3003, and 3004 represent provides a method of detecting an inhibitor of a protein three fusion inserts fragments from this gene that were found protein interaction comprising (a) incubating a population of in three independent interacting pairs evidencing this inter cells, Said population comprising cells recombinantly action. They are illustrated aligned between their 3' and 5' expressing a pair of interacting proteins, Said pair consisting ends as determined in previous processing StepS. Subse of a first protein and a Second protein, in the presence of one quence 3005 of sequence 3001 is the intersection of the three or more candidate molecules among which it is desired to inserts. Clearly, the protein domain responsible for the identify an inhibitor of the interaction between said first interaction must be encoded by (all or perhaps a portion of) protein and Said Second protein, in an environment in which the Subsequence 3005, as this is the only common amino Substantial death of Said cells occurs (i) when said first acid Sequence to all the interacting protein fragments. Sub protein and Second protein interact, or (ii) if Said cells lack Sequence 3005, as illustrated, can be computed as the a recombinant nucleic acid encoding Said first protein or a Sequence lying between a 3' boundary, which is the mini recombinant nucleic acid encoding Said Second protein; and mum of the 3' ends of all the fusion inserts, and a 5' 15 (b) detecting those cells that Survive said incubating step, boundary, which is the maximum of all the 5' ends of the thereby detecting the presence of an inhibitor of Said inter fusion inserts. Only inserts from independent interacting action in Said cells. In a preferred aspect, the population of pairs need be retrieved for this determination. cells comprises a plurality of cells, each cell within Said Domain identification is more certain if the same domain plurality recombinantly expressing a different Said pair of is found in a bi-directional Screen, when the inserts from the interacting proteins. In various embodiments, the plurality of protein are fused with both activation domains and binding cells consists of at least 10, at least 100, or at least 1000 cells domains. Domain identification is also more certain if (corresponding to different pairs of interacting proteins known motifs can be identified in the domain. After domain being assayed in a single assay). In a preferred embodiment, location is determined, the amino acid Sequences encoded the pair(s) of interacting proteins in the cells being assayed can be searched for known motifs. 25 consist of a first fusion protein and a Second fusion protein, In a further embodiment, additional domain information each Said first fusion protein comprising a first protein can be obtained in certain cases. By way of example, Sequence and a DNA binding domain; each said Second referring to FIG. 29, the ternary interaction of Protein A and fusion protein comprising a Second protein Sequence and a Protein B intermediated by MDM2 can provide additional transcriptional activation domain of a transcriptional acti domain information according to the following procedure. Vator; and in which the cells contain a first nucleotide First, interSection domains are determined as previously Sequence operably linked to a promoter driven by one or described for Protein A and MDM2 and for Protein B and more DNA binding sites recognized by said DNA binding MDM2. If both Proteins A and B interact with the same or domain Such that an interaction of Said first fusion protein overlapping MDM2 domain, then more information may be with Said Second fusion protein results in increased tran obtained by comparing the domains found in Proteins A and 35 Scription of Said first nucleotide Sequence, and in which the B as follows. A BLAST comparison of these two domains cells are incubated in an environment in which Substantial may reveal homologous Structures of a probability which death of the cells occurs (i) when increased transcription might be ignored if the functional relationship revealed by occurs of the first nucleotide sequence or (ii) if the cells lack the interaction were not a priori known. The domains may be a recombinant nucleic acid encoding the first fusion protein compared by protein Search tools, especially Search tools 40 or a recombinant nucleic acid encoding the Second fusion capable of evaluating multiple alignment between the two protein. The cells in which the assay is carried out are domains, in order to reveal Structural relationships, the preferably (but need not be) yeast cells, which can be presence of motifs, and So forth, at the amino acid Sequence haploid or diploid. level. Further, other techniques for extracting domain infor In a specific embodiment, an assay for the presence of an mation from binary, ternary, and higher-order interactions 45 interacting protein pair is carried out as described in the will be apparent to those of skill in the art. Such techniques Sections Supra, except that it is done in the presence of one are within the applications of the interaction database of this or more candidate molecules which it is desired to Screen for invention. the ability to affect an interaction between a protein-protein The information-processing aspects of this invention also pair that results in transcription from the Reporter Gene. An include those variations and elaborations that are apparent to 50 increase or decrease in Reporter Gene activity relative to those of skill in the art in view of the disclosure herein. In that present when the one or more candidate molecules are particular, the experimental data and workflow controls can absent indicates that the candidate molecule has an effect on be extended to manage the additional Steps of mating the interacting pair. For example, a decrease in (e.g., absence experiments prior to fusion protein characterization or after of) Reporter Gene activity that would otherwise occur in the confirmation tests. Automation of Screening interaction ago 55 absence of a candidate molecule, due to the presence of an nists and antagonists is an especially advantageous exten interacting pair, indicates that the candidate molecule is an SO. inhibitor of the interaction exhibited by the protein pair. In 53. INTEGRATED ISOLATION OF INHIBITORS OF AN a preferred embodiment, Selection of positive interactants INTERACTIVE POPULATION (colonies) is carried out; these colonies are exposed to The present invention also provides methods for identi 60 candidate inhibitor molecule(s) and are Selected again, this fying inhibitors or enhancers of protein-protein interactions. time for lack of interaction (e.g., by Selection for Survival in The method of identifying inhibitors provided by the inven medium containing 5-FOA wherein URA3 is a Reporter tion provides for greater ease and higher throughput than Gene, or by Selection for Survival in medium containing prior art methods, inter alia, through the ability to Select for C.-amino-adipate wherein LYS2 is a Reporter Gene, or the inhibitors based on cell survival. The present invention is 65 other methods of negative selection described in Section 5.1 particularly valuable in that it enables one to identify not above, Selection of cells that do not display a signal gener only the interacting proteins that are unique to or charac ated by a Reporter Gene (e.g., in the case of lac7, by activity 6,057,101 SS S6 on the B-gal substrate X-gal (5-bromo-4-chloro-3-indolyl In the 96-well format of this assay, the activity of a lacZ f-D-galactoside)). The environment in which selection is Reporter Gene can also be assayed enzymatically. The carried out preferably also Selects for the presence of the activity of the lacZ gene can be determined by assaying the recombinant nucleic acids encoding the interacting pair of B-galactosidase levels. This can be done in a high through proteins. Thus, for example, the proteins are expressed from put fashion as chemiluminescent assays or fluorescent plasmids also expressing a Selectable marker, thus facilitat assays using Substrates that are chemiluminescent (Jain and ing this Selection. For detecting an inhibitor, candidate inhibitor molecules Magrath, 1991, Anal. Biochem. 199:119-124) or fluorescent can be directly provided to a cell containing an interacting (Fluoreporter lacz/B-galactosidase quantitation kit from pair, or, in the case of candidate protein inhibitors, can be Molecular Probes Inc.). provided by providing their encoding nucleic acids under Use of a Reporter Gene that encodes a Selectable marker conditions in which the nucleic acids are recombinantly (e.g., URA3 or LYS2) that can be negatively selected against expressed to produce the candidate proteins within the cell. is preferred over the sole use of a Reporter Gene that The recombinantly expressed candidate inhibitors prefer encodes a detectable marker (e.g., lac7), Since negative ably comprise a nuclear localization signal to facilitate their Selection for a Selectable marker can be carried out on each import into the nucleus and exposure to the interacting 15 of multiple interacting pairs within a Single well, thus protein pair. allowing "multiplex” analysis (analysis of pools of cells A preferred exemplary method for detecting the presence containing interacting pairs in one well), thus increasing of inhibitors of protein-protein interactions is shown in FIG. throughput. This is because in the use of negative Selection, 6. The interactive population is grown in a 96-well format Survival of any cells indicates that at least one inhibited pair with each well containing 200 ul of media. If the two is present; in contrast, lack of detection of a detectable interacting proteins are plasmid-borne then the media pref marker occurs only if all interacting pairs in the well are erably Selects for maintenance of the plasmids, e.g., the inhibited, while detection of a detectable marker indicates media lacks those markers, like tryptophan or leucine that that at least one interacting pair in the well is not inhibited allow selection for the plasmids bearing TRP1 or LEU2, but does not indicate whether or not any of the other respectively. (This maintenance of Selective pressure is 25 potential pairs present are inhibited. obviated if the genes encoding the two proteins are not This embodiment of the invention is well Suited to Screen plasmid-borne but have been integrated into the chromo chemical libraries for inhibitors of protein-protein interac Some instead). Each well contains all the colonies that were identified as containing protein interactants from an NXM tions. assay of protein interactions according to the invention. Exemplary libraries are commercially available from Sev Thus, each well is representative of all the interactive eral Sources (ArCule, Tripos/PanLabs, ChemDesign, proteins present in a particular population. In the preferred Pharmacopoeia). In Some cases, these chemical libraries are embodiment of the invention, the Reporter Gene used for generated using combinatorial Strategies that encode the selection of interaction and selection of inhibition of inter identity of each member of the library on a Substrate to action is the URA3 gene. Interaction between the two fusion which the member compound is attached, thus allowing proteins causes the yeast to grow in the absence of uracil, 35 direct and immediate identification of a molecule that is an allowing Selection of the interacting colonies. However, effective inhibitor. Thus, in many combinatorial approaches, activation of the URA3 gene causes the yeast to die in the position on a plate of a compound Specifies that com medium containing the chemical 5-fluoroorotic acid pounds composition. Also, in one example, a Single plate (5-FOA; (Rothstein, 1983, Meth. Enzymol. 101:167–180)). position may have from 1-20 chemicals that can be Screened After a growth period that is Sufficient for early log-phase 40 by administration to a well containing the interactions of growth (a cell density of about 1x10" cells/ml), the cells are interest. Thus, if positive inhibition is detected, Smaller and exposed to inhibitor(s) for 1–2 hours. Then an appropriate Smaller pools of interacting pairs can be assayed for inhi dilution of the cells is transferred to a 96-well plate con bition. By Such methods, many inhibitors can be Screened taining 200 ul media lacking uracil to activate the transcrip against many interactors (see, e.g., FIG. 6). tion of the URA3 gene as a result of interaction between the 45 Many diversity libraries suitable for use are known in the two hybrid proteins in the presence of inhibitor(s). After this, art and can be used to provide compounds to be tested as an appropriate dilution of the cells is transferred to a 96-well inhibitors according to the present invention. Alternatively, plate containing 200 ml media made up of 5-FOA and the libraries can be constructed using Standard methods. Chemi inhibitor(s). At this step an alternative is to transfer 1 ul onto cal (Synthetic) libraries, recombinant expression libraries, or a 96-slot grid on Solid media containing 5-FOA and the 50 polySome-based libraries are exemplary types of libraries inhibitor(s) at the desired concentration. that can be used. Growth will be evident only in those instances where The libraries can be constrained or Semirigid (having inhibition of the protein-protein interaction occurs. AS a Some degree of structural rigidity), or linear or noncon preferred control, all the cells should be able to grow in the Strained. The library can be a cDNA or genomic expression absence of 5-FOA but in the presence of the inhibitor. Thus, 55 library, random peptide expression library or a chemically in a Single Screen, the inhibitor and the pair of interacting Synthesized random peptide library. Expression libraries are proteins it inhibits are identified. The identities of the introduced into the cells in which the inhibition assay interacting proteins that are inhibited are revealed by char occurs, where the nucleic acids of the library are expressed acterizing the genes that encode these interacting proteins. to produce their encoded proteins. The presence of more than one inhibited pair in a well 60 In one embodiment, the peptide libraries used in the would be indicated, e.g., by Sequence analysis. In Such an present invention may be libraries that are chemically Syn instance, the cells Surviving in the presence of 5-FOA can be thesized in vitro. Examples of Such libraries are given in diluted, and the inhibition assay repeated. Ultimately, the Houghten et al., 1991, Nature 354:84-86, which describes cells are diluted and Streak-purified So as to isolate Single mixtures of free hexapeptides in which the first and Second colonies representing a Single pair of interacting proteins. 65 residues in each peptide were individually and Specifically Then the inhibition assay is repeated on these Streak-purified defined; Lam et al., 1991, Nature 354 82-84, which isolates. describes a “one bead, one peptide' approach in which a 6,057,101 57 58 Solid phase split Synthesis Scheme produced a library of amino acids, fluoro-amino acids and amino acid analogs in peptides in which each bead in the collection had immobi general. Furthermore, the amino acid can be D lized thereon a single, random Sequence of amino acid (dextrorotary) or L (levorotary). residues; Medynski, 1994, Bio/Technology 12:709–710, A Specific embodiment of this invention uses mutant which describes split Synthesis and T-bag Synthesis meth Strains of yeast that have a mutation in at least one gene ods; and Gallop et al., 1994, J. Medicinal Chemistry 37(9) coding for a cell wall component, thereby having modified : 1233-1251. Simply by way of other examples, a combina cell walls that are more permeable to exogenous molecules torial library may be prepared for use, according to the than are wild-type cell walls, thus facilitating the entry of methods of Ohlmeyer et al., 1993, Proc. Natl. Acad. Sci. chemicals into the cell, and rendering Such yeast cells USA 90:10922-10926; Erb et al., 1994, Proc. Natl. Acad. 1O preferred for an inhibition assay in which exogenous can Sci. USA91:11422-11426; Houghten et al., 1992, Biotech didate inhibitor compounds are provided directly to the cell. niques 13:412; Jayawickreme et al., 1994, Proc. Natl. Acad. In one embodiment, mutations in the gene KNR4 in Sac Sci. USA91:1614-1618; or Salmon et al., 1993, Proc. Natl. charomyces cerevisiae cause the cell wall to be more per Acad. Sci. USA 90:11708-11712. PCT Publication No. WO meable to chemicals like X-gal, while not affecting general 93/20242 and Brenner and Lerner, 1992, Proc. Natl. Acad. 15 growth (Hong et al., 1994, Yeast 10:1083–1092). The Sci. USA 89:5381-5383 describe “encoded combinatorial reporter Strains are made mutant with respect to gene KNR4 chemical libraries,” that contain oligonucleotide identifiers to facilitate entry of inhibitor compounds. Similarly, in other for each chemical polymer library member. Compounds embodiments, mutations in genes that influence the cell wall Synthesized So as to be immobilized on a Substrate are integrity (reviewed in Stratford, 1994, Yeast 10:1741–1752) released from the Substrate prior to use in the inhibition are incorporated into the reporter Strain So as to make the cell asSay. wall more permeable. Further, more general, Structurally constrained, organic In a specific embodiment of the invention, the prospective diversity (e.g., nonpeptide) libraries, can also be used. By inhibitors are peptides that are genetically encoded and way of example, a benzodiazepine library (see e.g., Bunin et either plasmid-borne or are introduced into the chromosome al., 1994, Proc. Natl. Acad. Sci. USA'91:4708–4712) may be 25 through homologous recombination. The peptides to be used. Screened are thus provided by recombinant expression Conformationally constrained libraries that can be used within the cell in which the inhibition assay occurs. The include but are not limited to those containing invariant peptides are preferably expressed as a fusion to a nuclear cysteine residues which, in an oxidizing environment, croSS localization Sequence. The interactive population link by disulfide bonds to form cystines, modified peptides (preferably the entire population) from an MxN screen is (e.g., incorporating fluorine, metals, isotopic labels, are pooled together and then transformed with a library of phosphorylated, etc.), peptides containing one or more non plasmids encoding peptides to be tested as potential inhibi naturally occurring amino acids, non-peptide Structures, and tors. Alternatively, genes encoding the peptides are intro peptide S containing a Significant fraction of duced directly into the chromosome by first cloning the Y-carboxyglutamic acid. 35 genes into an integration plasmid containing the yeast Libraries of non-peptides, e.g., peptide derivatives (for Sequences that donate the Site necessary for homologous example, that contain one or more non-naturally occurring recombination. The transformed yeast cells are then plated amino acids) can also be used. One example of these are on media that selects for inhibition events. In the preferred peptoid libraries (Simon et al., 1992, Proc. Natl. Acad. Sci. embodiment of the invention, the reporter gene for interac USA 89:9367–9371). Peptoids are polymers of non-natural 40 tion and inhibition of the interaction will be the URA3 gene. amino acids that have naturally occurring Side chains Thus, transformants that emerge in media containing 5-FOA attached not to the alpha carbon but to the backbone amino represent peptide inhibitors that inhibit specific protein nitrogen. Since peptoids are not easily degraded by human protein interactions. digestive enzymes, they are advantageously more easily In another embodiment, DNA from a microorganism that adaptable to drug use. Another example of a library that can 45 reconstitutes Synthetic pathways for a compound (see be used, in which the amide functionalities in peptides have Hutchinson, 1994, Bio/Technology 12:375-380; Alvarez et been permethylated to generate a chemically transformed al. 1996, Nature Biotechnology 14:335-338) can be intro combinatorial library, is described by Ostresh et al., 1994, duced into the cell in which the inhibition assay takes place, Proc. Natl. Acad. Sci. USA 91:11138-11142). So as to be recombinantly expressed by the cell Such that the The members of the libraries that can be screened accord 50 compound is synthesized within the cell. If the synthesized ing to the invention are not limited to containing the 20 compound blocks the protein interactants, Such cells con naturally occurring amino acids. In particular, chemically taining an inhibitor of the interacting pair can be detected by synthesized libraries and polysome based libraries allow the methods as described above. By sequencing the DNA in the use of amino acids in addition to the 20 naturally occurring cells in which inhibition of the interactants has thus amino acids (by their inclusion in the precursor pool of 55 occurred, a novel inhibitory compound can be identified. amino acids used in library production). In specific The identities of the peptide inhibitors are deciphered by embodiments, the library members contain one or more the isolation and Sequencing of the plasmids that encode non-natural or non-classical amino acids or cyclic peptides. these peptides. The identities of the pair of interacting Non-classical amino acids include but are not limited to the proteins, whose interaction has been inhibited by the D-isomers of the common amino acids, C.-amino isobutyric 60 peptide, are identified by isolation and Sequencing the plas acid, 4-aminobutyric acid, Abu, 2-amino butyric acid; mids that encode these two proteins. The Sequences of the Y-Abu, e-Ahk, 6-amino hexanoic acid; Aib, 2-amino isobu inhibitor peptide and those of the interacting proteins can tyric acid; 3-amino propionic acid; ornithine; norleucine; also be obtained by amplifying the protein and peptide norvaline, hydroxyproline, Sarcosine, citrulline, cysteic acid, encoding region by PCR or other methods and Sequencing of t-butylglycine, t-butylalanine, phenylglycine, 65 the same. Specific primers can be used to amplify the cyclohexylalanine, B-alanine, designer amino acids Such as peptide or the DNA-binding fusion protein or the activation B-methyl amino acids, CC-methyl amino acids, NC.-methyl domain fusion protein. 6,057,101 59 60 In a particular embodiment of the invention, cells are able to DNA made of, for example, peptide nucleic acids incubating in the presence of candidate inhibitor molecules (hereinafter called “PNAs”) (See, e.g., Egholm et al., 1993, by expressing Such molecules within the cell from recom Nature 365:566-67) or DNAS. The subsequence recognition binant nucleic acids comprising the following operably means allow recognition of Specific DNA Subsequences by linked components (a) an ADC1 promoter; (b) a nucleotide the ability to specifically bind to or react with such subse Sequence encoding a candidate molecule fused to a nuclear quences. A QEATM method, and particularly its computer localization signal; and (c) an ADC1 transcription termina methods, are adaptable to any Subsequence recognition tion signal (see e.g., Section 6.8). In a particular means available in the art. Acceptable Subsequence recog embodiment, the candidate molecules are expressed from nition means preferably precisely and reproducibly recog purified expression vectors comprising the following com nize target Subsequences and generate a recognition signal ponents: (a) a promoter active in yeast; (b) a first nucleotide of adequate Signal to noise ratio for all genes, however rare, Sequence encoding a peptide of 20 or fewer amino acids in a Sample, and can also provide information on the length fused to a nuclear localization Signal, Said first nucleotide between target Subsequences. Sequence being operably linked to the promoter; (c) a In some QEATM embodiments, the presence of target transcription termination Signal activ.e in yeast, operably 15 Subsequences is directly recognized by direct Subsequence linked to said first nucleotide Sequence; (d) means for recognition means, including, but not limited to, RES and replicating in a yeast cell; (e) means for replicating in E. coli, other DNA binding proteins, which bind and/or react with (f) a Second nucleotide Sequence encoding a selectable target Subsequences, and oligomers of, for example, PNAS marker for Selection in a yeast cell, operably linked to a or DNAS, which hybridize to target Subsequences. In other transcriptional promoter and transcription termination signal embodiments, the presence of effective target Subsequences active in yeast; and (g) a third nucleotide sequence encoding is recognized indirectly as a result of applying protocols, a Selectable marker for Selection in E. coli, operably linked such as a SEQ-QEATM method, or e.g., involving multiple to a transcriptional promoter and transcription termination DNA binding proteins together with hybridizing oligomers. Signal active in E. coli. The means for inserting is preferably In this latter case, each of the multiple proteins or oligomers one or more Suitably located restriction endonuclease rec 25 recognizes a separate Subsequence and an effective target ognition sites, the means for replicating in a yeast cell can Subsequence is the combination of the Separate Subse be any Suitable origin of replication; the means for replicat quences. A preferable combination is Subsequence concat ing in E. coli can be any Suitable origin of replication. The enation in the situation where all the Separately recognized invention provides expression vectors which can be used for Subsequences are adjacent. Such effective target Subse expression of candidate inhibitor molecules, Such as a puri quences can have advantageous properties not achievable fied expression vector comprising the following compo by, for example, REs or PNA oligomers alone. However, the nents: (a) an ADC1 promoter; (b) a first nucleotide sequence QEATM method, and particularly its computer methods, are encoding a nuclear localization Signal, operably linked to the adaptable to any acceptable Subsequence recognition means promoter; (c) means for inserting a DNA sequence into the available in the art. The computer implemented analysis and vector in such a manner that a protein encoded by the DNA 35 design methods treat target Subsequences and effective tar Sequence is capable of being expressed as part of a fusion get Subsequences in the same manner. protein containing the nuclear localization signal; (d) an The Signals contain representations of target Subsequence ADC1 transcription termination Signal, operably linked to occurrences and a representation of the length between the first nucleotide sequence; (e) means for replicating in a target Subsequence occurrences. In various embodiments of yeast cell, (f) means for replicating in E. coli, (g) a second 40 the QEATM method these representations may differ. In nucleotide Sequence encoding a Selectable marker for Selec embodiments where the target Subsequences are exactly tion in a yeast cell, operably linked to a transcriptional recognized, as where RES are used, Subsequence represen promoter and transcription termination signal active in tation may simply be the actual identity of the Subsequences. yeast; and (h) a third nucleotide Sequence encoding a In other embodiments where Subsequence recognition is leSS Selectable marker for Selection in E. coli, operably linked to 45 exact, as where short oligomers are used, this representation a transcriptional promoter and transcription termination Sig may be “fuzzy'. It may, for example, consist of all Subse nal active in E. coli. quences which differ by one nucleotide from the target, or 54. THE-OEATM METHOD Some other Set of possible Subsequences, perhaps weighted 5.4.1. QUANTITATIVE EXPRESSION ANALYSIS by the probability that each member of the set is the actual METHOD, GENERALLY 50 Subsequence in the Sample Sequence. Further, the length According to a Quantitative Expression Analysis representation may depend on the Separation and detection (“OEATM") method, to uniquely identify an expressed gene means used to generate the Signals. In the case of electro Sequence, full or partial, and many components of genomic phoretic Separation, the length observed electrophoretically DNA, it is not necessary to determine actual, complete may need to be corrected, perhaps up to 5 to 10%, for nucleotide Sequences of Samples. Full Sequences provide far 55 mobility differences due to average base composition dif more information than is needed to merely classify or ferences or due to effects of any labeling moiety used for determine a gene according to the QEATM method. detection. AS these corrections may rot be known until target In a QEAM method, expressed Sequences are recognized Sequence recognition, the Signal may contain the electro by codes which are constructed from Signals which represent phoretic length in base pairs (hereinafter called “bp’) and the presence of short nucleic acid (preferably DNA) subse 60 not the true physicallength in bp. For Simplicity and without quences (hereinafter called “target Subsequences”) in the limitation, in most of the following description unless oth Sample Sequence and include a representation of the length erwise noted the Signals are presumed to represent the along the Sample Sequence between adjacent target Subse information conveyed exactly, as if generated by exact quences. The presence of these Subsequences is recognized recognition means and error or bias free separation and by Subsequence recognition means, including, but not lim 65 detection means. However, in particular embodiments, tar ited to, restriction endonucleases (hereinafter called “REs”), get Subsequences may be represented in a fuzzy fashion and DNA binding proteins, and oligomers (“probes”) hybridiz length, if present, with Separation and detection bias present. 6,057,101 61 62 Target Subsequences recognized are typically contiguous. Sequences, and (ii) the length between occurrences of effec This is required for all known RES. However, oligomers tive Subsequences in Said nucleic acid or between one recognizing discontinuous Subsequences can be used and occurrence of one effective Subsequence and the end of Said can be constructed by inserting degenerate nucleotides in nucleic acid; and (c) Searching a nucleotide Sequence data any discontinuous region. base to determine Sequences that match or the absence of A QEATM method is adaptable to analyzing any DNA any Sequences that match Said one or more generated Sample for which exists an accompanying database listing Signals, Said database comprising a plurality of known possible Sequences in the Sample. More generally, a QEA." nucleotide Sequences of nucleic acids that may be present in method is adaptable to analyzing the Sequences of any the Sample, a Sequence from Said database matching a biopolymer, built of a Small number of repeating units, generated Signal when the Sequence from Said database has whose naturally occurring representatives are far fewer that both (i) the same length between occurrences of effective the number of possible, physical polymers and in which Subsequences or the same length between one occurrence of Small Subsequences can be recognized. Thus, it is applicable one effective Subsequence and the end of the Sequence as is to not only naturally occurring DNA polymers but also to represented by the generated Signal, and (ii) the same naturally occurring RNA polymers, proteins, glycans, etc. 15 effective Subsequences as are represented by the generated Typically and without limitation, however, a QEATM method Signal, or effective Subsequences that are members of the is applied to the analysis of cDNA Samples from any in vivo Same Sets of effective Subsequences as are represented by the or in Vitro Sources. generated Signal, whereby Said one or more nucleic acids in A QEATM method probes a sample with recognition Said Sample are identified, classified, or quantified. In a means, the recognition means generating Signals, a preferred preferred embodiment, the method comprises (a) digesting Signal being a triple comprising an indication of the presence Said Sample with one or more restriction endonucleases, of a first target Subsequence, an indication of the presence of each Said restriction endonuclease recognizing a Subse a Second target Subsequence, and a representation of the quence recognition site and digesting DNA at Said recogni length between the target Subsequences in the Sample tion site to produce fragments with 5' overhangs; (b) con nucleic acids Sequence. Each pair of target Subsequences 25 tacting Said produced fragments with Shorter and longer may occur more than once in a Sample nucleic acid, in which oligodeoxynucleotides, each said shorter oligodeoxynucle case the associated lengths are between adjacent target otide hybridizable with a said 5' overhang and having no Subsequence occurrences. terminal phosphates, each Said longer oligodeoxynucleotide The QEATM method is preferred for classifying and hybridizable with a said shorter oligodeoxynucleotide; (c) determining Sequences in cDNA mixtures, but is also adapt ligating Said longer oligodeoxynucleotides to Said 5' over able to Samples with only one Sequence. It is preferred for hangs on Said fragments to produce ligated DNA fragments, mixtures because it affords the relative advantage over prior (d) extending Said ligated DNA fragments by Synthesis with art methods that cloning of Sample nucleic acids is not a DNA polymerase to produce blunt-ended double Stranded required. Typically, enough distinguishable signals are gen DNA fragments; (e) amplifying said blunt-ended double erated from pairs of target Subsequences to recognize a 35 Stranded DNA fragments by a method comprising contacting desired Sequence in a Sample mixture. For example, first, said blunt-ended double stranded DNA fragments with a any pair of target Subsequences may hit more than once in DNA polymerase and primer oligodeoxynucleotides, each a single DNA molecule to be analyzed, thereby generating Said primer oligodeoxynucleotide having a Sequence com several signals with differing lengths from one DNA mol prising that of one of the longer oligodeoxynucleotides, (f) ecule. Second, even if the pair of target Subsequences hits 40 determining the length of the amplified DNA fragments only once in two different DNA molecules to be analyzed, produced in Step (e); and (g) Searching a DNA sequence the lengths between the hits may differ and thus distinguish database, Said database comprising a plurality of known able Signals may be generated. DNA sequences that may be present in the Sample, for The target subsequences used in the QEATM method are Sequences matching one or more of Said fragments of preferably optimally chosen by computer methods from 45 determined length, a Sequence from Said database matching DNA sequence databases containing Sequences likely to a fragment of determined length when the Sequence from occur in the sample to be analyzed. Efforts of the Human Said database comprises recognition sites of Said one or Genome Project in the United States, efforts abroad, and more restriction endonucleases Spaced apart by the deter efforts of private companies in the Sequencing of the human mined length, whereby DNA molecules in Said Sample are genome Sequences, both expressed and genetic, are being 50 identified, classified, or quantified. collected in several available databases (listed in Section A QEATM method can be conducted in a “query' mode or 5.4.2). a “tissue” mode. “In a QEATM method “query mode” In a specific embodiment, a QEATM method comprises (a) experiment, the focus is on determining the expression of probing a Sample comprising a plurality of nucleic acids Several genes, perhaps 1-100, of interest and of known having different nucleotide Sequences with one or more 55 Sequence. A minimal number of target Subsequences is recognition means, each recognition means causing recog chosen to generate Signals, with the goal that each of the nition of a target nucleotide Subsequence or a set of target Several genes is discriminated by at least one unique signal, nucleotide Subsequences; (b) generating one or more signals which also discriminates it from all the other genes likely to from Said Sample probed by Said recognition means, each occur in the Sample. In other words, the experiment is generated Signal arising from a nucleic acid in Said Sample 60 designed So that each gene generates at least one signal and comprising a representation of (i) the identities of unique to it (a “good” gene, see infra). In a QEATM method effective Subsequences, each said effective Subsequence “tissue model” experiment, the focus is on determining the comprising a Said target nucleotide Subsequence, or the expression of as many as possible, preferably a majority, of identities of Sets of effective Subsequences, each said Set the genes in a Sample, without the need for any prior having member effective Subsequences each of which com 65 knowledge or interest in their expression. Target Subse prises a different one or more member target nucleotide quences are optimally chosen to discriminate the maximum Subsequences from one of Said Sets of target nucleotide number of Sample DNA sequences into classes comprising 6,057,101 63 64 one or preferably at most a few Sequences. Signals are Optional alternatives can provide increased discrimina generated and detected as determined by the threshold and tion in a QEATM method. Two sequences producing two Sensitivity of a particular experiment. Some important deter fragments of identical end Subsequences and length can be minants of threshold and Sensitivity are the initial amount of discriminated by recognizing a third Subsequence present in mRNA and thus of cDNA, the amount of molecular ampli one of the fragments but not in the other. In one alternative, fication performed during the experiment, and the Sensitivity a labeled probe recognizing this third Subsequence can be of the detection means. Preferably, enough Signals are added before detection to generate unique signals from the produced and detected so that the QEATM computer methods fragment containing that Subsequence. In another can uniquely determine the expression of a majority, or more preferably most, of the genes expressed in a tissue. alternative, a probe can be added before amplification which QEATM method signals are generated by methods utiliz prevents amplification of the fragment with the third Sub ing recognition means that include, but are not limited to, Sequence and which thereby removes (Suppresses) its signal. RES in a preferred RE/ligase method or in a method utilizing By way of example, such a probe can be either an RE for a removal means, preferably contacting Streptavidin linked recognizing and cutting the fragment with the third to a solid phase with biotin-labeled DNA, for removal of Subsequence, or a PNA, or modified DNA probe which will unwanted DNA fragments. 15 hybridize with the third subsequence and prevent its PCR A preferred embodiment of an RE/ligase QEATM method amplification. is as follows. The method employs recognition reactions Further RE/ligase alternative methods increase Sample with a pair (or more) of RES which recognize target Subse Sequence discrimination in QEAM experiments, for quences with high Specificity and cut the Sequence at the example, by recognizing target Subsequences longer or less recognition Sites leaving fragments with Sticky ends char limited than those recognized by RES, Such target Subse acteristic of the particular RE. To each Sticky end, Special quences are termed herein effective target Subsequences or primers are ligated which are distinctively labeled with effective Subsequences. This added information can often fluorochromes identifying the particular RE making the cut, discriminate two Sample Sequences producing fragments and thus the particular target Subsequence. A DNA poly having identical original end Subsequences and lengths. The merase is used to form blunt-ended DNA fragments. The 25 effective Subsequences are used in the computer imple labeled fragments are then PCR amplified using the same mented database lookup methods of this invention in a Special primers a number of times, preferably just Sufficient manner Similar to the use of target Subsequences. In one to detect Signals from all Sequences of interest while making alternative, termed herein a SEQ-QEATM method, the target relatively Small signals from the linearly amplifying Singly Subsequences recognized are effectively lengthened by cut fragments. The amplified fragments are then Separated using an amplification primer with an internal Type IIS RE by length using electrophoresis, and the length and labeling recognition site so positioned that the Type IIS RE cuts the of the fragments is optically detected. In order to improve amplified fragments in a manner producing a Second over the quality of the QEATM method signals, it is preferable to hang contiguous with the recognition site of the initial RE. conjugate a capture moiety with one or more of the primers The Sequence of the Second overhang concatenated with the and then to Separate unwanted reaction products by a 35 initial target end Subsequence produces an effective Subse method comprising contacting the reaction products with a quence that comprises, and is longer than, the target Subse binding partner of the capture moiety, Washing away quence recognized by the RE. Alternatively, an effectively unbound products, and then Separating by length those longer target Subsequence can be recognized by using phas Single Strands which are denatured from the bound products. ing primers during PCR amplification. The PCR amplifica See Sec. 6.1.12.2.1 (“QEATM Method Preferred For Use In 40 tion Step can be divided into Several pools with each pool A SEQ-QEATM Method”). Optionally, single stranded frag using one phasing amplification primer constructed So as to ments can be removed by a binding hydroxyapatite, or other recognize one or more additional nucleotides beyond the Single Strand Specific, column or by digestion by a single original RE recognition site. These additional nucleotides strand specific nuclease. Also, the QEATM method is adapt then contribute to an effective Subsequence that comprises able to other functionally equivalent amplification and 45 the target Subsequence recognized by the RE. length separation means. In this manner, the identity of the In a SEQ-OEATM method embodiment, an additional 4-8 RES cutting a fragment, and thereby the Subsequences bp Subsequence is recognized at the end of a fragment by present, as well as the length between the cuts is determined. digestion of a primer by a type IIS RE. This resulting In an exemplary QEATM method utilizing a removal overhang is precisely contiguous with the RE cut end and is means, which has improved quantitative characteristics and 50 Sequenced in a Standard manner, as by conventional Sanger is also adapted to highly Sensitive detection Systems, cDNA reactions. The additional Subsequence information is com is amplified using at least one internally biotinylated primer. bined with the RE recognition Subsequence to generate an The cDNA is then cyclized, cut with a pair of REs, and effective longer end target Subsequence that is used as the Specifically labeled primers are ligated to the cut ends, as effective Subsequence. discussed in S 5.4.3.2 (entitled “Second Alternative RE 55 The Signals generated from the recognition reactions of a Embodiment”). The singly cut ends attached to the biotiny QEATM method experiment are analyzed by computer meth lated Synthesis primers are removed with Streptavidin or ods. The analysis methods simulate a QEATM method avidin beads leaving highly pure labeled double cut cDNA experiment using a database either of Substantially all fragments without any Singly cut and labeled background known DNA sequences or of Substantially all, or at least a fragments. With a Sufficiently Sensitive optical detection 60 majority of, the DNA sequences likely to be present in a System, these pure doubly cut and labeled fragments can be Sample to be analyzed and a description of the reactions to Separated by length (e.g., by electrophoresis or column be performed. The Simulation results in a digest database chromatography) and directly detected without amplifica which contains for all possible signals that can be generated tion. If amplification is needed, absence of the DNA singly the Sample Sequences responsible. Thereby, finding the cut fragment background improves Signal to noise ratio 65 Sequences that can generate a Signal involves a look-up in permitting fewer amplification StepS and, thereby, decreased the Simulated digest database. Computer implemented PCR amplification bias. design methods optimize the choice of target Subsequences 6,057,101 65 66 in the OEATMTM method reactions in order to maximize the usual B-units, the Strands run parallel to each other (Gautier information produced in an experiment. For the tissue mode, et al., 1987, Nucl. Acids Res. 15:6625–6641). the methods maximize the number of Sequences having Oligomers for use in a QEATM method can be synthesized unique Signals by which their quantitative presence can be by Standard methods known in the art, e.g., by use of an unambiguously determined. For the query mode, the meth automated DNA synthesizer (Such as are commercially ods maximize only the number of Sequences of interest available from BioSearch, Applied BioSystems, etc.). AS having unique Signals, ignoring other Sequences that might examples, phosphorothioate oligonucleotides may be Syn be present in a Sample. thesized by the method of Stein et al. (1988, Nucl. Acids Res. In QEATM method embodiments wherein high stringency 16:3209), methylphosphonate oligonucleotides can be pre hybridization is specified, Such conditions generally com pared by use of controlled pore glass polymer Supports prise a low Salt concentration, equivalent to a concentration (Sarin et al., 1988, Proc. Natl. Acad. Sci. USA of SSC (173.5 g. NaCl, 88.2g. Na Citrate, HO to 1 1. ) of 85:7448-7451), etc. less than approximately 1 mM, and a temperature near or In specific QEATM method embodiments, it is preferable above the T of the hybridizing DNA. In contrast, condi to use oligomers that can Specifically hybridize to Subse tions of low Stringency generally comprise a high Salt 15 quences of a DNA sequence too short to achieve reliably concentration, equivalent to a concentration of SSC of Specific recognition, Such that a set of target Subsequences is greater than approximately 150 mM, and a temperature recognized. Further where PCR is used, since Taq poly below the T of the hybridizing DNA. merase tolerates hybridization mismatches, PCR specificity In QEATM method embodiments wherein DNA oligomers is generally less than hybridization Specificity. Where Such are specified for performing functions, including hybridiza oligomers recognizing Short Subsequences are preferable, tion and chain elongation priming, alternatively oligomers they can be constructed in manners including, but not can be used that comprise those of the following nucleotide limited to, the following. To achieve reliable hybridization to mimics which perform similar functions. Nucleotide Shorter DNA Subsequences, degenerate Sets of DNA oligo mimics are subunits (other than classical nucleotides) which mers can be used which are constructed of a total length can be polymerized to form molecules capable of Specific, 25 Sufficient to achieve Specific hybridization with each mem Watson-Crick-like base pairing with DNA. The oligomers ber of the Set containing a shorter Sequence complementary can be DNA or RNA or chimeric mixtures or derivatives or to the common Subsequence to be recognized. Alternatively, modified versions thereof. The oligomers can be modified at a longer DNA oligomer can be constructed with a shorter the base moiety, Sugar moiety, or phosphate backbone. The Sequence complementary to the Subsequence to be recog oligomers may include other appending groupS. Such as nized and with additional universal nucleotides or nucleotide peptides, hybridization-triggered cleavage agents (see, e.g., mimics, which are capable of hybridizing to any naturally Krolet al., 1988, BioTechniques 6:958-976), or intercalating occurring nucleotide. Nucleotide mimics are Sub-units agents (see, e.g., Zon, 1988, Pharm. Res. 5:539-549). The which can be polymerized to form molecules capable of oligomers may be conjugated to another molecule, e.g., a specific, Watson-Crick-like base pairing with DNA. peptide, hybridization triggered croSS-linking agent, trans 35 Alternatively, the oligomers may be constructed from DNA port agent, hybridization-triggered cleavage agent, etc. mimics which have improved hybridization energetics com The oligomers may also comprise at least one nucleotide pared to naturally occurring nucleotides. mimic that is a modified base moiety which is selected from A preferred mimic is a peptido-nucleic acid (“PNA') the group including, but not limited to, 5-fluorouracil, based on a linked N-(2-aminoethyl)glycine backbone to 5-bromouracil, 5-chlorouracil, 5-iodouracil, ypoxanthine, 40 which normal DNA bases have been attached (Egholm et al., Xantine, 4-acetylcytosine, 5-(carboxyhydroxylmethyl) 1993, Nature 365:566–67). This PNA obeys specific uracil, 5-carboxymethylaminomethyl-2-thiouridine, Watson-Crick base pairing, but with greater free energy of 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D- binding and correspondingly higher melting temperatures. galactosylqueoSine, inosine, N6-isopentenyladenine, Suitable oligomers may be constructed entirely from PNAS 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 45 or from mixed PNA and DNA oligomers. 2-methyladenine, 2-methylguanine, 3-methylcytosine, In QEATM method embodiments wherein DNA fragments 5-methylcytosine, N6-ade nine, 7-methylguanine, are separated by length, any length Separation means known 5-methylaminomethyluracil, 5-methoxyaminomethyl-2- in the art can be used. One alternative Separation means thiour a cil, be ta-D-man no Syl qu e o Sine, employs a Sieving medium for Separation by fragment length 5'-methoxycarboxymethyluracil, 5-methoxyuracil, 50 coupled with a force for propelling the DNA fragments 2-methylthio-N6-isopentenyladenine, uracil-5-oxyacetic though the Sieving medium. The Sieving medium can be a acid (v), Wybutoxosine, pseudou racil, queosine, polymer or gel, Such a polyacrylamide or agarose in Suitable 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, concentrations to separate 10-1000 bp DNA fragments. In 4-thiouracil, 5-methyluracil, uracil-5-oxyacetic acid this case the propelling force is a voltage applied acroSS the methylester, 3-(3-amino-3-N-2-carboxypropyl) uracil, and 55 medium. The gel can be disposed in electrophoretic con 2,6-diaminopurine. The oligomers may comprise at least one figurations comprising thick or thin plates or capillaries. The modified Sugar moiety Selected from the group including, gel can be non-denaturing or denaturing. Alternately, the but not limited to, arabinose, 2-fluoroarabinose, Xylulose, Sieving medium can be Such as used for chromatographic and hexose. The oligomerS may comprise at least one Separation, in which case a pressure is the propelling force. modified phosphate backbone Selected from the group con 60 Standard or high performance liquid chromatographic Sisting of a phosphorothioate, a phosphorodithioate, a (“HPLC) length separation means may be used. An alter phosphoramid othioate, a phosphoramidate, a native Separation means employs molecular characteristics phosphordiamidate, a methylphosphonate, an alkyl Such as charge, mass, or charge to mass ratio. Mass Spec phosphotriester, and a formacetal or analog thereof. trographic means capable of Separating 10-1000 bp frag The oligomer may be an O-anomeric oligomer. An 65 ments may be used. C.-anomeric oligomer forms Specific double-Stranded DNA fragment lengths determined by Such a separation hybrids with complementary RNA in which, contrary to the means represent the physical length in base pairs between 6,057,101 67 68 target Subsequences, after adjustment for biases or errors 5.4.2. DETAILS OF A QUANTITATIVE EXPRESSION introduced by the Separation means and length changes due ANALYSIS METHOD to experimental variables (e.g., presence of a detectable This embodiment of a QEATM method preferably gener label, ligation to an adopter molecule). A represented length ates one or more Signals unique to each cDNA sequence in is the same as the physical length between occurrences of a Sample containing a mixture of cDNAS, and to quantita target Subsequences in a Sequence from Said database when tively relate the Strength of Such a signal or Signals to the both Said lengths are equal after applying corrections for relative amount of that cDNA sequence in the Sample. LeSS biases and errors in Said Separation means and corrections based on experimental variables. For example, represented preferably, the Signals uniquely determine only Sets of a lengths determined by electrophoresis can be adjusted for Small number of Sequences, typically 2-10 Sequences. mobility biases due to average base composition or mobility QEATM method signals comprise an indication of the pres changes due to an attached labeling moiety and/or adapter ence of pairs of target Subsequences and the length between Strand by conventional Software programs, Such as Gene pairs of adjacent Subsequences in a DNA sample. Signals are Scan Software from Applied Biosystems, Inc. (Foster City, generated in a manner permitting Straightforward automa Calif.). tion with existing laboratory robots. For simplicity of In QEATM method embodiments wherein DNA fragments 15 disclosure, and not by way of limitation, the detailed are labeled and detected, any compatible labeling and detec description of this method is directed to the analysis of tion means known in the art can be used. Advances in Samples comprising a plurality of cDNA sequences. It is fluorochromes, in optics, and in optical Sensing now permit equally applicable to Samples comprising a single Sequence multiply labeled DNA fragments to be distinguished even if or Samples comprising Sequences of other types of DNA or they completely overlap in Space, as in a spot on a filter or nucleic acids generally. a band in a gel. Results of Several recognition reactions or While described in terms of cDNA hereinbelow, it will be hybridizations can be multiplexed in the same gel lane or understood that the DNA sample can be any DNA, e.g., filter spot. Fluorochromes are available for DNA labeling cDNA and/or genomic DNA, and preferably comprises a which permit distinguishing 6-8 Separate products Simulta mixture of DNA sequences. In Specific embodiments, the neously (Ju et al., 1995, Proc. Natl. Acad Sci. USA 25 DNA sample is an aliquot of cDNA of total cellular RNA or 92:4347–4351). total cellular mRNA, most preferably derived from human Exemplary fluorochromes adaptable to a QEATM method tissue. The human tissue can be diseased or normal. and methods of using such fluorochromes to label DNA are described in S 6.1.12.4 (entitled “Fluorescent Labels For The cDNA, or the mRNA from which it is synthesized, QEATM methods”). should be present at some threshold level in order to Single molecule detection by fluorescence (Eigen et al., generate Signals, this level being determined to Some degree 1994, Proc. Natl. AcadSci, USA91:5740–5747) can also be by the conditions of a particular QEATM method experiment. adapted for use. For example, Such a threshold is that preferably at least In QEATM method embodiments wherein intercalating 1000, and more preferably at least 10,000, mRNA molecules DNA dyes are utilized to detect DNA, any suitable dye of the Sequence to be detected be present in a Sample. In the known in the art can be used. In particular Such dyes include, 35 case where one or only a few mRNAs of a type of interest but are not limited to, ethidium bromide, propidium iodide, are present in each cell of a tissue from which it is desired Hoechst 33258, Hoechst 33342, acridine orange, and to derive the sample mRNA, at least a corresponding num ethidium bromide homodimers. Such dyes also include ber of such cells should be present in the initial tissue POPO, BOBO, YOYO, and TOTO from Molecular Probes Sample. In a Specific embodiment, the mRNA detected is (Eugene, Oreg.). 40 present in a ratio to total sample RNA of 1:10 to 1:10. With Alternative Sensitive detection means available include a lower ratio more Molecular amplification can be per Silver staining of polyacrylamide gels (Bassam et al., 1991, formed during a QEATM method experiment. Analytic Biochemistry 196:80-83), and the use of interca The cDNA sequences occurring in a tissue derived pool lating dyes. In this embodiment, the gel can be photographed include short untranslated Sequences and translated protein and the photograph Scanned by Scanner devices conven 45 coding Sequences, which, in turn, may be a complete protein tional in the computer art to produce a computer record of coding Sequence or Some initial portion of a coding the Separated and detected fragments. A further alternative is Sequence, Such as an expressed Sequence tag. A coding to blot an electrophoretic separating gel onto a filter (e.g., Sequence may represent an as yet unknown Sequence or gene nitrocellulose) and then to apply any visualization means or an already known Sequence or gene entered into a DNA known in the art to visualize adherent DNA. See, e.g., 50 Sequence database. Exemplary Sequence databases include Kricka et al., 1995, Molecular Probing, Blotting, and those made available by the National Center for Biotech Sequencing, Academic PreSS, New York. In particular, visu nology Information (“NCBI”) (Bethesda, Md.) (GenBank) alization means utilizing Secondary reactions with one or and by the European Bioinformatics Institute (“EMBL) more reagents or enzymes can be used. (Hinxton Hall, UK). A preferred Separation and detection apparatus for use in 55 A QEATM method is also applicable to samples of a QEATM method is found in copending U.S. patent appli genomic DNA in a manner Similar to its application to cation Ser. No. 08/438,231 filed May 9, 1995, which is cDNA. In g|DNA samples, information of interest includes herein incorporated by reference in its entirety. Other detec occurrence and identity of translocations, gene tion means adaptable to a QEATM method include the amplifications, loSS of heterozygosity for an allele, etc. This commercial electrophoresis machines from Applied BioSys 60 information is of interest in cancer diagnosis and Staging. In tems Inc. (Foster City, Calif.), Pharmacia (ALF), Hitachi, cancer patients, amplified Sequences might reflect an Licor. The Applied BioSystems machine is preferred among oncogene, while loSS of heterozygosity might reflect a tumor these as it is the only machine capable of Simultaneous 4 dye Suppressor gene. Such Sequences of interest can be used to resolution. Select target Subsequences and to predict Signals generated In the following SubSections and the accompanying 65 by a QEATM method experiment. Even without prior knowl examples sections a QEATM method embodiment is edge of the Sequences of interest, detection and classification described in detail. of the QEATM method signal patterns is useful for the 6,057,101 69 70 comparison of normal and diseased States or for observing The redundancy and resolution criteria are probabilisti the progression of a disease State. cally expressed in Eqns. 1 and 2 in an approximation Classification of QEATM method signal patterns, in an adequate to guide Subsequence choice. In these equations exemplary embodiment, can involve Statistical analysis to the number of genes in the cDNA sequence mixture is N, the determine significant differences between patterns of inter average gene length is L., the number of target Subsequence est. This can involve first grouping Samples that are similar pairs is M (the number of pairs of recognition means), and in one of more characteristics, Such characteristics the probability of each target Subsequence hitting a typical including, for example, epidemiological history, histopatho gene is p. Since each target Subsequences is preferably logical State, treatment history, etc. Signal patterns from Selected to independently hit each pooled Sequence, the Similar Samples are then compared, e.g., by finding the probability of an arbitrary Subsequence pair hitting is then average and Standard deviation of each individual signals. pf. Eqn. 1 expresses the redundancy condition of three hits Individual Signal which are of limited variability, e.g., for per gene, assuming the probabilities of target Subsequence which the Standard deviation is less than the average, then hits are independent. represent genetic constants of Samples of this particular characteristic. Such limited variability Signals from one Set of tissue Samples can then be compared to limited variability 15 Eqn 2 expresses the resolution condition of having frag Signals from another Set of tissue Samples. Signals which ments with lengths no closer on average than 3 base pairs. Significantly differ in this comparison then represent signifi This equation approximates the actual fragment length dis cant differences in the genetic expression between the tissue tribution with a uniform distribution. Samples and are of interest in reflecting the biological differences between the Samples, Such as the differences caused by the progression of a disease. For example, a L (2) Significant difference in expression is detected with the difference in the genetic expression between two tissues exceed the Sum of the Standard deviation of the expressions Given expected values of N, the number of Sequences in the in the tissues. Other Standard Statistical comparisons can 25 library or pool to analyze (library complexity), and L, the also be used to establish level of expression and the Signifi average expressed Sequence (or gene) length, Eqns 1 and 2 cance of differences in levels of expressions. Target Subsequence choice is important in the practice of are solved for the subsequence hit probability and number of a QEATM method. The two primary considerations for Subsequences required. This Solution depends on the par Selecting Subsequences are, first, redundancy, that is, that ticular redundancy and resolution criteria dictated by the there be enough Subsequence pair hits per gene that a unique particular experimental method chosen to implement the Signal is likely to be generated for each Sample Sequence, QEATM method. Alternative values may be required for and Second, resolution, that is, that there not be So many other implementations of a QEATM method. primer pairs hitting with very similar lengths in a sample that For example, it is estimated that the entire human genome the Signals cannot be discriminated. For Sufficient contains approximately 10 protein coding sequences with redundancy, it is preferable that there be on average, 35 an average length of 2000. The solution of Eqns. 1 and 2 for approximately three pair hits per gene or DNA sequence in these parameters is p=0.082 and M=450. Thereby the gene the Sample. It is highly preferable that there be at least one expression of all genes in all human tissues can be analyzed pair hit per each gene In a test of a database of eukaryotic with a tissue mode QEATM method using 450 target Subse expressed Sequences, it has been found that an average value quence pairs, each Subsequence having an independent of three hits per gene appears to be generally a Sufficient 40 probability of occurrence of 8.2%. In an embodiment in guarantee of this minimum criterion. which eight fluorescently labeled Subsequence pairs can be Sufficient resolution depends on the Separation and detec optically distinguished and detected per electrophoresis tion means chosen. For a particular choice of Separation and lane, Such as is possible when using the Separation and detection means, a recognition reaction preferably should detection apparatus described in copending U.S. patent not generate more fragments than can be separated and 45 application Ser. No. 08/438,231 filed May 9, 1995, 450 distinguishably detected. In a preferred embodiment, gel reactions can be analyzed in only 57 lanes. Thereby only one electrophoresis is the Separation means used to Separate electrophoresis plate is needed in order to completely deter DNA fragments by length. Existing electrophoretic tech mine all human genome expression levels. Since the best niques allow an effective resolution of three base pair (“bp') commercial machines known to the applicants can discrimi length differences in Sequences of up to 1000 bp length. 50 nate only four fluorescent labels in one lane, a corresponding Given knowledge of fragment base composition, effective increase in the number of lanes is required to perform a resolution down to 1 bp is possible by predicting and complete genome analysis with Such machines. correcting for the small differences in mobility due to AS a further example, it is estimated that a typically differing base composition. However and without limitation, complex human tissue expresses approximately 15,000 an easily achievable three bp resolution is assumed by way 55 genes. The solution for N=15000 and L=2000 is p=0.21 and of example in the description of the QEATM method. It is M=68. Thus expression in a typical tissue can be analyzed preferable for increased detection efficiency that the distin with a tissue mode QEATM method using 68 target Subse guishably labeled products from as many recognition reac quence pairs, each Subsequence having an independent tions as possible be combined for Separation in one gel lane. probability of occurrence of 21%. ASSuming 4 Subsequence This combination is limited by the number of labels distin 60 pairs can be run per gel electrophoresis lane, the 68 reactions guishable by the employed detection means. Any alternative can be analyzed in 17 lanes in order to determine the gene means for Separation and detection of DNA fragments by expression frequencies in any human tissue. Thus it is clear length, preferably with resolution of three bp or better, can that this method leads to greatly simplified quantitative gene be employed. For example, Such Separation means can be expression analysis within the capabilities of existing elec thick or thin plate or column electrophoresis, column chro 65 trophoretic Systems. matography or HPLC, or physical means Such as mass These equations provide an adequate guide to picking SpectroScopy. Subsequence pairs. Typically, preferred probabilities of tar 6,057,101 71 72 get Subsequence occurrence are from approximately 0.01 to terion of Eqns. 1 and 2 as closely as possible. If multiple RES 0.30. Probabilities of occurrence of Subsequences and RE Satisfy the Selection criteria, a Subset is Selected by including recognition sites can be determined from databases of DNA only those RES with independently occurring recognition Sample Sequences. Appropriate target Subsequences can be Sequences, determined, for example, by using conditional selected from these tables. Computer implemented, QEATM probabilities. checking for independence can be done, by, for method experimental design methods can then optimize this example, checking that the conditional probability for a hit initial Selection. by any Selected pair of Subsequences is the product of the Another use of a QEATM method is to compare directly probabilities of the individual subsequence hit probabilities. the expression of only a few genes, typically 1 to 10, An initial choice can be optionally optimized by the com between two different tissues, the query mode, instead of puter implemented experimental design methods. Seeking to determine the expression of all genes in a tissue, A number, R., of REs are preferably selected so that the the tissue mode. In this query mode, a few target Subse number of RE pairs is approximately M, where the relation quences are Selected to identify the genes of interest both between M and R is given by Eqn. 3. among themselves and from all other Sequences possibly (3) present. The computer design methods described hereinbe 15 low can make this Selection. If 4. Subsequence pairs are sufficient for identification, then the fragments from the 4 recognition reactions performed on each tissue are prefer For example, a set a set of 20 acceptable RES results in 210 ably Separated and detected on two Separate lanes in the Subsequence pairs. Same gel. If 2. Subsequence pairs are Sufficient for There are numerous REs currently available whose rec identification, the two tissues are preferably analyzed in the ognition Sequences have a wide range of occurrence Same gel lane. Such comparison of Signals from the same gel probabilities, from which REs can be selected for the improves quantitative results by eliminating measurement QEATM method. Asample of these are presented in Example variability due to differences between Separate electro 6.1.12.3 (entitled “Preferred QEATM Method Adapters And phoretic runs. For example, expression of a few target genes 25 RE Pairs”). in diseased and normal tissue samples can be rapidly and Restriction endonucleases (“RE”) generally bind with reliably analyzed. Specificity only to their short four to eight bp recognition A query mode of a QEATM method is also useful even if sites, cleaving the DNA preferably with 4bp complementary the Sequences of the particular genes of interest are not yet Sequences. It is preferable that RES used in this embodiment known. For example, fluorescent traces produced by Sub produce overhangs characteristic of the particular RE. Thus jecting Separate Samples to gel electrophoretic Separation RES, Such as those known as class IIS restriction enzymes, means and then fluorescent detection means are compared to which produce overhangs of unknown Sequence are leSS identify feature differences. Such differentially expressed preferable. Class IIS RES are adaptable to generate short features created in a particular recognition reaction are then Subsequences which may be sequenced to increase QEAM retrieved from the gel by methods known in the art (e.g. 35 method resolution by extending initial target Subsequences electro-elution from the gel) and their contained DNA into longer effective target Subsequences. This alternative fragments are analyzed by conventional techniques, Such as embodiment is known as the SEQ-QEATM method (see by Sequencing. If partial, Such Sequences can then be used as Section 5.4.4). Phasing primers can also be used to recog probes (e.g., in PCR or Southern blot hybridization) to nize longer effective target Subsequences. Further, ligases, recover full-length sequences. In this manner, QEATM 40 which are used in a QEATM method to ligate an adapter method techniques can guide the discovery of new differ Strand to a digested terminus, are highly Specific in their entially expressed cDNA or of changes of the state of hybridization requirements, even one bp mismatch near the gDNA. The Sequences of the newly identified genes, once ligation site will prevent ligation (U.S. Pat. No. 5,366,877, determined, can then be used to guide QEATM method target Nov. 22, 1994, to Keith et al., U.S. Pat. No. 5,093,245, Mar. Subsequence choice for further analysis of the differential 45 3, 1992, to Keith et al.). expression of the new genes. QEATM method experiments are also adaptable to distin Two specific embodiments of a QEATM method are guish Sequences into Small Sets, typically comprising 2 to 10 described herein. The specific embodiments described Sequences, which require fewer target Subsequence pairs. herein use RES to recognize and cleave target Subsequences Such coarser grain analysis of gene expression or genomic in the sample DNA. In one implementation, the desired 50 composition requires fewer recognition reactions and analy doubly cut fragments are amplified by an amplification sis time. Alternatively, Smaller numbers of target Subse means in order to dilute remaining, unwanted Singly cut quence pairs can be optimally chosen to distinguish indi fragments. Alternatively, the Singly cut fragments are vidually a specific Set of genes of interest from all the other removed by physical means (e.g., hydroxyapatite column genes in the Sample. These target Subsequences can be Separation) or enzymatic means (e.g., single Strand specific 55 chosen either from REs that produce fragments from the nucleases). In another implementation, the unwanted Singly desired genes. cut ends are removed by a removal means from the desired Detailed descriptions of exemplary implementations for doubly cut fragments without an amplification Step, as practicing QEAM method recognition reactions and related described in S 5.4.3.2 (entitled “Second Alternative RE computer implemented experimental analysis and design Embodiment”). For these implementations, RE recognition 60 methods are presented in the following SubSections followed Sites define the possible target Subsequences and are Selected by detailed experimental protocols in the Examples SubSec in a manner Similar to the above in order to meet the tions. The implementations are illustrative and not limiting, previous probability or occurrence and independence crite as a QEATM method can be practiced by any method ria. The probabilities of occurrence of various RE recogni generating the previously described QEAM method signals. tion sites are determined from a database of potential Sample 65 5.43. RE EMBODIMENTS OF A OEATM METHOD Sequences, and those RES are chosen with recognition Restriction endonuclease (“RE”) embodiments of a Sequences whose probabilities of occurrence meet the cri QEATM method use implementations of simultaneous RE 6,057,101 73 74 and ligase enzymatic reactions for generating labeled frag Press, New York. Of these alternative means, those employ ments of the genes or Sequences to be analyzed. These ing the T7 RNA polymerase are preferred. fragments are then Separated by length by a separation Two other Specific embodiments use a physical removal means and detected by a detection means to yield QEATM means to directly remove Singly cut fragments, preferably method signals comprising the identity of the RES cutting before amplification. This can be accomplished, e.g., by each fragment together with each fragment's length. The labeling DNA termini with a capture moiety prior to diges recognition reactions can Specifically and reproducibly gen tion. After digestion, the Singly cut fragments are removed erate QEAM method signals with good Signal to noise ratioS by contacting the Sample with a binding partner of the and without any intermediate extractions or buffer capture moiety, affixed to a Solid phase. The preferred eXchanges, which would hinder automatic execution. removal means is biotin-Streptavidin. Other removal means REs bind with specificity to short DNA target adaptable to this embodiment of a QEATM method include Subsequences, usually 4 to 8 bp long, that are termed various haptens, which are removed by their corresponding recognition sites and are characteristic of each RE. RES that antibodies. Exemplary haptens include digoxigenin, DNP, are used cut the Sequence at (or near) these recognition sites and fluorescein (Holtke et al., 1992, Sensitive chemilumi preferably producing characteristic (“sticky”) ends with 15 neScent detection of digoxigenin labeled nucleic acids: a fast Single-Stranded overhangs, which usually incorporate part of and Simple protocol for applications, Biotechniques the recognition site. 12(1):104-113 and Olesen et al., 1993, Chemiluminescent Preferred RES have a 6 bp recognition site and generate a DNA sequencing with multiple labeling, Biotechniques 4 bp 5' overhang. If more fragments are desired for the 15(3):480–485). Alternatively, single stranded fragments analysis of a particular Sample, RES with Shorter recognition can be removed by Single Stand Specific column Separation Sites can be used, for example with 4 or 5bp recognition or Single Strand Specific nucleases. sites. The RE embodiments are also adaptable to a 2 bp 5 RE embodiments of a QEATM method use recognition overhang, which is less preferred since 2 bp overhangs have moieties which are Specifically ligated to RE cut Sticky ends a lower ligase Substrate activity than 4bp overhangs. All RE So that in any one recognition reaction ends cut by a embodiments can be adapted to 3' overhangs of two and four 25 particular RE receive a unique moiety. Recognition moieties bp. Further preferred REs have the following additional comprise oligomers capable of Specifically hybridizing to properties. Their recognition sites and overhang Sequences the RE generated Sticky ends. In the preferred RE are preferably Such that an adapter can be designed whose embodiment, which uses PCR amplification, the recognition ligation does not recreate the recognition site. They prefer moieties also provide primer means for the PCR. ably have sufficient activity below 37 C. and are heat The recognition moieties also provide for labeling and inactivated at 65 C. Heat inactivation is preferable so that recognition of RE cut ends. For example, using a pair of RES RE inactivation can be performed prior to adding PCR in one recognition reaction generates doubly cut fragments reagents and conducting the PCR reaction in the same vial. Some with the recognition Sequence of the first RE on both They preferably have low non-Specific cutting and nuclease ends, Some with the recognition Sequence of the Second RE activities and cut to completion. Of course, RES Selected for 35 on both ends, and the remainder with one recognition a particular experiment preferably have recognition Sites Sequence of each RE on either end. Using more RES meeting the previously described occurrence and indepen generates doubly cut fragments with all pairwise combina dence criteria. Preferred pair of RES for analyzing human tions of RE cut ends from adjacent RE recognition Sites and mouse cDNA are listed in S 6.1.12.3 (entitled “Preferred along the Sample Sequences. All these cutting combinations QEATM Method Adapters and RE Pairs”). 40 need preferably to be distinguished since each provides Only doubly cut Sequence fragments are of interest, and unique information on the presence of different Subse thus in all RE OEATM method embodiments the desired quences pairs present in the original DNA sequence. Thus doubly cut fragments are distinguished from the unwanted the recognition moieties preferably have unique labels Singly cut fragments. Singly cut fragments have a non which label Specifically each RE cut made in a reaction. AS Specific and non-reproducible length distribution derived 45 many RES can be used in a single reaction as labeled from the distribution of overall cDNA lengths, which recognition moieties are available to uniquely label each RE depends strongly on cDNA synthesis conditions. Only the cut. If the detectable labeling in a particular System is, for doubly cut fragments have a specific and reproducible length example, by fluorochromes, then fragments cut with one RE distribution dependent only on the DNA sequence analyzed have a single fluorescent Signal from the one fluorochrome and independent of cDNA synthesis conditions. To make 50 associated with that RE, while fragments cut with two RES this distinction, the preferred RE embodiment of a QEATM have mixed signals, one from the fluorochrome associated method exponentially amplifies doubly cut fragments, So with each RE. Thus all possible pairs of fluorochrome labels that their Signals quickly overwhelm Signals from Singly cut are preferably distinguishable. Alternatively, if certain target fragments, which are at most linearly amplified. PCR is the Subsequence information is not needed, the recognition preferred amplification means. 55 moieties need not be distinctively labeled. In embodiments Alternative amplification means known in the art are using PCR amplification, corresponding primers would not adaptable to a QEATM method. If a removal means for singly be labeled. cut ends is not utilized in an embodiment, alternative ampli If Silver Staining is used to recognize fragments separated fication means should preferentially amplify doubly cut ends on an electrophoresis gel, no recognition moiety need be over Singly cut ends in order that signals from Singly cut 60 labeled, as fragments cut by the various RE combinations ends be relatively Suppressed. On the other hand, if a are not distinguishable. In this case, when PCR amplification removal means for Singly cut ends is utilized in an is used, only primers are required. embodiment, then alternative amplification means need have The recognition reaction conditions are preferably no amplification preference, as no Singly cut ends are present selected, as described in S 6.1.12.1 (entitled “QEATM Pre at the amplification Step. Known alternative amplification 65 ferred RE Method”), so that RE cutting and recognition means are listed in Kricka et al., 1995, Molecular Probing, moiety ligation go to full completion: all recognition sites of Blotting, and Sequencing, chap. 1 and table IX, Academic all RES in the reaction are cut and ligated to a recognition 6,057,101 75 76 moiety. It is more preferable, in general, to perform the recognition reaction Steps. In the preferred embodiment, the recognition reactions according to Sec. 6.1.12.2.1 (“QEATM adapters are partially double stranded DNA (“dsDNA”). Method Preferred For Use In ASEQ-QEATM Method”). This Alternatively, the adapters can be constructed as oligomers more preferred protocol describes performing the RE/ligase of any nucleic acid, with corresponding properties to the and PCR reactions in a single reaction vessel, with at least preferred DNA polymers. In an embodiment employing an one primer having a conjugated capture moiety, followed by alternative amplification means, any polymer that can Serve cleanup of certain reaction products. In this manner, the with a template as a primer for that amplification means can fragments generated from a sequence analyzed lie only be used in that embodiment. between adjacent recognition Sites of any RE in that reac FIG. 10A illustrates the DNA molecules involved in the tion. No fragments remain which include any RE recogni ligation reaction as conventionally indicated with the 5' ends tion site, Since Such a site is cut. Multiple RES can be used of the top strands and the 3' ends of the bottom strands at left. in one recognition reaction. Too many RES in one reaction dsDNA 201 is a fragment of a sample cDNA sequence with may cut the Sequences too frequently, generating a com an RE cut at the left end generating, preferably, a four bp 5' pressed length distribution with many short fragments of overhang 202. Adapter dsDNA 209 is a synthetic substrate lengths between 10 and a few hundred base pairs long. Such 15 provided by a QEATM method. a distribution may not be resolvable by the separation The precise characteristics of adapter 209 are Selected in means, for example gel electrophoresis, if the fragments are order to ensure that RE digestion and adapter ligation too close in length, for example less than 3 bp apart on the preferably go to completion, that generation of unwanted average. Too many RES also may generate fragments of the products and amplification biases are minimized, and that Same length and end Subsequences from different Sample unique labels are attached to cut ends (if needed). Adapter Sequences, thereby leading to non-unique signals. Finally, 209 comprises Strand 203, called a primer, and a partially where fragment labels are to be distinguished, no more RES complementary Strand 205, called a linker. The primer is can be used than can have distinguishably labeled Sticky also known as the longer Strand of the adapter, and the linker ends. These considerations limit the number of REs opti is also known as the shorter Strand of the adapter. mally useable in one recognition reaction. Preferably two 25 The linker, or shorter strand, links the end of a cDNA cut REs are used, with one, three and four REs less preferable. by an RE to the primer, or longer strand, by hybridization to Preferable pairs of REs for the analysis of human cl)NA the Sticky overhang of the cut end and to the primer in order samples are listed in S 6.1.12.3 (entitled “Preferred QEATM that the primer can be ligated to dsDNA 201. Therefore, Method Adapters and RE Pairs”). linker 205 comprises Sequence 206 complementary to the An additional level of Signal Specificity is possible by Sticky RE overhang 202 and Sequence 207 complementary Selecting or Suppressing fragments having a third internal to the 3' end of primer 203. Sequence 206 is preferably of the target Subsequence. Additional information on the presence Same length as the RE overhang. Sequence 207 is most or absence of Specific internal Subsequences can be used preferably eight nucleotides long, less preferably from 4 to along with the two end Subsequences and the length infor 12 nucleotides long, but can be of any length as long as the mation to further distinguish between otherwise identically 35 linker reliably hybridizes with only one top primer in any classified fragments. one recognition reaction and has an appropriate T. Other methods of providing third Subsequence informa (preferably less than approximately 68 C.). Linker 205 also tion are described below which label or SuppreSS fragments preferably has no 5' terminal phosphate so that it will not with third Subsequences. To Select fragments with a third ligate to the bottom strand of dsDNA 201. Lack of terminal internal Subsequence, probes with distinguishable labels 40 phosphate also prevents the annealed adapters from ligating which bind to this target Subsequence are added to the to each other, forming dimers, and thereby competing with fragments prior to detection, and alternatively prior to Sepa adapter ligation to RE cut Sample fragments. Adapter dimers ration and detection. On detection, fragments with this third would also be amplified in a Subsequent amplification Step Subsequence present will generate a signal, preferably generating unwanted fragments. Terminal phosphates can be fluorescent, from the probe. Such a probe can be a labeled 45 removed using phosphatases (e.g., alkaline phosphatase) PNA or DNA oligomer. Short DNA oligomers may need to known in the art, followed by Separation of the enzyme. be extended with a universal nucleotide or degenerate Sets of Further, the linker, or shorter strand, T should preferably natural nucleotides in order to provide for Specific hybrid be leSS than primer 203 Self-annealing T, This ensures that ization. Subsequent PCR amplification conditions can be controlled Fragments with a third Subsequence can be Suppressed in 50 So that linkers present in the reaction mixture will not various manners in embodiments using PCR amplification. hybridize and act as PCR primers, and, thereby, generate First, a probe hybridizing with this third Subsequence which Spurious fragment lengths. “The preferable T is less than prevents polymerase elongation in PCR can be added prior approximately 68 C. to amplification. Then Sequences with this Subsequence will Primer, or longer strand, 203 further has a 3' end sequence be at most linearly amplified and their signal thereby Sup 55 204 complementary to 3' end sequence 207 of bottom linker pressed. Such a probe could be a PNA or modified DNA 205. In a preferred aspect, in order that all RE cuts are oligomer (with the last nucleotide being addNTP). Second, properly ligated to a unique top primer, in any Single if the third Subsequence is recognized by an RE, this RE can reaction, each primer should be complementary to and be added to the RE-ligase reaction without any correspond hybridize with only one linker 205. Consequently, all the ing Specific primer. Fragments with the third Subsequence 60 linkers in any one reaction mixture preferably have unique will be at most linearly amplified. Sequences 207 for hybridizing with unique primers. In order Both these alternatives can be extended to multiple inter that the ligation reaction go to completion, primer 203 nal Sequences by using multiple probes to recognize the preferably should not recreate the recognition Sequence of Sequences or to disrupt exponential PCR amplification. any RE in the reaction mixture when it is ligated with cDNA Construction of the recognition moieties, also herein 65 end 202. Primer 203 has no 5' terminal phosphate in order called adapters or linker-primers, is important and is to prevent any Self-ligations. To minimize amplification of described here in advance of further details of the individual undesired Sequences, termed amplification noise, in any 6,057,101 77 78 Subsequent PCR step it is preferred that primer 203 not primer, or longer Strand, 304 and linker, or shorter Strand, hybridize with any Sequence present in the original Sample 305. Primer, or longer strand, 304 includes segment 306 mixture. The T of primer 203 is preferably high, in the complementary to and of the same length as 3' overhang 302 range from 50 to 80 C., and more preferably above 68 C. and section 307 complementary to linker 305. It also option This ensures that the Subsequent PCR amplification can be ally has label 308 which distinctively labels primer 304. As controlled So that only primers and not linkers initiate new in the case of adapters for 5' overhangs, primer 304 has no chains. For example, this T. can be achieved by use of a 5' terminal phosphate, in order to prevent Self-ligations, and primer having a combination of a G+C content preferably is Such that no recognition Site for any RE in one recognition from 40-60%, most preferably from 55-60%, and a primer reaction is created upon ligation of the primer with dsDNA length most preferably 24 nucleotides, and preferably from 301. These condition ensure that the RE digestion and 18 to 30 nucleotides. Primer 203 is optionally labeled with ligation reactions go to completion. Primer 304 should fluorochrome 208, although any DNA labeling system that preferably not hybridize with any Sequence in the initial preferably allows multiple labels to be simultaneously dis sample mixture. The T of primer 304 is preferably high, in tinguished is usable in the QEATM method. the range from 50° to 80° C. and more preferably above 68° Generally, the primer, or longer Strand, are constructed So 15 C. This ensures the Subsequent PCR amplification can be that, preferably, they are highly specific, free of dimers and controlled So that only primers and not linkers initiate new hairpins, and form stable duplexes under the conditions chains. For example, this T. can be achieved by using a Specified, in particular the desired T. primer having a G+C content preferably from 40–60%, most Software packages are available for primer construction preferably from 55-60%, and a primer length most prefer according to these principles, an example being OLIGOTM ably of 24 nucleotide and less preferably of 18-30 nicle Version 4.0 For Macintosh from National BioSciences, Inc. otides. Each primer 304 in a reaction can optionally have a (Plymouth, Minn.). In particular, a formula for T. can be distinguishable label 308, which is preferably a fluoro found in the OLIGOTM Reference Manual at Eqn. 1, page 2. chrome. FIG. 10B illustrates two exemplary adapters and their Linker, or Shorter Strand, 305 is complementary to and component primers and linkers constructed according to the 25 hybridizes with section 307 of primer 304 Such that it is above description. Adapter 250 is specific for the RE adjacent to 3' overhang 302. Linker 305 is most preferably BamHI, as it has a 3' end complementary to the 5' overhang 8 nucleotides long, leSS preferably from 4-16 nucleotides, generated by BamHI. Adapter 251 is similarly specific for and has no terminal phosphates to prevent any Self-ligation. the RE HindIII. This linker Serves only to promote ligation specificity and Example 6.1.12.3 (entitled “Preferred QEATM Method reaction Speed. It does riot perform the function of linking Adapters And RE Pairs') contains a more comprehensive, primer 304 to the cut dsDNA, as it did in the 5' case. Further, non-limiting list of adapters that can be used according to the linker 305 T should preferably be less than primer 304 QEA method. All Synthetic oligonucleotides used in the Self-annealing T. This insures that Subsequent PCR ampli QEATM method are preferably as short as possible for their fication conditions can be controlled So that linkers present functional roles in order to minimize Synthesis costs. 35 in the reaction mixture will not hybridize and act as PCR Alternatively, adapters can be constructed from hybrid primers, and, thereby, generate Spurious fragment lengths. primers which are designed to facilitate the direct Sequenc FIG. 11B illustrates an exemplary adapter with its primer ing of a fragment or the direct generation of RNA probes for and linker for the case of the RE NlaIII. As in the 5' overhang in situ hybridization with the tissue of origin of the DNA case, a 3' adapter can also be constructed from a hairpin loop Sample analyzed. Hybrid primers for direct Sequencing are 40 configuration. constructed by ligating onto the 5' end of existing primers RES generating 5' and 3' overhangs are preferably not the M13-21 primer, the M13 reverse primer, or equivalent used in the same recognition reaction. This is in order that Sequences. Fragments generated with Such hybrid adapters a complementary primer hybridization Site can be presented can be removed from the Separation means and amplified on each of the two strands of the product of the RE/ligase and Sequenced with conventional Systems. Such Sequence 45 recognition reaction. information can be used both for a previously known Turning now to a detailed description of a preferred RE Sequence to confirm the Sequence determination and for a embodiment of the QEATM method recognition reactions, previously unknown Sequence to isolate the putative new the Steps of this preferred embodiment comprise, first, gene. Hybrid primers for direct generation of RNA hybrid Simultaneously cleaving a mixed DNA sample (e.g., one of ization probes are constructed by ligating onto the 5' end of 50 the populations of proteins being assayed for interaction by existing primers the phage T7 promoter. Fragments gener the method of the invention with another protein population, ated with Such hybrid adapters can be removed using the or a pooled group of cDNAS encoding interacting proteins Separation means and transcribed into anti-Sense RNA with identified in the assay) with one or more REs and ligating conventional Systems. Such probes can be used for in Situ recognition moieties on the cut ends, Second, amplifying the hybridization with the tissue of origin of the DNA sample to 55 twice cut fragments, if necessary, and third, Separating the determine in precisely what cell types a Signal of interest is fragments by length and detecting the lengths and labels, and expressed. the identities of the RES cutting each fragment. Following A further alternative illustrated in FIG. 10C is to construct the amplification Step, optional Steps to remove unwanted an adapter by self hybridization of single stranded DNA in Singly Stranded DNA fragments prior to detection can hairpin loop configuration 212. The Subsequences of loop 60 increase the Signal to noise ratio of the following detection. 212 would have Similar properties to the corresponding Two alternative RE embodiments are described in following Subsequences of linker 205 and primer 203. Exemplary SubSections. The number of RES and associated adapters hairpin loop 211 Sequences are C to Co. preferably are limited So that both a compressed length RES generating 3' overhangs are leSS preferred and require distribution consisting of shorter fragments is avoided and the different adapter structure illustrated in FIG. 11A. 65 enough distinguishable labels are available for all the RES dsDNA 301 is a fragment of a sample cDNA cut with a RE used. Alternatively, RES can be used without associated generating 3' Sticky overhang 302. Adapter 309 comprises adapters in order that the amplified fragments not have the 6,057,101 79 80 asSociated recognition Sequences. Absence of these more REs. The amount of RE enzyme in the reaction is Sequences can be used to additionally differentiate genes that preferably approximately a 10 fold unit excess. Substantially happen to produce fragments of identical length with par greater quantities are leSS preferred because they can lead to ticular RES. Star activity (non-specific cutting) while Substantially lower A cDNA sample is prepared prior to carrying out a quantities are leSS preferred because they will result in leSS QEATM method by removal of terminal phosphates from all rapid and only partial digestion, and hence incomplete and the cDNA. This is important to improve the signal to noise inaccurate characterization of the Subsequence distribution. ratio in the Subsequent fragment length Separation and In the Same reaction, adapters and ligase enzyme are detection by eliminating amplification of unwanted, Singly present for Simultaneous adapter ligation to the RE cut ends. cut fragments. Significant background Signals arise from The method is adaptable to any ligase that is active in the exponential amplification of Singly cut fragments whose temperature range 10 to 37 C. T4 DNA ligase is the bluntends have ligated to form a single dsDNA with two cut preferred ligase. In other embodiments, cloned T4 DNA ends, an apparently doubly cut fragment, which is exponen ligase or T4 RNA ligase can also be used. In a further tially amplified like a normal doubly cut fragment. Since embodiment, thermostable ligases can be used, Such as cDNA lengths vary depending on Synthesis condition, these 15 Ampligase TM Thermostable DNA Ligase from Epicenpre unwanted, apparently doubly cut fragments have a wide (Madison, Wis.), which has a low blunt end ligation activity. range of lengths and produce a diffuse background on gel These ligases in conjunction with the repetitive cycling of electrophoresis which obscures Sharp bands from the nor the basic thermal profile for the RE-ligase reaction, mally doubly cut fragments. This background can be elimi described in the following, permit more complete RE cutting nated by preventing blunt end ligation of Singly cut frag and adapter ligation. ments by initially removing all terminal phosphates from the Ligase activity can both generate unwanted products and cDNA sample, without otherwise disrupting the integrity of also, if an RE recognition site is regenerated, can cause an the cDNA. endless cycle of further cutting and ligation. Terminal phos Terminal phosphate removal is preferably done with a phate removal during cDNA preparation prevents spurious phosphatase. To prevent interference with the intended liga 25 ligation of the blunt other ends of singly cut cDNA (and tion of adapters to doubly cut fragments, the phosphatase Subsequent exponential amplification of the results). Other activity preferably is removed prior to the RE digestion and unwanted products are fragment concatamers formed when adapter ligation Step. To avoid any phosphatase Separation or the Sticky ends of cut cDNA fragments hybridize and ligate. extraction Step, the preferred phosphatase is a heat labile Such fragment concatamers are removed by keeping the alkaline phosphatase which is heat inactivated prior to the restriction enzymes active during ligation, thus cutting RE/ligase Step. A preferred phosphatase comes from cold unwanted concatamers once they form. Further, adapters, living Barents Sea (arctic) shrimp (U.S. Biochemical Corp.) once ligated, terminate further RE cutting, Since adapters are ("shrimp alkaline phosphatase” or "SAP"). Terminal phos Selected Such that RE recognition sites are not recreated. A phate removal need be done only once for each population high molar excess of adapters also is preferable since it of cDNA being analyzed. 35 limits concatamer formation by driving the RE and ligase In other embodiments additional phosphatases may be reactions toward complete digestion and adapter ligation. used for terminal phosphate removal, Such as calf intestinal Finally, unwanted adapter Self-ligation is prevented Since phosphatase-alkaline from Boehringer Mannheim primers and linker also lack terminal phosphates (preferably (Indianapolis, Ind.). Those that are not heat inactivated due to Synthesis without phosphates or less preferably due to require the addition of a Step to Separate the phosphatase 40 pretreatment thereof with phosphatases). from the cDNA before the recognition reactions, such as by The temperature profile of the RE/ligase reaction is phenolchloroform extraction. important for achieving complete cutting and ligation. The Preferably, the prepared cDNA is then separated into preferred protocol has Several Stages. The first Stage is at the batches of from 1 picogram ("pg”) to 200 nanograms (“ng”) optimum RE temperature to achieve Substantially complete of cDNA each, and each batch is separately processed by the 45 cutting, for example 37 C. for 30 minutes. The second stage further Steps of the method. For a tissue mode experiment, is a ramp at -1C../min down to a third stage temperature for to analyze gene expression, preferably from a majority of Substantially compete annealing of adapters to the Sticky cut expressed genes, from a Single human tissue requires deter ends and primer ligation. During this ramp, cutting and mination of the presence of about 15,000 distinct cl)NA ligation continue. The third Stage is at the optimum tem Sequences. By way of example, one Sample is divided into 50 perature for adapter annealing and ligation to the Sticky approximately 50 batches, each batch is then Subject to the ends, and is, for example, at 16 C. for 60 minutes. The RE/ligase recognition reaction and generates approximately fourth Stage is again at the optimum RE to achieve complete 200-500 fragments, and more preferably 250 to 350 frag cutting of all recognition sites, for example at 37 C. for 15 ments of 10 to 1000 bp in length, the majority of fragments minutes. The fifth Stage is to heat inactivate the ligase and, preferably having a distinct length and being uniquely 55 preferably, also the RE enzymes, and is, for example, 10 derived from one cDNA sequence. A preferable example minutes at or above 65 C. If the PCR reaction is not to be analysis would entail 50 batches generating approximately immediately performed, the results are held at 4 C. If the 300 bands each. PCR amplification is to be immediately performed, as in the For the query mode, fewer recognition reactions are preferred single tube protocol of Sec. 6.1.12.2.1 (“QEATM employed since only a Subset of the expressed genes are of 60 Method Preferred For Use In A SEQ-QEATM Method”), this interest, perhaps approximately from 1 to 100. The number fifth stage is at 72 C. for 20 minutes. of recognition reactions in an experiment may then number A less preferred profile involves repetitive cycling of the approximately from 1 to 10 and an appropriate number of first four stages of the temperature protocol described above, cDNA batches is prepared. that is from an optimum RE temperature to optimum anneal Following cDNA preparation, the next Step is simulta 65 ing and ligation temperatures, and back to an optimum RE neous RE cutting of and adapter ligation to the Sample temperature. The additional cycles further drive the cDNA sequences. The prepared Sample is cut with one or RE/ligase reactions to completion. In this embodiment, it is 6,057,101 81 82 preferred to use thermostable ligase enzymes. The majority melting times reduces PCR amplification bias in favor of of restriction enzymes are active at the conventional 16 C. high G+C content. Further, large amplification Volumes are ligation temperature and hence prevent unwanted ligation preferred to reduce bias. Sufficient amplification cycles are events without thermal cycling. However, temperature pro performed, typically between 15 and 30 cycles. files consisting of optimum ligation conditions interspersed Any other techniques designed to raise Specificity, yield, with optimum RE cutting conditions cause both enzymatic or reproducibility of amplification are applicable to this reactions to proceed more rapidly than one constant tem method. One preferred technique is to include Betaine perature. An exemplary profile comprises periodically (Sigma) in both the RE/ligase reaction and in the PCR cycling between a 37 C. optimum RE temperature to a 16 amplification. Another technique that can be used is the use C. optimum annealing and ligation temperature at a ramp of of 7-deaza-2'-dGTP in the PCR reaction in place of dGTP. -1 C./min, and then back to the 37 C. optimum RE This has been shown to increase PCR efficiency for G+C temperature. Following completion of approximately 2 to 4 rich targets (Mutter et al., 1995, Nuc. Acid Res. of these temperature cycles, the RE and ligase enzymes are 23:1411–1418). As a further example, another technique that heat inactivated by a final stage at 65 C. for 10 minutes. can be used is the addition of tetramethylammonium chlo This avoids the need for Separation or extractions between 15 ride to the reaction mixture, which has the effect of raising steps. The results are held at 4 C. the T (Chevet et al., 1995, Nucleic Acids Research These thermal profiles are easily controlled and auto 23(16):3343-3344). mated by the use of commercially available computer con In a particular method of performing the PCR trolled thermocyclers, for example from MJ Research amplification, each RE/ligase reaction Sample is Sub-divided (Watertown, Mass.) or Perkin Elmer (Norwalk, Conn.). into multiple aliquots, and each aliquot is amplified with a These reaction conditions are designed to achieve Sub different number of cycles. Multiple amplifications with an Stantially complete cutting of all RE recognition Sites increasing number of amplification cycles, for example 10, present in the analyzed Sequence mixture and complete 15, and 20 cycles, are preferable. Amplifications with a ligation of reaction terminating adapters on the cut ends, lower number of cycles detect more prevalent messages in each adapter being unique in one reaction for a particular RE 25 a more quantitative manner. Amplification with a higher cut end. The fragments generated are limited by adjacent RE number of cycles detect the presence of leSS prevalent genes recognition sites and no fragment includes internal undi but leSS quantitatively. Multiple amplifications also serve as gested Sites. Further, a minimum of unwanted Self-ligation controls for checking the reliability and quantitative products and concatamerS is formed. response of the process by comparing the size of the same Following the RE/ligase Step is amplification of the Signal in each amplification. doubly cut cDNA fragments. Although PCR protocols are Other methods of performing the PCR amplification are described in the exemplary embodiment, any amplification more Suited to automation. For example, the content of a method that Selects fragments to be amplified based on end reaction vial can be configured as follows. First, 40 ul of the sequences is adaptable to a QEATM method (see above). PCR mix without Mg ions is added followed by a wax bead With high enough Sensitivity of detection means, or even 35 that melts approximately at 72 C., Such as Ampliwax beads Single molecule detection means, the amplification Step can (Perkin-Elmer, Norwalk, Conn.). This bead is melted at 75 be dispensed with entirely. This is preferable as amplifica C. for 5 minutes and solidified at 25 C. for 10 minutes. A tion inevitably distorts the quantitative response of the preferred wax is a 90:10 mixture of Paraffin:Chillout TM 14. method. The paraffin is a highly purified paraffin wax melting The PCR amplification protocol is designed to have 40 between 58 C. and 60° C. Such as can be obtained from maximum specificity and reproducibility. First, the PCR Fluka Chemical, Inc. (Ronkonkoma, N.Y.) as Paraffin Wax amplification produces fewer unwanted products if the cat. no. 76243. Chillout TM 14 Liquid Wax is a low melting, amplification Steps occur at a temperature above the T of purified paraffin oil available from MJ Research. It is the shorter linker so that it cannot initiate unwanted DNA preferred to coat the upper Sides of the reaction tubes with strands. The linker is preferably melted by an initial incu 45 this solidified wax, carefully add the PCR mix, then melt this bation at 72 C. without the Taq polymerase enzyme or wax onto the PCR mix by the temperature protocol in Sec. dNTP Substrates present. A further incubation at 72 C. for 6.1.12.2.1, which beginning with a 2 min incubation at 72 10 minutes with Taq polymerase and dNTPs is performed in C. then decreases the temperature by 5 C. every 2 min until order to complete partial double Strands to complete double 25 C. is reached. Then, the RE/ligase mix with Mgions is Strands. Alternatively, linker melting and double Strand 50 added. The RE/ligase and PCR reactions are carried out by completion can be performed by a single incubation at 72 following the preferred temperature profile in FIG. 22D. In C. for 10 minutes with Taq polymerase. Subsequent PCR this arrangement in the Same Vial, the RE/ligase reactions amplification Steps are carried out at temperatures Suffi can first be performed. The incubation at 72 C. for 20 ciently high to prevent re-hybridization of the bottom linker. minutes permits the wax layer Separating the mixtures to Second, primer strand 203 of FIG. 10A (and 304 of FIG. 55 melt, allows the RE/ligase mixture to mix with the PCR mix, 11A) are typically used as PCR primers. They are preferably and allows completion of the partial double Strands to designed for high amplification specificity and not to hybrid complete double strands. Then sufficient PCR cycles are ize with any native cDNA species to be analyzed. They have performed, typically between 15 and 30 cycles. This single high melting temperatures, preferably above 50 C. and tube implementation is well adapted to automation. Other So most preferably above 68 C., to ensure specific hybridiza 60 called PCR “hot-start” procedures can be used, such as those tion with a minimum of mismatches. employing heat Sensitive antibodies (Invitrogen, Calif.) to Third, the protocol's temperature profile is preferably initially block the activity of the polymerase. designed for Specificity and reproducibility. A preferred Following the amplification Step, optional Steps prior to profile is 95°C. for 30 seconds, then 57°C. for 1 minute, and length Separation and detection improve the methods signal then 72 C. for 2 minutes. High annealing temperatures 65 to noise ratio. It is preferable to use the protocol of Sec. minimize primer mis-hybridizations. Longer extension 6.1.12.2.1 referred to as “Biotin bead clean-up.” This times reduce PCR bias in favor of smaller fragments. Longer involves the use of a primer with a biotin (or capture moiety) 6,057,101 83 84 in the PCR amplification followed by binding to streptavidin with Specially placed Type IIS RE recognition Subsequence (or the capture moieties's binding partner) and washing to followed by digestion with the Type IIS RE and sequencing remove certain reaction products. The Single Strands dena of the generated overhang (in a SEQ-QEATM embodiment). tured from the bound products are then further analyzed. In a preferred alternative, the Type IIS recognition Subse Further, Single Strands produced as a result of linear ampli quence is placed So that the generated overhang is contigu fication from Singly cut fragments can be removed by the use ous with the original recognition Subsequence of the RE that of Single Strand Specific exonucleases. Mung Bean exonu cut the end to which the adapter hybridizes. In this clease (Exo) or Exo I can be used, with Exo I preferred embodiment, an effective target Subsequence is formed by because of its higher Specificity for Single Strands. Mung concatenating the Sequence of the Type IIS overhang and the bean is leSS preferred and even leSS preferred is S1 nuclease. original recognition Sequence. In another alternative, the LeSS preferably, the amplified products may be optionally Type IIS recognition Sequence is placed So that the Sequence concentrated by ethanol precipitation or column Separation. of the generated overhang is not contiguous with the original Alternate PCR primers illustrated in FIG. 10D can be recognition Sequence. Here, the Sequence of the Overhang is advantageously used. In that figure, Sample dsDNA 201 is used as an third internal Subsequence in the fragment. In illustrated after the RE/ligase reaction and after incubation at 15 both cases, the additionally recognized Subsequence is used 72°C. for 10 minutes but just prior to the PCR amplification in the computer implemented experimental analysis meth steps. dsDNA 201 has been cleaved by an RE recognizing ods to increase the capability of determining the Source Subsequence 227 at position 221 producing overhang 202 Sequence of a fragment. This enhancement is illustrated in and has been ligated to adapter primer Strand 203. For FIGS. 23A-E and is described in detail in Sec. 5.4.4 ("A definiteneSS and without limitation, a particular relative SEQ-QEATM Embodiment of a QEATM Method). position between RE recognition Subsequence 227 and over A subsequent QEATM method step is the separation by hang 202 is illustrated. Other relative positions are known. length of the amplified, labeled, cut cDNA fragments and The resulting DNA has been completed to a blunt ended observation of the length distribution. Lengths of the Sample double strand by completing strand 220 by incubation at 72 of cut fragments will typically span a range from a few tens C. for 10 minutes. Typically adapter primer strand 203 is 25 of bp to perhaps 1000 bp. For this range standard gel used as the PCR primer. electrophoresis is capable of resolving Separate fragments Alternatively, strand 222, illustrated with its 5' end at the which differ by three or more base pairs. Knowledge of left, can be advantageously used. Strand 222 comprises average fragment composition allows for correction of com Subsequence 223, with the same Sequence as Strand 203, position induced Small mobility differences and permits Subsequence 224, with the same Sequence as the RE over resolution down to 1 bp. Any Separation method with hang 202; Subsequence 225, with a Sequence consisting of a adequate length resolution, preferably at least to three base remaining portion of RE recognition Subsequence 227, if pairs in a 1000 base pair Sequence; can also be used. The any; and Subsequence 226 of P. nucleotides. Length P is length distribution is detected with means Sensitive to the preferably from 1 to 6 and more preferably either 1 or 2. primer labels. In the case of fluorochrome labels, Since Subsequences 223 and 224 hybridize for PCR priming with 35 multiple fluorochrome labels can be typically be resolved corresponding Subsequences of dsDNA 201. Subsequence from a single band in a gel, the products of one recognition 225 hybridizes with any remainder of recognition Subse reaction with Several RES or other recognition means or of quence 227. Subsequence 226 hybridizes only with frag Several Separate recognition reaction can be analyzed in a ments 201 having complementary nucleotides in corre Single lane. The detection apparatus resolution for different sponding positions 228. When P is 1, primer 223' selects for 40 labels limits the number of RE products that can be simul PCR amplification 1 of the 4 possible dsDNAS 201 which taneously detected. may be present; and when P is 2, 1 of the 16 is selected. If Preferred protocols for the specific RE embodiments are 4 (or 16) primers 223 are synthesized, each with one of the described in detail in S 6.1.12.1 (entitled “The QEATM possible (pairs of) nucleotides, and if the RE/ligase reactions Method Preferred RE Method”). mix is separated in 4 (16) aliquots for use with one of these 45 5.4.3.1. FIRST ALTERNATIVE RE EMBODIMENT 4 (16) primers, the 4 (16) PCR reactions will select for An alternative QEATM method protocol performs ampli amplification only one of the possible dsDNAS 201. Thus, fication prior to the RE/ligase Step. After the RE/ligase Step, these primers are similar to phasing primers (European further amplification is performed. Alternately, no further Patent Publication No. 0 534 858 A1, published Mar. 31, amplification is performed, and in this case unwanted Singly 1993). 50 cut ends are removed as they are not diluted by Subsequent The joint result of using primerS 223 with Subsequence amplification. 226 in multiple PCR reactions after one RE/ligase reaction Such removal is accomplished by first using primers that is to extend the effective target Subsequence from the RE are labeled with a capture moiety. A capture moiety is a recognition Subsequence by concatenating onto the recog Substance having a specific binding partner that can be nition Subsequence a Subsequence which is complementary 55 affixed to a Solid Substrate. For example, Suitable capture to Subsequence 226. Thereby, many additional target Sub moiety-binding partner pairs include, but are not limited to, Sequences can be recognized while retaining the Specificity biotin-Streptavidin, biotin-avidin, a hapten (Such as and exactness characteristic of the RE embodiment. For digoxigenin) and a corresponding antibody, or other removal example, RES recognizing 4 bp Subsequences can be used in means known in the art. For example, double Stranded Such a combined reaction with an effective 5 or 6 bp target 60 cDNA is PCR amplified using a set of biotin-labeled, Subsequence, which need not be palindromic. RES recog arbitrary primers with no net Sequence preference. The result nizing 6 bp Sequences can be used in a combined reaction to is partial cDNA sequences with biotin labels linked to both recognize 7 or 8bp Sequences. Such effective Sequences are ends. The amplified cDNA is cut with REs and ligated to then used in the computer implemented design and analysis recognition moieties uniquely for each particular RE cut methods Subsequently described. 65 end. The RE/ligase Step is performed by procedures identical In a further enhancement, additional Subsequence infor to those of the prior section in order to drive the RE digestion mation can be generated from adapters comprising primers and recognition moiety ligation to completion and to prevent 6,057,101 85 86 formation of concatamers and other unwanted ligation prod cut ends are then removed by contact with a Solid phase to ucts. The recognition moieties can be the adapters previ which a specific binding partner of the capture moiety is ously described. affixed. Next the unwanted singly cut fragments labeled with the FIGS. 12A, 12B, and 12C illustrate a second alternative capture moiety are removed by contacting them with the RE protocol, which uses biotin as Such a capture moiety for binding partner for the capture moiety affixed to a Solid direct removal of the singly cut 3' and 5' cDNA ends from phase, followed by removal of the Solid phase. For example, the RE/ligase mixture. cDNA Strands are amplified using, where biotin is the capture moiety, Singly cut fragments can for example, a primer with a biolin molecule linked to one be removed using Streptavidin or avidin magnetic beads, of the internal nucleotides as one of the two primers in PCR. leaving only doubly cut fragments that have RE-Specific Terminal phosphates are retained. recognition moieties ligated to each end. These products are FIG. 11A illustrates Such a cDNA 401 with ends 407 and then analyzed, also as in the previous Section, to determine 408, poly(da) sequence 402, poly(dT) primer 403 with biotin the distribution of fragment lengths and RE cutting combi 404 attached. 405 is a recognition sequences for RE; 406 is nations. a sequence for RE. Fragment 409 is the cDNA sequence Other direct removal means may alternatively be used in 15 defined by these adjacent RE recognition Sequences. Frag this embodiment of a QEATM method. Such removal means ments 423 and 424 are singly cut fragments resulting from include, but are not limited to, digestion by Single Strand RE cleavages at sites 405 and 406. Specific nucleases or passage though a single Strand Specific FIG. 12B illustrates that, next, the cDNA is ligated into a chromatographic column, for example, containing circle. Aligation reaction using, for example, T4DNAligase hydroxyapatite. is performed under Sufficiently dilute conditions So that 5.4.3.2. SECOND ALTERNATIVE RE EMBODIMENT predominantly intramolecular ligations occur circularizing A Second alternative embodiment in conjunction with the cDNA, with a only a minimum of intermolecular, Sufficiently Sensitive detection means can eliminate alto concatamer forming ligations. Reaction conditions favoring gether the amplification Step. In the preferred RE protocol, circularization verSuS concatamer formation are described in doubly cut fragments ligated to adapters are exponentially 25 Maniatis, 1982, Molecular Cloning A Laboratory Manual, amplified, while unwanted, Singly cut fragments are at best pp. 124-125,286-288, Cold Spring Harbor, N.Y. Preferably, linearly amplified. Thus amplification dilutes the unwanted a DNA concentration of less than approximately 1 lug/ml has fragments relative to the fragments of interest. After ten been found adequate to favor circularization. Concatamers cycles of amplification, for example, Signals from unwanted can be separated from circularized single molecules by size fragments are reduced to less than approximately 0.1% of Separation using gel electrophoresis, if necessary. FIG. 12B the Signals from the doubly cut fragments. Gene expression illustrates the circularized cDNA. Blunt end ligation can then be quantitatively determined down to at least this occurred between ends 407 and 408. level. A greater number of amplification cycles results in a Then the circularized, biotin end labeled, cDNA is cut greater relative dilution of Signals from unwanted Singly cut with RES and ligated to adapters uniquely recognizing and fragments and, thereby, a greater Sensitivity. But amplifica 35 perhaps uniquely labeled for each particular RE cut. The tion bias and non-linearities interfere with the quantitative RE/ligase Step is performed by procedures as described in response of the method. For example, certain fragments will the section hereinabove in order to drive RE digestion and be preferentially PCR amplified depending on Such factors primer ligation to completion over formation of concatamers as length and average base composition. and other unwanted ligation products. Next, the unwanted For improved quantitative response, it is preferred to 40 Singly cut ends are removed using Streptavidin or avidin eliminate the bias accompanying the amplification StepS. magnetic beads, leaving only doubly cut fragments that have Then output Signal intensity is linearly responsive to the RE-Specific recognition Sequences ligated to each end. number of input genes or Sequences generating that Signal. FIG. 12C illustrates these latter steps. Sequences 405 and In the case of common fluorescent detection means, a 406 are cut by RE and RE, respectively, and adapterS 421 minimum of 6x10" moles of fluorochrome (approximately 45 and 422 specific for cuts by RE and RE, respectively are 10 molecules) is required for detection. Since one gram of ligated onto the sticky ends. Thereby, fragment 409 is freed cDNA contains about 10 moles of transcripts, it is possible from the circularized cDNA and adapters 421 and 422 are to detect transcripts to at least a 1% relative level from ligated to it. The remaining Segment of the circularized microgram quantities of mRNA. With greater mRNA cDNA comprises singly cut ends 423 and 424 with ligated quantities, proportionately rarer transcripts are detectable. 50 adapters 421 and 422. Both Singly cut ends are joined to the Labeling and detection Schemes of increased Sensitivity primer sequence 403 with attached biotin 404. Removal is permit use of less mRNA. Such a scheme of increased accomplished by contact with streptavidin or avidin 420 sensitivity is described in Ju et al., 1995, Fluorescent energy which is fixed to Substrate 425, perhaps comprising mag transfer dye-labeled primers for DNA sequencing and netic beads. The doubly cut labeled fragment 409 can now analysis, Proc. Natl. Acad. Sci. USA 92:4347–4351. Single 55 be simply Separated from the Singly cut ends affixed to the molecule detection means are about 10 times more sensi Substrate. Thereby, Separation of the Singly and doubly cut tive than existing fluorescent means (Eigen et al., 1994, fragments is achieved. Proc. Natl. Acad. Sci. USA'91:5740–5747). Signals from the uniquely labeled doubly cut ends can be To eliminate amplification Steps, a preferred protocol uses directly detected without any unwanted contamination from a capture moiety Separation means to directly remove Singly 60 Signals from labeled Singly cut ends. Importantly, Since cut fragments from the desired doubly cut fragments. Only Signals originate only from cDNA sequences originally the doubly cut fragments have a discrete length distribution present in the Sample, the detected Signals will quantitatively dependent only on the input gene Sequences. The Singly cut reflect cDNA sequence content and thus gene expression fragments have a broad non-diagnostic distribution depend levels. If the expression level is too low for direct detection, ing on cdNA synthesis conditions. In this protocol, cDNA is 65 the Sample can be Subjected to just the minimum number of Synthesized using a primer labeled with a capture moiety, is cycles of amplification, according to the methods of circularized, cut with RES, and ligated to adapters. Singly Example 6.1.12.1 (entitled “Preferred QEATM RE Method”), 6,057,101 87 88 to detect the gene or Sequence of interest. For example, the Type IIS enzyme be contiguous to the initial target end number of cycles can be as Small as four to eight without any Subsequence. The placement of this additional Subsequence concern of background contamination or noise. Thus, in this is described with reference to FIGS. 23A-E, which illustrate embodiment, amplification is not needed to SuppreSS Signals steps in a SEQ-QEATM embodiment. FIG. 23B schemati from Singly cut ends, and preferred more quantitative cally illustrates dsDNA 2302, which is a fragment cut from response Signal intensities result. an original Sample Sequence on one end by a first initial RE 5.4.4. A SEQ-OEATM EMBODIMENT OF A OEATM and on the other end by a different second initial RE, with METHOD adapters fully hybridized but prior to primer ligation. Thus, SEQ-QEATM is an alternative embodiment to the pre linker strand 2311 has hybridized to primer strand 2312 and ferred method of practicing a QEATM method as described to the 5' overhang generated by the first initial RE, and now in Sec. 5.4.3 (“RE Embodiments Of a QEATM Method”). By fixes primer 2312 adjacent to fragment 2302 for Subsequent the use of recognition moieties, or adapters, comprising ligation. Primer 2312 has recognition Subsequence 2320 for Specially constructed primers bearing a recognition Site for Type IIS RE 2321. Linker 2311, to the extent it overlaps and a Type IIS RE, a SEQ-QEATM method is able to identify an hybridizes with recognition Subsequence 2320, has comple additional 4-6 terminal nucleotides adjacent to the recogni 15 mentary recognition Subsequence 2321. Additionally, primer tion Site, or recognition Subsequence, of the RE initially 2312 preferably has a conjugated label moiety 2334, e.g. a cutting a fragment. Thereby, the effective target Subsequence fluorescent FAM moiety. Similarly, linker strand 2313 has is the concatenation of the initial RE recognition Subse hybridized to primer strand 2314 and to the 5' overhang quence and the additional 4-6 terminal nucleotides, and has, generated by the second initial RE. Primer 2314 preferably therefore, a length of at least from 8 to 12 nucleotides and has a conjugated capture moiety 2332, e.g. a biotin moiety, preferably has a length of at least 10 nucleotides. This longer and a release means represented by Subsequence 2323 (to be effective target Subsequence is then used in the QEATM described subsequently). Primer 2312 is also called the “cut analysis methods as described in Sec. 5.4.5 (“QEATM' primer,” and primer 2314 the “capture primer.” Analysis and Design Methods”) which involve searching a Subsequence 2304 terminating at nucleotide 2307 in FIG. database of Sequences to identify the Sequence or gene from 25 23B is the portion of the recognition subsequence of the first which the fragment derived. The longer effective target initial RE remaining after its cutting of the original Sample Subsequence increases the capability of these methods to Sequence. The placement of the Type IIS RE recognition determine a unique Source Sequence for a fragment. Subsequence is determined by the length of this Subse In this Section, for ease of description and not by way of quence. FIG.23A schematically illustrates how the length of limitation, first shall be described Type IIS RES, next the Subsequence 2304 is determined by properties of the first Specially constructed primers, and then the additional initial RE. The first RE is chosen to be of a type that method steps of a SEQ-QEATM method used to recognize recognizes Subsequence 2303, terminating with nucleotide the additional nucleotides. 2307, of sample dsDNA 2301, and that cuts the two strands A Type IIS RE is a restriction endonuclease enzyme of dsDNA 2301 at locations 2305 that are located within which cuts a dsDNA molecule at locations outside of the 35 recognition subsequence 2303. In order that the first RE recognition site of the Type IIS RE (Szybalski et al., 1991, recognize a known target Subsequence, it is highly prefer Gene 100:13–26). FIG. 23C illustrates Type IIS RE 2331 able that subsequence 2303 be entirely determined by the cutting dsDNA 2330 outside of its recognition site, which is first RE and be without indeterminate nucleotides. As a recognition Subsequence 2320, at locations 2308 and 2309. result of this cutting, overhang Subsequence 2306 is gener The Type IIS RE preferably generates an overhang by 40 ated and has a known Sequence, Since it is entirely within the cutting the two dsDNA strands at locations differently determined recognition Subsequence 2303. Thereby, Subse displaced away on the two Strands from the recognition quence 2304, the portion of the recognition Subsequence Subsequence. Although the recognition Subsequence and the 2303 remaining on a fragment cut by the first RE, has a displacement(s) to the cutting site(s) are determined by the length not less than the length of overhang 2306 and is RE and are known, the Sequence of the generated overhang 45 typically longer. Typically and preferably, Subsequence 2303 is determined by the dsDNA cut, in particular by its nucle is of length 6 and is palindromic, locations 2305 are Sym otide Sequence outside of the Type IIS recognition region, metrically placed in Subsequence 2303; and overhang 2306 and is, at first, unknown. Thus, in a SEQ-OEATM is of length 4. Therefore, the typical length of the remaining embodiment, the overhangs generated by the Type IIS RES portion 2304 of the recognition Subsequence 2303 is of are sequenced. Table 9 in Sec. 6.1.12.5 (“Preferred Reac 50 length 5. In cases where shorter recognition Subsequences tants for SEQ-QEATM Methods”) lists several Type IIS RES 2303 are preferably, the remaining portion 2304 will have a adaptable for use in a SEQ-QEATM method and their corresponding length. evant characteristics, including their recognition Subse The preferred placement of Type IIS recognition Sequence quences on both DNA Strands and the displacements from 2320 is now described with reference to FIG. 23C, which these recognition Subsequences to the respective cutting 55 schematically illustrates dsDNA 2330, which derives from Sites. It is preferable to use RES of high Specificity and dsDNA 2302 of FIG. 23B after the further steps of primer generating an overhang of at least 4 bp displaced at least 4 ligation, PCR amplification with primers 2312 and 2314, or 5bp beyond the recognition Subsequence in order to Span binding of capture moiety 2332 to binding partner 2333 the remaining recognition Subsequence of the RE that ini affixed to a Solid-phase Substrate, and then binding of Type tially cut the fragment. FokI and Bbv are most preferred 60 IIS RE 2331 to its recognition subsequence 2320. Subse Type IIS REs for a SEQ-QEATM method. quence 2322 is the Subsequence between recognition Sub Next, the Special primers, and the Special linkers if sequence 232C and the end of primer 2312 at location 2305. needed, which hybridize to form the adapters for SEQ Type IIS RE is illustrated cutting dsDNA 2330 at nucleotide QEATM, have, in additional to the structure previously locations 2308 and 2309 and, thereby, generating an exem described in Sec. 5.4.3 (“RE Embodiments Of a QEATM 65 plary 5' overhang 2324 between these locations. For this Method”), a Type IIS recognition Subsequence whose place overhang to be contiguous with the remaining portion 2304 ment is important in order that the overhang generated by the of initial target end subsequence 2303, nucleotide 2309 is 6,057,101 89 90 adjacent to nucleotide 2307 terminating Subsequence 2304. hydrolysis of the Sugar backbone at an alkaline pH releases Therefore, Type IIS recognition sequence 2320 is preferably strand 2335. Second, Subsequence 2323 can be the recog placed on primer 2312 Such that the length of Subsequence nition Subsequence of an RE which cuts extremely rarely if 2304 plus the length of subsequence 2322 equals the dis at all in the Sequences of the Sample. A preferred RE of this tance of closest cutting of Type IIS RE 2331. For example, Sort is AScI, which has an 8 bp recognition Sequence that in the case of FokI, Since the closest cutting distance is 9 and rarely, if ever, occurs in mammalian DNA, and is active at the typical length of Subsequence 2304 is 5, its recognition the ends of molecules. In this case, digestion with the RE, Sequence is preferably placed 4 bp from the end of primer i.e. AScI., releases Strand 2335. These release means are 2312. In the case of Bbv, since the closest cutting distance particularly useful in the case of biotin-Streptavidin, which is 8, its recognition Sequence is preferably placed 3 bp from form a complex that is difficult to dissociate. the end of primer 2312. Table 10 of Sec. 6.1.12.5 (“Preferred Reactants for SEQ Finally, FIG. 23D schematically illustrates dsDNA 2330 QEATM Methods”) lists exemplary primers, linkers, and after cutting by Type IIS RE 2331. dsDNA has 5' overhang associated RES, for the preferred implementation of SEQ 2324 between and including nucleotides 2308 and 2309, QEATM in which contiguous effective target end Subse where the Type IIS RE cut dsDNA 2330 of FIG. 23C. This 15 quences are formed. This description has illustrated the overhang is contiguous with former Subsequence 2304, the generation of a 5'Type IIS generated overhang. Primers can remaining portion of the recognition Subsequence of the first equally be constructed to generate a leSS preferable 3' RE, which has been cut off. The shorter strand has primer overhang by using a Type IIS whose closest cutting distance 2314 including release means represented by Subsequence is on the 3' strand, rather than on the 5' strand. 2323. dsDNA 2330 remains bound to the solid-phase Sup Finally, the method steps of SEQ-QEATM are now port through capture moiety 2332 and binding partner 2324. described. SEQ-QEATM comprises, first, practicing the The absence of label moiety 2334 can be used to monitor the RE/ligase embodiment of QEATM using the special primers completeness of cutting by Type IIS RE 2331. The label and linkers previously described followed, Second, by cer moiety also advantageously assists in the determination of tain additional steps specific to SEQ-QEATM. More detailed the length of dsDNA 2330. 25 exemplary reaction protocols are found in the accompanying The QEATM method is also adaptable to other less pref examples in Sec. 6 (“Examples”). The protocols of Sec. erable placements of recognition Sequence 2320. If recog 6.1.12.1 (“Preferred QEATM RE Method”) are preferred for nition sequence 2320 is placed closer to the 3' end of primer performing a QEATM method, and the protocols of Sec. 2312 than the optimal and preferable distance, the overhang 6.1.12.2 (“Preferred Methods Of A SEQ-QEATM produced by Type IIS RE 2331 is not contiguous with Embodiment”) are preferred for performing the additional recognition subsequence 2303 of the first RE, and a con steps specific to SEQ-QEATM. FIGS. 23B–E illustrate vari tiguous effective target Subsequence is not generated. In this ous steps in a SEQ-QEATM method. FIG. 23B illustrates a case, optionally, the determined sequence of the Type IIS RE fragment from a Sample Sequence digested by two different generated overhang can be used as third internal Subse REs and just prior to primer ligation. FIG. 23C illustrates a quence information in QEAT experimental analysis meth 35 Sample Sequence after primer ligation, chain blunt-ending, ods in order to further resolve the Source Sequence of and PCR amplification. These QEATM steps are preferably fragment 2302, if necessary. If recognition Sequence 2320 is performed according to the alternative described in Sec. placed further from the 3' end of the cut primer than the 5.4.3 (“RE Embodiments Of a QEA.” Method”), but can optimal and preferable distance, the overhang produced by alternatively be performed by any RE/ligase alternative. The Type IIS RE overlaps with recognition subsequence 2303 of 40 additional steps unique to SEQ-QEATM include, first, bind the first RE. In this case, the length of the now contiguous ing the amplified fragments to a Solid-phase Support, also effective target Subsequence is less than the Sum of the illustrated in FIG. 23C, second, washing the bound lengths of the Type IIS overhang and the first RE recognition fragments, and third, digesting the bound fragments by the Subsequence. Effective target end Subsequence information Type IIS RE corresponding to primer 2312 used. The Type is, thereby, lost. In case recognition Sequence 2310 is placed 45 IIS digestion is preferably performed with reaction condi further from the 3' end than the distance of furthest cutting, tions Suitable to achieve complete digestion, which can be no additional information is obtained. checked by insuring the absence of optional label moiety Primer 2314 also has certain additional structure. First 2334 after washing the bound, digested sequences. FIG.23D additional Structure is capture moiety 2332 conjugated near illustrates dsDNA fragments 2330 remaining after complete or to the 5' end of primer 2314. The capture moiety coop 50 digestion by the Type IIS RE. Before Type IIS digestion, an erates with a corresponding binding partner affixed to a Solid aliquot of the bound, amplified RE/ligase reaction products support, an attachment means, to immobilize dsDNA 2330. is denatured and the Supernatant, containing the labeled 5' Biotin/streptavidin are the preferred capture moiety/binding Strands, are separated according to length by, e.g., gel partner pair, which are used in the following description electrophoresis, in order to determine the length of each without limitation to this invention. This embodiment is 55 fragment doubly cut by different REs as in the previous adaptable to any cooperating pair of capture moiety and OEATM embodiments. binding partner that remain bound under DNA denaturing The subsequent additional SEQ-QEATM step is sequenc conditions. Several Such pairs have been previously ing of overhang 2324. This can be done in any manner described. known in the art. In a preferred embodiment suitable for A Second additional Structure is a release means repre 60 lower fragment quantities, an alternative, herein called phas sented as Subsequence 2323 of primer 2314. The release ing QEAM, can be used to Sequence this overhang. A means allows controlled release of strand 2335 of FIG. 23D phasing QEAM method depends on the precise Sequence from the capture moiety/binding partner complex. This Specificity with which RE/ligase reactions recognize short alternative is adaptable to any Such controlled release means. overhangs, in this case the Type IIS generated overhang. Two Such means are preferable. First, Subsequence 2323 can 65 FIG. 23E illustrates a first step of this alternative in which be one or more uracil nucleotides. In this case, digestion a QEATM method adapter, which is comprised of primer with uracil DNA glycosylase (UDG) and subsequent 2351 with label moiety 2353 and linker 2350, has hybridized 6,057,101 91 92 to overhang 2324 in Type IIS digested fragment 2330 bound description is not limiting, as all the methods to be described to a Solid-phase Support. By way of example only, overhang are equally adaptable to all QEATM method embodiments. 2324 is here illustrated as being 4bp long. In this alternative, Further, the following descriptions are directed to the Special phasing linkers are used. For each nucleotide posi currently preferred embodiments of these methods. tion of overhang 2324, e.g. position 2354, 4 pools of linkers However, it will be readily apparent to those skilled in the 2350 are prepared. All linkers in each pool have one fixed computer and Simulation arts that many other embodiments nucleotide, i.e. one of either A, T, C, or G, at that position, of these methods are Substantially equivalent to those described and can be used to achieve Substantially the same e.g. position 2355, while random nucleotides in all combi results. The QEATM methods comprise such alternative nations are present at the other three positions. For each implementations as well as its currently preferred imple nucleotide position of the overhang, four RE/ligase reactions mentation. are performed according to QEAM protocols, one reaction 5.4.5.1. OEATM EXPERIMENTAL ANALYSIS METH using linkers from one of the four corresponding pools. ODS Linkers from only one pool, that having a nucleotide The analysis methods comprise, first, Selecting a database complementary to overhang 2324 at position 2354, hybrid of DNA sequences representative of the DNA sample to be ize without error, and only these linkers can cause ligation 15 analyzed, Second, using this database and a description of of primer 2351 to the 5' strand of fragment 2330. When the the experiment to derive the pattern of Simulated Signals, results of the four RE/ligase reactions are denatured and contained in a database of Simulated Signals, which will be Separated according to length, only one reaction of the four produced by DNA fragments generated in the experiment, can produce labeled products at a length corresponding to and third, for any particular detected Signal, using the pattern the length of fragment 2330, namely the reaction with or database of Simulated Signals to predict the Sequences in linkers complementary to position 2354 of overhang 2324. the original Sample likely to cause this signal. Further Thereby, by performing four RE/ligase reactions for each analysis methods present an easy to use user interface and nucleotide position of overhang 2324, this overhang can be permit determination of the Sequences actually causing a Sequenced. Optionally, the products of these four RE/ligase Signal in cases where the Signal may arise from multiple reactions can be further PCR amplified. In a further option, 25 Sequences, and perform Statistical correlations to quickly if linkers 2350 comprise Subsequence 2356 that is uniquely determine signals of interest in multiple Samples. related to the fixed nucleotide in subsequence 2352 and if The first analysis method is selecting a database of DNA four separately and distinguishably labeled primers 2351 Sequences representative of the Sample to be analyzed. In complementary to these unique Subsequences are used, all one use of a QEATM method, the DNA sequences to be four RE/ligase for one overhang position reactions can be analyzed will be derived from a tissue sample, typically a simultaneously performed in one reaction tube. With this human Sample examined for diagnostic or research pur overhang Sequencing alternative, release means 2323 can be poses. In this use, database Selection begins with one or omitted from primer 2314. more publicly available databases which comprehensively In an alternate embodiment, Sequencing of a 5' overhang record all observed DNA sequences. Such databases are can be done by standard Sanger reactions. Thus strand 2335 35 GenBank from the National Center for Biotechnology Infor is elongated by a DNA polymerase in the presence of labeled mation (Bethesda, Md.), the EMBL Data Library at the ddNTPs at a relatively high concentration to dNTPs in order European Bioinformatics Institute (Hinxton Hall, UK) and to achieve frequent incorporation in the short 4-6 bp elon databases from the National Center for Genome Research gation. Partially elongated strands 2335 are released by (Santa Fe, N.M.). However, as any sample of a plurality of denaturing fragment 2330, Washing, and then by causing 40 DNA sequences of any provenance can be analyzed by release means 2323 to release strands 2335 from the capture QEATM methods, any database containing entries for the moiety bound to the Solid phase Support. The released, Sequences likely to be present in Such a Sample to be partially elongated Strands are then Separated by length, e.g., analyzed is usable in the further Steps of the computer by gel electrophoresis, and the chain terminating ddNTP is methods. observed at the length previously observed for that frag 45 FIG. 13A illustrates the preferred database selection ment. In this manner, the 4-6 bp overhang 2324 of each method starting from a comprehensive tissue derived data fragment can be quickly Sequenced. base. Database 1001 is the comprehensive input database, The effective target Subsequence information, formed by having the exemplary flat-file or relational structure 1010 concatenating the Sequence of the Type IIS overhang to the shown in FIG. 13B, with one row, or record, 1014 for each Sequence of the recognition Subsequence of the first RE, is 50 entered DNA sequence. Column, or field, 1011 is the acces then input into QEATM Experimental Analysis methods, and Sion number field, which uniquely identifies each Sequence is used as a longer target Subsequence in order to determined in database 1001. Most Such databases contain redundant the Source of the fragment in question. This longer effective entries, that is multiple Sequence records are present that are target Subsequence information preferably permits exact and derived from one biological sequence. Column 1013 is the unique Sample Sequence identification. 55 actual nucleotide Sequence of the entry. The plurality of 5.45. OEATM ANALYSIS AND DESIGN METHODS columns, or fields, represented by 1012 contain other data Described hereinbelow are two groups of computer meth identifying this entry including, for example whether this is ods: first, methods for the QEATM method experimental a cDNA or g|DNA sequence, if cDNA, whether this is a full design; and second, methods for the QEATM method experi length coding Sequence or a fragment, the Species origin of mental analysis. Although, logically, design precedes 60 the Sequence or its product, the name of the gene containing analysis, the methods of experimental design depend on the Sequence, if known, etc. Although shown as one file, basic methods described herein as part of experimental DNA sequence databases often exits in divisions and Selec analysis. Consequently, experimental analysis methods are tion from all relevant divisions is contemplated by a QEATM described first. method. For example, GenBank has 15 different divisions, In the following, descriptions are often cast in terms of the 65 of which the EST division and the separate database, dbBST, preferred QEATM method embodiment, in which REs are that contain expressed sequence tags (“EST) are of par used to recognize target Subsequences. However, Such ticular interest, Since they contain expressed Sequences. 6,057,101 93 94 From the comprehensive database, all records are Selected as many as there are pairs of available recognition means to which meet criteria for representing particular experiments recognize Subsequences. FIG. 14 illustrates 15 reactions. For on particular tissue types. This is accomplished by conven example, reaction 1 Specified by major row 1102 generates tional techniques of Sequentially Scanning all records in the fragments between target Subsequences which are the rec comprehensive database, Selecting those that match the ognition Sites of restriction endonucleases 1 and 2 described criteria, and Storing the Selected records in a Selected data in minor rows 1103 and 1104. Further, the RE1 cut end is base. recognized by a labeling moiety labeled with LABEL1, and The following are exemplary Selection methods. To ana the RE2 end is recognized by LABEL2. Similarly, reaction lyze a genomic DNA sample, database 1001 is Scanned 15, 1109, utilizes restriction endonucleases 36 and 37 against criteria 1002 for human g|DNA to create selected labeled with labels 3 and 4, minor rows 1110 and 1111, database 1003. To analyze expressed genes (cDNA respectively. Sequences), several Selection alternatives are available. First, Major row 1105 describes a variant QEATM method a genomic Sequence can be Scanned in order to predict which reaction using three RES and a separate probe. AS described, Subsequences (exons) will be expressed. Thus selected data many RES can be used in a single recognition reaction as base 1005 is created by making Selections according to 15 long as a useful fragment distribution results. Too many RES expression predictions 1004. Second, observed expressed results in a compressed length distribution. Further, probes Sequences, Such as cDNA sequences, coding domain for target Subsequences that are not intended to be labeled sequences (“CDS”), and ESTs, can be selected 1006 to fragment ends, but rather occur within a fragment, can be create Selected database 1007 of expressed Sequences. used. For a further example, a labeled probe added after the Additionally, predicted and observed expressed Sequences QEATM method PCR amplification step (if present in a given can be combined into another, perhaps more comprehensive, embodiment), a post PCR probe, can recognize Subse Selected database of expressed Sequences. Third, expressed quences internal to a fragment and thereby provide an Sequences determined by either of the prior methods may be additional Signal which can be used to discriminate between further selected by any available indication of interest 1008 two Sample Sequences which produce fragments of the same in the database records to create more targeted Selected 25 length and end Sequence which otherwise have differing database 1009. Without limitation, selected databases can be internal Sequences. For another example, a probe added composed of Sequences that can be selected according to any before a QEATM method PCR step and which cannot be available relevant field, indication, or combination present in extended by DNA polymerase will prevent PCR amplifica Sequence databases. tion of those fragment containing the probe's target Subse The Second analysis method uses the previously Selected quences. If PCR amplification is necessary to generate database of Sequences likely to be present in a Sample and detectable signals (in a given embodiment), Such a probe a description of an intended experiment to derive a pattern will prevent the detection of Such a fragment. The absence of the Signals which will be produced by DNA fragments of a fragment may make a previously ambiguous detected generated in the experiment. This pattern can be stored in a band now unambiguous. Such PCR disruption probes can be computer implementation in any convenient manner. In the 35 PNA oligomers or degenerate sets of DNA oligomers, modi following, without limitation, it is described as being Stored fied to prevent polymerase extension (e.g., by incorporation as a table of information. This table may be stored as of a dideoxynucleotide at the 3' end). individual records or by using a database System, Such as In certain OEATM method embodiments an effective tar any conventionally available relational database. get Subsequence is available that is longer than the recog Alternatively, the pattern may simply be Stored as the image 40 nition Subsequence of the cutting RE. In these cases, the of the in-memory Structures which represent the pattern. effective target Subsequence is to be used in the analysis and A QEATM method experiment comprises several indepen design methods in place of the cutting RE recognition dent recognition reactions applied to the DNA sample Subsequence in order to obtain extra Specificity. One Such Sequences, where in each of the reactions labeled DNA embodiment is a SEQ-QEATM method, wherein an overhang fragments are produced from Sample Sequences, the frag 45 generated by a Type IIS RE is Sequenced to obtain a longer ments lying between certain target Subsequences in a Sample target end Subsequence. Another Such embodiment involves Sequence. The target Subsequences can be recognized and the use of alternative phasing PCR primers. In this case, their the fragments generated by the preferred RE embodiments extra recognition Subsequences and labeling are described in of the QEATM method. The following description is focused rows dependent to the RE/ligase reaction whose products on the RE embodiments. 50 they are used to amplify. FIG. 14 illustrates an exemplary description 1100 of a Next, FIG. 15A illustrates, in general, that from the preferred QEATM method embodiment. Field 1101 contains database Selected to best represent the likely DNA sequences a description of the tissue sample which is the Source of the in the Sample analyzed, 1201, and the description of the DNA sample. For example, one experiment could analyze a QEATM method experiment, 1202, the simulation methods, normal prostrate Sample, a Second otherwise identical 55 1203, determine a pattern of Simulated Signals Stored in a experiment could analyze a prostrate Sample with prema Simulated database, 1204, that represents the results of the lignant changes, and a third experiment could analyze a QEATM method experiment. The experimental simulation cancerous prostate Sample. Differences in gene expression generates the Same fragment lengths and end Subsequences between these samples, particularly among interacting pro from the input database that will be generated in an actual teins detected according to the method of the invention, then 60 experiment performed on the same Sample of DNA relate to the progreSS of the cancer disease State. Such Sequences. Samples could be drawn from any other human cancer or Alternately, the Simulated pattern or database may not be malignancy. needed, in which case the DNA database is searched Major rows 1102, 1105, and 1109 describe the separate Sequence by Sequence, mock digestions are performed and individual recognition reactions to which the DNA from 65 compared against the input signals. A simulated database is tissue Sample 1101 is Subjected. Any number of reactions preferable if Several Signals need to be searched or if the may be assembled into an experiment, from as few as one to same QEATM method experiment is run several times. 6,057,101 95 96 Conversely, the Simulated database can be dispensed with quence recognition means to the target DNA, to allow when few signals from a few experiments need to Searched. detection after Separation of the fragments by length. In an A quantitative Statement of when the Simulated database is embodiment using fluorescent detection, means, these labels more efficient depends upon an analysis of the costs of the are fluorochromes covalently attached to the primer Strands various operations and the Size of DNA database, and can be of the adapters, as previously described, or to hybridization performed as is well known in the computer arts. Without probes, if any. Typically, all the fluorochrome labels used in limitation, in the following the Simulated database is one reaction are simultaneously distinguishable So that frag described ments with all possible combinations of target Subsequences FIG. 15B illustrates an exemplary structure for the simu can be fluorescently distinguished. For example, fragments lated database. Here, the simulated results of all the indi at entry 1217 in table 1210 (FIG. 15B) occur at length 175 vidual recognition reactions defined for the experiment are and present Simultaneous fluorescent Signals LABEL1 and gathered into rectangular table 1210. The QEATM method is LABEL2 upon Stimulation, Since these are the labels used equally adaptable to other database Structures containing with adapters which recognize ends cuts by RE1 and RE2 equivalent information; Such an equivalent Structure would respectively. For a further example, in reaction 2, major row be one, for example, where each reaction was placed in a 15 1105 of experimental definition 1100 (FIG. 14), a fragment separate table. The rows of table 1210 are indexed by the with ends cut by RE2 and RE3 and hybridizing with probe lengths of possible fragments. For example, row 1211 con P will present simultaneous signals LABEL2, LABEL3, and tains fragments of length 52. The columns of table 1210 are LABEL4. Where effective target Subsequences are con indexed by the possible end Subsequences and probe hits, if structed with the SEQ-QEATM method or alternative phasing any, in a particular experimental reaction. For example, primers, this lookup is appropriately modified. columns 1212, 1213, and 1214 contain all fragments gen Other labelings are within the scope of the QEATM erated in reaction 1, R1, which have both end Subsequences method. For example, a certain group of target Subsequences recognized by RE1, one end Subsequence recognized by can be identically labeled or not labeled at all, in which case RE1 and the other by RE2, and both end Subsequences the corresponding group of fragments are not distinguish recognized by RE2, respectively. Other columns relate to 25 able. In this case, if RE1 and RE3 end subsequences were other reactions of the experiment. Finally, the entries in table identically labeled in table 1210 (FIG. 15B), a fragment of 1210 contain lists of the accession numbers of Sequences in length 151 may be generated by Sequence T163, AO1, or the database that give rise to a fragment with particular S003, or any combination of these sequences. In the length and end Subsequences. For example, entry 1215 extreme, if Silver (Ag) staining of an electrophoresis gel is indicates that only accession number AO1 generates a frag used in an embodiment to detect Separated fragments, then ment of length 52 with both end Subsequences recognized by all bands will be identically labeled and only band lengths RE1 in R1. Similarly, entry 1216 indicates that accession can be distinguished within one electrophoresis lane. numbers A01 and S003 generate a fragment of length 151 Thus the Simulated database together with the experimen with both end Subsequences recognized by RE3 in reaction tal definition can be used to predict experimental results. If 2. 35 a signal is detected in a recognition reaction, Say Rn, whose In alternative embodiments, the contents of the table can end labelings are LABEL1 and LABEL2 and whose repre be Supplemented with various information. In one aspect, Sentation of length is corrected to physical length in base this information can aid in the interpretation of results pairs of L, the length L row of the Simulated database is produced by the Separation and detection means used. For retrieved and it is scanned for Rn entries with the detected example, if Separation is by electrophoresis, then the 40 Subsequence labeling, by using the column headings indi detected electrophoretic DNA length can be corrected to cating observed Subsequences and the experimental defini obtain the true physical DNA length. Such corrections are tion indicating how each Subsequence is labeled. If no match well known in the electrophoretic arts and depend on Such is found, this fragment represents a new gene or Sequence factors as average base composition and fluorochrome not present in the Selected database. If a match is found, then labels. One commercially available package for making 45 this fragment, in addition to possibly being a new gene or these corrections is Gene Scan Software from Applied Sequence, can also have been generated by those candidate Biosystems, Inc. (Foster City, Calif.). In this case, each table Sequences present in the table entry(ies) found. entry for a fragment can contain additionally average base The Simulated database lookup is described herein as composition, perhaps expressed as percent G+C content, and using the physical length of a detected fragment. In cases the experimental definition can include primer average base 50 where the Separation and detection means returns an composition and fluorochrome label used. For a further approximation to the true physical fragment length, lookup example, if Separation is by mass spectroScopy or similar is augmented to account for Such as approximation. For method, the additional information can be the molecular example, electrophoresis, when used as the Separation weight of each fragment and perhaps a typically fragmen means, returns the electrophoretic length, which depending tation pattern. Use of other Separation and detection means 55 on average base composition and labeling moiety is typi can Suggest the use of other appropriate Supplemental data. cally within 10% of the physical length. In this case database Where phasing primers are used, Supplemental columns lookup can Search all relevant entries whose physical length are used with RE pairs in order to further identify the is within 10% of the reported electrophoretic length, per effective target Subsequence. A similar method can be form corrections to obtain electrophoretic length, and then employed to take account of the SEQ-QEATM method addi 60 check for a match with the detected Signal. Alternative tional Subsequence information. In this latter case, the lookup implementations are apparent, one being to precom additional information is not available until after the OEATM pute the electrophoretic length for all predicted fragments, method experiment is performed. construct an alternate table indeX over the electrophoretic Before describing how this simulated database is length, and then directly lookup the electrophoretic length. generated, it is useful first to describe how this database is 65 Other Separation and detection means can require corre used to predict experimental results. Returning to FIG. 14, sponding augmentations to lookup to correct for their par labels are used to detect binding reaction events by Subse ticular experimental biases and inaccuracies. It is understood 6,057,101 97 98 that where database lookup is referred to Subsequently, In embodiments of the QEATM method wherein target either simple physical lookup or augmented lookup is meant Subsequence are recognized with accuracy, Such as the RE as appropriate. embodiments, the comparison of target Subsequence against If matched candidate database Sequences are found, then input Sequence should be exact, that is the bases should the Selected database can be consulted to determine other match in a one-to-one manner. In embodiments where target information concerning these Sequences, for example, gene Subsequences are leSS accurately recognized, the String name, tissue origin, chromosomal location, etc. If an unpre match should be done in a leSS exact, or fuZZy, manner. In dicted fragment is found, this fragment can be optionally this case the String operation, which generates the vector of retrieved from the length Separation means, cloned or ends, should accept partial T-n matches as well as exact Sequenced, and used to Search for homologues in a DNA matches. In this, the String operations generate the false Sequence database or to isolate or characterize the previ positive matches expected from the experiments and permit ously unknown gene or Sequence. In this manner, the these fragments to be identified. Ambiguity in the Simulated QEATM method can be used to rapidly discover and identify database, however, increases, Since more fragments leads to new geneS. a greater chance of fragments of identical length and end The QEATM computer methods are also adaptable to other 15 labels. formats of an experimental definition. For example, the FIG. 17A illustrates end vectors 1401 and 1402, compris labeling of the target Subsequence recognition moieties can ing three and two ends, respectively, generated by RE1 and be stored in a table Separate from the table defining the RE2, which are for this example assumed to be REs with a experimental reactions. 4 bp overhang. The first overhang in vector 1401 occurs Now turning to the methods by which the simulated between nucleotide 10 and 14 in the input Sequence. database is generated, FIG. 16 illustrates a basic method, Step 1304 of FIG.16 merges all the end vectors for all the termed herein mock fragmentation, which takes one end Subsequences and Sorts the elements on the position of Sequence and the definition of one reaction of an experiment the end. Vector 1404 of FIG. 17B illustrates the result of this and produces the predicted results of the reaction on that step for example end vectors 1401 and 1402. Sequence. Generation of the entire Simulated database 25 Step 1305 of FIG. 16 then creates the fragments generated requires repetitive eXecution of this basic method. by the reaction by Selecting the parts of the full input Turning first to a description of mock fragmentation, the Sequence that are delimited by adjacent ends in the merged method commences at 1301 and at 1302 it inputs the and Sorted end Vector. Since the experimental conditions in Sequence to be fragmented and the definition of the frag conducting the QEATM method should be selected such that mentation reaction, in the following terms: the target end target end Subsequence recognition is allowed to go to Subsequences RE1 . . . REn, where n is typically 2 or 3, and completion, all possible ends are recognized. For the restric the Subsequences to be recognized by third Subsequence tion endonuclease embodiments, the cutting and ligase reac probes, P1 . . . Pn, where n is typically 0 or 1. Note that post tions should be conducted Such that all possible RE cuts are PCR disruption probes act as unlabeled end Subsequences made and to each cut end a labeled primer is ligated. These and are So treated for input to this method. The operation of 35 conditions insure that no fragments contain internal unrec the method is illustrated by example in FIGS. 17A-F for the ognized target end Subsequences and that only adjacent ends case RE1, RE2 and P1. in the merged and Sorted vector define generated fragments. At Step 1303, for each target end Subsequence, the method Where additional information is needed for simulated makes a “vector of ends', which has elements which are database entries to adapt to inaccuracies in particular Sepa pairs of nucleotide positions along the Sequence, each pair 40 ration and detection means, Such information can be col being labeled by the corresponding end Subsequence. For lected at this step. For example, in the case of electrophoretic embodiments where end Subsequences are recognized by Separation, fragment Sequence can be determined and per hybridizing oligonucleotides, the first member of each pair cent G+C content computed and entered in the database is the beginning of a target end Subsequence and the Second along with the fragment accession number. member is the end of a target end Subsequence. For embodi 45 For the PCR embodiments, the fragment length is the ments where target end Subsequences are recognized by difference between the end position of the second end restriction endonucleases, the first member of each pair is Subsequence and the Start position of the first end Subse the beginning of the Overhang region that corresponds to the quence. For RE embodiments, the fragment length is the RE recognition Subsequence and the Second member is the difference between the start position of the second end end of that overhang region. It is preferred to use RES that 50 Subsequence and the Start position of the first end Subse generate 4bp overhangs. The actual target end Subsequences quence plus twice the primer length (48 in the preferred are the RE recognition Sequences, which are preferably 4-8 primer embodiment). bp long. FIG. 17C illustrates the exemplary fragments generated, This vector is generated by a String operation which each fragment being represented by a 4 member tuple compares the target end Subsequence in a 5' to 3’ direction 55 comprising: the two end Subsequences, the length, and an against the input Sequence and SeekS String matches, that is indicator whether the third subsequence probe binds to this the nucleotides match exactly. Where effective target Sub fragment. In FIG. 17C the position of this indicator is sequences are formed by using the SEQ-QEATM method or indicated by a *. Fragment 1408 is defined by ends 1405 alternative phasing primers, it is the effective Subsequences and 1406, and fragment 1409 by ends 1406 and 1407. There that are compared. This can be done by Simply comparing 60 is no fragment defined by ends 1405 and 1407 because the the end Subsequence against the input Sequence Starting at intermediate end Subsequence is recognized and either fully one end and proceeding along the Sequence one base at time. cut in an RE embodiment or used as a fragment end priming However, it is preferable to use a more efficient String position in a PCR embodiment. For simplicity, the fragment matching algorithm, such as the Knuth-Morris-Pratt or the lengths are illustrated for the RE embodiment without the Boyer-Moore algorithms. These are described with sample 65 primer length addition. code in Sedgewick, 1990, Algorithms in C, chap. 19, Step 1306 of FIG. 16 checks if a hybridization probe is Addison-Wesley, Reading, Mass. involved in the experiment. If not, the method skips to Step 6,057,101 99 100 1309. If so, step 1307 determines the sequence of the If the SEQ-QEATM method or alternative phasing primers fragment defined in step 1305. FIG. 17D illustrates that the are used, effective target Subsequences are used. Any of fragment Sequences for this example are the nucleotide Several criteria can be used to ascertain the amount of Sequences within the input Sequence that are between the information obtained, and any of Several algorithms can be indicated nucleotide positions. For example, the first frag used to perform the reaction optimization. ment Sequence is the part of the input Sequence between A preferred criteria for ascertaining the amount of infor positions 10 and 62. Step 1308 then checks each third mation uses the concept of “good Sequence.” A good Subsequence probe Subsequence against each fragment Sequence for an experiment is a Sequence for which there is Sequence to determine whether there is any match (i.e., at least one reaction in the experiment that produces a unique whether the probe has a Sequence complementary enough to Signal from that Sequence, that is, a fragment is produced the fragment Sequence Sufficient for it to hybridize thereon). from that good Sequence, by at least one recognition If a match is found, an indication is made in the fragment 4 reaction, that has a unique combination of length and member tuple. This match is done by String Searching in a labeling. For example, returning to FIG. 15B, the sequence Similar manner to that described for generation of the end with accession number AO1 is a good Sequence because VectOrS. 15 reaction 1 produces Signal 1215, with length 52 and with Next at step 1309 of FIG. 16, all the fragment are sorted both target end Subsequences recognized by RE1, uniquely on length and assembled into a vector of Sorted fragments, from Sequence A01. However, Sequence S003 is not a good which is output from the mock fragmentation method at Step Sequence because there are no unique signals produced only 1310. This vector contains the complete list of all fragments, from S003: reaction R2 produces signal 1216 from both AO1 with probe information, defined by their end Subsequences and S003 and signal 1219 from both Q012 and S003. Using and lengths that the input reaction will generate from the the amount of good Sequences as an information measure, input Sequence. the greater the number of good Sequences in an experiment FIG. 17E illustrates the fragment vector of the example the better is the experimental design. Ideally, all possible Sorted according to length. For illustrative purposes, third Sequences in a Sample would be good Sequences. Subsequence probe P1 was found to hybridize only to the 25 Further, a quantitative measure of the expression of a third fragment 1412, where a Y is marked. 'N' is marked good Sequence can simply be determined from the detected in all the other fragments, indicating no probe binding. Signal intensity of the fragment unlguely produced from the The Simulated database is generated by iteratively apply good Sequence. Relative quantitative measures of the ing the basic mock fragmentation method for each Sequence expression of different good Sequences can be obtained by in the Selected database and each reaction in the experimen comparing the relative intensities of the Signal uniquely tal definition. FIG. 18 illustrates a simulated database gen produced from the good Sequences. An absolute quantitative eration method. The method starts at 1501 and at 1502 inputs measure of the expression of a good Sequence can be the Selected representative database and the experimental obtained by including a concentration standard in the origi definition with, in particular, the list of reactions and their nal Sample. Such a Standard for a particular experiment can related Subsequences. Step 1503 initializes the digest data 35 consist of Several different good Sequences known not to base table So that lists of accession numbers may be inserted occur in the original Sample and which are introduced at for all possible combinations of fragment length and target known concentrations. For example, exogenous good end Subsequences. Step 1504, a DO loop, causes the iterative sequence 1 is added at a 1:10 concentration in molar terms; execution of steps 1505, 1506, and 1507 for all sequences in exogenous good sequence 2 at a 1:10" in molar terms, etc. the input Selected database. 40 Then comparison of the relative intensity of the unique Step 1505 takes the next Sequence in the database, as Signal of a good Sequence in the Sample with the intensities Selected by the enclosing DO loop, and the next reaction of of the unique signal of the Standards allows determination of the experiment and performs the mock fragmentation the molar concentrations of the Sample Sequence. For method of FIG. 16, on these inputs. Step 1506 adds the example, if the good Sequence has a unique Signal intensity Sorted fragment vector to the Simulated database by taking 45 half way between the unique Signal intensities of good each fragment from the vector and adding the Sequence Sequences 1 and 2, then it is present at a concentration half accession number to the list in the database entry indexed by way between the concentrations of good Sequences 1 and 2. the fragment length and end Subsequences and probe (if Another preferred measure for ascertaining the amount of any). FIG. 17F represents the simulated database entry list information produced by an experiment is derived by lim additions that would result for the example mock fragmen 50 iting attention to a particular set of Sequences of interest, for tation reaction of FIGS. 17A-E. For example, accession example a set of known oncogenes or a Set of receptors number AO1 is added to the accession number list in the known or expected to be present in a particular tissue entry 1412 at length 151 and with both end Subsequences Sample. An experiment is designed according to this mea Finally, step 1507 tests whether there is another reaction Sure to maximize the number of Sequences of interest that in the input experiment that should be Simulated against this 55 are good Sequences. Whether other Sequences possibly Sequence. If So, Step 1505 is repeated with this reaction. If present in the Sample are good Sequences is not considered. not, the DO loop is repeated to Select another database These other Sequences are of interest only to the extent that Sequence. If all the database Sequences have been Selected, the Sequences of interest produce uniquely labeled frag the step 1508 outputs the simulated database and the method ments without any contribution from these other Sequences. ends at 1509. 60 The QEATM method experimental design is adaptable to 5.4.5.2. OEATM EXPERIMENTALDESIGN METHODS other measures for ascertaining information from an experi The goal of the experimental design methods is to opti ment. For example, another measure is to minimize on mize each experiment in order to obtain the maximum average the number of Sequences contributing to each amount of quantitative information. An experiment is detected Signal. A further measure is, for example, to mini defined by its component recognition reactions, which are in 65 mize for each possible Sequence the number of other turn defined by the target end Subsequences recognized, Sequences that occur in common in the same Signals. In that third Subsequences recognized, if any, and labels assigned. case each Sequence is linked by common occurrences in 6,057,101 101 102 fragment labelings to a minimum number of other taken to be 1.0, So that the energy equals the temperature. Sequences. This can simplify making unambiguous Signal The minimum of the energy and temperature, denoted Eo peaks of interest (see infra). and To, respectively, are defined by the maximum of the Having chosen an information measure, for example the information measure. For example, if the number of good number of good Sequences, for an experiment, the optimi Sequences of interest is G and is used as the information Zation methods choose target Subsequences, and possibly measure, then E, which equals T, equals 1/G. An initial probes, which optimize the chosen measure. One possible temperature, denoted T, is preferably chosen to be 1. An initial experimental definition, or State, is chosen, either optimization method is exhaustive Search, in which all randomly or guided by prior knowledge of previous experi Subsequences in lengths less than approximately 10 are mental optimizations. Finally, two execution parameters are tested in all combinations for that combination which is chosen. These parameters define the “annealing Schedule', optimum. This method requires considerable computing that is the manner in which the temperature is decreased power, and the upper bound is determined by the computa during the execution of the Simulated annealing method. tional facilities available and the average probability of They are the number of iterations in an epoch, denoted by N. occurrence of Subsequences of a given length. With adequate which is preferably taken to be 100 and the temperature resources, it is preferable to Search all Sequences down to a 15 decay factor, denoted by f, which is preferably taken to be probability of occurrence of about 0.005 to 0.01. Upper 0.95. Both N and f may be systematically varied case-by bounds may range from 8 to 11 or 12. case to achieve a better optimization of the experiment A preferred optimization method is known as Simulated definition with a lower energy and a higher information annealing. See Press et al., 1986, Numerical Recipes- The CSUC. Art of Scientific Computing, S. 10.9, Cambridge University With choices for the information measure or energy PreSS, Cambridge, U.K. Simulated annealing attempts to function, the moves for generating new experiments, an find the minimum of an “energy' function of the “state' of initial State or experiment, and the execution parameters a System by generating Small changes in the State and made as above, the general application of Simulated anneal accepting Such changes according to a probabilistic factor to ing to optimize an experimental definition is illustrated in create a “better new state. While the method progresses, a 25 FIG. 20A. The information measure used in this description simulated “temperature”, on which the probabilistic factor is the number of good Sequences of interest. Any informa depends and which limits acceptance of new States of higher tion measure, Such as those previously described, may be energy, is slowly lowered. used alternately. In the application to the methods of the QEATM method The method begins at step 1701. At step 1702 the tem experimental design, a “State', denoted by S, is the experi perature is Set to the initial temperature; the State to the mental definition, that is the target end Subsequences and initial State or experimental definition; and the energy is Set hybridization probes, if any, in each recognition reaction of to the energy of the initial state. At step 1703 the temperature the experiment. The "energy', denoted E, is taken to be 1.0 and energy are checked to determine whether either is less divided by the information measure, So that when the energy than or equal to the minima for the information measure is minimized, the information is maximized. Alternatively, 35 chosen, as the result of either a fortuitous initial choice or the energy can be any monotonically decreasing function of Subsequent computation Steps. If the energy is less than or the information measure. The computation of the energy is equal to the minimum energy, no further optimization is denoted by applying the function E() to a state. possible, and the final experimental definition and its energy The preferred method of generating a new experiment, or is output. If the temperature is less than or equal to the State, from an existing experiment, or State, is to make the 40 minimum temperature, the optimization is stopped. Then the following changes, also called moves to the experimental inverse of the energy is the number of good Sequences of definition: (1) randomly change a target end Subsequence in interest for this experimental definition. a randomly chosen recognition reaction; (2) add a randomly Step 1706 is a DO loop which executes an epoch, or N chosen target end Subsequence to a randomly chosen reac iterations, of the Simulated annealing algorithm. Each itera tion; (3) remove a randomly chosen target end Subsequence 45 tion consists of steps 1707 through 1711. Step 1707 gener from a randomly chosen reaction with three or more target ates a new experimental definition, or State, Se, according Subsequences; (4) add a new reaction with two randomly to the described generation moves. Step 1708 ascertains or chosen target end Subsequences; and (5) remove a randomly determines the information content, or energy, of S. Step chosen reaction. All target end Subsequences are to be 1709 tests the energy of the new state, and, if it is lower than chosen from available RE recognition Sequences. If the 50 the energy of the current State, at Step 1711, the new State and SEQ-QEATM method or alternative phasing primers are used new energy are accepted and replace the current State and to generate effective target Subsequences, all Subsequences current energy. If the energy of the new State is higher than must be chosen from among Such effective target Subse the energy of the current State, Step 1710 computes the quences that can be generated from available RES. In the following function. case of the SEQ-QEATM method, the extra subsequence 55 information is not known until the QEATM method experi EXP-(E-E)/T (4) ment is performed. To generate a new experimental This function defines the probabilistic factor controlling definition, one of these moves is randomly Selected and acceptance. If this function is less than a random chosen carried out on the existing experimental definition. number uniformly distributed between 0 and 1, then the new Alternatively, the various moves can be unequally weighted. 60 State is accepted at Step 1711. If not, then the newly In particular, if the number of reactions is to be fixed, moves generated State is discarded. These Steps are equivalent to (4) and (5) are skipped. The QEATM method is further accepting a new State if the energy is not increased by an adaptable to other moves for generating new experiments. amount greater than that determined by function (4) in Preferable generation methods will generate all possible conjunction with the Selection of a random number. Or in experiments. 65 other words, a new State is accepted if the new information Several additional Subsidiary choices are needed in order measure is not decreased by an amount greater than indi to apply Simulated annealing. The “Boltzman constant” is rectly determined by function (4). 6,057,101 103 104 Finally, after an epoch of the algorithm, at Step 1712 the Further extensions of the first manner can be made to temperature is reduced by the multiplicative factor f and the cases where more than one of the possible Sequences is not method loops back to the test at step 1703. a good Sequences if the Sequences which are not good appear Using this algorithm, Starting from an initial experimental as contributors to further signals involving good Sequences definition which has certain information content, the algo in a manner which allows their quantitative presences to be rithm produces a final experimental definition with a higher determined. For example, Suppose signal 1219 is of interest, information content, or lower energy, by repetitively and where both possible Sequences are not good Sequences. The randomly altering the experimental definition in order to quantitative presence of Sequence Q012 can be determined from signals 1220 and 1218 in the manner previously Search for a definition with a higher information content. outlined. The quantitative presence of Sequence S003 can be The computation of the energy of an experimental determined from signals 1216 and 1215. Thereby, the definition, or state, in step 1708 is illustrated more detail in Sequences contributing to Signal 1219 can be determined. FIG.20B. This method starts at step 1720. Step 1721 inputs More complex combinations can be Similarly made unam the current experimental definition. Step 1722 determines a biguous. complete digest database from this definition and a particu An alternative extension of the first manner of making lar selected database by the method of FIG. 18. Step 1723 15 unambiguous is by designing a further experiment in which Scans the entire digest database and counts the number of the possible Sequences contributing to a signal of interest are good Sequences of interest. If the total number of good good Sequences even if they were not originally So. Since Sequences is the measure used, the total number of good there are approximately 50 suitable REs that can be used in Sequences can be counted. Alternatively, other information the RE embodiment of the QEATM method (Section 6.2), measures may be applied to the digest database. Step 1724 there are approximately 600 RE reaction pairs that can be computes the energy as the inverse of the information performed, assuming that half of the theoretical maximum measure. Alternatively, another decreasing function of the of 1,250 (50x50/2=1,250) are not useable. Since most RE information content may be used as the energy. Step 1725 pairs produce on the average of 200 fragments and Standard outputs the energy, and the method ends at Step 1726. electrophoretic techniques can resolve at least approxi 5.4.5.3. THE OEATM METHOD AMBIGUITY RESO 25 mately 500 fragment lengths per lane, the RE QEATM LUTION method embodiment has the potential of generating over In one utilization of the OEATM method, DNA from two 100,000 signals (500x200-100,000). The number of pos related tissue samples can be Subject to the same experiment, Sible Signals is further increased by the use of reactions with perhaps consisting of only one recognition reaction, and the three or more REs and by the recognition of third Subse outcomes compared. The two tissue Samples may be other quences. Further, Since the average complex human tissue, wise identical except for one being normal and the other for example brain, is estimated to express no more than diseased, perhaps by infection or a proliferative process, approximately 25,000 genes, there is a 4 fold excess of Such as hyperplasia or cancer. One or more signals may be possible signals over the number of possible sequences in a detected in one Sample and not in the other Sample. Such Sample. Thus it is highly likely that for any Signal of interest, Signals might represent genetic aspects of the pathological 35 a further experiment can be designed and optimized for proceSS in one tissue. These signals are of particular interest. which all possible candidates of the Signal of interest are The candidate Sequences that can produce a Signal of good Sequences. This design can be made by using the prior interest are determined, as previously described, by look-up optimization methods with an information measure the in the digest database. The Signal may be produced by only Sequences of interest in the Signal of interest and Starting one Sequence, in which case it is unambiguously identified. 40 with an extensive initial experimental definition including However, even it the experiment has been optimized, the many additional reactions. In that manner, any signal of Signal may be ambiguous in that it may be produced by interest can be made unambiguous. Several candidate Sequences from the Selected database. A A Second manner of making unambiguous is by automati Signal of interest may be made unambiguous in Several cally ranking the likelihood that the Sequences possibly manners which are described herein. 45 present in a Signal of interest are actually present using In a first manner of making unambiguous assume the information from the remainder of the experimental reac Signal of interest is produced by Several candidate Sequences tions. FIG. 21 illustrates a preferred ranking method. The all of which are good Sequences for the particular experi method begins at step 1801 and at step 1802 inputs the list ment. Then which Sequences are present in the Signal of possible accession numbers in a Signal of interest, the interest can be ascertained by determining the quantitative 50 experimental definition, and the actual experimental results. presence of the good Sequences from their unique signals. DO-loop 1803 iterates once for each possible accession For example, referring to FIG. 15B, if the signal 1217 of number. Step 1804 performs a simulated experiment by the length 175 with the labeling 1213 is of interest, the method illustrated in FIG. 11 in which, however, only the Sequences actually present in the Signal can be determined current accession number is acted on. The output is a Single from the quantitative determination of the presence of 55 sequence digest table, such as illustrated in FIG. 17F. signals 1215 and 1218. Here, both the possible sequences Step 1805 determines a numerical Score of ranking the contributing to this signal are good Sequences for this Similarity of this digest table to the experimental results. One experiment. possible Scoring metric comprises Scanning the digest table The first manner of making unambiguous can be extended for all fragment Signals and adding 1 to the Score if Such a to the case where one of the Sequences possibly contributing 60 Signal appears also in the experimental results and Subtract to a signal is not a good Sequence. The quantitative presence ing 1 from the Score if Such signal does not appear in the of all the possible good Sequences can be determined from experimental results. Alternate Scoring metrics are possible. the quantitative Strength of their unique signals. The pres For example, the subtraction of 1 may be omitted. ence of the remaining Sequence which is not a good Step 1806 sorts the numerical scores of the likelihood that Sequences can be determined by Subtracting from the quan 65 each possible accession number is actually present in the titative presence of the Signal of interest the quantitative sample. Step 1807 outputs the sorted list and the method presences of all the good Sequences. ends at step 1808. 6,057,101 105 106 By this method likelihood estimates of the presence of the Input and output data are preferably Stored on disk devices various possible Sequences in a signal of interest can be such as 1604, 1605, 1624, and 1625 connected to computer determined. 1601 through links 1606. The data can be stored on any 5.4.6. APPARATUS FOR PERFORMING THE OEATM combination of disk devices as is convenient. Thereby, links METHODS 1606 can be either local attachments, whereby all the disks An apparatus for the QEATM method includes means for can be in the computer cabinet(s), LAN attachments, performing the computer implemented QEATM experimental whereby the data can be on other local Server computers, or analysis and design methods and optionally for performing remote links, whereby the data can be on distant Servers. the QEATM method recognition reactions in a preferably Instruments 1630 and 1631 exemplify laboratory devices automated fashion, for example by the protocols of S for performing, in a partly or wholly automatic manner, the 6.1.12.1 (entitled “QEATM Preferred RE Method”). In the QEATM method recognition reactions. These instruments embodiment herein presented both elements are described. can be, for example, automatic thermal cyclers, laboratory In an alternative embodiment, the laboratory methods can be robots, and controllable Separation and detection apparatus, performed by other means, for example manually, and the Such as is found in the applicants copending U.S. patent apparatus needed is limited to the computer apparatus 15 application 08/438,231 filed May 9, 1995, incorporated by described for performing the experimental design and analy reference herein in its entirety. Links 1632 exemplify control sis methods. and data links between computer 1601 and controlled FIG. 19A illustrates in exemplary apparatus for the devices 1631 and 1632. They can be special buses, standard QEATM method embodiments. Computer 1601 can be, LANs, or any suitable link known in the art. These links can alternatively, a UNIX based work Station type computer, an alternatively be computer readable medium or even manual MS-DOS or Windows based personal computer, a Macin input eXchanged between the instruments and computer tosh personal computer, or another equivalent computer. In 1601. Outline arrows 1634 and 1635 exemplify the physical a preferred embodiment, computer 1601 is a PowerPCTM flow of Samples through the apparatus for performing based Macintosh computer with Software Systems capable of experiments 1607 and 1613. Sample flow can be either running both Macintosh and MS-DOS/Windows programs. 25 automatic, manual, or any combination as appropriate. In FIG. 19B illustrates the general Software structure in alternative embodiments there may be fewer or more labo RAM memory 1650 of computer 1601 in a preferred ratory devices, as dictated by the current State of the labo embodiment. At the lowest Software level is Macintosh ratory automation art. operating system 1655. This system contains features 1656 On this complete apparatus, a QEATM method experiment and 1657 for permitting execution of UNIX programs and is designed, performed, and analyzed, preferably in a man MS-DOS or Windows programs alongside Macintosh pro ner as automatic as possible. First, a QEATM method experi grams in computer 1601. At the next higher Software level ment is designed, according to the methods specified in S are the preferred languages in which the QEATM computer 5.4.5 (entitled “QEATM Analysis And Design Methods”) as methods are implemented. LabView 1658, from National implemented by experimental routines 1662 on computer Instruments (Dallas, Tex.), is preferred for implementing 35 1601. Input to the design routines are databases of DNA control routines 1661 for the laboratory instruments, exem Sequences, which are typically representative Selected data plified by 1651 and 1652, which perform the recognition base 1605 obtained by selection from input comprehensive reactions and fragment Separation and detection. C or C++ sequence database 1604, as described in S 5.4.5 (entitled languages 1659 are preferred for implementing experimen “QEATM Analysis And Design Methods”). Alternatively, tal routines 1662, which are described in S 2 (entitled 40 comprehensive DNA databases 1604 can be used as input. “QEATM Analysis And Design Methods”). Less preferred Database 1604 can be local to or remote from computer but useful for rapid prototyping are various Scripting lan 1601. Database selection performed by processor 1601 guages known in the art. PowerBuilder 1660, from Sybase executing the described methods generates one or more (Denver, Colo.), is preferred for implementing the user representative selected databases 1605. Output from the interfaces to the computer implemented routines and meth 45 experimental design methods are tables, exemplified by ods. Finally, at the highest Software level are the programs 1609 and 1615, which, for a QEATM method RE implementing the described computer methods. These pro embodiment, Specify the recognition reaction and the RES grams are divided into instrument control routines 1661 and used for each recognition reaction. experimental analysis and design routines 1662 Control Second, the apparatus optionally performs the designed routines 1661 interact with laboratory instruments, exem 50 experiment. Exemplary experiment 1607 is defined by tissue plified by 1651 and 1652, which physically perform the Sample 1608, which may be normal or diseased, experimen QEATM method and CC protocols. Experimental routines tal definition 1609, and physical recognition reactions 1610 1662 interact with Storage devices, exemplified by devices as defined by 1609. Where instrument 1630 is a laboratory 1654 and 1653, which store DNA sequence databases and robot for automating reaction, computer 1601 commands experimental results. 55 and controls robot 1630 to perform reactions 1610 on cDNA Returning to FIG. 19A, although only one processor is samples prepared from tissue 1608. Where instrument 1631 illustrated, alternatively, the computer methods and instru is a Separation and detection instrument, the results of these ment control interface can be performed on a multiprocessor reactions are then transferred, automatically or manually, to or on Several Separate but linked processors, Such that 1631 for separation and detection. Computer 1601 com instrument control methods 1661, computational experi 60 mands and controls performance of the Separation and mental methods 1661, and the graphical interface methods receives detection information. The detection information is can be on different processors in any combination or Sub input to computer 1601 over links 1632 and is stored on combination. Storage device 1624, along with the experimental design Input/output devices include color display device 1620 tables and information on the tissue Sample Source for controlled by a keyboard and standard mouse 1603 for 65 processing. Since this experiment uses, for example, fluo output display of instrument control information and experi rescent labels, detection results are Stored as fluorescent mental results and input of user requests and commands. traces 1611. 6,057,101 107 108 Experiment 1613 is processed similarly along Sample reactions that could be produced by this accession number. pathway 1633, with robot 1630 performing recognition In a further window, these lengths and reactions are reactions 1616 on cDNA from tissue 1608 as defined by displayed, and the user allowed to Select further reactions for definition 1615, and device 1631 performing fragment Sepa display in order to confirm or refute the presence of this ration and detection. Fragment detection data is input by accession number in the tissue Sample. If one of these other computer 1601 and stored on storage device 1625. In this fragments are generated uniquely by this sequence (a "good case, for example, Silver Staining is used, and detection data Sequence', see Supra), that fragment can be highlighted as of is image 1617 of the stained bands. particular interest. By displaying the results of the generat During experimental performance, instrument control ing reaction of that unique fragment, a user can quickly and routines 1661 provide the detailed control Signals needed by unambiguously determine whether or not that particular instruments 1630 and 1631. These routines also allow opera accession number is actually present in the Sample. tor monitoring and control by displaying the progreSS of the In another interface alternative, the System displays two experiment in process, instrument Status, instrument excep experiments Side by Side, displaying two histological images tions or malfunctions, and Such other data that can be of use 1621 and two experimental results 1622. This allows the to a laboratory operator. 15 user to determine by inspection signals present in one Third, interactive experimental analysis is performed Sample and not present in the other. If the two Samples were using the database of Simulated Signals generated by analy diseased and normal Specimens of the same tissue, Such sis and design routines 1662 as described in S 5.4.5 (entitled Signals would be of considerable interest as perhaps reflect “QEATM Analysis And Design Methods”). Simulated data ing differences due to the pathological process. Having a base 1612 for experiment 1607 is generated by the analysis Signal of interest, preferably repeatable and reproducible, a methods executing on processor 1601 using as input the user can then determine the likely accession numbers caus appropriate Selected database 1605 and experimental defi ing it by invoking the previously described interface facili nition 1609, and is output in table 1612. Similarly table 1618 ties. In a further elaboration of this embodiment, system is the corresponding Simulated database of Signals for 1601 can aid the determination of signals of interest by experiment 1613, and is generated from appropriate Selected 25 automating the Visual comparison by performing Statistical database 1605 and experimental definition 1615. A signal is analysis of Signals from Samples of the same tissue in made unambiguous by experimental routines 1662 that different States. First, Signals reproducibly present in tissue implement the methods described in S 5.4.5 (entitled Samples in the same State are determined, and Second, “QEATM Analysis And Design Methods”). differences in these reproducible signals acroSS Samples Display device 1602 presents an exemplary user interface from the several states are compared. Display 1602 then for the QEATM method data. This user interface is pro shows which reproducible Signals vary acroSS the States, grammed preferably by using the Powerbuilder display front thereby guiding the user in the Selection of Signals of end. At 1620 are Selection buttons which can be used to interest. Select the particular experiment and the particular reaction of This apparatuS has been described above in an embodi the experiment whose results are to be displayed. Once the 35 ment adapted to a single Site implementation, where the experiment is Selected, histological images of the tissue various devices are substantially local to computer 1601 of Source of the Sample are presented for Selection and display FIG. 19A, although the various links shown could also in window 1621. These images are typically observed, represent remote attachments. Alternative, explicitly distrib digitized, and Stored on computer 1601 as part of Sample uted embodiments of this apparatus are possible as is preparation. The results of the Selected reaction of the 40 apparent to those of ordinary skill in the computer arts. Selected experiment are displayed in Window 1622. Here, a All the computer implemented QEATM methods can be fluorescent trace output of a particular labeling is made recorded for Storage and transport on any computer readable available. Window 1622 is indexed by marks 1626 repre memory devices known in the art. For example, these Senting the possible locations of DNA fragments of Succes include, but are not limited to, Semiconductor memories Sive integer lengths. 45 such as ROMs, PROMs, EPROMs, EEPROMS, etc., of Window 1623 displays contents from simulated database whatever technology or configuration-magnetic 1612. Using, for example, mouse 1603, a particular frag memories-Such as tapes, cards, disks, etc of whatever ment length indeX 1626 is Selected. The processor then density or size-optical memories-Such as optical read retrieves from the Simulated database the list of accession only memories, CD-ROM, or optical writable memories numbers that could generate a peak of that length with the 50 and any other computer readable memory technologies. displayed end labeling. This window can also contain further information about these Sequences, Such as gene name, 6. EXAMPLES bibliographic data, etc. This further information may be The following examples further illustrate the different available in Selected databases 1605 or may require queries features of the invention but do not in any way limit the to the complete Sequence database 1604 based on the 55 Scope of the invention which is defined by the appended accession numbers. In this manner, a user can interactively claims. This Section describing examples has been divided inquire into the possible Sequences causing particular results into a Section describing protocols that are common to and can then Scan to other reactions of the experiment by Several of the examples and another Section that is a descrip using buttons 1620 to Seek other evidence of the presence of tion of the examples themselves. these Sequences. 60 6.1. DESCRIPTION OF PROTOCOLS It is apparent that this interactive interface has further The following Sections describe protocols for use. alternative embodiments Specialized for classes of users of 6.1.1. MATING PROTOCOL differing interests and goals. For a user interested in deter Mating of the yeast a and C. Strains is preferably per mining tissue gene expression, in one alternative, a particu formed according to a filter disc mating protocol, which lar accession number is selected from window 1623 with 65 achieves efficient cell handling, limited cell doublings, and mouse 1603, and processor 1601 Scans the Simulated data high mating efficiencies. An alternate leSS preferred protocol base for all other fragment lengths and their recognition is a plate mating protocol, which has less favorable charac 6,057,101 109 110 teristics. After mating according to either protocol, the library members in Separate cultures and to Separately boost mating efficiency preferably is estimated according to a each member for 4-8 hours in YPAD medium. protocol which determines the ratio of the number of yeast Next, the boosted cells are mixed to form the mating mix. diploids to the total number of yeast cells. The number of cells from each of the binding and activation The filter disc protocol is preferred since more cells can domain libraries to be mixed is preferably determined be mated with high mating efficiencies and with fewer cell according to the Statistical considerations of Section 5.2.7. doublings during mating than can be achieved-by prior Alternately, and equally preferably, the number of cells to be protocols, in particular by the plate mating protocol. Accord mixed can be simply determined according to the relation F*M*N, where F is a factor, M is the complexity of the ing to filter disc mating, at least approximately 3x10" cells, binding domain library, and N is the complexity of the to at least approximately 6x10 cells, to approximately activation domain library. “Library complexity” is taken 1x10" cells, to approximately 2x10 cells, and up to approxi herein to mean the number of Separate clones in the library. mately 3.5x10° cells can be mated per 90 mm filter disc. The factor F is approximately at least 50, more preferably (These cell numbers correspond approximately to mating 75, or even more preferably 100 or greater. Cell number can cell densities of approximately at least 5x10", at least 1x10, be found from measurement of ODoo, where 1 ODoo unit at least 1.5x10, at least 3.5x10, and up to 4-6x10° cells, 15 equals approximately 2x107 cell/ml. Where one of the respectively, per Square millimeter on the filter-disc during libraries is of limited complexity and the library members mating.) In contrast, plate mating is limited to, for example, are maintained in Separate cultures, an equal number of each approximately 1x10 mating cells on each 10 mm plate (a library member is mixed to attain the required cell number. mating cell density of 6x10 cells per Square millimeter). Next, aliquots of cells from the mating mix are packed Therefore, the filter disc mating is more efficient in that it onto filter discS Soaked in rich medium, preferably, by uses fewer mating resources, and consequently is capable of vacuum aspiration. When a preferred 90 mm diameter filter processing mating experiments of greater complexity, which disc is used, the aliquots contain preferably between 1.5 and require a greater number of mated cells. Further, according 2.0x10 cells and more preferably approximately 1.8x10 to filter disc mating, no more than approximately one cell cells. For filter discs of other diameters, the preferred doubling occurs during the conditions prevailing during the 25 number of cells can be Scaled according to the relative areas mating period, whereas with plate mating Several cell dou of the discs. A Sufficient number of filter discs is used to blings can occur during mating. Thus, interacting colonies accommodate the total number of cells in the mating mix. AS observed after filter disc mating are more likely to represent Soon as the cells are packed on the filter disc, vacuum independent and unique protein-protein interactions than are aspiration is Stopped and the filter disc is placed on a large the colonies observed after cells mating. Finally, both filter YPAD plate, taking care that no air bubbles remain between disc and plate mating usually achieve Similarly high mating the filter disc and the plate. The plate(s) carrying the filter efficiencies (fraction of diploids formed) of approximately disc(s) is then incubated for approximately 6-10 hours at 25% to 50%. This invention is, however, adaptable to other approximately 30 C. to permit cell mating. A preferred filter mating protocols that achieve efficient cell handling, limited disc is Catalog no. HAWP 090 25 from the Millipore cell doublings during mating, and comparable mating effi 35 Corporation (Bedford, Mass.), and has a diameter of 90 mm ciencies. with a pore Size of 0.45 um. A preferred vacuum aspiration In Summary, according to the filter disc mating protocol, unit is a 500 ml large filtration unit from the Fisher Scientific transformed yeast cells are grown to from mid to late log Corporation (Pittsburgh, Pa.). phase to Stationary phase on media Selective for the appro Finally, after the mating incubation, the mated cells are priate transforming plasmids, and then are briefly boosted on 40 suspended in 1000 ml of sterile water by Swirling the filter rich medium immediately prior to mating. The boosted cells disc(s), and are then Screened for protein-protein interac of both mating Strains are mixed in numbers Sufficient tions by plating on appropriate media Selective for diploid according to the Statistical considerations disclosed in Sec cells bearing interacting binding domain and activating tion 5.2.7. Aliquots of the mixed cells are packed by, e.g., domain fusion proteins. For Screening efficiency and Vacuum Suction onto filter discS, which can be of paper, 45 effectiveness, it is advantageous to plate no more than nylon, or any other Suitable material capable of retaining approximately 50-100 expected interactant coclonies or no yeast cells. The filter discS with the packed cells are incu more the approximately 10 expected diploid cells per plate. bated at a temperature and for a time Sufficient to allow cell These expected numbers can be simply obtained as follows. mating. Finally, mated cells are harvested and transferred to The expected number of diploids can be simply found by media Selective for appropriate for diploids optionally, an 50 multiplying the density of mated cells and the mating aliquot of the harvested cells is used to estimate the mating efficiency, where the cell density can be estimated from the efficiency. ODoo and the mating efficiency can be estimated according In more detail, a preferred embodiment of this protocol to the following protocol. The expected number of interac proceeds according to the following detailed Steps. First, tants among the mated cells can be found by further multi prior to mating, yeast cells bearing activation and binding 55 plying the expected number of diploids by the expected rate domain fusion plasmid libraries are grown for at least two of protein-protein interactions. The latter rate can be esti days, or until Stationary phase, on media Selective for the mated from experience with various mating, and in appropriate plasmid. Stationary phase cells are then particular, it has been found for libraries of interest derived “boosted” just prior to mating by a brief growth period on from human Samples that the expected rate of protein rich media to numbers 3 to 5 fold higher than required for 60 protein interactions is approximately 2–6x10'. Using these mating. A Volume of 1–2 ml of Stationary phase library yeast expected numbers, one of skill in the art will be able to plate is diluted in 1000 ml of YPAD media (Sherman et al., eds., the mated cells according to the preferred criteria. Even with 1991, Getting Started with yeast, Vol. 194, Academic Press, Such careful plating, however, Screens of complex libraries, New York) and grown for 4–8 hours at approximately 30° C. which require large numbers of mated cells, can require Where one of the libraries is of limited complexity, for 65 many 10s or even a few hundred plates. example of a complexity, for example of complexity leSS Briefly, mating efficiency can be estimated by plating than 10 or less than 50, it is advantageous to maintain the Serial dilutions of an aliquot of the Suspended, mated cells. 6,057,101 111 112 The ODoo, and thus the cell concentration, is measured N.Y.). Example: 100 mg in 1 ml, 1 g in 10 ml. 0.2 volumes after resolving cell flocculation by adding EDTA up to a of chloroform are added and vortexed for 15 seconds, and concentration of 2 mM. Serial dilutions from 10° to 10' phases separated by centrifugation (5000xg, 15 min). The are then plated onto each of three plates, a first plate aqueous phase is precipitated with 0.6 volumes of Selective for activation domain plasmids, a Second plate 2-propanol. The precipitated RNA is pelleted at 10,000xg Selective for binding domain plasmids, and a third plate for 15 min, rinsed with 70% ethanol and dried. The RNA Selective for diploid cells. Mating efficiency is estimated pellet is resuspended in water to give a final concentration of 100 ng/ul. from the Set of plates with easily counted colonies as twice 6.14. DNASE TREATMENT the ratio of the number of diploid colonies to the sum of the 0.2 volumes of 5x reverse transcriptase buffer (Life number of colonies containing each of the plasmids. An Technologies), 0.1 volumes of 0.1 M DTT, and 5 units independent estimate of the cell density can also be obtained RNAguard/100 mg starting tissue (Pharmacia Biotech, from the Serial dilution plates. Uppsala, Sweden) are added to the RNA extracted according In addition to the filter disc protocol, mating is performed to Section 6.1.3. One unit RNase-free DNase I (Pharmacia as per standard protocols (Sherman et al., eds., 1991, Getting Biotech)/100 mg starting tissue is added, and the mixture is Started with yeast, Vol. 194, Academic Press, New York). 15 incubated at 37 C. for 20 min. 10 volumes of Triazol is Briefly, for the plate mating protocol, cells are grown until added and RNA extraction by addition of chloroform and mid to late log phase on Solid or liquid media that Select for precipitation is repeated. the appropriate plasmids. The two mating Strains, a and C, 6.1.5. MESSENGER RNA PURIFICATION are then mixed together as a paste onto a rich Solid media RNA concentration is estimated by measuring ODeo of a like YPAD (Sherman et al., eds., 1991, Getting Started with 100-fold dilution of extracted RNA mixture after DNase yeast, Vol. 194, Academic Press, New York) and incubated treatment. The Dynal oligo(dT) magnetic beads have a at 30° C. for 6-8 hr. The cells are then transferred to capacity of 1 ug poly(A+) per 100 ug of beads (1 mg/ml Selective media appropriate for the desired diploids. concentration). Assuming that 2% of the total RNA is In a preferred embodiment of the plate mating protocol, poly(A+), 5 volumes of Lysis/Binding buffer (Dynal, Oslo, 1x10 cells/ml of each mating type are mixed for minutes at 25 Norway) and sufficient beads to bind poly(A+) are added. room temperature and then plated onto a 150 mm diameter This mixture is heated at 65° C. for 2 min and then incubated at room temperature for 5 min. The beads are first washed YPAD plate and incubated at 30° C. for 6–8 hours. Then, the with 1 ml washing buffer/LiDS (Dynal), then with 1 ml contents of the plate are harvested in a volume of 1-2 ml in washing buffer (Dynal, Oslo, Norway) twice. The poly(A+) the appropriate Selective media and transferred to a 150 mm RNA is eluted with 1 ul water/ggbeads twice. diameter plate that has the Selective medium for Selecting 6.16. cDNASYNTHESIS AND CONSTRUCTION OF interactions. Alternatively, the YPAD plate with the mating FUSION-LIBRARIES mix can be replica-plated onto another 15 cm diameter plate cDNA synthesis is performed using the Hybrizap Two that has the Selective medium for Selecting interactions. Hybrid cDNA synthesis and Gigapack cloning kit cl)NA 6.1.2. TRANSFORMATION PROTOCOL Synthesis kit (Stratagene) according to the manufacturers Yeast transformations are performed by the lithium 35 protocol with the following modifications. The cDNA syn acetate procedure (I to et al., 1983, J. Bacteriol. thesis is performed substantially as per the Gubler-Hoffman 153:163–168) and the transformants are selected by plating method (Gubler and Hoffman, 1983, Gene 25:263-269). In on appropriate Selective media that are usually Synthetic the first strand synthesis step, MoMuLV reverse tran Complete (SC) media that lack the appropriate nutrients Scriptase is used for reverse transcription. The primer (from (Sherman et al., eds., 1991, Getting Started with yeast, Vol. 40 the kit) (GAGAGAGAGAGAGAGAGAGAACTAG 194, Academic Press, New York). TCTCGAGTTTTTTTTTTTTTTTTTT) (SEQ ID NO:36) In detail, lithium acetate transformation proceeds accord used in the first Strand Synthesis also adds an XhoI site near ing to the following StepS. Cells to be transformed are grown the end. After the Second Strand Synthesis, EcoRI adapters overnight in rich medium like YPAD medium (Sherman et (also from the kit) are ligated to the cDNAS using standard al., eds., 1991, Getting Started with yeast, Vol. 194, Aca 45 linker ligation conditions according to manufacturer's demic Press, New York) and then diluted two-fold in rich (Stratagene's) protocols. The identities of the EcoRI adapt medium and shaken for two hours at 30° C. The cells are ers are AATTCGGCACGAG (SEQID NO:37) and CTCGT. pelleted and washed with sterile water and with transforma GCCG (SEQ ID NO:38). Following this, the cDNA is tion buffer (0.1 M LiAc in 10x TE buffer at pH 7.5). The digested with EcoRI and XhoI and cloned into the EcoRI washed cells are pelleted and resuspended in three times the 50 and XhoI sites of the Hybrizap vector (Stratagene), which is pellet volume of transformation buffer. In an Eppendorf a lambda phage vector, using the manufacturer's protocols. tube, to 80 ul of this cell Suspension, are added 28 ug of The phagemid paD-GAL4 bearing the cDNA inserts is single-stranded salmon sperm DNA in 10x TE buffer, and removed by in Vivo excision using the reagents and proto 1-10 ug of appropriate, transforming plasmid DNA, and colS provided in the Hybrizap Gigapack cloning kit. This which is then incubated at room temperature for 5-10 55 creates a cDNA library, containing plasmid pad-GAL4, minutes. Then to each Eppendorf tube, are added 500 ul of with the sense strand being in frame with the GAL4 acti a mixture of 40% PEG with a molecular weight of 3350 and Vation domain of the plasmid pad-GAL4 (Stratagene). 60% transformation buffer, which are incubated at 30° C. for Plasmid paD-GAL4 contains LEU2 to facilitate selection in 20–20 minutes, after which 58 ul of DMSO is added. The media lacking leucine. cells are heat-shocked for 10-15 minutes in a 42-45 C. 60 In a different embodiment, the activation domain fusion water-bath, washed in TE buffer, resuspended in 200 ul of library is created in the vector pACT2 (Clontech). The TE buffer, and plated onto appropriate Selective media. EcoRI-XhoI linked cDNA is cloned between the EcoRI and 6.13. RNA EXTRACTION SalI sites in paCT2. This creates a cDNA library with the The tissue to be extracted is weighed and a 10-fold Sense Strand being in frame with the GALA activation Volume/weight of Triazol reagent (Life Technologies, 65 domain in the plasmid paCT2 (Clontech). Plasmid paCT2 Gaithersburg, Md.) is added and the tissue ground with a contains LEU2 to facilitate Selection in media lacking leu Polytron homogenizer (Brinkman Instruments, Westbury, CC. 6,057,101 113 114 In the case of cloning into pAS2-1 (Clontech) or pBD selection with 5-FOA according to the following protocol GAL (Stratagene) to create a library of DNA-binding has been observed to routinely reduce fortuitous activation domain fusion genes, the EcoRI-XhoI linked cDNA is to a rate of less than approximately 1x10'. If a fortuitous cloned between the EcoRI and SalI sites in paS2-1 or activation rate greater than approximately 1x10' is found, pBD-GAL to create a cDNA library in plasmid paS2-1 or further protocol steps replica plating (as described below) PBD-GAL, with the sense strand being in frame with the are performed. Accordingly, this embodiment is most pre GAL4 DNA-binding domain. Statistically, one in every ferred for binding domain libraries of any complexity, and especially for complex binding domain libraries. three clones will represent a true open reading frame. The preferred 5-FOA negative selection protocol pro Plasmids paS2-1 or pBD-GAL contain TRP1 to facilitate ceeds according to the following Steps. Approximately Selection in media lacking tryptophan. 2x10 cells transformed with the binding domain library are 6.17. TRANSFORMATION OF THE REPORTER shaken for approximately 2 hours at 30° C., pelleted, and STRAINS WITH THE BINDING DOMAIN FUSION then resuspended in 50 ml of sterile water. Using the cell cDNA LIBRARY AND ACTIVATION DOMAIN cDNA density calculated from the measured ODoo (1 ODoo unit LIBRARY TO CREATE MAND “N POPULATIONS equals approximately 2x10 cells/ml), an aliquot containing The strains YULH and N106' (see Sections 6.3.2 and 15 approximately 10 cells is plated on a large plate containing 6.3.4) are transformed with the paS2-1, pBD-GAL, and the media Selective for the binding domain plasmid and con pAD-GAL4 or paCT2 cdNA libraries, respectively, by taining 5-FOA. A sufficient number of plates is plated so that lithium acetate protocol (Section 6.1.2, Ito et al., 1983, J. the total number of cells plated equals approximately three Bacteriol. 153:163–168). One lug of library DNA generally times the complexity of the binding domain library. yields a maximum of 1x10° transformants. The transfor After overnight incubation at room temperature, the plate mants are selected on either media lacking leucine (for (s) is incubated at 30° C. until the colonies grow up. These pAD-GAL4/pACT2) or lacking tryptophan and containing colonies are then replica plated to another large plate with 5-FOA (for paS2-1 or pBS-GAL). In the latter case, all the same medium lacking tryptophan and containing 5-FOA, GAL4 DNA-binding domain (GBD)-fusions that fortu and once again the colonies are allowed to grow up. After itously activate transcription on their own will be eliminated 25 2-3 days, the colonies are harvested by Scraping and since 5-FOA kills the URA+ cells. It is preferred that 5-FOA pooling, and the cells are Stored in 15 glycerol and 3% negative-Selection be performed according to the protocol to DMSO. be Subsequently described. The transformants are harvested The replica plating Step is important in achieving the extra in the appropriate media (SC-Leu for paD-GAL4/pACT2 reduction in fortuitous activation rate. optionally, this replica and SC-TRP for paS2-1 or pBD-GAL) to a final cell density plating Step can be repeated until the fortuitous activation of 2x10 to 2x10" cells/ml and preferably 2x10 cells/ml rate no longer declines. The fortuitous activation rate at each and stored in aliquots at -70° C. after making them 10% in Step of replica plating can be estimated by plating Serial DMSO or glycerol. dilutions of a sample of harvested cells on medium Selective Negative Selection of the binding domain library trans for the reporter gene, and finding the ratio of positive formants to eliminate fortuitous activation of the reporter 35 colonies to the total cells plated known from the cell density. genes is, as has been described, always important but is It has been found that a single replica plating achieves most especially So in the case of complex activation or binding of the decrease in fortuitous activation, and that Subsequent domain libraries. Since fortuitous activation can occur in up replica platings generally do not result in further significant to 1-5% of binding domain transformants, without such decreases. Replica plating is the preferred method of Selec negative Selection, finding the occasional protein-protein 40 tively removing only yeast cells that are actively growing in interaction among the numerous false-positive, fortuitously the toxic environment from Substantially all other yeast activating binding domain transformants is virtually impos cells, including dead cells, cells which are living but not Sible. For example, a binding domain library of complexity viable, and cells which are dormant in the toxic environment 107 with a fortuitous activation rate of 1% results in approxi but still viable and capable of future growth in a non-toxic mately 10 false positive colonies for each activation 45 environment. Further, any dormant URA3' cells that are domain library member. Individually Screening Such a vast transferred into a new media will enter into a new growth number of false-positive colonies for true protein-protein phase and will, thereby, be inhibited or killed by the 5-FOA. interactions is clearly quite impractical. Effective use of Negative Selection can also be done according to a bait complex libraries depends on negative Screening protocols validation protocol, which Screens both fortuitously which greatly reduce fortuitously activating binding domain 50 activating binding domain fusion proteins and also fusion transformants. proteins in the activation domain library or the binding Since it has been found that fortuitous activation by domain-library that non-specifically associate with other activating-domain fusions with the GAL4 activating proteins, and thereby activate reporter gene expression. Bait domains are almost never observed, negative-Selection of validation is most advantageously applied to matings in activating-domain transformants is not usually useful. 55 which one library has such limited complexity (the “bait” In more detail, the preferred negative Selection protocol library) that each member can be separately manipulated and achieves a fortuitous activation rate of preferably less than Separately maintained in individual cultures. Briefly, bait approximately 5x10, or less than approximately 4x10, validation Separately mates each member of the bait library or less than approximately 3x10, or less than approxi with the more complex library and Selects out and removes mately 2x10, or preferably less than approximately 1x10 60 from further consideration those bait-library members that 6, or even less. Simple plating of binding domain libraries on too frequently activate reporter gene expression. AS plates that negatively Select for the expression of reporter described in Section 5.2.8, for mammalian or human genes such as URA3, LYS2, Can1, or CYH2 has been found Samples, it is most preferred to Select out those library to result in a fortuitous activation rate of no less than members that activate reporter gene(s) with a frequency approximately 10' to approximately 10 in the harvested 65 greater than approximately 10. cells. However, most advantageously, where URA3 is used In more detail, the more complex library is grown for 4-8 as one of the reporter genes, it has been found that negative hours by inoculating 1-2 ml of frozen library stock (or 6,057,101 115 116 enough Stock to achieve an ODoo of approximately 0.2) in microtiter plates. The following reactants are premixed and 500 ml of a rich medium like YPAD. After this growth, the are added to each well: cell density is measured, for example from ODoo values, and aliquots of approximately 50,000 colonies per plate are plated on plates Selective for the appropriate library plasmid. 41 ul Water Beginning on the Second day of complex library growth on 5 ul 10x PCR2 buffer (1x PCR2 Buffer = 20 mm these plates, each member of the low complexity library is Tris-HCl pH 8.55, 16 mM ammonium sulfate, grown to Stationary phase in media Selective for the appro 2.5 mM MgCl2, 150 ug/ml BSA) priate library plasmid, and then 300 ul aliquots of this 0.2 titl 50 pmful of M13-40AD5 + BACREVAD3 (Ab Peptides, St. Louis, MO) for amplifying Stationary-phase culture are plated onto YPAD mating activation domain fusions plates. Then, the more complex library is also replica plated 0.2 titl 50 pmful of paS3BacREV + pASForM13-40 for onto the mating plates, which are then incubated for 10 amplifying binding domain fusions hours and 30° C. for cell mating to occur. 0.3 ul 25 U/ml KlenTaq:Pfu (16:1 volume ratio) The mated cells are Screened by replica plating them onto two plates, one with media appropriately Selective for dip 15 Next add 1.5 ul of the appropriate yeast DNA template loid cells and the other with media appropriately Selective prepared according to the previous protocol to each well. for diploid cells with reporter gene activation. Each member Preferably, this contains approximately 1-10 ng of DNA. of the bait library for which the most preferred rate of The microtiter plate is briefly equilibrated to 94 C. for 15 reporter gene activation is exceeded is not used further. seconds and 2 ul of 5 mM dNTPs are added to each well. The 6.18. INTERACTANT PCR following thermal profile is then performed: After mating (Section 6.1.1), PCR can be performed using cells positive for protein-protein interactions in order to discover the fusion fragments responsible for the interaction. 94° C. for 4 minutes after adding dNTPs; PCR is preferably performed on DNA templates derived 94° C. for 40 seconds: from lysed yeast cells in a 96 (or greater) well format. Less 25 50° C. for 40 seconds: preferably, PCR is performed on whole cells, which are 72° C. for 3 minutes; then repeat 94-50-72 C. for five cycles; lysed at the denaturation temperature of the first PCR 94° C. for 40 seconds: thermal cycle. 58 C. for 40 seconds: The preferable PCR protocols proceed, first, by producing 72° C. for 4 minutes; then repeat 94-58-72 C. yeast DNA template, and second, by PCR amplification of for 28 cycles; this template. Yeast DNA template is produced by treating 72° C. for 5 minutes. an aliquot of yeast cells positive for interaction, first, with a cell-wall lytic enzyme, Such as Zymolase, to dissolve cell The PCR amplification is adaptable to certain variations of walls, and Second, with a proteolytic enzyme, Such as this thermal profile according to guidelines known in the art. Proteinase K, to inactivate all other lytic enzymes. Protein 35 For example, the reaction time at 72 C. can be adjusted for ase K Self inactivates and need not be separately inactivated. the expected length of products, generally allowing one In detail, 10 ul of Zymolase solution is added to each well minute for each kilo-base. A three minute time permits of a 384 or a 96 well PCR plate. Zymolase solution is 2.4M amplification of up to three kilo-base fragments. The cycle Sorbitol, 100 mM sodium phosphate buffer at pH 7.4, 60 mM numbers can be chosen according to the abundance of the B-mercaptoethanol, 1 mM EDTA, and 5 mg/ml of Zymolase. 40 yeast template and the PCR reaction efficiency. These num This Solution is made by adding Zymolase to Small aliquots bers can be Sufficiently large to detect products but not So of the Sorbitol/sodium phosphate buffer just before use. An large that amplification background interferes with product aliquot of 10 ul of yeast cells from colonies positive for detection. The most preferred hot-Start protocol is done in pre protein-protein interaction (Section 6.1.9) is added to each 45 well of the plate and the plate is incubated at room tem waxed 96-well PCR plates. A preferred wax, which melts at perature for 30 minutes or at 37 C. for 5 minutes. The approximately 72° C. is a 90:10 mixture of Paraffin:Chill samples are then held at 4 C. until the next step. Next, 10 out TM 14. The paraffin is a highly purified paraffin wax All of 30 tug/ml of Proteinase K is added to each well, and the melting between 58 C. and 60° C. Such as can be obtained plate is incubated sequentially at 50 C. for 10 minutes, 95 50 from Fluka Chemical, Inc. (Ronkonkoma, N.Y.) as Paraffin C. for 10 minutes, and then held at 4 C. Wax cat. no. 76243. Chillout TM14 Liquid Wax is a low Using this yeast DNA template product, PCR is prefer melting, purified paraffin oil available from MJ Research. ably performed with a hot-start protocol. Hot-start protocols Pre-waxed PCR plates are made by layering approximately 40 ul of the melted wax on the upper third of the wall of each are advantageous to reduce false priming and primer-dimer well in the PCR plate, and by allowing it to solidify. The formation. One preferred hot-Start protocol proceeds by 55 adding an essential PCR reaction component, preferably the PCR mix is divided into a “lower mix” and an “upper mix,” dNTPs, after the reaction mixture has reached the denatur which individually do not react, of the following composi ation temperature of, for example, 94 C. A most preferred tions. hot-Start protocol proceeds by Separating two components of LOWER MIX: the PCR reaction mix by a wax layer in a reaction wells. The 60 amplification only commences when the reaction mix has been sufficiently pre-heated to melt the wax layer allow the 25 ul Water two components to mix. 3 ul 10x PCR2 buffer (1x PCR2 Buffer = 20 mm Tris-HCl pH 8.55, 16 mM ammonium sulfate, A first preferred hot-start PCR reaction is done in a 2.5 mM MgCl2, 150 ug/ml BSA) reaction volume of approximately 50 ul in wells of a 96 well 65 2 till dNTPs (5 nM equi-molar mixture)) microtiter plate. It will be apparent to those of skill in the art how to Scale the reaction conditions for, e.g., 384 well 6,057,101 117 118 UPPER MIX: To amplify the fusion gene insert from p AS2, p AS2-1, pASSfil, pBD-GAL4, and other related vectors such as pAS1 (collectively referred to herein as “pAS-like vectors') 15.2 ul Water (pAS1 is a parental GAL4-DNA binding domain vector; see 2 till 10x PCR2 buffer Durfee et al., 1993, Genes Dev. 7:555–569), one of the 0.25 ul 100 pmful of primer (M13-40AD5 for following primer pairs can be used: activation domain fusions; pAS3BacREV for pAS3BacREV+pASForM13-40 binding domain fusions) (Ab Peptides, St. Louis, MO) pACTBAC+pASFOR 0.25 ul 100 pmful of primer (BACREVAD3 for pASSEQI--pASSEQII activation domain fusions; pASForM13-40 1O for binding domain fusions) (Ab Peptides, pASSEQIA+pASSEQII St. Louis, MO) pASForM13-40, PASSEQI, and paSSEQIA are inter 2 till SM Betaine changeable. p AS3BACREV and paCTBAC are inter 0.3 ul 25 U/ml KlenTaq:Pfu (16:1 volume ratio) changeable. 15 To amplify the fusion gene insert from paCT, paCT2, The protocol proceeds according to the following StepS. pACTS fil, p AD-GAL4 and other related vectors 30 ul of the lower mix is dispensed into each PCR reaction (collectively referred to herein as “pACT-like vectors”), one well. Any droplets on the Sides of the Wells are centrifuged of the following primer pairs can be used: down for approximately 10 seconds. The wax is then melted M13-40-BACREVAD3 and Solidified onto the top of the lower mix by carrying out pACTBAC+pACTFOR the following thermal program: 72 C. for 3 minutes; then 65 C., 55 C., and 50° C. in turn for 1 minute each; then 45° pACTBAC+pACTFORII C., 40° C., 35 C., 30° C. in turn for 30 seconds each; then pACTSEQI-pACTSEQII hold at 25 C. Next, 20 ul of the upper mix is carefully added pACTSEQI--pACTBAC to each PCR well on top of the wax layer. Next, 2 ul of the 25 pACTSEQII+PACTFOR appropriate yeast DNA template are added to each reaction pACTSEQII+pACTFORII well. PCR amplification is then performed according to the BACREVAD3, p.ACTBAC and pACTSEQII are inter following thermal program: changeable. M13-40AD5, p.ACTFORII, and pACTSEQI are inter changeable. 94 C. for 4 minutes after adding dNTPs; The identities of the above-listed primers are as follows: 94 C. for 40 seconds: pAS3BACREV=5'-AGG AAA CAG CTA TGA CCA 50° C. for 40 seconds; 72° C. for 3 minutes; then repeat 94-50-72 C. TCT GAG AAA GCA ACC TGA CCT (SEQ ID for five cycles; NO:118) 94 C. for 40 seconds: 35 pASForM13-40-5'-GTT TTC CCA GTC ACG ACG 58 C. for 40 seconds: 72° C. for 4 minutes; then repeat 94-58-72 C. GTG CGA CAT CAT CAT CGG AAG (SEQ ID for 28 cycles; NO:119) 72° C. for 5 minutes: M13-40AD5=5'-GTTTTC CCA GTCACG ACG AGG 4. C. hold. 40 GAT GTTTAATAC CAC TAC (SEQ ID NO:120) BACREVAD3=5'-AGGAAA CAG CTATGACCATGC The reaction time at 72 C. is chosen assuming that Some of ACA GTT GAA GTGAACTTG C (SEQ ID NO:121) the yeast DNA template will be up to 2 kb in size. pACTSEQII-5'-CGATGC ACA GTT GAA GTGAAC Advantageously, the fluid manipulation Steps of this pro 3' (SEQ ID NO:1) tocol can be performed by a Standard laboratory robot, Such 45 pACTFORII-5'-CGC GTTTGG AAT CAC TAC AGG as that available from the Tecan Corporation. GAT G-3 (SEQ ID NO:2) Finally, a less preferable, alternative, whole-cell PCR is pACTBAC=5'-CTA CCA GAA TTC GGC ATG CCG performed under the following conditions: GTA GAG GTG TGG TCA-3' (SEQ ID NO:3) Reaction volume: 100 ul pASFOR-5'-ATGAAG CTA CTG TCTTCT ATC GAA 10xPC2 Buffer for Klentaq polymerase: 10 ul (1xPC2 50 C-3' (SEQ ID NO:4) Buffer=20 mm Tris-HCl pH 8.55, 16 mM ammonium p ACT FOR = 5'- Sulfate, 2.5 mM MgCl, 150 tug/ml BSA) ATGGATGATGTATATAACTATCTATTC-3' (SEQ ID 10 mM dNTPs: 3 ul NO:122) 50 pmoles of each primer pair 55 pACTSEQI-5'-TTGGAATCACTACAGGGATG-3' 1.0 ul of Klentaq polymerase (a thermostable DNA poly (SEQ ID NO:49) p ASSEQI = 5'- merase sold by AB Peptides Inc., St. Louis, Mo.). GAATTCATGGCTTACCCATAC-3' (SEQ ID NO:50) 2-5 ul of Saturated culture of yeast in water. p ASSEQ II = 5'- PCR is performed at 94° C. for 30 sec, 45–55° C. for 30 AACCTGACCTACAGGAAAGAGTTAC-3' (SEQ ID sec and 72 C. for 2 min, with each being repeated for 60 NO:51) 20–30 cycles. The annealing temperature (i.e., the pASSEQIA=5'-CCTCTAACATTGAGACAGCATAG-3' 45-55 C. for 30 sec step) depends on the melting (SEQ ID NO:52) The primers can be used in sequenc temperature of the primers used. The PCR primers are ing as well as in PCR. designed in Such a way that the melting temperature 6.19. RECOVERY OF COLONIES POSITIVE FOR usually lies between 45–55° C. 65 PROTEIN-PROTEIN INTERACTION A primer pair suitable for use according to either PCR Colonies that are URA+HIS+, and 3-AT are selected as protocol can be Selected from among those described below. positive for protein-protein interactions and arrayed onto 6,057,101 119 120 96-well (or 384-well) plates in which each well contains 100 In particular, a preferred protocol for performing the All of the appropriate selective media like SC-URA-HIS filter-lift assay for B-galactosidase activity is presented TRP-LEU+3-AT (SC medium lacking uracil, histidine, herein. An assay Solution is prepared by combining 100 ml tryptophan, leucine, and containing 3-amino-1,2,4-triazole). of Z-buffer, 0.27 ml f-mercaptoethanol, and 1 ml X-gal In an equally preferred mode, colonies that are URA+ and stock (5-bromo-4-chloro-3-indolyl-d-D-galactoside at a HIS+ are Selected on plates lacking Tyr, Leu, Ura, His. Thus, concentration of 33.4 mg/ml in N,N-dimethylformamide). each well Serves as Source of a single colony positive for (Z-buffer is made by adding to 800 ml of water 16.1 g of protein-protein interactions, and each column or row in a NaHPO, 5.5g of NaHPO, 0.75 g of KCl, and 0.246 g of 96-well plate now serves as a pool of positive colonies. Cells MgSO.7HO, adjusting the pH to 7.0, and adding water to are grown at 30° C. until late log phase (ODoo of 1.5-2). 1000 ml.) For smaller yeast growth plates, a 75 mm filter These cells are processed further or stored frozen at -80 C. paper (Whatman 1 of VWR grade 413) is soaked in 1.8 ml after making them 10% in DMSO or glycerol. of assay Solution in a petri dish. For larger growth plates, Selection as above on plates with media entirely deficient 3–4 ml of assay Solution is used with a correspondingly in products of the reporter genes may cause certain weak larger filter paper. Yeast colonies are then lifted off the protein-protein interactions to be missed. In certain cases, it 15 growth plate with Optitran filter paper, Catalog no. BA-S 85 may be advantageous in order to detect Such weak protein Schleicher and Schull (Keene, N.H.), and the filter paper is protein interactions to Select on plates with trace quantities placed with the colonies facing up in a pool of liquid of the reporter gene products. In particular, in the case of the nitrogen for approximately 5 Seconds. Then the filter paper yeast strain YULH, the reporter gene URA3 can have a low is thawed at room temperature and then placed onto the filter level of natural expression. Thereby, Strong protein-protein paper Soaked with assay Solution, taking care that no air interactions are required for growth on media entirely lack bubbles remain between the two filter papers. The filter ing in uracil. To detect weaker protein interactions, it has papers are incubated at 30-37 C. for up to several hours. been found advantageous to include a trace amount of uracil Positive B-galactosidase activity is indicated by a blue color in the Selective media. It has been found that adding approxi appearing in from 1 minute to 10 hours. mately 1-10 uM, and preferably approximately 5 uM, of 25 ZZZZ, uracil to the Selective media allows the detection of weak 6.1.12. PROTOCOLS FOR OEATM METHODS AND protein-protein interactions that would otherwise have been SEO-OEATM METHODS missed. 6.1.12.1. PREFERRED OEATM RE METHOD 6.1.10. PRODUCTION OF PCR POOLS FOR CRE A DNA (preferably cDNA) population is input to the ATION OF PROTEIN INTERACTION MAPS QEATM method protocols described in this section. This If the total number of positive dolonies is less than 1500 DNA population can be pooled DNAS, each DNA encoding then they are readily pooled according to a two-dimensional an interactant protein identified according to the methods of pooling Scheme. 10 ul of each well in a given column or row the invention, or can be, or can be derived from, one or both are combined into a single pool and mixed well. The mix is of two DNA populations encoding the initial protein popu centrifuged at 1000 g for 2 minutes, resuspended in 100 ul 35 lations between which (in fusion form) protein interactions of water, centrifuged again as described above, and the are detected according to the invention. Supernatant discarded. The pelleted cells are preferably This protocol is designed to keep the number of individual lysed (Section 6.1.8), or less preferably, the PCR mix is manipulations down, and thereby raise the reproducibility of added directly to the pellet and mixed well. PCR is per the QEATM method procedure. In a preferred method, no formed wherein DNA-binding (pAS-specific or pBD-GAL 40 buffer changes, precipitations or organic (phenol/ Specific) and activation domain fusion specific primers chloroform) extractions are used, all of which lower the (pAD-GAL4/pACT-specific) amplify the genes encoding overall efficiency of the proceSS and reduce its utility for the two interacting proteins directly from yeast (Section general use and more specifically for its use in automated or 6.1.8). Thus, each PCR reaction refers to the “M” population robotic procedures. or the “N' population. Primers that can be used are 45 The protocol is described in terms of cDNA, but can be described in Section 6.1.8. used with any DNA. 6.1.11. B-GALACTOSIDASE ASSAYS 6.112.1.1 cDNA PREPARATION Filter-lift f8-galactosidase assays are performed as modi Terminal phosphate removal from cDNA is illustrated fied from the protocol of Breeden and coworkers (Breeden with the use of Barents Sea Shrimp alkaline phosphatase and Nasmyth, 1985, Cold Spring Harb. Symp. Quant. Biol. 50 (“SAP) (U.S. Biochemical Corp.) and 2.5 lug of cDNA. 50:643-650). The URA+HIS+ and 3-AT colonies are Substantially less (<10 ng) or more (>20 ug) of cDNA can patched onto SC-TRP-LEU-URA-HIS+3-AT plates, grown be prepared at a time with proportionally adjusted amounts overnight and replica plated onto Whatman no. 1 filter of enzymes. Volumes are maintained to preserve ease of papers overlayed onto SC-TRP-LEU plates and again grown handling. The quantities necessary are consistent with using overnight at 30° C. The filters with the grown colonies of 55 the method to analyze Small tissue Samples from normal or yeast are then assayed for B-galactosidase activity. Colonies diseased specimens. positive for B-galactosidase activity turn blue. Quantitative 1. Mix the following reagents B-galactosidase assays on yeast are performed as described previously by Coney and Roeder (Coney and Roeder, 1988, Mol. Cell. Biol. 8:4009-4017). Chemiluminescent 60 B-galactosidase assays are performed by using the Galacto 2.5 ul 200 mM Tris-HCL Light and Galacto-Light Plus Chemiluminescent reporter 23 ul cDNA assay System for the detection of B-galactosidase (Tropix, 2 till 2 unitsful Shrimp alkaline phosphatase Inc.) according to the manufacturer's protocols. Fluorescent B-galactosidase assays are performed using the FluoRe 65 The final resulting cDNA concentration is 100 ng/ul. porter lacz/Galactosidase Quantitation kit (Molecular 2. Incubate at 37° C. for 1 hour Probes) according to the manufacturer's protocols. 3. Incubate at 80 C. 15 minutes to inactivate the SAP. 6,057,101 121 122 6.1.12.1.2. PREFERRED RE/LIGASE AND AMPLIFI 1. Add 40 ul of the PCR reaction mix to each RE/ligation CATION REACTIONS reaction Once the cDNA has been prepared, including terminal 2. Perform the PCR temperature profile of FIG.22B using phosphate removal, it is Separated into a number of batches a PTC-100 thermal cycler (MJ Research, Watertown, of from 10 ng to 200 ng each, equal to the desired number 5 Mass.) 6.1.12.1.3. PREFERRED AUTOMATED of individual Samples that need to be analyzed and the extent RE/LIGASE REACTIONS of the analysis. For example, if six RE/ligase reactions and The reactions of the preceding Section can be automated Six analyses are needed to generate all necessary Signals. Six according to the following protocol which requires interme batches are made. Shown by example are 50 ng fractions. diate reagent additions or by a protocol note requiring Such RE/ligase reactions are performed as digestions by, additions. preferably, a pair of RES, alternatively, one or three or more Single Tube Protocol With Reaaent Additions RES can be used provided the four base pair overhangs Reactions are preformed in a standard 96 well thermal generated by each RE differ and can each be ligated to a cycler format using a Beckman Biomek 2000 robot uniquely adapter and a Sufficiently resolved length distribu (Beckman, Sunnyvale, Calif.). Typically 4 cDNA samples tion results. The amount of RE enzyme specified is sufficient 15 are analyzed in duplicate with 12 different RE pairs, for a for complete digestion while minimizing any other eXo- or total of 96 reactions. All steps are performed by the robot, endo-nuclease activity that may be present in the enzyme. including Solution mixing, from user provided Stock Adapters are chosen that are unique to each RE in a reagents, and temperature profile control. reaction. Thus, one uses a linker complementary to each Pre-annealed adapters are prepared as in the preceding unique RE Sticky overhang and a primer which uniquely Section. hybridized with that linker. The primer/linker combination is Restriction-Digestion/Ligation Reaction an adapter, which will preferably be uniquely and distin Mix per reaction: guishably labeled. 1. 1U of appropriate RE (New England Biolabs, Beverly, Adapter Annealing Mass.) Pairs of 12-mer linkers and 24-mer primers are pre 25 2. 1 ul of appropriate annealed adapter (10 pmoles) annealed to form adapters before they are used in the QEATM 3. 0.1 ul T4 DNA ligase 1. U?ul (Life Technologies method reactions, as follows: (Gaithersburg, Md.) 1. Add to water linker and primer in a 2:1 concentration ratio (12-mer:24-mer) with the primer at a total con 4. 1 ul ATP (Life Technologies, Gaithersburg, Md.) centration of 5 pM per ul. 5. 5 ng of subject prepared cDNA 2. Incubate at 50 C. for 10 minutes. 6. 1.5 ul 10x NEB2 buffer from New England Biolabs 3. Cool slowly to room temperature and store at -20° C. (Beverly, Mass.) Restriction-Digestion/Ligation Reaction 7. 0.5ul of 50 mM MgCl, Reactions are prepared for use in a 96 well thermal cycler. 8. Water to bring total volume to 10 ul and transfer to Add per reaction: 35 thermal cycler The robot requires 23 minutes total time to set up the 1. 1 U of appropriate REs (New England Biolabs, reactions. Then it performs the RE/ligation reaction by Beverly, Mass.) (preferred RE pair listing in S 6.1.12.3 following the temperature profile of FIG. 22C using a (entitled “Preferred QEATM Method Adapters and RE PTC-100 Thermal Cycler equipped with a mechanized lid Pairs”)) 40 from MJ Research (Watertown, Mass.). 2. 1 ul of appropriate annealed adapter Amplification Reaction 3. 1 ul of Ligase/ATP (0.2 tul T4DNA ligase 1. U?ul/0.8 Prepare the PCR reaction mix by combining: ul 10 mM ATP from Life Technologies (Gaithersburg, 1. 10 ul 5x E-Mg (300 mM Tris-HCl pH 9.0, 75 mM Md.)) (NHA)SO) 4. 0.5ul 50 mM MgCl 45 2. 100 pm of appropriate fluorescently labeled 24-mer 5. 10 ng of subject prepared cDNA primer 6. 1 ul 10x NEB2 buffer from New England Biolabs 3. 1 ul 10 mM dNTP mix (Life Technologies, (Beverly, Mass.) Gaithersburg, Md.) 7. Water to bring total volume to 10 ul 50 4. 2.5 U of 50:1 Taq polymerase (Life Technologies, Then perform the RE/ligation reaction by following the Gaithersburg, Md.) : Pfu polymerase (stratagene, La thermal profile in FIG. 22A using a PTC-100 Thermal Jolla, Calif.) Cycler from MJ Research (Watertown, Mass.). 5. Water to being volume to 35 ul per PCR reaction Amplification Reaction Preheat the PCR mix to 72° C. and transfer 35ul of the Prepare the PCR reaction mix by combining: 55 PCR mix to each digestion/ligation reaction and mix. The 1. 10 ul 5x E-Mg (300 mM Tris-Hcl pH 9.0, 75 mM robot requires 6 minutes for the transfer and mixing. (NH)SO, no Mg ions)) Then the robot performs the PCR amplification reaction 2. 100 pm of appropriate fluorescently labeled 24-mer by following the temperature profile of FIG. 22B using a primers PTC-100 thermal cycler equipped with a mechanized lid 60 (MJ Research, Watertown, Mass.). 3. 1 ul 10 mM dNTP mix (Life Technologies, The total elapsed time for the digestion/ligation and PCR Gaithersburg, Md.) amplification reactions is 179 minutes. No user intervention 4. 2.5 U of 50:1 Taq polymerase (Life Technologies, is required after initial experimental design and reagent Gaithersburg, Md.) : Pfu polymerase (Stratagene, La positioning. Jolla, Calif.) 65 Single Tube Protocol Without Reagent Additions 5. Water to bring volume to 40 ul per PCR reaction Then First, add the PCR reaction mix by combining in the perform the following Steps: reaction tube: