CONF-971146

DOE uman ~ ~enome Contact for queries about this publication:

Human Genome Program U.S. Department of Energy Office of Biological and Environmental Research ER-72GTN Washington, DC 20585 301/903-6488, Fax: 301/903-8521 E-mail: [email protected]

A limited number of print copies are available. Contact:

Betty K. Mansfield Oak Ridge National Laboratory 1060 Commerce Park, MS 6480 Oak Ridge, TN 37830 423/576-6669, Fax: 423/574-9888 E-mail: [email protected]

An electronic version of this document will be available, after the November 1997 meeting, at the Project Information Web site under Publications (http://www.ornl.gov/hgmis).

This report has been reproduced directly from the best obtainable copy.

Available to DOE and DOE contractors from the Office of Scientific and Technical Information; P.O. Box 62; Oak Ridge, TN 37831. Price information: 423/576-8401.

Available to the public from the National Technical Information Service; U.S. Department of Commerce; 5285 Port Royal Road; Springfield, VA 22161. CONF-971146

DOE Human Genome Program Contractor-Grantee Workshop VI November 9-13, 1997 Santa Fe, New Mexico

Date Published: October 1997

Prepared for the U.S. Department of Energy Office of Energy Research Office of Biological and Environmental Research Washington, D.C. 20585 under budget and reporting code KP 0404000

Prepared by Hwnan Genome Management Information System Oak Ridge National Laboratory Oak Ridge, 1N 37830-6480

Managed by LOCKHEED MARTIN ENERGY RESEARCH CORP. for the U.S. DEPARTMENT OF ENERGY UNDER CONTRACT DE-AC05-960R22464

Contents

Introduction to Contractor-Grantee Workshop VI ...... 1

Poster Number Sequencing 1. Large Scale Genomic Sequencing at Washington University Stephanie L. Chissoe ...... 5 2. The University of Texas Southwestern High Throughput DNA Project Glen A. Evans ...... 5 3. Progress Towards Sequencing Selected Regions of Human 19 Jane Lamerdin ...... 6 4. Analysis ofNine Tandem Zinc Finger-Containing Located within a 339kb Region in Human Chromosome 19q13.2 Mark Shannon ...... 6 5. Genomic Sequencing of 16p 13 .3 N.A. Doggett ...... 7 6. The Beginning of the End: Mapping and Sequencing Human Telomeric DNA Robert K. Moyzis ...... 8 7. 7q Telomere: Complete Sequencing Han-Chang Chi ...... 8 8. Finished Sequence of7q Telomere Region: Features, Validation and Polymorphism Detection P. Scott White ...... 9 9. Large-scale High-quality Genomic Sequencing of Human Chromosome 7 Shawn Iadonato ...... 10 I 0. Sequencing Progress at Lawrence Berkeley National Lab and the Formation of JGI Chris Martin ...... I 0 II. Comparative Analysis of Human and Mouse Orthologous DNA to Identify Conserved Regulatory Regions Kelly A. Frazer ...... II 12. Comparative Genomic Sequencing of the Human and Mouse Fibroblast Growth Factor Receptor Type III Michael R. Altherr ...... 11 13. Comparative Genomic Sequencing of the Murine Syntenic Segments Corresponding to Human Chromosome 16p13.3 Michael R. Altherr ...... 12 14. High-Throughput Sequencing of Eubacterial and Archeal Genomes Owen White ...... 12 15. High-Throughput Transposon Mapping and Sequencing Robert B. Weiss ...... 13

iii Poster Number ~

16. Complete Sequencing of the 2.3Mbp Genome of the Hypertbermophilic Archaean Pyrobaculum aerophilum Ung-Jin Kim ...... 13 I 7. Microbial Genome Sequencing and Comparative Analysis D.R. Smith ...... 14 18. Sequencing the Borrelia burgdorferi Genome John J. Dunn and F. William Studier ...... 15 19. Sequence Analysis of the Plasmids of Bacillus anthracis R.T. Okinaka ...... 16

Sequencing Technologies 20. Rhodobacter capsulatus Genome Sequencing Project: DENS Technology Testing Ground Mugasimangalam Raja ...... 17 21. Cosmid Sized Templates for DENS Sequencing Teclmique Mugasimangalam Raja ...... 17 22. Sequencing of Human Telomeric Region DNA by Differential Extension with Nucleotide Subsets (DENS) Dina Zevin-Sonkin ...... 18 23. Primer Walking Through Alu Repeats Using DENS Sequencing Teclmique Dina Zevin-Sonkin ...... 18 24. Structural Insights into the Properties of DNA Polymerases Important for DNA Sequencing Stanley Tabor ...... 19 25. Semiautomated Mutagenesis and Screening of T7 RNA Polymerases Mark W. Knuth ...... 20 26. Vectors and Biochemistry for Sequencing by Nested Deletions John J. Dunn ...... 20 27. Trapping of DNA in Non-Uniform Oscillating Electric Fields Ger van den Engh ...... 21 28. Energy Transfer Fluorescent Primers Optimized for DNA Sequencing and Diagnostics Alexander N. Glazer ...... 21 29. Direct PCR Sequencing with Boronated Nucleotides Barbara Ramsay Shaw ...... 22 30. Advances in DNA Analysis Using Capillary Array Electrophoresis Richard A. Mathies ...... 23 31. Development of a Low Cost, High Throughput, Four Color DNA Sequencer MichaelS. Westphall ...... 24 32. Multiplexed Integrated On-line System for DNA Sequencing by Capillary Electrophoresis: From Template to Called Bases Edward S. Yeung ...... 24 33. Production Sequencing Evaluation of a Nearly Automated 96-Capillary Array DNA Sequencer Developed at LBNL Jian Jin ...... 25

iv Poster Number ~

34. DNA Sequencing by Capillary Electrophoresis from Sample Preparation to Resulting Sequence: A Robust High Throughput Procedure with Long Read Length B.L. Karger ...... 26 35. A Fully Automated 96-Capillary DNA Sequencer Qingbo Li ...... 26 36. Development of a Microchannel Based DNA Sequencer Courtney Davidson ...... 27 37. Microchannel Process Development and Fabrication for DNA Sequencing Steve Swierkowski ...... 28 38. An Optical Trigger for Locating Microchanhel Position Laurence R. Brewer ...... 29 39. Multiple Capillary DNA Sequencer Illuminated by a Waveguide Mark A. Quesada ...... 29 40. Technology Development and Informatics Tools for Genome Sequencing E.R. Mardis ...... 30 41. Single Molecule DNA Sequencing Richard A. Keller ...... 30 42. On the Road to Rapid Exonuclease Screening for DNA Sequencing Richard A. Keller ...... 30 43. Sizing of Individual DNA Fragments Yongseong Kim ...... 31 44. Single Molecule DNA Detection in Microfabricated Capillary Electrophoresis Chips Richard A. Mathies ...... 31 45. Sizing Megadalton DNA by Mass Spectrometry W. Henry Benner ...... 32 46. Evaluation of New Membrane Surfaces for Chemiluminescent Assays Christopher S. Martin ...... 33 4 7. Advances in Microfabricated Integrated DNA Analysis Systems Richard A. Mathies ...... 33 48. The Flowthrough Genosensor Program at Oak Ridge National Laboratory Kenneth Beattie ...... 34 49. Technical Aspects of Fabrication and Quantitative Analysis of DNA Micro-Arrays for Comparative Genomic Hybridization. Damir Sudar ...... 35 50. Multilabel SERS Gene Probes for DNA Sequencing T. Vo-Dinh ...... 36 51. MicroArray of Gel Immobilized Compounds A. Mirzabekov ...... 37 52. Flow Cytometry-Based Polymorphism Detection and Analysis John P. Nolan ...... 37 53. DNA Characterization by Electrospray Ionization-Fourier Transform Ion Cyclotron Resonance Mass Spectrometry Richard D. Smith ...... 38

v Poster Number Page

54. Improved Mass Spectrometric Resolution for PCR Product Size Measurement Gregory B. Hurst ...... 39 55. Laser Desorption Mass Spectrometry for Quick DNA Sequencing and Analysis C.H. Winston Chen ...... 39 56. Genome Analysis Technologies George Church ...... 40 57. 2'-Fluoro Modified Nucleic Acids: Polymerase-Directed Syothesis, Properties, and Stability to Analysis by Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry Tetsuyoshi Ono ...... 41 Mapping and Resources 58. Development and Application of Subtractive Hybridization Strategies to Facilitate Gene Discovery Marcelo Bento Soares ...... 45 59. EURO-IMAGE: the European IMAGE Consortium for Iotegrated Molecular Analysis of Human Gene Transcripts C. Auffray ...... 45 60. The Functional Genomics Initiative at Oak Ridge National Laboratory Reinhold Mann ...... 46 61. Defining the Function of the Human DNA Repair Protein XPF by Comparative Genomics and Protein Chemistry Sandra L. McCutchen-Maloney ...... 48 62. Direct Isolation of Functional Copies of the Human Hprt Gene by TAR Cloning Using One Specific Targeting Sequence Vladimir Larionov ...... 49 63. Mobile Elements and DNA Iotegration Jerzy J urka ...... 4 9 64. Study of Electric Field-Ioduced Transformation of Escherichia coli with Large DNA Using DHIOB Cells Alexander Boitsov ...... 50 65. Progress Towards the Construction ofBAC Libraries from Flow Sorted Human Jonathan L. Longmire ...... 50 66. Construction ofHnman Chromosome 5 and 16 Specific Libraries by TAR Cloning N. Kouprina ...... 51 67. Construction of Genome-Wide Physical BAC Contigs Using Mapped eDNA as Probes: Toward an Iotegrated Bac Library Resource for Genome Sequencing and Analysis Ung-Jin Kim ...... 52 68. Current Status of the Iotegrated Chromosome 16 Map N.A. Doggett ...... 53 69. Construction of Sequence-Ready Physical Map of Human Chromosome 16p l3 .1-11.2 Region Ung-Jin Kim ...... 53 70. Large-Scale BAC End Sequencing to Aid Sequence-Ready Map Construction Mark D. Adams ...... 54

Vl Poster Number ~

71. The Sequence Tagged Connector (STC) Approach to Genomic Sequencing: Accelerating the Complete Sequencing of the Human Genome Gregory G. Mahairas ...... 55 72. Amplification and Sequencing of End Fragments from Bacterial Artificial Chromosome (BAC) Clones by Single-Primer PCR Joomyeong Kim ...... 56 73. Construction of a High Resolution P 1/PAC/BAC Map in the Distal Long Arm of Chromosome 5 for Direct Genomic Sequencing Jan-Fang Cheng ...... 56 74. An Integrated Genetic and Physical Map of Human Chromosome 19: A Resource for Large Scale Genomic Sequencing and Gene Identification L.A. Gordon ...... 57 75. Mapping by Two-Dimensional Hybridization Norman A. Doggett ...... 57 7 6. Estimates of Gene Densities in Human Chromosome Bands Norman A. Doggett ...... 58 77. Mapping and Functional Analysis of the Mouse Genome Dabney K. Johnson ...... 59 78. The Albino (c) Region of Mouse Chr 7: A Model for Mammalian Functional Genomics M.J. Justice ...... 60 79. Comparative Mapping and Expression Study of Two Hl9ql3.4 Zinc-Finger Gene Clusters in Man and Mouse JoomyeongKim ...... 61 80. One Origin of Man: Primate Evolution Through Genome Duplication Julie R. Korenberg ...... 61 81. Third-Strand In Situ Hybridization (TISH) to Non-Denatured Metaphase Spreads and Interphase Nuclei Marion D. Johnson ...... 62 82. New Male-Specific, Polymorphic Tetranucleotide Microsatellites from the Human Y Chromosome P. Scott White ...... 63 83. In Vivo Libraries of the Human Genome in Mice as a Reagent to Sift Genomic Regions for Function E.M. Rubin ...... 63 84. Automatic BAC Contig Assembly by Three-Color Fluorescent Restriction Fragments H. Shizuya ...... 64 85. Construction and Characterization ofNew BAC Libraries H. Shizuya ...... 64 86. Completing the BAC-Based Physical Map of Human Chromosome 22 Hiroaki Shizuya ...... 65 87. A Subdermal Physiological Monitoring System For Mass Screening Mice In Gene Expression Studies M. N. Ericson ...... 65

vii Poster Number Page

88. .f:ll.A Genotype Analysis of Chronic Beryllium Disease Sensitivity P. Scott White ...... 66 89. A New Program to Develop a High-Resolution, High-Throughput Tomographic Imaging System for Mutagenized Mouse Phenotype Screening M.J. Paulus ...... 66 90. Quantitative Detection of Nucleic Acids by Invasive Cleavage of Oligonucleotide Probes Mary Ann D. Brow ...... 67 91. Quantitative DNA Fiber Mapping (QDFM) Techniques for Physical Map Construction and Quality Control Heinz-Ulrich G. Weier ...... 68 92. Analysis of DNA Sequence Copy Number Variation in Breast Tumors and HPV!6-Immortalized Cell Lines Using Comparative Genomic Hybridization to DNA Microarrays D. G. Albertson ...... 68 93. Molecular Anatomy of a 20q I 3.2 Breast Cancer Amp Iicon C. Collins ...... 69 94. 3D Genetic Analysis of Thick Tissue Specimens S.J. Lockett ...... 70

Informatics 95. Informatics at the Center for Applied Genomics Donn Davy ...... 75 96. The Genome Channel and Genome Annotation Consortium Edward C. Uberbacher ...... 75 97. GRAIL and GenQuest Sequence Annotation Tools Edward C. Uberbacher ...... 77 98. The Stationary Statistical Properties of Human Coding Sequences David C. Torney ...... 78 99. WITIWIT2: A System for Supporting Metabolic Reconstruction and Comparative Analysis of Sequenced Genomes Natalia Maltsev ...... 79 100. Internet Release of the Metabolic Pathways Database, MPW Evgeni Selkov, Jr...... 80 101. Metabolic Reconstruction from Sequenced Genomes Evgeni Selkov ...... 81 102. From Genomic Sequence to Protein Expression: A Model for Functional Genomics Michael R. Altherr ...... 81 103. FAKtory: A Customizable Fragment Assembly System Eugene W. Myers ...... 82 I 04. Divide-and-Conquer Multiple Sequence Alignment Jens Stoye ...... 83 105. Sequence Assembly Validation by Restriction Digest Analysis David J. States ...... 83

viii Poster Number Page

106. Segmentation Based Analysis of Genomic Sequence Eric C. Rouchka ...... 84 107. Swedish and Finnish Quality Based Finishing Tools for a Production Sequencing Facility Matt P. Nolan ...... 84 108. Informatics to Support Increased Throughput and Quality Assurance in a Production Sequencing Facility Arthur Kobayashi ...... 85 109. Software Tools for Data Analysis in Automated DNA Sequeocing Michael C. Giddings ...... 86 110. Treasures and Artifacts from Human Genome Sequence Analysis Darrell 0. Ricke ...... 86 Ill. Annotating and Masking Repetitive Sequences with RepeatMasker Arian F.A. Smit ...... 87 112. Why Is Basecalling Hard to Do Well? Sources of Variability in DNA Sequencing Traces and Their Consequences David 0. Nelson ...... 87 113. A Statistical Model for Basecalling Lei Li ...... 88 114. An Expert System for Base Calling in Four-Color DNA Sequencing by Capillary and Slab Gel Electrophoresis Arthur W. Miller ...... 88 115. Web-Based Tools for the Analysis and Display of DNA Trace Data Judith D. Cohn ...... 89 116. Joint Genome Institute's (JGI) Informatics Plans and Needs Darrell 0. Ricke ...... 90 117. Restriction Map Display on the World Wide Web Mark C. Wagner ...... 90 118. The BCM Search Launcher - Providing Enhanced Sequence Analysis Search Services Kim C. Worley ...... 90 119. SubmitData Data Submission Framework Snshil Nachnani ...... 91 120. Graphical Ad hoc Query Interface for Federated Genome Databases Dong-Guk Shin ...... 92 121. Towards a Comprehensive Conceptual Consensus of the Expressed Human Genome: A Novel Error Analytical Approach to EST Consensus Databases Winston Hide ...... 93 122. Query and Display of Comprehensive Maps in GDB Stanley Letovsky ...... 93 123. GSDB- New Capabilities and Unique Data Sets P.A. Schad ...... 94 124. Database Transformations for Biological Applications Susan B. Davidson ...... 94

ix Poster Number Page

125. Exploring Heterogeneous Biological Databases with the OPM Multidatabase Tools Victor M. Markowitz ...... 96 126. The OPM Data Management Toolkits Anno 1997 Victor M. Markowitz ...... 97 127. Quality Control in Sequence Assembly Analysis Mark 0. Mundt ...... 98 128. Automating the Detection of Human DNA Variations Deborah A. Nickerson ...... 99 129. Fast and More Accurate Distance-Based Phylogenetic Construction William J. Bruno ...... 99 130. Determining the Important Physical-Chemical Parameters Within Various Local Environments of Proteins Jeffrey M. Koshi ...... 100 131. Improving Software Usability and Accessibility Ryan Carroll ...... 100 132. bioWidgets: Visualization Componentry for Genomics G. Christian Overton ...... 100 133. Research on Data and Workflow Management for Laboratory Informatics Nathan Goodman ...... 101 134. Method of Differentiation Between Similar Protein Folds I. Dubchak ...... 102

Ethical, Legal, and Social Issues 135. Geneletter: An Internet Newsletter on Ethical, Legal, and Social Issues in Genetics Dorothy C. Wertz ...... 105 136. Communicating Science in Plain Language: The Science+ Literacy for Health: Human Genome Project Maria Sosa ...... 105 137. The Arc's Human Genome Education Project Sharon Davis ...... 106 138. Measuring the Effects of a Unique Law Limiting Employee Medical Records to Job-Related Matters Mark A. Rothstein ...... 106 139. The Genetics Adjudication Resource Project Franklin M. Zweig ...... 107 140. Upstream Patents and Downstream Products: A Tragedy of the Anticommons? Michael A. Heller and Rebecca S. Eisenberg ...... 108 141. The Science and Issues of Human DNA Polymorphisms David Micklos ...... 109 142. Implications of the Geneticization of Health Care for Primary Care Practitioners Mary B. Mahowald ...... 110

X Poster Number

143. Confidentiality Concerns Raised by DNA-Based Tests in the Market-Driven Managed Care Setting J.S. Kotval ...... 110 144. Getting the Word Out on the Human Genome Project: A Course for Physicians Sara L. Tobin ...... 111 145. Electronic Scholarly Publishing: Foundations of Classical Genetics Robert J. Robbins ...... 112 146. The Community College Initative Sylvia J. Spengler ...... 112 147. The Hispanic Educational Genome Project Margaret C. Jefferson ...... 113 148. Genes, Environment, and Human Behavior: Materials for High School Biology Joseph D. Mcinerney .. , , , ..... , , ...... 113 149. Microbial Literacy Collaborative Cynthia Needham , .... , . , ...... , ...... 114 150. High School Students as Partners in Sequencing the Human Genome Maureen Munn ..... , ...... 115 151. Your World/Our World- Exploring the Human Genome. Creating Resource Materials for Teachers Jeff Alan Davidson ...... 116 152. Human Genome Teacher Networking Project Debra L. Collins ...... 117 153. A Question of Genes Noel Schwerin ...... 118 154. The DNA Files: A Nationally Syndicated Series of Radio Programs on the Social Implications of Human Genome Research and its Applications Bari Scott ...... 119 155. Human Genome Project Education & Outreach Component Molly Mnltedo ...... 119 Infrastructure !56. Human Genome Management Information System Betty K. Mansfield and JohnS. Wassom ...... 123 157. DOE Joint Genome Institute Public WWW site Robert Sutherland ...... 124 15 8. Human Genome Program Coordination Activities Sylvia J. Spengler ..... , ...... 124 Appendices Appendix A: Author Index ...... , ...... 127 Appendix B: National Laboratory Index ...... 145

Xl ' Introduction to Contractor-Grantee Workshop VI

Welcome to the Sixth Contractor-Grantee Workshop sponsored by the Department of Energy (DOE) Human Genome Program (HGP). This workshop provides a unique opportunity for HGP investigators to discuss and share the successes, problems, and challenges of their research as well as new material resources and software capabilities. The meeting also gives genome scientists and administrative staff an overview of the program's progress and content, a chance to assess the impact of new technologies, and, perhaps most important, a forum for initiating new collaborations. We hope you will take advantage of opportunities offered by this meeting and by the always-beautiful surroundings in Santa Fe.

The 158 abstracts in this booklet describe the most recent activities and accomplishments of grantees and contractors funded by DOE's human and microbial genome programs, as well as the research of a few invited guests. All projects will be represented at the poster sessions, so you will have an opportunity to meet with researchers. Plenary sessions will be held at the Eldorado Hotel and poster sessions at the La Fonda Hotel. New informatics resources also will be demonstrated during the poster sessions.

The main challenge facing the genome program today is high-throughput sequencing. DOE is addressing this challenge with the formation of the Joint Genome Institute (JGI) under the direction of Elbert Branscomb. JGI will take advantage of the complementary strengths of DOE's three largest genome programs and those at other laboratories and universities to make more efficient and effective use of diverse expertise and resources. A major step in JGI's inauguration is the creation of a DNA sequencing factory in Walnut Creek, California, located halfway between the Lawrence Berkeley and Lawrence Livermore National Laboratories. Factory output will be high-accuracy human DNA sequences that will be deposited into public databases in accordance with "Bermuda principles."

The U.S. Human Genome Project is nearing the end of its second 5-year plan, which was produced in 1993 by DOE and the National Institutes of Health following rapid progress toward the goals of the original 5-year plan of 1990. We look forward to defining and meeting the goals of the project's next 5 years, now being developed with input from the broad genome scientific community. Although many challenges lie ahead, we are optimistic about the success of this grand project and await its many contributions to science and society.

We anticipate a very interesting and productive meeting and offer our sincere thanks to all the organizers and to you, the scientists whose vision and efforts have realized the promise of the genome program.

Sincerely,

Ari Patrinos, Associate Director Office of Biological and Environmental Research U.S. Department of Energy

1

Sequencing

Large Scale Genomic Sequencing at of the finished sequence, potential exons are Washington University identified by similarity to EST data and known protein sequences, and by gene prediction Stephanie L. Chissoe and the Genome Sequencing programs. Center Washington University School of Medicine, St. An overview of the projects will be presented, Louis, MO, USA 63108 addressing our strategies, progress, and methods [email protected] for quality control at each stage of the process. Additionally, recent software tools and technology Large scale genomic sequencing is ongoing at the developments (see abstract by E. Mardis) which Washington University Genome Sequencing have increased our throughput of high quality data Center, focusing on human and C. elegans DNA. will be included. The C. elegans project is in its final year with a total of 71 Mb finished sequence (in collaboration with The Sanger Centre, Hinxton, UK). The The UTSW High Throughput DNA human project has been maturing since funding Project began last year, and we now have 15 Mb finished human genomic sequence. The goals for each Glen A. Evans, Maria Athanasiou, Lisa Hahner, project are similar; to achieve contiguous, Sherri Osborne-Lawrence, Terry Franklin, Nina base-perfect sequence. For C. elegans, a Federova, Cynthia English, Shelly Hinson-Cooper, clone-based sequence-ready map has provided the Joel Durm, James McFarland, Juan Davie, Travis majority of sequencing substrates, including Ward, Paul Card, Parul Patel, Margaret Gordon, cosmids andYACs. A minimal tiling path of Jackie Newton, Danny Valenzuela, Jeff cosmids were chosen, although recently we have Schageman, Jeff Harris, Garrett Gotway, selected Y ACs for direct sequencing. Additionally, Mathenna Syed, Ken Kufer, Peter Schilling, direct probing of a C. elegans fosmid library has Vanetta Gee, Mujeeb Basit, Stafford Brignac, provided bacterial clones for regions previously Odell Grant, Ron Burmeister, Kevin 0 'Brien, Skip spanned only by Y ACs, facilitating sequence Gamer, and Roger Schultz contiguity. We have incorporated an up-front, Genome Science and Technology Center, sequence-ready mapping paradigm for human University of Texas Southwestern Medical Center genomic DNA. Chromosome 7 STS information is at Dallas, Texas 75235-8591 being translated into sequence-ready bacterial contigs. As the sequencing progresses, we are In order to complete the DNA sequence of the actively working to close gaps in the bacterial human genome, collaboration of a group of high clone map by re-screening the large insert libraries throughput sequencing centers will be necessary. with probes based on end-sequence data, additional The UTSW Genome Sequencing Center has as its markers, or overlapping Y ACs. Candidate clones goals: 1) development of 100mb/year sequencing are chosen for sequencing after evaluating the capacity, 2) complete sequencing of human fingerprint data to verify clone fidelity and overlap chromosomes 11 and 15 and 3) developing through a region. exportable technology and robotics to support high throughput genomic sequencing. Aspects of this Sequencing is performed using a mixed shotgun project include high throughput mapping to strategy, which includes initial sequencing of generate sequence ready PAC and BAC template random subclones followed by directed approaches sets, pilot projects to sequence megabase-sized for gap closure and ambiguity resolution. The contigsonchromosome llp15.5, llpl4, llq23, quality of the assembly and sequence editing are 15q26 and 15q11, and technology development monitored by reassembly and comparison to the projects in robotics and informatics. The strategy finished sequence. Additionally, the restriction for integrated map construction, sequencing and digest sizes are compared to that based on the data annotation involves assembly of a precise finished sequence. During analysis and annotation sequence-ready map using STSs, ESTs, regionally

5 mapped genes and other probes screened against a human chromosome 19, constructed largely in 1OX total human BAC library which meets ethical bacterial-based clones, serves as a resource for and legal standards of informed consent and targeted genomic sequencing in regions of high anonymity. BAC clones are isolated by array biological interest. One interesting feature of hybridization with radiolabeled pooled chromosome 19 is the high density of clustered oligonucleotides, identity confirmed by PCR with gene families such as zinc finger genes (ZNFs), specific markers and FISH, and clones olfactory receptors (OLFRs) and cytochrome fingerprinted. BAC/PAC end-sequencing is widely P-450s (CYPs). In order to understand their utilized for gap filling, contig establishment and evolution and subsequent functional sequence annotation. BAC clones are sequenced by diversification, several of these clusters are current MI3 shotgun sequencing and fragments assembled sequencing targets. We are also interested in genes into initial contigs using PHRED and PHRAP. involved in DNA repair, and have performed Gap filling and quality improvement to an genomic analyses of 6 such loci, many in both estimated sequence accuracy of99.99% are human and mouse. To date we have completed accomplished by oligonucleotide-directed over 2 Mb of genomic sequence from chromosome sequencing from BAC and PAC templates. 19 and other human and rodent targets, using a Sequence processing, assembly, quality control and shotgun strategy. Our largest completed sequence annotation and implemented through informatics contig is-1Mb and is located in 19ql3.1, flanked tools developed in the center. The effort is by the genetic markers D19S208 and COX7A1. augmented by the robotics and instrumentation Preliminary analysis of this contig indicates a developed in the center including MERMADE high relative gene density of -I. 7 per 40 kb, and an throughput oligonucleotide synthesizers, the average Alu density of 1.1 Alu!kb (-31 %), which Sequencing Support System, a 3 m Sagian rail is comparable to other previously sequenced robot developed for automated sequencing, and a regions of chromosome 19. Ofthe 43 putative high capacity DNA sequencer under development. genes identified, two appear to be pseudogenes, 7 The center has produced 4.2 mb of DNA sequence encode putative cell surface proteins (e.g. and analysis and annotation of750 kb of glycoproteins ), and 14 are completely novel. contiguous sequence for 1lp15.5 region have been Progress towards completion of an >800 kb contig completed with an anticipated 2 mb of continuous in 19p12 and regions containing clustered OLFR sequence by the end of 1997. and ZNF gene family members will also be presented. This work was performed by Lawrence Livermore Progress Towards Sequencing Selected National Laboratory under the auspices of the U.S. Regions of Human Chromosome 19 Department of Energy, Contract No. W-7405-Eng-48. Jane Lamerdin, Aaron Adamson, Karolyn Burkhart-Schultz, Linda Danganan, Jeff Garnes, Ami Kyle, Melissa Ramirez, Stephanie Stilwagen, Analysis of Nine Tandem Zinc Glenda Quan, Pat Poundstone, Robert Bruce, Evan Finger-Containing Genes Located Skowronski, Arthur Kobayashi, David Ow, within a 339kb Region in Human Anthony V. Carrano, and Paula McCready Chromosome 19q13.2 Joint Genome Institute, Lawrence Livermore National Laboratory, Livermore, CA, 94550 Mark Shannon, Linda Ashworth, Laurie Gordon, lamerdin [email protected] Anne Olsen, and Lisa Stubbs Human Genome Center, Lawrence Livermore Chromosome 19 is the most GC-rich of the human National Laboratory, Livermore, CA 94551 chromosomes as determined by flow cytometry. It [email protected] is predicted to be very gene rich, with an estimated 2000 genes contained within -60 Mb of Genetic and physical mapping studies indicate that euchromatin. A high resolution physical map of hundreds, if not thousands, of zinc finger

6 (ZNF)-containing genes populate the human acquired new biological functions. Comparative genome, and that many of these genes, including mapping studies have demonstrated that a those with Kruppel-associated box (KRAB) homologous Xrccl-linked gene family is present in motifs, are arranged in familial clusters. In a the mouse genome. Preliminary studies indicate previous study, we identified a KRAB-containing that three members of the murine gene cluster are ZNF gene family located near the XRCC 1 gene in orthologous to genes located in Hl9q 13 .2. Future human chromosome 19q13.2 (Hl9q13.2). studies will address whether the historically Preliminary characterization of this family homologous mouse and human genes are subject to indicated that the genes are arranged in a the same upstream regulation and whether they 'head-to-tail' tandem array with an average regulate the same downstream genes. intergenic spacing of 20-30kb. Current estimates based upon physical mapping data suggest that this This work was supported by the U.S. Department family is comprised of at least 15 members, which of Energy under contract Nos. span a minimum of 600kb in 19ql3.2. Restriction DE-AC05-960R0224674 and W-7405-ENG-48. mapping studies, in combination with Southern blot analyses using ZNF consensus and KRAB sequence probes, have allowed a preliminary Genomic Sequencing of 16p13.3 determination of this family's content and organization. In depth studies of the largest cosmid D. C. Bruce, D. 0. Ricke, J. L. Longmire, P. S. contig from the region have demonstrated that nine White, J. M. Buckingham, L. A. Chasteen, D. L. genes are present within 339kb, encompassing the Robinson, M.D. Jones, A. C. Munk, J.D. Cohn, proximal one-half to two-thirds of the gene family. A. L. Williams, M. 0. Mundt, L. L. Deaven, and A number of methods, including Southern blot N. A. Doggett analysis, PCR analysis and DNA sequencing, have Los Alamos National Laboratory, Life Sciences been used to localize expressed sequences within Division and Center for Human Genome Studies, an EcoRI restriction map (RMAP) of this region. Los Alamos, New Mexico 87545 Several sequences present in the Genbank database [email protected] (ZNF45, ZNF155, and HZF4 as well as ESTs 429289, 20113, and 28165) have been localized to We have recently begun full genomic sequencing specific sites within the RMAP. The region also of a 3. 0 Mb cosmid!P 1 contig of the human contains three previously unknown gene sequences. chromosome region in 16pl3.3 extending from the Sequence analysis of eDNA clones for eight of the polycystic kidoey disease 1 (PKD 1) locus to the genes indicates that while the KRAB A domains of CREB binding protein (CREBBP) locus the sibling genes are highly similar in structure, [responsible for Rubinstein-Taybi Syndrome and other portions, including the KRAB B and ZNF implicated in acute myeloid leukemias associated domains, are highly divergent in sequence. These with translocations t(8;16)(pll;p13.3) and observations suggest that this gene family may t(ll; 16)(q23;p13.3)]. This contig encompasses the have evolved to encode a collection of transcription recently cloned Familial Mediterranean Fever gene, factors that bind to different target DNA sequences and the syntenic breakpoint between mouse as a result of divergent ZNF arrays, while chromosomes 16 and 17. The average overlap participating in common transcription control between clones in the cop.tig is 25%. Sample complexes due to their highly similar KRAB A sequencing of this region has revealed that it is domains. Interestingly, the genes are expressed gene rich and G+C rich (>50% G+C), with the widely in adult tissues and are co-expressed in gene density approaching one gene/ 10 kb in some many sites. However, tissue-specific variations in stretches. These observations are consistent with levels of transcript between the genes are also the cytogenetic designation of 16p 13.3 as a G+C evident. Such overlapping, but not identical, rich 'T' band (Dutrillaux, 1973; Holmquist, 1992). expression patterns are consistent with the idea Our strategy for sequencing involves nebulization that, like coding sequences, the duplicated to randomly break DNA, size selection of 3 kb cis-acting regulatory regions of the sibling genes fragments, double adapter cloning into bluescript have diverged over evolutionary time as they KS+ plasmid, and sequencing of both ends to 5 X

7 random sequencing coverage. Assembly of such regions is warranted, where individual sequence contigs is constrained by the inherent laboratories specialize in genomic regions they relationship of the end sequences being have special expertise in investigating. Such efforts approximately 3 kb apart. Closure is achieved by a would complement and integrate with the few truly combination of primer walking, longer reads, and large-scale sequencing centers that are likely to alternate chemistry reactions. To date we have evolve during the next few years, such as the sequenced to approximately 2 X coverage for the proposed DOE Joint Genome Institute Sequencing first Mb of this contig and approximately 1 X Center. Our initial iboutiquei target is telomeric coverage for the remaining 2Mb. Supported by the regions, which exhibit high levels of repetitive US DOE, OBER under contract W-7405-ENG-36. DNA composition, cloning instability, and population heterogeneity. Numerous investigations have implicated genes near telomeres as likely The Beginning of the End: Mapping targets for alterations during aging and cancer and Sequencing Human Telomeric DNA progression. Through the efforts of a number of laboratories, most notably Dr. Harold Rietbmanis Robert K. Moyzis, Han-Chang Chi, and Deborah at the Wistar Institute, nearly all human telomeres L. Grady have now been cloned by functional Department of Biological Chemistry, University of complementation in yeast. My laboratory has California, Irvine, Irvine, CA 92697 finished the 0.23Mb 7q telomere sequence (Chi [email protected] et.al., this meeting), the first RARE cleavage confirmed human telomere region to be sequenced The Human Genome Project is undergoing a rapid directly up to the terminal {TTAGGG)n repeat. An transition from an emphasis on generating physical important QC/QA aspect of this project was the maps to the large-scale finished sequencing of extensive confirmation of the sequence against human DNA. Current technology will allow a large genomic DNA by PCR-sequencing (White et.al., fraction of human DNA to be sequenced in the this meeting). Greater than 3Mb of confirmed next 5-l 0 years by highly automated, telomeres are now available for sequencing, with high-throughput sequencing "factories". A another 6-12Mb currently being confirmed. These significant fraction of the human genome, however, represent 43 of the 46 unique human chromosome will be difficult to sequence to completion by such ends. The finished and annotated sequence of these "factory" approaches. These are regions that: I) clones will "cap" the world-wide genome contain a high percentage of repetitive DNA sequencing effort, and identifY numerous important sequences, 2) contain internal tandem duplications, genes and polymorphic markers. including multigene families, and/or 3) are unstable in all current sequencing vectors. This would be irrelevant if such regions were rare, or 7q Telomere: Complete Sequencing contained little of intrinsic informational va!ue. 1 Such is not the case. The first five years of the Han-Chang Chil.2, Elizabeth H. Saunders , Judy 1 Human Genome Project mapping efforts have M. Buckingham , Darrell 0. Ricke\ A. Christine 1 1 1 indicated that such regions represent Munk , Rebecca Lobb , Samantha Y.-J. Ueng , approximately 20% of human DNA. This includes Mark 0. Munde, P. Scott White\ Owatha L. 1 1 2 such critical regions as centromeres and telomeres, Tatum , and Robert K. Moyzis • 1 as well as a greater abundance oflow-copy repeats Center for Human Genome Studies and and multigene families than previously anticipated. Genomics Group, LS-3, Los Alamos National 2 Producing quality DNA sequence of these regions, Laboratory, Los Alamos, NM 87545 and which faithfully represent genomic DNA, will be a Department of Biological Chemistry, College of continuing challenge. Medicine, University of California, Irvine, Irvine, CA92697 We propose that a large-scale, yet distributed "boutique" approach to mapping and sequencing 225,432 bp of DNA inunediately internal to the (TTAGGG)n telomere repeat of human

8 chromosome 7q was sequenced. The telomeric end Los Alamos National Laboratory, Life Sciences of chromosome 7q, unlike most human Division and Center for Human Genome Studies, chromosomes, contains only a few (<1.5 kb Los Alamos, New Mexico 87545 nucleotides in total) subtelomeric repetitive DNA. [email protected] The lack of subtelomeric repeats, therefore, suggested that this region would likely contain The 7q telomere region was sequenced to examine subtelomeric genes whose expression would be if this region contained expressed genes close to affected by telomere alterations accompanying the telomere or served as a buffer area between the aging and cancer progression (Moyzis et. a!., this telomere and expressed genes in the advent of meeting). Nine overlapping cosmids and two PCR telomere shortening. A CpG island next to the few products obtained from the 7q telomere Y AC clone exons of the vasoactive intestinal polypeptide HTY146 (yRM2000) were sequenced using a receptor 2 precursor gene were discovered (remain Sample Sequencing (SASE)-parallel primer exons presumed to be centromeric of the sequenced walking strategy. Sequence validation was region). This 225,432 bp sequence contains performed on PCR amplified human genomic multiple EST clusters and evidence fur additional DNA from at least 6 individuals including the cell candidate genes as well as a large array of line used to construct HTY146. Primer pairs repetitive sequences and one type I (spaced 300-900 bp) were randomly picked from phosphatidylinositol-4-phosphate 5-kinase 25 sites that are almost evenly distributed along the pseudogene. Interesting, a second candidate gene entire 7q terminal region. The QC/QA results exists in the form of 7 close regions of homology confirmed that the cloned Y AC/E. coli sequences with chromosome 4 cos mid L 191 Fl. Evidence for were a faithful representation of genomic DNA, a third candidate gene can be pieced together from containing less than one error in 10,000 bases EST clones with homologies to both the 5' and 3' (White et. a!., this meeting). Computer analysis ends. These paired EST end sequences link uncovered numerous open reading frames, together multiple candidate exons in what may be expressed sequence tags (ESTs), and potential an alternatively spliced gene. An update of the exons dispersed along the entire 226 kb region, as sequence analysis of this region will be presented. well as 19 variable number of tandem repeats (VNTRs) and 20 microsatellite repeats. As part of the sequence verification for the 7q Approximately 192 kb internal to the (TTAGGG)n region, we have PCR amplified and sequenced terminal repeat the first and second exons for the -300 bp from each of 19 sites along this region, human vasoactive intestinal peptide receptor 2 from at least 6 individuals representing 4 diverse (VIPR2) gene were located. This gene is involved ethnic groups. In addition, we sequenced the in a diverse set of physiological functions including genomic DNA used to build the 7q Y AC library, smooth muscle relaxation, electrolyte secretion, and the supporting YAC clone, HTY-146. For and vasodilation. SASE-parallel primer walking is each site, PCR primers were selected in regions efficient for finished sequencing and gene free of known repeats. All PCR products were lacalization of a long range genomic target, sequenced and compared to detect errors and especially a idifficulti region like telomeres. polymorphisms. Sequence comparisons among individuals give us an idea of the amount of natural variation, as well as validate the clone sequence to Finished Sequence of7q Telomere the genomic DNA from which the clone originated. Region: Features, Validation and Out of5722 nucleotides resequenced (2.5% of Polymorphism Detection total) there was one heterozygous site found in the individual from which the library was constructed, and 9 single nucleotide polymorphisms (SNPs) P. Scott White, Han Chi, Larry L. Deaven, among the other individuals. All other sequences Darrell 0. Ricke, Elizabeth Saunders, Owatha L. were identical between the clone sequence and that Tatum, and Robert K. Moyzis of the individual from which the library was made. These results are in agreement with previous estimates that natural variation at the primary

9 sequence level is at least I in 1000, which needs to Sequencing Progress at Lawrence be considered when performing sequence Berkeley National Lab and the validation aimed at detecting a one in 10,000 error Formation of JGI frequency. Chris Martin and Mohan Narla Human Genome Sequencing Department, Life Large-Scale High-Quality Genomic Sciences Division, Lawrence Berkeley National Sequencing of Human Chromosome 7 Laboratory and the DOE Joint Genome Institute, Berkeley, CA Shawn Iadonato, Jun Yu, Gane K.-S. Wong, Charles Magness, Phil Green, and Maynard Olson The Genome Centers at DOE's Los Alamos, Human Genome Center, Department of Medicine, Berkeley, and Livermore National Laboratories University of Washington, Seattle, WA 98195. have merged into the Joint Genome Institute (JGI). This reorganization is a significant undertaking We are implementing an approach to large-scale that is aimed at achieving economies of scale in sequencing of human genomic DNA that our genome efforts while also leveraging off of the emphasizes high-quality at a reasonable cost. expertise available at the three sites. In the area of Quality targets are: 1) a single-base-pair error rate genomic sequencing, the large majority of the of better than 0. 0 I%, 2) no gaps, in either the JGI's effort will be moving the a new facility, shotgun sequence or the physical map, across located in Walnut Creek, California, in early to mega-base sized regions, and 3) validation of the mid 1998. This will provide a custom designed sequenced large-insert clones to an average space sufficient for the scale up of the JGI' s resolution of200-bp. This is accomplished by the production sequencing effort in a factory like systematic introduction of objective quality setting, termed the PSF (Production Sequencing measures throughout the data production process, Facility). Until this time, the production from the physical mapping through to the shotgun sequencing efforts at the three laboratories will be sequencmg. scaled up in place, while also working towards the goal of finalizing a uniform production process for Detailed sequence-ready restriction-maps are use within the PSF. produced by the multiple-complete-digest (MCD) method. These maps have proven to be very The sequencing approach at Berkeley has recently accurate, with average sizing errors better than been altered by the increase in the emphasis on the I%. They allow us to both validate the large-insert up-front shotgun phase of the process, which clones to a 200-bp average resolution, and to utilizes double end plasmid subclone sequencing. confirm the correctness of the subsequent sequence Additionally, we have adopted the use of additional assemblies. The sequence production is bacterial strains for all of our subclone libraries distinguished by an emphasis on long read-lengths that seem to alleviate under-representation of instead of maximum machine utilization. Data and genomic regions due to cloning biases. These quality analysis are performed with the changes are helping us to significantly reduce the Phred!Phrap/Consed system and average time required for the complete sequence of a given Phrap-alignable read lengths are 735-bp. physical mapping clone to be determined. Data will Overlapping large-insert clones are finished be presented on the current status of a set of 16 independently and the sequence overlaps are used human and mouse PI's, pac's and hac's that are to estimate the single-base-pair error rate, which now in progress using this new sequencing process. has been better than 1 in 100,000-bp. We will present data from a CONTIGUOUS 2-Mbp region on human chromosome-? (near 7q31.3). Evidence will be presented to support all of the above quality assertions.

10 Comparative Analysis of Human and expression patterns of these five genes suggest that Mouse Orthologous DNA to Identify human-mouse conserved noncoding regulatory Conserved Regulatory Regions elements are likely to be in this 150 kb region and identified by comparative sequence analysis.

Kelly A. Frazer, Gabriela G. Cretz, Christopher H. Martin, Jan-Fang Cheng, and Edward M. Comparative Genomic Sequencing of Rubin Joint Genome Institute, Lawrence Berkeley the Human and Mouse Fibroblast National Laboratory, Berkeley, CA 94720 Growth Factor Receptor Type ill Gene [email protected] Ana V. Perez-Castro, Julie Wilson, Cleo Naranjo, Human chromosome 5q31 was chosen by the JGI and Michael R. Altherr for large-scale sequencing because it harbors a Los Alamos National Laboratory, Genomics family of interleukin genes which are important Group, Los Alamos, New Mexico 87545 regulators of the immune response. lnterleukin-3 [email protected] (IL-3), IL-4, IL-5, IL-13 and granulocyte-macrophage colony-stimulating factor We have sequenced the human and mouse genomic (GM-CSF) are clustered within 600 kilobases of segments encoding the fibroblast growth factor each other on human 5q31. Previously, we have receptor 3 gene (FGFR3). FGFR3 is a member of computationally and biologically analyzed the the receptor tyrosine kinase superfamily and a available genomic sequence data in the 1 Mb developmentally regulated transmembrane protein. region of human 5q31 containing the interleukin in FGFR3 contribute to a number of family cluster and identified 16 new genes, as well significant human maladies. The human and mouse as 7 previously known genes. We also performed genes exhibit significant similarity in terms of their comparative mapping studies and demonstrated structural domains and genomic organization. In that 13 of the human genes in the 5q31 region are both species, the gene consists of 19 exons and 18 located in the syntenic region of mouse introns spanning greater than 15 kbp of sequence. . The coding regions across the entire gene are 84% and 92% identical at the nucleic acid and amino Numerous experimental studies have indicated that acid levels respectively. The alternatively spliced noncoding regulatory sequences controlling gene exon 8 is 95% similar at the nucleic acid level and expression are evolutionarily conserved in mice 93% similar at the amino acid level. While the and humans. Comparative analysis of human and sequence similarity of the introns is (on average) mouse ortbologous sequences is a powerful tool for less than 50%, the size of the individual introns is identifying these conserved noncoding regulatory very similar. The 5' flanking regions of the gene in elements. To facilitate the annotation of coding both species shows a similar high degree of regions and permit noncoding conserved regulatory conservation. In general, the analysis of a 400 bp elements to be readily identified in the human 5q31 segment preceding the initiator ATG shows only a sequence data, the JGI has chosen to sequence the 60% similarity. However, several common 1 Mb region containing the interleukin growth transcriptional regulatory sequences are present in factor cluster on mouse chromosome 11. both. Consensus binding sites for Sp 1, AP2, Krox24, and IgHC.4 are located in this region. It is To date, we have isolated and begun sequencing a also worth 'noting that the position and spacing 150 kb mouse BAC which contains 5 complete between these sites is conserved in these species. genes: IL-4 and IL-13 - genes with coordinated These data suggest that mouse comparative expression in T -Cells; septin- a ubiquitously genomic sequencing can be used to identify and expressed gene with complex alternative splicing; annotate significant functional domains in the and cyclin-like and KIF3 - genes which are both human genome. predominantly expressed in the brain and are physically near each other. Importantly, the

11 Support: DOE Contract W-7405-ENG-36 and required to identifY conserved sequence segments LANL LDRD funds. and presumably functionally important domains. E-mail Contact: [email protected] Day Phone: (505)665-6144 Support: DOE Contract W7405-ENG-36 and LANL LDRD funds. E-mail Contact: [email protected] Comparative Genomic Sequencing of Day Phone: (505)665-6!44 the Murine Syntenic Segments Corresponding to Human High-Throughput Sequencing of Chromosome 16p13.3 Eubacterial and Archeal Genomes

Darrell 0. Ricke, Norman A. Doggett and Michael Owen White, Mark D. Adams, Rebecca Clayton, R. Altherr Hans-Peter Klenk, Anthony R. Kerlavage, Karen Los Alamos National Laboratory, Genetics Group, E. Nelson, J. Craig Venter Los Alamos, New Mexico 87545 The Institute for Genomic Research [email protected] owhite@tigr. org

The terminal short arm band of human The Institute for Genomic Research (TIGR) has chromosome 16 (16p13.3) is syotenic with three completed the genomic sequences for the different mouse chromosomes ( 11, 16, and 17). At DOE-funded projects of the eubacteria least 8 loci define the syotenic overlap between Mycoplasma genitalium, and two archeal genomes human 16 and mouse 17. Less than 1 Mbp from Methanococcus jannaschii and Archaeoglobus the proximal boundary of this segment is another folgidus. The eubacterial genomes Deinococcus cluster of eight loci that comprise 21 centimorgans radiodurans and Thermotoga maritima are now in of mouse chromosome !6. Our human genomic closure, while a fifth genome project, Shewanella sequencing targets include this I Mbp segment of putrefaciens is currently in the random sequencing chromosome 16 where the syotenic breakpoint phase at TIGR. These projects result in over 11.5 between mouse chromosomes 16 and 17 occurs. Mb of finished sequence that contain In addition, this segment overlaps at least one approximately 10,500 genes. Data relating to each unidentified locus of medical significance CATM of these bacterial projects are located in our local and contains the recently identified gene for ' database, disk -based files used for searching, an familial Mediterranean fever. The CATM locus is external web server, the public sequence archives, a gene that, when defective, causes congenital and other web servers (e.g. Argonne's biochemical cataract and microphthalimia. It is clear from pathways for Haeomophilus in.fluenzae). One previous comparative sequencing studies that both challenge of the distributed archetecture of structural and regulatory regions of genes can be bacterial information is to effectively propagate identified by similarity comparisons of human and changes from one data location to another, mouse genes. In fact, our preliminary sequencing preferably in an automated manner. We will report efforts of this region of 16pl3.3 have already on a system that is under development that encodes identified 33 mouse EST clusters with greater than system-wide data dependencies necessary for 80% similarity to the human sequence. These update information in a large scale sequencing ESTs are being used to develop primer pairs to facility. We have also begun implementation of identifY mouse BA C clones. These mouse BAC enhanced high-throughput data analysis that adds clones will be subcloned and sequenced at low intergenomic and intragenomic gene families, redundancy to identifY and anootate potentially hydrophobicity plots, block and motif searching, interested regions of the human genome. This medline and web-derived information to our effort will be closely coordinated with higher previous database searching methodology. We will redundancy efforts at LLNL and LBNL in an report on our enhanced anootation teclmiques and effort to optimize the depth of sequence coverage will discuss them in context of the recently completed genomes.

12 inserts and 400 cosmid inserts. The cosmid inserts High-Throughput Transposon provide a global scaffold of the genome, while the Mapping and Sequencing plasmid inserts are used in the primary process of transposon mapping and sequencing. Overlaps 1 1 between plasmid inserts derive from the matching Robert B. Weiss', Mark Stump , Joshua Cheny , Cindy Hamil', Frank Robb2 and Diane Dunn' of end sequence onto transposon sequence contigs, 1 Department of Human Genetics, University of and contigs are grown from re-feeding the mapping Utah, Salt Lake City, UT 84112, 'Center of process with inserts that extend or bridge the Marine Biotechnology, University of Maryland, sequence contigs. Baltimore, MD 21202. bob. weiss@genetics. utah.edu Common growth and DNA prep formats feed both the mapping and sequencing process. Minimal The ability to sequence moderate-size plasmid spanning sets of clones are predicted from the inserts ( 10-15 kb) using mapped transposons is mapping phase and fed to the sequencing process, being tested on both microbial and human where cycle sequencing is performed on sub-clone libraries. The transposon serves as the double-stranded templates. A single hybridization priming site for initiating a set of bi-directional instrument is capable of imaging 1728 lanes of di-deoxy ladders within the plasmid insert. The mapping data in each eight hour cycle; with a priming site locations are mapped by automated 20-fold deep multiplex set, these instruments Southern analysis using a restriction digest that acquire mapping data on 34,560 clones over the releases the insert from the vector, while cutting in course of a single, unattended run. The use of this the center of the transposon. The inserts propagate technique on the Pyrococcus genome is on a vector designed to maintain the plasmid at a demonstrating the robustness and cost­ few copies per cell, while enabling efficient DNA effectiveness of using this technique. The use of purification after temperature-induced runaway moderate-size plasmid insert libraries providing a plasmid replication. Sub-clone libraries are built in reduced clone feed and finishing information a multiplex family of this vector backbone, where results in a complementary approach to small-size each vector attaches unique sequence tags to the Ml3 insert strategies. This project is an alpha test insert for use in mapping. Automated of methods, instrumentation and software under Hybridization and Imaging Instruments (AHII) are development at the Utah Center for Human used to sequentially probe for the mapping tags. Genome Research. These instruments are fully automated devices for detecting enzyme-linked fluorescence from DNA This work is funded by DOE grant hybrids on nylon membranes, and are used in the DE-FG03-94ER-61950 (R.B. Weiss, P.I.). mapping and sequencing phase of the project

We are in the process of using the transposon Complete Sequencing of the 2.3Mbp mapping and sequencing technique to complete the Genome of the Hyperthermophilic 2.1 Mb genomic sequence of Pyrococcus foriosus Arcbaeon Pyrobaculum aerophilum (DSM 3638), a hyperthermophilic heterotrophic 1 2 member ofthe Archaea. This organism, isolated Sorel Fitz-Gibbon , Ung-Jin Kim Heidi Ladner' 2 2 from hot marine sediments, grows vigorously at Y ongweo Cao , Gony H. Kim , Ba:,bara Perry', ' 2 temperatures near lOOoC. Its lifestyle is anaerobic Enrique Colayco , Ronald V. Swanson', Teny 4 and it derives energy by fermenting peptides and Gaasterland , Jeffrey H. Miller', and Melvin I. carbohydrates to organic acids and C02. P. Simon2 fo.riosus plasmid and cosmid libraries were built in 1 Department of Microbiology and Molecular a family of2l multiplex vectors. These vectors Genetics, and Molecular Biology Institute, provide multiplex tags for both the end sequencing University of California, Los Angeles and transposon mapping phases of the process. 2 Biology, California Institute of Technology End sequences were determined on 2875 plasmid 3 Now at Diversa Corp., San Diego

13 4 Mathematics and Computer Science Division, Microbial Genome Sequencing and Argonne National Laboratory, Argonne, IL; Comparative Analysis Department of Computer Science, University of Chicago, Chicago, IL D.R. Smith, T. Aldredge, R. Bashirzadeh, H. Bochner, M. Boivin, S. Bross, D. Bush, A. Caron, Pyrobaculum aerophilum is a hyperthermophilic A. Caruso, G. Church*, R. Cook, C.J. Daniels#, archaean discovered from a boiling marine water C. Deloughery, L. Doucette-Stamm, J. Dubois, J. hole at Maronti Beach, Italy that is capable of Egan, D. Ellston, J. Ezedi, T. Ho, K. Holtham, P. growth at 104 o C. This microorganism can grow Joseph, M. LaPlante, H-M. Lee, D. Blakely, R. aerobically, unlike most of it's thermophilic Cook, R. Gibson, K. Gilbert, A. Goyal, J. Guerin, relatives, making it amenable to a variety of D. Harrison, J. Hitti, L. Hoang, N. Jiwani, P. experimental manipulation and a potential Keagle, J. Kozlovsky, W. Lumm, J. Mao, P. candidate for a model organism for studying Mank, A. Majeski, S. McDougall, J. Nolling, D. archaeal microbiology and thermophilism. Patwell, J. Phillips, S. Pietrokovski@, B. Pothier, Sequencing the entire genome of this organism S. Prabhakar, D. Qiu, J.N. Reeve#, P. Rice, P. provides a wealth of information on the Richterich, M. Rossetti, M. Rubenfield, M. evolutionary and phylogenetic relationship between Sachdeva, H. Safer, G. Shimer, P. Snell, R. archaea and other organisms as well as the basis of Spadafora, L. Spitzer, H-U. Thomann, R. Vicaire, thermophilic nature of this organism. We have Y.Wang, L.Wong, K. Weinstock, J. Wierzbowski, constructed a physical map that covers estimated Q. Xu, L. Zhang 2.3 Megabase pair genome using a 1OX fosmid Genome Therapeutics Corp., Waltham, MA library. The map currently consists of 96 *HHMI, Dept. of Genetics, Harvard Medical overlapping fosmid clones. We have completed School, Boston, MA sequencing the entire genome using a random @Fred Hutchinson Cancer Research Center, shotgun approach with the supplement of Seattle, WA oligonucleotide primer directed sequencing. A total #Dept. of Microbiology, The Ohio State of 16,098 random sequences corresponding to University, Columbus OH approximately 3 .5X genomic coverage were [email protected] obtained by sequencing from both ends of 2-3 kbp genomic DNA fragments cloned into pUC18/19 This project is applying automated sequencing vectors using vector-specific primers. These technology and bioinformatics tools to the analysis fragments were assembled into several hundred of microbial genomes with potential applications in contigs using the Phrap program developed by Dr. energy production and bioremediation. Efforts Phil Green at University ofWashington, Seattle. have focused on two genomes in particular, those Gaps and regions oflow quality base calls have ofMethanobacterium thermoautotrophicum strain been resolved using primers specifically designed delta H and Clostridum acetobutylicum ATCC to extend the problem regions. We have 824. successfully closed all remaining gaps after a total of2,300 directed sequencing reactions and Methanobacterium thermoautotrophicum is a reassembly. Our current full length genomic thermophilic archaeon that grows at temperatures sequence still suffers from low data quality: only from 40-70°C, and was isolated in 1971 from approximately 99% of the nucleotide sequences are sewage sludge. The complete 1, 751,377 bp accurate. This is mainly due to the low redundancy sequence of the genome ofM (3 .5 fold) in random sequencing. We plan to thermoautotrophicum was determined by a whole perform 2-3,000 more directed sequencing genome shotgun sequencing approach. Analysis of reactions to polish the sequence to 99.99% the sequence predicted 1,855 polypeptide-encoding accuracy. We are currently running MAGPIE, a ORFs, 807 (44%) of which could be classified WEB based system for sequence annotation, in our according to function. The putative gene products UltraS pare server (date.tree.caltech.edu) and plan were compared with sequences from to analyze and completely annotate the genome. Methanococcus jannaschii, as well as eucaryal,

14 bacterial and archaeal specific databases. These S(lquencing the Borrelia burgdorferi analyses indicated that most ORFs are most Genome similar to sequences described previously in other Archaea but that there has been extensive John J. Dunn, Laura-Li Buder-Loffredo, Ting divergen~e between the two sequenced methanogen Chen William C. Crockett, Jan K.ieleczawa, genomes. Most gene products predicted to be Jerei~Y Medalle, Sean McCorkle, Keith H. involved in cofactor and small molecule Thompson, Jeanne R. Wysocki, Shiping Zhang and biosyntheses, intermediary metabolism, transport, F. William Studier nitrogen fixation, regulatory fimctions and Biology Department, Brookhaven National interactions with the environment are more similar Laboratory, Upton, New York 11973 to bacterial than eucaryal sequences, whereas the [email protected], [email protected] converse was true for most proteins predicted to be involved in DNA metabolism, transcription, and The -900 kbp linear chromosome of Borrelia translation. There are 24 polypetides that could burgdorferi, the bacterium that causes Lyme form two-component sensor kinase-response disease, is being sequenced to spur the regulator systems, homologs of bacterial DnaK development of an integrated system for accurate, and DnaJ, homologs of eucaryal DNA replication high-throughput, low-cost genome sequencing with initiation Cdc6 proteins, an X-family repair-type minimal human involvement. We are testing a DNA polymerase and an unusual archaeal B-type modified whole-genome shotgun approach, using DNA polymerase. There are 39 tRNA genes, two random 1st-end and directed 2nd-end sequencing rRNA gene clusters, one intein containing gene, on a clone set with good physical coverage. A several repeated regions, two large clusters of short network of linked clones is generated at relatively repetitive elements, and numerous other interesting low sequence redundancy, and primer walking on features. linking clones is used to close the gaps and complete the sequence of both strands. This The Clostridia are a diverse group of strategy can achieve highly accurate sequence at an gram-positive, rod-shaped, spore forming overall redundancy of 4-fold or even less. anaerobes that include several toxin-producing pathogens and a large number of terrestrial Two commercial fluorescent sequencers have been species. The latter have been used extensively for used in this development phase, with the intention industrial solvent production (acetone, butanol and of scaling up capacity with a capillary sequencing ethanol) by fermentation of starches and sugars. C. system currently under development. Data acetobutylicum strain ATCC 824 has a 4.1 Mb, management and analysis capabilities tailored to AT-rich genome and is one of the best-studied this sequencing strategy have been developed and solventogenic clostridia. The shotgun sequencing implemented, including software for managing the phase has been completed, with 4.9 Mb of sequence process, selecting 2nd-ends for multiplex and 21.3 Mb of ABI raw sequence reads sequencing, assembling the sequence, and selecting (6.3 fold total redundancy) that produced 551 walking primers. The routine biochemical contigs spanning 4,030,725 bases when assembled protocols needed for Mbp-scale sequencing have using PHRAP with quality scores. A total of 4018 also been implemented. Primers for walking are putative polypeptide encoding ORFs were generated from a library of all 4096 hexamers, by identified and searched against public databases to ligating hexamers on hexamer templates to form provide preliminary annotation. The finishing 12-mers for cycle sequencing. phase of the project is currently underway utilizing a quality-based finishing paradigm and a set of The Borrelia sequence is almost finished. As of integrated bioinformatics tools. The data are August 21, approximately 5400 end sequences and available at http://www.cric.com. 2700 primer-walking sequences had been obtained on approximately 2700 plasmid clones (average insert length of 2.1 kbp) and I 06 fesmid clones (average insert length of35 kbp), representing

15 about 4x sequence coverage and 1Ox physical combined contain sufficient sequence information coverage of the linear chromosome. The sequence to code for an additional200-250 microbial sized redundancy was higher than necessary because of open reading frames (ORF), i.e., 185 kb and 85 kb extra sequences obtained during protocol of DNA in pXOl and pX02, respectively. As an development. A single contig of approximately 900 initial step to identify and characterize potential kbp aligns with the published restriction map and ORFs we have begun to sequence, assemble and contains all but an estimated few kbp at each end. analyze the entire DNA sequence of each plasmid. This sequence is posted on our web site at The pXO I and pX02 genomes are being analyzed www.bio.bnl.gov and will be updated periodically by random cloning of sheared or restriction as the sequence of both strands is completed. As of digested 2-4 kb fragments followed by high August 21, about 92% had been sequenced on both throughput DNA sequence analysis on automated strands and approximately 200 single-strand gaps Applied Biosystems sequencing machines. We will remained to be filled. Our sequence agrees well report on our progress and sequence analysis of the with the sequence of the chromosome of the same assembled shotgun sequence contigs for both pXO I Borrelia strain determined independently at The andpX02. Institute for Genomic Research (TIGR). This work is sponsored by the U.S. Department of Software and database support for this sequencing Energy. strategy continues to be refined, with the aim of removing as much human decision making as possible. Automation of sample selection and data entry will greatly reduce sources of human error, which cause more problems in primer walking than in shotgun sequencing. A newly developed single-copy amplifiable vector in which nested deletions are easily produced should allow the use of clone libraries of longer average fragment length, improve sequencing efficiency, and help in resolving repeated sequences.

Sequence Analysis ofthe Plasmids of Bacillus anthracis

R. T. Okinaka, K. G. Cloud, 0. A. Hampton, K. K. Hill, P. Keirn, S. Kumano, D. Manter, J. J. Renouard, D. 0. Ricke, and P. J. Jackson Los Alamos National Laboratory, Life Sciences Division, Los Alamos, New Mexico 87545 and Northern Arizona University, Flagstaff, Arizona [email protected]

Virulent strains of Bacillus anthracis contain two large plasmids, pXO I and pX02. These plasmids are known to carry the toxin genes (lethal factor, edema factor, and protective antigen) and the three capsule genes that enhance the virulence of the organism (Cap A, B and C). The vast majority of the genes on pXO l and pX02, however, have not yet been characterized. The two plasmids

16 Sequencing Technologies

Rhodobacter capsulatus Genome implications of the DENS technique will be Sequencing Project: DENS Technology discussed. Testing Ground 1. Raja et a!. (1997) Nucleic Acids Res. 25: 800-805. Mugasimangalam Raja, Vincent Molloy, Lev 2. Fonstein eta!. (1995) EMBO J. 14: Lvovsky, James Akowski, Jonathan Schisler, 1827-1841. Y akov Kogan#, Michael F onstein#, Robert Haselkorn# and Levy Ulanovsky Argonne National Laboratory, Argonne, IL #University of Chicago, Chicago, IL Cosmid Sized Templates for DENS Sequencing Technique We are using DENS (Differential Extension with Nucleotide Subsets, see Ref. I and the Mugasimangalam Raja*, Vincent Molloy*, Lev accompanying abstract by Raja eta!.) for Lvovsky# and Levy Ulanovsky*# sequencing the genome of Rhodobacter capsulatus *CMB, Argonne National Laboratory, Argonne, by primer walking without custom primer IL synthesis. A cosmid library of R. capsulatus was #Weizmann Institute of Science, Rehovot, Israel constructed and mapped with high resolution (2). Of the total3.5 Mb genome, 1.4Mb has already DENS (Differential Extension with Nucleotide been sequenced by limited shotgun sequencing Subsets) is a technique used for template directed followed by conventional primer walking. enzymatic synthesis of unique primers, avoiding Forty-eight plasmid subclones containing 5 kb the chemical synthesis step in primer walking ( 1). inserts were isolated from each cosmid and both DENS works by selectively extending a short ends sequenced. The resulting sequences were then primer and making it a long one at the intended site assembled into contigs. In conventional primer only. The procedure starts with an initial extension walking, approximately l 00 custom synthesized of the primer (at 20-30.C) in the presence of only primers per cosmid are required to close the two out of the four possible dNTPs. The primer is sequencing gaps and generate the second strand. extended by 5 bases or longer at the intended Around 10 additional primers are needed to fill the priming site, which is deliberately selected, (as is physical gaps between subclones by direct primer the 2 dNTP set), to maximize the extension length. walking on the cosmid. The subsequent termination reaction at 60-65 • C In primer walking by DENS, the conventional then accepts the primer extended at the intended custom synthesized walking primers are replaced site but not at alternative sites, where the initial with DENS octamers (containing two degenerate extension (if any) is generally much shorter. positions each) from our presynthesized library of 2,048 octamers (50% of all possible sequences). Until recently we were getting about 70% success At the current DENS success rate of 62% on rate on ssDNA plasmid and asynunetric PCR dsDNA templates, the use of DENS is cheaper products and 62% on ds plasmids. Now, we have than primer walking using conventional found that the usable template size for DENS is custom-made primers, even though the latter yields not linuted to several kb (typically plasmids), as an 85% success rate. Future closed end we thought previously. It turns out that DENS automation of DENS primer walking, made sequencing can be performed on cosmid-sized possible by the instant availability of primers, templates on the condition that the sequence of should reduce sequencing cost by more than an most of the template is known (as in filling order of magnitude. The results of the R. physical gaps). Then the computer excludes capsulatus sequencing show that DENS is a viable candidate priming sites for whom the chosen option. Updated results on this pilot project and octamer primers have alternative priming sites in the known part of the template if at any alternative

17 site the extension crosses the failure threshold is generally much shorter (see Ref.!). Both steps (about 4-5 bases). Of 17 DENS sequencing involve thermocycling. reactions performed using Lambda phage as the template, 10 worked well. Here we show an example of DENS primer walking on three human genomic DNA subclones In addition, we have developed a novel version of of the 3-4 kbp length each from telomeric region of DENS that uses only one out of the 4 possible human chromosome 7, kindly provided by Robert dNTPs in the differential extension (either dCTP or Moysis group, LANL, Los Alamos, NM. The full dGTP extending the octamer by 3 or more bases). length sequences (3024, 3473 and 3707 bp) were One-dNTP DENS allows primer walking on such obtained from both strands of each subclone using a template as a whole cosmid, even if most of its dye- terminators and the ABI-373 sequencer. For sequence is completely unknown. What makes it these clones we tested an easy method for ss possible is that the probability of the occurrence of template preparation avoiding plasmid prep of any alternative sites with long differential extensions kind. The entire insert of a plasmid clone was using a single dNTP is dramatically lower than amplified by PCR in which one of the two primers that of extensions using a two-dNTP subset. Of was 5'-phosphorylated. One of the two strands five octamer primers tested on Lambda using was digested with lambda exonuclease one-dNTP DENS, four produced good sequences (Boehringher Mannheim, # 1666908) which at the intended unique sites. The occurrence of selectively digests 5'-phosphorylated strand only suitable priming sites for the one-dNTP DENS is (Ref.2). Using this method ssDNA template is lower than for the two-dNTP DENS but is still produced by PCR using vector-specific primers high enough to perform primer walking if all 4, 096 directly from overnight bacterial culture. octamers are used. For the DENS reaction, we have optimized I. Raja et al (1997) Nucleic Acids Res. 25, parameters related to the stability of the extended 800-805. primers and used them for the selection of DENS priming sites. Upon this optimization, the success rate of DENS primer walking using the ss template Sequencing of Human Telomeric preparation by PCR was 70%. Region DNA by Differential Extension with Nucleotide Subsets (DENS) I. Raja eta!. (1997) Nucleic Acids Res. 25, 800-805. 2. Little eta!. (1967) J. Bioi. Chern. 242, 672-678. Dina Zevin-Sonkin#, Anahit Ghochikyan#, Arthur Liberzon#, Lev Lvovsky# and Levy Ulanovsky*# # Dept. of Structural Biology, Weizmann Institute of Science, Rehovot, ISRAEL Primer Walking Through Alu Repeats * CMB, Argonne National Laboratory, Argonne, Using DENS Sequencing Technique IL DENS allows primer walking without custom Dina Zevin-Sonkin#, Anahit Ghochikyan#, Arthur primer synthesis and each walk involves a two step Liberzon#, Lev Lvovsky#, and Levy Ulanovsky*# procedure. It starts with a limited initial extension #Weizmann Institute of Science, Rehovot, Israel of an 8 -mer primer (degenerate in 2 positions) at *CMB, Argonne National Laboratory, Argonne, 20-3o·c in the presence of only 2 out of the 4 IL possible dNTPs. The primer is extended by 5 bases or longer at the intended priming site, which Primer walking using conventional primers is is deliberately selected, (as is the two-dNTP set), believed to be problematic when tandem Alu to maximize the extension length. The subsequent repeats are present in the template. In contrast to termination reaction at 6o•c then accepts the conventional18-20mer primers, DENS primer extended at the intended site, but not at (Differential Extension with Nucleotide Subsets, alternative sites, where the initial extension (if any) see Ref. 1 and accompanying abstracts by Raja et

18 a!.) uses octamers degenerate in 2 positions. We Pol I family, that includes T7 DNA polymerase, have found that DENS has an advantage over Taq DNA polymerase, and E. coli DNA conventional primer walking in sequencing through polymerase I. An important advance in our tandem Alu repeats. A single mismatch in the understanding of these enzymes has resulted from octamer/template complex prevents priming, our recent determination of the 2.2 A crystal enabling discrimination between nearly identical structure ofT7 DNA polymerase locked in a repeats. It is possible to walk through repeat-rich replicating complex with a dideoxy-terrninated regions by selecting hypervariable sites for DENS primer-template, an incoming dNTP, and the priming within the Alu consensus sequence. processivity factor thioredoxin [ 1]. The incoming Additional mismatch discrimination is provided dNTP fits snugly into a pocket formed by the during the differential extension stage, since the fingers of the polymerase closing on its palm and extension will be shorter if a template base is the 3'-terrninus of the primer. Numerous non-complementary to either of the two dNTPs interactions of the bound nucleotide with the provided (e.g. ifdATP and dGTP were the only template base, the polymerase, and two metal ions nucleotides present, and a "G" were encountered in specifY the correct base-pair in the active site, and the template). Finally, the extended primer is also provide insight into the mechanism of destabilized by a single mismatch at the higher discrimination against analogs with modifications annealing temperature of the cycle sequencing in the sugar moiety (e.g. dideoxynucleotides, stage. All of these factors cause DENS to be very ribonucleotides and 3' fluoro derivatives) as well as effective at distinguishing similar, but not identical, bases containing bulky fluorescent substituents. priming sites within a DNA template, whereas a We are making use of this structure to construct conventional long primer is less able to provide mutant polymerases that incorporate various such discrimination. nucleotide analogs more efficiently that are modified in either their sugar, base or triphosphate 1. Raja eta!. (1997) Nucleic Acids Res. 25, groups. 800-805. The processivity factor for T7 DNA polymerase, E. coli thioredoxin, is located at the tip of the Structural Insights into the Properties thumb of the polymerase in a position poised to of DNA Polymerases Important for prevent dissociation of the DNA from the DNA Sequencing polymerase. It binds to a 74 residue domain in T7 DNA polymerase that is attached to the thumb by a flexible tether. While this domain is unique to T7 Stanley Tabor, Sylvie Doublie, Tom Ellenberger DNA polymerase, it is a modular in that it can be and Charles Richardson transferred to other homologous polymerases by Department of Biological Chemistry and gene fusion to generate hybrid enzymes that have Molecular Pharmacology, Harvard Medical dramatically increased processivity for DNA School, Boston, MA 02115 synthesis [2]. The crystal structure of the T7 DNA [email protected] polymerase complex suggests that thioredoxin is acting to stabilize the region it binds to, allowing a Current methods for DNA sequencing require a number of critical basic residues in T7 DNA DNA polymerase to extend a primer with each of polymerase to interact electrostatically with the the four natural nucleotides, as well as a variety of DNA backbone and thus prevent its dissociation. analogs such as fluorescent and chain-terminating We are carrying out extensive mutagenesis oftl1is dideoxynucleotides. We have been characterizing region in order to further define the critical the structure and function of DNA polymerases to structural features and to engineer new DNA better understand and to modifY those properties polymerases that have increased processivity. important for DNA sequencing, in particular the incorporation of nucleotide analogs and the This work is funded in part by DOE grant processivity of DNA synthesis. Our work has DE-FG02-96ER62251 (Stanley Tabor, P. I.) focused on the DNA polymerases belonging to the

19 [ 1] Crystal Structure of Bacteriophage T7 DNA Vectors and Biochemistry for Polymerase Complexed to a Primer-Template, a Sequencing by Nested Deletions Nucleoside Triphosphate, and its Processivity Factor Thioredoxin. Sylvie Doublie, Stanley John J. Dunn and Matthew Randesi Tabor, Alexander Long, Charles C. Richardson Biology Department, Brookhaven National and Tom Ellenberger, Nature, in press. Laboratory, Upton, New York 11973 [email protected] [2] The Thioredoxin Binding Domain of Bacteriophage T7 DNA Polymerase Confers An ordered set of nested deletions whose ends are Processivity on Escherichia coli DNA Polymerase separated by -400 bp allows rapid sequencing I. Ella Bedford, Stanley Tabor and Charles C. across one strand of a cloned fragment, using a Richardson. Proc. Nat!. Acad. Sci. USA 94, universal primer. Any gaps remaining after this 479-484 (1997). process can be closed by primer walking on the original clone. Even highly repeated DNA can easily be assembled correctly, knowing the relative Semiautomated Mutagenesis and locations of the sequences obtained. We have Screening of T7 RNA Polymerases developed vectors and protocols that allow simple and reliable production of nested deletions suitable Mark W. Knuth, Scott A. Lesley, and Heath E. for such a sequencing strategy, from cloned Klock fragments at least as large as 17 kbp and Promega Corporation 2800 Woods Hollow Road, potentially 40 kbp. Mawson, WI53597 [email protected] We have made two single-copy, amplifiable vectors suitable for this strategy, and have The objective of this grant is to develop an validated and made reliable a previously described engineered RNA/DNA polymerase for primerless method for generating nested deletions DNA sequencing, among other applications. In enzymatically. In both vectors, clones are stably order to develop this enzyme, systems have been maintained at low copy number by the F developed for facile semiautomated mutagenesis, replication and partitioning functions and can be expression, micropurification, and screening of amplified from an IPTG-inducible PI lytic rep Iicon mutants ofT? RNA polymerase. to prepare DNA. A synthetic version of the phage fl origin of replication is located a short distance Two publications have reported T7 RNAP upstream of the multiple cloning site. Vector mutations that confer some ability to incorporate pND-1 is used primarily for obtaining clones by dNTPs; we have further characterized their transformation or electroporation; pND-2 has performance in parameters important to DNA phage lambda cos sites that allow efficient cloning sequencing and other applications, and have of30-40 kbp fragments in a lambda packaging created screens to test for improved performance. system.

To date, clones representing all possible amino Reaction conrutions have been defined where acid substitutions at 96 sites have been created, purified fl gene 2 protein efficiently introduces a using a novel mutagenesis kit developed at strand-specific single nick in the fl origin sequence Promega. These mutant enzymes are being with very little rejoining. Large amounts of stable expressed, purified and tested for increased dNTP gene 2 protein are easily obtained from a clone by incorporation. We anticipate that at full capacity, a rapid purification procedure we developed. To around 12 positions will be screened per week, create the nested deletions, the nick is expanded using techniques and equipment that can be unidirectionally into the cloned fragment by 3' to 5' generalized to other enzymes requiring some digestion with E. coli Exo III, the resulting purification before screening. single-stranded regions are rugested with S 1 nuclease, and the ends are repaired and ligated

20 with T4 DNA polymerase and ligase. The Exo III them follow very precise trajectories along the digestion is highly synchronous and processive, edges. This phenomenon may be useful in and the deletion lengths are proportional to microdevices for manipulation of small quantities incubation time. To prevent undeleted DNA from or single molecules of DNA. giving rise to clones, the treated DNA is digested with one of several restriction enzymes whose 8-base recognition sequences lie between the fl Energy Transfer Fluorescent Primers origin and the cloning site. Nested deletion clones Optimized for DNA Sequencing and are then obtained by electroporation. Diagnostics Pooling samples from several different times of Su-Chun Hung*, Yiwen Wang**, Richard A. Exo III digestion before subsequent treatment Mathies** and Alexander N. Glazer generates a good distribution of deletion clones. Departments of Molecular and Cell Biology* and Growth and amplification of randomly selected Chemistry**, University of California, Berkeley, clones in 1 ml of medium in 96-well format CA 94720 followed by a simple DNA preparation protocol provides ample DNA for analyzing deletion length Energy transfer (ET) primers are markedly by gel electrophoresis and for DNA sequencing superior to single dye-labeled primers in DNA reactions. Imaging and sizing software is now sequencing, and in multicomponent genetic being tested for automated selection of an analyses, such as forensic identification, and appropriate set of deletions for sequencing. genetic typing of short tandem repeatsl·2 We describe here improvements in the properties ofET Exploratory work has demonstrated the fluorescent primers that enhance their value in all effectiveness of this nested deletion strategy for of these applications. In designing improved ET sequencing fragments at least as large as I 7 kbp primers, we focused on the spectroscopic cloned from a human BAC. characteristics most important for DNA sequencing and PCR product analysis: the relative acceptor fluorescence emission intensity and the Trapping of DNA in Non-Uniform amount of residual donor fluorescence emission, Oscillating Electric Fields and on improving the match in electrophoretic mobilities of the DNA fragments extended from Charles L. Asbury, Ger van den Engh ET primers. Hung et al. 3 showed that Department of Molecular Biotechnology, 3-(e-carboxypentyl)-3'-ethyl- 5,5'-dimethyloxa­ University ofWashington, Seattle, WA 98105 carbocyanine (CYA or C), a dye with a high absorption cross-section but a low fluorescence We present a method, analogous to optical quantum yield, is superior to FAM (F) as a donor trapping, which allows manipulation of DNA in ET primers and that 6-carboxyrhodarnine-6G molecules in aqueous solution. Due to the (R6G or G) and 5&6-carboxyrhodamine-11) induction of an electric dipole, DNA molecules are (Rll 0) can serve as alternative acceptors to JOE pulled by a gradient force to regions of high and F AM, respectively, in ET primers. electric field strength. With the use of very thin CY A -labeled ET primers have stronger acceptor gold films and oscillating fields, molecules can be emission and higher spectral purity than ET trapped individually or locally concentrated. The primers with a F AM donor. The ET primer set molecules do not become permanently attached to ClORllO, ClOG, ClOT, and ClOR with only the gold. Spatial control over the trapped rhodamine derivatives (Rll 0, R6G, T AMRA, and molecules is achieved because they are confined to ROX) as the four acceptors has superior a width of-5 nun perpendicular to the edges of the spectroscopic properties and gives results in DNA gold films. It is possible to mix static and sequencing with near-perfect match in the oscillating electric fields in order to move trapped electrophoretic mobility of single-base extension molecules from one edge to another, or to make DNA fragments both in capillary and slab gel

21 3 5 electrophoresis. • ET primers with CYA as donor 6. Y. Wang, S-C. Hung, J.F. Linn, G. Steiner, and a donor-acceptor spacing of 4-6 nucleotides A.N. Glazer, D. Sidransky, and R.A. Mathies offer excellent acceptor emission intensities Electrophoresis 18, (1997)- in press. coupled with negligible donor emissions. In 7. S-C. Hung, R.A. Mathies, and A.N. Glazer multiplex separations, this allows precise Anal. Biochem. -in press. quantitation of the ratio of the signals from DNA fragments labeled with one or another of two different primers. We have applied the two-color Direct PCR Sequencing with ET primer sets C4G/C4R and C6G/C6R to Boronated Nucleotides bladder cancer diagnosis based on electrophoretic analyses of polymerase chain reaction-amplified Kenneth W. Porter, Dima Sergueev, Alunad short tandem repeats (STRs) where the diagnosis Hasan, J. David Briley and Barbara Ramsay depends on the detection ofloss of heterozygosity Shaw at particular loci. The success of the analysis Department of Chemistry, Duke University, depends on the accurate multiplex quantitation of Durham, NC 27708 the amplified DNA fragments from two different samples, normal and tnmor cell DNA. Multiplex DNA can be simultaneously amplified and analyses with the two-color primer sets C4G/C4 R sequenced using a new class of nucleotides and C6G/C6R allowed quantitative determination containing boron. During the polymerase chain 6 of allelic ratios with a precision of ±10% We reaction, boron-modified nucleotides, i.e. performed a quantitative comparison of sets of 2 '-deoxynucleoside 5 '-alpha-[P-borano]­ primers differing in the natnre of the triphosphates, are incor-porated into the product donor-acceptor combinations, CYA -ROX, DNA. The boranophosphate linkages are resistant F AM-ROX, and BODIPY503/512- to nucleases and thus the positions of the BODIPY581/591. Variables examined included boranophosphates can be revealed by exonuclease the length of the 5' -amino linker arm, the number digestion, thereby generating a set of fragments of base pairs between the donor and acceptor, and that terminate in a boranophosphate linkage and the excitation wavelength (488 or 514 mn). Ofthe define the DNA sequence. The boranophosphate primers examined, CYA-ROX primers offer the method offers an alternative to current PCR best combination of acceptor fluorescence emission sequencing methods. Single-sided prin1er 7 intensity and spectral purity extension with dideoxynucleotide chain terminators is avoided, with the consequence that the Supported by the Director, Office of Energy sequencing fragments are derived directly from the Research, Office of Health and Enviromnental original PCR products. Boranophosphate Research of the U.S. Department of Energy under sequencing has been demonstrated with the Contract DE-FG-91ER61125. Financial support Pharmacia and the Applied Biosystems 373A from the Amersham Life Science Inc. is also automatic sequencers, producing data that is gratefully acknowledged. comparable to cycle sequencing.

L J. Ju, A.N. Glazer, and R.A. Mathies Nature The method has been improved recently by Medicine 2, 246-249 (1996). streamlining sample preparation and by employing 2. A.N. Glazer and R.A. Mathies Curr. Opinion modified boranophosphate nucleotides. Sample Biotechnol. 8, 94-102 (1997). preparation is streamlined by implementing direct 3. S-C. Hung, J. Ju, R.A. Mathies, and A.N. digestion of the PCR products immediately after Glazer Anal. Biochem. 243, 15-27 (1997). amplification. The base-specific PCR products are 4. S-C. Hung, J. Ju, R.A. Mathies, and A.N. combined and digested by direct addition of Glazer Anal. Biochem. 238, 165-170 (1996) Exonuclease III and Phosphodiesterase I. 5. S-C. Hung, R.A. Mathies, and A.N. Glazer Subsequently, the digestion products are purified Anal. Biochem. 251, (1997)- in press. by spin column chromatography, concentrated by isopropanol precipitation, and separated by gel

22 electrophoresis. The uniformity of the digestion ( 1) Four-color confocal fluorescence CAE products is increased by the addition of modified scanners using our standard flat bed design 1 are boranophosphates to the PCR amplification and by being used along with energy transfer (ET) primers the addition of Phosphodiesterase I to the digestion for sequencing of mitochondrial (mt) DNA, mixture. Chlamydia, and Anabaena. Twelve motifs of the hypervariable region I of human mt DNA from a Additionally, we have synthesized a Sierra Leone population have been sequenced with 2'-deoxyadenosine alpha-borano-triphosphate with an average accuracy of 99. 7%? Over 40 kilo bases a fluorescent tag attached at the C-8 position of of Chlamydia clones have been resequenced and a adenine via an alkylamine. Experiments are in 8 kilobase fragment has been assembled with progress to determine its ability to be incorporated 99.63% accuracy. Several genes from Anabaena during amplification and its nuclease resistance, have also been sequenced as a part of an and to develop a screen that will allow for rapid undergraduate research project and fragments of evaluation of other naturally occurring or mutant up to 8 kilobases have been assembled. exonucleases. (2) A capillary array scanner (CAE) capable The boron method may find use in applications of acquiring four-color data at the rate of 4 Hz where high resolution of longer fragments requires from over 1000 capillaries has been constructed. stronger signals at longer read lengths, because the The scanner has a rotating objective which excites distribution of fragments produced by nuclease and collects fluorescence from one to 1088 digestion should be skewed to long fragments. capillaries which are positioned in precisely Direct sequencing of PCR products simplifies machined grooves (spaced at 260 microns) in a bidirectional sequencing and provides a simple, cylindrical objective housing. Four-color data from direct, and complementary method to cycle four photomultipliers is obtained simultaneously sequencing. with four independent ADCs. The acquired data is stripped of non-sample gaps, averaged across each (Supported by DOE grant 94-ER61882 and capillary and also averaged across a variable continuing grant 97ER-62376 to B.R.S.) number of successive rotations as it is acquired. The instrument has been designed to use replaceable matrices introduced by pressure filling. Advances in DNA Analysis Using This scanner will be evaluated through sequencing Capillary Array Electrophoresis of Chlamydia in collaboration with Ron Davis' group at Stanford. Indu Kbeterpal, James R. Scherer, William W. Ja, Yiwen Wang and Richard A. Mathies (3) New three-color extended binary coding Department of Chemistry, University of California, strategies for multiplex DNA sequencing have 3 Berkeley, CA 94720 been developed using ET primers. These three-color methods are found to be nearly as good Recent advancements in DNA analysis with as traditional four-color coding methods with capillary array electrophoresis (CAE) have sequencing accuracy rates of 99.6% and 99.9 %, included (i) the optimization of sample preparation, respectively. Methods have been developed to sample loading and separation matrices for DNA deconvolve the three-color data into the four base sequencing, (ii) the development of a scanner for concentrations. This three-color approach is the detecting separations in large numbers of first step towards the development of higher order capillaries (-1000) in parallel, (iii) the multiplex coding schemes for DNA sequencing and development and evaluation of new 3-color other analyses. extended binary coding strategies, and (iv) the development of rapid and sensitive methods for (4) Finally, short tandem repeat (STR)-based high-throughput cancer screening: bladder cancer diagnosis methods have been developed using two-color labeling with ET primers and CAE electrophoresis.' Rapid(< 35

23 min.) separations are achieved on capillary arrays emphasizing long read lengths, dense sample using replaceable separation matrices and the loading, and low construction cost. allelic ratios are quantitatively detennined with a precision of 10%. These methods provide a The system employs a 250 micron thick significant improvement in the speed, ease and cross-linked polyacrylamide gel, 10 inches wide x precision of STR analyses. 24 inches long which is temperature regulated on both sides. A scanning four color detection system Supported by the Director, Office of Energy is employed to collect the emitted fluorescence. Research, Office of Health and Environmental Data is collected bi-directionally with one of four Research of the U.S. Department of Energy under bandpass filters in position per scan. Image Contract DE-FG-91ER61125. processing and base calling is performed using Ge!Imager and BaseFinder software packages 1. R. A. Mathies, X. C. Huang, Nature (London), developed in our lab. 359, 167-169, !992. 2. I. Kheterpal, J. R. Scherer, S. M. Clark, A. The system provides sufficient resolution to Radhakrishnan, J. Ju, C. L. Ginther, G. F. base-call DNA fragments beyond 1000 bases in Sensabaugh, R. A. Mathies, Electrophoresis, 17, length (albeit requiring high quality template 1852-1859, 1996. preparations and sequencing chemistry) and to 3. I. Kheterpal, L. Li, T. P. Speed, R. A. Mathies, routinely deliver runs with 750 bases of useable Anal. Chern., submitted sequence (98% accuracy). The system currently 4. Y. Wang, S. -H. Hung, J. F. Linn, G. Steiner, process 88 samples in parallel and can be built at a A. N. Glazer, D. Sidransky and R. A. Mathies, cost of $23,000. Instrument design details will be Electrophoresis 18, in press, 1997. presented along with performance characteristics (in combination with our analysis software) as obtained through in-house sequencing projects. Development of a Low Cost, High Throughput, Four Color DNA Sequencer Multiplexed Integrated On-line System Michael S. Westphal!, David R. Rank and Lloyd for DNA Sequencing by Capillary M. Smith Electrophoresis: From Template to University of Wisconsin-Madison, Department of Called Bases Chemistry, 1101 University Ave., Madison, WI 53706 Edward S. Yeung* and Hongdong Tan [email protected] Ames Laboratory-USDOE and Department of Chemistry, Iowa State University, Ames, IA 50011 It has been recognized from the beginning of the [email protected] Human Genome project that in order to successfully complete this tremendous undertaking DNA sequencing as practiced today involves a the development of new and improved technologies series of steps starting from isolating DNA from for DNA sequencing would be required. We have the biological sample, cutting these into fragments focused on several aspects of this technology with convenient sizes, amplifying the fragments development from automated Front End sample biochemically, introducing a label for detection preparation to base calling of collected data. while generating a nested set of ordered fragments, Central to the development of this and any other separating the ordered set of fragments and automated sequencing system is the electrophoresis identifying the nucleotide sequence, and platform. To meet the electrophoresis throughput reassembling the short sequence data into a requirements of our automated Front End DNA continuous sequence. Recent developments of sample preparation system, we have developed a 4 capillary electrophoresis, especially in multiplexed color fluorescent sequencing instrument arrays, show great promise for substantially increasing the speed and throughput of the

24 separation and identification steps. The issue of Production Sequencing Evaluation of a cost when combined with the other steps in the Nearly Automated 96-Capillary Array whole sequencing process still remains. Our DNA Sequencer Developed at LBNL project is aimed towards the development of novel front -end strategies, whereby the speed and Jian Jin, William F. Kolbe, Yunian Lou and Earl throughput of sample preparation can be W. Cornell significantly increased while the amount of manual Ernest Orlando Lawrence Berkeley National operation and total cost can be significantly Laboratory, University of California, Engineering reduced. We will present our latest results on Sctence Department, Human Genome Group, multiplexed sample preparation in small volumes 1 Cyclotron Road, Berkeley, CA 94 720 without the use of robotics. This promises to [email protected] reduce the cost of reagents and at the same time provide high-speed high- throughput operation. As the Human Genome Project moves towards a scale-up of its sequencing phase, it is apparent that An integrated on-line prototype for coupling a gel electrophoresis will remain the main technology nucroreactor to capillary electrophoresis for DNA for DNA sequencing. Although slab gel sequencing has been demonstrated. A dye-labeled electrophoresis is currently the dominant terminator cycle-sequencing reaction is performed technolo~ used in production-level sequencing, in a fused-silica capillary. Subsequently, the recent raptd progress in capillary electrophoresis sequencing ladder is directly injected into a technology suggests that it soon will have the size-exclusion chromatographic column operated capability of replacing slab gel sequencer in at -95 o C for purification. On-line injection to a production work. At LBNL we have developed a capillary for electrophoresis is accomplished at a beta-test version of a 96-capillary system capable junction set at -70°C. High temperature at the of production level sequencing at increased rates purification column and injection junction prevents and more importantly, improved automation. The the renaturation of DNA fragments during on-line system is based on an adaptation of the best transfe~ without affecting the separation. The high available technology currently being developed by solubthty of DNA m and the relatively low ionic several laboratories. In particular, we employ a strength of lx TE buffer permit both effective sheath-flow excitation/detection geometry derived purification and electrokinetic injection of the from earlier work by Norman Dovichi at the DNA sample. The system is compatible with University of Alberta. Our system, fully integrated highly efficient separations by a replaceable from the assembling of the capillary array to poly(ethylene oxide) polymer solution in uncoated automated DNA sequencing, to base-calling, has capillary tubes. demonstrated the effectiveness of the sheath-flow approach using 96-channel array, with a We will present data from an 8-capillary system separation speed of 700 bases/hour/channel. The ~here individual templates are simultaneously ' instrument's operation has been fully computerized mJected from standard microtiter wells. After with automatic control of the laser, high voltage injection, a completely computer controlled system power supply, data acquisition and processing. The takes the samples through terminator-labeled cycle gel replacement and sample loading were nearly sequencmg, purification, introduction to 8 parallel automated, and the sequencing cycle time was less electrophoresis capillaries for separation and than 2 hours. A standardized baseline protocol for d~tection, and base calling. Scaling up to 100 capillary coating and filling and sample sunultaneous channels is straightforward. Future preparation has also been developed. Currently, research should allow starting from single bacterial this system has undergone a series of beta-testing colonies injected into each capillary and the runs in the production-sequencing environment at nucleotide sequence identified in a fully on-line and L~NL' s sequencing Center. By loading our system automated system, perhaps multiplexed to 1000 at With the sample remainders left by production a time. sequencing at LBNL and directly comparing our sequencing results with those generated by the

25 ABI-377 machines, we have found that a we previously found that long read length comparable sequencing quality has been achieved sequencing runs may be achieved using 2% w/w by our instrument. Details of results will be linear polyacrylamide (LP A) with a molecular represented. weight close to 9 MDa. A new, convenient method for high molecular weight LP A preparation using This work was supported by the Director, Office inverse emulsion polymerization has been of Energy Research, Office of Health and developed. With this procedure, LPA forms a Environmental Research, Hnman Genome white powder of unlimited shelf life and allows fast Program, of the U.S. Department of Energy under and reproducible preparation of working polymer Contract No. DE-AC03-76SF00098. solutions. Additionally, base calling software has been improved to increase the accuracy of extended read length (see a separate abstract of DNA Sequencing by Capillary A.W. Miiier and B.L. Karger). Electrophoresis from Sample Preparation to Resulting Sequence: A We are currently working on a fully automated multicapi!lary array DNA sequencing system, Robust .High Throughput Procedure which will include a robotic sample preparation with Long Read Length and purification system, separation matrix replacement and sample injection automated 1 Salas-Solano, 0., Carrilho, E., Ruiz-Martinez • devices. We will present our latest results in this M.C., Goetzinger', W., Kotler, L., Sosic, Z., and area. The described sequencing technology Karger, B.L. advances will greatly increase overall throughput Barnett Institute and Department of Chemistry, of the automated multi-chaunel capillary DNA Northeastern University, Boston, MA, 02115 electrophoresis system. [email protected] This work is being supported by DOE grant Capillary electrophoresis (CE) is being developed #DE-FG02-90ER60895. in our lab and elsewhere for high speed automated DNA sequencing. However, while several (1) Present address: Curagen Corp., Branford, CT multichannel sequencing instruments are currently 06504 under development by different groups, the (2) Present address: Arqule Inc., Medford, MA significant issues of robustness and high overall 02155 throughput of the DNA sequencing analysis have not been adequately addressed. We are developing a fully automated multicapillary DNA sequencing A Fully Automated 96-Capillary DNA system, starting from sample preparation to DNA Sequencer sequence generation. To reduce the failure rate to a minimum and, hence, increase the throughput of Qingbo Li, Thomas E. Kane, Changsheng Liu, such a system, proper care is necessary for every John Kernan, and James R. Hoyland step of the process. We have developed a robust Premier American Technologies Corp. protocol for thorough purification of DNA samples [email protected] that includes both desalting and template removal. This procedure also resulted in 10-fold increase in A high throughput DNA sequencer is constructed, the amount of DNA sequencing fragments where the instrument operation (e.g., sample compared to conventional desalting by ethanol introduction, DNA separation, instrument precipitation or gel filtration columns. The reconditioning, data processing) is carried out and purification procedure is economical, very tightly controlled by the instrument computer. The reproducible and compatible with different complete instrument consists of two units. The sequencing chemistries. Conditions for main unit houses the optical detection system, the electrokinetic injection of the purified samples have 96-capillary cartridge, the mechanical system for also been optimized. With respect to the column,

26 sample introduction, and the associated electronic We are developing instrumentation for DNA controllers. The other unit is the liquid handling sequencing based on the use of an array of module dedicated to gel delivering and capillary rnicrochannels fabricated on a glass substrate. reconditioning. The two units interface through a Arrays with up to l 01 rnicrochannels, 48 em long liquid conducting tubing. In the detection system, have been fabricated in plates of borosilicate float an air-cooled argon ion laser is efficiently coupled glass. Since last reporting we have further with the optics to excite fluorescence from all 96 improved our microfabrication process to etch capillaries. The instrument is portable. In the channels of arbitrary width and depth. We have sample introduction system, a carrousel assembly etched substrates with 12, 24, and I 01 channels allows automatic processing of seven 96-well per plate varying from !50 - 200 urn wide and 30 - sample trays without human intervention. 60 urn deep. The rnicrochannels are constructed with two plates of glass (nominally 7.5 em x 58 The high throughput arises primarily from four em) which are fusion bonded. The channel plate is advantages: (l) the instrument design that allows bonded to a top plate that has the input and output complete automatic operation; (2) the use of 96 ports for sample introduction and buffer reservoir discrete capillaries, with which 96 separate DNA interconnects. Channels of various cross section samples can be analyzed simultaneously; (3) the sizes have been built and tested using low viscosity use of a high speed CCD camera that allows solutions oflinear poly(dimethylacrylarnide). This simultaneous monitoring of fast separation in all sieving media dynamically coats the walls of the 96 capillaries; (4) the use of dilute replaceable gel glass rnicrochannels to significantly reduce the matrix that minimizes the time required for gel electro-osmotic flow so that sequencing filling and capillary reconditioning. Preliminary separations can be done in uncoated channels. sample runs of 400 - 500 DNA bases per capillary Also, the relatively low viscosity of the sieving have been achieved, yielding 38,400- 48,000 media allows it to be pumped into the channels via bases read per instrument run. Including the time a simple syringe pump. This syringe pump is for sample introduction, separation, and capillary coupled directly to a common output port of the reconditioning, the P A TCO prototype is capable of rnicrochannel plate so that all channels can be one complete run within two hours. This sample filled with sieving media in one simple pump analysis throughput is comparable to the demands operation. A linear scanning confocal PMT-based of the Human Genome Project. detection system is used to detect the laser induced four color fluorescence from the DNA fragments. The 96-capillary prototype will be available by the The data acquisition and control system is based end of the year. In the next stage, the instrument on a Pentium class personal computer running will be scaled up to 384 capillaries for 4X LabVIEW. Data compression and transfer, signal improvement in the instrument throughput. analysis, and basecalling routines have been developed in S-PLUS and C on a Sun Microsystems Ultra Enterprise 2 server. Recent Development of a Microchannel Based experimental results of microchannels 60 urn deep DNA Sequencer by 250 urn wide and having a 38 em load-to-read length resulted in an electrophoretic resolution Courtney Davidson, Joseph Balch, Larry Brewer, greater than 400 bases in about 90 min. for a 160 Joe Kimbrough, Steve Swierkowski, David Nelson, V/cm separation field. A 24 rnicrochannel plate is Ramkrishna Madabhushi, Ron Pastrone, Ann Lee, presently operational providing electrophoretic Paula McCready, Aaron Adamson, Bob Bruce, resolution of 450 to 500 bases. Currently we are Ray Mariella, and Anthony Carrano building and assembling a 96 channel system based Lawrence Livermore National Laboratory, Human on this technology into an "alpha-phase" DNA Genome Center sequencing instrument which will be networked [email protected] with the Sun for data analysis and base calling. We will report on-going results obtained from it for high throughput DNA sequencing.

27 Work is supported by a grant from the National typically has 12, 24, or 101 channels per plate; the Center for Human Genome Research, National channels are 150- 200 urn wide and about 30- 60 Institutes of Health and by The Department of urn deep. The channel plate is bonded to a top plate Energy Human Genome Program. Work was that has the input and output ports in it. Two performed by Lawrence Livermore National different types of input ports have been tested. For Laboratory under auspices of the U.S. Department a 5 mm thick top plate, input ports I mm in of Energy under contract no. W-7405-ENG-48. diameter have been ultrasonically milled through the top plate and registered with the channels before the bonding process. For a 1.2 mm thick top Microchannel Process Development plate, input ports as small as 150 urn have been and Fabrication for DNA Sequencing fabricated{ about the same as the channel diameter) and these have been registered to within 20 urn accuracy to the channels before bonding. Steve Swierkowski, Joseph Balch, Courtney Davidson, and Lisa Tarte The patterning of these plates employs simple Lawrence Livermore National Laboratory, Human contact printing with flat panel display industry Genome Center type photomasks onto standard photoresists. swierkowski [email protected] Special apparatus was constructed to coat the plates with photoresist and also to expose them We have developed a process for the production of with a simple contact printing method. A critical microchannel arrays on single glass substrates as procedure was developed to eliminate microscopic an alternative electrophoresis technology to arrays damage to the glass before processing begins and of discrete capillaries for DNA sequencing. This to clean the glass at the beginning of the technology approach provides a number of processing. This special procedure was essential to advantages for building large arrays of reduce the microchannel etching defects by many electrophoresis microchannels for DNA orders of magnitude that would have otherwise sequencing. By fabricating the array of rendered the plates useless for high resolution microchannels on a single glass substrate, the genome sequencing. Extensions of this fabrication arrays of microchannels are very robust technology should enable very high(400 - 500) mechanically and can be handled without any channel count plates to be made that would, in special care. By means of photolithography and tum, greatly increase the throughput and efficiency chemical etching techniques the dimensions of of sequencing instruments. rectangular cross-section channels can be optimized by making the channel depth thin to Work is supported by a grant from the National minimize the thermal dispersion of DNA bands Center for Human Genome Research, National while at the same time the channel width can be Institutes of Health and by The Department of made large to increase the amount of dye-labeled Energy Human Genome Program. Work was DNA available for strong fluorescence signal performed by Lawrence Livermore National generation and detection. The detection of the Laboratory under auspices of the U. S. Department fluorescence signal is also made easier by having a of Energy under contract no. W-7405-ENG-48. flat optical window over the channels through which laser excitation of fluorescence occurs with less scattered light of the primary laser beam to contribute to the overall noise level.

Microchannel arrays for electrophoresis with up to 101 channels, 48 em long have been fabricated in plates of borosilicate float glass. The channels are constructed with two plates of glass that are 7.6 em wide and 58 em long and are fusion bonded at 650 oc. The channel plate is 5 mm thick and

28 An Optical Trigger for Locating quesada@genome l.bio. bnl.gov Microchannel Position Capillary electrophoresis through a replaceable Laurence R Brewer, Joseph Kimbrough, polymer matrix has great promise for improving Courtney Davidson, and Joseph Balch the speed and efficiency of DNA sequencing if Lawrence Livermore National Laboratory, Hwnan many different capillaries can be analyzed Genome Center simultaneously. However, illwnination of the brewer [email protected] interiors of multiple capillaries and collection of the emitted fluorescence from each is complicated We have developed a technique for automatically by the cylindrical shape and small radii of aligning a microchannel plate with a scanning curvature of the capillaries. We have solved this fluorescence detector in a high throughput DNA problem by using the capillaries themselves as sequencer. An optical signal from each optical elements in a waveguide. microchannel can be used to dramatically reduce the amount of data collected while further Refraction of light at the surfaces of a capillary eliminating the effects of hysteresis and velocity depends on the radii of curvature of the capillary variation of the scanning motor. While the system walls and the change in refractive index in crossing can be run in a high resolution mode (-2800 points each surface. With appropriate dimensions and per scan) for diagnostic purposes such as refractive indices, refractive effects can confine a evaluating band shape, the described technique can beam of light to pass through the interior of each be adapted for the collection of a single data point successive capillary in a parallel array. The per channel per scan. Such a reduction is illwninating beam must be in the plane of the array prerequisite for practical application to high and normal to the first capillary, have minimal throughput DNA sequencing. The technique makes divergence, and have a radius comparable to or use of the difference in reflection of the air-glass, smaller than the internal radius of the capillary. gel-glass, and bonded glass-glass interfaces present This condition is readily achieved by delivering the in the microchannel plate. A thin piece of glass is beam through an integrated fiber optic transmitter. used to collect back reflected light from laser illuminated microchannels and sent to a photodiode Losses oflight due to reflection at each successive for detection. This signal delineates the position of capillary surface can be minimized by reducing the the microchannels with a high signal to noise ratio differences in refractive index across the surfaces and is used as an electronic trigger for data while still satisfying the conditions for beam collection. confinement. Illuminating with coaxial beams from opposite ends of the array also improves the W ark is supported by a grant from the National uniformity of illumination. Using commercially Center for Human Genome Research, National available materials, it would be feasible to make Institutes of Health and by The Department of 96-capillary waveguide sequencers that would be Energy Hwnan Genome Program. Work was expected to have only a few percent variation in performed by Lawrence Livermore National the intensity of illumination of each capillary Laboratory under auspices of the U. S. Department across the entire array. The fluorescence can be of Energy under contract no. W-7405-ENG-48. collected by an array of optical fibers whose spacing is identical to that of the capillaries and whose ends are positioned normal to the Multiple Capillary DNA Sequencer capillaries. This configuration allows simultaneous alignment of all of the collection fibers. IDuminated by a Waveguide A 12-capillary prototype sequencing apparatus Mark A. Quesada, Harbans S. Dhadwal, David J. was fabricated and tested, validating the key Fisk, Janine S. Graves and F. William Studier elements of this design. The capillaries are Biology Department, Brookhaven National illuminated efficiently with only 30 m Watts of Laboratory, Upton, New Y ark 11973 laser power, and the fluorescence is efficiently

29 collected by the matched optical fibers with little sequence data entry and to examine assembled cross-talk between channels. The collected light is projects. delivered to a spectrograph and the full fluorescence spectrum of all the capillaries in parallel is displayed on the surface of a CCD and Single Molecule DNA Sequencing read into a computer about 3 times per sec. Replacement of capillaries a'ld aligmnent of the Hong Cai, Peter M. Goodwin, James H. Jett, system is very simple. Richard A. Keller, Nicholas P. Machara, Susan L. Riley, and David J. Semin Samples are electrokinetically injected Los Alamos National Laboratory, Los Alamos, simultaneously into all 12 capillaries from a single New Mexico 87545 row of a microliter plate. A [email protected] dimethylpolyacrylamide polymer matrix is used in uncoated capillaries. Our current preparations can Our flow-cytometry based approach to DNA be used for about 30 sequencing runs per capillary sequencing involves: (1) labeling DNA fragments before performance begins to degrade. We are with base-specific fluorescent tags; (2) suspending developing base calling software based on a single labeled fragment in a flow stream; (3) Bayesian analytical methods. The system currently digesting with an exonuclease that sequentially reads almost 500 bases per capillary when run at cleaves the end nucleotide and releases it into the room temperature, with a cycle time of about 2 hr. flow stream; ( 4) detecting and identifying the individual cleaved nucleotides as they pass in order of cleavage, through a focused laser beam. Technology Development and Informatics Tools for Genome Sequencing We have implemented an optical trap to suspend the DNA laden microsphere upstream from the E.R. Mardis, L. Hillier, A. Chinwalla, M. Cook, detection laser. This resulted in: reduced M. Holman, G. Marth, R. McGrane, D.A. background; improved detection efficiency; and Panussis, D. C. Peluso, L.L. Rifkin, J.E. Snider, J. simplified sample handling. With this system, our Strong, E. Stuebe, R.H. Waterston, M. Wend!, detection efficiency oflabeled nucleotides is - 90% R.K. Wilson and false positive signals from the background are Washington University School of Medicine, a few per second. Enzymatic cleavage rates with Genome Sequencing Center, St. Louis, MO 63108 Exo HI are- 5 per second at 37 oc on [email protected] fluorescently labeled substrates. Progress towards a two color sequencing demonstration will be Efficient performance of high throughput, described. large-scale genome sequencing projects depends upon improvements to techniques and devices for their performance that streamline data production, On the Road to Rapid Exonuclease as well as computer tools to complement these Screening for DNA Sequencing improvements. Thus, the ability to study processes and apply the appropriate modifications, devices Hong Cai, # Susan L. Riley,# Kristina and/or informatics tools is critical to our ability to Kommander, * John Nolan, • Richard A. Keller# increase productivity and efficiency. Several Los Alamos National Laboratory, Chemical examples of technology development and Science and Technology# and Life Sciences* informatics contributions to our processes will be highlighted. These include efforts to increase The limitation of current automated sequencers is sample capacity on DNA sequencers, improve 1800 base pairs/day, which results in the need for sample loading processes, and streamline liquid 4600 machine years to create the first finished transfer steps, as well as scripts and algorithms to sequence of one human genome ( -3 x I 09 bases reduce time spent on gel retracking, to facilitate pairs). A rapid laser-based technique for

30 sequencing of 10 kb or larger fragments of DNA at are. sized in minutes with an accuracy of- 98%. a rate of 100 to 1000 bases per second is being We have demonstrated the applicability oftbis developed in our laboratory. Successful completion technique for sizing DNA fragments as small as of tbis would greatly reduce the time and effort 212 bp and as large as 340 kbp. In comparison needed in the sequencing of the human genome and witb pulsed-field gel electrophoresis for the sizing other genomes. This new method relies on the oflarge DNA fragments, tbis approach is more attachment of fluorescent labeled DNA to a accurate, much faster, requires much less DNA, microsphere, introduction of tbis microsphere into and is independent of the DNA conformation. a flowing sample stream, and detection of the Applications to the characterization of PAC and individual labeled nucleotides as they are cleaved BAC clones for DNA library construction and from the DNA fragment by an exonuclease. In identification of bacteria strains by their DNA order to increase analysis rates, an exonuclease fingerprint will be described. with a fast digestion rate is required.

Extensive testing of commercially available Single Molecule DNA Detection in exonucleases in our laboratory has not revealed a Microfabricated Capillary suitable exonuclease capable of rapidly cleaving Electrophoresis Chips fluorescently-labeled DNA. This has lead us to search for either mutant forms of exonucleases or Brian B. Haab and Richard A. Mathies exonucleases derived from other bacterial strains. Department of Chemistry, University of California, In order to screen the numerous exonucleases, a Berkeley, CA 94720 rapid screening assay based on flow cytometry has been developed. Compared to conventional We have shown that single-molecule fluorescence techniques, this new assay is sensitive, rapid, and burst counting is a highly sensitive method for requires no radiation labeling or separation. This detecting electrophoretic separations of ds-DNA will allow tbe screening of hundreds of samples per fragments. 1 In previous work, methods for day. optimizing dye labeling, laser power and data analysis were developed, which enabled detection of single DNA fragments as small as 100 bp in Sizing of Individual DNA Fragments capillary electrophoresis separations2 These separations were detectable when only 50-100 Zhengping Huang, Y ongseong Kim, Jonathan L. molecules passed through the probe volume. To Longmire, Nancy C. Brown, James H. Jett and further enhance the applicability of tbis method to Richard A. Keller low level pathogen and detection, we Los Alamos National Laboratory, Los Alamos, have now successfully performed single molecule New Mexico 87545 detection of DNA separations in microfabricated [email protected] glass capillary electrophoresis (CE) chips. By fabricating CE chips with a 280 mm thick top Our flow-cytometry based approach to DNA cover plate and by using a 40X NA 1.3 immersion fragment sizing involves: (I) staining a restriction microscope objective, tbe SIN ratio for single digest of DNA with a dye that intercalates molecule detection was enhanced by more than stoichiometrically with the fragments such that the two-fold over conventional capillaries. By number of incorporated dye molecules is constricting the sample in the detection region to a proportional to the fragment length; (2) diluting the -10 mm wide by -5 mm deep cross section, sample to - 10-13 M; passing the sample through approximately 10% of the molecules could be our single molecule detection apparatus and (3) probed by the -2 mm wide focused laser beam. measuring tbe magnitude of the fluorescence from Cross channel and separation channel dimensions individual, stained fragments. A histogram of the were systematically varied to optimize tbe injection fluorescence intensities gives the size distribution of the DNA sample. We have now achieved a of the DNA fragments, i.e. a DNA fingerprint. detection limit of 500 molecules on-column or 200 Samples containing less than a femtogram of DNA

31 attograms/ml for 500 bp DNA fragments.' This Electrospray mass spectrometry is an important accomplishment is important because DNA-based type of mass spectrometry applicable to DNA methods are becoming increasingly important in analysis because it does not break apart DNA environmentaJ·monitoring and in health care molecules. For relatively small electrospray DNA diagnostics to detect trace levels of pathogen ions, the different charge states provide a way to contamination or DNA mutation. calculate mass but for large ions with numerous charge states the calculation of ion mass is Supported by the Director, Office of Energy precluded because the charge states are not Research, Office of Health and Environmental resolved in most mass spectrometers. We recently Research of the U.S. Department of Energy under demonstrated the direct detection of charge on Contract DE-FG-91ER61125. large electrospray ions as a way to solve this problem so that the mass of large DNA ions could 1 B. B. Haab and R. A. Mathies, "Single molecule be measured. fluorescence burst detection of DNA fragments separated by capillary electrophoresis," Anal. A patent application has been submitted for the Chern. 34, 3253-3260 (1995). invention of charge-detection-mass-spectrometry. 2 B. B. Haab and R. A. Mathies, "Optimization of More recently, we have significantly improved the single molecule fluorescence burst detection of measurement capability of this technique by ds-DNA: Application to capillary electrophoresis implementing this detection system in a new type separations of I 00-1000 bp fragments," Appl. of ion trap. The gated electrostatic trap consists of Spec., 51, NlO (1997). a detector tube mounted between two sets of ion 3 B. B. Haab and R. A. Mathies, in preparation. mirrors. The mirrors define symmetrically-opposing potential valleys which guide axially-injected ions to cycle back and forth Sizing Megadalton DNA by Mass through the charge-detection tube. A low noise Spectrometry charge-sensitive amplifier, connected to the tube, reproduces the image charge of individual ions as W. Henry Benner they pass through the detector tube. Ion mass is LBNL, 1 Cyclotron Road, Berkeley, CA 94720 calculated from measurement of ion charge and [email protected] velocity following each passage through the detector. Individual ions carrying more than 250 Mass spectrometry has important potential charges at an energy of200 eV/charge have been applications in the measurement of molecular trapped for 10 ms corresponding to 450 cycles masses relevant to the goals of the Human Genome through the detector tube. At this level of trapping Project. Thin gels, capillary gels and DNA chip time, a theoretical precision for charge technology appear to offer improvements in DNA measurement as small as 2 electrons RMS can be analysis rates over slab gels but "mass achieved. The operation of the system was spectrometry has perhaps demonstrated the demonstrated by trapping a 4.3 kilobase long greatest near-term potential [to increase circular DNA molecule of bacterial plasmid through-put]." To the extent that direct pBR322. The sodium form of this molecule has a instrumental measurements of molecular size could molecular weight of 2.88 MDa. A mass value of replace gel electrophoresis as a routine tool for 2. 79 +/- 0.09 (ave+/- s.d.) MDa was determined. DNA-sequencing separations and general The accuracy of the mass measurements and the molecular biology experimentation, a significant speed of this technique suggest that this saving in time could be effected. Electrophoresis measurement approach could be applied to the gels take hours to run but mass spectra are routine sizing of cloning vectors for the purpose of acquired in seconds to a few minutes. Additionally, quality control of the cloning process. mass spectrometers produce a mass measurement compared to gel electrophoresis which separates ions according to mobility, a relative measurement.

32 Evaluation of New Membrane Surfaces film and electronic imaging devices (i.e. CCD for Chemiluminescent Assays cameras).

Christopher S. Martin, Jing Ying Lee, Betty Liu, This work was funded by the DOE Genome Jeffrey Shumway, John C. Voyta and Irena Program. Bronstein Contract No. DE FG05 92ER8l389 Tropix, Inc., 47 Wiggins Ave., Bedford, MA 01730 [email protected] Advances in Microfabricated Integrated DNA Analysis Systems Nylon membrane is the preferred support for a multitude of molecular biology applications due to Adam T. Woolley, Peter C. Simpson, Shaorong its robustness and retension of high levels of Liu, Kaiqin Lao, Stephen J. Williams, Mary X. bioanalytes. Chemiluminescent detection with Tang, Lester Hutt, Alexander N. Glazer and 1,2-dioxetane substrates on nylon membrane is Richard A. Mathies facilitated by the enhancement properties of nylon, Department of Chemistry and Department of but limited by high levels of non-specific binding Molecular and Cell Biology, University of of enzyme labeled reagents. We have developed California, Berkeley, CA 94 720 polymers of quaternary amines for enhancement of chemiluminescence intensity from 1,2-dioxetanes The microfabrication of DNA sample in solution. These polymers also improve preparation, electrophoretic analysis and detection membrane assays when added to the substrate devices is making possible a new generation of buffer. Poly(benzyltributyl)ammonium high-speed, high-throughput DNA analysis chloride(TBQ)(SapphireAE II) and systems. Our early work showed that high-quality Poly(benzyldimethylvinylbenzyl)amrnonium fragment sizing as well as DNA sequencing chloride (BDMQ) (SapphireAE I) are current separations could be performed on microfabricated 1 2 Tropix chemiluminescent enhancer products. capillary electrophoresis (CE) chips. • We also Various quaternary amine polymers were screened demonstrated that PCR amplification could be to determine if membranes coated with such directly performed on our CE chips to make the polymers would exhibit superior performance first integrated DNA sample preparation and compared to commercially available nylon, PVDF, analysis devices.' It was also possible to increase or nitrocellulose supports. Due to the superior the throughput of these microdevices by making chemiluminescent enhancement properties of TIIQ capillary array electrophoresis (CAE) chips that (Polyvinylbenzyltrihexylamrnonium chloride), this could carry out parallel genotyping separations of polymer was used to overcoat different membrane up to 12 samples on a single chip in under 160 supports. After overcoating with TIIQ, seconds. 4 Recent advances in the use of replaceable biotinylated DNA was detected with a denaturing separation matrices, in injection chemiluminescent dot blot assay. Polyethersulfone methodology, and in channel fabrication now membranes exhibited the best performance in these enable DNA sequencing separations on chips with assays. Recently, we have collaborated with an single base resolution to =500 bases in only 12 outside membrane manufacturer to bench cast minutes. Improvements in fabrication permit the membranes with TIIQ polymer. These membranes construction oflarger defect-free devices on 10 ern were subsequently tested by performing diameter glass wafers. 5 We have also developed (i) chemiluminescent detection of biotin or fluorescein novel injection modules for serially introducing labeled DNA in dot blot and Southern assays. The multiple (2-4) samples onto the same capillary, (ii) results indicate that sensitive detection and low elastomer sample well arrays for the facile loading background signal are attained on these of up to 96 samples, and (iii) electrode arrays for membranes. Development of a superior membrane addressing up to 96 samples. These improvements for chemiluminescent assays is of great benefit, have Jed to the development of a CAE chip that enabling more rapid imaging of signals with x-ray can separate 96 DNA fragment samples in less

33 than 8 minutes using 48 parallel capillaries, each 6 Simpson, P. C.; Roach, D.; Thorsen, T.; capable of analyzing two different samples.• These Johnston, R.; Sensabaugh, G. F.; Mathies, R. A. analyses have all been perfonned by using manuscript in preparation (1997). high-sensitivity, laser-excited confocal 7 Woolley, A. T.; Lao, K. Glazer, A. N.; Mathies, fluorescence detection. However, to produce truly R. A. submitted for publication ( 1997). portable high-speed microdevices it is desirable to eliminate expensive and bulky optical components and laser systems. Toward this end we have been The Flowthrougb Genosensor Program working on the development of integrated at Oak Ridge National Laboratory electrochemical detection systems for CE chips. 7 In these devices, the working electrode, fanned by RF Kenneth Beattie, Mitchel Doktycz, Ming Zhan, sputter deposition ofPt (2500 A) on a 200 A Ti Gabriel Betanzos, William Bryan, John Turner adhesion layer, was photolithographically placed Oak Ridge National Laboratory, Oak Ridge, TN =30 urn outside the end of the separation channel 37831-6123 to avoid interference from the separation field. [email protected] Electrophoretic separations of neurotransmitters were perfonned in under 100 seconds A fiowthrough genosensor instrument for ultrahigh demonstrating the speed, resolution and attomole throughput analysis of gene structure and function detection sensitivity of these devices. Indirect is being developed. The core of this microscale electrochemical detection of DNA fragment instrumentation is a microchannel hybridization separations was performed by using the array, containing a library of thousands of specific redox-active intercalator DNA sequences, immobilized within microporous Fe(l, I O-phenanthroline)32+ in the separation cells in a thin layer of silicon or glass. A nucleic buffer; transient dips in the constant background acid sample is passed through the microchannel current from free intercalator indicated the genosensor at precisely controlled temperature and migration ofDNA-intercalator complexes through flow rate, and each porous hybridization cell binds the detection region. On-chip electrochemical nucleic acid sequences that are complementary to detection provided excellent sensitivity (=103 the immobilized DNA probe. The quantitative molecules) for rapid (=200 s) and high-quality binding pattern reflects the base sequence of the separations of DNA restriction fragments and PCR nucleic acid strands present in the analyte and products. This work is the harbinger of a paradigm reveals the relative abundance of different shift in the application of CE chips to genomic sequences. The porous glass configuration has sequencing and analysis. several important advantages over the fiat surface DNA chip being developed by others: greatly Supported by the Director, Office of Energy improved hybridization kinetics, superior detection Research, Office of Health and Environmental sensitivity, the ability to analyze dilute solutions of Research of the U.S. Department of Energy under nucleic acids, and direct detection of Contract DE-FG-91ER61125. heat-denatured PCR fragments without prior isolation of single strands. 1 Woolley, A. T.; Mathies, R. A. Proc. Nat!. Acad. Sci., USA 91, 11348-11352 (1994). The prototype genosensor system includes a 2 Woolley, A. T.; Mathies, R. A. Anal. Chern. 67, temperature-controlled fluidics module and a CCD 3676-3680 (1995). imaging system for quantitation of hybridized 'Woolley, A. T.; Hadley, D.; Landre, P.; deMello, fluorescent-labeled strands. A key objective in the A. J.; Mathies, R. A.; Northrup, M.A. Anal. project is to develop important applications of the Chern. 68,4081-4086 (1996). flowthrough genosensor for analysis of gene 4 Woolley, A. T.; Sensabaugh, G. F.; Mathies, R. structure and function and DNA diagnostics. A. Anal. Chern. 69,2181-2186 (1997). Feasibility studies for key genosensor applications 5 Simpson, P. C.; Woolley, A. T.; Mathies, R. A. are being pursued in collaboration with the mouse BioMEMS I, in press (1997). genetics program at ORNL and with several other

34 organizations. In one application area, a series of sequence across the BAC array. This process will miniature "genochips" containing arrays of be repeated with numerous expressed sequences, to genomic DNA fragments are being prepared for achieve rapid assignment of expressed sequences use in gene discovery and mapping. Another to their genomic clones. application area, being pursued via a CRADA with Gene Logic, Inc. (Columbia, MD) and in collaboration with Dr. Jeffrey Trent (NHGRI) Technical Aspects of Fabrication and employs flowtbrougb arrays of DNA probes for Quantitative Analysis of DNA transcriptional profiling, facilitating the discovery Micro-Arrays for Comparative of genes that function in specific biological processes. A third application of the flowthrougb Genomic Hybridization genosensor, led by Dr. Mitch Doktycz, involves model hybridization studies with defined nucleic Damir Sudar, Steve Clark, Ian Poole, Rick acid sequences, aimed at providing a more Segraves, Stephen Lockett, Arthur Jones, Donna complete understanding of the specificity of Albertson, Joe Gray, Daniel Finkel oligonucleotide hybridization, which will facilitate Lawrence Berkeley National Laboratory intelligent selection of probes and valid d [email protected] interpretation of hybridization patterns. A fourth application of the flowthrough genosensor, being We have developed a method of performing pursued in collaboration with Dr. James Weber at comparative genomic hybridization (CGH) to the Marshfield Medical Research Foundation, is microarrays of genomic DNA clones (cosmids, high throughput genotyping. In this work miniature Pis, BACs, etc.) that permits high resolution flowtbrough genosensors will be fabricated to detection and mapping of DNA copy number simultaneously analyze thousands ofbiallelic variations in the human genome (Albertson et al., single nucleotide polymorphisms. this meeting). In array CGH, spots of cloned DNA are arrayed onto a microscope slide and hybridized In another approach, the ultrahigh surface area of with total genomic DNA from a test specimen, channel glass is being exploited to create arrays of labeled with one fluorochrome, and reference "microreactor cells" containing inunobilized BAC genomic DNA labeled with a spectrally different DNAs, for use in repetitive reactions needed for fluorochrome. The ratio of the fluorescence genome mapping and sequencing. These reactions intensities on each target clone is proportional to will include: (I) Cycle sequencing reactions - the relative copy number of those sequences in the Sequencing primers, nucleotides and Taq test and reference genomes. Efficient polymerase are flowed into each of the porous implementation of array CGH requires overcoming glass "microreaction cells," then the wafer will be several major technical challenges including sealed and placed into a thermocyler to carry out production of higb density arrays, and rapid the cycle sequencing reactions. Products will be quantitative readout and analysis ofthe eluted and analyzed by Dr. Joe Balch at LLNL, fluorescence signals. We have made considerable using his parallel microcapillary array progress in both of these areas. electrophoresis apparatus. This process will be carried out in numerous successive cycles, each We are developing a robotic arraying system with time with a new set of primers, to acquire a large a multi-pin tool to print DNA target solutions from amount of sequence information from each BAC. 864 well plates onto fused silica slides mounted on The sequencing primers can be selected from a precision X-Y stage. The pins, each made from a oligonucleotide libraries as suggested by segment of capillary electrophoresis tubing are Ulanovsky and others, to achieve rapid primer spring mounted on 3 mm centers. The upper end of walking along the entire set ofBACs. (2) Mapping each pin is connected to a manifold by flexible of expressed sequences in BAC libraries - tubing to permit pressurization for printing and Libraries ofBACs immobilized in the channel suction for cleaning. We have demonstrated the glass array will be hybridized with individual feasibility of printing targets on 100 urn centers eDNA clones or ESTs to localize each expressed

35 using this approach, providing a density of 10,000 Multilabel SERS Gene Probes for DNA targets I cm2. Sequencing

Our imaging system was designed to provide: (a) T. Vo-Dinh·, D.L. Stokes', G.D. Griffin', N. large field-of-view (1cm2, to match the print 2 Isola', J.P. Alarie', U.J. Kim , M.I. Simon', T. format), (b) sufficient resolution (10 pixels per Bunde' target diameter), and (c) high sensitivity (accurate quantitation down to single-copy DNA quantities). • Corresponding Author; Oak Ridge National A digital CCD camera and 2 high-speed lenses in a Laboratory, Oak Ridge, TN 37831-6101 back-to-back confignration image the sample at 1x 1 Advanced Monitoring Development Group, Life magnification. Fluorescence is excited using a Sciences Div., Oak Ridge National Laboratory, mercury arc lamp with filter wheel for wavelength Oak Ridge, TN 37831-6101 selection. A fiber-optic delivery system illuminates 2 Div. of Biology, Calif. Inst. ofTechnology, the sample at a 45 degree angle through the back Pasadena, CA 91125 side of the slide. A right-angle fused-silica prism is [email protected] used in "total internal reflection" conditions to efficiently excite the fluorescent dyes without allowing excitation light to enter the imaging We describe a new type of DNA gene probe based optics. A multi-band barrier filter was used in the on surface-enhanced Raman scattering (SERS) emission path so it did not have to be changed label detection. Raman spectroscopy is a useful when imaging different fluorochromes. We acquire tool for chemical analysis due to its excellent a DNA counterstain (DAPI) image and the test and capability of chemical group identification. One reference images with an exposure of several limitation of conventional Raman spectroscopy is seconds or less. Software automatically identifies its low sensitivity, often requiring the use of the spots using the DAPI counterstain and powerful and costly laser sources for excitation. measures the background-corrected fluorescence However, a renewed interest has recently intensities and intensity ratios of the spots. developed among analytical spectroscopists as a result of observation that Raman scattering We evaluated the performance of the system and efficiency can be enhanced by factors of up to analysis software using test samples made by 10(8) when a compound is adsorbed on or near spiking 200 ng of total human genomic DNA with special metal surfuces. The technique associated 0, 1, 2, 20, 200, and 2000 pg of lambda DNA and with this phenomenon is known as a reference sample consisting of 200 ng of total Surface-Enhanced Raman Scattering (SERS) genomic DNA spiked with 20 pg of lambda DNA. spectroscopy. The surface-enhanced Raman gene In this situation 3 pg oflambda DNA is equivalent (SERGen) probes described here do not require the to a single copy sequence. We found that the use of radioactive labels and have great potential to changes in the fluorescence ratios were detectable provide both sensitivity and selectivity for DNA from below single copy equivalent level, and were sequencing. The SERGen probes can be used to quantitatively proportional to DNA sequence copy detect DNA biotargets (e.g., gene sequences, number over three orders of magnitude. bacteria, viral DNA fragments) via hybridization to DNA sequences complementary to that probe. Partially supported by the Director, Office of The use of stable clone resources containing large Energy Research, Office of Health and human DNA insets has opened new possibilities to Environmental Research of the U.S. Department of contig building for the Human Genome Project. Energy under contract NO. DE-AC03-76SF00098. The objective of this research is to apply the SERS multi-label technique for use in DNA mapping and bacterial artificial chromosomes (BAC) colony hybridization. The technology is based on a system that will integrate several concepts including: i)

36 multi-label SERS detection, ii) spectral multiplex novo sequencing and DNA sequence analysis will mapping, and iii) BAC colony hybridization. be described.

Emphasis is on detection techniques that minimize the time, expense and variability of preparing Flow Cytometry-Based Polymorphism samples by combining the BAC mapping approach Detection and Analysis with SERS "label multiplex" detection. Large numbers of DNA samples can be simultaneously John P. Nolan, Hong Cai, Kristina Komrnander, prepared by automated devices. With this device, and P. Scott White multiple samples can be separated and directly Los Alamos National Laboratory, Life Sciences analyzed using multiple SERS labels Division, Los Alamos, New Mexico 87545 simultaneously. The results demonstrate the [email protected] feasibility of the SERGen approach in the detection of two gene probes simultaneously. Single-base polymorphism analysis of the human genome on a large scale requires robust and sensitive screening methods which are amenable to MicroArray of Gel Immobilized automation and high throughput analysis. We are Compounds developing a suite of microsphere-based approaches employing fluorescence detection by A. Mirzabekov flow cytometry to screen for and analyze single Joint Human Genome Program: Argonne National base polymorphisms. One approach being Laboratory, U.S.A., and Engelhardt Institute of developed for polymorphism detection is to Molecular Biology, Moscow immobilize on microspheres proteins which recognize specific DNA structures. When a Technologies for robotic manufacturing of fluorescently labeled DNA molecule binds to the Microarrays of Gel-Immobilized Compounds on a immobilized protein, the binding can be measured chip (MAGIC™ chip) have been developed and by flow cytometry. An example is the recognition the MAGIC™ chips are being tested fur various ofheteroduplex DNA by the mutS protein as a applications. These microchips are polyacrylamide means to detect single base mismatches. To gel pads fixed and separated on a glass surface by analyze nucleotide base frequencies at a hydrophobic spacers. The microchips can be used polymorphic site, we are developing an approach like an array of micro test tubes in which chemical based on minisequencing in which immobilized and biochemical reactions can be carried out primers designed to interrogate a specific site are separately in each gel pad. Different used to bind the region of interest in an unknown oligonucleotides, DNA antibodies, and proteins sample. The primers are then extended by have been immobilized in specified gel pads to polymerase using fluorescent ddNTPs and flow produce oligonucleotide, DNA, and protein cytometry is used to read the frequency of each microchips. The usual size of the gel pads is differently colored base. Alternately, the ligation of 100xl00x20 m. However, microchips with gel fluorescent oligonucleotides containing base element sizes as small as lOxlOxlO m can be variation at the site of interest is used to detect the produced by photopolymerization. Applications of base frequency at that site. Apart from the oligonucleotide microchips have been demonstrated advantages of sensitivity and low sample for detection of mutation, identification of consumption, the flow cytometric approaches have microorganisms, HLA allotyping, DNA the advantages of the potential for multiplexed fractionation, DNA enzymatic phosphorylation and analysis using different color or size beads and ligation on specified or all microchip elements, and automated sample handling. Multiplexed analysis thermodynamic analysis of DNA duplexes. could enable simultaneous analysis of base Generic microchips containing a114,096 possible 6 frequencies at dozens of different loci, which mers have been manufactured, and their use for de combined with automated sample handling could provide a powerful tool for high throughtput

37 screening of single base polymorphisms. Supported for direct selection and analysis of individual by NIH, DOE, and LDRD. components from within mixtures that may share a high degree of similarity without cloning. This alternative would not only eliminate that DNA Characterization by Electrospray time-consuming step, but also potentially allow Ionimtion-Fourier Transform Ion identification of low abundance products without Cyclotron Resonance Mass Spectrometry an intensive screening process. In this presentation the recent advances at our David S. Wunschel, Ljiljana Pasa Tolic, David. C. laboratory will be described. These include the Muddiman, James E. Bruce, Steve A. Hofstadler development of new interfacing methods for and Richard D. Smith realizing greatly improved sensitivity and the Environmental Molecular Sciences Laboratory, implementation of new high performance FTICR Pac1fic Northwest National Laboratory, Richland, that has been designed to achieve greater WA 99352 sensitivity as well as resolution and mass rd_ [email protected] measurement accuracy. Mass spectrometry offers the potential for high 1. "Characterization ofPCR products from bacilli speed DNA sequencing and other applications. In using electrospray ionization FTICR mass addition to the development of sequencing spectrometry", D. C. Muddiman, D. S. Wunschel, approaches, ongoing work in the laboratory is C. L. Liu, L. Pasa Tolic, K. F. Fox, A. Fox, G. A. exploring applications using Fourier transform ion Anderson and R. D. Smith, Anal. Chern. 68, cyclotron resonance (FTICR) mass spectrometry. 3705-3712 (1996) These efforts include the characterization of polymerase chain reaction (PCR) products, 2. "Length and Base Composition of enzymatically produced oligonucleotide mixtures PCR-Amplified Nucleic Acids Using Mass modified DNA and the development of methods for Measurements from Electrospray Ionization Mass the analysis of DNA large fragments. The analysis Spectrometry", D. C. Mnddiman, G. A. Anderson time required is on the order of seconds, and is S. A. Hofstadler and R. D. Smith, Anal. Chern. made possible by isotopic resolution of each 69, 1543-1549 (1997) c~mponent's charge states obtained using FTICR. High mass accuracy measurements for PCR 3. "Inter-Operon Variability in B. cereus by products have been achieved for products up to Normal PCR using ESI-FTICR Mass 114 base pairs in length. As an example, the mass Spectrometry", D. S. Wunschel, D. C. Muddiman accuracy allowed single base substitutions to be K. F. Fox, A. Fox and R. D. Smith, submitted. ' detected with de novo identification of an unreported base substitution (1,2). This approach This research was supported by the Office of was extended to exanJine a multi-component Biological and Environmental Research, U.S. reaction from a single primer pair where a base Department of Energy. Pacific Northwest National pair deletion was identified with the putative Laboratory is operated by Battelle Memorial identification of inter- variability within a single Institute through Contract No. DE-AC06-76RLO bacterial strain (3). Recent efforts have focused on 1830. increasing the size of products amenable to analysis with a goal of providing comparable "read-lengths" to traditional sequencing methods and have involved improvements to sample ' preparation methods and the exploitation of improved methods for dynamic range expansion. We are also exploring the use of collision induced dissociation methods with PCR products to provide sequence information. This would allow

38 Improved Mass Spectrometric Other important practical considerations are the Resolution for PCR Product Size reproducibility and throughput ofMALDI-MS. Measurement Commercially-available instrumentation allows robotic loading onto multiple-sample plates followed by automated analysis. However, samples Gregory B. Hurst, Krista! Weaver, and Michelle outside narrow constraints of concentration size V. Buchanan , , and purity still require human intervention because Chemical and Analytical Sciences Division, Oak of the inhomogeneity of the dried matrix/sample Ridge National Laboratory, Oak Ridge TN mixture. We are currently developing methods for 37831-6365 combining oligonucleotides with the MALDI [email protected] matrix to yield a homogenous preparation resulting in a uniform signal at any point interrogated by the While a wealth of biological and genetic desorption laser. information can be gleaned from properly­ designed polymerase chain reaction (PCR) assays, K.W. acknowledges support through an currently-used technologies for analysis of the appointment to the Oak Ridge National Laboratory resulting oligonucleotides all suffer from Postgraduate Research Program administered limitations in speed, accuracy, safety, or jointly by the Oak Ridge Institute for Science and convenience. Matrix-assisted laser Education and Oak Ridge National Laboratory. desorption/ionization mass spectrometry Research supported by the Environmental (MALDI-MS) offers considerable potential for Management Science Program and Office of rapid and accurate molecular mass determination Biological and Environmental Research, U.S. of biopolymers, such as proteins and DNA. In Department of Energy, and the Oak Ridge order to achieve this potential, we are working to National Laboratory Director's Research and improve the utility of MALDI-MS for Development Funds. Oak Ridge National measurement ofPCR product size. Laboratory is managed for the United States Department of Energy by Lockheed Martin Energy The resolution (ability to distinguish products of Research Corp. under contract similar molecular mass) ofMALDI-MS is DE-AC05-960R22464. determined by both chemical and instrumental factors. The presence of reaction components necessary for polymerase activity, particularly metal ions, causes broadening of the observed Laser Desorption Mass Spectrometry for peaks in MALDI mass spectra ofPCR products. Quick DNA Sequencing and Analysis Post-PCR removal of these metal ions and other interferences can be performed efficiently using C. H. Winston Chen, N. I. Taranenko, Y. F. Zhu, reverse-phase cartridges in syringe-mounted or S. L. Allman, V. V. Golovlev and N. R. Isola microtiter plate formats. The wide energy Life Science Division, Oak Ridge National distribution imparted to biomolecules by the laser Laboratory, Oak Ridge, TN 37831-6378 desorption process also broadens mass spectrometric peaks. Delayed ion extraction has During the past two years, we have used laser greatly improved the resolution of MALDI-MS by desorption mass spectrometry (LD MS) to obtain compensating for this energy spread. Combining the following major achievements. They are (1) these techniques, PCR products differing in length LDMS sequencing of ss-DNA longer than 100 by a single base can be resolved up to a total nucleotides with DNA ladders (2) Direct DNA length of 60 bases or more, and larger sequencing without ladders (3)LDMS for oligonucleotides can be detected if single- base hybridization detection (4) STR detection for resolution is not required. The design of PCR forensic applications and (5) Rapid disease products that are shorter than typically used for gel diagnosis. electrophoresis is thus a high priority in improving the applicability ofMALDI-MS.

39 For conventional gel electrophoresis for DNA Research has been sponsored by the Office of sequencing, major steps include (I) DNA ladders Health and Environmental Research, U.S. preparation (2) Separation of different sizes of Department of Energy under contract DNAs and (3) detection by either autoradiogram or DE-AC05-840R21400 with Lockheed Martin laser-induced fluorescence. Laser desorption mass Energy System, Inc. spectrometry (LDMS) can be used to enhance DNA sequencing speed. One approach is to use a mass spectrometer as a detector only. In general Genome Analysis Technologies multiplexing is used to increase the sequencing speed. However, gel electrophoresis and DNA George Church, Martha Bulyk, Sonali Bose, ladders preparation are stili required. Another Chris Harbison, Linxiao Xu, Poguang Wang, approach is to use LDMS for both separation and Laura Kutney, T. O'Keeffe, Dereth Phillips detection. With this approach, gel electrophoresis Department of Genetics, Harvard Medical School, is not needed but the preparation of DNA ladders 200 Longwood Ave., Boston, MA 02115 is required. Recently, we succeeded in using church@salt2 .med.harvard. edu LDMS in sequencing single-stranded DNA with the size longer than l 00 nucleotides. With primers Together with Bruker Instruments Inc., Genome for both strands, a double-stranded DNA segment Therapeutics Corp., and Northeastern University, with the size up to 260 base pairs can be we have developed a laser-desorption electron sequenced. Both cycle sequencing and standard capture mass spectrometry method to quantitate up Sanger's sequencing have been tried with to 400 "Mass-tags" every 10 to 200 milliseconds. successful results. Both DNA and proteins have been mass-tagged in ways analogous to fluorescent-tags. The tags are Since the preparation of DNA ladders is somewhat laser-heat-releasable electrophores. Advantages time consuming, it is very desirable to be able to over fluorescence include more numerous cleanly sequence DNAs without the need of ladder separated spectral peaks, better detection, and preparation. We recently found that preferred bond higher throughput. Applications to capillary cleavage can be obtained during the laser electrophoretic (CE) assays, DNA microarray desorption process if adequate matrices and laser chips, in situ hybridization, and in situ PCR fluences are used. We took this approach and imaging are under evaluation. The electrophoretic recently succeeded in sequencing an applications are designed (but not yet tested) to oligonucleotide with 35 bases. This direct collect 100 sequence equivalents per sequencing by MALDI without the need to prepare second. To assess the capacity of our CE (75 DNA ladders can be conveniently used to sequence micron inner diameter), we have obtained dideoxy primers and short DNA probes. sequence data from a multiplex PCR sequence of a mixture of 4 7 ds-templates. Another feature of the In addition to the DNA sequencing, we alsc apply Mass-tag is the 400 internal standards, which LDMS for the detection of DNA probes for should enhance the ability to computationally align hybridization. Preliminary results indicate that and quantitate 400 images for chip and in situ LDMS as detector for hybridization process can applications. For the new complete microbial reduce the time and cost for DNA sequencing by genome sequences, we have developed chip and CE hybridization (SBH). LDMS was also used to technologies for systematic analyses of obtain short tandem repeats (STR) for forensic DNA-protein interactions and competitive growth applications. Clinic applications for disease phenotypes. diagnosis such as cystic fibrosis due to the base deletion and point mutation have also been L. Xu, N. Bian, Z. Wang, S. Abdel-Baky, S. demonstrated. Different schemes for resolution and Pillai, D. Magiera, V. Murugaiah, RW. Giese, P. detection efficiency improvements will be pursued Wang, T. O'Keeffe, H. Abushamaa, L. Kutney, in the future to increase the sequencing and/or G.M. Church, S. Carson, D. Smith, M. Park, J. analysis speed. Experimental details will be Wronka, F. Laukien. Electrophore Mass Tag presented in the meeting.

40 Dideoxy DNA Sequencing. Analytical Chemistry 2'-fluoronucleotides was evaluated. UlTma DNA in press. polymerase was found to be the best of the http://arep.med.harvard.edu/ enzymes tested for this purpose. MALDI analysis of enzymatically produced 2'-fluoro modified DNA using the matrix 2,5-dihydroxy benzoic acid 2'-Fluoro Modified Nucleic Acids: showed no base loss or backbone fragmentation, in Polymerase-Directed Synthesis, contrast to the extensive fragmentation evident Properties, and Stability to Analysis by with umnodified DNA of the same sequence. Matrix-Assisted Laser Desorptionllonization Mass Spectrometry

Tetsuyoshi Ono, Mark Scalf, Lloyd M. Smith University of Wisconsin-Madison Chemistry Dept. [email protected]

DNA fragmentation is a major factor limiting mass range and resolution in the analysis of oligonucleotides by Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI-MS). Protonation of the nucleobase leads to base loss and backbone cleavage by a mechanism similar to the depurination reactions employed in the chemical degradation method of DNA sequencing. In a previous study, the stabilizing effect of substituting the 2' hydrogen with an electronegative group such as hydroxyl or fluorine was investigated. These 2' substitutions stabilized theN-glycosidic linkage, blocking base loss and subsequent backbone cleavage. For such chemical modifications to be of practical significance, it would be useful to be able to employ the corresponding 2'-modified nucleoside triphosphates in the polymerase-directed synthesis of DNA This would provide an avenue to the preparation of2'-modified PCR fragments and dideoxy sequencing ladders stablilized for MALDI analysis. In this paper methods are described for the polymerase-directed synthesis of 2'-fluoro modified DNA, using commercially available 2'-fluoronucleoside triphosphates. The ability of a number of DNA and RNA polymerases as well as reverse transcriptase to incorporate the 2'-fluoro analogs was tested. Four thermostable DNA polymerases (Pfu (exo-), Vent (exo-), Deep Vent (exo-) and UlTma) were found that were able to incorporate 2'-fluoronucleotides with reasonable efficiency. In order to perform Sanger sequencing reactions, the enzymes' ability to incorporate dideoxy terminators in conjunction with the

41

Mapping and Resources

Development and Application of become increasingly advantageous as we strive Subtractive Hybridization Strategies to towards the ultimate goal of identifying all human Facilitate Gene Discovery and mouse genes. This project has two major goals: ( 1) to optimize the method for construction of subtracted libraries, and (2) to generate Maria de Fatima Bonaldo, Greg Lennon and subtracted libraries to facilitate the ongoing human Marcelo Bento Soares and mouse EST programs. The University oflowa, Iowa City, IA [email protected]

The methods we originally developed to normalize EURO-IMAGE: the European IMAGE directionally cloned eDNA libraries (Soares et al., Consortium for Integrated Molecular 1994; Bonaldo, Lennon & Soares, 1996) have Analysis of Human Gene Transcripts been successfully utilized to generate a number of 1 human, mouse and rat eDNA libraries. All human Auffray, C.', Devigoes, M.D.', Pietu, G , 2 3 and mouse libraries (and soon the rat libraries as Ansorge, W?, Schwager, C. , Ballabio, A. , 3 3 4 4 well) have been contributed to the I.M.A.G.E. Borsani, G. , Banfi, S , Estivill, X , Lynch, M , 5 5 6 consortium and they have been extensively used for Gibson, K. , Mundy, C , Lehrach, H , Poustka, 7 7 7 7 large scale generation of expressed sequence tags A. , Wiemann, S , Korn, B , O'Brien, J , Uhlen, 8 8 (ESTs). Both the ESTs and their respective clones M. , Lundeberg, J. are publicly available. Although the use of 1 CNRS UPR 420, BP 8, 94801 Villejuif cedex, normalized libraries has proven most advantageous France to minimize the redundant identification of the 2 EMBL, Postfach 10.2209, 6900 Heidelberg, mRNAs of the super-prevalent and intermediate Germany frequency classes within a particular tissue, it 3 TIGEM, Via Olgettina 58, 20132 Milano, Italy cannot prevent the redundant identification of 4 IRO, Hospital Duran I Reynals, Autovia de mRNAs (of any frequency class) that are Castelldefels 2, 7, Barcelona, Spain expressed in multiple tissues. In other words, 5 HGMP, Hinxton, CBlO IRQ, United Kingdom normalization alone cannot avoid the redundant 6 MPI, Ihnerstrasse 73, Dahlem, 14195 Berlin, identification of ESTs that have been obtained Germany previously from other libraries. This problem is 7 DKFZ, 1m Neuenheimer Feld 280, 69120 becoming increasingly more relevant as we Heidelberg, Germany approach completion of the ongoing human and 8 KTH, Teknikringen 34, 10044 Stockhohn, mouse gene discovery efforts. Hence, we proposed Sweden to take advantage of subtractive hybridization [email protected] strategies that we developed, to generate libraries enriched for novel cDNAs. Briefly, pools of The general objectives of the European IMAGE I.M.A.G.E. clones from which ESTs have been Consortium are: derived, are used as drivers in hybridizations with • To generate a minimal set of non single or multiple normalized libraries thus redundant eDNA clones for most human generating subtracted libraries enriched for cDNAs gene transcripts and a master set of not yet represented in public databases. Subtracted unique full-length eDNA clones libraries are characterized by Southern representing 3,000 transcripts based upon hybridization to assess reduction in representation the IMAGE Consortium resources (arrays of clones of the driver population and then of eDNA and CpG islands clones). contributed to the I.M.A.G.E. consortium for • To characterize by DNA sequencing with large-scale arraying and sequencing. Sequence high accuracy the complete sequence of analysis of two subtracted libraries that we have the master set of 3,000 human gene generated indicated a four-fold reduction in transcripts ( 6 Mbases of finished representation of the driver population. It is sequence). anticipated that the use of subtracted libraries will

45 • To obtain high resolution and comparative Supported by the participating institutions and the functional mapping localization in man European Union BIOMED 2 Program. and model organisms of 1, 000 genes represented in the master set. • To develop the IMAGE Consortium Data The Functional Genomics Initiative at Base to provide an easy access to an Oak Ridge National Laboratory integrated view of the sequence, map and expression data generated. Dabney Johnson, Monica Justice, Ken Beattie, Michelle Buchanan, Michael Ramsey, Rose The European IMAGE Consortium will devote Ramsey, Michael Paulus, Nance Ericson, David 20% of the resources to the assembly of the Allison, Reid Kress, Richard Mural, Ed physical resources (eDNA and CpG island arrays Uberbacher, Reinhold Mann characterized by end sequencing and Life Sciences, Instrumentation and Controls, fingerprinting, minimal sets of clones selected after Chemical Technology, and Robotics and Process comprehensive sequence, clone and functional Systems Divisions, Oak Ridge National clustering, master set of3,000 full-length clones). Laboratory, Oak Ridge TN 37831 These resources will be available throughout the [email protected] European Union for the user community in both academy and industry. The Consortium will serve The Functional Genomics Initiative at the Oak the future needs of the scientific community in the Ridge National Laboratory integrates outstanding systematic identification of all human genes and capabilities in mouse genetics, bioinformatics, and their regulatory sequences by deciphering in an instrumentation. The 50 year investment by the efficient and economic manner 6 Mb of complete, DoE in mouse genetics/mutagenesis has created a finished sequence for 3,000 transcripts of average one-of-a-kind resource for generating mutations size 2 kb: 50% of the resources will be devoted to and understanding their biological consequences. It the sequencing of full-length cDNAs and CpG is generally accepted that, through the mouse as a islands selected by Consortium on the basis of their surrogate for human biology, we will come to map position, similarity to known families or understand the function of human genes. In expression profiles. In order to ensure that addition to this world class program in mammalian advances in basic genetic knowledge is used to genetics, ORNL has also been a world leader in further enhance human health, the Consortium will developing bioinformatics tools for the analysis, seek to contribute to the identification of genes management and visualization of genomic data. involved in human biology and diseases by Combining this expertise with new instrumentation correlating precise map location and phenotypic technologies will provide a unique capability to expression data, exploiting various comparative understand the consequences of mutations in the approaches for 1,000 of the genes represented in mouse at both the organism and molecular levels. the master set: 10% of the resources will be devoted to high-resolution and comparative The goal of the Functional Genomics Initiative is functional mapping in close interaction with the to develop the technology and methodology mapping consortia in order to obtain the most necessary to understand gene function on a precise and evolutionary relevant map location. genomic scale and apply these technologies to Last but not least, 20% of the resources will be megabase regions of the human genome. The effort devoted to the IMAGE Consortium Data Base. is scoped so as to create an effective and powerful This will provide the community with up to date, resource for functional genomics. ORNL is integrated sequence, mapping and expression data partnering with the Joint Genome Institute and related to the IMAGE consortium arrays, as they other large scale sequencing centers to sequence are collected by IMAGE Consortium members in several multimegabase regions of both human and Europe, the United States and Japan, and will help mouse genomic DNA, to identifY all the genes in in sharing and harmonizing such data. these regions, and to conduct fundamental surveys to examine gene function at the molecular and

46 organism level. The Initiative is designed to be a of protein mixtures with electrospray ionization of pilot for larger scale deployment in the the analytes for direct, on-line interfacing with post-genome era. Technologies will be applied to mass spectrometry. This system will provide the examination of gene expression and regulation, unparalleled throughput and sensitivity, allowing metabolism, gene networks, physiology and robust protein identification in small samples. development. Gene Expression Studies using Genosensor The initiative was launched in 1996 and is Microchip Technology (Ken Beattie) -Porous comprised of the following component efforts: glass flowthrough Genosensor chip technology provides unique hybridization surface area and Directed High-Efficiency Mouse Mutagenesis sensitivity for use in large-scale gene expression (Monica Justice) -We have established a studies, mapping, and mutational surveillance. comprehensive high-efficiency mouse germline mutagenesis program to examine mammalian gene Genome Lab-on-a-Chip (Michael Ramsey) - function. N-ethyl-N-nitrosourea (ENU) is currently Microfluidic and microinstrumentation capabilities the mutagen of choice because it induces point are being applied to labor intensive assays used in mutations that reflect single gene function. Large gene mapping and analysis of genetic mutations genomic regions can be scanned for mutations that and polymorphisms. "Laboratory-on-a-chip" may reflect loss of function, gain of function, or technology is being used to automate assay partial loss of function. In parallel, we are procedures, increase analysis rates, eliminate the developing sequence-ready BAC contigs of a use of radioactive isotopes, and reduce the region that we are mutagenizing. Allelic series for consumption oflimited DNA samples. Current loci will be obtained that will complement other efforts are focused on development of multiplexed mutagenesis approaches, such as gene disruptions. microdevices for PCR-based analysis of simple Our program incorporates many different methods tandem repeat markers used in the genetic and of genetic screening to isolate mutations, and the physical mapping of the mouse genome. mutations will serve as a resource to the mouse and human genome communities. Mouse Physiological Monitoring Microchip (Nance Ericson) - Unique integrated circuitry is The Mouse Screenotyping Center (Dabney being created to provide a capability for large scale Johnson) -We have established a mouse monitoring of mouse physiological parameters phenotype-screening center to complement and related to physiological or behavioral abnormalities extend current mouse mutational analyses by due to mouse mutations. Application of this new developing screening protocols for biochemical and technology to genome studies will accelerate mass behavioral abnormalities. High-throughput specimen screening by providing automated methodologies are being developed to efficiently detailed .observation and reporting of multiple key screen for induced aberrations in behaviors, in physiological parameters, such as body locomotor and neuromuscular function, and in temperature, heart rate, physical activity level, biochemical and hematological parameters. The location and motion patterns using a microscale screens are designed to detect mouse models of a implanted device. variety of human diseases from among large populations of potentially mutant mice. High-Throughput Tomographic Imaging of Mouse Phenotypes (Michael Paulus) - A new Comprehensive Cellular Protein Analysis using high-resolution high-throughput automated 3D Microflnidic Devices (Rose Ramsey)- ORNL's mouse imaging and screening methodology is being unique lab-on-a-chip technology is being combined created which will rapidly identify, quantify and witli. mass spectrometric methods to provide record subtle phenotypes in mutagenized mice, uniquely powerful methods for comprehensive based on a novel x-ray imaging technology witb 50 cellular protein analysis using microfluidic devices. micron spatial resolution and a 3D dataset The microchips will integrate on a single structure, acquisition time of

47 Development of DNA Sequencing and parameters in mass spectroscopy-based DNA Automated Genotyping Infrastructure (Richard analysis. Our initial efforts have focused on optical Mural) Basic capabilities for DNA sequencing and imaging and diagnostics of the matrix assisted automated genotyping are being developed to laser desorption ionization (MALDI) plume, and support mouse genetics and molecular biology. Not investigations of the crystallization of DNA for the only are these capabilities necessary for addressing MALDI matrix. In parallel, we are examining basic problems in genetic and molecular biology methods for higher-throughput DNA analysis but they will also and flexibility to the program for necessary for large-scale genome-wide initiating new projects. Having the DNA sequence polymorphism analysis. of regions which are being mutagenized will also be important to mutation detection. Defining the Function of the Hwnan DNA Event-Based Flexible Automated DNA Repair Protein XPF by Comparative Sequencing (Reid Kress) - By moving automated Genomics and Protein Chemistry DNA sequencing from a sequential, batch-mode process to a continuous mode hands off process we Sandra L. McCutchen-Maloney, Mark Shannon, will be able to maximize the efficiency and Jane Lamerdin and Michael P. Thelen cost-effectiveness of our limited DNA sequencing Molecular and Structural Biology Division, resources at ORNL. Lawrence Livennore National Laboratory, Bioinformatics Support for Gene Function (Ed Livermore, CA 94550 Uberbacher) -A comprehensive system to collect [email protected] and assemble information relevant to the function of newly discovered genes is being constructed for The human XPF gene is required for the cellular application to sequenced human and mouse regions response to DNA damage caused by ultraviolet in the Functional Genomics Initiative. This radiation and mutagenic chemicals. Mutation in includes support for initial gene frnding and this gene leads to one form of the disease syndrome informatics-based characterization, and xeroderma pigmentosum. Homologs of the XPF compilation of experimental information generated protein are present in fly, worm, plant, yeast and related to mouse mutations, gene and protein archea, indicating conservation of an essential expression studies, and computational and function throughout all of evolution. The experimental information derived from numerous expression of mouse XPF in pachytene external databases and other model organisms. The spermatocytes corroborates other indications that resulting catalog of gene function information will XPF has an additional role in genetic be linked to genome-wide browsing being produced recombination during meiosis. Information from all by the Genome Annotation Consortium. of the homologs has been valuable in directing experiments to determine the specific Direct Visualization of Regulatory Protein-DNA characteristics of the XPF protein. We have complexes and Mutations using AFM (David demonstrated the predicted endonuclease activity, Allison) -High-throughput atomic force protein-protein interactions, and DNA binding in microscopy will be used to precisely locate and the recombinant XPF protein that was visualize proteins bound to individual DNA overexpressed and purified from E. coli. molecules or genes, including complexes related to Fragments of XPF obtained during purification regulation of gene expression and repair enzymes retain endonuclease activity. This result, together site-specifically bound to lesion sites. Automated with the sequence analysis of homologous proteins, image analysis and interpretation systems are being has lead us to produce a series of deletion applied to locate and characterize the protein-DNA fragments of XPF in order to identify functional complexes. domains. The use of comparative genomics coupled with protein chemistry has thus enabled us Mass Spectrometry for DNA Analysis (Michelle to begin defming the function of this protein that is Buchanan) -To complement the genome-wide crucial to human health. mutation screening efforts, we are examining

48 1 Work was funded by Nlli grants to SLM-M and ( ) Larionov, V., Kouprina, N., Solomon, G, MPT; research was performed under the auspices Barrett, J. C. and Resnick, M. A. Direct isolation of the U.S. DOE by LLNL under contract No. of human BRCA2 gene by W-7405-ENG-48. transformation-associated recombination in yeast. Proc. Nat. Acad. Sci. USA, 94, p.7384-7387, 1997. Direct Isolation of Functional Copies of the C'l Larionov, V., Kouprina, N., Graves, J., X.-N., Human HPRT Gene by TAR Cloning Chen, Julie R. Korenberg and Resnick, M. A. Using One Specific Targeting Sequence Specific cloning of human DNA as Y ACs by transformation-associated recombination. Proc. Nat. Acad. Sci. USA, 93, p. 491-496, 1996. Natalay Kouprina*, Lois Annab**, Michael A. Resnick*, J. Carl Barrett•• and Vladimir Larionov* *Laboratory of Molecular Genetics and Mobile Elements and DNA Integration **Laboratory of Molecular Carcinogenesis, National Institute of Enviroi1Q1.ental Health Jerzy Jurka Sciences, Research Triangle Park NC 27709, USA Genetic Information Research Institute Palo Alto USA ' '

Recently we demonstrated that transformation­ e-mail: [email protected] associated recombination (TAR) in the yeast Saccharomyces cerevisiae can be used to During the last two years or so, our selectively isolate single copy genes, BRCA2 and DOE-sponsored research led to the discovery and BRCAl, from total human DNA as large circular characterization of DNA targets for retroposon 1 integration (1,2). These targets are most likely Y ACs ( ) The TAR cloning method is based on co-penetration into yeast spheroplasts of gently recognized and cut by L 1-0RF2 enzymes already isolated genomic DNA along with the vector DNA existing in mammalian cells and have been that contains 5' and 3' sequences specific for gene postulated to be hot spots for homologous of interest, followed by recombination between the recombination (3). Recently, our team in vector and the human DNA to establish a Y AC (') collaboration with Dr. Sun-Yu Ng, has We investigated whether a single copy gene could demonstrated that the preferred integration targets be isolated directly from total human DNA by are also hot spots for homologous recombination TAR cloning using only one piece of its sequence which is increased by up to two orders of information. A TAR cloning vector was magnitude in test cancer cell lines ( 4). In another constructed that contained a small amount of 3' collaborative effort, we have discovered tl1at HPRT sequence and an Alu repeat. mobile elements integrate primarily at kinkable Transformation with the vector along with human DNA sites (5). DNA led to the selective isolation oflarge circular Y ACs containing the entire HPRT gene. YACs up We continued our repeat annotation service to the to 400 kb were generated that extended from the community ( 6), as well as, discovery and analysis unique position of3' HPRT to various Alu's of new repetitive families (7,8). similar to "genome walking". Based on transfection of the NeaR-retrofitted YAC clones l. Jurka, J., Klonowski, P. J. Mol. Evol. 43, into mouse cells, theYACs contained the 685-689 (1996) fimctional HPRT. Use of the common Alu repeat 2. Jurka, J. Proc. Natl. Acad. Sci. 94, 1872-1877 as a second targeting sequence greatly expends (1997) utility ofTAR cloning. We propose that a new 3. Jurka, J. Site-Directed Recombination. U.S. Patent No. 08/643,886 TAR cloning approach is readily applicable to direct isolation of single copy genes from 4. Ng, S.-Y., Ma, J., Jurka, J. (in preparation) manunalian genomes. 5. Jurka, J., Klonowski, P., Trifonov, E.N. Manunalian Retroposons Integrate at Kinkab le DNA sites (submitted)

49 6. CENSOR server: http://www.girinst.org DNA ET work that had been based on results with 7. Jurka, J.,Kapitonov, V.V., Klonowski, P., smaller DNA. To more fully characterize the ET Walichiewicz, J. and Smit, A.F.A. Genetica process ofDHIOB cells with large DNA, we have 98, 235-247 (1996) determined the effects of each of the electric field 8. Kapitonov, V.V. and Jurka, J. DNA Sequence parameters on transformation with respect to (in press) plasmid size (and topology). Plasmids ranging in size up to 250 kb have been used, and the resulting transformed molecules have been examined to Study of Electric Field-Induced verifY their fidelity. The data which will be Transformation of Escherichia coli presented shall show that kinetics of DNA with Large DNA Using DHlOB Cells penetration into DHl OB cells on exposure of electric field differ drastically from those obtained

12 for other used E. coli cells (C600, HBlOl, K802, Alexander Boitsov • , Boris OskinI, Pieter J. de DH5a). First of all no clear dependence on Jong2 molecule size of supercoiled plasmids was 1 Saint Petersburg State Technical University, observed. The preliminary interpretation of the Department of Biophysics, St. Petersburg 195251, results is: the size of appearing electropores in Russia, 2 Roswell Park Cancer Institute, DH1 OB is much bigger than that in other E. coli Department of Human Genetics, Ehn and Carlton cells. It explains high ET efficiency ofDHlOB St., Buffalo, NY 14263 cells with large supercoiled plasmids which can be [email protected] achieved (up to 108 transformants per mg of250 kb plasmid) and which is, on a molar basis, only Construction oflarge-insert libraries in bacterial several times lower than the highest ET efficiency hosts, such as those using PAC/BAC vectors and with small pUC18 plasmid. DHlOB cells, has been limited by the inefficient transformation of Escherichia coli with large Acknowledgment: This work has been sponsored in DNA molecules by electroporation. The goal of part by a DOE humanitarian grant this project was to elucidate the mechanism and OR0033-93CIS015 awarded to A.B. kinetics of the entrance of large DNA molecules ([email protected]) and by into DHl OB cells on exposure to electric fields and DE-FC03-96ER62294 to PdJ to exploit this information to increase the ([email protected]) efficiency of P AC/BAC library construction. The essence of the approach used is the independent optimization of the three distinct stages of electrotransformation (ET): i) electrophoretic Progress Towards the Construction of movement of DNA towards cells in cell-DNA BAC Libraries from Flow Sorted suspension, ii) permeabilization of cell wall Human Chromosomes (production of electric pores) and iii) electrophoretic permeation of DNA molecules into Jonathan L. Longmire, Nancy C. Brown, Evelyo the cells through electric pores. This was feasible W. Campbell, Mary L. Campbell, John J. Fawcett, due to the apparatus constructed in Saint Phil Jewett, Mary Maltbie, and Larry L. Deaven Petersburg State Technical University that can Life Sciences Division and Center for Human generate multiple independently-regulated electric Genome Studies, Los Alamos National pulses. The results obtained previously with Laboratory, Los Alamos, NM 87545. different Escherichia coli strains, other than DHl OB, have revealed that the optimal time of Because of the advantages of large insert size and electrophoretic permeation of DNA molecules into stability associated with BAC cloning systems, we the cells through electric pores is proportional to have attempted to adapt the pBelloBAC vector for the square root of the molecular length of use with flow sorted human chromosomes. supercoiled plasmids in the range from 7 up to 250 Compared to making genomic libraries (where kb. This explains the failure of previous large DNA mass is not limited) the cloning of

50 chromosomes into the BAC vector is very the production of chromosome-specific BAC challenging due to the fact that only microgram libraries. This work was supported by the US quantities of DNA can be obtained even after DOE under contract W-7405-ENG-36. extensive periods of sorting (months). The technical challenges involved in making chromosome-specific BAC libraries include Construction of Human Chromosome developing methodologies to 1) allow efficient 5 and 16 Specific Libraries by TAR recovery of flow sorted chromosomes into agarose Cloning plugs; 2) predictable partial digestion of ~mall masses of chromosomal DNA embedded m N. Kouprina, M. Campbell, J. Graves, E. agarose; and 3) improving BAC cloning Campbell, L. Meincke, J. Tesmer, N. Brown, J. efficiencies to allow construction of multiple Fawcett, P. Jewett, R. K. Moyzis, N. Doggett, L. representation libraries from microgram quantities Deaven, and V. Larionov of chromosomal DNA Los Alamos National Laboratory, Life Sciences Division and Center for Human Genome Studies, We have found that partial digestions using National Institute of Enviromnental Health Hindl!I can be performed in a predictable and Services, Laboratory of Molecular Genetics, reliable manner on microgram quantities of Research Triangle Park, NC 27709 genomic DNA by carefully controlling incubation [email protected] time and enzyme concentration-to-DNA mass ratios. In addition, these small amounts of partially Transformation-associated recombioation (TAR) digested genomic DNA can be size selected using was exploited io yeast to the selective isolation of PFG electrophoresis and cloned into the BAC human DNAs as circular Y ACs from vector with efficiencies that are sufficient for monochromosomal mouse/human hybrid cell lioes. producing multiple representation chromosome Chromosome 5 and 16 specific Y AC libraries were libraries ( 103-104 cfu per mg startiog DNA; produced from the hybrid cell lines Q826-20 and average insert size approximately 90 kb). CY18, respectively, usiog the F-factor based TAR vectors contaioiog human Alu repeats. The Improvements have also been made io the presence of an F-factor origin in the TAR vectors collection of flow sorted chromosomes prior to provides the opportunity for transfer of generated DNA isolation. The previously used method for YACs to E. coli to produce BACs. Although <2% collectiog chromosomes involved sortiog into an of the DNA io the hybrid cells was human as many agarose-coated tube until the tube was full of as 80% of the transformants had human DNA sheath fluid and then harvestiog the relatively small Y ACs based on colony hybridization of a number of chromosomes by centrifugation. We representative number of clones with both human have developed a new method that allows larger and mouse probes. Thus, the level of enrichment of numbers of chromosomes to be sorted ioto a siogle human DNA to mouse was nearly 3,000-fold. The agarose coated tube. This is accomplished by usiog y AC libraries of chromosome 5 and 16 consist of a series of centrifugation, decantiog, and re-sortiog 299 and 1320 clones, respectively, with an average steps to 'stack' chromosome pellets withio the size of 150 kb. Approximately 266 chromosome 5 tube. A final spin followed by brief meltiog and and 748 chromosome 16 specific BACs were regelliog of the agarose is used to embed the obtaioed after electroporation of Y A Cs ioto E. chromosomes withio the agarose plug. Higher coli. Based on the size of the clones generated from DNA yields have resulted from usiog the stacked chromosome 5, the YAC and BAC libraries are pellet method. However, chromosomes were lost -0.24X and -0.21X, respectively, in chromosomal during this process because the yield of coverage. Based on the size of the clones generated chromosomal DNA was not directly proportional from chromosome 16, the Y AC and BAC libraries to the number of chromosomes that were origioally are -2.4X and -1.2X, respectively, in sorted. This loss of chromosomes during the chromosomal coverage. The chromosomal embeddiog step represents the siogle major distribution of 41 TAR BACs each from problem that remaios to be solved in order to allow

51 chromosomes 5 and 16 was evaluated by among human, mouse and other model organisms. fluorescence in situ hybridization. The distribution Therefore, gene markers have advantages over of FISH signals was random along the length of random anonymous STSs in building maps for each chromosome. We concluded that TAR comparative genomic studies. cloning may provide an efficient means for generating Y AC/BACs from specific As a preliminary work for the ongoing chromosomes. "BAC-EST" project, we are currently screening our human BAC libraries with thousands of eDNA probes. We have thus far used over 3,000 Unigene Construction of Genome-Wide probes and the number will increase to 7, 000 in Physical BAC Contigs Using Mapped this year. Our goal is to screen the library with at eDNA as Probes: Toward an least 27,000 markers, most of which are in the form of eDNA probes. This represents at least 1 Integrated Bac Library Resource for marker per every 100 kb of euchromatin regions. Genome Sequencing and Analysis We plan to deconvolute the positive BACs to each marker by sorting the library into groups of BACs 2 Steve C. Mitchell, Diana Bocskai, Yicheng Cao , that are positive to specific pools, arraying each 3 Robert Xuequn Xu, Mei Wang, Troy Moore , So group on small hybridization filters, then 4 2 Hee Dho , Enrique Colayco , Christie Gomez, hybridizing the filters with individual probes. We Gabriella Rodriguez, Annabel Echeverria, Melvin are also determining the end sequences for the 2 2 I. Simon , and Ung-Jin Kim positive BACs. BAC end sequence (BES) 2 Biology, California Institute of Technology information can be extremely useful to precisely 3 Research Genetics, Huntsville, AL align these mapped BAC clones against any known 4 Seoul National University, Seoul, Korea sequence contigs by means of sequence match. Putative contigs or clone overlaps identified by The goal of the human genome project is to markers or sequence match are verified via characterize and sequence entire genomes of restriction fingerprint analysis. The BAC clone human and several model organisms, thus resources integrated with physical mapping providing complete sets of information on the information will be useful for building entire structure of transcribed, regulatory and other sequence-ready contigs on any chromosomal functional regions for these organisms. In the past reg10n. years, a number of useful genetic and physical markers on human and mouse genomes have been made available along with the advent of BAC library resources for these organisms. The advances in technology and resource development made it feasible to efficiently construct genome-wide physical BAC contigs for human and other genomes. Currently, over 30,000 mapped STSs and 27,000 mapped Unigenes are available for human genome mapping. ESTs and cDNAs are excellent resources for building contig maps for two reasons. Firstly, they exist in two alternative forms - as both sequence information for PCR primer pairs, and eDNA clones - thus making library screening by colony hybridization as well as pooled library PCR possible. We are now able to screen genomic libraries efficiently for large number of DNA probes by combining over 100 eDNA probes in each hybridization. Second, the linkage and order of genes are rather conserved

52 Current Status of the Integrated regions in collaboration with several other Chromosome 16 Map investigators by supplying the available map resources (YACs, STSs, and cosmid contigs) and high density cosmid filter arrays for cosmid N. A. Doggett, L. A. Goodwin, J. G. Tesmer, L. J. walking andYAC to cosmid hybridization Meincke, D. C. Bruce, M. R. Altherr, R. D. experiments. The largest of these is a 4 Mb Sutherland, U.-J. Kim, and L. L. Deaven restriction mapped 'sequence ready' cosmid contig Los Alamos National Laboratory, Los Alamos, extending proximally from the p telomere. A 3.0 New Mexico 87545 and California Institute of Mb region of this contig, extending proximally Technology, Pasadena, California 91125 from the PKD 1 locus is the substrate for a finished [email protected] sequencing effort underway at Los Alamos. Supported by the US DOE, OBER under contract We have previously reported on the construction of W-7405-ENG-36. an integrated physical map of human chromosome 16 (Doggett eta!., Nature 377:Suppl:335-365, 1995). This map was constructed against a framework somatic cell hybrid breakpoint map Construction of Sequence-Ready which divides the chromosome into 90 intervals. Physical Map of Human Chromosome The physical map consists of both a low resolution 16p13.1-11.2 Region Y AC contig map and a high resolution cosmid/P 1/BAC contig map. The low resolution Yicheng Cao, Diana Bocskai, Steve C. Mitchell, Y AC contig map is now comprised of 900 CEPH Robert Xuequn Xu, So Hee Dho#, Jun-Ryul Huh#, megaYACs, and 300 flow-sorted 16-specific Byeong-Jae Lee#, Anna Glodek*, Mei Wang, miniY ACs that are localized to and ordered within Enrique Colayco, Gony H. Kim, Christie Gomez, the breakpoint intervals with 1150 STSs. (These Gabriella Rodriguez, Judith G. Tesmer**, Annabel include 200 megaYACs and 300 STSs which were Echeverria, Robin Hua Li, Melvin I. Simon, incorporated from the Whitehead Institute's total Norman A. Doggett**, Mark D. Adams*, and genome mapping effort.) TheYAC/STS map Ung-Jin Kim provides practically complete coverage of the * The Institute for Genomic Research, Rockville, euchromatic arms of the chromosome and provides MD STS markers on average every 78 kb. The ** Human Genome Center, LANL, Los Alamos, integrated map also includes 4 70 NM genes/ESTs/exons and 400 genetic markers--as # Seoul National University, Seoul, Korea part of an ongoing effort to incorporate all available loci into a single map of this Extensive physical mapping efforts and advances chromosome. A high resolution 'sequence ready' in automated sequencing technology have resulted cosmid contig map consisting of 4000 in the initiation of genomic sequencing oflarge fingerprinted cosmids assembled into contigs human chromosomal regions. Currently, both NIH covering 60% of the chromosome is anchored to and DOE are supporting several centers in the U.S. the Y AC and cytogenetic breakpoint maps via to begin sequencing the genomes of human and STSs developed from cosmid contigs and by model organisms systematically and in massive hybridizations between Y ACs and cosmids. scale. In the past 1. 5 years, Caltech has been Current work is focused on completing the building physical contig maps on the 20 Mbp 'sequence ready' map using a combination of region of the human chromosome 16p arm cosmids and BACs. IRS-bubble PCR products (16pl3.1-11.2) jointly with TIGR and LANL as a from a minimally tiling set ofYACs are being pilot experiment to generate sequence-ready hybridized to the chromosome 16 cosmid library to physical map using large insert human BAC localize cosrnids to gaps in the existing map; and libraries (A, B and C) that have been constructed over 200 BACs, identified by library screening are at Caltech. now linked to the cosmid map. Several large contigs have been completed across disease gene

53 First, the pooled library A (with 3 .5 X genomic TIGR: http://www.tigr.org/tdb/bumgen/ coverage) was screened with 98 ordered STS humgen.html primer pairs taken from the integrated chromosome LANL: http://www-ls.lanl.gov 16 Y AC-STS map constructed by LANL. In this initial screening, 77 STS markers successfully identified 184 positive BACs. Positive BACs were Large-Scale BAC End Sequencing to Aid characterized by checking multiple single colonies Sequence-Ready Map Construction per clone, and restriction fingerprint analysis. For the clones to be sequenced, FISH mapping and Mark D. Adams, Steve Rounsley, Casey Field, genomic Southern hybridization steps were added Jenny Kelley, Steve Bass, Brook Craven, and J. for the verification of the chromosomal localization Craig Venter and genomic colinearity. Inserts from these The Institute for Genomic Research, Rockville, positive BACs were used for screening library B MD 20850 and C (approximately I OX genomic coverage) by [email protected] colony hybridization. The libraries were also screened with approximately 90 Unigene eDNA Libraries constructed in BAC vectors have become probes that have been localized to the 20 Mb the choice for clone sets in high throughput region, and with approximately 250 Unigene genomic sequencing projects because of their eDNA probes mapped to the other regions on higher stability as compared to their Y A C or chromosome 16. Shotgun clones derived from the cosmid counterparts. We have proposed the use of ends of completely sequenced BACs were used as BAC end sequences as a primary means of probes to efficiently identify BACs that overlap selecting minimally overlapping clones for minimally with the sequenced BACs. We have thus sequencing large genomic regions. A necessary far identified nearly 1,000 putative BACs prerequisite of this is the collection of end belonging to the 20 Mb region, and more than sequences from all the clones in deep coverage 2 000 BA Cs over the entire chromosome 16. End BAC libraries. This is now being pursued for both s~quences were determined from all of these the human and Arabidopsis genomes. BAC vectors BACs. TheBES (BAC end sequence) have been are based on the E. coli F -factor replicon and offer used to align these clones against the sequenced strict copy number control limiting the number of BACs by sequence match, thus allowing rapid and BACs to 1-2 copies per cell. However, in addition precise determination of the extent of overlaps to minimizing the chances of chromosomal between clones. The putative overlaps from the rearrangements, the low copy number also poses a sequence match are verified by restriction challenge for high throughput direct sequencing of fingerprint analysis. the BAC clone ends because of the difficulty in obtaining sufficient quantities of high quality Currently we are building BacDB database, a template from standard minipreps. modified version of ACeDB .4 _1, by entering and organizing all information related to human BAC We have developed reliable approaches for both a clones and physical mapping data that are multiprep for BAC DNA purification and a available from Caltech, LANL, TIGR, as well as protocol for the direct sequencing ofBAC DNA public resources. The database contains available using Dye Terminator chemistry. The combination BAC related data and mapping information, STSs, of these two methods allows us to produce daily ESTs, BES, and completed BAC sequences. about 400 high quality BAC end sequences with an BacDB will not only serve as an integrated average edited length of 400 bases using 4 ABD database for mapping and sequencing, but will be a 3 77 sequencers and a small team of personnel. To tool for the efficient identification of aid in sample tracking and high throughput sequence-ready clones. processing, the prep is processed completely in a 96 well format, from clone storage and growth WEB sites through DNA purification, isopropanol Caltech: http://www.tree.caltech.edu precipitation, and final resuspension. Sequencing

54 reactions are also processed in a 96 well format, primers and ABI 377 DNA sequencers. including the removal of excess dyes before Laboratory protocols, automated data production, loading onto the sequencers. Both high throughput data processing, quality control measures and methods are amenable to future automation. Our LIMS will be described in detail. During a 50 day methods will be presented along with discussion period the STC laboratory sequenced 23317 BAC regarding the advantages and disadvantages of ends (STCs). 19224 (82.4%) where greater than various methods we tried. 100 bp non trimmed and the average nontrimmed read length was 3 8 8 bp for a total of 7.46 Mb (.25%) of the genome. 29 %of the STCs The Sequence Tagged Connector (STC) contained repetitive DNA but less than 11% where Approach to Genomic Sequencing: entirely repeat. 12% of the repetitive DNA were Accelerating the Complete Sequencing LINE sequence, 4.6% LTR, 6.7% SINE sequence and 1.3% of the STCs contain a microsatellite or of the Human Genome simple sequence repeat. The total G + C content was 40% and the average CpG content was .28, Gregory G. Mahairas, Keith D. Zackrone, both expected numbers for human genomic DNA. Stephanie Tipton, Sarah Schmidt, Alan 224 STCs had CpG scores of 1 representing CpG Blanchard, Anne West, Joe Slagel and Leroy Hood islands. 3103 STCs (16.8%) hit the EST, Department of Molecular Biotechnology, non-redundant nucleotide or Sixframe database. University ofWashington, Seattle, WA 98195 1103 STCs hit the EST database (DB), 517 of which hit only the EST; 1087 STCs hit the nr The STC approach has been proposed as an nucleotide DB, 471 of which hit the nr nucleotide attractive strategy to provide a sequence ready DB only, and 913 STCs hit the nr protein DB, 500 scaffolding for the efficient and directed of which hit only the nr protein DB. 181 STCs sequencing of the complete human genome OJ. This (1 %) hit all three databases, 131 hit nr nuc. and nr effort has been undertaken through a collaborative protein DBs, 101 hit the EST and nr prot. DBs, effort between the California Institute of and 304 hit EST and nr nucleotide DBs, i.e., 4% technology, TIGR and the University of hit more than one of these DBs and probably Washington, and funded through the U. S. represent genes. Department of Energy. The approach entails the sequencing of the ends of 3 00,000 Bacterial 1 · Venter, J. C., Smith, H. 0., and Hood, L. (1996) Artificial Chromosomes (BACs) that constitute a Nature 381: 364-366 20X deep Human DNA library to construct a sequence ready scaffold of the human genome.

At the Univ. of Washington we have assembled a high throughput automated end sequencing and fmgerprinting process with its associated informatics. BAC clones are robotically inoculated from 384 well plates into 4 m1 96 well culture format, grown and the BAC DNA robotically extracted using AutoGen 740 robots. BAC template DNA from the AutoGen is then robotically transferred into 96 well microtiter plates from which DNA sequencing and fingerprinting reactions are setup. DNA fingerprinting is performed using conventional agarose electrophoresis, digestion with a single restriction enzyme (EcoRV) followed by automated imaging and analysis. DNA sequencing is performed using PE-ABD High Sensitivity dye

55 Amplification and Sequencing of End Construction of a High Resolution Fragments from Bacterial Artificial Pl/PAC/BAC Map in the Distal Long Chromosome (BAC) Clones by Arm of Chromosome 5 for Direct Single-Primer PCR Genomic Sequencing

Joomyeong Kim, Ethan A. Carver, and Lisa Ze Peng, Steve Lowry, Duncan Scott, Yiwen Zhu, Stubbs Eddy Rubin and Jan-Fang Cheng Human Genome Center, Lawrence Livermore Joint Genome Institute, Lawrence Berkeley National Lab., PO Box 808, L-452, Livermore, National Laboratory, Berkeley, CA 94 720 CA 94551 [email protected] [email protected] One of the JGI genomic sequencing targets is the PCR-based methods have provided invaluable dista145 Mb of the long arm of human tools for the analysis of large genomic clones, such chromosome 5. This region was chosen because it as Y ACs and BACs, that comprise the bulk of contains a cluster of cytokine growth factor (IL3, most existing and emerging genomic physical IL4, IL5, IL9, ILl2, ILI3, GM·CSF, FGFA, maps. However, applications of the available M-CSF) and receptor genes (GRL, ADRB2, methods are often limited, or made more difficult M-CSFR, PDGFR) and is likely to yield new and to apply, because of the need for significant functionally related genes through long range identity between primers and sequences located on sequence analysis. This region is relatively rich in both sides of a targeted site. We have developed a disease-associated genes including susceptibility to method that permits target sequences to be asthma, several autosomal dominant corneal exponentially amplified, with a high degree of dystrophies, low-frequency hearing loss, dominant purity and specificity, from low-complexity limb-girdle muscular dystrophy, Treacher Collins templates when only a single sequence-specific syndrome, and myeloid disorders associated with anchor primer is present in the mixture. the 5q- syndrome. Amplification is efficiently driven by specific binding of the primer to one end of the target locus, The mapping strategy is based on a combination of with reverse priming initiated at nearby regions both hybridization and PCR approaches. Inter-Aiu containing only a 5-6 bp sequence match with the fragments generated from non-chimeric Y ACs anchor oligonucleotide. Using standard covering 3-5 Mb of DNA were used to isolate vector-derived primers, we have applied the regionally specific P 1/P AC/BAC clones. These single-primer protocol to amplifY end-fragments clones were sized by pulsed-field gel directly from a number of different electrophoresis, and their map locations were BAC-containing colonies, and have generated high confirmed using fluorescent in situ hybridization. quality DNA sequence information from the PCR We estimated that approximately 60% of the mixtures without the need for further purification. clones representing a targeted region in the P I, This work introduces single-primer PCR as a PAC or BAC libraries were identified by this simple, efficient and convenient alternative to approach. Overlaps between PIs, P ACs and BACs existing methods for isolation of sequence-ready were mainly established by PCR using STSs end fragments and other sequences from within generated from ends of clones. Contigs were BACs and other large-insert genomic clones. further oriented using STSs developed from known genes, ordered markers, and ends of Pis, PACs, BACs and Y ACs. The STS content mapping also allowed us to identifY new clones to fill gaps.

We have used this strategy to map 198 Pls, 60 PACs and 1,407 BACs so far in the region of 5q23-q35. These clones were linked by 1,321 STSs to form 74 contigs. The contigs are

56 approximately 0.2-4 Mb in size. The average High-resolution FISH to decondensed human density of S TSs is about 3. 7 per 100 Kb in tbe sperm pronuclei establishes genomic distances proximal20 Mb and about 2.3 per 100 Kb in tbe between 286 ordered FISH markers in 19p and q, distal 25 Mb. Most STSs were derived from tbe thereby providing a "metric" framework that links ends ofBACs which allowed detection of3.4% tbe ordered contigs to the cytogenetic map. chimeras by PCR analysis. The clone coverage of Complete digest EcoRl maps have been the proximal 20 Mb is over 95% and tbe coverage constructed for all contigs, which provide oftbe distal 25 Mb is between 70-80% at different validation of tbe contig assembly and constituent locations. To date, 121 P1/PAC/BAC clones clones, as well as an indication of contig size and spanning tbe proximal 10 Mb were in tbe pipeline extent of clone overlap. for production sequencing. This clone map witb STS information is distributed through our Web The map currently includes 15 Mb of restriction site (http://www-hgc.lbl.gov/sequence-archive. mapped contigs greater than 5 00 kb (average size html) and is being updated periodically. 1 Mb), which provide ideal substrates for large-scale genomic sequencing of this chromosome. About 7.5 Mb have been sequenced An Integrated Genetic and Physical or are currently in tbe sequencing pipeline. The Map of Human Chromosome 19: A high depth of coverage (average 5X) and mix of Resource for Large Scale Genomic cosmid and BAC clones enables selection of an optimum set of spanning clones witb minimum Sequencing and Gene Identification overlap for sequencing.

L.A. Gordon, A. Georgescu, M. Christensen, S. The map is extensively annotated, witb over 300 Ross, L. Woo, L.K. Ashworth, H. W. genes/cDNAs and 180 polymorphic markers that Mohrenweiser, A.V. Carrano and A. S. Olsen. have been localized at the clone, and occasionally Human Genome Center, Biology and restriction fragment, level. Placement of tbe genetic Biotechnology Research Program, Lawrence markers in the physical map demonstrates excellent Livermore National Laboratory, Livermore, CA correspondence witb existing genetic maps and 94550 provides relative order of many markers that [email protected] cannot be distinguished by recombination. This map provides a unique resource for identification We have developed an integrated genetic and of disease genes mapped to this chromosome. physical map for human chromosome 19 consisting of metrically ordered, well-annotated Work performed under tbe auspices oftbe US cosmid/BAC contigs that provide a critical DOE by Lawrence Livermore National Laboratory resource for sequencing. under contract No. W-7405

At approximately 65 Mb, chromosome 19 represents 2% oftbe haploid genome; it is tbe most Mapping by Two-Dimensional GC-rich human chromosome, suggesting an especially high gene density. The current map Hybridization consists of 185 ordered cosmid/BAC contigs of average size 225 kb (range 40 kb to 3.3 Mb) CliffS. Han, Mark 0. Mundt, Linda J. Meincke, spanning a total of 42Mb, i.e. over 75% oftbe Judy G. Tesmer, Robert K. Moyzis, Larry L. non-centromeric portion of chromosome 19. An Deaven and Norman A. Doggett additional 8 Mb of small EcoRl mapped cosmid Life Sciences Division and Center for Human contigs, average size 80 kb, have not yet been Genome Studies, Los Alamos National incorporated into the ordered map. The order of tbe Laboratory, Los Alamos, NM 87545 constituent contigs have been determined by standard FISH techniques applied to a series of The framework genome-wide physical maps have chromatin targets witb increasing resolution. largely been constructed witb YAC clones. YACs

57 however, are often unstable and chimeric, and Estimates of Gene Densities in Human because of the difficulty to isolate cloned Y AC Chromosome Bands DNA in a pure form, are unsuitable for DNA sequencing. Therefore, BAC libraries, which retain Norman A. Doggett, Robert D. Sutherland and the advantages of large insert size, are stable, and David C. Torney easy to manipulate, were constructed. Physical Theoretical Division, Life Sciences Division, and mapping with these libraries is now underway and OHER Joint Genome Institute, Los Alamos is a critical component to large scale genomic National Laboratory, Los Alamos, NM. sequencing. ISCN 1995 established a new set of Human Clone based physical maps have been constructed chromosome ideograms that comprise 850 by many methods, including clone fingerprinting metaphase-chromosome bands, differentiated by and STS content mapping. We are developing an five shades of staining intensity [ISCN (1995), alternative method which is applicable to small to Mitelman, F. (ed), S. Karger, Basel; and Francke, moderate complexity clone libraries such as U. Cytogenet. Cell Genet. 65 206-219 (1994)]. chromosome specific cosmid and BAC libraries The five band shades are referred to as white, light and plasmid subclone libraries ofBACs. This gray, medium gray, dark gray, and black. These two-dimensional hybridization method is based on shades, presumably, reflect different underlying clone hybridization with pooled clone DNA as states of chromatin. The publication of the probes. The principal experimental design is as mapping of 16,000 partially sequenced genes, or follows: First, several grids of the library are cDNAs to a framework radiation hybrid map of ' made. Second, DNA of the clones are pooled by a the Human genome has established the two-dimensional strategy (rows and columns) and chromosome assignment for as much as 20% of all purified by subtractive hybridization to remove Human genes [Schuler et al., Science 274 540-546 both low abundance and high abundance repetitive ( 1996)]. In this work, the genes were observed to elements. Third, the grids are hybridized separately be distributed non-uniformly along the with the row and column pooled DNAs. Positive chromosomes. Because none of these cDNAs were hybridizing clones that are in common between the localized to a specific chromosome band, no row and column pool probes will overlap with the conclusions were drawn about gene densities in the clone that is in the intersection point with these different types of bands. pools. We are using the program MAP (written by Mark Mundt) to construct contigs from the To establish a relationship between gene density two-dimensional hybridization results. and band type we have used the proportion of each type of band on each chromosome to estimate the This method was tested with 350 cosmids, chosen respective gene densities, by optimizing the from the chromosome 16 specific cosmid library consistency of the model with the numbers of by grid hybridization with 3 YACs. Overlaps cDNAs on the 22 autosomes. In detail, the identified by two-dimensional hybridization were theoretical prediction for the number of genes on confirmed by restriction fingerprinting. We are chromosome number j equals a5= I aipij, in which currently implementing this approach for rapid the ai are gene density parameters for band type I construction of contigs from a chromosome 16 and the pij are the proportions for the five types of specific BAC library and for the ordering of bands on this chromosome. The statistical analysis plasmid subclones of cosmids and BACs into used involves integrating over the posterior joint minimal tiling sets prior to their sequencing. distribution for the parameters, given the linear Supported by the US DOE, OBER under contract model and the eDNA data. In one data-analysis we W-7405-ENG-36. grouped the white and light gray bands together, assigning them a shared gene-density, and also grouped the two darkest bands together, thereby reducing the number of gene-density parameters to three. The inferred gene density for the white &

58 light gray bands equaled 7.3/Mb, with a standard mutagenesis in mouse spermatogonial stem cells, deviation of 0. 3, the inferred gene density in gray using the supermutagen N-ethyl-N-nitrosourea bands equaled 1.451Mb, with a standard deviation (ENU) to induce point mutations in single genes, of 1.1, and the inferred gene density in gene density will provide the functional information for given in dark gray and black bands equaled 0.45/Mb, segments of DNA sequence within the physical with a standard deviation of 0.4. If we knew the map. Other mutagens, for made-to-order total number, T, of genes in the Human genome, mutagenesis, are also being tested, as are the expected gene densities are, of course, our DNA-repair deficient mouse strains as mutagenesis estimates multiplied by T/16,000. Reasonable targets. For an immediate target, we propose to choices forT range from 60,000 to 100,000. We 'saturate' a long and well-characterized will present the gene density estimates for all types chromosomal deletion with END-induced ofbands, plus confidence intervals. mutations to develop a gene-by-gene phenotype map of the region uncovered by the deletion. This It is reasonable to conclude that the expected gene efficient experimental protocol will generate density decreases with band darkness. These phenotypes evident in deletion hemizygotes after results have practical implications for large scale only the second generation post-treatment, and all sequencing of the Human genome and have progeny in that generation will be 'color-coded' for ramifications for the study of the evolution of instant genotyping. Submapping of phenotypes will genomes. be quite efficient because we have existing mutant stocks that carry many additional deletions whose endpoints nest within the large deletion. Physical Mapping and Functional Analysis of and transcriptional mapping within this set of the Mouse Genome nested deletions are well under way.

Dabney K. Johnson, Edward J. Michaud, and We will also create similar deletion reagents in the Monica J. Justice mouse cognate for a human chromosomal region Mammalian Genetics Section, Life Sciences already being sequenced; good choices would be Division, Oak Ridge National Laboratory, Oak mouse chr 16/human chr 21, or mouse chr 8/ Ridge, Tennessee 37831-8080 human chrs 16 and 19. A complex of nested [email protected] deletions will be generated molecularly using the CrelloxP approach, or via radiation-induced As sequencing of the human genome progresses, mutations in ES cells. the role of the mouse as proxy manunal for functional studies will become crucial. We at Pilot experiments conducted here at 0 RNL ORNL have designed a comprehensive program demonstrated that high-efficiency ENU that will combine the power of our unparalleled mutagenesis generated numerous visible and lethal mouse 'mutation machine' with our massive mutations that fall within a large deletion computational capability in bioinformatics and our encompassing the p locus in mouse chr 7. These integrated technology development effort in experiments resulted in the isolation and fine detection and analysis of new mutant mouse localization of at least seven new loci in 1244 phenotypes. Our goal is to assign function to all gametes tested; since only lethal and visible the genes that reside in defined segments of the phenotypes were under scrutiny, more subtle mouse genome, via efficient identification of mutations/alterations were missed. We propose to phenotypic changes in behavior, biochemistry, and extend this analysis to a second p-region deletion, gene expression that accompany induced genetic p30PUb, that has been shown to contain multiple changes. interesting phenotypes. These phenotypes have been submapped to specific intervals between The creation of functional maps of the manunalian deletion breakpoints, but we now must create the genome must keep pace with the rapidly expanding necessary intragenic mutant alleles for any genes in physical and transcriptional maps. High-efficiency the deletion. We expect also to generate new phenotypes as we disable additional genes within

59 the large deletion. DNA sequence of the deleted Overlapping deletions obtained at albino (c) are region will be determined in conjunction with the useful genetic reagents for functional genomics. JGI to facilitate gene identification and mutational Saturation mutagenesis with ethylnitrosourea analysis. (ENU) utilizing the deletions at the c region on mouse Chromosome 7 revealed many new Additional crucial components of our strategy are functional units that reflect single gene changes automation of phenotype screening by (Rinchik et al. 1990; Rinchik et al. 1995). The microfabrication of subdermal sensors, region is homologous to human Chromosome computer-aided imaging of both skeletal and soft II q l3 -q21, and is linked to the mouse homologue tissues, and improvements in through-put for of human oculocutaneous albinism, type lA. behavior-testing paradigms and GC/MS analysis Because of the nature of the phenotypic screen, of blood, urine, and breath for biochemical many of the new mutants die as embryos. The alterations. Changes in gene expression will be region is particularly valuable for functional detected by the use of chip-based DNA/RNA studies because of the variety of genetic reagents, analysis, and mass-spectrometry-based protein including overlapping deletions and point analysis from mutant vs normal cell populations. mutations, that are available. Our focus is a group of alleles isolated at a locus (axis) that affects the We are also developing sperm-freezing protocols development of the body axis. The homozygous for mutation preservation and distribution, and phenotypes of six alleles at axis that are likely to efficient artificial insemination for recovery. be point mutations range from early prenatal lethality to adult viability. The baseline function of This work is supported by the U.S. Department of axis is demonstrated by two alleles that arrest Energy FWP ERKP260 under contract no. during formation of the primitive streak. However, DE-AC05-960R22464 with Lockheed Martin two other alleles have a severely disorganized body Energy Research Corporation. axis later in development. One allele exhibits a variety of neural tube defects, including exencephaly. Together, these observations suggest The Albino (c) Region of Mouse Chr 7: that the axis locus functions in the formation of the A Model for Mammalian Functional rostral-caudal body axis and neurulation. Genomics Intriguingly, homozygotes of one allele of axis survive to adulthood, and have axial skeletal

2 1 abnormalities. Complete complementation studies M.J. Justice\ E.M. Rinchik , D.A. Carpenter , 1 1 of these six alleles reveal complex genetic S.E. Thomas , and D.K. Johnson characteristics such as intragenic complementation 1 Mammalian Genetics Section, Life Sciences among two of the lethal alleles and a maternal Division, Oak Ridge National Laboratory, Oak effect of the viable allele. The maternal effect is a Ridge, Tennessee 37831 phenotype of severe neural kinking, accompanied 2 Sarah Lawrence College, Bronxville, New York by somite abnormalities, suggesting again, a role in 10708 the determination or maintenance of the integrity of the body axis. The varied features of axis will The mouse, with its powerful genetic tools and its provide important insights into the mechanism of extensive comparative molecular linkage map with gene function at this complex locus. the human, is a useful model organism to study mammalian gene function. Multiple mutant alleles The chromosomal region that includes axis of genes can be derived by mutagenesis that may contains other genes that affect body axis and reflect loss of function, partial loss of function, or skeletal development. Predictions of the potential gain of function. The entire series of alleles must role of a gene or genes at axis will be presented, as be studied together to dissect the function of the well as a view of the functional organization and gene. possible interactions of other loci in the region. These analyses will give us essential data for subsequent large-scale expansions of the functional

60 map of the mouse genome in parallel with human contigs and, at the same time, isolating the mouse gene maps using additional induced and targeted homologs or related genes for the human ZNFs deletions combined with chemical mutagenesis. from these contigs.

E.M. Rinchik, D.A. Carpenter & P.B. Selby. Proc. Recently, we have also assigned the human Nat!. Acad. Sci. USA 87, 896-900 (1990). E.M. homolog of a mouse imprinted gene, Peg3 Rinchik, D .A. Carpenter & M.A. Handel. Genetics (paternally expressed gene 3), to a contig (174) 92, 6394-6398 (1995). located approximately 1Mb proximal of a ZNF134 contig. Studies of several imprinted domains, This work is supported by the U.S. Department of including Prader-Willi and Angelman Energy FWP ERKP260 under contract with syndrome-region (Hl5qll-13)/central Mmu 7 and Lockheed Martin Energy Research Corporation. Beckwith-Wiedemann syndrome-region (llp15.5)/distal Mmu7, indicated that genomic imprinting is generally conserved among Comparative Mapping and Expression mammalian species and also that imprinted Study of Two H19q13.4 Zinc-Finger domains are large--spanning distances ranging Gene Clusters in Man and Mouse from several hundred kilobases to two megabases. Due to these observations, we have decided to characterize PEG3/Peg3-containing regions in both Joomyeong Kim, Mark Shannon, Linda human and mouse. In human, the adjacent contig Ashworth, Elbert Branscomb, and Lisa Stubbs to a PEG3-contig appears to have numerous ZNFs. Human Genome Center, Lawrence Livermore In mouse, we have isolated two zinc finger genes National Laboratory, Livermore, CA 94551. located very close to Peg3 and are currently kimj@bioaxl. bio.oml.gov investigating the imprinting status and genomic organization of these genes relative to Peg3. One of the larger syntenically homologous blocks of human and mouse genomes includes the long arm of human chromosome 19 and the proximal portion of mouse chromosome 7. As part of an One Origin of Man: Primate Evolution extensive comparative genome mapping of human Through Genome Duplication chromosome 19, we have targeted a· region spanning approximately 2.5 Mb near the telomere Julie R. Korenberg, Xiao-Ning Chen, Steve ofH19q. As a first step for characterization ofthis Mitchell, Rajesh Puri, Zheng-Yang Shi and Dean region, we have assigned several known genes and Yimlamai eDNA sequences to the l9ql3.4 physical map. Medical Genetics Birth Defects Center, The Most genes assigned to this region are C2H2-type, CSMC Bums & Allen Research Institute, UCLA zinc finger (ZNF)-containing genes, which include School of Medicine, Los Angeles, CA ZNF134, ZNF154, ZIKl, C2H2-25, and one EST [email protected] (T26651). These genes are all localized to a centrally-located contig (577/1514) and KRAB Chromosome duplication is a force that drives (Kruppel-associated box)-type ZNFs. To provide evolution. We now suggest that this may also be clues for the potential role of these genes, the true of the primates and that the resulting expression-patterns of these ZNFs have been duplications in part determine the spectrum of determined; most genes are expressed ubiquitously human chromosomal rearrangements. To in all tissues examined but the highest expression investigate the existence and origin of duplications level for each gene has been observed in different in the human genome, and their consequences, tissues. Using an interspecific backcross system, 5,000 bacterial artificial chromosomes (BACs) we have mapped ZNF 134-related sequences in the were mapped at 2-5 Mb resolution on human high mouse and confirmed that the mouse genome has resolution chromosomes by using fluorescence in similar zinc-finger genes clusters in proximal Mmu situ hybridization. A subset of 469 of these was 7. We are currently constructing mouse BAC defined that generated two or more signals,

61 excluding those located in regions of known Third-Strand In Situ Hybridization repeated sequences, viz., the regions of centromeres, telomeres and ribosomal genes. (TISH) to Non-Denatured Metaphase Although a subset of these multiple site BACs Spreads and Interphase Nuclei represent the chimeric artifacts of cloning, derived from two different chromosomal regions, others Marion D. Johnson and Jacques R. Fresco reflect regions of true homology in the human Princeton University/Washington Road/Princeton, genome. NJ 08544 [email protected] Two questions were considered; first, the extent to which the multiple sites of hybridization of single An in situ methodology employing solution BACs within single chromosomes reflected the conditions has been developed for binding breakpoints of naturally occurring human oligodeoxyribonucleotide 'third-strands' to inversions, and second, the extent to which these chromosomal DNA targets in non-denatured same multiple hybridization points reflected the protein-depleted metaphase spreads and interphase chromosomal inversion points in primate evolution. nuclei. Third-strand in situ hybridization (TISH) was performed on slides at pH 6. 0 using a dual For human inversions, the results of the analyses psoralen-and biotin-modified 17 -nt pyrimidine-rich revealed a total of 124 BACs (2.5%) mapping to third strand to target a unique multicopy sequence two or more sites on the same chromosome, of in human chromosome 17 alpha-satellite (D 17Z 1 which 81 (65%) mapped to one of 27 distinct locus). UVA photofixed third strands, rendered human inversion sites, the largest share of which fluorescent by FITC-labeled avidin, are recognized the well-established pericentromeric reproducibly centromere-specific for chromosome inversions of chromosomes 1, 2, 9, and 18, as well 17 and visible without amplification in human as the paracentric inverted region of chromosome l~phocyte and somatic cell hybrid spreads and 7q ll!q22. From this, we infer that meiotic interphase nuclei. Two D 17Z I haplotypes, one mispairing involving the homologous regwns may positive and the other negative for thi~d-strand be responsible for the inversions. binding, were identified in three combmatwns (+/+, +/-,and-/-). Third-strand probes specific for With respect to primate evolution, a significant unique multicopy alpha-satellite targets in human proportion of inversion breakpoints that . chromosome X and 16 have also been developed. characterize the chromosomal changes seen m the Similar alpha-satellite target sequences have been evolution of the great apes through man, are also identified in 22 of the 24 human chromosomes, reflected in the distribution of BAC multiple making centromere-specific chromosome intrachromosomal sites. Further analyses of the 29 identification by TISH applicable to virtually to all independent BACs recognizing the pericentromeric human chromosomes. The techoology is presently region of human chromosome 9 suggest at least being applied to chromosomes of other eukaryotes. three classes, two of which recognize only single TISH has potential diagnostic, biochemical, and sites in Pan troglodytes. These data suggest that flow cytometric applicability to native metaphase inversions occurring through primate evolution and interphase chromatin. may generate small duplications that, although they can cause chromosomal imbalance in single individuals, they also provide the additional genetic material for speciation.

62 New Male-Specific, Polymorphic In Vivo Libraries of the Human Tetranucleotide Microsatellites from Genome in Mice as a Reagent to Sift the Human Y Chromosome Genomic Regions for Function

1 12 P. Scott White , Owatha L. Tatum • , Larry Rubin, E.M., Zhu, Y., Fraser, K., Ueda, Y., 1 1 Deaven , Jonathan L. Longmire Smith, D.J., Symula, D., Cheng, J.F. 1Genomics Group and Center for Human Genome Joint Genome Institute, Lawrence Berkeley Studies, Life Sciences Division, Los Alamos National Laboratory, Berkeley, CA 94720 National Laboratory, Los Alamos, NM 87545 and [email protected] 'Dept. of Biological Sciences, Texas Tech University, Lubbock, TX 79409. Libraries of all or part of the mammalian genome have been propagated in single cells and have been Human genetic polymorphisms are valuable for used as tools in gene discovery through in vitro extracting information about population structure analyses. We have expanded upon this concept by and evolutionary histories. Clonally inherited DNA the creation of panels ofYAC and PI transgenic such as mitochondrial or Y chromosome-specific mice containing defined contiguous regions of the DNA is of particular use due to lack of mouse or human genome. Since each library recombination. Microsatellites have been used to member contains a large (80-700 kb) transgene, reconstruct human evolutionary histories. Until together several megabases of contiguous DNA recently, only six male-specific tetranucleotide from a defined region of the genome can be repeats have been publicly available as PCR propagated using the mouse as a host. We have markers. We have developed six additional successfully used such libraries to sift through microsatellite markers using a cosmid library of large genomic regions and to localize and clone flow-sorted human Y -chromosomes. These genes based on phenotyping members of the microsatellites are tetranucleotide (GATA)n library. Examples of our use of these libraries to repeats of nine to twelve repeat units each, and are link sequence with function include: polymorphic among unrelated individuals. All (I) Biological annotation of human 5q31 markers were analyzed using Applied Biosystems genomic sequence data. Computational Genescan fluorescent fragment sizing. At least analysis of 1.2 Mb of sequence from human three alleles were identified for each marker when 5q31 generated as part of the JGI sequencing diverse genomic DNAs were used as PCR program has revealed multiple putative genes template. The allele sizes range from 162 to 372 and exons. As a tool to validate the nucleotides. An additional marker was identified, computationally predicted novel genes in this having polymorphic alleles in both males and region, and to determine their site and timing females of sizes 217-237. These markers fall into of expression, we have created and analyzed a sets of non-overlapping alleles that allow for l.S.Mb in vivo library ofhuman 5q31 and efficient gel multiplexing with fluorescent dyes. documented the expression patterns of the Because six of these markers are male-specific or newly identified genes in the YAC transgenics. have male-specific alleles, they are valuable for (2) Fine mapping a 5q31 QTL for asthma. evolutionary and population studies where Several human studies have mapped a major non-recombining DNA is desired. QTL determining IgE levels in asthmatics to 5q31. Through analysis oflgE levels in members of the 5q31 in vivo library following an airway irritant we have identified a single Y AC noted in two separate founder lines to be associated with a marked decrease in lgE levels.

63 We are in the process of looking at mice containing developed a rapid fingerprinting method using fragments of this Y ACto move from this disease fluorescently labeled dideoxyadenosine associated 5q31 QTL to the causative gene. triphosphates ([F]ddA TP). Taq FS incorporates (1) In vivo complementation for cloning mouse [F]ddA TP at the first nucleotide of a 5' overhang mutations. We developed an in vivo library of generated by Hind III digestion. The labeled the 55 0 kb region to which the mouse fragments are further digested by four-base cutters recessive neurological mutation vibrator had to generate even smaller fragments (less than 500 been localized by meiotic mapping. Through bp) for visualization on sequencing gel. The the in vivo complementation of the phenotype reaction uses [F]ddATP labeled with one of three with a member of the library, we have been mobility-matched fluorescent dyes for fragments of able to fine map and than clone the gene each four-base cutter. They allow to multiplex the (PITP-N) responsible for the disorder. analysis in a single gel lane. The fourth color in the (2) ldenticying genes contributing to defects in same lane is used to identify the location of the cognition on chromosome 21. We have created fragments of known molecular weights and then to a I. 8 megabase in vivo library of human calculate the size of each fragments based on that chromosome 2lq22.2 in a panel ofYAC result. Fragment size data assigned by Genescan transgenic mice. Analysis of these animals (ABI) are converted into FPC (Fingerprinted with regard to learning and behavior identified Contigs, Sanger Center, UK) format and then a 550 kb YAC responsible for specific electronically transferred to FPC for automatic deficits. Through fragmentation of the YAC contig assembly. We have tested the system on one we have been able to identifying a gene of the most well characterized BAC contigs of (GIRK) whose altered level of expression is human chromosome 22; 96 BACs from the responsible for behavioral abnormalities in chromosome 22 regions where large-scale mice, a gene whose altered expression has also sequencing is underway. Without manual been linked to learning defects in Drosophila. intervention, the system did not produce false overlap in all 96 BAC clones, and assembled 16 These studies on panels of transgenic mice contigs and 31 singletons. The accuracy of containing large inserts have effectively enabled overlaps constructed by the method is comparable phenotypic assays at the organismallevel to be to that established using BAC end sequence, and performed on many genes at once. This, in essence, compared with completely sequenced BACs. constitutes a multiplex analysis that permits increased throughput of data collection relating Fragments patterns determined by this method sequence to phenotype. provide each BAC with digital fingerprint, and may be stored for searching clones with minimal overlap, and for identification of clones. Automatic BAC Contig Assembly by Three-Color Fluorescent Restriction Fragments Construction and Characterization of New BAC Libraries Y. Ding, M. Johnson, Y. Chen, J. Colayco, J. Melnyk, S. Khan, D. Gilbert, and H. Shizuya Y. Sheng, Y-J. Chen, C. Neal, and H. Shizuya Division of Biology, Caltech, Pasadena, CA Division of Biology, Caltech, Pasadena, CA 91125 91125, PE Applied Biosystems, Foster City, CA [email protected] 94404 [email protected] Human BAC clones have been extensively used in a variety of research areas of Human Genome Bacterial Artificial Chromosomes (BACs) are Project because of their stability in E. coli, easy extensively used for large-scale mapping and handling of BAC DNA, and relatively large insert sequencing. In order to identify and characterize size. Over the past four years, we have generated BAC clones for contig assembly, we have over 400,000 BAC clones from two individuals,

64 and arrayed them in 384-well microliter plates to determination of the nucleotide sequence for the organize these clones. Recently we initiated a new entire 30,000 megabase pairs of DNA. There is a round of construction ofBAC libraries for the clear need to develop reliable clone resources community involved in the high throughput which are accurately mapped and at the same time sequencing efforts. In order to build high quality can provide templates for direct use in large scale libraries, we have implemented many checks and sequencing. A new generation of cloning system, extensive experiments to test for the degree of the Bacterial Artificial Chromosomes (BACs), has representation and the degree to which BACs been developed in our lab and is now extensively accurately reflect the human genome. We extend used throughout the community in a variety of our improved quality control procedures to the new research areas. library making endeavor. For these libraries, we made a new generation BAC vector, To initially demonstrate the usefulness of the BAC pindigoBAC45 which gives much darker blue system for long range physical mapping, a scaffold colored colonies on the X-gal plates. This feature integrated contig map of the entire long arm of enables us to identify clones with inserts more chromosome 22 was constructed. Individual clones accurately, resulting low percentage of empty were ordered into contigs by fingerprint analysis clones, and to shorten the time required for library and placed along the genomic stretch according to construction. To estimate the number of empty their content of genetic anchorpoints. The map wells and the number of clones that lack inserts, consists of more than 700 BAC clones and spans we stamp all the clones from each 3 84-well the length of approximately 45 megabase pairs. microliter plates onto LB agar-plates. This detects Each BAC clone is further characterized by mistakes made during colony picking procedure. fluorescence fingerprinting and end-sequencing the To check for chimerism, co-habitation of wells by inserts. Sequence information of the clone ends is multiple clones, and representation of the human presently used to select new clones from the 15x genome, we end-sequence BACs and carry out RH coverage BAC library, which extend contigs and mapping based on PCR using primers designed by close the gaps. the sequences. In collaboration with Mark Adams at TIGR, we have sequenced thus far both ends Our ultimate goal is to generate a well about 1,000 BAC clones in the newly constructed characterized collection of BACs, which provides BAC library. Furthermore, we plan.to compare complete physical coverage of the entire long arm new libraries with previously constructed BAC of chromosome 22 with minimal overlaps. This libraries by testing with the same probes used for map will serve as an excellent resource to discover these libraries. We will report the progress of these all of the transcripts mapped on the chromosome, library construction and discuss quality of these and can readily be used for cost-efficient genomic BAC clones. DNA sequencing with minimal redundancy.

Collaborators in our current mapping project are Completing the BAC-Based Physical Dr. N. Blin, Univ. ofTuebingen, and Dr. E. Map of Human Chromosome 22 Meese, Univ. of Saarland.

Holger Schmitt, Mitzi Shpak, Y an Ding, Melvin Simon, and Hiroaki Shizuya A Subdermal Physiological Monitoring Division of Biology, Caltech, Pasadena, CA 91125 System For Mass Screening Mice In [email protected] Gene Expression Studies

The major goals of the Human Genome Project are M. N. Ericson, D. K. Johnson, R. S. Burlage, T. the identification and the localization of L. Ferrell, D. E. McMillan, K. G. Falter, A. D. 50,000-100,000 genes expected in the human McMillan, S. F. Smith, G. E. Jellison, and C. L. genome, the generation of physical maps for each Britton, Jr. individual chromosome, and finally, the

65 Oak Ridge National Laboratory, Oak Ridge, TN beryllium disease (CBD) affects a percentage of 37831-6006 human individuals exposed to airborne particles of [email protected] beryllium. CBD is an autoinunune disease affecting the lungs of susceptible individuals. Researchers at the Oak Ridge National Laboratory Antigen-presenting cell surface proteins have been are developing a highly automated the focus of investigation into possible genetic integrated-circuit based research tool for susceptibility to this disease. It was discovered subdermal monitoring of physiological parameters previously that certain HLA-DPBI alleles in mice used for gene expression studies. correlated with the development of CBD. Because Application of this new instrumentation capability the suspected allele is also found in normal to genome studies will accelerate mass specimen populations at frequencies of over 40% it was screening by providing automated detailed necessary to collect sequence information from observation and reporting of multiple key more individuals. In order to determine if the physiological parameters of interest. Body implicated genotype, with a Glu-69 mutation, was temperature, heart rate, physical activity level, the sole contributing mutation we sequenced over movement trajectory, and possibly blood pressure 30 individuals for exon II of this locus. More than will be measured by an implanted integrated-circuit 90% of disease individuals possessed the Glu-69 based instrument containing multiple integrated mutation, although most of these were in the sensors. Measured data will be transmitted heterozygous state. This larger data set supports periodically via wireless techniques for subsequent previous results, but because of the high frequency data processing, visualization, fusion, and storage. of this mutation in normal individuals it will The integrated sensor/telemetry package will be require more extensive investigation to determine if low-cost, reusable, and sufficiently miniaturized to other contributing genetic factors are present. In be directly injectable. The system will provide addition to this locus, we are sequencing two other detailed parameter measurement and analysis MHC Class II HLA loci, HLA-DQBI and capabilities not presently available to genomics HLA-DRBI for CBD correlating genotypes. researchers. Advanced multi-parametric data Screening for susceptibility to CBD will require presentation will permit improved detail and robust assays, and the development of a genetic accuracy in high-volume phenotype screening and screen is one goal of this research. increased delectability of subtle genetic defects. This paper will present preliminary information on parameter measurement methods, sensor selection, A New Program to Develop a instrument and system architectures, instrument High-Resolution, High-Throughput miniaturization techniques, and data processing Tomographic Imaging System for methods. Mutagenized Mouse Phenotype Screening HLA Genotype Analysis of Chronic Beryllium Disease Sensitivity M.J. Paulus, H. Sari-Sarraf, D.K. Johnson, D.H. Lowndes, M. L. Simpson, C.L. Britton, Jr., F.F. Knapp, Jr., J.S. Hicks P. Scott White, Michelle Petrovic, Owatha L. Oak Ridge National Laboratory, Oak Ridge, TN Tatum, Nancy Lehnert, Usha S-Nair, Zaolin 37831-6006 Wang, Larry L. Deaven, and Babetta Marrone [email protected] Los Alamos National Laboratory, Life Sciences Division and Center for Human Genome Studies, The Oak Ridge National Laboratory has recently Los Alamos, New Mexico 87545 begun the development of a new high-resolution, high-throughput 3-D mouse imaging and Beryllium alloys are used in several industrial computer-aided screening methodology to rapidly processes and products, including the synthesis of identify, quantify and record subtle phenotypes in components of nuclear weapons. Chronic mutagenized mice. The research goals for this

66 program are to develop a novel x-ray imaging When the reaction contains an excess of the signal technology with <50 mm spatial resolution, a 3-D molecules and is performed at elevated temperature data acquisition time of <1 minute per mouse and to promote rapid dissociation and association of an estimated cost of a few dollars per image. A key these molecules, the cleaved signal oligonucleotide element of this new system will be a novel is readily replaced by an intact copy so that the cadmium zinc telluride detector operating in pulse process can. be repeated. In this way many signal counting mode. 1bis new detector technology molecules are cleaved for each copy of the target provides spatial resolutions and x-ray energy nucleic acid. The amount of target present may discrimination capabilities unattainable with then be calculated from the yield of cleavage traditional x-ray computed tomography detectors. product, the rate of product accumulation and the Due to its high atomic number, the new detector is time of incubation. also suitable for traditional nuclear medicine studies. Additionally, new image processing and We have found that the accumulation of cleaved pattern recognition algorithms will be developed product is linear over both time and target for the system to assist researchers in identifying concentration, with the rate of accumulation phenotypes. In this paper we present the objectives dependent on the turnover rate of the enzyme. With and approach for this research program and some a turnover rate of about 35 cleavage events per preliminary data. minute, our current system permits signal to be amplified by more than 1000-fold, compared to single round hybridization, in 30 minutes, thus Quantitative Detection of Nucleic Acids allowing the quantitative detection of sub-attomole by Invasive Cleavage of levels of target nucleic acid. Oligonucleotide Probes Quantitative deteetion of nucleic acids in this fashion has several advantages over other methods Jeff G. Hall, Andrea L. Mast, Victor Lyamichev, of oligonucleotide-based detection. Foremost, James R. Prudent, Michael W. Kaiser, Tsetska because the cleavage requires the precise Takova, Bob Kwiatkowski, Bruce Neri, and Mary coordination and hybridization of two Ann D. Brow oligonucleotides, this reaction has a high level of 1bird Wave Technologies, Inc., 2800 S. Fish specificity for the intended target sequence. The Hatchery Rd., Madison, W153711 specificity of the detection is further enhanced [email protected] because the investigator can select the site of cleavage in the signal molecule by designing the We have developed an enzymatic assay for direct, appropriate amount of overlap with the upstream sensitive and quantitative nucleic acid detection. oligonucleotide. The production of a discrete This assay is based on cleavage of a unique cleavage product of expected size allows that secondary structure that can be formed between product to be more easily distinguished from the two DNA probe oligonucleotides and a target products of oligonucleotide destruction that may nucleic acid of interest. The assay depends on the arise from thermal degradation or from nuclease coordinate aetion between the two synthetic contaminants in diagnostic test samples. Further, oligonucleotides. By the extent of their the use of oligonucleotides that are mostly or complementarity to the target strand, each of these completely composed of DNA, rather than RNA, oligonucleotides defines a specific region of the eliminates background that could arise from target strand. These regions are oriented such that target-independent degradation by ribonucleases. when the two oligonucleotides are hybridized to the Finally, because each cleavage event is dependent target strand, the 3' end of the upstream on the presence of the actual target material, and oligonucleotide overlaps with the 5' end of the not on the products of the cleavage reaction, labeled downstream "signal" oligonucleotide. The contamination by material carried over from resulting structure is recognized by a completed detection reactions cannot induce structure-specific nuclease, which cuts the signal additional signal in subsequent reactions. We have oligonucleotide to release a labeled fragment. applied this method to direct detection of DNA

67 targets such as DNA viral genomes, and to the Y ACs prohibited their use for map construction. detection of mRNA for monitoring of gene We therefore modified our mapping scheme and expression. Judicious placement of the prepared DNA fibers comprised of genomic DNA. oligonucleotide pair around splice junctions allows This proved to be a rapid method for determination mRNA detection without prior destruction of of clone overlap and genomic distances between genomic DNA. The products ofthe cleavage genes or markers. Furthermore, the genomic fibers reaction can be analyzed by gel electrophoresis, or provide a 'gold standard' for validation of clones by non-gel methods such as capture to solid and their contigs as well as the delineation of support. We will show how additional deletions in large clones. Compared to mapping on post-cleavage manipulation of the products can to fibers that were prepared from cloned fragments lower the limit of detection by 2 to 3 orders of and purified by pulsed field gel electrophoresis, magnitude when compared to gel-based readout. genomic DNA fibers are expected to significantly shorten the mapping cycle time, because slides carrying these fibers can be prepared in large Quantitative DNA Fiber Mapping batches and stored. The use of genomic fiber slides (QDFM) Techniques for Physical Map for mapping will also facilitate the implementation Construction and Quality Control* of standardized protocols that are amenable to automation and increase the mapping throughput. Using clones derived from selected regions on Heinz-Uirich G. Weier, Stanislav Volik, Jenny human chromosomes 5 and 16, respectively, we Wu, Thomas Duell, Mei Wang, Ung-Jin Kim\ are presently evaluating the utility of genomic Jan-Fang Cheng and Joe W. Gray fibers for clone/contig validation. Resource for Molecular Cytogenetics and Human Genome Center, Life Sciences Division, University *Supported by a grant from the Director, of California, Lawrence Berkeley National Office of Energy Research, Office of Health and Laboratory, Berkeley, CA 94720, and 'California Environmental Research, Department of Energy, Institute of Technology, Pasadena, CA under contract DE-AC-03-76SF00098. [email protected] 1 H.-U.G. Weier et al. Human Molecular Genetics The construction of high resolution physical maps 4, 1903-1910 (1995). and definition of a minimal tiling path are 2 M. Wang et al. Bioimaging 4, 73-83 (1996). indispensable for directed approaches to large 3 T. Duell et al. Genomics (in press). scale DNA sequencing. Similarly, closure of gaps and completion of shot-gun sequencing projects depends on knowledge of the physical location and size of gaps. We applied 'Quantitative DNA Fiber Analysis of DNA Sequence Copy Mapping (QDFM)', an optical procedure for Number Variation in Breast Tumors mapping based on hybridization of fluorescently and HPV16-Immortalized Cell Lines labeled probes on to individual stretched DNA Using Comparative Genomic molecules, for construction of high resolution Hybridization to DNA Microarrays physical maps and definition of minimal tiling paths as well as for quality control of sequencing 2 1 D. G. Albertson', R. Segraves , D. Sudar , S. templates derived from human chromosome 20. 2 2 2 Clark , C. Collins', C. Chen , W.-L. Kuo , D. Digital image analysis allowed localization of 1 3 4 5 Kowbel , S. H. Dairkee , I. Poole , M. Diirst , J. probes with near kilobase(kb )-resolution in W. Gray1J, D. Pinkel'J intervals of several hundred kb(I.Z) When the 1 E. 0. Lawrence Berkeley National Laboratory, technique was applied to construct physical maps 2University of California San Francisco, for regions on the proximal long arms of human 'Geraldine Brush Cancer Research Institute, chromosomes 11 and 22 (3), respectively, we 4Vysis, Inc., 5Deutsches Krebsforschungszentrum encountered numerous unstable yeast artificial [email protected] chromosome (YAC) clones. Deletions in these

68 Gene dosage alterations underlie many diseases. nurp.ber occurring within a -1 Mb region at For example, variations in DNA sequence copy 20ql3.2. The array was composed of contiguous number are associated with a significant and overlapping clones from a contig that has been proportion of the genetic aberrations involved in constructed spanning this region (Collins et al., cancers, and also with certain developmental these Abstracts). For some tumors, a constant level abnormalities. Comparative genomic hybridization of elevated copy number was observed across the ( CGH) has proven to be an effective method for region. However, in others, an abrupt variation in detecting and mapping these genetic alterations. In copy number was recorded, which mapped the CGH total genomic DNA from a test specimen and boundaries of different levels of amplification to a normal genomic reference DNA are labeled with within a fraction of a BAC or P l clone. These different fluorochromes and hybridized to normal copy number alterations measured with array CGH metaphase chromosomes. The ratio of the are concordant with data obtained by using the fluorescence intensities at a location on the array target clones as probes for interphase chromosomes is approximately proportional to the fluorescent in situ hybridization. However array ratio of sequences in the test and reference CGH appeared to be both more quantitative and genomes that bind there. The use of metaphase substantially faster, since only one hybridization chromosomes as the hybridization target has wa,s required to obtain the data at all loci. The previously limited the resolution of CGH to 10-20 increasing availability of genomic resources and Mb. However, we have now implemented a new the technology to print arrays with more than 104 form of high resolution CGH by replacing the elements/cm2 (Sudar et al., these Abstracts) make normal metaphase spread with an array of genomic it reasonable to consider performing genome-wide cosmid, P 1, and BAC clones (Sudar et al., these analyses of copy number variation using arrays Abstracts). This approach provides a resolution at that would provide 1 Mb or better resolution for least a factor of l 00 better than standard CGH, as the entire human genome. it is determined only by the size and spacing of the target genomic clones. Thus, measurements of copy number can be made at low resolution on Molecular Anatomy of a 20q 13.2 clones spaced at several Mb, or at high resolution Breast Cancer Amp Iicon using closely spaced or overlapping clones. We have performed both low and high resolution 1 2 C. Collins , J. Rommens , D. KowbeP, G. Nonet\ analyses of the DNA sequence copy number 3 3 1 1 L. Stubbs , M. Shanuon , M. Wernick , J. Froula , variation occurring on chromosome 20 in breast G. Hutchinson', T. Godfrey', D. Polikoff', T. cancer and in pre-cancerous models. A low 1 1 1 Cloutier , K. Myambo\ C. Martin , M. Palazzolo , resolution, 'scanning' array of the entire 1 5 15 Dan Pinkel • , D. Albertson • , and J.W. Grayl.5 chromosome was constructed from genomic clones 1 Lawrence Berkeley National Laboratory, spaced at -l-3 Mb intervals on human Berkeley, CA chromosome 20. The analysis of breast tumors and 2 Hospital for Sick Children, Toronto, Canada breast tumor cell lines revealed at least five 3 Lawrence Livermore National Laboratory, independent regions of copy number increase and a Livermore CA region of decrease on 20q that were present in 4 RabbitHutch Biotechnology, B.C. Canada various combinations. Thus, multiple, interacting 5 University of California Cancer Center, San genes involved in breast and perhaps other cancers Francisco may be located on 20q. Two of these regions were [email protected] also recurrently present at elevated copy number in HPV-immortalized keratinocytes, suggesting that High level amplification of chromosome 20 band the processes of tumorigenesis in vivo and 13.2 occurs in 10% of primary breast tumors and immortalization of cells in culture may proceed correlates with poor prognosis in node negative through common pathways of amplification and patients. This amplification is also detected in overexpression of certain genes mapping to these numerous other solid tumors including bladder, two loci. High resolution analysis was also brain, colon, head and neck and melanoma. We performed on breast tumors with elevated copy

69 hypothesize that selection for overexpression of We have now constructed a BAC contig across the one or more genes encoded within this amplicon mouse ZABC 1 locus. It is our goal to sequence drives the 20q 13.2 amplification event. To fine this contig and identify conserved noncoding map the amplicon and identify the hypothesized regulatory elements through comparative sequence oncogene(s) a 1.0Mb interval spanning the analysis. BACs from the contig will also be used to amplicon was cloned in contiguous BAC, PAC, determine if ZABC 1 is amplified in murine and P 1 clones. mammary tumors. It is anticipated that the combination of comparative and functional To fully explore the genomics of the 20ql3.2 genomics will elucidate key regulatory elements of amplicon we have sequenced- 80% of the I Mb the ZABC 1 gene and lead to insights regarding the contig and analyzed it extensively for genes using a mechanism of amplification at 20q!3.2. In suite ofbioinformatics tools. This combined with addition, comparative genomic hybridization to exon trapping and eDNA direct selection has led to DNA microarrays is being used to map this the discovery offour genes ZABCl, ZABC2, amplicon at unprecedented resolution revealing NABCl, PICl-related and a new cyclophilin new regions of consistent copy number pseudogene. ZABC2 may be the ortholog of the abnormalities and suggesting additional co-selected Drosophila melanogaster homeotic gene tea shirt. and counter-selected genes involved in the PICl-related is- 93% identical to PIC! or sentrin, evolution breast tumors (Albertson et al., these a gene that encodes a ubiquitin-homology domain abstracts). protein. NABC 1 encodes a novel cytoplasmic protein of unknown function. ZABC 1 is especially Supported by grants from US DOE contract interesting. The eDNA sequence of ZABC 1 is DEAC0376SF00098, USPHS grants CA44768, predicted to encode a putative transcription factor CA45919, CA52807 and Vysis. containing eight C2H2 zinc finger domains. To manage this data and make it available to collaborators we have developed a tailored 3D Genetic Analysis of Thick Tissue ACEDB database accessible via the World-Wide Specimens Web (Davy et al, these abstracts). SJ Lockett+, C Ortiz de Solorzano+, A Jones+, E A -260 kb "minimum common amp Iicon" was 0 Rodriguez+, K Chin°, C Fernandez , D Sudar+, D defined by performing interphase FISH using P 1 Pinkel+o, JW Gray+o and BAC probes from the sequence-ready contig +Resource for Molecular Cytogenetics, Life on -300 primary tumors. ZABC 1 is centrally Sciences Division, Ernest Orlando Lawrence located in the 260 kb critical region. Quantitative Berkeley National Laboratory, Berkeley, CA PCR and Northern analysis were employed to 94720, °Cancer Center, University of California, analyze the expression of this gene in breast cancer San Francisco, CA 94143 cell lines and primary tumors. Expression of [email protected], [email protected], ZABC 1 is elevated in all cell lines and tumors in [email protected], [email protected], which it is amplified. High level expression of [email protected], [email protected], ZABC 1 was found in the breast cancer cell line [email protected], [email protected], 600MPE, a notable finding because 600MPE is [email protected] not amplified at the ZABC 1 locus. This suggests alternative mechanisms may elevate ZABCl The Resource for Molecular Cytogenetics is transcripts in some tumors. ZABC 1 is the only developing computer assisted microscopy and transcribed sequence identified that maps in the image analysis techniques to allow combined 260 kb critical region with a pattern of expression genotypic and phenotypic analysis of intact cells in so strongly correlated with copy number. The tissue. The long range goal is to obtain quantitative murine ortholog of ZABC 1 has been cloned to information about the copy number of DNA study its expression in murine mammary tumors sequences, levels of expression of RNA and and its normal spatial and temporal pattern of proteins and their distributions in cells; to expresswn.

70 morphologically characterize cells, nuclei and of diffuse molecular markers inside nuclei, the size other organelles, and to analyze the cellular and shape of nuclei, the spatial positions of FISH organization of tissue. This information will signals inside the nuclei, and the spatial contribute to our understanding of functional relationships of cells to each other in the tissue. processes in normal and diseased tissue involving The input to the algorithms is a 3D image of nuclei candidate genes. Some of these genes will have labeled with a DNA counterstain. The first of the been discovered using genome-wide surveillance algorithms automatically thresholds the image into techniques, such as array-based comparative regions of background and nuclei. The next genomic hybridization(CGH). algorithm is a custom-designed, interactive 3D rendering program, which the analyst uses to The technical procedure we have adopted for this inspect each nuclear region and indicate if each project, which is at an early stage, is as follows: region is a single nucleus or a cluster of nuclei. Adjacent thin (4 micron) and thick (30 micron) Clusters are then divided by an automatic sections are cut from cancer specimens. The thin algorithm, which employs a variant of the Hough sections are used for standard histological staging, transform to shrink nuclei and consequently while the thick sections are used for quantifying separate them. The resulting divided regions are the specific molecular species of interest in the presented to the analyst, who indicates if they are individual cells. Thick sections are employed, nuclei, are still clusters, or have been incorrectly because they contain intact cells, thus permitting divided and should be rejoined. The alternating accurate quantification at the individual cell level, steps of automatic cluster division and human and the cellular organization of the tissue is inspection are repeated as many times as necessary preserved. However, such sections require careful to correctly segment all nuclei in the image. optimization of fluorescent in situ hybridization (FISH) and immunocytochemical procedures for This work was supported by the US DOE contract labeling the particular molecular species, followed DEAC0376SF00098, NIH grant CA67412 by 3D (confocal) microscopic image acquisition and a contract with Carl Zeiss Inc. and 3D image analysis. We have developed several image analysis programs for this project. The first interactively enumerates punctate FISH signals in the individual intact nuclei of thick sections, and the second registers adjacent thick and thin sections. It is thus now possible to use this procedure to study the relationship between the copy number of specific genomic sequences in individual cells and histological stage of the tissue. In a preliminary application to a breast biopsy specimen, we demonstrated that histologically benign regions contained cells with two copies of a chromosome I alpha-satellite probe, whereas invasive cancer regions had variable copy numbers per cell of the probe. We are continuing these experiments in order to determine the degree of genetic heterogeneity in tumor cells and surrounding histologically normal tissue.

The image analysis programs mentioned above limit the procedure to the enumeration of punctate FISH signals in each cell. In order to expand the range of applications, we have developed algorithms for segmenting individual cell nuclei within thick sections. This enables quantification

71 I I

I I Informatics I I

I I

I I

I I Informatics at the Center for Applied non-redundant nucleotide public database searches, Genomics the Northern hybridization images, two maps graphically representiog the region of ioterest and Donn Davy, Tom Cloutier, Colio Collios, Manfred connections to related databases. For trackiog Zorn, Joe Gray progress on the oncogene project, a second web University of California at San Francisco/ server and database were deployed usiog Lawrence Berkeley National Laboratory off-the-shelf Java-language Rapid Application [email protected] Development (RAD) tools. It is used to report progress in selectiog targets, developiog primers, The Resource for Molecular Cytogenetics/Center and extractiog clones, and in keepiog track of for Applied Genomics, in collaboration with the collaborators' suggestions for further oncogene University of California at San Francisco-Cancer chip targets. Center, has molecularly characterized and sequenced an interval of chromosome 20 band 13.2 found amplified io I 0% of primary breast tumors The Genome Channel and Genome and correlated with poor prognosis in Annotation Consortium node-negative patients. The goal is to determine the complete genomic organization of the one-mega Ed Uberbacher, Richard Mural, Manesh Shah, base interval spanning this amplicon. In a Yiog Xu, Sheryl Martin, Sergey Petrov, Jay follow-on project, the group will select DNA Snoddy, Morey Parang- Oak Ridge National clones from all known oncogenes for use in a DNA Laboratory chip, to be used as a diagnostic tool. Collaborators Manfred Zorn, Sylvia Spengler, Donn Davy­ conduct their work io San Francisco, Berkeley, Lawrence Berkeley National Laboratory Vancouver, Toronto and Fioland: sites so Terry Gaasterland- Argonne National Laboratory geographically distant as to make travel between Peter Schad - The National Center for Genome them ioconvenient and ioefficient. Data concemiog Resources physical map assembly, genomic sequencing, Stan Letovsky, Bob Cottiogham - The Genome Northern hybridization, tumor mappiog, public Database database searches and extensive sequence David Haussler - University of California Santa annotation efforts iocludiog graphical maps and Cruz detailed data on clones, loci, exons, sequences and Pavel Pevzner - University of Southern California eDNA must be made equally available to all Chris Overton - University of Pennsylvania collaborators independent oflocation. The [email protected] computiog technology brought to bear on this problem combines a standard World-Wide Web Human and model organism sequencing projects sever engine with a tailored ACEDB database and will soon be produciog data at a rate which will a number of supportiog CGI scripts connecting require new methods and infrastructure for users to Unix file system documents with retrieved data be able to effectively view and understand the data. from the database and with related data both local A multi-iostitutional project was recently funded to and at remote sites on the WWW. The provide large-scale analytical processiog ACEDB-style database (CAGdb) is accessible capabilities and we will present the results of three ways; as a normal X -window application, as several pilot efforts related to this project. The an aceclient/aceserver application, and via a set of goals of the project are as follows: ioterface scripts, from the CAG Web page. Additional WWW services include the display of • Provide au environment where annotation Genotater (LBNL-developed sequence annotation can be constructed based on multiple visualization tool) images, XGrail (ORNL interoperable analysis tools and significant gene-prediction tool) images, a set of HTML pages available computing power. of annotated sequence, a second set of HTML • Provide au environment where pages providiog first-pass results of dbest and characterization of long genomic sequence

75 regions can be facilitated and analysis can sequence source information, related homology and maintained and updated over time. functional information, and hyperlinks to numerous • Provide an interactive graphical underlying primary data resources. environment where predictions, features and evidence from many tools can be Analysis Methods combined by users into high-quality annotation and visualized by the Current analysis methods which combine pattern community. recognition with EST information and protein • Provide high-throughput automated similarities are capable of accurate and automated analysis methods which can be configured analysis of large genomic regions containing by genome centers for their use in complex multiple gene structures. Analysis constructing annotation and facilitating methods will include the GRAIL EST/Protein data submission. homolog system, Procrustes (Pavel Pevsner et al. ), • Provide high-quality annotation to large and Genie (David Haussler et al.) as well as other genomic sequence regions which would tools. The results of multiple tools can be viewed otherwise go unannotated. in a common environment, combined • Provide the community with the best mathematically in user specified ways, and used as sequence level view of genomes possible. the basis for the automated or interactive construction of annotation. The components of this system are a number of services, a broker that oversees the distribution and Data Mining Agents management of tasks and a data warehouse, with services implemented using distributed object Maintenance of an up to date description of technology. Multiple gene prediction is genomic regions will be based on the use of data accomplished using several gene finding tools mining agents formulated for particular including the GRAIL-EST system and gene information resources. These agents will make use annotation from databases such as Genbank is also of several different technologies, such as OPM captured. The data warehouse supporting the (Victor Markowitz), Kleisli (Chris Overton), and Genome Channel view is updated daily by database indices to facilitate meaningful automated Internet agents and event triggers which information retrieval from remote sources. We facilitate analysis procedures. Real time operation expect new information related to each gene or of the Genome Channel browser will be genome region to continue to be discovered and demonstrated. A more detailed description of the actively linked in a long-term ongoing process. The basic components follows: current state of these links will be maintained in the data warehouse. The Genome Channel The Data Warehouse The Genome Channel provides a graphical user interface to comprehensively browse and query The data warehouse provides and maintains a snap assembled sequence placed in the public domain by shot of the current view or status of the genomes or the Human Genome Project and'sequencing of genomic regions analyzed in the project. The model organisms. It is a JAVA interface tool which warehouse is designed to facilitate rapid access by relies on a number of underlying data resources, users, visualization tools, and analysis systems. analysis tools and data retrieval agents to provide The views contained in the warehouse will be up-to-date view of genomic sequenc~s, as well as constructed and maintained by processes within computational and experimental annotation. this project (such as sequence analysis and Navigation from a whole chromosome view to information retrieval agents), with additional help contigs provided by sequencing centers allows one from central databases like GDB. It will contain to zoom in on regions of interest to see information and make available a synthesized best view of about clones, markers, ESTs, computationally and genomes from multiple underlying sources. experimentally determined genes, the sequence and

76 Internet Object Request Broker • Developed improved systems for recognition of exons, splice junctions, Services for data input, analysis, visualization and promoter elements and other features of submission will be facilitated with a distributed biological importance, including greater underlying Internet architecture using CORBA sensitivity for exon prediction (especially with an object request broker to manage processes. in AT rich regions) and robust indel error Compute platforms, analysis servers, databases, detection capability. etc. will be at a variety oflocations and in some • Developed improved and more efficient cases duplicated depending on need. Specialized algorithms for constructing models of the computing hardware will be used to facilitate some spliced mRNA products of human genes. tasks. • Developed and implemented methods for the analysis and visualization of sequence This project is jointly sponsored by the features including poly-A addition sites, Computational Grand Challenge program of the potential Pol II promoters, CpG islands Office of Computational and Technology Research and repetitive DNA elements. and the Human Genome Program of the Office of • Designed and implemented new methods Biological and Environmental Research of the for detecting potential sequence errors Department of Energy. which· can be used to "correct" frameshifts, add quality assurance to sequencing operations, and better detect GRAIL and GenQuest Sequence coding regions in low pass sequences such Annotation Tools as ESTs. • Developed systems for a number of model Ying Xu, Manesh B. Shah, J. Ralph Einstein, organisms including mouse, Escherichia Morey Parang, Jay Snoddy, Sergey Petrov, Victor coli, Drosophila melanogaster, Olman, Ge Zhang, Richard J. Mural and Edward Arabidopsis thaliana, Saccharomyces C. Uberbacher cerevisiae and a number of microbial Computational Biosciences Section, Life Sciences genomes. Division, Oak Ridge National Laboratory, Oak • Implemented methods for the Ridge, TN 37831-1060 incorporation of protein, EST, and mRNA [email protected] sequence evidence in the multiple gene modeling process. Our goal is to develop and implement an integrated • Constructed a powerful and intuitive intelligent system which can recognize biologically graphical user interface and client-server significant features in DNA sequence and provide architecture which supports Unix insight into the organization and function of workstations and JAVA Web-based regions of genomic DNA. GRAIL is a modular access from many platforms. expert system which facilitates the recognition of • Improved algorithms and infrastructure in gene features and provides an enviromnent for the the genQuest server, allowing construction of sequence annotation. The last characterization of newly obtained several years have seen a rapid evolution of the sequences by homology-based methods technology for analyzing genomic DNA sequences. using a number of protein, DNA, and The current GRAIL systems (including the e-mail, motif databases and comparison methods XGRAIL, JAVA-GRAIL and genQuest systems) such as FASTA, BLAST, parallel are perhaps the most widely used, comprehensive, Smith-Waterman, and algorithms which and user friendly systems available for consider potential frameshifts during computational characterization of genomic DNA sequence companson. sequence. In the past 2 years of the project we • An improved "batch" GRAIL client have: allows users to analyze groups of short (300-400 bp) sequences for coding character (with frameshift compensation

77 options) and automates database searches (8) improved interoperation with other tools, of translations of putative coding regions. databases and methods for integrating • Provided support for GRAIL use in more information from multiple sources, than a thousand laboratories and at a rate particularly within the Genome of over 4000 analysis requests per month. Annotation Consortium framework and Genome Channel, and The imminent wealth of genomic sequence data (9) continued community and user support, will present significant new challenges for technology transfer, and educational sequence analysis systems. Our vision for the outreach. future entails incorporation of a more sophisticated view of biology into the GRAIL system. These developments will enable GRAIL to become Computational systems for genome analysis have more comprehensive and biologically thus far focused on generic or textbook-like sophisticated, and yet remain a user-friendly examples of single isolated genes which can be analysis environment which can be used described fairly simply using the most usual interactively or in fully automated modes. assumptions, and fall far short of the intelligence necessary to interpret complex multiple gene GRAIL and genQuest related tools are available as domains. In its next phase the GRAIL project will a Motif Graphical client (anonymous ftp from involve the development of new pattern recognition grail.lsd.ornl.gov (134.167.140.9)), through methods and modeling algorithms for DNA WWW interfaces (URL http://www.compbio.lsd. sequence, expert systems for interpretation using ornl.gov/), or by email server ([email protected]) and experimental evidence and comparative genomics, genQuest at ([email protected]). Communications with and interoperation with other tools and databases. the GRAIL staff should be addressed to More specifically we will focus on several [email protected]. (Supported by the Office development areas: ofHealth and Environmental Research, United States Department of Energy, under contract (I) Improved accuracy of feature recognition DE-AC05-840R21400 with Martin Marietta and greatly increased tolerance to Energy Systems, Inc.) sequencmg errors, (2) development oftechnology to describe the structure and regulation of large, complex The Stationary Statistical Properties of genomic regions containing multiple Human Coding Sequences genes, (3) automated and interactive methods for the David C. Torney, Clive C. Whittaker, and incorporation of experimental evidence GouchunXie such as ESTs, mRNAs, and protein Los Alamos National Laboratory, Theoretical sequence homologs in multi-gene domains Division, Los Alamos, New Mexico 87545, and (GRAIL-EXP), Merck and Company, Inc., West Point, (4) more comprehensive feature recognition Pennsylvania and increased biological sophistication in [email protected] the areas of expression and regulation, (5) capabilities for direct comparison of We analyzed approximately 0.5 Mb of in-frame, genomes, nonredundant coding sequences from the NFRES (6) a comprehensive suite of microbial database. These were reversibly encoded as binary genome analysis systems, sequences, using l 1 for A, 1 -I for C, -1 l for G (7) infrastructure for use of high-performance and -1 -1 for T. This binary encoding is computing systems and specialized appropriate for characterizing all stationary hardware to facilitate analysis and statistical features of the data, either in terms of annotation oflarge volumes of sequence moments or of cumulants. data,

78 Moments are expectations of products of digits. WITIWIT2: A System for Supporting Cumulants are constructed from the moments, Metabolic Reconstruction and aiming to remove features that could have been Comparative Analysis of Sequenced predicted from subsets of the digits. The simplest cumulants are covariances of two digits, vanishing Genomes if these are independent. In fact, the cumulants offer a more concise description of the statistical Ross Overbeek, * Nat alia Maltsev, * Gordon properties of coding sequences than do the Pusch,* and Evgeni Selkov * ** moments. * Mathematics and Computer Science Division, Turning to the covariances of digits from coding Argonne National Laboratory, and •• Institute of sequences, as each codon is represented by six Theoretical and Experimental Biophysics, Russian binary digits, there are 36 types of pairs of digits Academy of Sciences, 142292 Pushchino, Moscow for separate consideration. Some of these pairs region, Russia exhibit a nonzero asymptote as the separation [email protected] between the two digits increases. Most pairs exhibit interesting transients, out to about 50 The WIT/WIT2 system has been developed to bases._ We will show plots of these 36 types of support metabolic reconstruction from sequenced covanances. genomes. Specifically, it supports 1. derivation of initial assignments of We will also report results for cumulants oflarger function to ORFs for an organism, numbers of digits. For cumulants of three digits, 2. use of these assignments to construct an the asymptotes are all effectively zero, as the initial estimate of the metabolic pathways spacings increase indefinitely, except for those present in the organism, cases in which two digits are derived from one 3. use of consistency analysis to refine the base position. Similarly, the cumulants offour functional assignments, and digits that have the largest asymptotic values are 4. provide a framework for presenting and the ones in which these four are derived from two refining the emerging metabolic models base positions. We will show plots for the for a set of organisms. cumulants corresponding to the nine pairs of codon bases. The cumulants of six digits, corresponding WIT2 is a UNIX-based system that is made to three bases, appear to have a zero asymptotic available to anyone wishing to support Web-based value, as the two spacings between the bases access to genomic sequence data. It includes in the mcrease. standard distribution a set of integrated genomes from the public archives. For each of the A complete list of the nonzero cumulants-a distributed organisms, the user has access to the tabulation of the stationary statistical properties of ORFs, RNAs, contigs, function assignments, and human coding sequences-is in hand. This asserted pathways that characterize the current tabulation could greatly facilitate the classification state of the analysis of the genome. The user has of anonymous DNA sequences and provide a the option of adding new public or proprietary natural starting point for the non-stationary genomes, and then analyzing the new genomes analysis of DNA sequences. based on clustering of ORFs with the distributed genomes. The system supports both shared and non-shared annotation of features and the maintenance of multiple models of the metabolism for each organism. WIT2 comes in two parts: a Web-based system offering access to the data and a set of batch tools that offer extensible query access against the data. The Web-based tools integrate WIT2 with other sources of data on the Web, while the batch tools allow one to do pattern

79 matching, extract regions of sequence for analysis The MPW diagrams were originally encoded and using other tools, look for operons, and so forth. distributed as a formatted ASCII text [2]. In the current public release ofMPW [8], the encoding is Currently, the released system offers access to data based on the logical structure of the pathways and for the following organisms: is represented by the objects commonly used in electronic circuit design. This facilitates drawing Archaeoglobus folgidus, Caenorhabditis elegans, and editing the diagrams and makes possible Deinococcus radiodurans, Escherichia coli, automation of the basic simulation operations such Haemophilus injluenzae, Helicobacter pylori, as deriving stoichiometric matrices, rate laws, and, Methanobacterium thermoautotrophicum, ultimately, dynamic models of metabolic pathways. Methanococcus jannaschii, Mycobacterium Individual pathway diagrams, automatically tuberculosis, Mycoplasma genitalium, derived from the original ASCII records [1, 9], are Mycoplasma pneumoniae, Rhodobacter stored as SGML instances supplemented by capsulatus SB1003, Saccharomyces cerevisiae, relational indices. An auxiliary database of Synechocystis sp., and Treponema pallidum. compound names, encoded as SMILES strings [10], is maintained to unambiguously connect the The release at Argonne National Laboratory is pathways to the chemical structures of their accessible via [ 1]. intermediates. In accordance with the IUP AC Acknowledgment. This work was supported by nomenclature for chemical compounds, the current U.S. Department of Energy under Contract release supports super- and subscripts, Greek W-31-109-Eng-38. letters, Italic fonts, etc.

[I] http://www.mcs.anl.gov/home Acknowledgment. This work was supported by /overbeek/WIT2/CGI/user.cgi U.S. Department of Energy under Contract W-31-109-Eng-38, and award OR00033-97CISOOI.

Internet Release of the Metabolic References Pathways Database, MPW l. Selkov E., Galimova M., Goryanin I., Gretchkin Y., Ivanova N., Komarov Y., Maltsev N., Evgeni Selkov, Jr.,* Yuri Grechkin, * and Evgeni Mikbailova N., Nenashev V., Overbeek R., Selkov * ** Panyushkina E., Pronevitch L., Selkov E., Jr. * Institute of Theoretical and Experimental Nucleic Acids Res., 1997, 25 (I), 37-38 Biophysics, Russian Academy of Sciences, 142292 2. http://www.biobase.com/emphome.html Pushchino, Moscow region, Russia, and ** /homepage.html/pags/pathways.html Mathematics and Computer Science Division, 3. Selkov E., Basmanova S., Gaasterland T., Argonne National Laboratory, Argonne, 9700 S. Goryanin I., Gretchkin Y., Maltsev N., Nenashev Cass Ave., MCS-221, IL 60439-4844, USA V., Overbeek R., Panyushkina E., Pronevitch L, selko\[email protected] Selkov E., Jr., Yunus I. Nucleic Acids Res., 1996, 24 (1), 26-29 The Metabolic Pathway Database, MPW [1, 2], a 4. http://www.biobase.com/EMP subset ofEMP [3. 4], plays a fundamental role in 5. http://www.mcs.anl.gov/home/compbio/PUMA/ the technology of metabolic reconstructions from Production/ReconstructedMetabolism/ sequenced genomes under the PUMA, WIT, and reconstruction.html WIT2 systems [5-7]. It is the largest and the most 6. http://www.mcs.anl.gov/home/compbio/ comprehensive metabolic database, including some WIT/wit.html 2,800 pathway diagrams covering primary and 7. http://www.mcs.anl.gov/home/overbeekl secondary metabolism, membrane transport, signal WIT2/CGI/user.cgi transduction pathways, intracellular traffic, 8. http://beauty.isdn.mcs.anl.gov/MPW translation, and transcription. 9. http://www.cme.msu.edu/MPW/ 10. http://www.daylight.com/dayhtml/smiles/ index.html

80 Metabolic Reconstruction from We are entering a new phase of the project in Sequenced Genomes which substantial benefit can be achieved via our growing understanding of which parts of Evgeni Selkov, *+Natalia Maltsev, * and Ross metabolism are universal, of gene families Overbeek* (allowing more rapid development of initial * Mathematics and Computer Science Division, function assignments made to ORFs), and of Argonne National Laboratory, Argonne, 9700 S. insights achieved from one genome that have Cass Ave., MCS-221, IL 60439-4844, USA, obvious application in others. What is emerging is + Institute of Theoretical and Experimental not just a large number of isolated metabolic Biophysics, Russian Academy of Sciences, 142292 portraits for a set of diverse microbial organisms, Pushchino, Moscow region, Russia but rather an integrated understanding of the [email protected] evolution of metabolism and the technology for developing higher-level functional models. With the availability of increasing numbers of Acknowledgment. This work was supported by complete genomes, the possibility of developing U.S. Department of Energy under Contract accurate models of metabolism for these organisms W-31-109-Eng-38. becomes of central interest. We have initiated a project to "reconstruct the metabolism" of References organisms from the sequence data supplemented by available biochemical and phenotypic data, and we 1. http://www.mcs.anl.gov/home/compbio/ have developed initial reconstructions for a number WIT/wit.html of organisms. These reconstructions for over 2. http://www.mcs.anl.gov/home/overbeek/WIT2/ twenty organisms (some based on incomplete CGI/user. cgi sequence data) have been made available via the 3. http://beauty .isdn.mcs .anl.gov/WIT2. pub/ WIT and WIT2 systems [1-3]. CGI/org.cgi

The reconstructions are based on the collection of metabolic pathways, MPW. This collection now From Genomic Sequence to Protein includes over 2,800 diagrams and is continually Expression: A Model for Functional being enhanced. Each diagram represents a Genomics grouping of functional roles, and these groupings provide a resource in analyzing the functional Tracy J. Wright, Simone Abmayr, Min S. Park, assignments made to ORFs in newly-sequenced Darrell 0. Ricke, Cleo Naranjo, Becky genomes; when several functions in a pathway Welsh-Breitinger, Karen Denison and Michael R. have been clearly identified, more focused analysis Altherr on the "missing functions" provides a powerful Genmnics Group, Mail Stop M888, Los Alamos means of improving function assignments that National Laboratory, Los Alamos, New Mexico were originally made without access to an overall 87545 understanding of the metabolism of the organism. The goal of Functional Genomics is to provide The actual process of metabolic reconstruction useful and effective annotation of genomic involves a number of steps which we describe in sequence data. While effective annotation is likely detail. We have made the WIT/WIT2 system to be defined by the needs of the end user, there are available to support such efforts, but the actual a variety of routine procedures that geneticists process is independent of this software and will employ in the characterization of any gene or undoubtedly be adopted by other efforts based on hypothetical gene. Included in these collections of the same overall goal of using sequence data as a procedures are techniques to identify: genetic foundation to develop an accurate model of the variation (or polymorphism), transcript complexity metabolism of an organism. and distribution, as well as protein coding capacity. However, the first step in the process is

81 the identification of putative coding segments Support: DOE Contract W7405-ENG-36 and within a genomic sequence. This can be LANL LORD funds. accomplished by interrogating the expressed E-mail Contact: [email protected] sequence tag data base (dbEST) with the genomic Day Phone: (505)665-6144 sequence of interest; or by submitting the genomic sequence to a gene identification program like GRAIL in a effort to identify potential FAKtory: A Customizable Fragment transcriptional domains. Sequences and clones Assembly System identified in this way should be considered putative genes until their functional significance is Engene W. Myers, Susan J. Larson, Brad W. substantiated by additional biological data. Traweek, Kedarnath A. Dubhashi University of Arizona, Tucson, AZ We have initiated a process to evaluate the [email protected] biological function of putative genes identified in a large segment of genomic sequence. The genomic F AKtory supports a large range of sequencing sequence we have focused on represents a portion protocols and specialized fragment database (-200 kbp) of a 2.2 Mbp sequence contig derived information. Customizations include a processing from a gene rich region on human chromosome 4 pipeline for clipping, tagging, and vector trimming (4pl6.3). This region is significant because it has stages (prescreeners), and a configurable been demonstrated to represent the smallest region fragments database. Each pipeline stage can run in of overlap for deletions in Wolf Hirschhorn Automatic, Supervised, or Manual mode syndrome (WHS) patients. WHS is a multiple depending on the degree of user control desired. anomaly dysmorphic malady characterized by The F AKII Fragment Assembler provides mental and developmental defects. Due to the high-sensitivity overlap detection, near-perfect complex and variable expression of this disorder, it multialigrunents, alternate assemblies, and support is thought that WHS is a contiguous gene of user specifiable assembly constraints. While syndrome with an undefined number of genes initial development on F AKtory emphasized its contributing to the phenotype. The analysis of customizability, recent work provides sophisticated genomic sequence data by BLAST, FASTA and prescreeners, an editor for contig layout GRAIL identified a number of putative manipulation, a Finishing editor, and VO filter transcription units. Potential coding segments were capabilities for easy translation of data formats subjected to full insert sequencing of eDNA (when and linking to post-analysis programs. available) and Northern blot analysis. These initial analyses identified two putative coding segments A pipeline can include any number of prescreener with a high probability (intron: :exon structure and stages for vector removal, data trimming, and positive N orthem results) of representing bona fide tagging. Each prescreener includes a number of genes. In addition, highly similar clones have been recognizers, which locate within a fragment isolated from mouse providing additional support selected intervals, frequencies of specified bases, that these represent true coding segments. The trace signal characteristics, overlap with reference human coding segments have been cloned into sequences, or matches to regular expressions. expression vectors and will ultimately be used to Because alternate assemblies may be generated for generate antibodies to confirm the existence of the a given data set, F AKtory provides a Layout Edit hypothetical proteins. panel for comparing and interacting with the potential solutions. Portions of contigs can be . This work has allowed us to evaluate coding locked together, split apart, or checked for poss1ble potential and possible function of genes identified joins to other contigs. Constraints may be added or by genomic sequencing efforts. It has identified a deleted, and reassembly with additional data number of potential bottlenecks and allowed us to generates additional assemblies to consider. conceptually develop a scheme to provide functional annotation for genomic sequence.

82 A Finishing editor displays a multialignment with a scrollable canvas of all trace data for the current location. F AKtory allows automatic tabbing to the Sequence Assembly Validation by previous or next unedited problem in the contig, Restriction Digest Analysis minimizes the number of keystrokes needed in an editing sweep, and allows unlimited undo. All Eric C. Rouchka and David J. States edited regions are shown to indicate finishing Washington University progress. ecr@ibc. wustl.edu states@ibc. wustl.edu

DNA sequence analysis depends on tl1e accurate Divide-and-Conquer Multiple assembly of fragment reads for the detemlination Sequence Alignment of a consensus sequence. Genomic sequences frequently contain repeat elements that may Dan Gusfield, Jens Stoye confound the fragment assembly process, and Department of Computer Science, University of errors in fragment assembly may seriously impact California, Davis, CA 95616, USA the biological interpretation of the sequence data. stoye@cs. ucdavis .edu Validating the fidelity of sequence assembly by experimental means is desirable. This report We present a fast heuristic algorithm for the examines the use of restriction digest analysis as a simultaneous alignment of multiple sequences method for testing the fidelity of sequence which provides near-to-optimal results for assembly. A dynamic programming algorithm to sufficiently homologous sequences. determine the maximum likelihood alignment of error prone electrophoretic mobility data is derived The algorithm makes use of the optimal alignments and used to assess the likelihood of detecting of all pairs of sequences which give rise to rearrangements in genomic sequencing projects. secondary matrices containing additional charges imposed by forcing the alignment path to run Restriction digest fmgerprint matching is an tlnough a particular vertex of the distance matrix. established technology for high resolution physical From these "additional-cost" matrices, we compute map construction, but the requirements for suitable positions for cutting all of the given assembly validation differ from those of fingerprint sequences simultaneously, thus reducing the mapping. Fingerprint matching is a statistical problem of aligning a family of sequences in a process that is robust to the presence of errors in divide-and-conquer fashion to aligning two families the data and independent of absolute fragment of sequences, each of approximately half the mass determination. Assembly validation depends original length. When, after re-iterating this on the recognition of a small number of discrepant division procedure sufficiently often in a recursive fragments and is very sensitive to both false manner, the subsequences are sufficiently short, positive and false negative errors in the data. these are aligned optimally. Assembly validation relies on the comparison of absolute masses derived from sequence with This procedure allows us to align simultaneously masses that are experimentally determined, making up to twelve amino acid sequences of the usual absolute accuracy as well as experimental length(< 500) within a few minutes. The poster precision important. As the size of a sequencing will also present several results concerning running project increases, the difficulties in assembly time and memory usage as well as the quality of validation by restriction fingerprinting become the obtained alignments. more severe .. Simulation studies are used to demonstrate that large-scale errors in sequence Acknowledgment: assembly can escape detection in fingerprint Part of this work was supported by Dept. of pattern comparison. Alternative technologies for Energy grant DE-FG03-90ER60999 and by the sequence assembly validation are discussed. German Academic Exchange Service (DAAD).

83 Segmentation Based Analysis of Segmentation according to higher order oligomers Genomic Sequence is also under consideration.

Eric C. Rouchka and David J. States Washington University Swedish and Finnish Quality Based ecr@ibc. wustl. edu states@ibc. wustl. edu Finishing Tools for a Production Sequencing Facility The human genome is patchy and non-uniform in composition. We have developed a method for Matt P. Nolan, Jane E. Lamerdin, Stephanie A. analyzing genomic sequence based on segmental Stilwagen, Glenda G. Quan, Ami L. Kyle, variation in sequence composition using a heuristic Anthony V. Carrano algorithm employing classic changepoint methods Joint Genome Institute, Lawrence Livermore and log-likelihood statistics. Our approach models N a tiona! Laboratory the genome as composed of linear segments each [email protected] characterized by its own compositional characteristics. The most informative description Our modified shotgun sequencing effort has three of a sequence is that description that maximizes the phases. In the random phase we sequence a fixed likelihood of the observed sequence given the number of plates resulting in 80%-95% of the model while simultaneously minimizing the cost of cosmid bases meeting our quality-based, specifying the model. A Java interface has been double-stranded, finish criteria (QbDsFc). During developed to aid in visual inspection of the pre-finishing we resequence clones attempting in segmentation results (http://www.ibc.wustl.edu/ one round of forwards and reverses to meet the -ecr/CPG/segment.html). The software has been QbDsFc for 95% of the bases and close most gaps. tested using CpG dinucleotide distribution as a During directed closure we close any remaining well-established example of biologically significant gaps and complete double-stranding. To reduce non-uniform composition. finishing costs and speed time to completion for our -40KB cosmid clone projects we created Regions of DNA rich in CpG dinucleotides, also software to automate selection of finishing reads. known as CpG islands, are often found upstream We describe our SaF (Swedish and Finnish) of the transcription start site in both tissue specific software tools developed to I) facilitate the and housekeeping genes. Overall, CpG specification of clones for resequencing and to 2) dinucleotides are observed at a density of25% the quantify the state of project contigs with respect to expected level from base composition alone, our QbDsFc. partially due to 5-methylcytosine decay . About 56% of human genes have associated CpG rich For a project assemblage our SaF tools identify islands. Since CpG dinucleotides typically occur bases not meeting the QbDsFc, then conglomerate with low frequency, CpG islands can be these problem bases into problem regions using distinguished statistically in the genome. Our parameterized filtering and clustering algorithms. model is tested using several sequences obtainable They produce reports listing each problem region from GenBank, including a 220 Kb fragment of and a contig summary. Within each problem human from the filanin (FLN) gene region, we quantify the problem with respect to our to the glucose-6-phosphate dehydrogenase (G6PD) QbDsFc, the phrap consensus quality values and gene which has been experimentally studied. whether or not the consensus sequence overlaps a Results demonstrate a breakpoint segmentation database of repeats. We can generate a list of that is consistent with observable manual analysis. sequences (candidates for resequencing) intersecting the problem region. For each sequence, In addition to segmenting DNA according to the we identify where the read intersects the region. location of CpG islands, other compositions are Within the intersection we compute the mean and explored as well. Among these are C + G content, sigma of the phred basecall quality values, mononucleotide content, and dinucleotide content. quantify how well it matches the consensus

84 sequence, and compute a value proportional to To support this increase in sequence data volume, resequencing suitability. we have designed and implemented our informatics system to support our current sample prep and In our production sequencing we use the SaF tools sequencing strategy: bulk generation of dye primer to fully automate clone selection in the and dye terminator reads early in the random pre-finishing phase and we require finishers to phase, followed by increasing automation in the address each region identified during directed pre-finishing and directed closure phases. We have closure. The SaF applications consist of -8000 developed a number of software programs lines of C++ and -1000 lines of per!. We run the (predominantly Sybperl or PerlTk) which are script Swedish-setup ( -4000 lines of per!) to integrated through our Sybase relational database. coordinate the conversion of assembly data through multiple formats to construct a CAF file To support increased throughput, we have representation of the phrap assemblage which the developed Sybperl programs to generate HTML SaF applications input. One application produces a WWW forms to create and manage sequencing file used to direct a robotic workstation to rearray projects, track samples, and create sample sheets. clones to prepare chemistries for finishing reads. We have also implemented an automated sample We expect other new SaF functionality to migrate file sorting system that analyzes and distributes into our production environment to meet the ten sample files from our 14 ABI sequencers, which fold increase projected for JGI sequencing in the currently are loaded and run twice each day. upcoming year. Analysis results forE .coli contamination, sequence read length, and percent vector are Work performed under the auspices of the US archived in our relational database for reports, DOE by Lawrence Livermore National Laboratory trend analysis, and troubleshooting. We have under contract No. W-7405-Eng-48 recently completed a rearraying procedure for the pre-finishing and gap closure phases using laboratory robots, which are also integrated into Informatics to Support Increased our database and sample-tracking system. Throughput and Quality Assurance in a Production Sequencing Facility To support quality assurance procedures, we have developed an automated reporting system which uses archived analysis results as well as project Arthur Kobayashi, David J. Ow, Tom Slezak, information to generate and distribute a series of Mark C. Wagner, Matt P. Nolan, T. Mimi Yeh, reports which include an estimate of sequencing Stephan Trang, Anthony V. Carrano efficiencies and project consensus quality. We also LLNL HGC; Joint Genome Institute have provided WWW access to project directories, kobayashii @llnl.gov which display current quality plots and assembly information for each project. In addition, we have We are developing an integrated informatics developed a suite of tools which can be used for infrastructure to support the increasing throughput analyzing specific projects for sequencing and quality assurance demands of our production efficiencies, read length, etc. We plan to continue sequencing facility. The LLNL Human Genome to develop and expand our informatics capabilities Center utilizes a modified shotgun-based approach as we continue to significantly scale up our to sequence human chromosome 19 and targeted throughput over the next several years. gene regions of interest. To date we have finished over 1.5 Mb of high-quality genomic sequence, This work was performed by Lawrence Livermore including a contiguous 1-Mb region. Our National Laboratory under the auspices of the U.S. laboratory strategies and protocols are described in Department of Energy, Contract No. more detail in other abstracts in these proceedings W-7405-Eng-48. (e.g., Larnerdin, McCready, eta!).

85 Software Tools for Data Analysis in slab-gel scanning system, and a capillary Automated DNA Sequencing sequencing system.

We are also working on a project to incorporate Michael C. Giddings, Jessica Severin, Michael these programs and software developed elsewhere, Westphal!, and Lloyd M. Smith along with a relational database engine, utilizing University of Wisconsin-Madison Chemistry distributed object technology. This system has the Dept., 1101 University Ave., Madison, Wl53703 promise of providing a robust means of dealing [email protected] with the large quantities of complex data that must be handled in a genome sequencing center, while A crucial component of the automated DNA providing a simple and consistent view of the entire sequencing process is the analysis software. The process to a human operator. software analysis can be roughly divided into four primary steps: gel analysis, base calling, assembly, and finishing. In each of these steps, the software is responsible for the difficult task of accurately Treasures and Artifacts from Human analyzing large quantities of complex data and Genome Sequence Analysis deriving information that is useful to end users of the data. Darrell 0. Ricke and Larry L. Deaven Los Alamos National Laboratory, Center for We have developed software for gel analysis and Human Genome Studies, Los Alamos, New base calling utilizing a cross-platform, modular, Mexico 87545 object oriented architecture. The core of this [email protected] software is BaseFinder, a framework for trace processing, analysis, and base-calling. BaseFinder A very interesting aspect of the Human Genome is highly extensible, allowing the addition of trace Project (HGP) is the analysis of the sequences analysis and processing modules without being generated. Sequence analysis of genomic recompilation. Powerful scripting capabilities sequence data is yielding treasures and interesting combined with modularity allow the user to artifacts. Human genomic sequence analysis finds customize BaseFinder to virtually any type of trace significant similarities to ( 1) known genes, (2) processing. It currently runs on Windows/NT known repetitive sequences (Alu, Ll, etc.), (3) (Microsoft) and OpenStep/Mach (NeXT), with human and mammalian ESTs, (4) trapped exons, ongoing work to port it to additional operating (5) gene orthologs and paralogs (both at the DNA systems, including: Solaris (Sun), Rhapsody and protein level), ( 6) tRNAs, (7) sitRNAs, (8) (Apple), and Linux (GNU/freeware). The Solaris rRNAs, (9) CpG DNA sequences, etc. Multiple is in progress and is expected to be completed significant protein sequence similarities have been shortly. observed between translated human genomic regions (i.e., exons) and protein sequences from Base calling is currently performed by an adaptive, drosophila, yeast, and C. elegans. Similarities with iterative algorithm which can be quickly tuned to less than 90% identity between a human genomic work on data from a large variety of sequencing sequence and a human EST sequence usually instruments. The base calling module, when represent two members of a multigene family or a applied in combination with appropriate possible pseudogene. Surprisingly, human genomic pre-processing of the data, provides fast, accurate DNA includes multiple regions with significant base labeling along with confidence measures on similarities to mitochondrial DNA. Analysis of the bases called. It has been extensively tested with over 6 megabases of human genomic sequence data data from a number of machines, including the has detected 244 EST clusters, 21 known genes, ABI 373A with and without the "stretch" (long 55 novel genes, and 35 trapped exons. These and gel) upgrade, our in-house Horizontal Ultrathin many other treasures await us as the HGP Gel Electrophoresis (HUGE) system, our vertical progresses.

86 Identification of the treasures in the genomic the usefulness of RepeatMasker in database sequence data is confounded by the multiple searches, gene prediction and evolutionary studies. artifacts that are encountered. Artifacts tbat need The program can be run locally on any computer to be ignored include: low complexity similarities, with sufficient memory, or can be accessed via the similarities to unannotated repeats, similarities to web at http://ftp.genome.washington.edu/ unannotated contaminating sequences (vector, cgi-bin/RepeatMasker or by e-mail at yeast, and E. coli), exon predictions within [email protected]. repetitive sequences, etc. Many EST, intron, and genomic sequences contain novel repeats tbat have not been previously described or annotated. One Why Is Basecalling Hard to Do Well? hallmark of a novel repeat in an EST is the lack of Sources of Variability in DNA sequence similarity between the rest of the EST Sequencing Traces and Their sequence and genomic sequence surrounding the region of similarity. One particularly interesting Consequences artifact is the presence of a human Alu repeat in a bacterial database sequence (pva}. Most likely this David 0. Nelson represents a bacterial sequence contaminated with Human Genome Center human sequence. However, horizontal DNA Lawrence Livermore National Laboratory transfer can not be ruled out. Other interesting [email protected] examples include similarities between human DNA and plant DNA database sequences. Terence P. Speed Statistics Department http://www.jgi.doe.gov and University of California, Berkeley http://www.chgs.lanl.gov "Basecalling" is the process of inferring the sequence of a segment of DNA from data arising Annotating and Masking Repetitive from electrophoresis traces. Assigning a realistic, probabilistic quality measure to the inferred bases Sequences with RepeatMasker in a DNA sequence requires knowledge of the variability of the data, given the underlying Arian F.A. Smit and Phil Green sequence. Treating the sequencing process as an Human Genome Center, Department of Medicine, attempt to decode a message in a digital University ofWashington, Seattle, WA 98195 communications system provides us with a framework for analyzing that variability, Probably over half of mammalian genomic DNA is quantifying it, assigning it to various features of comprised of sequences that still can be recognized the system, and assessing the performance to be derived from transposable elements. These consequences. interspersed elements as well as simple repeats and low complexity DNA need to be masked before We have examined traces of known sequences performing database searches, since interesting from locally produced ABI 377 data to assess the sequence similarities may be overlooked amidst the extent of variability in DNA traces. We present many spurious matches to these repeats. data showing the variability of interpeak distance Furthermore, recognition of interspersed repeats and peak width as a function of base number. We dramatically can improve gene prediction and also present data showing the relationship between interspecies sequence comparison and is useful as base number and the amount of information in the a finishing tool in genomic sequencing. signal available for basecalling. We show that, as RepeatMasker is a widely used program to the number of bases called increases, the primary annotate and mask repetitive DNA sequences. stumbling block to accurate basecalling is similar Here, we report on recent improvements in the to the problem of "synchronization" in a digital program and databases as well as development communications system. We examine the currently in progress. We also present examples of relationship between the probabilistic quality

87 scores produced by Phil Green's basecaller Signals simulated according to this model exhibit "phred" and the local resolution of the signal, many, but certainly not all of the features found in defined for a consecutive pair of peaks as the ratio real sequencing traces. We hope to have captured of the interarrival time between the two peaks and the important ones. Our base-calling strategy is to the sum of the scale parameters for the peaks. We invert the process described in the model. derive a simple linear model for the probability Specifically, we combine algorithms for color that the resolution is less than one. Our data shows separation, mobility adjustment and deconvolution that beyond a few hundred bases phred quality with a hidden Markov model decoder. The output scores appear to be driven mainly by the of this analysis will be a "best" estimate of the probability that the resolution of the signal is less sequence of bases that gave rise to the data. In than one. These statistical characteristics of the addition, for each base called, the analysis will underlying signal help to explain the abruptness of provide the marginal probability of any alternative the well-known "phase transition" from high base call at that position. quality to low quality decisions seen in DNA sequence data. We will report our progress in the design of the hidden Markov model, and the deconvolution and color-separation steps. A Statistical Model for Basecalling

Lei Li, David 0. Nelson, and Terence P. Speed An Expert System for Base Calling in Department of Statistics, University of California, Four-Color DNA Sequencing by Berkeley Capillary and Slab Gel Electrophoresis lilei@stat. berkeley .edu, [email protected], [email protected] Arthur W. Miller and Barry L. Karger Barnett Institute, Northeastern University, Boston, Base-calling is that part of automated DNA MA02115 sequencing which takes the time-varying signal of [email protected] four fluorescence intensities and produces an estimate of the underlying DNA sequence which An important consideration in fluorescent DNA gave rise to that signal. sequencing strategies is to extract as much information as possible from each run. In previous We approach the problem of base-calling from a work in which we described the sequencing of statistical perspective, hoping to make use of a more than 1000 bases per run by capillary statistical model to call bases, and also to attach electrophoresis (Carrilho et al., Anal. Chern. 1996, suitable measures of uncertainty to the bases we 68, 3305-3313), one characteristic contributing to call. We model the automated Sanger sequencing the extended read length was the use of process by the following steps. First, the sophisticated base calling algorithms. We have underlying sequence of bases is encoded by a recently developed a new base calling method that hidden Markov model into a virtual signal, shows promise to increase read lengths for both consisting of four spike trains. Second, each capillary and slab gel (e.g., ABD electrophoresis component of the virtual signal is distorted by a by decreasing the number of errors at long slowly-changing point spread function, to represent migration times. For example, errors between 800 the diffusion effect of electrophoresis. Third, and 1100 bases in the above cited paper are mobility shifts are applied to the components reduced by 40% relative to the graph-theoretic separately, with the result being the (model) time method employed at that time. Accuracy in this varying concentrations of the four bases. Finally, late region is better by the new procedure than by these dye concentrations are converted into any other one we have tested to date. Data fluorescence intensities by the application of an processing with the new system begins by instrument-dependent cross-talk matrix. determining the dye spectra from the raw data file in order to perform color separation. This step is

88 followed by baseline subtraction, and bases are Working applications for the analysis and display subsequently assigned in a moving time window, of DNA trace data from automated DNA roughly ten bases wide. Using a set of empirical sequencing instruments include Trace Viewer, rules, each channel in the window is divided into Base Caller, Sequence Trimmer and Production the smallest sections likely to contain at least one Statistics. While these applications have been call, and then a second set of rules is applied to designed to work with data from a variety of determine the final calls. Confidences on the base sources, they have been implemented thus far for assignments are supplied by statistical correlation use with data from ABI sequencing instruments of the variables used in the rules, such as peak stored in a central flat-file database. We are in the height and width, with errors made on known process of moving the data to an Infonnix sequence. The system has been applied to four-dye object-relational database. separations using four-channel or full-spectral detection. Results on a large body of data will be The Trace Viewer displays trace data from a single shown, along with comparisons to several widely sample file at multiple resolutions and includes used base callers. such features as semantic zooming and coordinated scrolling. At present, it is possible to view ABI, This work is being supported by D. 0. E. grant phred and our own base calls along with associated #DE-FG02-90ER60895. quality numbers and output from the Sequence Trimmer.

Web-Based Tools for the Analysis and The CHGS Base Caller was designed to replace Display of DNA Trace Data our use of the ABI base caller. At the time, we did not have access to the phred base caller. Designing Judith D. Cohn, Mark 0. Mundt, A. Christine our own base caller, however, opened up the Munk, Larry L. Deaven and Darrell 0. Ricke possibility of achieving additional goals: 1) more Los Alamos National Laboratory, Los Alamos, accurate base calling under a variety of conditions New Mexico 87545 (e.g. dye-primer versus dye-terminator, new dye [email protected] sets, new polymerases, new sequencing instrumentation, etc); 2) a robust set of quality The Human Genome Project worldwide is numbers at each base position, which can be used currently in transition towards a new era of vastly to fine tune analysis further down the pipeline, e.g. increased production of DNA sequences. assembly or primer picking. To date tbese quality Anticipating the requirements of increased numbers include an overall quality assessment sequence production, the Center for Human (similar to phred), a quality number for each base Genome Studies (CHGS) at Los Alamos began and a gap statistic. work in 1996 on a new suite of integrated, web-based tools for processing and evaluating The Sequence Trimmer performs two automated sequence data. Major development goals were: functions, which together define a "clear" region 1) to limit the need for human intervention in the for each trace sequence. The first function is to analysis process; 2) to design highly modular designate a "high quality" region. This is software, which could run on multiple platforms determined using the overall quality scores while accessing a single, centralized (though produced by our base caller and can be modified possibly distributed) database; and 3) to build by manipulating one or parameters. The second user-friendly GUI applications. In order to meet function is to trim vector sequence. Vector these goals, we chose the Java progranuning trimming has been optimized to recognize even language as our primary development environment. very poor quality sequence if it appears at the Currently, a number of pieces are in every day use appropriate position in the sequence. The vector and will be described here as well as in other trimming function is also used to confirm the posters/presentations by our Informatics Team. status of gel control sequences.

89 Joint Genome IDstitute's (JGI) The amount of data generated by a Genome Center Informatics Plans and Needs precludes anything but a graphical interface to the database. The graphical user interface used by the Darrell 0. Ricke, Tom Slezak, Sam Pitluck, and Human Genome Project at Lawrence Livermore Elbert Branscomb National Laboratory predated the World Wide Joint Genome Institute Web, being written for an environment consisting [email protected] solely of Unix workstations, and was sufficient to meet the needs of our local researchers and more The DOE funded human genome centers at LANL, sophisticated collaborators. LBNL, and LLNL are joining together to form the DOE Joint Genome Institute (JGI). Under the JGI, However, current collaborative agreements require the informatics teams at LANL, LBNL, and LLNL inunediate public release of data, and this has are working together to focus on meeting JGI necessitated a shift from a specific type of needs. The JGI informatics needs are considerable. hardware platform to a much wider audience. The JGI projects include scaling up both shotgun and advent of Web based technologies (particularly transposon-based sequencing, scaling up sequence Java) have made it possible to write interfaces to ready physical map production, new functional our database for public use, without the necessity genomics projects, systems and data integration, of dealing with software upgrades and hardware and building software systems for the new JGI dependency problems. We have written a WWW Production Sequencing Facility (PSF). To version of the Restriction Map display of our illustrate this complexity, functional genomics Genome Browser software which will enable projects will generate annotation information for access to our database from anywhere on the JGI sequenced regions of the human genome in the Internet, and from any type of system supporting form of(!) human eDNA sequences, (2) mouse Web browsers. eDNA sequences, (3) sequences for targeted syntenic mouse genomic regions, and (4) gene This software will serve the clone resources task of expression data. To manage data and software the Joint Genome Institute (JGI) in addition to the distributive across four sites, integrated databases Genome Center at LLNL. It is our intention that and WWW software projects are being designed this software will become a prototype for a larger and planned. While these integrated systems are package that will include other types of displays being put into place, existing software and systems for data generated by the JGI. will continue to be used and in some instances enhanced to meet production scale-up needs. A This work was performed by Lawrence Livermore summary of plans, status of major projects, and National Laboratory under the auspices of the U.S. informatics needs will be presented. Department of Energy, Contract No. W-7405-Eng-48. See URL: http:\\www.jgi.doe.gov The BCM Search Launcher­ Restriction Map Display on the World Providing Enhanced Sequence Wide Web Analysis Search Services

Mark C. Wagner, Thomas R. Slezak, Arthur Kim C. Worley, Pamela A. Culpepper, Daniel B. Kobayashi, David J. Ow, Linda K. Ashworth, Davison Laurie A. Gordon, Anne S. Olsen, Anthony V. Department of Molecular and Human Genetics and Carrano Department of Cell Biology, Baylor College of Joint Genome Institute, Lawrence Livermore Medicine, Houston, TX National Laboratory, Livermore, CA 94550 [email protected] [email protected]

90 We provide Genome Program investigators access alignment searches (using WU-BLAST2). In to a variety of enhanced sequence analysis search addition, new releases of the Annotated Domains tools via the BCM Search Launcher. The BCM database used with BEAUTY are produced for Search Launcher (http://gc.bcm.tmc.edu:8088/ each full GenBank release. These up-to-date search-launcher/launcher.html) is an enhanced, versions of the database present annotation integrated, and easy-to-use interface that organizes information for many more sequences than sequence analysis servers on the WWW by previous editions. From the Search Launcher, function, and provides a single point of entry for users can submit sequences to the NCBI's BLAST related searches. This organization makes it easier network server to search the non-redundant, for individual researchers to access a wide variety daily-updated database, and have their search of sequence analysis tools. The Search Launcher results returned with BEAUTY displays added. extends the functionality of other WWW services by adding additional hypertext links to provide Our future development will focus on the analysis easy access to Medline abstracts, links to related of large scale genomic sequences to support the sequences, and additional information which can efforts of the Genome Annotation Collaboratory. be extremely helpful when analyzing database search results. This research is supported by a grant from the U.S. Department of Energy Office of Health and For frequent users of sequence analysis services, Environmental Research (DE-FG03-95ER62097I the BCM Search Launcher Batch Client provides AOOO), and grants from the National Human access to all of the searches available from the Genome Research Institute, National Institutes of BCM Search Launcher web pages in a convenient Health (1F32-HGOO 133-03, lRO l-HG00973-03). drag and drop (on the Macintosh) or command line (Unix) interface. The BCM Search Launcher Batch Client is a Unix and Macintosh application SubmitData Data Submission that automatically I) reads-in sequences from one Framework or more input files, 2) runs a specified search in the background for each sequence, and 3) stores David Demitjian, Sushi! Nachnani, Manfred Zorn each of the search output files as individual Lawrence Berkeley National Laboratory documents directly on a user's system. The HTML [email protected] formatted result files can be browsed at any later date, or retrieved sequences can be used directly in SubmitData is a data translation and submission further sequence analysis. For users who wish to framework. It is being built to provide a common perform a particular search on a number of user representation to public genome databases sequences at a time, the batch client provides having different internal formats. SubmitData complete access to the Search Launcher with the parses in the schema of a particular database and convenience of batch submission and background provides the user with a forms-like display to enter operation, greatly simplifying and expediting the his or her data. The user can define a number of search process. different types of variables to incorporate data files generated by another program e.g. Excel. One of the tools unique to the Search Launcher is SubmitData will replace the variables with actual BEAUTY, our Blast Enhanced Alignment Utility. fields read in from these data files, when BEAUTY makes it much easier to identify weak, submitting a transaction. Thus the user can defme but functionally significant matches in BLAST a template consisting of fields having a constant protein database searches. BEAUTY generates an value along with fields with variables defined as alignment display showing the relative locations of their value. This template can then be used to annotated domains and the local BLAST hits in process a number of data files over a period of each matched sequence, greatly facilitating the time. SubmitData also takes care of translating the analysis of BLAST search results. Recent transaction into a format expected by the specific improvements make BEAUTY searches available database. for DNA queries (BEAUTY-X) and for gapped

91 The framework implemented initially in Smalltalk The SQL Editor is a client program written in and subsequently in the Java programming Java which can be downloaded as an applet via the language contains five main categories of classes: Internet or an Intranet. Database schema data representation, user interface, parser/builder, information is imported by clicking on the button printer and batch processing. The data representing a given database. For example, representation objects are the unchanging internal clicking on the EGAD button causes the EGAD common representation of objects and fields in the table list to be imported, on-the-fly, from the various databases. The user interface objects serve remote server at TIGR. Similarly, the list of as the views of the data objects in a forms-like attributes for each table can also be imported, display. The parser/builder classes are responsible on-the-fly, with a single mouse click. Users for parsing the schemas of public databases and express SQL queries by browsing imported building the definitions for the common data database schema information, choosing the items objects. The printer set of classes translate the of interest, and graphically specifYing SQL internal data object representation into a format selection clauses, restriction conditions, and join required by a particular public database for links. These join links are capable oflinking tables submitting a transaction. The batch process objects in different databases, and are the means of allow the user to defme different types of variables constructing distributed queries. and incorporate data files for submitting batch transactions. One unique feature of this interface is the SchemaViewer. Through the SchemaViewer, the system can suggest to the user semantically correct Graphical Ad hoc Query Interface for ways ofjoining tables. This feature, which we call Federated Genome Databases join path discovery, can even discover join paths that cross database boundaries (i.e. distributed join Dong-Guk Shin,1 Lung-Yung Chu,1 Wally paths). Join path discovery is supported by a Grajewski,1 Joseph Leone/ Thomas Barnes,2 and meta-data repository, which is constructed for each Rich Landers, 2 participating genome database in the federation. 1 Computer Science & Eng., University of SchemaViewer's Selection, Zooming and Connecticut, Storrs, CT 06269-3!55 Abstraction features allow the user to browse, and 2 CyberConnect EZ, LLC, Storrs, CT 06268 choose among multiple system-discovered join [email protected] paths. Once a join path is chosen, it can be exported to the main SQL Editor window for We have been developing the Graphical SQL refinement into a complete query. The SQL Editor Query Editor capable of aiding genome scientists also includes other features designed to enhance in learning and/or examining third-party database the user's query formulation. schemas in a relatively short time and assisting them in rapidly producing correct SQL queries. Acknowledgements Specifically, our goal has been to allow a user to (1) The author's work was supported in part by form an SQL query within a 5 - 10 minute time the NIH!NHGRI Grant No. HG00772-03. frame despite a lack of familiarity with the (2) The author's work was supported in part by the schemas of the public federated genome databases. DOE SBIR Phase II Grant No. Using the SQL Editor, genome scientists can DE-FG02-95ER81906. construct queries targeting not only a single database but also distributed queries targeting multiple databases. Currently the SQL Editor interlinks GDB, GSDB, EGAD and SST at TIGR, MGD at Jackson Laboratory, and CHR 12, the chromosome 12 database at Yale University.

92 Towards a Comprehensive Conceptual combination of two error analysis systems: DRAW Consensus of the Expressed Human and CONTIGPROC. Each entry contains all Genome: A Novel Error Analytical fragments and isoforms of the gene in serial association, separated by spacers. Thus the Approach to EST Consensus Databases maximum possible consensus for each gene exists, providing a useful reagent for functional analysis. Robert Miller, John Burke, Alan Christoffels and Comparison of gene sequence, aligned clusters and Winston Hide "alternative splice" frequencies from STACK now South African National Bioinformatics Institute. allows a more comprehensive understanding of the The University of the Western Cape, Private Bag nature of expressed genes to be performed. We are Xl7, Cape Town, South Africa in the process of discovering what is artifact, and [email protected] what is genome biology.

Human gene data is currently mostly available in References fragments in the form of expressed sequence tags (I) http://www.tigr.org/tdblhgi/ (ESTs) and only a relatively minor fraction of all (2) http://www.ncbi.nlm.nih.gov/UniGene/ human genes are completely sequenced. The index.html fragmentary but redundant nature of ESTs (3) http://www.ncgr.org/gsdb/ provides a large but complex and error prone resource for gene discovery.We have developed a novel set of highly portable tools to manufacture Query and Display of Comprehensive and process a database of publicly available assembled consensi of Human ESTs and MapsinGDB aligmnents. The database represents an easily distributable, core information resource upon Stanley Letovsky, Robert Cottingham and GDB which a comprehensive knowledgebase can be Staff built. We maximise the coverage of accurate high Genome Database, Johns Hopkins University, quality consensus sequences of human genes, Baltimore MD perform error compensation and analysis, and [email protected] provide a measure of error so that we can derive the best possible estimate of the makeup of the GDB, the human Genome Database human gene set from the data as it becomes (http://www.gdb.org), stores whole chromosome available. The system has been designed to derive and regional maps of the human genome from a extended consensus representation of the expressed variety of sources. Mapping methologies human genome. representing in the database include linkage, radiation hybrid, content contig, and various forms The database differs markedly from indices such of cytogenetic mapping. An important class of as TIGR Gene Index (1), and also databases of queries against GOB look for markers in some clusters ofESTs such as UniGene (2) because it region of interest, possibly combined with does not discard noisy information. Instead, all additional restrictions on the types of markers or possible information is used for clustering using their properties. It is useful for such positional d2-cluster (Burke, Davison, Hide, in prep), a queries to be able to search all maps, regardless of global word based clustering methodology. The type, source, or whether or not they happen to "dirty" information is carefully checked for useful contain the markers used to specify the region of constituent subsequences. As a result, extended interest. We do this by combining the various maps gene consensi are manufactured containing both of each chromosome into a single comprehensive high quality and poor quality regions; these are map; positional queries are then expressed against duly annotated. The database has relational access the coordinates of the comprehensive map. to annotated aligmnents via the Genome Sequence Database (3). Alignment of the clusters can be on a The comprehensive maps are generated by a novel system with a specific consensus builder using a integration algorithm that constructs nonlinear

93 warpings of maps in order to bring common (Maestro). In addition, major improvements have markers into correspondence. The comprehensive been made to the software suite that imports data maps produced by this algorithm are a significant from the IC databases into GSDB. improvement over the linear deformation method used previously for this purpose. The improvement GSDB has created unique data sets by translates into an increased accuracy of positional constructing alignments to high-profile sequences querymg. (such as the complete E. coli genome with over 5000 alignments to other E. coli sequences in the The comprehensive maps can also be viewed using database); and by constructing and maintaining our Mapview display program, which is accessible discontiguous sequences that represent from the Web. A recent software enhancement sequence-based chromosome maps (such as human allows the results of queries on any type of mapped chromosome X which is comprised of over 1400 object (gene, clone, amplimer, etc.) to be displayed sequences markers and several larger genomic as dynamically generated maps, as well as in sequences). In addition, there are ongoing projects tabular form. These map displays are based on the to improve the quality of data within the database, comprehensive maps. Queries which can be such as the identification and subsequent removal expressed in this fashion include "find genes in of vector contamination from the 5' and 3' ends of region", "find sequenced clones in region", "find sequences. polymorphic amplimers in region", and so on. As the content of GDB is extended in various areas, it These tools and data are available to the public becomes possible to ask other queries using this from the GSDB web site (www.ncgr.org/gsdb). same basic capability. For example, we are currently adding the capability to store gene GSDB has been gaining momentum over the past function and expression data; when this is in place six months and will continue to develop new and it will be possible to ask for a map of genes that innovative ways to access data in the database. are highly expressed in a given tissue, or genes that GSDB will also continue to create unique data sets have a certain fimction. to provide to the public. Planned enhancements include the development of a web-based sequence viewer to replace Annotator, improvements to GSDB - New Capabilities and Unique Excerpt and Maestro, collaborations with JGI and Data Sets ORNL, and the incorporation of unique data sets such as SANBI's STACK data. A. Farmer, C. Harger, S. Hoisie, P. Hraber, D. Kiphart, L. Krakowski, M. McLeod, J. Schwertfeger, A. Siepel, G. Singh, M. Skupski, D. Database Transformations for Stamper, P. Steadman, N. Thayer, R. Thompson, Biological Applications* P. Wargo, M. Waugh, J.J. Zhuang, and P.A. Schad G. Christian Overton, Susan B. Davidson, Peter Buneman The Genome Sequence DataBase (GSDB), at the Dept. of Computer and Information Science, National Center for Genome Resources (NCGR), University of Pennsylvania, Philadelphia, P A is moving forward to provide the research 19104 community, with new data access capabilities and Tel: (215) 898-3490; Fax (215) 898-0587 unique data sets. GSDB has improved sequence {susan,peter,coverton}@central.cis.upenn.edu data access by creating a web-based program {Excerpt) that allows researchers to extract desired The overall goal of the Kleisli project is to develop portions of sequences from the database; designing a suite of tools to perform data source and implementing a GSDB £1atfile that displays all transformation and integration. The central of the data which can be represented in text components under development are the high-level format; and improving a web-based query tool query language and system, CPL/Kleisli, a general

94 schema description language and a transformation relational multidatabase optimization to constraint language, TSL. (Development of TSL is OPM queries. funded through an NSF grant.) The two main • Complex Object Libraries. As a first step tools--Morphase and CPL--complement one towards migrating CPL onto more another; Morphase is a heavier-weight system ubiquitous and widely-accepted language designed to transform an entire database or dataset platfurms we have implemented a according to user specifications and constraints prototype set of complex object written in TSL, a language which allows one to manipulation libraries in each of the C++, naturally express a broad class of such operations. Perl, and Java(TM) languages. These Conversely, CPL is of most use precisely when libraries allow application programs there are multiple distributed source databases and written in any of the target languages to transforming each in its entirety to a uniform dispatch queries to a central CPL server representation would be infeasible. In such (in the first stage of migration the server situations, CPL permits complex queries over the will still run under ML) and receive query distributed heterogeneous data sources, performing results in an abstract representation integration "on the fly", while simultaneously appropriate to the specific host language, allowing powerful structural transformations to be with an API which is uniform across all of done. CPL' s extensible query optimizer ensures them. The Perl and C++ complex object that this all takes place in a timely fashion. libraries are currently being used in an automated annotation testbed, GAlA Together, the tools can be used either to perform (http://agave.humgen.upenn.edu/gaial). data integration by providing dynamic user-defined This approach allows CPL queries to be views, or to create specialized data warehouses. seamlessly integrated as part of the This layer can then be used underneath data annotation process. mining, OLAP or other decision making tools. • Local Storage Management. The customizable optimizer of CPL employs a Recent developments include: multi-pass source-to-source rewriting • Coupling CPL to OPM-Based Databases. strategy to attempt to minimize both the CPL/Kleisli can now query against GDB response time and intermediate storage 6.0's Object Broker Server (documented at consumption of queries. However, to be http://gdbwww.gdb.org/ob/top.html) using able to efficiently store and retrieve the OPM query language. Having ever-larger intermediate query results, reimplemented our existing Web-based preferably along with one or more indices, queries to take advantage of GDB 6.0's we are currently exploring the use of a richer OPM-based schema, we envision as more sophisticated local backing store a next step embedding some or all of using either the freely-available SHORE OPM's query translator in the CPL system, the relational Sybase database optimizer in a modular fashion. Since system, or perhaps OPM. OPM queries are ultimately translated to • User Interfaces. As a first step towards one or more relational SQL queries by the developing a simple and powerful user OPM tools anyway, we can perform this interface to CPL, we have created a translation at an earlier stage of the query number of Web-based stereotypic queries. process, producing SQL to which CPL can To date, we have focused on the apply its maximal subquery migration implementation and optimization of capabilities, pushing to the servers queries which were chosen to test system operations which would otherwise have to performance in formulating and executing be performed locally. Furthermore, distributed queries while providing opening up the query in this manner allows functionality to a growing user CPL to apply other RDBMS-specific community. Eight major classes of queries techniques, such as semijoins, essentially are available with new ones being added leveraging the work already done on by request from the user community.

95 These queries, shown below with the Data Engineering, April 1997 (Glasgow, databases accessed indicated in Scotland). parenthesis, demonstrate how Kleisli can "BioKleisli: A Digital Library for Biomedical be used to integrate data stored in Researchers," S.B. Davidson, C. Overton, V. disparate formats and physical locations. Tannen and L. Wong. Journal of Digital Libraries More details on the queries can be found 1:1 (November 1996). at the Kleisli homepage, http://agave. "A Data Transformation System for Biological humgen.upenn.edu/cpl/cplhome.html. Data Sources," P. Buneman, S.B. Davidson, K. • "Complete Genome" query, e.g., Hart, C. Overton and L. Wong. Proceedings of "Return all complete mitochondrial VLDB, Sept. 1995 (Zurich, Switzerland). genomes larger than 20kb." (GSDB.) "Challenges in Integrating Biological Data • EST Location query, e.g., "Find the Sources," S.B. Davidson, C. Overton and P. location of a mapped EST." (db EST Buneman. J. Computational Biology 2 (1995), pp or GenBank or GSDB and GDB.) 557-572. • Gene/Location query; i.e., "Find "Semantics of Database Transformations," A. protein kinase genes on human Kosky, S.B. Davidson and P. Buneman. Semantics chromosome 4." (GDB.) of Databases, edited by L. Libkin and B. • Sequence/Size query; i.e., "Find Thalheim. mapped sequences longer than "Transforming Databases with Recursive Data 100,000 base pairs on human Structures," A. Kosky. PhD Thesis, December chromosome 17." (GDB and GSDB.) 1995. • Mapped EST query; i.e., "Find ESTs mapped to chromosome 4 between q21.1 and q21.2." (GDB and GSDB.) Exploring Heterogeneous Biological • Primate alu query; i.e., "Find primate Databases with the OPM genomic sequences with alu elements located inside a gene domain." Multidatabase Tools (BLAST and GSDB.) • BLAST Sequence/Feature query; i.e., Victor M. Markowitz, I-MinA. Chen, Anthony "Find sequence entries with homoiogs Kosky, Ernest Szeto of my sequence inside an mRNA Lawrence Berkeley National Laboratory, Berkeley, region." (BLAST and GSDB.) CA 94720 • Human genome map search; i.e., [email protected] "Find human sequence entries on human chromosome 22 overlapping The Object-Protocol Model (OPM) data ql2." (GDB, GSDB and ASN.l management tools provide support for rapid GenBank.) development, documentation, and flexible exploration of scientific databases. Several *Supported by a grant from the Department of archival molecular biology databases have been Energy, 94ER6!923. designed and implemented using the OPM tools, including the Genome Database (GDB) and the Recent Kleisli References Resource Center Primary Database (RZPD) of the German Human Genome Project, while other "Querying an Object-Oriented Database Using databases, such as tbe Genome Sequence Database CPL," S.B. Davidson, C. Hara and L. Popa. (GSDB) and NCBI's Genbank have been Proceedings of the Brazilian Symposium on retrofitted with semantically enhanced views using Databases (October 1997). the OPM tools. "WOL: A Language for Database Transformations and Constraints," S.B. Davidson and A. Kosky. The multidatabase OPM tools provide powerful Proceedings of the International Conference of facilities for: (!) assembling (federating) heterogeneous databases into a multidatabase

96 system, while documenting their structure and Department of Energy under Contract inter-database links; (2) processing ad-hoc DE-AC03-76SF00098. multidatabase queries via uniform OPM interfaces; and (3) assisting scientists in specifying and interpreting multidatabase queries. Incorporating a The OPM Data Management Toolkits database into a multidatabase system involves Anno 1997 constructing one or more OPM views of the database and entering information about the Victor M. Markowitz, I-MinA. Chen, Anthony database and its views into an Multidatabase Kosky, Ernest Szeto, William Barber, Thodoros Directory (see http://gizmo.lbl.gov/DM_TOOLS/ Topaloglou OPM!MBD/MBD.html). This Directory records Lawrence Berkeley National Laboratory, Berkeley, information necessary for accessing and CA 94720 formulating queries over the component databases, [email protected] including: general information required for accessing the database; structural information on The Object-Protocol Model (OPM) data the schemas of each database; and information on management tools provide support for rapid known links between databases, including semantic development, documentation, and exploration of descriptions of the links, and data manipulations scientific databases. These tools are based on necessary in order to traverse such links. OPM, an object data model that is similar to the ODMG standard, but also has additional Queries over an OPM based multidatabase system constructs for modeling scientific data. Databases are expressed in an extension of the OPM query designed in OPM can be implemented with language, that includes additional constructs commercial relational DBMSs, using the OPM necessary for accessing multiple databases. Database Development Toolkit that includes OPM Multidatabase queries are processed by generating schema translators for generating complete DBMS queries over individual databases, and combining database definitions from OPM schemas, a Java the results using a local query processor. A Java based OPM schema editor and browser, tools for graphical multidatabase query construction tool specifying and maintaining multiple OPM views, provides support for dynamic construction and and OPM schema publishing tools for automatic generation of Web query forms that can documenting 0 PM databases in a variety of be then either used for further specifying formats and notations. OPM schemas can be also conditions, or can be saved and customized. The retrofitted on top of existing relational databases or OPM multidatabase tools have been applied to structured files defined using notations such as the several projects, including the construction of a ASN.I data exchange format. Native or retrofitted federation of molecular biology databases OPM databases can be queried using the OPM involving GDB, GSDB, and Genbank Database Query Toolkit that includes OPM query (http :1 /gizmo .lbl.gov/jopmDemo/gdbs_ mqs .html), language translators that interpret queries and are currently used for setting up a federation expressed in the ODMG compliant OPM query of biocollections databases and a federation of language (OPM-QL) and translate them into the several databases involved in the German Human languages supported by the. underlying DBMS. A Genome Project. Web-based OPM query interface allows graphical construction of ad-hoc OPM queries and can be The main problem we are currently addressing used for generating Web query forms. consists of identifying, documenting, and formally defining a comprehensive set of relevant An OPM Multidatabase Toolkit contains tools that inter-database links that drive multidatabase support: (I) assembling heterogeneous databases queries and applications. into an OPM based multidatabase system, while documenting their schemas and inter-database This work is supported by a grant from the links; (2) processing ad-hoc multidatabase queries Director, Office of Energy Research, Office of via uniform OPM interfaces; and (3) assisting Health and Environmental Research, of the U.S.

97 scientists in specifying and interpreting assembly decisions made by automatic programs multidatabase queries. must be scrutinized in a more efficient manner. To further complicate matters, natural variation Several archival molecular biology databases bave among individual human sequences is at least five been designed and implemented using the OPM to ten fold higher than the desired quality standard tools, including the Genome Database (GOB) and for finished sequence. On average, we expect to the Resource Center Primary Database (RZPD) of see approximately one polymorphism in every the German Human Genome Project, while other 1000 base pairs, while the community's biological databases, such as the Genome sequencing error standard is now projected at one Sequence Database (GSDB) and NCBI's in every 10,000 base pairs. Assessment of quality Genbank, have been retrofitted with semantically and variation must therefore be efficiently enhanced OPM views. The OPM multidatabase automated. tools have been applied for constructing a molecular biology database federation and for At the Center for Human Genome Studies in Los providing a common interface on top of a variety Alamos, we are using our own base caller and of different biological databases. quality values to derive input for the TIGR assembly program. Qualitative comparisons are Current OPM work includes further development then made between individual cosmid clone of the OPM multidatabase tools and extending the assemblies and ones using overlapping clones. OPM toolkit to support complex data types, such Phred and phrap results are also contrasted by as DNA sequences and 3-dimensional means of algorithms operating on a novel, standard crystallographic data, irrespective of the assembly format. Differences in assemblies are underlying DBMS facilities. now automatically detected, and our future objectives would include using quantitative Examples, documentation, and papers regarding measures and adding repeat profiles to make the OPM toolkits are available at decisions favoring the results which best meet the http://gizmo.lbl.gov/opm.html. goals of the project. These tools along with graphical analysis displays are being integrated This work is supported by a grant from the into a web-based Java system. Director, Office of Energy Research, Office of Health and Environmental Research, of the U.S. In addition to quality assessment, we are Department of Energy under Contract concurrently searching for single nucleotide DE-AC03-76SF00098. polymorphisms (SNPs). Because of the higher mutation rate relative to other sequences (17 to 25 fold higher) we are presently targeting CpG islands Quality Control in Sequence Assembly in our 7q telomeric sequence for resequencing. For Analysis this automated process, we have developed software that targets regions of high densities of Mark 0. Mundt, Judith D. Cohn, Tracy L. Ricke, CpG dinucleotides, and we use Whitehead P. Scott White, Larry L. Deaven and Darrell 0. Institute's Primer3 software to design PCR and Ricke sequencing primers. These are used to amplify and Los Alamos National Laboratory, Life Sciences sequence genomic DNA templates from several Division and Center for Human Genome Studies, diverse individuals in search of SNPs. Low quality Los Alamos, New Mexico 87545 regions of sequence are targeted in a similar [email protected] manner, and the multiple assembly information is used to target weak joins. As participants in the Human Genome Project produce more sequence data at higher rates, there In conclusion, an automated system for designing is less chance that any given base pair will be PCR and resequencing primers, targeting specific manually edited by interactive means. Moreover, regions to assess sequence quality and natural variation, is being implemented and optimized.

98 Saitou and Nei [1987] has the advantage of being Automating the Detection of Human fast enough for constructing a profile from DNA Variations hundreds of sequences, but it does not do the best job of constructing the correct tree, and in fact we Scott L. Taylor, Mark J. Rieder, and Deborah A. show it to be biased. Nickerson Department of Molecular Biotechnology, Gascuel recently [1997] introduced the BIONJ University of Washington, Seattle, W A 98195 method, which improves the Neighbor-Joining debnick@u. washington. edu method by using a weighted average to compute the new distances. BIONJ is still biased, and Fluorescence-based sequencing is playing an performs exactly the same as Neighbor-Joining in increasingly important role in efforts to identify the case of 4 taxa. DNA polyrnorphisms/mutations in genes of biological and medical interest. We have developed We introduce a new, weighted neighbor-joining a computer program known as PolyPhred that method called Weighbor. This method uses weights automatically detects the presence of single that accurately reflect the exponential dependence nucleotide substitutions in their heterozygous state of variances and covariances on distance. The using fluorescence-based sequencing of PCR weights are used both in determining which pair is products. The operation ofPolyPhred will be joined and in computing new distances. As a result, described as well as its integration with the Phred the method is not biased, and gives better results base-calling program, the Phrap assembly than Neighbor-Joining, even in the case of 4 taxa. program, and the Consed viewing program. The current implementation ofWeighbor has a Additionally, we will illustrate how Consed can be computational complexity ofN"4, but an N"3 leveraged to display a set of highly annotated version, comparable in speed to Neighbor-Joining, reference sequences that greatly simplifies the is underway. analysis of DNA variations with respect to existing information on gene structure, PCR primers, and Bruno, W. J., "Modeling Residue Usage in Aligned previously known DNA polyrnorphisms or Protein Sequences via Maximum Likelihood " mutations. Lastly, we will document the ease and Mol. Bioi. Evol. 13:1368-75 (1996). ' speed of perfOrming high quality and accurate Gascuel, 0., "BIONJ: an Improved Version of the fluorescence-based resequencing on. long-tracks of NJ Algorithm Based on a Simple Model of mitochondrial and nuclear DNA as well as the Sequence Data," Mol. Bioi. Evol. 14:685-95 application these new tools to automatically find (1997). and view DNA variations within these sequences. Saitou, N. and M. Nei, "The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees," Mol. Bioi. Evol. 4:406-25 Fast and More Accurate (1987). Distance-Based Phylogenetic Construction

William J. Bruno, Aaron L. Halpern, Nicholas D. Socci Los Alamos National Laboratory [email protected]

Phylogenetic reconstruction is relevant to genomic analysis whenever profile methods are used to assess gene function, because accurate profile construction depends on historical relationships [Bruno, 1996]. The Neighbor-Joining algorithm of

99 Determining the Important Physical-Chemical Parameters Within Improving Software Usability and Various Local Environments of Accessibility Proteins Ryan Carroll and Ruth Arm Manning Jeffrey M. Koshi and William J. Bruno ApoCom Inc., 1020 Commerce Park Drive, Suite Los Alamos National Laboratory, Theoretical F, Oak Ridge TN 37830-8026 Biology and Biophysics [email protected] [email protected] Over recent years the number of software tools In this work the importance of various being produced to assist in research related to the physical-chemical parameters of the amino acids Human Genome Project has increased rapidly. are analyzed in a position specific manner for However, most of these tools are currently being various protein secondary structure and surface under-utilized because either they are not available accessibility classes. This is done on the basis of a on a wide variety of computer platforms or they previously published maximum likelihood method have an interface that cannot be learned without a that finds the optimal site-specific residue significant time investment. An additional frequencies for a set of aligned homo logs. These shortcoming that is inherent in most of the vectors representing the site-specific residue available sequence analysis tools is the use of frequencies can be broken up into subsets based on opaque pattern analysis systems such as statistical, position within various secondary structures and syntactic, Markovian, or artificial neural network surface accessibility. An analysis of the methods. Although such systems can be very eigenvectors (PCA analysis) of the resulting accurate, the reasoning methods they use cannot be covariation matrix shows the most important described intuitively. This means that the user has physical-chemical parameters for a given subset of very little information (typically only a single data. Preliminary results indicate the importance numerical score) to assist in appraising the value of hydrophobicity, agreeing with the conclusions of of system predictions. ApoCom is addressing these many other researchers, and we are investigating problems by implementing comprehensive Java the importance of size, charge, and aromatic rings and CORBA interfaces through which it will make at specific positions within various secondary accessible a large assortment of computational and structures. database related tools. ApoCom is also developing a fuzzy logic-based gene hunting algorithm. Work is also in progress to expand the model of evolution used in the above work. Rather than looking only at amino acid frequencies, we are bioWidgets: Visualization implementing a method that uses a limited, amino Componentry for Genomics acid mutation matrix optimized for each position in a set of aligned homo logs. This more realistic Steve Fischer, Jonathan Crabtree, Mark Gibson model of evolution is still computationally feasible and G. Christian Overton (PI) and should improve the performance of the model Department of Genetics, University of in phylogenetic tree reconstruction, homolog Pennsylvania; Philadelphia, P A 19104 detection, analysis of the importance of {sfischer,crabtree,gibson,coverton} @cbil.humgen. physical-chemical parameters, or any other upenn.edu application. bioWidgets is a software package for the rapid development and deployment of graphical user interfaces (GUis) designed for the scientific visualization of molecular, cellular and genomics information. The overarching philosophy behind bioWidgets is componentry: that is, the creation of

100 adaptable, reusable software, deployed in modules objects directly to the widgets. Our object-oriented that are easily incorporated in a variety of specification is written using Java(tm) interfaces, applications, and in such a way as to promote and will likely migrate to CORBA IDL. The interaction between those applications. This is in interfaces we define are as generic as possible, sharp distinction to the common practice of with the goal of allowing diverse applications to developing dedicated applications. The bioWidgets use them. project additionally focuses on the development of specific applications based on bioWidget The Inter-widget communication model, componentry, including chromosomes, maps, and component-based design and object oriented data nucleic acid and peptide sequences. format are intended in the near future to be used with the Java Beans(tm) technology. This will The current set ofbioWidgets has been allow users ofbioWidgets to incorporate the implemented in Java with the goal in mind of widgets into an application simply by interacting delivering local applications and distributed with an application builder tool rather than writing applets via Intranet!Intemet environments as actual integration code. required. TI1e inmlediate focus is on developing interfaces for information stored in distributed The current implementation includes widgets heterogeneous, databases such as GDB, GSDB, which display Sequence, Map, Blast results, Entry, and ACeDB. The issues we are addressing Chromosomes and Sequence alignments. are database access, reflecting database schemas in bioWidgets, and performance. We are also For more details, please see directing our efforts into creating a consortium of http://agave.humgen.upenn.edu/bioWidgetsJava/. bioWidget developers and end-users. This *Supported by a grant from the Department of organization will create standards for and Energy, 92ER613 71. encourage the development ofbioWidget components. Primary participants in the consortium include Gerry Rubin (UC Berkeley), Research on Data and Workflow Nat Goodman (Jackson Labs), Stan Letovsky Management for Laboratory (GDB) and Tom Flores(EBI). Informatics Current progress includes the development of an Nathan Goodman Inter-widget Communication package. This The Jackson Laboratory package consists of a set of Java(tm) classes and [email protected] interfaces, and is based on the Java(tm) JDK 1.1 Delegation Event Model. It defines a set of We have been pursuing a strategy for laboratory inter-widget events that control: (1) mutual informatics based on three main ideas: ( 1) selection of elements in multiple widgets and (2) component-based systems;(2) workflow coordinated scrolling and zooming of multiple management; and (3) domain-specific data We widgets. We have used this package in our have been pursuing a strategy for laboratory implementation which integrates our genome infommtics based on three main ideas: (1) viewer and our sequence viewer. component-based systems; (2) workflow management; and (3) domain-specific data We have also developed an object-oriented data management. The workflow and data management specification for the Sequence and Map widgets. software we have developed pursuant to this This specification will more generally apply to strategy are called LabFlow and LabBase widgets that display sequence and sequence respectively. LabFlow provides an object-oriented annotation. With the increasing use of framework for describing workflows, an engine for object-oriented databases, object brokers and executing these, and a variety of tools for standardized remote object formats (CORBA and monitoring and controlling the executions. JAVA's RMI}, application developers using our LabBase is implemented as middleware rullling on bioWidgets will have the capability to serve data

101 top of commercial relational database management 2. Collecting the non-redundant set of systems (presently Sybase and ORACLE). It proteins to represent a wide range of provides a data definition language for succinctly available members of protein folds in the defining laboratory databases, and operations for context of the comprehensive Structural conveniently storing and retrieving data in such Classification of Proteins (SCOP). databases. 3. Selecting several attributes which represent various groups of Both LabFlow and LabBase are implemented in physico-chemical and structural Perl5 and are designed to be used conveniently by properties. Perl programs. The total quantity of code is 4. Analyzing the prediction accuracy modest, comprising about 10,000 lines ofPerl5. achieved by each parameter set in order to The software is freely available and redistributable chose the most efficient sets for separation (see http://goodman.jax.org for details), but be of the FI from a particular neighbor. forewarned that this is research software and is 5. Voting among predictions made by incomplete in many ways. different sets of parameters to achieve higher reliability of prediction. The developed procedure is simple and efficient. Method of Differentiation Between Further improvement of the prediction method Similar Protein Folds greatly depends on the possibility of automation of the steps 1-3 and growth of existing databases. I. Dubchak, I. Muchnik, S. Spengler, M. Zorn 1. I. Dubchak, I. Muchnik, S. R. Holbrook (1995). Lawrence Berkeley National Laboratory, Berkeley, PNAS 92, 8700-8704. CA94720 2. Murzin, A. G., S. E. Brenner, T. Hubbard and [email protected] C. Chothia. (1995). J. Malec. Bioi., 247: 536-540. Predicting the protein fold and implied function for a target sequence whose structure is unknown is the problem of significant interest. The information derived from such a prediction is substantial, guaranteed by the similarity between the three-dimensional (3D) structures and the functions of class members. Prediction is especially complex when a distinction of one particular fold from other highly similar three-dimensional folds is needed.

For the development of the prediction technique we used a particular example of a separation of 4a-helical cytokines (fold of interest, FI) from similar folds. We applied the method based on global descriptors of a protein in terms of the biochemical and structural properties of the constituent amino acids [1]. Neural networks were used to combine these descriptors in a specific way to discriminate members of the Fl. The following steps were necessary: 1. Defining the neighborhood of the FI, i.e. spatially close protein folds which are hard to distinguish from the FI.

102 Ethical, Legal, and Social Issues

Geneletter: An Internet Newsletter on Recent literacy surveys have found that a large Ethical, Legal, and Social Issues in number of adults lack the skills to bring meaning Genetics to much of what is written about science. This, in effect denies these adults access to vital about their health and well-being. To Dorothy C. Wertz, PhilipP. Reilly, Robin J.R. info~ation address this need, the American Association for Blatt the Advancement of Science (AAAS) has The Eunice Kennedy Shriver Center for Mental developed a 2-year project to provide low-literate Retardation, Inc. adults with the background knowledge necessary to [email protected] address the social, ethical, and legal implications of the Human Genome Project. Geneletter has reached a wide audience, with 348 412 hits Geneletter has reached a wide With its Science + Literacy for Health: Human aucllence, with 348,412 hits and 54,083 user Genome Project, AAAS used its existing network sessions between September 18, 1996 and October of adult education providers and volunteer science 6, 1997. Current readership is about 9000 user and health professionals to pursue the following sessions per month (323 per day), with an average overall objectives: (l) to develop new materials for length of 7 minutes. Readers range from adult literacy classes, including a high-interest elementary school students to graduate and reading book, a short video providing background professional school stndents, and include people information on genetics, a database of resources, with genetic conditions, lawyers, and medical and fact sheets to assist other organizations and professionals. 17% are international, including researchers in preparing easy-to-read materials Canada, Australia, United Kingdom, Sweden, about the HGP, and (2) to develop and conduct a Japan, france, Germany, Singapore, Malaysia, campaign to disseminate the materials to libraries New Zealand, Israel, Norway, Brazil, and Italy, as and community organizations carrying out literacy the leading nations. Our content, in the seven programs throughout the United States. To issues published to date, includes ethics, science, introduce the materials to low-literate adults, medicine, law, international views, education, workshops using the materials were conducted in society, book reviews, and a digest of the news Washington, DC, Baltimore, MD, Chicago, IL, about genetics. Reader interest has focused on Miami, FL, and Cleveland, OH. In 1997, cloning, the genetics of homosexuality, "The thousands each of the book and video, both titled Calico Cat, the Munchkin, and Achondroplasia," Your Genes, Your Choices, have been sent to "Adam, Eve, and Mitochondria," and basic literacy educators, community colleges, church information about genetic testing. Reader queries groups, libraries, and other organizations around have focused on popular science issues (cloning, the country. In each workshop, a genetic counselor Jurassic Park}, accuracy of paternity testing, or other genetic professional was present to answer causes of miscarriage, insurance coverage, and questions and provide insight into genetic research material for stndent papers, rather than on ethical and issues. The entire book is also available on the concerns. This project demonstrates the feasibility Web at http://ehr.aaas.org/ehr/books/index.html. of using the Internet to educate people about Our model for helping scientists communicate in genetics. simple language will has impact beyond classrooms and learning centers. Since not every low-literate adult is enrolled in a literacy class, we Communicating Science in Plain developed a model that reaches out to community Language: The Science + Literacy for groups providing health services. These groups Health: Human Genome Project have indicated that easy-to-read materials on genetics are not only desirable but necessary; Maria Sosa, Judy Kass, and Tracy Gath indeed, the groups we worked with often received American Association for the Advancement of requests for information on heredity and genetics. Science, 1200 New York Avenue, NW, Your Genes, Your Choices enables medical and Washington, DC 20005 scientific organizations to communicate more

105 effectively with economically disadvantaged concerns people with mental retardation and their populations, which often include a large number of families may face in light of new genetic research low-literate individuals. are addressed. Finally, inherited syndromes associated with mental retardation are highlighted * Supported by a grant from the Director, Office to provide a practical example of how someone of Energy Research, Office of Health and with this disability is affected by the HGP and to Environmental Research of the U.S. Department of report on the latest research in genetic therapy. Energy under contract DE-FG02-95ER61988. The Arc's Human Genome Education Project: Examining Genetic Ethical, Legal and Social Issues is an interactive and comprehensive training The Arc's Human Genome Education package which includes a detailed script for the Project workshop leader, a pre and post-questionnaire to measure change in opinion and understanding of Sharon Davis, Ph.D.; Leigh Ann Reynolds, the issues, background information for the Project Associate presenter, print copies of overheads. a project The Arc of the U.S. 500 E. Border Street, Suite description to use in promoting the workshop, a 15 3 00 Arlington, TX 7 60 10 minute video and other handouts. The most lreynold@metronet. com important objective of the training (discussing highly complicated and sometimes controversial New genetic findings from the Human Genome issues) is achieved through the use of case Project (HGP) pose unique ethical questions and scenarios. Through this challenging exercise, legal/social concerns to those with disabilities and members actively learn how difficult it can be their family members. Many disorders associated making decisions regarding genetic testing, therapy with the disability of mental retardation have and discrimination on behalf of themselves or their genetic causes, with Down syndrome and fragile X children who have disorders either caused by or syndrome being the most common. associated with mental retardation. By simplifying medical terminology and concepts, identifying core In an effort to begin addressing these complex issues most threatening to those with disabilities issues, The Arc of the U.S. (the largest voluntary and utilizing practical case scenarios, members of organization on mental retardation in the country The Arc are better equipped to make informed with 140,000 members) developed a series of decisions and educated opinions on ethical, legal reports, fact sheets and a training package for use and social issues impacting the lives of people with by the organization's leadership to educate mental retardation resulting from the HGP. members about these important and timely issues. The materials were distributed to all 1,100 chapters of The Arc and made available through Measuring the Effects of a Unique Law the internet. Some topics covered include: Limiting Employee Medical Records to Job-Related Matters • An Introduction to Genetics and Mental Retardation Mark A. Rothstein, J.D., Steven G. Craig, Ph.D., • Genetic Discrimination: Why Should We Betsy D. Gelb, Ph.D. Care? University of Houston • Genetic Testing, Screening and Counseling [email protected] - An Overview • Protecting Genetic Privacy Under the Americans with Disabilities Act, after a conditional offer of employment an employer is The aim of this educational effort is, first, to permitted to conduct medical examinations of provide a basic understanding of genetic unlimited scope and may require the release of all inheritance and mental retardation. Next, unique medical records in the possession of the

106 individual's health care providers. Allowing and vocabularly they are likely to encounter in civil unlimited access to personal medical records not lawsuits and criminal prosecutions. Expected only facilitates surreptitious discrimination, but it evidence flows from parties' pleadings and from invades the privacy of individuals and discourages expert witness testimony. ELSI matters flow from at-risk individuals from undergoing genetic testing the conferences' active, case-based, in the clinical setting. problem-solving curriculum.

Pursuant to a law enacted in 1983, Minnesota is The project's operational objective is to provide the only state to restrict the scope of all medical judges with tools by means of which to exercise inquiries by employers to matters that are strictly their gatekeeping duties for scientific evidence. job-related and consistent with business necessity. Those duties include, but are not limited to, If this law has had no adverse effects on employers determination of the fitness of the evidence for or other parties, then it may serve as a model for presentation to juries. protecting genetic privacy in the workplace. The Conference series was preceded by four We are using the following three main methods to preparatory "working conversations" ("WC's") measure the effects of the law: (1) reviewing and intended to perfect a durable, effective, and analyzing all of the cases filed under this law in the adaptable educational technology template. One of last five years; (2) surveying human resource the "WC's" was hosted by the Lawrence managers in Minnesota (using Ohio as a control Livermore National Laboratory in November, state) to learn their opinions about limitations on 1996. The first large scale conference of 106 the scope of preplacement medical examinations; judges and 40 faculty was held in May, 1997 at and (3) doing an economic analysis in Minnesota Airlie House, Airlie, Virginia for judges of courts (with Ohio as the control state) of workers' located in the Greater Washington, D. C. region. compensation claims rates, employee turnover rates, productivity rates, health insurance claims The Project has been approved for conduct of six rates, and other data. conferences in 1997 and 1998. Included are the National Capital Area, New England, Chicago, The data collection phase is well underway and and Mountain West regions, and two conferences should be completed by the end of 1997. Analysis scheduled for June 27-July 3 and August 1-6 at of the data will help determine whether this Orleans, Cape Cod. Three additional conference legislative approach is a viable alternative to are on the drawing boards for the Mid-Atlantic current proposals for prohibiting genetic region in 1998 and in 1999 for the Southeast discrimination in employment. regions, and for the California State Courts.

The GARP has also produced a Handbook for The Genetics Adjudication Resource Judges for Cases Involving Genetics, Molecul~ ~ Project Biology and Biotechnology Evidence. Under production is a nine cassette videofilm primer. Franklin M. Zweig, Ph.D., J.D. Another primer has been produced by EINSHAC within the courts' own publication system. The The Genetics Adjudication Resource Project Judges' Journal, a quarterly published by the (GARP) has been funded by the Ethical, Social American Bar Association, produced a dedicated and Legal Issues (ELSI) Program of the DOE theme issue late Summer, 1997 entitled "Genetics Human Genome Project to provide foundation in the Courtroom." This primer is divided into education for 1,000 judges of federal and state scientific and adjudication perspectives. courts in genetics, molecular biology and biotechn­ ology. The Project's over-arching objective is to Key to the GARP's conferencing success is the familiarize judges, who have little scientific recruitment, training and deployment of science background, with the concepts, research findings, advisors. Scientists reside with the judges and the informal interaction is a key supportive device in

107 making this unfamiliar subject material patent to development of new pharmaceutical products lay judges. DOE scientists and ELSI personnel further downstream in the R&D process. have been instrumental in this effort. GARP deploys one scientist for every four judicial In this article we propose a theory of anticommons participants. GARP has developed a training property to explain how too many upstream program for scientific faculty so that differences in intellectual property rights may lead to too few communication content and style rooted in the downstream products. Anticommons property may judicial culture can be anticipated and managed. be understood as the mirror image of commons The training strategy is based upon the different property. In the familiar tragedy of the commons, professional paradigms (mental worlds unique to too many owners have the privilege to use scarce adjudication and to science) highlighted by the resources, and the property is prone to overuse. literature and experienced in the 1995 and 1996 By contrast, in a tragedy of the anticommons, too WC's. many owners have the right to exclude others from using a scarce resource, and the resource is prone Science advisors are now being sought for the to underuse. Anticommons property may arise 1998 and 1999 judicial conferences. Six hour whenever governments defme new property rights training seminars will be scheduled conveniently. that are too fragmented. Empty storefronts in GARP promises an interesting experience with an Moscow provide one stark example of this avid group ofleamers whose day-to-day phenomenon. Transition regimes charged with responsibilities can make important contributions privatization have endowed multiple owners with to our society's adaptation to advances in human overlapping rights in each storefront, so no owner genetics. holds a useable bundle of rights to convey to an entrepreneur who wishes to set up shop. As a (DE-FG02-96ER62!70, conducted by EINSHAC, result, scarce property is wasted. Once an the Einstein Institute for Science, Health and the anticommons emerges, collecting rights into Courts, Washington, D.C.) useable bundles can be brutal and slow.

Like post-socialist transition regimes, intellectual Upstream Patents and Downstream property systems are constantly creating new Products: A Tragedy ofthe property rights. An anticommons could arise if Anticommons? multiple owners obtain fragmented patent rights that are difficult to assemble into usable bundles. The anticommons model provides one way of Michael A. Heller & Rebecca S. Eisenberg1 understanding a widespread intuition that issuing University of Michigan Law School patents on gene fragments makes little sense. Depending on their scope, patent rights in gene The past twenty years have witnessed significant fragments could lead to the emergence of a privatization of pre-market or "upstream" stages of genomic anticommons in which a product biomedical research. In an earlier era, such developer requiring use of a full-length gene or a research was typically performed in the public set of polymorphic markers useful in diagnosing sector and made freely available in the public disease might be stymied by the costs of bundling domain to "downstream" users. Today, public and licenses from many patent owners. Up to a point, private institutions are increasingly likely to patent privatization may enhance efficiency by spurring biomedical research discoveries that are several investment in upstream research. But privatization steps removed from end product development. can go astray. When the legal system creates too Patent rights have been credited with a many fragmented intellectual property rights, a conspicuous increase in private funding for tragedy of the anticommons could block the biotechnology research. Yet paradoxically, a development of new pharmaceutical products. proliferation of patent rights in upstream discoveries could create barriers to the 1 Supported by DOE Grant No. DE-FG02- 94ER61792.

108 The Science and Issues of Human DNA Macintosh computer and gives results comparable Polymorphisms to commercial machines.

David Micklos, Mark Bloom, Scott Bronson, and Using a facility at our WWW site John Kruper (darwin.cshl.org), the Alu insertion data are further DNA Learning Center, Cold Spring Harbor used as an entree into human population genetics Laboratory, 1 Bungtown Road, Cold Spring and genome diversity. The "Student Allele Harbor, New York 11724 Database" has forms for entering micklos@cshl. org student-generated data, as well as archival data from populations around the world. Several This ELSI training program introduces high school statistical functions are available: testing biology faculty to a laboratory-based unit on Hardy-Weinberg equilibrium within a single human DNA polymorphisms - which provides a population, measuring genetic distance between uniquely personal perspective on the science and two populations, and comparing two populations ELSI aspects of the Human Genome Project. By using contingency chi-square. A "Monte Carlo" targeting motivated biology faculty who currently generator shows the effect of genetic drift in small perform student laboratories with viral and populations. bacterial DNA, this program offers a cost-effective means to bring high school biology education During the summer, we developed reliable methods up-to-the-minute with genomic biology. In for generating mitochondrial DNA sequence from October-November 1997, we are instructing the buccal and hair samples. We hope to introduce first of 12 workshops nationwide at Mt. Sinai this technology at as many DOE workshop sites as School of Medicine (New York), Boston possible. Workshop participants can then use their University School of Medicine (Massachusetts), own mitochondrial DNA sequence as an entree to and Canada College (California). modem bioinformatics. Our WWW site has a step-by-step template for analyzing mitochndrial The program is based on lab and computer DNA sequence - including similarity searches, technology, developed at the DNA Learning Center multiple sequence alignments, a recreation of the and the University of Chicago, that makes human Neandertal DNA analysis, and the identification of DNA "fingerprinting" by polymerase chain the Romanov family remains. Ultimately, we reaction (PCR) accessible and affordable for high envision students preamplifying their mitochondrial school use. Program participants learn simplified DNAs and performing dye terminator reactions at lab techniques for amplifying two types of their own schools. The ready-to-sequence DNAs chromosomal polymorphisms - an Alu insertion would then be sent to regional centers for (TPA-25) and a VNTR (D 1S80). These sequencing and the results would be posted by polymorphisms illustrate the use of DNA Internet. variations in disease diagnosis, forensic biology, and identity testing - and provide a starting point Thus, we are striving to develop a robust and for discussion of the uses and potential abuses of accurate analog of human genome research that genetic technology. allows students to use their own chromosomal and mitochondrial DNA polymorphisms as the basis of Workshop participants amplify their own explorations into contemporary genomic biology. polymorphisms from DNA obtained from rapid preparations of buccal cells and hair sheaths. The loci are amplified using rapid cycling profiles and analyzed on agarose gels. To further reduce the cost of experiments, we have developed the Biogenerator, the first inexpensive ($900) thermal cycler licensed for precollege use. This "Rube Goldberg" thermal cycler articulates with a

109 Implications of the Geneticization of Teaching sessions for faculty, house staff, students Health Care for Primary Care led by Co-Pis + fellows Practitioners* Phase V (April 1997) National Conference on The New Genetics in Mary B. Mahowald, John Lantos, Mira Lessick, Primary Care, keynoted by Victor McKusick, with Robert Moss, Lainie Friedman Ross, Greg Sachs, CME/CNE workshops for primary caregivers Marion Verp Outreach (Sununer 1997 and beyond) Department of Obstetrics and Gynecology and CME conference, AMA planning for MacLean Center for Clinical Medical Ethics, conference on genetics in primary care, University of Chicago, Chicago, IL 60637 Teaching sessions on genetics for new clinical [email protected] ethics fellows, Preparation of materials for publication Phase I (fall 1995): Generic topics in genetics in primary care presented to broad audience • Supported by a grant from the Director, Office Ten sessions: Goals, methods, & achievements of of Energy Research, Office of Health and HGP; Typology of genetic conditions; Scientific, Environmental Research, U.S. Department of clinical, ethical, and legal aspects of gene therapy; Energy, DE-FG02-95ER61990 Concepts of disease, Genetic Disabilities; Gender and socio-economic differences; Cultural and ethnic differences; Directive or nondirective genetic counseling. Confidentiality Concerns Raised by Transcripts of presentations prepared for revision DNA-Based Tests in the by authors Market-Driven Managed Care Setting

Phase II (Jan.-Mar. 1996): Grand rounds in J.S. Kotval, D. Dewar, and S. Brynildsen specific areas of primary care: School of Public Health, University at Albany, Topic: What every general practitioner should One University Place, Rensselaer, NY 12144-3456 know about the new genetics EMail: [email protected] • Pediatrics - Stephen Friend • Obstetrics/gynecology- Joe Leigh Previous policies to protect the confidentiality of Simpson DNA-based tests have centered on protecting the • Medicine- Tom Caskey dispersion of genetic information. In general, the • Family medicine - Loralane Lindor goal has been to keep information from passing • Nursing - Colleen Scanlon outside a health care institution to third parties that Transcripts of presentations prepared for revision might discriminate against patients in employment, by authors. Bibliography developed on generic access to health care and other areas of civic life. issues in genetics, and issues specific to areas of We are focusing our attention on the threats to primary care patient welfare created within the setting of the market-driven managed care organization (MCO). Phase III (Apr., 1996) This setting presents unique ethical dilemmas, Policy issues presented by Sherman Elias and since physicians (and, often, personnel at testing George Annas laboratories) are employees of the MCO, and since Syllabi and chapters developed by each primary the payor and provider functions are contained care team + policy team within the same entity. In this context, the institutional imperative of cost savings in order to Phase IV (Oct.-Dec. 1996) capture market share could lead to discrimination Series on genetics in primary care for Clinical in health care access (either through outright denial Ethics fellows and Robert Wood Johnson Clinical of enrollment or prohibitive premiums) if Scholars, presented by each team of primary DNA-based tests reveal the likelihood of future caregivers (Co-PI+ fellow) high-cost illnesses. Our group-- which spans the

110 disciplines of genetics, medical ethics, health so recently. The courseware, entitled "The New economics and the law -- seeks to (i) formulate an Genetics: Courseware for Physicians. Molecular ethical construct for medical confidentiality with a Concepts, Applications, and Implications," will view to defining its core ethical functions and its provide accredited continuing education for limitations in the changing health care system; (ii) practicing physicians through the Stanford examine the practices and institutional imperatives University Office of Postgraduate Medical of market-driven MCOs to understand the context Education. in which the DNA-based tests would be used and to assess the cost of confidentiality measures; (iii) It is important for physicians to understand the identify gaps in the policy framework that could modem clinical applications of molecular genetics allow misuse of confidential medical information for several reasons. First, physicians will be within the MCO; and (iv) make recommendations explaining genetic tests and their implications to to remediate these gaps in policy in a manner that their patients, selecting specific tests, and is applicable and practical to MCOs. Progress to interpreting the results. Second, physicians must be date includes a preliminary formulation of able to work productively with other health confidentiality concerns raised in the market-driven professionals, including genetic counselors and managed care setting and the design of a survey psychologists. Third, standards of practice that instrument to assess the institutional and marketing will govern the application of molecular genetic practices ofMCOs. We are paying specific diagnosis to patient care are currently under attention to the current practices of MCOs that development, and physicians need to contribute to encroach on the core ethical concerns of medical this evolution. Fourth, a lack of training in modem confidentiality. genetics prevents many physicians from understanding much current medical research. Finally, physicians with training in molecular Getting the Word Out on the Human medical genetics can serve as informed resource Genome Project: A Course for persons and enhance the level of public Physicians understanding of and appreciation for the Human Genome Project in their communities. Sara L. Tobin, Ph.D., M.S.W.* and Ann The CD format confers multiple advantages for Boughton# continuing medical education. The delivery of *Stanford University and #Thumbnail Graphics course materials via CD-ROM frees the physician [email protected] and from a presentation schedule and does not require [email protected] travel and time away from a busy practice. CD' s function as an improved teaching resource because The knowledge gained from the Human Genome of their interactivity and multimedia capability. We Project has the potential to correlate molecular have designed a streamlined, accessible diagnosis with effective treatments and to lead the navigational system tailored to the needs of the way to novel medical interventions. The molecular busy (and possibly computer-naive) physician. tools that have emerged from genetic studies are Engaging interactive features and animations have changing the face of medical practice and been created to convey complex concepts. The inaugurating a transitional period that will be course content is supervised by a Board of uncomfortable for both physicians and the public. Advisors. While the Internet is currently too slow There will be marketing pressures, health care to serve as the primary delivery system, we are industry changes, uneven supporting resources, programming the CD to accept updates and variable training of physicians, and limited public supplements from the Internet or from a understanding. We have designed an interactive, subscription floppy. multimedia CD-ROM course to ease this transition for the majority of physicians, who have received The development of a prototype version of the little or no training in clinical applications of courseware is funded by the Department of molecular genetics because the field has developed

111 Energy, and our current draft version of the classical genetics out of most text books, and the courseware will be demonstrated at the Workshop. original literature is increasingly difficult to obtain.

Electronic Scholarly Publishing: To address these problems, we have established an Foundations of Classical Genetics electronic educational resource at which classic literature (both papers and monographs), that Robert J. Robbins established the foundations of modern genetics, is Fred Hutchinson Cancer Research Center, 1124 being republished on-line, freely accessible to all. Columbia St., LV-101, Seattle, WA 98104 Works are being made available in a variety of [email protected] formats including simple illML, Adobe PDF files http://www.esp.org containing high-quality typeset republications, and Adobe PDF files containing image facsimiles of With the continuing success of the Human Genome the original publication. The fundamental work of Project (HGP), more and more people are Gregor Mendel, for example, is available as both interested in the project and wish to understand its a typeset republication and as an image facsimile results. Since the HGP has significant ethical, of the originall865 publication. legal, and social implications for all citizens, the number of individuals who do, or should wish to Funding for the project was received earlier this become familiar with the project is high. ln year, and work thus far has involved (I) moving addition to its importance in the training of the original prototype site from Johns Hopkins to professional geneticists, the HGP is of special Seattle, (2) redesigning the site to make it easier to relevance for undergraduate training in basic understand and use, and especially to prepare it to biology, and even for high-school and other K-12 handle more data and more traffic, (3) establishing education. high-efficiency systems for converting paper publications into electronic form, and (4) Interest, however, is not enough. Real acquiring material for republication. When understanding of the results ofHGP research full-scale publication begins in January, we requires some familiarity with basic genetics anticipate publishing the equivalent of a 25-page notions. In particular, most of the methods and paper every day. findings of molecular genetics are ~ssentially inaccessible to those without appropriate training. On the other hand, both the methods and results of The Community College Initative early work in classical genetics can be appreciated by virtually anyone: cross a black mouse with a Sylvia J. Spengler* and Laurel Egenberger** white mouse, count the progeny of different *Life Sciences Division, **Center for Science and colors, then try to figure out what might be going Engineering Education. E. 0. Lawrence Berkeley on. National Laboratory, I Cyclotron Road, Berkeley, CA94720 A familiarity with the basic notions of classical [email protected] genetics is essential for a simple appreciation of the significance of the HGP, and a more detailed The Community College Initiative prepares knowledge of classical genetics (including an community college students for work in appreciation of the question, What is the chemical biotechnology. A combined effort of Lawrence nature of the gene?), can provide a basis for the Berkeley National Laboratory (LBNL) and the genuine understanding ofHGP findings. Gaining California Community Colleges, we aim to develop this basic understanding of classical genetics is mechanisms to encourage students to pursue becoming more difficult, even for science majors. science studies, to participate in forefront The runaway success of modern molecular laboratory research, and to gain work experience. genetics is driving the detailed presentation of The initiative is structured to upgrade the skills of

112 students and their instructors through four that leads to Hispanic student and family components. interactions regarding the science, ethical, legal, and social issues of the Human Genome Project. Summer Student Workshops: Four weeks summer By opening up channels of familial dialogue residential programs for students who have between parents and their high school students, completed the first year of the biotechnology entire families can be exposed to genetic health and academic program. Ethical, legal and social educational information and opportunities. In concerns are integrated into the laboratory addition, greater interaction is anticipated between exercises and students learn to identify commonly students and teachers, and parents and teachers. shared values of the scientific community as well as increase their understanding of issues of Each participating high school has taken different personal and public concern. In the two year approaches to exposing students and parents to the period of the grant, we have involved over twenty science and ELSI ofHGP. Some schools have students. Students in the second summer have had divided students into research teams from various laboratory interships. levels of biology curricula with each team analyzing small segments of their own DNA. Other Teacher Workshop Training: Seminars for schools have science classes following the BSCS biotechnology instructors to improve, upgrade, and HGP-ELSI curricula or components of the update their understanding of current technology University of Washington High School Human and laboratory practices, with emphasis on Genome Program. Still other schools have utilized curriculum development in current topics in the materials that were developed by various Los ethical, legal, and social issues in science. These Angeles Unified School District science teachers. workshops have involved the students as well. Several other classes (e.g., Spanish classes, English classes, Journalism classes, and Social Sabbatical Fellowships: For community college Science classes) at each participating school have instructors to provide investigative and field developed ELSI newsletters for distribution to the experience in research laboratories. During the parents. In addition, we have math and computer fellowship, teachers also assist in development of science classes from one of the schools helping in student summer research activities. the construction of our web page. Each participating school is also expected to have parent This work was supported by the Director, Office focus groups which usually meet once per month in of Energy Research, Office of Biological and the evening to discuss various genetic health issues Environmental Research, Human Genome and implications of HGP. Program, ELSI program, of the U.S. Department of Energy under Contract No. DE-AC03- 76SF00098. Genes, Environment, and Human Behavior: Materials for High School Biology The Hispanic Educational Genome Project Joseph D. Mcinerney and Michael J. Dougherty 5415 Mark Dabling Blvd., Colorado Springs, CO Margaret C. Jefferson, Mary Ann Sesma, and 80918-3842 Patricia Ordonez [email protected] Department of Biology & Microbiology, California State University, Los Angeles CA 90032 The Biological Sciences Curriculum Study (BSCS) [email protected] has begun the development of an instructional module on genetics and human behavior for use in The primary objectives of this grant are to develop, introductory high school biology. The module will implement, and distribute cultorally competent, rectify the deficient treatment of the biology of linguistically appropriate, and relevant curricula behavior in the current curriculum and will help to

113 dispel misconceptions about genes and human module, and distribute the module free of charge to behavior that often pervade media reports of all ioterested high school biology teachers. research in this area. The materials also will address some of the ethical, legal, and social issues generated by research iota the biological basis of Microbial Literacy Collaborative behavior and will help to change traditional assumptions about the teachiog of genetics at the Cynthia Needham high school level. Lahey Hitchcock Medical Center, Burliogton, MA [email protected] The project employs the process of curriculum development that BSCS has refined contioually The Microbial Literacy Collaborative is a sioce the ioception of the organization io 1958. In partnership of organizations dedicated to addition, development of the module is drawiog enhanciog public understanding of microorganisms upon the experience BSCS acquired during the and the roles they play in sustaining the planet. development, distribution, and implementation of Partners ioclude the American Society for three genome-related iostructional modules Microbiology, Baker & Simon, Oregon Public between 1991 and 1996. This experience iocludes Broadcastiog, the Association of Science and writing conferences, pilot and field testiog of draft Technology Centers, and the American Association materials, and periodic reviews of progress by for the Advancement of Science. The MLC' s members of the education committees of the initiatives ioclude National Society of Genetic Counselors, the American Society of Human Genetics, the Council • Intimate Strangers: Unseen Life on Earth, of Regional Networks for Genetic Services, and a television series airing io 1999 other iodependent experts io genetics. • Formal college and precollege educational programs, designed specifically to support In late July 1997, BSCS completed the first of two and enhance teaching io settiogs where writiog conferences during which experts in educational resources are limited behavioral and medical genetics, ethics, and high • lnformallearniog programs specifically school biology teachiog produced a draft module targeted to youths at risk developed and contaioiog iostructional activities for students and disseminated io conjunction with the extensive background materials for teachers. The Association of Science Technology draft instructional activities are designed to help Centers (ASTC) Youth Alive! project, students move through the followiog ideas: and the American Association for the Advancement of Science (AAAS) Black 1. variation io behavior exists in populations; Churches Project. 2. there are genetic and environmental components • An ioteractive Website for linking to this behavioral variation; scientists with students, their parents and 3. scientists have methods for investigatiog the teachers and fostering wide access to source of differences in human behavior; resources developed through the preceding 4. these methods have strengths and limitations; ioitiatives. and 5. there are ethical, social, and legal implications The scientific messages delivered through these to understanding that genes iofluence behavior. projects will promote a balanced view of our ioteractions with our microbial partners on the To test the effectiveness of the draft module in planet. One of the MLC's primary goals is to helpiog students understand how genes and dispel public anxiety about microorganisms which environment influence human behavior, BSCS has been created through iotensive media focus on pilot tested several activities io October. Following microbes as disease agents. Viewers and analysis of these data, BSCS will refine the. participants will be iotroduced to broad concepts materials, conduct a complete field test, convene a such as second writiog conference, produce the final

114 • The important microbial role as timelines and recognize who the responsible parties bioremediators and how we can use their are for various aspects of each of the three natural processes to help clean up our initiatives. environment, • The biotechnological potential for microbial products and processes and how High School Students as Partuers in they can better our daily lives Sequencing the Human Genome • The microbial basis of all ecosystems and how we can sustain our environment while Maureen Munn, Leroy Hood still developing the economic potential of Department ofMBT, University of Washington, our natural resources Box 352145, Seattle, WA 98195 • The linkages between global climate [email protected] changes and microbial processes and how increasing our understanding of these The High School Human Genome Program linkages can influence public policy (HSHGP) encourages high school students to think • New strategies for understanding and constructively about the scientific and ethical controlling infectious diseases and how we issues of genomic research by enabling them to can take advantage of these new participate in both. This program supports many of developments to improve the human the teaching objectives presented in the National condition throughout the world. Science Education Standards (1996), including meeting the content standards for genetics Project Status: education, teaching science through inquiry and developing a learning community of teachers and The three initiatives of the MLC received a scientists to promote better science education. guarantee for complete funding in May, 1997. The Participating students learn about many career intellectual framework for the science documentary options in science through discussions with was established at a seminal meeting of the Science scientist mentors who assist during classroom Advisory Group held in Woods Hole, MA, three experiments, and the lab experiences help to years prior to funding. The production staff prepare them for future employment. officially began work Oct. 6 and are presently participating in "Microbiology school" and A. Program modules. The DNA Synthesis beginning to formulate story lines for the four part experiment is an introductory experiment that series to reflect the scientific content. helps students learn about DNA structure and replication and develop their laboratory skills. Planning for the Annenberg sponsored telecourse During the DNA Sequencing experiment, students began mid-sununer, with a meeting of the Science sequence a region of chromosome 5 that is Education Advisory group at OPB. The group is involved in a hereditary form of deafuess. This nearing completion of curricular content and project is made possible through a collaboration learning objectives for an accredited college level with Eric Lynch and Mary-Claire King from the telecourse , both of which will be finalized at an Departments of Genetics and Medicine at the upcoming meeting in October. University of Washington. The Ethics unit, which focuses on presymptomatic testing for The Advisory Group to the informal set of Huntington's disease, helps students develop the initiatives met in September for a 3 day planning skills to define ethical issues, ask and research meeting to discuss the hands on activities that will relevant questions about a particular topic and be developed to accompany each of their make justifiable ethical decisions. The module was dissemination plans. developed by Sharon Durfy and Robert Hansen from the UW Department of Medical History and The official kickoff meeting for the MLC will take Ethics. place at the end of the month at Mt. Hood, Oregon, where all parties will come together to validate

115 B. Teacher Preparation and Classroom Communication among program participants. Implementation. Local, regional and national Discussion boards will be set up so that teachers attend a week-long summer workshop, participants can discuss technical problems and which provides training in program modules, solutions, ask research questions and exchange informal seminars and discussions of relevant classroom tips. topics. E. Program Evaluation: Preliminary results of C. Equipment Kit Loans. During the academic program evaluation will be discussed. year, local teachers are provided with the necessary equipment, supplies and technical assistance to carry out the classroom experiments. Your World/Our World- Exploring Teachers from distant sites receive DNA templates the Human Genome and primers and ongoing technical advice. This Creating Resource Materials for program is currently serving 32 high school teachers at fifteen schools in Washington state and Teachers 13 other teachers nationally. Jeff Alan Davidson D. The HSHGP web-site (http://hshgp.genome. Pennsylvania Biotechnology Association, Alliance washington.edu). This site is intended as a resource for Science Education for teachers and students everywhere and contains P A_Biotech@Compuserve. Com the following: The Pennsylvania Biotechnology Association Program modules. These are available on-line or in (PBA) in cooperation with the Alliance for Science a downloadable version. On-line DNA assembly Education (ASE) publishes the biotechnology and data analysis. This tutorial enables students to science magazine YOUR WORLD/OUR WORLD carry out the assembly of the sequencing data from to introduce middle and high school studentsto the the web-site, using the demonstration version of the underlying science and the social issues raised by DNA assembly program, Sequencher and a folder modem biological research and technology. In the of student data files. Spring of 1996, with partial DOE fimding, a special enlarged issue of YOUR WORLD/OUR Virtual DNA sequencing. This tutorial enables WORLD dealing with the underlying science of classrooms that are unable to do the sequencing genomics, and the ethical, legal, and social issues experiment to do many aspects of the sequencing raised by the Human Genome Project (HGP) was process by providing scans of student sequencing published. ladders and highlighting the portions of our teaching modules that can be used to simulate PBA and ASE are now creating additional DNA sequencing. instructional materials for use by middle and high school students to facilitate a more extensive Future tutorials on the Web-site: presentation of the subjects covered in the special On-line research projects. In the future, we will issue. These materials are being built in two develop tutorials that help students access DNA phases. First, by assembling and reviewing for and protein databases and related analysis software usefuloess materials from other publishers and by and understand what the analysis software is continuing the development of new materials by doing. Once students understand these tools, they PBA to create a comprehensive supplemental will be able to design and explore their own plan materials package that provides resources in research problems. several different media. Second, by running Exploring ethical issues related to genomic national contest for science teachers and students research. We plan to develop additional modules to encourage classroom development of new and that emphasize how the decision making process original approaches to teaching the material. can be used to examine any ethical issue. Materials from both phases will each in turn be packaged and made available to the 45,000 middle

116 school and high school biology teachers in the Human Genome Teacher Networking United States over the next 24 months. Project

This project is targeted at middle and high school Debra L. Collins, M.S., and Rebecca Knetter, teachers and students for several reasons - most of Ph.D. the biological information studied and learned in Genetics Education Center, University of Kansas the United States occurs at this level, students are Medical Center, Kansas City, KS 66160-7318 generally very interested in biology and science at [email protected] these levels, and teachers can greatly assist students in considering this material, but need more Families, health care providers, and the general support in teaching about the HGP and ELSI. public are all increasingly aware of the human genome project discoveries. However, many do not Phase I Materials are expected to include: have a background on basic genetic information, and therefore are not aware of; or prepared for, the • A Further Exploration of the Science and ethical, legal, and social implications of this new Ethics, Legal and Societal Issues of technology and the applications. Our program Genomics helps prepare high school students for their future • Reference Materials for Genomics through updated information and resources from • Teaching Plans, Colorful Overheads & their biology teachers. Each teacher in this project Assessment Materials spent time in educational activities over a two-year • Biological Animations (Computer period, learning about updated information through Animations available on Video & CD that two one-week workshops, preparing updated illustrate important ideas) lesson plans, presenting peer teacher programs, • Experiments, Activities and and networking with other teachers, genetic Demonstrations professionals, and ELSI experts. • Science Fair or Research Project Ideas • Theatrical Vignettes that explore During 1993-1997, 177 teachers attended a series important questions in ELSI ofHGP of Human Genome Teacher Networking education workshops which addressed the applications of Phase II Materials are expected to include: Human Genome Project technology, with a focus on ethical, legal, and social implications (ELSI). • Teaching Plans After the teachers attended summer workshops, • Graphics, Art, Photographs they used new materials with students, then • Science and ELSI Text & Animations conducted peer and community education • Additional Experiments, Demonstrations, programs, and contacted genetic and ELSI experts & Activities to enhance their classroom teaching. The networks • Multimedia Presentations and Graphics, which developed between teachers, liaisons with Web Sites, Games genetic profussionals, peer teacher programs, and • Role Playing Exercises on-line computer communications helped educators • Science Fair or Research Project and their students obtain current human genetics Suggestions information. • Plays or Literature - written or performed and videotaped We analyzed the improvement in teacher • Articles for the General Press or Radio or confidence and preparedness and measured student Television achievement as a result of the summer workshops. Following the workshops, teachers were prepared and confident teaching complex genome technology and applications (p> .05). They acquired new information to expand their knowledge of human genetics and integrate the complex information into

117 existing science curricula. Some teachers A Question of Genes developed new school courses. Student achievement was significantly increased (p>.05) Noel Schwerin due to teacher attendance at the workshops. The NoelEye Documentaries, 1327 Church Street, San teachers gained a new awareness of the scientific Francisco, CA 94114 as well as the personal aspects of the genome 415.282.5620 information. [email protected]

Participants were prepared to help students A two-hour national PBS special (aired September understand the ramifications ofHGP discoveries 16, 1997), A Question of Genes looks for the first and readily access information on many aspects of time at the ethical, social and legal implications of the Human Genome Project, including decisions genetic testing. A Question of Genes enters the regarding genetic testing. Their students scored lives of a few individuals and families as they significantly higher (p > .05) on a survey of confront genetic risks for conditions like heart knowledge than of comparable students whose disease, Alzheimer's, breast cancer and genetic teachers did not attend the workshop. birth defects. A Question of Genes explores the profound challenges genetic information makes to Teachers, as part of their biology curricula, can a person's sense of self, family and future. Closely integrate genome project concepts, and help observing regular people over several years, A students understand the ELSJ issues which will be Question of Genes takes us inside the decisions important in their future. Teachers are enthusiastic and dilemmas of a range of personalities and about education workshops, they increase the perspectives: a woman who had a preventive · amount of curricula time devoted to genome! ELSI mastectomy after losing three sisters to breast projects, and can present information at an cancer tries to make sense of her just-discovered appropriate pace for students. The studentis results; a poor African-American mother struggles science literacy is increased on timely topics, and to know more about the genetic legacy to her they have knowledge of internet and other daughter; a physician who administers genetic resources to answer new questions, not available in tests wrestles with ethical dilemmas about his current published textbooks. patients' privacy and rights; a pregnant woman makes startling decisions about the fate of her Workshop resource materials, lesson plans, the unborn children. mentor network, and teachers who agreed to have their names listed on the internet are available In seven stories told by the participants themselves, though the web site for this project: A Question of Genes captures both the profound http://www.knmc.edu/gec (Genetics Education emotional drama as well as the enormous social, Center). Links are provided to other ELSI sites, legal and ethical implications of the powerful new career information, and other genetic resources. technology of genetic testing. A Question of Genes was produced and directed by Noel Funded by DOE #DE-FG02-92ER6!392 Schwerin for the Chedd/Angier Production Company and Oregon Public Broadcasting.

118 The DNA Files: A Nationally issues. "DNA and the Law," for example, reviews Syndicated Series of Radio Programs the scientific basis for genetic fingerprinting and on the Social Implications of Human looks at cases of alleged genetic discrimination by insurance companies, employers and others. Genome Research and its "Gene Therapy: Medicine for Our Genes" takes on Applications* popular descriptions of genetic therapy, derived from stories like Jurassic Park, and attempts to Bari Scott and Jude Thilman help us realistically understand the science, as well SoundVision Productions, 2991 Shattuck Ave., as its promise, limits, and social implications. Ste. 304, Berkeley, CA 94705 Other shows include "The Commercialization of [email protected] Genetics," "Prenatal Genetic Testing: Better Babies Through Science," and "Genetic Evolution, The DNA Files is a series of nationally distributed Diversity and Kinship." public radio programs furthering public education on developments in genetic science. Program *Supported by ELSI grant DE-FG03-95ER62003 content is guided by a distinguished body of from the Office of Health and Environmental advisors and will include the voices of prominent Research of the U.S. Department of Energy. genetic researchers, people affected by advances in the clinical application of genetic medicine, members of the biotech industry, and others from Human Genome Project Education & related fields. They will provide real-life examples Outreach Component of the complex social and ethical issues associated with new discoveries in genetics. In addition to the Molly Multedo general public radio audience, the series will target Self Reliance Foundation, 121 Sandoval Street, educators, scientists, and involved professionals. 3rd Floor, Santa Fe, NM 87501 (505) 984-0080 Ancillary educational materials will be distributed in paper and digital form through over Self Reliance Foundation (SRF), in collaboration collaborative organizations and in fulfillment of with Hispanic Radio Network (HRN) and the listener requests. National Center for Genome Resources (NCGR) has developed Spanish radio progranuning and With information linking major diseases such as outreach services which will help inform Hispanics breast cancer, colon cancer, and arteriosclerosis to on the ethical, legal, and social issues related to the genetic factors, new dangers in public perception Human Genome Project and motivate them to emerge. Many people who hear about them access the resources available for further education mistakenly conclude that these diseases can now be on these issues. Funding by DOE-ELSI supports easily diagnosed and even cured. On the other end the production of 50 new episodes of "Buscando Ia of the public perception spectrum, unfounded fears Belleza" (BB) for broadcast over the period of two of extreme, and highly unlikely, consequences also years; extensive, national outreach and referral appear. Will society now genetically engineer services and a linked Web site. whole generations of people with "designer genes" offering more "desirable physical qualities"? The DNA Files will ground public understanding of these issues in reality. A live, two-hour call-in show will cover the broad scope of human genetic research and its applications. It will include a discussion with experts, and a chance for listeners at home to make comments and ask questions.

Nine one-hour documentaries will provide the basic science of DNA, genes and heredity, while illustrating the accompanying social and ethical

119 I I Infrastructure

Human Genome Management Genome News (HGN) uniquely presents a broad Information System spectrum of genome-related topics to a widely diverse body of 14,000 domestic and foreign HGN Betty K. Mansfield, Anne E. Adamson, Denise K. subscribers. HGMIS also produces the DOE Casey, Sheryl A. Martin, JohnS. Wassom, Judy Primer on Molecular Genetics, progress reports M. Wyrick, Laura N. Yust, Murray Browne, and on the DOE Human Genome Program, Santa Fe Marissa D. Mills contractor-grantee workshop proceedings, 1-page Life Sciences Division; Oak Ridge National topical handouts, and other related resource Laboratory; 1060 Commerce Park; Oak Ridge, TN documents. 37830 [email protected] Document Distribution. In addition to HGN HGMIS has distributed more than 65,000 c~pies The Human Genome Management Information of items requested by subscribers, meeting System (HGMIS), established in 1989 by the attendees, and managers of genetics meetings and Human Genome Program Task Group of the DOE educational events. Office of Biological and Environmental Research helps DOE fulfill its commitment to informing ' Electronic Communication. Since November 1994, scientists, policymakers, and the public about the HGMIS has produced a comprehensive, text-based program's funded research. Through its Web server called Human Genome Project communication role, HGMIS increases the use of Information, which is devoted to topics relating to resources generated in the Human Genome Project, the science and societal issues surrounding the reduces duplicative research efforts, and fosters genome project. The HGMIS Web sites contain collaborations and contributions to biology from more than 1700 text files that are accessed over other research disciplines. 1.2 million times a year. Each month, about 10,000 host computers connect directly to the HGMIS products, including the Web sites and HGMIS site and through more than 1000 other newsletter, have won technical and electronic Web sites. communication awards. In May 1997, DOE acknowledged the newsletter's value by presenting All HGMIS documents are published on the Web an exceptional service award to HGN' s managing site, along with other DOE-sponsored documents editor at a symposium celebrating 50 years of pertaining to the Human Genome Project. biological and environmental research. Collaborating with the Einstein Institute for Science, Health, and the Courts, HGMIS helps to Communicating scientific and societal issues to produce CASOLM, the online magazine for nonscientist audiences contributes to increased judicial education in genetics and biomedical science literacy, thus laying a foundation for more issues. HGMIS also maintains the Genetics section informed decision making and public-policy of the Virtual Library from CERN (Switzerland) development. For example, since 1995 HGMIS has and the DOE Human Genome Program pages and been participating in a project to educate judges on moderates the BioSci Human Genome Newsgroup. the basics of genetics and gene testing to prepare for the coming flood of cases involving genetic Information Source evidence. HGMIS staff members answer individual questions and supply general information about the Human Information Resources Genome Project by telephone, fax, and e-mail and In keeping with its goals, HGMIS produces the refer other questioners to experts in the Human following information resources in hard copy and Genome Project. HGMIS displays the DOE on the Web: Human Genome Project traveling exhibit at occasional scientific conferences and Publications. A quarterly forum for genome-related meetings, makes presentations to interdisciplinary information exchange, Human educational, judicial, and other groups, and strives to strengthen the content relevancy of its services.

123 This work is sponsored by the Office of Biological Human Genome Program and Environmental Research, U.S. Department of Coordination Activities Energy, under contract No. DE-AC05- 960R22464 with Lockheed Martin Energy Sylvia J. Spengler, Kelcey J. Poe, and Janice L. Research Corp. 9/97 Mann Human Genome Program Office, 459 Donner Laboratory, E. 0. Lawrence Berkeley National DOE Joint Genome Institute Public Laboratory, 1 Cyclotron Road, Berkeley CA WWWsite 94720 [email protected] Robert Sutherland*, Linda Ashworth+, Mark 0. Mundt*, Sam Pitluck$, Tom Slezak+, and Darrell The DOE Human Genome Program of the Office 0. Ricke* of Biological and Environmental Research (OBER) Joint Genome Institute: *Los Alamos National has developed a number of tools for management Laboratory, +Lawrence Livermore National of the Program. Among these was the Human Laboratory, $Lawrence Berkeley National Genome Coordinating Committee (HGCC), Laboratory established in 1988. In 1996, the HGCC was expanded to a broader vision of the role of genomic The Joint Genome Institute (JGI) is integrating its technologies in OBER programs, and the name information on its public WWW site. This was changed to reflect this broadening. The information will combine data from LANL, HGCC is now the Biotechnology Forum. The LBNL, LLNL, and the new JGI PSF (Production Forum is chaired by the Associate Director, Sequencing Facility). While this WWW site is OBER, Dr. A. Patrinos. Members of the Human under development, its planned content includes: Genome Program Management Task group are ex (I) information about the JGI and its member officio members, as are members of the Biological institutions, (2) links to member institution's and Environmental Research Advisory WWW pages and other relevant WWW sites, (3) Committee's subcommittee on the Human physical maps for regions being sequenced by the Genome. Responsibilities of the Forum include: JGI, (4) links to entries submitted to public assisting OBER in overall coordination of databases, (5) links to the JGI FTP server, (6) DOE-funded genome research; facilitating the finished sequence data with associated quality development and dissemination of novel genome information and feature annotations, (7) JGI technologies; recommending establishment of ad informatics, (8) functional genomics information, hoc task groups in specific areas, such a and (9) information and WWW-links to promote informatics, technologies, model organisms; and public education and understanding of Genetics evaluation of progress and consideration of and the Human Genome Program. This work is long-term goals. Members also serve on the Joint funded by the United States Department of Energy. DOE-Nlli Subcommittee on the Human genome, for interagency coordination. The coordination URL: http://www.jgi.doe.gov group also participates in interface programs with other facilities and provides scientific support for development of other OHER goals, as requested.

This work was supported by the Director, Office of Energy Research, Office of Biological and Environmental Research, Human Genome Program, of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098.

124 Appendices I I

I I

I I

I I

I I

I I

I I

I I Appendix A: Author Index

First authors are in bold. Boughton, Ann ...... 111 Chen, C...... 68 A Branscomb, Elbert ... 61, 90 Chen, C. H. Winston ... 39 Brewer, Laurence R. 27, 29 Chen, I-MinA...... 96 Abmayr, Simone ...... 81 Brignac, Stafford ...... 5 Chen, Ting ...... 15 Adams, Mark D .. 12, 53, 54 Briley, J. David ...... 22 Chen, Xiao-Ning ...... 6 I Adamson, Aaron ..... 6, 27 Britton, Jr., C.L. . ... 65, 66 Chen, Y-J...... 64 Adamson, Anne E ...... 123 Bronson, Scott ...... 109 Cheng, Jan-Fang Akowski, James ...... 17 Bronstein, Irena ...... 33 .. ' . ' ... ' ' .. 11, 56, 63, 68 Alarie, J.P ...... 36 Bross, S ...... 14 Cherry, Joshua ...... 13 Albertson, Donna 35, 68, 69 Brow, Mary Ann D ..... 67 Chi, Han-Chang ...... 8, 9 Aldredge, T...... 14 Brown, Nancy C .. 31, 50, 51 Chin, K ...... 70 Allison, David ...... 46 Browne, Murray ...... 123 Chinwalla, A...... 3 0 Allman, S.L...... 39 Bruce, D.C ...... 7, 53 Chissoe, Stephanie L. . . . 5 Altherr, Michael R. Bruce, James E ...... 38 Christensen, M...... 57 ...... ll, 12, 53, 81 Bruce, Robert ...... 6, 27 Christoffels, Alan ...... 93 Annab, Lois ...... 49 Bruno, William J ... 99, 100 Chu, Lung-Yung ...... 92 Ansorge, W ...... 45 Bryan, William ...... 34 Church, George .... 14, 40 Asbury, Charles L...... 21 Brynildsen, S...... 110 Clark, Steve ...... 35, 68 Ashworth, Linda K. Buchanan, Michelle V. 39, 46 Clayton, Rebecca ...... 12 ...... 6, 57, 61, 90, 124 Buckingham, Judy M. . . 7, 8 Cloud, K.G...... 16 Athanasiou, Maria ...... 5 Bulyk, Martha ...... 40 Cloutier, Tom ...... 69, 75 Auffray, C...... 45 Bunde, T ...... 36 Cohn, Judith D ... 7, 89, 98 Buneman, Peter ...... 94 Colayco, Enrique . 13, 52, 53 B Burke, John ...... 93 Colayco, J ...... 64 Balch, Joseph .... 27, 28, 29 Burkhart-Schultz, Karolyn 6 Collins, Colin .. 68, 69, 75 Ballabio, A...... 45 Burlage, R. S...... 65 Collins, Debra L...... 117 Banfi, S ...... 45 Burmeister, Ron ...... 5 Cook, M ...... 30 Barber, William ...... 97 Bush, D ...... 14 Cook, R ...... 14 Barnes, Thomas ...... 92 Butler-Loffredo, Laura-Li 15 Cornell, Earl W...... 25 Barrett, J. Carl ...... 49 Cottingham, Robert .. 75, 93 Bashirzadeh, R ...... 14 Crabtree, Jonathan ..... 100 Basit, Mujeeb ...... 5 c Craig, Steven G...... 106 Cai, Hong ...... 30, 37 Bass, Steve ...... 54 Craven, Brook ...... 54 Campbell, Evelyn W. . 50, 51 Beattie, Kenneth . . . . 34, 46 Cretz, Gabriela G ...... lJ Camp bell, Mary L. . . 50, 51 Benner, W. Henry ..... 32 Crockett, William C. . ... 15 Cao, Yicheng ...... 52, 53 Betanzos, Gabriel ...... 34 Culpepper, Pamela A. . .. 90 Cao, Y ongweo ...... 13 Blakely, D...... 14 Card, Paul ...... 5 Blanchard, Alan ...... 55 Caron, A...... 14 D Blatt, Robin J.R...... 105 Carpenter, D.A...... 60 Dairkee, S.H...... 68 Bloom, Mark ...... 109 Carrano, Anthony V. ' Danganan, Linda ...... 6 Bochner, H ...... 14 . ' ..... 6, 27, 57, 84, 85, 90 Daniels, C.J...... 14 Bocskai, Diana ..... 52, 53 Carrilho, E...... 26 Davidson, Courtney Boitsov, Alexander ..... 50 Carroll, Ryan ...... 100 ' . ' ' ' ' ... ' ' ... ' 27, 28, 29 Boivin, M ...... 14 Caruso, A ...... 14 Davidson, Jeff Alan ... 116 Bonaldo, Maria de Fatima 45 Carver, Ethan A...... 56 Davidson, Susan B. . ... 94 Borsani, G...... 45 Casey, Denise K...... 123 Davie, Juan ...... 5 Bose, Sonali ...... 40 Chasteen, L. A...... 7 Davis, Sharon ...... 106

127 Davison, Daniel B ...... 90 Fernandez, C ...... 70 Green, Phil ...... 10, 87 Davy, Donn ...... 75 Ferrell, T. L...... 65 Griffin, G .D...... 3 6 de Jong, Pieter J ...... 50 Field, Casey ...... 54 Guerin, J...... 14 de Solorzano, C Ortiz ... 70 Fischer, Steve ...... 100 Gusfield, Dan ...... 83 Deaven, Larry L. Fisk, David J...... 29 .. 7, 9, 50, 51, 53, 57, 63, 66, Fitz-Gibbon, Sorel ...... 13 H 86, 89, 98 Fonstein, Michael ...... 17 Haab, Brian B...... 31 Deloughery, C...... 14 Franklin, Terry ...... 5 Hahner, Lisa ...... 5 Demiijian, David ...... 91 Fraser, K...... 63 Hall, Jeff G ...... 67 Denison, Karen ...... 81 Frazer, Kelly A...... 11 Halpern, Aaron L...... 99 Devignes, M.D...... 45 Fresco, Jacques R ...... 62 Hamil, Cindy ...... 13 Dewar, D ...... !10 Froula, J ...... 69 Hampton, O.A...... 16 Dhadwal, Harbans S ..... 29 Han, CliffS...... 57 Dho, So Hee ...... 52, 53 G Harbison, Chris ...... 40 Ding, Yan ...... 64, 65 Gaasterland, Terry . . . 13, 75 Harger, C ...... 94 Doggett, Norman A. Gamer, Skip ...... 5 Harris, Jeff ...... 5 ...... ~ 12,51,53,57,58 Garnes, Jeff ...... 6 Harrison, D...... 14 Doktycz, Mitchel ...... 34 Gath, Tracy ...... 105 Hasan, Ahmad ...... 22 Doublie, Sylvie ...... 19 GDB Staff ...... 93 Haselkom, Robert ...... 17 Doucette-Stamm, L...... 14 Gee, Vanetta ...... 5 Haussler, David ...... 75 Dougherty, Michael J. . . 113 Gelb, Betsy D ...... 106 Heller, Michael A. . ... 108 Dubchak, I ...... 102 Georgescu, A...... 57 Hicks, J.S ...... 66 Dubhashi, Kedamath A .. 82 Ghochikyan, Anahit . . . . . 18 Hide, Winston ...... 93 Dubois, J...... 14 Gibson, K...... 45 Hill, K.K...... 16 Duell, Thomas ...... 68 Gibson, Mark ...... 100 Hillier, L...... 30 Dunn, Diane ...... 13 Gibson, R ...... 14 Hinson-Cooper, Shelly . 5 Dunn, Joel ...... 5 Giddings, Michael C .... 86 Hitti, J...... 14 Dunn, John J ...... 15, 20 Gilbert, D ...... 64 Ho, T...... 14 Diirst, M ...... 68 Gilbert, K...... 14 Hoang, L...... 14 Glazer, Alexander N. 21, 33 Hoisie, S ...... 94 E Glodek, Anna ...... 53 Holman, M ...... 30 Echeverria, Annabel . . 52, 53 Godfrey, T...... 69 Holtham, K...... 14 Egan, J ...... 14 Goetzinger, W...... 26 Hood, Leroy ...... 55, 115 Egenberger, Laurel .... 112 Golovlev, V.V ...... 39 Hoyland, James R...... 26 Einstein, J. Ralph ...... 77 Gomez, Christie . . . . . 52, 53 Hraber, P...... 94 Eisenberg, Rebecca S. . 108 Goodman, Nathan .... !01 Hnang, Zhengping ...... 3 1 Ellenberger, Tom ...... 19 Goodwin, L. A ...... 53 Huh, Jun-Ryul ...... 53 Ellston, D...... 14 Goodwin, Peter M ...... 30 Hung, Su-Chun ...... 21 English, Cynthia ...... 5 Gordon, Laurie A. 6, 57, 90 Hurst, Gregory B. .... 39 Ericson, M.N...... 65 Gordon, Margaret ...... 5 Hutchinson, G ...... 69 Ericson, Nance ...... 46 Gotway, Garrett ...... 5 Hutt, Lester ...... 33 Estivill, X...... 45 Goyal, A ...... 14 Evans, Glen A...... 5 Grady, Deborah L...... 8 I Ezedi, J...... 14 Grajewski, Wally ...... 92 Iadonato, Shawn ...... 10 Grant, Odell ...... 5 Isola, N.R...... 36, 39 F Graves, J...... 51 Graves, Janine S ...... 29 Falter, K.G...... 65 Gray, JoeW. Farmer, A ...... 94 ...... 35, 68, 69, 70, 75 Fawcett, John J ...... 50, 51 Grechkin, Yuri ...... 80 Federova, Nina ...... 5

128 J Korn, B...... 45 Lowndes, D .H...... 66 Koshi, Jeffrey M...... 100 Lowry, Steve ...... 56 Ja, William W ...... 23 Kosky, Anthony ..... 96, 97 Lunun, W ...... 14 Jackson, P.J ...... 16 Kotler, L...... 26 Lundeberg, J...... 4 5 Jefferson, Margaret C. . 113 Kotval, J.S ...... 110 Lvovsky, Lev ...... 17, 18 Jellison, G.E ...... 65 Kouprina, Natalay .. 49, 51 Lyamichev, Victor ...... 67 Jett, James H...... 30, 31 Kowbel, D...... 68, 69 Lynch, M ...... 45 Jewett, Phil ...... 50, 51 Kozlovsky, J ...... 14 Jin, Jian ...... 25 Krakowski, L...... 94 Jiwani, N...... 14 M Kress, Reid ...... 46 Johnson, Dabney K. Machara, Nicholas P ..... 30 Kruper, John ...... 109 ...... 46, 59, 60, 65, 66 Madabhushi, Ramkrishna . 27 Kufer, Ken ...... 5 Johnson, M ...... 64 Magness, Charles ...... 10 Kumano, S ...... 16 Johnson, Marion D ..... 62 Mahairas, Gregory G ... 55 Kuo, W.-L...... 68 Jones, Arthur ...... 35, 70 Mahowald, Mary B. ... 1!0 Kutney, Laura ...... 40 Jones, M.D...... 7 Majeski, A...... 14 Kwiatkowski, Bob ...... 67 Joseph, P...... 14 Maltbie, Mary ...... 50 Kyle, Ami L...... 6, 84 Jurka,Jerzy ...... 49 Maltsev, Natalia .... 79, 81 Justice, Monica J. 46, 59, 60 Mank, P ...· ...... 14 L Mann, Janice L...... 124 K Ladner, Heidi ...... 13 Mann, Reinhold ...... 46 Lamerdin, Jane E. 6, 48, 84 Manning, Rutlt Ann . . . . 100 Kaiser, Michael W...... 67 Landers, Rich ...... 92 Mansfield, Betty K. . . . 123 Kaoe, Thomas E...... 26 Lantos, John ...... !I 0 Manter, D ...... 16 Karger, Barry L .... 26, 88 Lao, Kaiqin ...... 33 Mao, J ...... 14 Kass, Judy ...... 105 LaPlante, M...... 14 Mardis, E.R...... 30 Keagle, P...... 14 Larionov, Vladimir . 49, 51 Mariella, Ray ...... 27 Keirn, P ...... 16 Larson, Susan J...... 82 Markowitz, Victor M. 96, 97 Keller, Richard A. . . 30, 31 Lee, Ann ...... 27 Marrone, Babetta ...... 66 Kelley, Jenny ...... 54 Lee, Byeong-Jae ...... 53 Marth, G ...... 30 Kerlavage, Anthony R .... 12 Lee,H-M ...... 14 Martin, Christopher H. Kernan, John ...... 26 Lee, Jing Ying ...... 33 ...... 10,11,69 Khao, S ...... 64 Lehnert, Nancy ...... 66 Martin, Christopher S ... 33 Kheterpal, Indu ...... 23 Lehrach, H...... 45 Martin, Sheryl A. ... 75, 123 Kieleczawa, Jao ...... 15 Lennon, Greg ...... 45 Mast, Andrea L...... 67 Kim, Gony H...... 13, 53 Leone, Joseph ...... 92 Mathies, Richard A. Kim, Joomyeong ... 56, 61 Lesley, Scott A...... 20 ...... 21, 23, 31, 33 Kim, Ung-Jin Lessick, Mira ...... 110 McCorkle, Sean ...... 15 ...... 13, 36, 52, 53, 68 Letovsky, Stanley ... 75, 93 McCready, Paula ..... 6, 27 Kim, Y ongseong ...... 31 Li, Lei ...... 88 McCutchen-Maloney, Kimbrough, Joseph .. 27, 29 Li, Qingbo ...... 26 Sandra L...... 48 Kiphart, D...... 94 Li, Robin Hua ...... 53 McDougall, S...... 14 Klenk, Hans-Peter ...... 12 Liberzon, Arthur ...... 18 McFarland, James ...... 5 Klock, Heath E...... 20 Liu, Betty ...... 3 3 McGrane, R ...... 30 Knapp, Jr., F.F ...... 66 Liu, Changsheng ...... 26 Mcinerney, Joseph D .. 113 Knetter, Rebecca ...... 117 Liu, Shaorong ...... 33 McLeod, M...... 94 Knuth, Mark W...... 20 Lobb, Rebecca ...... 8 McMillan, A.D ...... 65 Kobayashi, Arthur 6, 85, 90 Lockett, Stephen ... 35, 70 McMillan, D.E ...... 65 Kogan, Y akov ...... 17 Longmire, Jonathan L. Medalle, Jeremy ...... 15 Kolbe, William F...... 25 ...... 7, 31, 50, 63 Meincke, Linda J. 51, 53, 57 Konunaoder, Kristina. 30, 37 Lou, Yunian ...... 25 Meloyk, J...... 64 Korenberg, Julie R. .... 61

129 Michaud, Edward J...... 59 Olson, Maynard ...... 10 R Micklos, David ...... 109 Ono, Tetsuyoshi ...... 41 Raja, Mugasimangalam . 17 Miller, Arthur W...... 88 Ordonez, Patricia ...... 113 Ramirez, Melissa ...... 6 Miller, Jeffrey H...... 13 Osborne-Lawrence, Sherri 5 Ramsey, Michael ...... 46 Miller, Robert ...... 93 Oskin, Boris ...... 50 Ramsey, Rose ...... 46 Mills, Marissa D ...... 123 Overbeek, Ross ..... 79, 81 Randesi, Matthew ...... 20 Mirzabekov, A ...... 37 Overton, G. Christian Rank, David R...... 24 Mitchell, Steve ...... 61 ...... 75, 94, 100 Reeve, J.N...... 14 Mitchell, Steve C. . .. 52, 53 Ow, David J...... 6, 85, 90 Reilly, Philip P...... 105 Mohrenweiser, H.W. . ... 57 Renouard, J.J ...... 16 Molloy, Vincent ...... 17 p Resnick, Michael A ...... 49 Moore, Troy ...... 52 Palazzolo, M...... 69 Reynolds, Leigh Ann . . . 106 Moss, Robert ...... 110 Panussis, D .A...... 3 0 Rice, P ...... 14 Moyzis, Robert K. Parang, Morey ...... 75, 77 Richardson, Charles . . . . . 19 ...... 8, 9, 51, 57 Park, Min S ...... 81 Richterich, P...... 14 Muchnik, I...... 102 Pastrone, Ron ...... 27 Ricke, Darrell 0. Muddiman, David C ..... 38 Patel, Parul ...... 5 7, 8, 9, 12, 16, 81, 86, 89, 90, Multedo, Molly ...... 119 Patwell, D ...... 14 98, 124 Mundt, Mark 0. Paulus, Michael .... 46, 66 Ricke, Tracy L...... 98 ...... 7, 8, 57, 89, 98, 124 Peluso, D.C...... 30 Rieder, Mark J...... 99 Mundy, C ...... 45 Peng, Ze ...... 56 Rifldn, L.L ...... 30 Munk, A. Christine . 7, 8, 89 Perez-Castro, Ana V ..... 11 Riley, Susan L...... 30 Munn, Maureen ...... 115 Perry, Barbara ...... 13 Rinchik, E.M...... 60 Mural, Richard J .. 46, 75, 77 Petrov, Sergey ...... 75, 77 Robb, Frank ...... 13 Myambo, K...... 69 Petrovic, Michelle ...... 66 Robbins, Robert J •.... 112 Myers, Eugene W ...... 82 Pevzner, Pavel ...... 75 Robinson, D.L...... 7 Phillips, Dereth ...... 40 Rodriguez, E ...... 70 N Phillips, J ...... 14 Rodriguez, Gabriella . 52, 53 Nachnani, Sushi! ...... 91 Pietrokovski, S...... 14 Rommens, J ...... 69 Naranjo, Cleo ...... 11, 81 Pietu, G ...... 45 Ross, Lainie Friedman .. 110 Narla, Mohan ...... 10 Pinkel, Daniel 35, 68, 69, 70 Ross, S ...... 57 Neal, C ...... 64 Pitluck, Sam ...... 90, 124 Rossetti, M ...... 14 Needham, Cynthia . . . . 114 Poe, Kelcey J...... 124 Rothstein, Mark A. . .. 106 Nelson, David 0. 27, 87, 88 Polikoff, D...... 69 Rouchka, Eric C. . .. 83, 84 Nelson, Karen E...... 12 Poole, I...... 68 Rounsley, Steve ...... 54 Neri, Bruce ...... 67 Poole, Ian ...... 35 Rubenfield, M ...... 14 Newton, Jackie ...... 5 Porter, Kenneth W...... 22 Rubin, Edward M. Nickerson, Deborah A ... 99 Pothier, B ...... 14 ...... 11,5~63 Nolan, John P ...... 30, 37 Poundstone, Pat ...... 6 Ruiz-Martinez, M.C ..... 26 Nolan, Matt P...... 84, 85 Poustka, A...... 45 Nolling, J...... 14 Prabhakar, S ...... 14 Prudent, James R...... 67 s Nonet, G ...... 69 S-Nair, Usha ...... 66 Puri, Rajesh ...... 61 Sachdeva,M...... 14 Pusch, Gordon ...... 79 0 Sachs, Greg ...... 110 O'Brien, J ...... 45 Safer, H...... 14 O'Brien, Kevin ...... 5 Q Salas-Solano, 0...... 26 O'Keeffe, T...... 40 Qiu, D ...... 14 Sari-Sarraf, H ...... 66 Okinaka, R. T ...... 16 Quan, Glenda G...... 6, 84 Saunders, Elizabeth H ... 8, 9 Ohnan, Victor ...... 77 Quesada, Mark A...... 29 Scalf, Mark ...... 41 Olsen, AnneS ..... 6, 57, 90 Schad, P.A...... 75, 94

130 Schageman, Jeff...... 5 Soares, Marcelo Bento . 45 Torney, David C .... 58, 78 Scherer, James R...... 23 Socci, Nicholas D...... 99 Traweek, Brad W ...... 82 Schilling, Peter ...... 5 Sosa, Maria ...... I 05 Trong, Stephan ...... 85 Schisler, Jonathan ...... 17 Sosic, Z ...... 26 Turner, John ...... 34 Schmidt, Sarah ...... 55 Spadafora, R...... 14 Schmitt, Holger ...... 65 Speed, Terence P. . .. 87, 88 u Schultz, Roger ...... 5 Spengler, Sylvia J. Uberbacher, Edward C...... 75, 102, 112, 124 Schwager, C...... 45 ...... 46, 75,77 Spitzer, L ...... 14 Schwerin, Noel ...... 118 Ueda, Y...... 63 Stamper, D ...... 94 Schwertfeger, J ...... 94 Ueng, Samantha Y.-J. . . . 8 Scott, Bari ...... 119 States, David J. . ... 83, 84 Uhlen, M ...... 45 Scott, Duncan ...... 56 Steadman, P...... 94 Ulanovsky, Levy . . . . 17, 18 Segraves, Rick ...... 35, 68 Stilwagen, Stephanie A. 6, 84 Selkov, Evgeni .. 79, 80, 81 Stokes, D .L...... 36 Selkov, Jr., Evgeni ..... 80 Stoye, Jens ...... 83 v Semin, David J...... 30 Strong, J...... 30 Valenzuela, Danny ...... 5 Sergueev, Dima ...... 22 Stubbs, Lisa .. 6, 56, 61, 69 van den Engh, Ger ..... 21 Sesma, Mary Ann ..... 113 Studier, F. William .. 15, 29 Venter, J. Craig ..... 12, 54 Severin, Jessica ...... 86 Stuebe, E...... 30 Verp, Marion . . ... 110 Shah, Manesh B. . . . . 7 5, 77 Stump, Mark ...... 13 Vicaire, R...... 14 Shannon, Mark 6, 48, 61, 69 Sudar, Damir ... 35, 68, 70 Vo-Dinh, T...... 36 Shaw, Barbara Ramsay . 22 Sutherland, Robert D. Volik, Stanislav ...... 68 Sheng, Y ...... 64 ...... 53, 58, 124 Voyta, John C ...... 33 Shi, Zheng-Yang ...... 61 Swanson, Ronald V. . ... 13 Shimer, G ...... 14 Swierkowski, Steve . 27, 28 w Shin, Dong-Guk ...... 92 Syed, Mathenna ...... 5 Wagner, Mark C. ... 85, 90 Shizuya, Hiroaki ... 64, 65 Symula, D ...... 63 Wang, Mei ...... 52, 53, 68 Shpak, Mitzi ...... 65 Szeto, Ernest ...... 96, 97 Wang,Poguang ...... 40 Shumway, Jeffrey ...... 33 Wang, Y...... 14 Siepel, A...... 94 T Wang, Yiwen ...... 21, 23 Simon, M.I...... 36 Tabor, Stanley ...... 19 Wang, Zao1in ...... 66 Simon, Melvin I. Takova, Tsetska ...... 67 Ward, Travis ...... 5 ...... 13, 52, 53, 65 Tan, Hongdong ...... 24 Wargo, P ...... 94 Simpson, M.L...... 66 Tang, Mary X...... 33 Wassom, JohnS ...... 123 Simpson, Peter C...... 33 Taranenko, N.I...... 39 Waterston, R.H...... 30 Singh, G ...... 94 Tarte, Lisa ...... 28 Waugh, M ...... 94 Skowronski, Evan ...... 6 Tatum, Owatha L.8, 9, 63, 66 Weaver, Krista! ...... 39 Skupski, M ...... 94 Taylor, Scott L...... 99 Weier, Heinz-Ulrich G. . 68 Slagel, Joe ...... 55 Tesmer, Judith G. 51, 53, 57 Weinstock, K...... 14 Slezak, Thomas R. Thayer, N ...... 94 Weiss, Robert B. .... 13 ...... 85, 90, 124 Thelen, Michael P ...... 48 Welsh-Breitinger, Becky . 81 Smit, Arian F.A ...... 87 Thilman, Jude ...... 119 Wend!, M ...... 30 Smith, D.J...... 63 Thomann, H-U...... 14 Wernick, M...... 69 Smith, D.R...... 14 Tbomas,S.E ...... 60 Wertz, Dorothy C ..... 105 Smith, Lloyd M. . 24, 41, 86 Thompson, Keith H ...... 15 West, Anne ...... 55 Smith, Richard D...... 38 Thompson, R...... 94 Westphal!, MichaelS. 24, 86 Smith, S.F...... 65 Tipton, Stephanie ...... 55 White, Owen ...... 12 Snell, P ...... 14 Tobin, Sara L...... 111 White, P. Scott Snider, J.E...... 30 Tolic, Ljiljana Pasa ..... 38 ...... 7, 8, 9, 37, 63, 66, 98 Snoddy, Jay ...... 75,77 Topaloglou, Thodoros ... 97 Whittaker, Clive C...... 78

131 Wiemann, S ...... 45 Wierzbowski, J...... 14 Williams, A.L...... 7 Williams, Stephen J. .... 33 Wilson, Julie ...... 11 Wilson, R.K...... 30 Wong, Gane K.-S ...... 10 Wong, L ...... 14 Woo, L ...... 57 Woolley, Adam T ...... 33 Worley, Kim C...... 90 Wright, Tracy J...... 81 Wu, Jenny ...... 68 Wunschel, DavidS ...... 38 Wyrick, Judy M...... 123 Wysocki, Jeanne R...... 15

X Xie, Gouchun ...... 78 Xu, Linxiao ...... 40 Xu, Q...... 14 Xu, Robert Xuequn .. 52, 53 Xu, Ying ...... 75, 77 y Yeh, T. Mimi ...... 85 Yeung, Edward S...... 24 Yimlamai, Dean ...... 61 Yu, Jun ...... 10 Yust, Laura N ...... 123 z Zackrone, Keith D...... 55 Zevin-Sonkin, Dina ..... 18 Zhan, Ming ...... 34 Zhang, Ge ...... 77 Zhang, L...... 14 Zhang, Shiping ...... 15 Zhu, Y.F ...... 39 Zhu, Yiwen ...... 56, 63 Zhuang, J .J...... 94 Zorn, Manfred .. 75, 91, 102 Zweig, Franklin M. . .. 107

132 Appendix B: National Laboratory Index

U.S. Department of Energy Laboratories

Human Genome Program work at the national laboratories is described in the following abstracts.

Argonne National Laboratory: 13, 17, 18, 37, 75, 79, 80, 81

Brookhaven National Laboratory: 15, 20, 29

Lawrence Berkeley National Laboratory: 10, 11, 25, 32, 35, 56, 63, 68, 69, 70, 75, 91, 96, 97, 102, 112, 124

Lawrence Livermore National Laboratory: 6, 27, 28, 29, 48, 56, 57, 61, 69, 84, 85, 87, 90, 124

Los Alamos National Laboratory: 7, 8, 9, 12, 16, 30, 31, 37, 50, 51, 53, 57, 58, 62, 66, 78, 81, 86, 89, 98, 99, 100, 124

Oak Ridge National Laboratory: 34, 36, 39, 46, 59, 60, 65, 66, 75, 77, 123

Pacific Northwest National Laboratory: 38

133